The Ruby parser is implemented in C, so it’s more like comparing C to Elixir. Obviously C is going to be faster We’re not doing that bad actually. It would be interesting to compare with Jason compiled with HiPE. In my benchmarks this makes it at least twice as fast.
The data is also quite different from what you’d face in a regular HTTP app - the JSON is pretty-printed, most JSON flying on the wire is not. This can have significant difference in the actual performance. Depending on what you want to learn from this, it can be important.
Don’t suppose you’d be up for tweaking the Techempower benchmarks for Elixir to use Jason with HiPE. It’s the basis for the results of so much of those tests I think it would go a long way with the report.
Let’s have some fun, here is how they compare to C++/rapidjson:
# Golang
$ time jq '.' data/10mb.json > /dev/null
real 0m0.607s
user 0m0.603s
sys 0m0.004s
$ time jq '.' data/citylots.json > /dev/null
real 0m14.348s
user 0m13.839s
sys 0m0.504s
# Elixir
## 10mb.json
Time taken [Jason]: 521.655ms
Time taken [Poison]: 1358.531ms
## citylots.json
Time taken [Jason]: 12224.44ms
Time taken [Poison]: 33350.239ms
# Ruby
## 10mb.json
$ time ruby app.rb
real 0m0.350s
user 0m0.250s
sys 0m0.020s
## citylots.json
$ time ruby app.rb
real 0m5.632s
user 0m5.393s
sys 0m0.236s
# C++
$ time ./rapidjson-testing < ../../benchmark-large-json-parsing/data/10mb.json
real 0m0.035s
user 0m0.034s
sys 0m0.000s
$ time ./rapidjson-testing < ../../benchmark-large-json-parsing/data/citylots.json
real 0m0.531s
user 0m0.503s
sys 0m0.028s
Admittedly this benchmark is flawed because jq is outputting to stdout and elixir/ruby/C++ are just kind of blackholing the data after it is parsed, so jq/go is artificially limited here. In addition the elixir version is actually instancing a tree to hold the whole structure, which is wasted work as well (unsure about ruby). The C++ version is fully parsing and performing callbacks for every parse (standard sax parsing).
I can PR the C++ one in it if you want, it only needs the normal C++ compiler and cmake installed, nothing else needed (not even rapidjson, it acquires it itself).
Just to make sure, here is the C++ version as both a sax parser, and as an elixir-style-structure-building document parser (yay eating memory):
$ time ./rapidjson-sax < ../../benchmark-large-json-parsing/data/10mb.json
real 0m0.038s
user 0m0.034s
sys 0m0.004s
$ time ./rapidjson-sax < ../../benchmark-large-json-parsing/data/citylots.json
real 0m0.529s
user 0m0.485s
sys 0m0.044s
$ time ./rapidjson-structure < ../../benchmark-large-json-parsing/data/10mb.json
real 0m0.037s
user 0m0.036s
sys 0m0.000s
$ time ./rapidjson-structure < ../../benchmark-large-json-parsing/data/citylots.json
real 0m0.533s
user 0m0.504s
sys 0m0.028s
Not much of a difference, honestly the C++ compiler is so good that it is probably being optimized out, hmm…
EDIT: And I added some code to print out some details about the structure to ensure it is compiled and parsed in full and it somehow got a few milliseconds faster… so yeah those are accurate, C++ is just fast as always…
I could foresee a Rust one outperforming a C++ one to be honest, but I doubt the current pure-rust libraries would at this time (though still plenty fast).
Sure, I’ll clean it up and PR it into a cpp/rapidjson directory or something.
Do you want a readme.md or INSTALL file in that directory, or do you want me to edit the root readme to add instructions on how to compile/run it?
I was thinking of adding a statistical benchmarker to the C++ version instead of just using time, do you want me to do that pre-PR?
It will take longer to benchmark (as it performs lots of tests to get a statistical accuracy), but it would be more detailed (if you don’t mind it taking potentially many minutes (or more) to run)? I leave it up to you as it will make it take substantially longer, but it would also be substantially more accurate, but nothing else is using a statistical benchmarker so it seems kind of useless right now. ^.^;
Should probably leave it out for now, at least unless a real parsing benchmarker was setup across all the languages or something.
Does rapidjson by default validate UTF-8 when decoding? I know it has an option, but I don’t think it does it by default. This should have significant performance implications, if it does not - it’s comparing apples to oranges without this.
I’m using the UTF<> argument so I’d hope so? Let me check the docs… Hmm, I’m unsure if it is default or not for parsing but I found where to set the flag to force it on regardless, results now:
$ time _builds/rapidjson-sax < ../../data/citylots.json
real 0m0.549s
user 0m0.533s
sys 0m0.016s
$ time _builds/rapidjson-structure < ../../data/citylots.json
real 0m0.542s
user 0m0.509s
sys 0m0.032s
Not seeing much of a difference, so it probably is on by default?
/me has never used rapidjson before, so feel free to check the code, PR incoming in a minute…
You know it would help if I didn’t compile the sax version for both names… >.>
Fixed, results make more sense now:
$ time _builds/rapidjson-sax < ../../data/10mb.json
real 0m0.039s
user 0m0.035s
sys 0m0.004s
$ time _builds/rapidjson-structure < ../../data/10mb.json
real 0m0.048s
user 0m0.040s
sys 0m0.008s
$ time _builds/rapidjson-sax < ../../data/citylots.json
real 0m0.545s
user 0m0.516s
sys 0m0.028s
$ time _builds/rapidjson-structure < ../../data/citylots.json
real 0m0.742s
user 0m0.657s
sys 0m0.084s
There it goes, the structure form should be slower than the sax form, that makes MUCH more sense!