I really have no idea how popular or well known these benchmarks are, but I’ve been following https://www.techempower.com/benchmarks/ for some time. I was happy to see some elixir/erlang entries recently, but then concerned with their performance. Given that phoenix at least has a high error count, I imagine that is at least part of the issue. I intend to take a look when I both learn enough to feel useful and get the chance, but I’m posting it here in case there are others more able that have the bandwidth to improve things sooner.
I understand that abstract benchmarks are always of only limited use, and each project will have its own performance characteristics; but performing well in a 3rd party (and theoretically unbiased) suite like this could help with adoption.
I was looking at these benchmarks too the other day. And I also noticed the high error rate on Phoenix’s benchmarks. It seems strange to me, like their server wasn’t setup right. It seems implausible that Phoenix could perform so poorly. The GitHub repo is here - I thought about pulling it down and running the benchmarks myself. For one thing, the version of Phoenix is out of date, so we could make a pull request to update those dependencies.
Many of the packages there are entirely by outside groups; that is part of the appeal to me actually, that those evangelizing their favorites can actually submit improvements.
I saw on twitter a response from Chris McCord somewhere (Twitter, Reddit, HN, ?) that stated there were serious problems with the configuration and that they had submitted a PR to fix some issues and nothing had been done. Also there was some caveat to the numbers posted where they had their hardware decommissioned and scrambled to get it published which meant the frameworks got no time to tweak/fix stuff.
I would say the jury is WAY out on the legitimacy of the numbers.
We don’t know what caused the errors and unfortunately we didn’t have a chance to collaborate with them on a true run. A few months ago they added Phoenix in a preview, but it was a very poor implementation. They were testing JSON benchmarks through the :browser pipeline, complete with crsf token generation. They had a dev DB pool size of 10, where other frameworks were given of pool size of 100. And they also had heavy IO logging, where other frameworks did no logging. We sent a PR to address these issues, and I was hoping to see true results in the latest runs, but no preview was provided this time and we weren’t able to work with them on the errors. tldr; these results are not representative of the framework.
From my point of view there are so many things that are off with the way they do these tests and they even say themselves that results are not comparable unless you are an “expert”.
From a quick glance:
They use wrk which has the coordinated omission problem. So basically their test result can only be interpreted for closed system (i.e not something where there is no back-pressure from clients like the internet)
They report the best case result for each framework. I think worst case is more interesting if I am running something in production
For latency they use average latency for ranking. It is well known that averages are bad to use in this case. They could have used proper percentiles or if they don’t do any coordinated omission adjustment (which they don’t) they should use the max latency.
Tests run for way too short. 15 seconds. Max concurrency 256 for anything but plaintext. They say themselves that they didn’t see any difference running test for 60 seconds or 15 seconds. Well I am not surprised:) 60 seconds is way too short as well! Once memory buffers run full, garbage collection kicks in, database connections get stuck you will see other problems. I’ve had servers that looked like they were running OK which ground to a halt and complete stop after 30 minutes. I understand it is hard to run the tests for all these frameworks for such a long time but still.
Are their tests useless? If you are contemplating setting up an http server in a closed system then perhaps no. They might be valid if you interpret the data correctly. If you have the http servers on an open system such as the internet? No then they are not valid.
And they say (http://tiamat.tsotech.com/unfair-comparisons):
“The tests set what can be considered high-water marks for performance.” That is, any framework is unlikely to perform better than these tests. So at least that is one thing you can takeaway from the results.
If you look through the data tables and latency graphs phoenix actually does pretty well. For latency it has low “max” and standard deviation which is not completely off the walls.
If you look how it performs under the various concurrency settings it performs as expected with little variation. Some of the faster frameworks fluctuated over 100% between the various concurrency settings. That is not what I call “stable”
Finally for these tests erlang VM has a bottleneck which hopefully will be addressed in OTP20 with multi poll-set. Not that it will be the best but I have a feeling it will do much better.
There is a danger with these tests that something like phoenix comes out in a bad light. There are python/php/javascript frameworks that come out on top in the way the present the data. I know from experience that in real life they stand no chance if you need performance out of the box.
I hope people don’t look at this and outright discards any framework because of it, but I think that is the case.
Part of the trap that we fall into is that people tend to assume “benchmarks + developer time = all factors”. We’re as guilty of it as anybody by showing the comparisons for the Ruby crowd but there’s always somebody faster. We need to do a better job clearly communicating the non-benchmark related perks of Elixir/Phoenix to clearly differentiate it for the casual observer.
Comparing round 13 vs 14 it seems that Phoenix has gone backwards - does anyone have any idea why? I appreciate that these aren’t real world figures but I’d expect them to be consistent between the two rounds.
You can look at this and say “we have room for improvement” or you can look at this and say “they don’t know how to benchmark.” One will lead to a better product, the other won’t.
When it comes to performance, Elixir/Phoenix/Ecto is a middle-of-the-pack player. For a dynamic language, it’s doing pretty well. But things could be improved. And, maybe some things can’t be improved because of the nature of Erlang.
Thanks @kseg. I’m can’t disagree with you on your points. I was just curious as to why Phoenix seems to have slipped down the chart compared to previous versions (not against other frameworks). So, for example, The Data Updates test;
Round 12 - 1,100 (27.8%)
Round 13 - 1,915 (65.0%)
Round 14 - 750 (17.1%) - quite a significant drop.
While I agree that these perf tests can be useful to discover possible issues, I also agree with @cmkarlsson that these tests are a terrible way to compare frameworks, which is according to the site their main purpose (This is a performance comparison of many web application frameworks).
IMO with TE benches, you can’t really conclude whether one framework is faster than another for your particular problem. Moreover speed is not the only factor (nor the most important one) for choosing a web framework. Therefore, I think that these benches are very shallow and misleading, and that they fail in their main purpose.
On the other hand, having a common suite for benching frameworks would be great. If we could have an easy way of installing TE suite locally and running, we could have an automated way of benching our frameworks, finding out possible bottlenecks, and even detecting regression performance. Not sure to what extent can TE code currently be used for that, but I agree that this can be useful.
It’s worth noting that the goal of optimizing should be to get good and consistent performance numbers (latency and throughput), and not to move the framework up this shallow top list. Because, if the latter is the goal, we might end up doing all sort of trickery just to create the illusion that Phoenix is faster, introducing non-obvious trade-offs which will bite us in more complex real-life scenarios.
It might be a bad way to compare frameworks, but it’s a pretty good way to compare the performance of HTTP servers and database drivers.
Personally, I think it’s safe to say that both net/http and Cowboy satisfy the HTTP specification, provide a usable interface to build frameworks on top of, are capable of scaling beyond a single core and have proven to be fairly robust. At that point, performance (latency, throughput, memory usage, …) is the only factor. At least, it’s the only factor from the point of view of people working on these components.
I agree that to a consumer, maybe speed isn’t the only factor. They are very different in another respect: one lets you run Go code, the other lets you run Erlang/Elixir code. Maybe that ought to weigh more on your decision than an X% difference in performance.
Or maybe not. I don’t know what your needs are, and I’m not going to assume that performance isn’t really important to you.
People will take whatever value from these benchmarks as they want. Yes, a lot of that will be misguided. And yes, it’s useful to regurgitate the typical “these benchmarks are flawed” and “it doesn’t represent the real world” (which is a pretty condescending thing to say to strangers, really, what people mean is it doesn’t represent THEIR real world).
I know that we’re largely saying the same thing, I just wanted to get the last word in