Techempower benchmarks

sasajuric · May 30, 2017, 1:57pm

That’s debatable. Last time I checked, these tests took only 10 sec, so they didn’t really measure the cost of garbage collection (which might significantly affect the results for stop-the-world-gc runtimes). They also didn’t observe how CPU bound tasks interfere with I/O ones (which under some conditions can cause tremendous latency increase). For other issues, see comments made by @cmkarlsson in this thread.

It’s not just about language. There are properties, such as fault-tolerance, the ability to troubleshoot production, ecosystem, and whatnot. Those things matter, and IMO matter much more than the speed. Not that speed is irrelevant, but past some point of “acceptable”, it matters much less, and sometimes it can be counterproductive.

There is a difference between performance matters and “I want to use every nanosecond of CPU time as best as possible”. Of course that all of us want to have some reasonable performance. On the other hand, aiming for the fastest possible framework (there’s no such thing, but let’s pretend it exists) is IMO usually wrong, b/c that speed gain is likely obtained by some trade-off, which might not be immediately obvious.

My usual advice is to measure whether the candidate framework is good enough for the desired case. That’s what I did when I first evaluated Erlang, by running a 12 hours load test on a simulation of the system, with 10x of the estimated load. Once I was convinced that Erlang easily handles that (and that it can easily scale), I didn’t really care about the speed anymore. I knew that if in some special cases I need to squeeze out the best performance, I can easily step outside of Erlang for that.

When I mention “real world” I talk about systems running in production (and those which are being developed to run in production in the future). If it’s running in production, it’s real. Otherwise it’s not. Consequently, TE benches are not real-world, but rather some (IMO very poor) attempt to simulate the production. If that makes me condescending, so be it

The problem is that your production is not the same as mine, and definitely not the same as the thing being benched in TE. Hence, even though framework foo might be 100x faster than bar in the TE bench, it might well happen that bar actually produces better number for your case (or mine). Therefore, even if TE benches were done properly (and I don’t think they are), they still wouldn’t tell you a lot about your system, and might even lead you to a bad conclusion.

Which is why I believe there’s no such thing as the proper general bench comparison of frameworks. You can design one for your own system, simply to discard any framework which makes it very hard to deal with the desired load. Past that point, the decision on which framework to use should IMO mostly revolve around other properties.

anthonyb · May 31, 2017, 3:12pm

Yes, you can install and run the TechEmpower benchmarks suite locally via the Vagrant setup they provide. The basics are in the readme, with further details in the documentation.

Among fullstack frameworks using a full ORM (as opposed to raw SQL) with an RDBMS (i.e., MySQL or Postgres), on most of the tests, the performance of Phoenix is respectable, though not exceptional. However, performance on the “Data updates” and “Multiple queries” tests is especially low (with a significant regression on “Data updates” from Round 13 to Round 14) – on the lower end of the fullstack Python and even PHP frameworks. It might be worth investigating what’s going on with those two tests in particular. Is it something with Ecto?

More generally, it might not be a bad idea for the core team members to take a look at the Phoenix code submitted for these benchmarks and ensure Phoenix is putting its best foot forward. The Phoenix website describes Phoenix as “Productive. Reliable. Fast.”. Whatever flaws these benchmarks may have, people will likely look at them and conclude Phoenix isn’t living up to that third adjective.

Note, it isn’t difficult to push changes to the TechEmpower code. Just use their Vagrant setup to make changes on a local machine and send a pull request to the Github repo.

It might also be interesting to add another entry using just Cowboy, Plug, and raw SQL – this could help show the overhead added by Phoenix and Ecto.

brightball · May 31, 2017, 5:14pm

Agreed. In a benchmark like this the goal isn’t a real world test. It’s a “what can you squeeze out of it”. We had another good thread talking about benchmarks that were closer to real world though.

This, for example, gives some great numbers: https://github.com/tbrand/which_is_the_fastest

Based on discussions from here: Which is the fastest web framework? (Link/Repo and results in this Topic)

EDIT: Reading through the rules of these benchmarks, it looks like there might be something that Erlang/Elixir could take advantage of for a performance gain, especially on the multiple queries test. The queries are executed individually with a random id and it strikes me as feasible that if queries could be executed in a process named by query and id that we could naturally leverage the BEAM to de-duplicate in progress requests. I wonder what it would take to do that and/or if it would help.

sasajuric · June 1, 2017, 10:28am

I do agree that the word “fast” in Phoenix advertisement is quite vague and should be perhaps replaced with “scalable”. The pitch I gave at my Phoenix talks is that with the stack of Erlang/Elixir/Phoenix, you can get on board quickly, move forward at a reasonable pace, write reliable, fault-tolerant systems, and be confident that the technology can help you go very far in the case your system becomes very popular and you need to scale it. That’s a bunch of wins on all accounts, and this is why I use and recommend Phoenix.

Raw speed is not Erlang’s forte (in fact sometimes it even sacrifices the speed for some other properties), so usually a sequential code will not be as fast as in many other popular languages. But the question for me isn’t who’s fastest, but rather is the tech fast enough? IME in the vast majority of the cases, Erlang is more than sufficient for my needs.

Doing various algorithmical and technical optimizations can usually take me very far. For example, looking at the rules of the “updates” task, I’m confused with some requirements and limitations. In real life, I’d try to optimize this task by performing a single update of all the target rows, say perhaps using UPDATE ... RETURNING of postgresql. I expect this would do wonders for the perf, since in a single trip to the DB I could change everything, and get all the data (and only that data) which I need.

Another option would be to try updating from multiple processes (currently it’s done sequentially).

This is not what’s advertised on the TE site. My understanding is that they think we should pick our framework based on their graphs, which is something I strongly disagree with given that the tests are synthetic, contrived, and poorly executed. You could just as well roll a dice, fire up a RNG, or consult a tarot expert to pick a framework

anthonyb · June 1, 2017, 12:39pm

These are all good points, but they end up sounding a bit like rationalizations one would expect from a framework that is not near the top of the list. Python isn’t known for its speed either, nor Ruby, nor PHP, but they all have frameworks (even full stack frameworks) beating Phoenix on some of the tests.

But the question for me isn’t who’s fastest, but rather is the tech fast enough?

Sure, but in the “Data updates” test, Phoenix seems downright slow in Round 14, though it was among the fastest in Round 13. Hopefully that can be fixed.

In real life, I’d try to optimize this task by performing a single update of all the target rows

Presumably they are trying to represent real world scenarios in which you do in fact have to do multiple separate updates, perhaps to several different tables/collections. For simplicity, rather than actually creating many different tables in the database, they just have the apps make multiple updates on the same table.

sasajuric · June 1, 2017, 1:01pm

Keep in mind that I’m not a member of the core team, but just a user of Phoenix. As someone who has been using Erlang for 7 years, and Elixir for 4, in a fairly loaded system (~ 2k non-cacheable reqs/sec), I have high confidence that I can reach the desired performance numbers for the vast majority of the cases.

And this leads me back to the point I already raised in this thread. No two real-life (aka production) systems are the same, and the perf bottlenecks can appear anywhere for various reasons. In most cases I’ve seen, the bottlenecks had little to do with the actual framework being chosen. Yet, I’m supposed to pick a framework based on these synthetic and highly contrived example? That doesn’t sounds right to me.

That’s of course worth looking into. Perhaps there’s some low-hanging fruit to be picked there.
However, this significant change in the ranks makes me wonder even more to what extent are these tests accurate and useful.

brightball · June 1, 2017, 1:10pm

Most of that is accomplished via offloading the benchmark to pure C. I just filtered Elixir, Ruby, Python and PHP on the Fortunes benchmark for a quick check and there’s one full stack framework ahead of Phoenix. That is a PHP framework called YAF which is…written in pure C. GitHub - laruence/yaf: Fast php framework written in c, built in php extension

I don’t think anybody here expects Erlang to be faster than pure C. You can offload parts of the benchmark to C, but doing so would eliminate all of the consistency guarantees that come from the BEAM. If anything, the fact that you get all of those guarantees and comparable top end speed with Elixir without having to substitute chunks of your code with another language is a pretty major factor.

I am genuinely curious what happened on the updates though. The benchmarks that have a real concurrency test should, in my opinion, be the biggest target.

brightball · June 1, 2017, 1:23pm

In looking at that commit, why aren’t the updates being done concurrently? Am I reading that code wrong or are they executing 1 - q queries sequentially?

|> json(Enum.map(1..q, fn _ ->
       id = :rand.uniform(10000)
       num = :rand.uniform(10000)
       w = Repo.get(World, id)
       changeset = World.changeset(w, %{randomnumber: num})
        Repo.update(changeset)
       %{id: id, randomnumber: num}
      end))

If q is 10, they are doing 10 get / updates sequentially?

OvermindDL1 · June 1, 2017, 2:26pm

Yep indeed, why is this not looking up and updating many things at once?

anthonyb · June 1, 2017, 2:31pm

I’m generally in agreement, but consider the following statement:

No two real-life (aka production) systems are the same, and the perf bottlenecks can appear anywhere for various reasons.

This seems extreme. Surely there are many systems that have quite similar characteristics and performance demands and for which useful insights could be gained from standardized benchmarks of alternative platforms. I’m just suggesting there may be a middle ground between “benchmarks are completely useless” and “I should pick my framework based solely on a single benchmarking test I saw.” Declaring the former in the face of a recent subpar result seems a bit suspect, particularly when the top pinned tweet at https://twitter.com/elixirphoenix is about a different “synthetic” test where Phoenix seems quite impressive.

In any case, my larger point was just that from a marketing perspective, it might be worth looking at the Phoenix code used for these benchmarks to ensure it is not unnecessarily slow. Hopefully that makes sense, even if you don’t think the benchmarks themselves are particularly useful.

anthonyb · June 1, 2017, 2:37pm

No, I wasn’t talking about the Fortunes benchmark. As noted, Phoenix does fairly well against full stack frameworks on most tests, including Fortunes. It falls down on “Multiple queries” and “Data updates” specifically, and in those cases, the Python, Ruby, and PHP frameworks beating it are not offloading the work to C.

sasajuric · June 1, 2017, 2:51pm

There are many differences between any two systems, such as deployment environment, network speed, database, load patterns, the kind of job that needs to be done, guarantees you need to provide etc. Therefore, a generic framework comparison is IMO not going to tell you anything useful about your possible bottlenecks.

The middle ground IMO is checking whether a tech stack is fast enough for the kind of load you expect in the kind of system you want to build. Later on, benching your system continuously is also useful.

Unlike TE benches, the Phoenix test doesn’t compare Phoenix to anything else. The test merely proves that the stack can handle 2M simultaneous connections. In other words, it demonstrates the capacity of the stack, which is definitely useful to know.

In pragmatic terms you’re certainly right. It would be nice if Phoenix was higher on the list.

OTOH I’m personally terribly sad when I think of the amount of effort invested in TE benches, given that I find them to be worse than useless. IMO, this is a huge waste of effort. Personally, I’d rather see Phoenix team be focused on Phoenix features, stability, security, performance improvements (but focused ones, along the lines of the 2M test), and such.

brightball · June 1, 2017, 3:07pm

On looking at all of these benchmarks…they are all sequential.

github.com

TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Elixir/phoenix/web/controllers/page_controller.ex

defmodule Hello.PageController do
  use Hello.Web, :controller
  alias Hello.World
  alias Hello.Fortune

  def index(conn, _params) do
    conn
    |> put_resp_content_type("application/json", nil)
    |> send_resp(200, Jason.encode_to_iodata!(%{"TE Benchmarks\n" => "Started"}))
  end

  # avoid namespace collision
  def _json(conn, _params) do
    conn
    |> put_resp_content_type("application/json", nil)
    |> send_resp(200, Jason.encode_to_iodata!(%{"message" => "Hello, world!"}))
  end

  def db(conn, _params) do
    conn

This file has been truncated. show original

How has this gone overlooked for so long? Not a single additional process is being used across these benchmarks?

OvermindDL1 · June 1, 2017, 3:20pm

Like really? Why not send_json since it will convert the map to JSON automatically using the phoenix registered JSON encoder (which might not be poison, and thus could be faster than poison)?

I see a lot of various things just wrong with this code… o.O

sasajuric · June 1, 2017, 3:29pm

If multiple processes are used, you could also encode each piece with poison in each process, and then append encoded strings to an iolist in the request handler process.

Also, using encode_to_iolist! instead of encode! should be faster.

Bypassing changesets in update might also help. Of course, it would be best to avoid read followed by the write, which is possible for the update task, but forbidden by the task rules

sasajuric · June 1, 2017, 3:40pm

Also, pool size should probably be increased in this case.

anthonyb · June 1, 2017, 4:24pm

Therefore, a generic framework comparison is IMO not going to tell you anything useful about your possible bottlenecks.

I don’t mean to be arguing heavily in favor of the usefulness of the TE benchmarks in particular, so I don’t want to belabor the point, but again, these statements come off as rather extreme.

Suppose the only information you had was at TechEmpower Framework Benchmarks. Are you saying you would conclude that Symfony2 is just as likely to outperform Phoenix as the other way around in a real-world scenario involving requests that query a database and return HTML?

It’s not practical to implement a real-world version of a given application in 100+ different technology stacks, so various published performance results might at least help narrow down the likely candidates for further investigation.

The middle ground IMO is checking whether a tech stack is fast enough for the kind of load you expect in the kind of system you want to build.

First, how do you check without building the system? Second, being “fast enough” isn’t always the relevant question. Suppose you implement a proof of concept in some stack and estimate that based on its performance and hardware needs, your monthly costs would be $50,000. Is that “fast enough”? What if there is another stack that is so fast it could get the monthly costs down to $5,000. In that case, you might conclude the first system was not “fast enough,” but you wouldn’t know that without having some points of comparison. But you can only build so many proofs of concept – before starting, it might help to have some existing data to point you in the right direction.

Unlike TE benches, the Phoenix test doesn’t compare Phoenix to anything else. The test merely proves that the stack can handle 2M simultaneous connections. In other words, it demonstrates the capacity of the stack, which is definitely useful to know.

I’m not sure I follow. If you want, you could look at any given framework in a set of benchmarks to determine the capacity of that particular framework, without making any comparisons to other frameworks included in the benchmarks. I don’t see how running multiple different frameworks through the same set of tests makes the tests for any individual framework any less useful.

Also, the 2 million websocket connection test was on a $1500/month machine. How do we know if that is good performance? Could we get the same performance on an $80/month Digital Ocean VM using Express? No, but we couldn’t know that without making comparisons. Without points of comparison, the single test in isolation isn’t particularly meaningful or useful.

idi527 · June 1, 2017, 4:34pm

It has been repeated on a $640/month Digital Ocean VM [0]. But it still doesn’t compare well to uWebSockets which would probably handle 2m websocket connections on a much smaller machine.

[0] GitHub - dsander/phoenix-connection-benchmark: Reproduction of the 2 million connection in Phoenix benchmark

brightball · June 1, 2017, 5:02pm

I remember seeing a breakdown of the channels benchmark that made the point that Channels uses 3 elixir processes for each socket to handle failure, reconnects and state properly. It’s not a raw “how many websockets can you hold” example.

Since the ram limit is essentially the socket limit, a raw socket test should be closer to 3x higher.

UWebsocket is another example of subbing out to C. Good looking library though.

idi527 · June 1, 2017, 5:08pm

I’m not sure but I think n2o [0] uses one process per websocket connection to handle all those things. So there are always ways to optimize. It’s similar to the difference between cowboy and elli, which shows that for simple use-cases sometimes it’s better to keep everything in the same process (cowboy creates a process for each connection and then also for each request, elli keeps it all in one).

[0] https://github.com/synrc/n2o