Techempower benchmarks

We’d need somebody with more expertise than me to speak into the channels implementation and why it’s built the way it is. @chrismccord want to chime in? :slight_smile:

1 Like

What I’m saying is that I can’t conclude anything reliably from that benchmark. That’s my main point I’m arguing in this thread.

When I first evaluated Erlang, I did a quick, dirty, and a very simple simulation of the target system, issuing a 10x of the expected load for the duration of 12 hours. I also did a few minor experiments, just to make sure I can do some typical stuff, such as talking to the database, working with XMLs, and such. I needed to see first hand that I can handle the desired load. Once I established that, I didn’t care much about few microseconds here/there.

Sure, but the thing is that raw speed is usually not the only, nor the most important factor. Once you have the tech which can handle your load, other things start to matter, such as the support for fault-tolerance, stable latency, and troubleshooting a running system. Erlang/Elixir excel at this, and that matters, because it improves the uptime and availability, and makes the life of developers much easier. I find this important, because the cost of downtime and the cost of developers are IME much higher than the cost of hardware.

Moreover, with proper algorithmical and technical optimizations, in most cases the observed speed between two different stacks usually becomes much smaller than in contrived benchmarks, such as TE. In particular, when it comes to Erlang, based on the fact that I worked with it for 7 years, and that others have done wonders with it, I’m confident that it will suffice in the vast majority of the cases. In the cases when it doesn’t (e.g. heavy number crunching), I can always step outside and reach for say, C, Rust, Go, or some other more performant language.

That might be true, but one big problem I have with TE is that they put the ranking list right in your face. It’s literally the first thing they show you. So there’s a big implication that a higher positioned stack is necessarily better.

Another problem is that I think the tests themselves are shallowly executed, which has also been argued by others in this thread.

Finally, the tasks themselves seem quite contrived. Take a look at the updates task. My first idea to optimize this would be to try to update everything in a single round-trip to the database. If that didn’t work, I’d at least remove the needless record read. But that’s not allowed by task rules. Which makes the task highly contrived IMO.

There’s a whole other bag of tricks we can throw at the problem. They usually come with some associated trade-off, but those trade-offs can only be evaluated in real life (which is not TE). Caching is for example not allowed, but that’s one of the main optimization techniques. Since you can do easy caching in Erlang without needing to run an additional process and serialize to/from json, it could do wonders for the perf, especially compared to many other techs. But that’s not tested at all in TE. So what can I conclude from the results of the TE bench? For me, the answer is: nothing :slight_smile: Not even about performance, let alone about any other properties of a tech stack.

Assuming that you pay some developers, administrators, support, and others to manage the system with such large userbase, 1.5k/mo doesn’t seem like a significant cost in the total expenses sheet. You could get bigger savings by choosing a technology which allows developers to efficiently and confidently manage that kind of load, and to keep the system stable and running as much as possible, and to reduce the load on the support team. Again, raw performance is just a part of the story. It matters, sure, but up to a point. You need to balance it with other properties.

Also, in most cases, I don’t expect a dramatical difference between Erlang and other technologies in terms of hardware costs. While for some cases (e.g. computation heavy tasks) Erlang can be an order of magnitude slower, that doesn’t necessarily mean that you can save an order of magnitude in terms of hardware cost, unless the 100% of what you do is crunching numbers.

4 Likes

The process per channel and a separate process per socket ensure that:

  • A crash in a channel doesn’t take down the whole connection and other conversations.
  • A busy channel doesn’t block the communication on other channels.

As usual this comes with a trade-off, in this case in terms of both memory, and a slight perf cost (because of extra message hopping and scheduler overhead).

Currently, if you don’t like that trade-off, you can always fall back to plain cowboy. I argued that Phoenix could be slightly modified to make it possible to opt-out from the default approach and have just one process per socket. Some work has been done by José and me on this, but we kind of stashed it a year ago. José mentioned he’ll get back to it at some point.

Regardless, I believe that the current Phoenix approach is a good default, because it values fault-tolerance and overall system responsiveness, which is a good bias for many cases.

6 Likes

I actually looked at this when we fixed up things for the other benchmarks, but gave up on it. Honestly the config for the DB is insane.
In general the phoenix code is not optimized, and in my eyes it doesn’t look very functional or elixir-ish.

Three significant things slows down phoenix:

  1. Json encoding - obviously slower in elixir than in c.
  2. DB test - poolsize 20 and the benchmark setup which is not normal/unrealistic (PG has max_connections 2000 and the the test is done at low (keepalive) concurrency of 256 - thus not requiring a DB pool at all - and indeed some of the frameworks are not using one )
  3. Minor optimizations, making it faster and more elixir-ish - (pattern match params, batch update etc.)

notes on the different functions from back then:

1. json
def _json (json bench) and def db (db+json bench)
use jiffy to encode.

Benchee.run(%{time: 50, parallel: 8}, %{
  "poison"  => fn -> Poison.encode!(%{message: "Hello, world!"}) end,
  "jiffy" => fn -> :jiffy.encode(%{message: "Hello, world!"}) end,
  "jiffyerl" => fn -> :jiffy.encode({[{<<"message">>, <<"Hello, world!">>}]}) end
})

Comparison: 
jiffyerl      343.87 K
jiffy         303.72 K - 1.13x slower
poison        188.26 K - 1.83x slower

2. def queries and def updates params to int
both def queries and def updates currently does a less than optimal param parsing. by doing the classic (conn, %{"queries" => queries_param}) the matching all the way to integer is ~50% faster(but its really fast anyhow)

benchee:
    new        4.09 M
    old        2.74 M - 1.49x slower

this does require some rework to handle missing params cases. I propose this which is hopefully also much more idiomatic:

  #pattern match queries and value queries_param
  def queries(conn, %{"queries" => queries_param}) do
    q = try do
      String.to_integer(queries_param)
    rescue
      ArgumentError -> :not_integer
    end
    queries_rules(conn, q)
  end

  #queries didn't pattern match above aka are missing
  def queries(conn, _unused_params), do: queries_rules(conn, :missing)

  defp queries_rules(conn, queries_param) do
    case queries_param do
      :missing       -> queries_response(conn, 1,   :missing)       # If the parameter is missing,
      :not_integer   -> queries_response(conn, 1,   :not_integer)   # is not an integer, 
      x when x < 1   -> queries_response(conn, 1,   :less_than_one) # or is an integer less than 1, the value should be interpreted as 1; 
      x when x > 500 -> queries_response(conn, 500, :more_than_500) # if greater than 500, the value should be interpreted as 500.
      x              -> queries_response(conn, x,   :ok)            # The queries parameter must be bounded to between 1 and 500. 
    end
  end 

  defp queries_response(conn, parsed_param, _status ) do
    conn
    |> put_resp_content_type("application/json")
    |> send_resp(200, Poison.encode!(Enum.map(1..parsed_param, fn _ -> Repo.get(World, :rand.uniform(10000)) end)))
  end

I would have liked to do Integer.parse- but this is much slower than the try/rescue unfortunately - (elixir might be in need of a String.to_integer() equivalent that returns integer or :error and not ArgumentError - or a :ok/:error tuple):

#slower than try/rescue :/
case Integer.parse(queries_param) do
  # {int, remainder} int is only perfectly good if remainder is empty ""
  {queries_int, ""} -> queries_rules(conn, queries_int)
  _ ->                 queries_rules(conn, :not_integer)
end

the same pattern matching params to int refactor applies to def updates

3. def updates
this is around where I gave up, as I realized the realities (or lack thereof!) of the DB benchmarks.
use the same param matching as above.
rules does NOT allow batch querying the records(sic!). In my limited test asyncing the querying was not fruitful.(yes I did test different DB pool sizes, and asyncing levels ymmv)
rules does ALLOW batch updating.
this is what I ended up with, which I make no claims about being pretty nor fully optimized:

ids = 1..q 
|> Stream.map(fn _ -> :random.uniform(10_000) end)

ws = ids 
|> Enum.map( &Repo.one(
  from p in HelloPhoenix.Post, 
  where: p.id == ^&1,
  select: map(p, [:id, :randomnumber]) ) )
|> Enum.map( &Map.put( &1, :randomnumber, to_string(:random.uniform(10_000)) ))

Repo.insert_all(HelloPhoenix.Post, Enum.uniq_by(ws,fn x -> x.id end ), on_conflict: :replace_all, conflict_target: :id)

and then return the ws json encoded. This utilizes upsert and does the update in a batch. Enum.uniq_by(ws,fn x -> x.id end ) is there to handle edge scenario, where :random.uniform(10_000) has returned the same number and there are duplicates in the ids array, obviously postgres barks at a batch update holding opposing truths - I hope other DBs does the same. And I’m at a loss how this is the spec for the benchmark, and was the tipping point for me.

In general I would say we go for making it pretty and nice code that showcases the readability/productivity of elixir/phoenix, I’m sure phoenix will perform fine (as it already does).

Changing the DB pool size to 2000 is just too much, and I doubt it’s even faster - especially in the real world.

I too would like to see multi-hour benchmarks, and get away from keepalive, no GC, do nothing really fast for 15 secs tests.
I would also add (hot/rollover) code deploys, various peak times, slow clients, endpoint with errors etc. to the multihour tests.

4 Likes

Just looking at the 20 query or update per request part of the test, if we broken each query/update into it’s own process we’d be looking at 5,120 possible connections at the same time on the 256 concurrency. Can the pool go higher?

Just to satisfy my own curiosity I looked at the Go code for these benchmarks expecting to see use of goroutines…but they don’t seem to be using them either. Is that considered “batching”?

2 Likes

This is an interesting example of how these benches can diverge from the real life. Issuing concurrent queries is fine if you don’t expect many requests at the same time. However, if that can happen, then this approach might cause some bad effects. For example, if you and I issue our requests at the same time, and I want to update 5k elements, while you want to update just 1, you might end up being DoS-ed by my request if all of my 5k updates enter the pool queue before your single update.

A proper way to optimize this IMO is to reduce the number of db roundtrips, which is unfortunately not permitted by the rules.

In one of the Go examples I was looking at, it had a comment that I’d be interested to dig into a little bit more (from the fastest Go benchmark):

1 Like

To be clear, uWebSockets is an entirely different comparison. Phoenix Channels is a protocol built on top of websockets with distributed pubsub and communication multiplexing. uWebSockets is a raw websocket library. A more apt comparison would be cowboy ws

4 Likes

Just a note from the other numerous times TechMeme comes up, last year (or before) they added Phoenix in a preview, but it was a very poor implementation. They were testing JSON benchmarks through the :browser pipeline, complete with crsf token generation. They had a dev DB pool size of 10, where other frameworks were given of pool size of 100. And they also had heavy IO logging, where other frameworks did no logging. We sent a PR to address these issues, and I was hoping to see true results in later runs, but recent runs have show high error rates and we don’t know the details why nor have they done a great job with any of the recent code as shown in this thread. tldr; these results are not representative of the framework and the core-team’s time is better spent elsewhere. For those interested, please feel free to send PRs to improve their code, but it’s not something I lose sleep over :smiley:

15 Likes

any real world high load scenario would put the db pool under constraint within seconds - and in fact require a db pool.

any real world scenario would not make these kind of mass DB sequential queries - it’s an absolute anti-pattern for a relational DB - if one had a busy endpoint where this behaviour was necessary you would change the data structure and have an educational talk with the person who implemented it (and the person who approved it).

this benchmark is unreal - and it markets itself as real-world: "..provides representative performance measures.." - other benchmarks usually have the courtesy of not pretending to be real - but testing only overhead/throughput (thus using keepalive to accentuate differences) etc.

this unrealness/weirdness doesn’t particularly hurt elixir/phoenix it just makes it moot to optimize, thus I suggest going for pretty and nice proper code and style - and then not spend more time on it.

3 Likes

I thought is was already clear from the name. I’m just impressed with that library, that’s all. I was planning to run the uwebsockets author’s ws benchmark against cowboy’s ws, but can’t find the time.

What I’m saying is that I can’t conclude anything reliably from that benchmark. That’s my main point I’m arguing in this thread.

No measurements are going to be perfectly reliable. The question is whether there is any useful information in the data. It sounds like you believe the answer is no and that if these same tech stacks were used in real-world scenarios, the relevant performance metrics would show little correlation with the TE benchmark results (i.e., you would expect to see essentially a random permutation of the rank ordering shown on the site). Ultimately, I suppose that’s an empirical question.

Once I established that, I didn’t care much about few microseconds here/there.

Sure, but I don’t think the argument for measuring performance is typically for the sake of saving a few microseconds (assuming that’s a small percentage savings for the task at hand). The argument I hear in favor of Phoenix is that it doesn’t require the typical tradeoff between productivity/developer happiness and speed/reliability. So, the claim is being made that its technical performance (both reliability and speed) is meaningfully superior to that of other options, and that technical performance does indeed matter. It’s hard to make such claims convincingly without doing some measurement (and comparison).

Once you have the tech which can handle your load, other things start to matter, such as the support for fault-tolerance, stable latency, and troubleshooting a running system. Erlang/Elixir excel at this, and that matters, because it improves the uptime and availability, and makes the life of developers much easier.

Yes, but how do you know if Elixir exhibits better fault-tolerance, stable latency, and uptime compared with other options? These are technical performance attributes as well. Presumably that would require some measurements of different systems under similar circumstances.

If you want, you could look at any given framework in a set of benchmarks to determine the capacity of that particular framework

That might be true, but one big problem I have with TE…

Yes, there are plenty of criticisms of TE, but I wasn’t referring to TE. I was addressing your more general claim that individual measurements in isolation are useful, whereas measurements of multiple platforms under similar circumstances are not. Given that the latter can be reduced to the former, this doesn’t make sense.

You could get bigger savings by choosing a technology which allows developers to efficiently and confidently manage that kind of load, and to keep the system stable and running as much as possible, and to reduce the load on the support team. Again, raw performance is just a part of the story.

Just substitute “stability” for “raw performance” in all of your arguments against comparative measurements, and it would seem we also have no way of determining that Elixir/Phoenix is generally any more stable or reliable than any other option. Presumably for each new project, we must build a realistic proof of concept in Elixir and several other stacks to see how each perform in that particular unique system (given, of course, that no two systems are ever sufficiently alike to generalize from any previous data or observations).

Looking through some of the other tests, it does look like many are taking steps to ensure that the same database connection is used to burn through the entire set of requests. Doesn’t Ecto hand it back after every call in the current setup? Is there a way to avoid that? Ecto.Multi maybe? I saw the tests for Ruby using Sequel were all wrapped in DB.synchronize.

I don’t mind benchmarks. They can be useful. But I don’t like how most of the benchmarks out there are done because they measure the wrong thing!

I think TE benchmarks have the right idea. They test against more than just the HTTP layer (i.e database, data serialization, etc) but they measure the wrong thing. They have the wrong test setup to give you an idea.

And phoenix and erlang does great in these benchmarks! If you analyze the results some of the benchmarks that beat phoenix did it once. For a particular 15 second period, for a particular (low) concurrency level.

Unless the framework consistently give good result for the various concurrency levels and benchmarks something is wrong with either the test or the framework and I would discard the result at once. If the results are jumping all over the place I would discard the result. A max latency that is too high is also something I would consider a warning flag.

If you sort by max latency phoenix is:

  • phoenix is #6 for json
  • phoenix is #2 for single query
  • It drops down a bit on the others but still higher than the average (why do they use average? it should never be used.)

They test a closed system and not the open web. This is a very important distinction. Testing against a closed system and comparing with an open system means you will have the “coordinated omission” problem. I.e the client and server work together to make sure the server isn’t overloaded. The open web doesn’t work like this. Clients will not stop sending requests because another request is slow.

If someone wants to re-run the TE benchmarks please:

  • Client framework. Either use full simulator (I think tsung will work here) or use a benchmark tool that can work with coordinated omission (wrk2). The difference is massive!
  • Use much higher concurrency. For the open web, if you are serving 300,000 request per second I can tell you that your concurrency is deemed to be somewhere along the 100,000 mark at least. They test with 256.
  • Run the tests for longer time.
  • Display the full data tables, no summaries.

Run these tests again with the above parameters and I will assure you phoenix will end up higher currently and it should generally outperform any ruby/python/php framework out there with a big margin.

If you ask me: If you test any other way for the open web you are wasting your time and optimising the wrong thing.

4 Likes

In general, the feeling i get for having followed multiple round of tentative by phoenix people to work on the TE benchmark :

  • It is really really hard to understand what is happening in these benchmark
  • it is really really hard to get anything that look like perf data that would allow to analyse and solve what makes something slow
  • The TE team tend to modify the code of a particular framework when they want, breaking it regularly
  • In general, getting anytime of information from the TE team is super hard.

So if you want to fix it, go for it. But i prefer to fix a complex problem or help build the next Orleans for the BEAM. It will results with far more net result for the time invested.

3 Likes

Yeah, I believe that this selling point is poorly worded. The “speed” is a very relative thing. It depends on your needs, and your requirements. Sometimes you don’t care about milliseconds, other times a single millisecond can be an eternity.

I’d say instead that Erlang/Elixir/Phoenix doesn’t require a trade-off between productivity, fault-tolerance, and scalability, and will give you a “reasonable” speed in many cases, with the option to easily step outside of Erlang for the situations where Erlang simply doesn’t suffice.

By studying properties of the runtime and the language, and understanding what tools does it give you to help you get those properties.

For example, in Erlang, a process crash doesn’t disturb anything else in the system, but any other process can be notified about it. This gives you a way of isolating crashes but also to detect them and respond to them, thus allowing your systems to self-heal. Shared-nothing concurrency ensures that the failing thing doesn’t leave the garbage behind.

In contrast, in go, an uncaught error (panic) in a goroutine takes down the whole OS process. Shared memory means that even if you catch that error, you might still end up with an inconsistent global state, leading to other errors. Improving fault-tolerance, while certainly possible, is going to be harder in go, simply because the language doesn’t give you the kind of support Erlang gives you. This shouldn’t come as a surprise, given that Erlang was built precisely for systems which run continuously and experience as little downtime as possible.

For a few other interesting and unique properties of BEAM, you can also take a look at my ElixirDaze talk.

The individual measurements can give you some hints about how you’re doing. They can also help you check whether you’re improving or degrading in performance, as you make subsequent changes. This is where a number is good. If today it’s much higher than it was yesterday, then likely some change had affected performance in a bad way.

In contrast, bench comparison is a very dangerous territory, because you end up reducing a stack to a number, and then choose one based on comparing numbers, completely disregarding the features it gives you.

I wasn’t suggesting reducing this property into a number. Instead, I’m saying that we should focus more on how the tech helps us with some difficult challenges. Fault-tolerance is one such challenge, which is both required and difficult even in a moderately complex system. Therefore, the question is how can tech help us deal with this challenge from the day one, when things are always simple, all the way to production when things are never simple.

That’s definitely not an exact science, so it’s always a bet, but a deeper look into what the tech gives you can definitely help you make a more educated guess. In contrast, a simple comparison of two numbers is IMO as good as throwing a coin :slight_smile:

I’m not suggesting doing it for every project. I did it once, the first time I considered Erlang. Once that was followed by the success in production, and my deeper understanding of the tech, I got a much higher confidence about the stack.

2 Likes

I forgot to address this point, so just want to confirm that this is precisely what I’m saying. Regardless of the fact that I believe that choosing a stack amounts to way more than focusing on raw perf, my impression is that TE benches are fundamentally flawed as well as poorly conducted, so I think that their results have little correlation to what would happen in a real world scenario.

2 Likes

By studying properties of the runtime and the language, and understanding what tools does it give you to help you get those properties.

Sure, but then the question becomes how much do these properties matter in practice (i.e., how much more reliability do they really buy, and at what cost)? Ultimately, I think you need to see real measurements to confirm expectations based on analysis. An analysis of the properties of the runtime would be less compelling if we couldn’t also point to examples like the famed nine nines of uptime. You indicated that in your own initial evaluation of Erlang, you tested it out – you didn’t pick it blindly after merely analyzing its properties.

In contrast, bench comparison is a very dangerous territory, because you end up reducing a stack to a number, and then choose one based on comparing numbers, completely disregarding the features it gives you.

Sure. That doesn’t mean they can’t provide value when used appropriately. I suppose the same argument could be made about comparing systems merely based on their properties/features – a particular feature may sound compelling on paper, but if it can’t be demonstrated to tip the cost-benefit scale in the right direction in practice, it has no real value.

Sounds to me like you’re a fan of the benchmarks that show Phoenix in a good light and not so much those that show it in a bad light.

The 2M connection is basically testing Linux. It opens a socket, accepts connections and reads a ~1KB. The most basic TE benchmark is more representative of HTTP load than that is of websocket load. Apparently when a benchmark “demonstrates” that Phoenix doesn’t add much overhead, that’s a good benchmark. If it demonstrates that it adds a lot of overhead, well…the benchmark is flawed, or you should care more about stability, or it doesn’t represent real world examples.

Wrapping the outer handler in a try/catch is neither difficult nor a challenge. Which is why net/http does it: go/src/net/http/server.go at master · golang/go · GitHub

There’s two groups of people in this conversation about performance: those that think performance could be improved and those that want to talk about everything except performance.