Techempower benchmarks

chrismccord · June 2, 2017, 4:08am

This is just not true. Our Chanel benchmarks do more than just accept a connection. We exercise the channel protocol which spawns and isolated, concurrent process monitored by the underlying websocket transport, which is subscribed to our pubsub layer. We also put the pubsub layer thru its paces by broadcasting messages to the 2M clients.

This is not a fair point. Saša has done nothing here to do mental dynamistics to make up for any apparent elixir or phoenix deficiencies. For anyone remotely adept at writing Elixir, the techmeme app is obviously poorly written, from configuration to middleware to lack of use of concurrency in the db tests, and more. My experience with them over the last two years has given me little confidence to see value in their results.

I’ll also note that this thread is turning unnecessarily negative, so please take care to avoid things going toxic

chrismccord · June 2, 2017, 4:18am

It’s designed this way because channels provide:

isolated, concurrent lines of communication on the underlying multiplexed transport process. You don’t want an error in one channel to bring down the others. Likewise, you don’t want one channel to block others if doing any blocking or intensive work. Put simply, it’s designed exactly how you’d manage other system sin the runtime – isolated concurrent processes that are supervised.
client reception of channel errors. If we used a single transport process and one of your channels caused a crash, you’d take out the physical connection. With our setup, clients get notified of channel crashes without taking out the connection and can rejoin the channel without the extra trip to reconnect.

Yes these things require 3 processes vs 1, but our 2M benchmarks exactly show that this overhead scales. Phoenix also allows you to write your own transport that would forgo this multiplexed model. @sasajuric’s has also done recent transport work to make things more extensible for these scenarios. That said, for the vast majority of usecaes our default approach is the way to go.

brightball · June 2, 2017, 4:21am

Thanks for taking the time to break it down.

sasajuric · June 2, 2017, 5:50am

I’m sorry you have that impression, b/c that’s not what I’m saying. I wouldn’t have any better opinion about TE even if Phoenix was the winner in all categories.

If you take a look at the post you’ll see that it was a bench which uncovered a couple of issues in Phoenix. Prior to that test, Phoenix was breaking at some 40k connections. So the test was useful to discover those issues and fix them.

The Phoenix bench also initially demonstrated the flaw in Phoenix, and yet I still like it. What I like about it is that it has a very clear focus, and that it doesn’t fall into a trap of comparing stacks based on a number. TE is the complete opposite, with the update test being extremely contrived, and with the ranking of stacks based on some number being very shallow.

Performance can always be improved, but we need properly focused tests for that. The update test is IMO not such test.

The difficult/challenging part is that you need to do this in every goroutine you spawn. Even then, if there’s an unhandled panic in a goroutine started by some dependency, the system will fail. Another problem is that because of the mutable shared data, you might end up with inconsistent state and further bugs. I’m not saying that it’s impossible to work with those properties, but rather that it’s harder. You need to pay more attention, but no matter how much attention you pay, some risks never fully disappear.

While in Erlang fault-tolerance doesn’t happen by magic, in many cases it’s ensured by the basic properties of the runtime. When I was working on my first system, I was frequently surprised how the system survived various individual crashes and self-healed, even if I didn’t plan for it. The tech had my back when I didn’t. This is something I can’t put a number to. But I can say that I was (and still am) very grateful to learn how much a tech helps me making my systems more resilient.

kseg · June 2, 2017, 6:49am

I stand corrected. I’m not sure why I got that impression. I should have double checked first. Thank you.

sasajuric · June 2, 2017, 7:07am

Exactly! And this is something you can decide for your particular case.

And this is where I think is the core difference between our opinions. You seem to think that you can somehow assign a number to foo and bar, and then pick one based on that number, whereas I argue that it’s nowhere near as simple as that, and that most likely a number-based comparison is going to be very shallow and misleading.

Nine-nines is mostly shallow marketing, so let’s skip that. The fact is that people have been making highly available systems in various technologies, so it’s definitely doable with or without Erlang. The question is however, how much a particular tech helps you with that. My experience is that Erlang helps a lot, thus making that difficult job easier. However, I cannot quantify that into a number.

Precisely, but the point of that test was to verify that Erlang is capable of handling my load. It wasn’t a comparison of Erlang to other stacks, but merely a verification that the stack is capable for what I needed.

I’m not going to say that bench comparisons are always bad, but I’ve yet to see one which actually makes sense

Personally I think these things are hard to properly quantify. However, looking at the properties, you can at least decide which ones do you value. For example, Erlang has excellent support for fault-tolerance, but is unspectacular in CPU processing, and terrible at crunching numbers. Knowing that, I can pick or avoid Erlang, depending on the particular specifics of the problem I’m solving.

anthonyb · June 2, 2017, 4:15pm

And this is where I think is the core difference between our opinions.

I don’t think it is so much that we disagree, but that to some degree you have been caricaturizing my arguments and somewhat inconsistent in your own.

Please note, my initial post was in response to the following comment from you:

If we could have an easy way of installing TE suite locally and running, we could have an automated way of benching our frameworks, finding out possible bottlenecks, and even detecting regression performance. Not sure to what extent can TE code currently be used for that, but I agree that this can be useful.

So, initially you seemed to acknowledge value in making use of the TE suite for measurements. I was just pointing out that it is indeed fairly easy to get the suite up and running on a local machine with the provided Vagrant setup. You could probably be fully set up and ready to play with the code in about 30 minutes. Furthermore, the endpoints are quite simple, so I would think an experienced and knowledgeable Elixir/Phoenix dev could quickly make improvements if the current code is not as idiomatic or performant as it could be. If nobody cares about what Phoenix looks like in the TE data, that’s fine, but if there’s any interest in making improvements, we’re probably talking about a time investment of a few hours at most.

You seem to think that you can somehow assign a number to foo and bar, and then pick one based on that number,

I don’t think I have said anything like that. When choosing a technology, many attributes may be relevant to the decision. Performance (speed) may be one of those attributes. Reliability may be another attribute. Both of those things (among many others) can be quantified in various ways, and our decisions should be better to the extent that we have high quality measures of attributes that matter.

The question is however, how much a particular tech helps you with that. My experience is that Erlang helps a lot, thus making that difficult job easier. However, I cannot quantify that into a number.

But you can quantify that into a number, and implicitly, you have. When you say “Erlang helps a lot,” there is an implicit comparison there – you are saying it saves you time, effort, headache, etc. compared with other options you have tried or heard about. If it doesn’t really save you time and cost over other options, then why prefer it and recommend it? You may not have precise measurements of the benefits and cost savings, but (a) you at least have a sense that they are non-trivial (i.e., they exceed some threshhold), and (b) they could in principle be measured more precisely, and it would be useful if they were.

Yes, but how do you know if Elixir exhibits better fault-tolerance, stable latency, and uptime compared with other options?

By studying properties of the runtime and the language, and understanding what tools does it give you to help you get those properties.

You indicated that in your own initial evaluation of Erlang, you tested it out – you didn’t pick it blindly after merely analyzing its properties.

Precisely, but the point of that test was to verify that Erlang is capable of handling my load. It wasn’t a comparison of Erlang to other stacks, but merely a verification that the stack is capable for what I needed.

The above exchange is another example where you appear to be disagreeing with me but have in fact simply restated my point. I started by asking how you know Elixir is more reliable, and your initial response was by mere analysis of its properties, independent of its actual real world behavior. I pointed out that you also need to verify any analysis by testing actual behavior, and that you in fact have done so yourself – and now you seem to be agreeing with this point.

Personally I think these things are hard to properly quantify. However, looking at the properties, you can at least decide which ones do you value. For example, Erlang has excellent support for fault-tolerance, but is unspectacular in CPU processing, and terrible at crunching numbers. Knowing that, I can pick or avoid Erlang, depending on the particular specifics of the problem I’m solving.

How can you pick without making at least some rough attempt to quantify? Let’s say you need to do a lot of CPU processing? How do you know whether Erlang is or isn’t as fast as any other language you might consider, or whether the differences really matter in terms of cost, etc.? Presumably at some point you are attempting to make some measurements or relying on measurements others have made.

Finally, to be clear, I am not defending the TE benchmarks, and I think you have made many good points about the qualities of Elixir/Erlang and the many factors to consider when evaluating technology solutions. Thank you for sharing your experience and insights.

sasajuric · June 2, 2017, 5:27pm

If I did that, it wasn’t intentional, sorry about that.

I did, but only for isolated measurements (i.e. not comparisons). You may also noticed that I said this:

So my point was never about improving Phoenix position on the TE list.

Moreover, somewhere during this exchange I took a glance at the Phoenix test code, and then at the requirements for the update task, and that lowered my otherwise low opinion on TE which is possibly the reason for my gradually worse tone about TE.

Sure, but it’s not an exact number, and it’s a very subjective and vague interpretation, although it’s based on my own experience. I cannot put a number to it, like TE does, and my point is that the TE number is garbage, just like any other number of any other comparison bench would be.

Exactly, it’s a feeling/sense, not a precise number. We might put a number to it in the context of the system. So I could e.g. measure percentage of successful requests, latencies, uptime, etc. But even those numbers wouldn’t be meaningful for comparison between two frameworks. So to restate my point, rather than comparing techs through some number(s), take a look at what they give you, and how you can use their building blocks to make your system.

What I tested was not reliability. It was a load test. The reliability I refer to is about fault-tolerance - the ability of the system to mitigate all sorts of outages, keep running with little-to-no downtime, and self-heal to resume the full service. Perhaps I confused you with my terminology there?

I didn’t perform a particular test of fault-tolerance, other than playing with some toy examples as I was learning the language. To be honest, I was pretty clueless for the first year or so, I did many things wrong, and yet the production miraculously survived all sorts of problems. So the production was the proof of fault-tolerance promise, though I did have some idea about those properties, because I actually studied various material about the tech.

It depends. For some things I’ll of course measure, at least to validate that I have reached desired perf. If I’m unsure, I might do a quick experiment. For some other things, I wouldn’t even touch Erlang. If I’m writing a one-off program (as opposed to a continuously running system), then many benefits and trade-offs of Erlang make little-to-no sense, so I’d likely consider something else.

I didn’t say (or at least didn’t try to say) that measurements are bad. I did benches during my talks, I have a blog post about a shallow bench of Phoenix, and I certainly bench during my work. My point was that comparison benches such as the TE one are fundamentally flawed, because they indicate that the whole stack should be chosen based on a number. And I think this is very shallow and misleading.

It was a good exchange and I hope I didn’t come off as arguing for the sake of argument, because that wasn’t my intention

fishcakez · June 3, 2017, 4:03pm

I tried to write a quick example using what I naively assume are the faster possible combinations of libs available to Elixir at https://github.com/fishcakez/FrameworkBenchmarks/commit/4099828651725837bc61946a43fb5afb329ddb0e. I did not do ANY tuning, except to automatically adjust pool size based on machine size. Hopefully someone can compare it with the existing phoenix implementation and fix things up there. On my machine throughput is higher for all tests, especially multiple queries and updates.

thousandsofthem · January 14, 2018, 8:23pm

https://groups.google.com/forum/#!topic/framework-benchmarks/nePDNY9jp-4

Phoenix is missing, again.
logs: http://tfb-logs.techempower.com/round-13/preview-2/logs-20161031-1/phoenix/
exact line:

Setup phoenix: dpkg-deb: error: `elixir_1.3.2-1~ubuntu~trusty_amd64.deb' is not a debian format archive

dsissitka · November 5, 2016, 5:35pm

I see the issue:

$ grep -R elixir .
...
./toolset/setup/linux/languages/elixir.sh:fw_get -O http://packages.erlang-solutions.com/debian/pool/elixir_${VERSION}~ubuntu~${RELEASE}_${ARCH}.deb
...
$

That evaluates to http://packages.erlang-solutions.com/debian/pool/elixir_1.3.2-1~ubuntu~trusty_amd64.deb, which doesn’t exist. Their version and arch are out of date.

I’ll submit a pull request.

Edit: Done.

thousandsofthem · November 11, 2016, 6:02pm

New preview. https://www.techempower.com/benchmarks/previews/round13/azure.html
So far so good. Thanks again for the patch

ChaseGilliam · November 11, 2016, 7:35pm

Phoenix did really well on the data updates (2nd best) but not as well on multiple queries, which should be roughly analagous. I wonder what’s up with that.

sztosz · November 11, 2016, 9:06pm

In Multiple Queries it’s just SELECT + serialization to json. In Data Updates it’s SELECT + UPDATE + serialization to json. You have relevant code here: https://github.com/TechEmpower/FrameworkBenchmarks/blob/master/frameworks/Elixir/phoenix/web/controllers/page_controller.ex But i don’t judge those reults, neither looked at any optimizations.

But… i don’t like this try -> resque, and why String.to_integer/1, and not Integer.parse/1 also why params["queries"] and not Map.fetch(params, :queries) It all looks too much like ruby or python… i don’t know. And as much I love Python, and work in Ruby(don’t love it though ;P) You have much better ways to express intent in Elixir, so why not use it?

I just don’t trust those benchmarks at all

Nicd · November 11, 2016, 9:21pm

They’re not equivalent though. It would be Map.fetch(params, "queries"). Personally I would pattern match it in the function head (def queries(conn, %{"queries" => queries_param}) do).

sztosz · November 11, 2016, 10:27pm

True, i forgot about GC not collecting symbols

ChaseGilliam · November 11, 2016, 10:28pm

This may help https://github.com/TechEmpower/FrameworkBenchmarks/pull/2350/files

ChaseGilliam · November 11, 2016, 10:28pm

Pattern matching is probably faster too, Have you considered putting in a PR?

dsissitka · November 11, 2016, 11:13pm

I believe today is the last day they’ll accept pull requests for round 13.

I’m testing it now. If it is I’ll try to get one in in time.

PragTob · November 12, 2016, 8:34am

thanks for working on it and submitting PRs to fix the benchmarks!