Benchmarks - how important are they to you? To Elixir?

AstonJ · January 21, 2021, 1:23pm

Inspired by @atraac’s post here - how important are benchmarks to you and how important do you think they are to languages like Elixir? (Languages where performance and scalability are some of their biggest selling points.)

As has come up in conversation in a few threads over the years, there seems to be a general frustration with benchmarks like the Techempower ones because they don’t really reflect real life usage. That’s particularly bad for Elixir and Erlang because real life situations is what they excel at!

Yet despite this, we continue to be swayed by them - I suppose we simply resolve to the fact that they at least provide ‘some’ insight, and while this may not be true for the most experienced of developers, for the majority (or at least a significant proportion) it appears to be it appears to be a relevant and important factor.

I know that when I was first drawn to Elixir the benchmarks where Phoenix (and Plug) were outperforming frameworks like Go’s Play and Gin played a fairly significant part in influencing my decision to try Elixir, and I saw it generate excitement in others too. At the very least, I think they help reinforce your decision and make you feel good about your choice. As with many things in life, all the little things add up, and I think benchmarks is definitely one of those things when it comes to tech.

Many of us have also seen Rust’s popularity sky-rocket of late, and it doesn’t seem unremarkable that part of this may be due to its place at the top of the Techempower benchmarks.

Personally I think it’s difficult to argue against benchmarks playing a role in the developer decision making process (and thus adoption) but what do you think? Do you think benchmarks are important for you and languages like Elixir?

cnck1387 · January 21, 2021, 3:00pm

Personally I don’t care much about raw hello world style benchmarks.

I’m more interested in practical examples that combine performance, code readability, extensibility, how much code needs to be written and what it’s like when compared to other languages and frameworks while everyone is using best practices and features in whatever tech stack they use.

More so end to end tests such as “I made this type of request, how long does it take for me to render this response in a browser?”. I don’t care about the layers, just the end result.

For example, Rails tends to be pretty slow at compiling its views (templates) but it has web framework features to help combat that with very little extra coding.

If you directly compared rendering a Rails response on a read heavy page that assembles a few partial templates and a couple of database queries to Phoenix then Phoenix is going to win here by a lot.

But in the real world with Rails you would add 1 line of code to cache the response with Russian doll caching and now your Rails solution doesn’t have to compile anything or perform any database queries. It may end up being faster or about the same as the Phoenix response, who knows.

And the big difference there is Rails has framework features to make this style of caching as easy as you can ask for but with Phoenix if you wanted to do a similar level of caching you’d have to write a lot more code and probably get side tracked trying to develop a cache abstraction on top of cachex instead of writing your app’s code.

I try to consider these things in when factoring in benchmarks because in a real application you aren’t coding in a vacuum. You’re pulling out all the stops out to make the best possible experience you can get in your current environment. Personally I would want to use the tool that makes things fast enough while making it the easiest / straight forward to pull it off.

atraac · January 21, 2021, 3:34pm

Like I mentioned in my post, I understand that most technical people know that benchmarks(techempower especially) do not represent the real life. But besides developers, there’s also a ton of other people in the industry that are more easily influenced by medium articles, benchmarks and other stuff that in turn can influence tech used by given company.
I feel like techempower shows Elixir in a bad light. I know that we want to present f.e. Phoenix in a fairly real-life approach, but the truth is, no other framework there does it.

I’ve seen .NET implementations for techempower. I would not ever, EVER, in my life write API like that, using Spans and hardcoded, static headers. But people still use the argument that .NET(Core+) is ‘fast’ and ‘high in benchmarks’, half of reddit is wet just thinking about ‘how fast .NET Core is’ now, while none of them probably have ever implemented something that has to have a sub 100ms response time, or was ever bottlenecked by setting date in a header…

What I meant with that post especially, is that I’d love to see more knowledgeable people in Elixir community, try to optimize the hell out of that benchmark, to show to the ‘outside world’ that we can do it as well. I’m too far of a scrub in this matter yet, to help with really anything(though if I can, I’d love to).
Would be awesome to have a benchmark of real-life framework implementations that @AstonJ mentions here as well.

In my mind, these benchmarks are more of a marketing statement for frameworks/languages, rather than anything else. People all over the internet keep quoting techempower when they consider speed of certain stacks. Do they know that that implementation in no way represents their CRUD WebApp that will be bottlenecked by IO anyway? Of course. Do they care? I honestly don’t think so. I keep bringing techempower as an example, because as a backend developer, this is really the only ‘benchmark’ that I know of(besides https://benchmarksgame-team.pages.debian.net/ which is even more questionable), which I think speaks of its popularity somehow. Also just googling stuff like ‘web framework benchmark’ has them on the top.

I think that there’s a lot of things that people care about when deciding on a stack and I’m sure that topping benchmarks won’t suddenly make Elixir the coolest kid on the block, but I think it would be a step in the right direction.

olafura · January 21, 2021, 3:44pm

Performance is important but what I really like about the Beam is how well it scales and does really sensible things. So if you have your app on a small instance it will just use what it needs and handle a lot of connections and if you put it on a bigger instance it will also scale.

To do the same with other languages you always have to pick some strategies that have different compromises.

I just like how well you can handle scaling, most modifications aren’t a complete rewrite but a small tweak.

I get really annoyed with some of the benchmarks ( Benchmarks Game ) because in they are often using the exact same library for the computations so it’s just how well your language memory structure lines up with the library

With low level languages you are often shooting yourself in the foot because your app will do the exact thing you told it really fast, ignoring other duties.

I have found that you can write really competitively fast code in Elixir because it does a lot of the right things for you already. I have a json diff library that I could not for the life of me compare to other libraries because at the scale where those ns or ms mattered they had already segfaulted.

AstonJ · January 21, 2021, 4:27pm

Great minds think alike Nick!

We had a thread about this on DT and I’ve mentioned that we’ve thought creating something that would be much more useful that what’s currently out there; benchmarks based around real-world apps or components of them along with being typical of that language or framework (so no cheating!)

cnck1387:

More so end to end tests such as “I made this type of request, how long does it take for me to render this response in a browser?”. I don’t care about the layers, just the end result.

For example, Rails tends to be pretty slow at compiling its views (templates) but it has web framework features to help combat that with very little extra coding.

If you directly compared rendering a Rails response on a read heavy page that assembles a few partial templates and a couple of database queries to Phoenix then Phoenix is going to win here by a lot.

But in the real world with Rails you would add 1 line of code to cache the response with Russian doll caching and now your Rails solution doesn’t have to compile anything or perform any database queries. It may end up being faster or about the same as the Phoenix response, who knows.

And the big difference there is Rails has framework features to make this style of caching as easy as you can ask for but with Phoenix if you wanted to do a similar level of caching you’d have to write a lot more code and probably get side tracked trying to develop a cache abstraction on top of ex_cache instead of writing your app’s code.

I agree caching in Rails is brilliant, and it would be interesting seeing how it would compare to a similar type of Phoenix app (who knows maybe you/we may be surprised by the results?). The results would not only be useful for developers weighing up tech either, but for maintainers of those projects too, resulting in what I think will ultimately be a better product.

This is exactly why we need a new type of benchmarks!

I mentioned this idea to PragProg very briefly a while ago, and I think we could definitely get a few partners on board that would help give it the kind of exposure it would need to be taken seriously (and it’s awesome that now we have a multi-lang platform that we can leverage to do stuff like this).

It’s definitely something I was hoping we could do, though perhaps next year - however, looking at the landscape I am beginning to think we may need something sooner.

That’s definitely worth making a big deal of as well Olafur

al2o3cr · January 21, 2021, 4:38pm

People who don’t understand what they’re talking about, yes.

TBH reading into the shenanigans around Actix and some of the other “top speed” competitors has negatively affected my opinion of Techempower in particular because it’s gotten so artificial; like 1000HP supercars that are fun to drive on a track but get 25 miles to a tank of gas.

Qqwy · January 21, 2021, 7:23pm

I use benchmarking solely to decide between multiple potential implementations if their other characteristics are identical so it comes down to what is most performant.

I see little value in comparisons between languages. In general runtime- and/or memory-efficiency is not usually the most important factor in the apps I am building on a day to day basis, so I prefer to make decisions based on developer efficiency/short- and long-term maintainability instead.

AstonJ · January 22, 2021, 1:15pm

In fairness to Actix, while they were doing stuff that isn’t considered best practice in Rust, it is (iirc anyway) being used in production by Microsoft. I think this further supports the need for different categories in such benchmarks, so one for what is typical of that language or framework, and another where you can hack things to your heart’s content to squeeze every last drop of performance This would help people see what they can get out of a system from the get go, and what the potential of it could be at a later stage.

I’s agree that for 90% of apps most languages would probably suffice, however, for some strange reason we seem to be attracted to the promise of more. I don’t think that’s necessarily a bad thing tho - having ambitious plans for a project will probably (imo) lead to a more polished product because the developers may be much more passionate about it

Edit:

I took a trip down memory lane via an old MetaRuby thread (which was funnily enough after looking for a thread to link to in this recent thread - funny how one thing can lead to another )) and this was one of the graphics that was part of the whole hype/excitement:

https://metaruby.com/uploads/default/optimized/1X/a9d42cff18e97114effb23bdf3e14f6d41ecf156_2_1312x1000.png

There are lots of comments about Elixir being like “Ruby on steroids ” etc …so I’m more certain than ever that they (/the performance aspect) plays (or played, at least back then) an important part…

igouy · January 24, 2021, 1:23am

I get really annoyed with some of the benchmarks ( Benchmarks Game ) because in they are often using the exact same library for the computations…

Please tell us specifically which tasks you think show programs which use " the exact same library".

Please tell us how many other tasks are shown which you do not think show programs which use " the exact same library".

If all you mean is that many of the pidigits programs use GMP, and many regex-redux programs use PCRE then how does that effect comparison of say Erlang and Ruby programs?

dimitarvp · January 24, 2021, 3:13am

Same, that’s what I want to see in benchmarks. People get too focused on low-level details!

50/50, obviously Rails is more mature but cachex is quite easy as well, plus many – myself included – find the hand-crafted approach with just a few more code lines preferable because it’s more explicit. That’s beside the topic though and I get your point, just not sure if your example is precise.

Generally agreed but “easiest” isn’t the only important metric. Simple isn’t easy but simple makes your code more future-proof. So I’d slightly modify your statement when it applies to me: “I prefer to benchmark the simplest solutions (even if they turn out not that easy to code) because that’s what makes a project easily evolvable and is thus adding to the ‘real world project’ definition”.

I’ll also link to The Real World Project which is IMO a very good starting point for more realistic benchmarks.

dimitarvp · January 24, 2021, 3:21am

Absolutely. When I work with Elixir and Rust and I get torn between 2-3 implementations I simply write a benchmark to decide what to use in the current project.

Yep! Like every “game” where score is the only thing that matters, the very predictable result is that people will – ahem – game the system to get as much score as possible.

I too don’t take T.E. seriously these days. They don’t do enough code quality control.

kokolegorille · January 24, 2021, 3:56am

What about Telemetry, something that is quite specifiic to BEAM ecosystem?

cnck1387 · January 24, 2021, 12:24pm

Most popular languages / frameworks have their own features and libraries to get similar stats.

For example with Flask, Rails, Laravel and Django there’s various forms of “debug toolbars” that give you stats like how long a request took, how many DB queries you had, how long those queries took, how much time was spent compiling a view, queries against Redis, etc…

These are drop in solutions requiring zero lines of code and are extensible to add your own custom stats. There’s also language specific tools for profiling function calls and getting break downs of how long it took to execute a specific part of a function, etc…

Then there’s also a bunch of APM tools out there for nearly every language that’ll let you get in depth stats about pretty much anything you could imagine. All results are persisted too, so you can compare and chart the results over time.

System specs like memory / CPU usage, etc. is also a solved problem with many dozens of tools. Often times you can get a moving chart (with weeks of persistence) of this straight from your cloud provider for free and without having to set anything up. DigitalOcean will do this for you.

kokolegorille · January 24, 2021, 12:38pm

I meant telemetry could answer your question about request duration…

cnck1387 · January 24, 2021, 10:17pm

Yeah totally, like other languages Elixir has its take on metric gathering too.

Originally you mentioned the idea of telemetry is unique to the BEAM. I was just trying to say that lots of other languages have the ability to get various metrics out of their runtime and then display those metrics in easy to digest ways. It’s just not called telemetry in most other languages.

kokolegorille · January 25, 2021, 2:47am

I speak about telemetry, and You speak of other frameworks… I don’t see the relation.
Have You ever implemented telemetry in Elixir?
It is specific to the BEAM how it is done.

olafura · January 25, 2021, 10:15am

Hi Isaac,
thank you for coming here and being curious about our discussion of the project.
I haven’t written code for the debian Benchmarks Game yet. But yeah. I was looking at GMP. Not realizing that Erlang uses GMP under the hood and it also uses PCRE.

So it would probably be interesting seeing if we can bring the speed closer to the top.

It’s really hard making these kinds of benchmarks because each language has their own way of doing things that might not align with those kinds of benchmarks.

So I’ll probably try to contribute Erlang or Elixir examples if I can find some speed ups instead of complaining about them on a forum