Tuning Elixir/Ecto/Phoenix for production

keathley · May 22, 2019, 5:09pm

Here’s my general performance tuning tips for running elixir at high scale:

This should be a given but really measure everything and “call your shots” before you try to make performance improvements. Have objectives or goals for what you want to achieve on your boxes (tail latency, number of connections, etc.)
Performance problems are almost never “low level” or at the VM level. You tune the VM for small-optimizations over already optimized code. There’s no magic flag that will give you some huge perf bonus.
Performance problems are almost always either application level - like a bottleneck in your system code - or inefficient database queries. As others have said you’ll get more mileage out of tuning your queries than tuning the VM.
The only benchmarks that really matter are what happens when your system is under saturation. Starving the BEAM of CPU is the quickest way to bring it to its knees. Invest in large boxes you can use to generate enough traffic to really saturate your production machines with realistic traffic. If you can’t generate enough traffic then run some other CPU intensive process on the production box (Something like a small C utility that calculates primes or does floating point math in a tight loop should do it).

With that out of the way here’s the stuff we generally do for our services under heavy load.

The default ecto pool size is generally way too low for our use case. We end up increasing it by at least a couple of factors.
If you’re using hackney / httpoison you should watch out for issues with the built in pool options. This is one of the first places we start to see errors under high load.
You need to heavily sample any traces you’re capturing. Last I checked a lot of the tracing libs available in elixir apply backpressure which can quickly become a bottleneck on your system
Very carefully measure any pooling / queueing solutions you’re utilizing. We’ve seen performance issues with some of the more popular options out there. You should make sure that you can shed load effectively if you start to get overwhelmed.

One final note. We’ve tweaked some scheduler flags on certain services that have very high traffic. This was more important a few OTP versions ago and we’ve recently removed those flags and saw no noticeable degradation. You wanna be really careful about tweaking the schedulers, process priorities, and other such things. Its pretty easy to add unintentional performance degradations. Generally speaking the BEAM makes intelligent choices for you. As always, measure your performance under load to see how it behaves.