Hi all! I am wondering about how you usually tune your production releases for real use load. I mean:
Number of DB connections
Any ratio between db connections and the async IO threads VM parameter?
Any ratio of db connections and max concurrent jobs (for those that use something like ecto_job, rihanna, oban and etc)
Any custom VM parameters that helped you achieve lower latency/higher throughput
Any libs that you switched because of production workload issues
Have you had to use/not had to use Erlang clusters?
so on…
Here at work we are running a few services with Elixir/Phoenix/Ecto all without tuning anything at all and metrics seem great up to now. We are not handling a huge workload but we believe the tide is coming and I’d like to know the community experience on this part of the journey.
Honestly, I don’t touch any tuning until things start to demand it. At most I might tune postgresql a bit after seeing its workload for a few days of heavy use, but in general there’s no real need to tune past the defaults of anything until your workload becomes higher, of which tuning is fairly well documented, if not scattered…
You know it would be awesome if someone made a generic tuning guide for phoenix/postgresql/mesh’d setups somewhere, say via a wiki post on this forum.
Stressgrid posted a couple of articles that might be relevant.
I agree with @OvermindDL1 that it would be really nice to have a generic tuning guide for Phoenix/Postgres. It’s probably hard to write, since many of the options will be very “fragile” and have different effects when comparing eg a $10 droplet on DigitalOcean, a Heroku dyno or the biggest AWS instance. They will also depend on whether you care more about p99 or average. Some work better in a shared environment and some are better when the app owns the whole machine.
In general, I’d say you’ll find most optimizations in Postgres itself, either by tweaking options or by analyzing and optimizing your queries, but if you’ve already done that and you’re still struggling, Erlang VM tweaking is going to be very relevant.
The Techempower Framework benchmark phoenix implementation has some flags set, but it’s pretty old and I don’t know how accurate it is. Also check the config, which sets some things.
I made some notes for myself on flags that seem relevant, I might try to collect it into a post with some notes. But I’m absolutely not an expert.
Here’s my general performance tuning tips for running elixir at high scale:
This should be a given but really measure everything and “call your shots” before you try to make performance improvements. Have objectives or goals for what you want to achieve on your boxes (tail latency, number of connections, etc.)
Performance problems are almost never “low level” or at the VM level. You tune the VM for small-optimizations over already optimized code. There’s no magic flag that will give you some huge perf bonus.
Performance problems are almost always either application level - like a bottleneck in your system code - or inefficient database queries. As others have said you’ll get more mileage out of tuning your queries than tuning the VM.
The only benchmarks that really matter are what happens when your system is under saturation. Starving the BEAM of CPU is the quickest way to bring it to its knees. Invest in large boxes you can use to generate enough traffic to really saturate your production machines with realistic traffic. If you can’t generate enough traffic then run some other CPU intensive process on the production box (Something like a small C utility that calculates primes or does floating point math in a tight loop should do it).
With that out of the way here’s the stuff we generally do for our services under heavy load.
The default ecto pool size is generally way too low for our use case. We end up increasing it by at least a couple of factors.
If you’re using hackney / httpoison you should watch out for issues with the built in pool options. This is one of the first places we start to see errors under high load.
You need to heavily sample any traces you’re capturing. Last I checked a lot of the tracing libs available in elixir apply backpressure which can quickly become a bottleneck on your system
Very carefully measure any pooling / queueing solutions you’re utilizing. We’ve seen performance issues with some of the more popular options out there. You should make sure that you can shed load effectively if you start to get overwhelmed.
One final note. We’ve tweaked some scheduler flags on certain services that have very high traffic. This was more important a few OTP versions ago and we’ve recently removed those flags and saw no noticeable degradation. You wanna be really careful about tweaking the schedulers, process priorities, and other such things. Its pretty easy to add unintentional performance degradations. Generally speaking the BEAM makes intelligent choices for you. As always, measure your performance under load to see how it behaves.
The default pool that hackney uses should be tuned to your own use case. The timeouts tend to be pretty high and the maximum connections are typically way too low for our purposes. You probably want to break out different services you communicate with into their own pools as well to avoid contention.
Like I mentioned before you should set goals for your service and work to meet those. Based on what you need you may not see these issues. In order to hit our goals during traffic spikes we have to really measure and tweak these values. But ymmv.