Performance of Erlang/Elixir in Docker/Kubernetes

ellispritchard · April 8, 2019, 8:41pm

Has anyone else observed throttling of Erlang or Elixir processes running in Kubernetes (or other Docker orchestration platforms)?

I’ve observed this in production a while ago (on a system I no longer have access to), but was unable to make headway. Originally, I noticed a wide variation of ‘ping’ times when connecting to a do-nothing endpoint, when the app was not under load, and then noticed throttling being recorded in Prometheus stats. I couldn’t seem to find a reason, nor create the problem in minikube etc. but then it was a very noisy/busy system, with 30-odd processes of various technologies and load per node.

Recently I came across what is possibly an explanation; this comes in two parts, one, a very detailed examination of CPU usage in the BEAM:

The second clue is talk about scheduler bugs in the Linux kernel, and what CFS quotas are supposed to do anyway:

github.com/kubernetes/kubernetes

CFS quotas can lead to unnecessary throttling

opened 04:06AM - 20 Aug 18 UTC

closed 02:11AM - 01 Nov 20 UTC

bobrik

kind/bug sig/node kind/feature

> /kind bug This is not a bug in Kubernets per se, it's more of a heads-up. … I've read this great blog post: * https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-manager/ From the blog post I learned that k8s is using cfs quotas to enforce CPU limits. Unfortunately, those can lead to unnecessary throttling, especially for well behaved tenants. See this unresolved bug in Linux kernel I filed a while back: * https://bugzilla.kernel.org/show_bug.cgi?id=198197 There's an open and stalled patch that addresses the issue (I've not verified if it works): * https://lore.kernel.org/patchwork/cover/907448/ cc @ConnorDoyle @balajismaniam

It occurs to be that when running in a container under a CFS quota, we should all be setting +sbwt none to avoid this optimisation from throttling the Erlang process.

Anyone come across this? Thoughts?

tristan · April 8, 2019, 10:56pm

At least one of our project at work (which runs on k8s) saw a large benefit from enabling +sbwt none. But I don’t know that it is a tweak to make by default.

Another cpu related configuration to watch out for when running on something like Kubernetes is your number of active schedulers. You can end up with too many schedulers for the cpu share allocated and end up wasting cycles on dealing with schedulers and have many in busy wait.

ellispritchard · April 9, 2019, 10:26am

Interesting. I actually tried running 2xvCPU schedulers to see if that helped: didn’t observe anything, but might have made it worse.

Yes, on dedicated machines, or outside containerisation, I believe it’s been demonstrated that +sbwt does help performance (which is presumably why it was added), at the cost of CPU/energy etc.

tristan · April 9, 2019, 1:17pm

Helps only in some cases, otherwise it would be the default

What do you mean by 2xvCPU schedulers? As in if you specific 1 vcpu in k8s you used 2 active schedulers? That should hurt performance.

blatyo · April 9, 2019, 1:54pm

If you were running on AWS and using burstable performance instances you could be throttled because you exceeded your quota.

ellispritchard · April 9, 2019, 6:08pm

Certainly didn’t help, didn’t seem to make any difference though: theory was that having more threads would help it compete against Java processes running scores of threads.

ellispritchard · April 9, 2019, 6:23pm

These were m4.xlarge standard instances (4 vCPU), also tried m4.2xlarge (8 vCPU) for a while, so it wasn’t that.

If the VM would have exhausted its credits, the whole node would have practically ground to a halt in my experience. We experienced this on a Kafka cluster once, due to a bug in Kafka 0.8, not fun!

It’s probably not a great idea to run a busy production kubernetes node on a standard burst-able instance, without some sort of support for these in the k8s scheduler, since k8s basically tries to squeeze as much out of a VM as possible (kind of the whole point), so it might rarely earn credits; however, AWS T Unlimited instances allow you to pay to burst above your accumulated credits now, so there may be some use-cases where it makes sense.

jola · April 9, 2019, 6:26pm

By default it’ll look at the host system and spin up as many schedulers as there are logical CPUs, but depending on your type of load you can potentially get performance gains by increasing it. Do you know how many schedulers you’re running? It’s not an uncommon problem for VMs to make the assumption that it has access to all the host CPUs (Java etc), even if it is limited by eg cgroups. I’m not sure how well BEAM behaves here.

What was also interesting in the article about the BEAM CPU Usage was that even with disabling busy waiting with those three settings, they didn’t see a performance loss. In a cloud environment, when sharing CPUs or when limited by credits, it probably makes sense to disable it by default. Although benchmarking doesn’t hurt.

ellispritchard · April 9, 2019, 7:27pm

By default, it was definitely running the same number of schedulers as available cores (NB this is a system I don’t have access to any more).

I think what I’m interested in is finding out is what kind of analysis people have performed on the Erlang VM running in containerised platforms, and what the best VM settings are.

There are many years of experience of tuning the BEAM on dedicated hardware/machines, most of which probably translates pretty well to dedicated VMs, but running it in containers with CFS quotas, sharing a VM with numerous other containers, running multiple language technologies (i.e. a heterogeneous environment) , is relatively new, and we may not have figured out all the hitches.

jola · April 9, 2019, 7:40pm

Yeah, by default the BEAM is a pretty noisy neighbor. Compared to running some simple single threaded application, it’s harder to reason about performance and resource sharing in a cloud environment. It’ll gladly hog all CPUs because sharing CPUs or paying for cycles wasn’t necessarily a major consideration in its design. A bit like how Go aggressively allocates memory to avoid overhead, tradeoffs were made that don’t make sense in all use cases.

It might be valuable to actually look at reducing the number of schedulers to prevent the BEAM from over-spending on cycles. Even though the host system has some number of CPUs, if you’re limiting the application to say 1 vCPU, maybe you’ll get a behavior closer to the expected one if you also reduce the burstiness of the BEAM. If you only assign it 1/8th of the resources, it might not make sense to allow it to use 100% of them 1/8th of the time. Or it might, that’s up to your use case obviously!

tristan · April 9, 2019, 8:38pm

Right. And note that you can change this at runtime, could be useful in experimenting.

If there are 8 cores and you assign the container N vCPUs you’d start the node with +S 8:N. Then adjust with erlang:system_flag(schedulers_online, NewN).

Would be neat to also see how having it automatically adjust depending on changes in load and allocated vCPUs would work out.

garazdawi · April 10, 2019, 8:32am

There are indeed scenarios when this is the case, however I would say that if you are not running alone on a machine, you do not want to have spinning enabled as you start to conflict with other services. I’ve been thinking about maybe changing the default to none or very_short, as that seems to be what most systems needs.

I can also see how changing +swct and +swt could be useful, as they adjust how eager the VM is to wake up more schedulers to help do work.

tristan · April 10, 2019, 2:27pm

Nice, setting the default to be what is best for shared environments makes sense.

For +swct what are the cleaning tasks it is referring to? Garbage collection?

Would lowering +swt be best for busty loads? So that it takes longer to wake up a scheduler in cases it will just have to go back to sleep shortly after, but if the burst is more sustained it will still get woken up? Or could it be an improvement for general workloads in a shared environment?

Glad you brought them up, I was planning to finally look at benchmarking these but wanted to get a grasp on how to best structure the benchmarks to be most likely to show where they help.

Curious also about the different scheduler balancing options. Do you think using one or the other, +scl or +sub, could be useful in a shared environment?

garazdawi · April 10, 2019, 2:45pm

Internal cleanup work by the schedulers. For instance, delayed de-allocation of remote memory blocks. otp/erts/emulator/internal_doc/DelayedDealloc.md at master · erlang/otp · GitHub

Tinkering with +swt would be the classic tradeoff of latency vs CPU. If you set it to very_high then you will use less CPU to do the work as the work tends to be co-located, so there is a lesser risk of lock contention and better cache usage etc etc. However, at the same time, the average time a job will wait in the run-queue before being allowed to run will go up, so you application will have higher latency.

I don’t consider them relevant at all. The default strategy is best for all scenarios. They exist in order to support the strange needs of some embedded systems run at Ericsson. +scl false is very similar to what you get when you run +swt very_low, so it could possibly be good when you want to optimize for latency.

seb5law · April 11, 2019, 10:40am

How about +swtdio very_low?
Could this improve performance in shared environments as well?

garazdawi · April 11, 2019, 11:31am

Yes. Infact, in OTP-22, it’s default had been changed to short: http://blog.erlang.org/cd/docs/master/erts-10.3.2/doc/html/erl.html#+sbwtdio

seb5law · April 11, 2019, 12:54pm

We run multiple dockerized Elixir Microservices on every host.
For the sake of others seeking performance optimization in shared environments in the following I summarize the changes to the vm.args file I currently try in one of our Services:

+sbwt very_short
+swt very_low
+swtdcpu very_low
+swtdio very_low
+swct very_eager

With these ‘optimizations’ I hope to achieve a noticeable performance gain. Let’s see what happens.

garazdawi · April 11, 2019, 1:21pm

I think you may have set these options the wrong way around.

seb5law · April 11, 2019, 1:47pm

I don’t care about resource consumption but I want my applications to be as fast as possible. So shouldn’t, for example, schedulers be woken eagerly consuming more CPU but getting the work done faster?
Or am I misinterpreting the documentation here?
Or, since these two options swt and swct concern cleanup schedulers which should only run when CPU is not so busy as not to disturb the running applications?
Thanks for the reply!

seb5law · April 12, 2019, 5:38am

@garazdawi Can you please elaborate on my misconception?
Sadly, the time in service with the previously mentioned changes increases by 2ms (21ms now), roughly 2%. So the result of this little test agrees with your statement on my customized emulator flags:

I think you may have set these options the wrong way around.

Which part of my reasoning is faulty?

As I understand the documentation the sbwt* flags should be set to longer in order to keep idle schedulers alive for longer so new work will be accepted faster without the need of starting a new scheduler.

On the other hand the swt* flags should be set to a lower value as to wake up schedulers earlier when new work comes in that the current schedulers can’t handle.