Linux, single machine, nice, multiple beam vm, OOM killer

zeroexcuses · May 5, 2021, 2:38am

Linux has nice(1) - Linux man page which allows us to set ‘priority’ of a process. Lower = more important.
Does OOM killer obey nice ? If not, is there something for OOM killr ?
One possible solution I have been thinking of involves running multiple BEAM VMs on a single machine. Then connecting them via distributed Erlang.
The BEAM VMs have different priority levels. Where if X is a supervisor of Y, then X is in a VM ‘more important’ than the Beam VM that Y is in.
The goal here is that under normal behaviour, supervisors can still respawn workers (even if on different BEAM VM)
In the case of a OOM kill, because of priority, the kernel kills some of the less important worker VMs

======

Now, there is a cost here in that in addition to the “internal to the VM elixir-process switching”, there is now the additional cost of the kernel context switching different BEAM VMs.

Interested in hearing thoughts on either (1) this wouldn’t work or (2) how this could be made to work.

Main goal: when machine is overloaded, we want to continue serving requests at machine limit, rather than have everything suddenly die at once.

evadne · May 5, 2021, 5:17am

Consider vm.oom-kill = 0 in /etc/sysctl.conf OR vm.overcommit_memory = 2 OR sudo echo -100 > /proc/<pid>/oom_score_adj and pray that when the machine is saturated you can still connect to it

zeroexcuses · May 5, 2021, 5:35am

Good advice, thanks! Googling on oom_score_adj brought up this interesting article Taming the OOM killer [LWN.net] , which is pretty close to what is needed.

soup · May 5, 2021, 5:59am

Those are only suggestions to the watchdog, you can still be dropped without warning.

And if the system is really out of memory and you’re pushing your VMs into “please don’t touch” territory, can you really trust that the rest of the system is going to be functional?

Like evadne says, it might kill your sshd or syslogger or your remote telemetry that lets you know your memory usage is capped or whatever, and the node might just get stuck until it totally collapses anyway.

Seems much safer in the end to take sasa’s suggestion in your other thread and rate limit, queue or refuse jobs if you’re near capacity. What kind of work are you doing?

zeroexcuses · May 5, 2021, 6:13am

@sasajuric 's probably contributed more to Elixir in the past week than I ever will, but at this point, I’m just curious if a system like this can be designed this way.
The system are tiny docker images (Alpine Linux + elixir + some Rust code I write). It should not be hard to make everything else more important than my BEAM VMs, and ensure that the ‘grunt level BEAM VMs’ are the first things the OOM kills.
One of the requirements of @sasajuric 's technique in that thread involves (1) approximately constant size memory usage of processes and (2) being able to predict this up front. I would prefer to resolve this all at runtime.
What I want here is a “protected core” that consists of sshd + syslogd + whatever else non-BEAM daemon + a tiny BEAM VM for “otp supervisor processes” – and this core should be using < 512 MB RAM at all times, and always stay alive.
Then, I want a bunch of worker/grunt BEAM VMs – which we spawn as loads increase, and eventually gets killed by OOM if necessary.
And if we have a system like this, I no longer need to worry about having processes take approximate constant memory or calculating limits upfront.

This is all theoretical and probably a terrible idea, I’m just curious what problems this runs into now.

soup · May 5, 2021, 6:33am

The system are tiny docker images (Alpine Linux + elixir + some Rust code I write). It should not be hard to make everything else more important than my BEAM VMs, and ensure that the ‘grunt level BEAM VMs’ are the first things the OOM kills.

The processes in your docker containers are still watched by the nodes kernel, so you may not get the isolated behaviour you want. Also make sure you unlock docker’s own container memory limits.

I do think spawning separate VM’s will probably get you where you want since your main CNC VM would theoretically not be killed while the VM that over-requests memory should get the boot naturally. I don’t think you even have to play with nice/oom_score_adjust. I guess you could be unlucky if the CNC attempted an alloc at the wrong time in a race-condition sort of way.

Spawn two vm’s and run Stream.cycle([:kaboom]) |> Enum.take(-1) in one and see it burn (do it on a lower memory machine so it doesn’t take forever).

Perhaps you can configure the memory limit to be like, “let me over-allocate 50mb” or whatever is a reasonable spike for the CNC and “let me over-allocate nothing” for the grunt, which means the grunt should always get hammered when it misbehaves while command and control has some room to maneuver.

E: You could just configure your worker containers memory limit. I think you can set per-container from memory. Probably quite a portable configuration that way.

Fair enough as an intellectual exercise, maybe even interesting as an actual exercise too, but I think in any production system you’re better spending the effort building on Kubernetes et al and just scaling up new “grunt vm” nodes as the current workers approach a load limit (either via erlang’s inspection tools or node level inspectors). I think you may make it work, but really the failure point is still there, just hidden further down the line. Better to have the ability to scale anyway.

LostKobrakai · May 5, 2021, 6:38am

That’s probably fair advice for things in the cloud, but there are many use cases, where this is not possible and resources are indeed limited.