Global process limit to prevent vm crash?

We just experienced some crashes in our vms due to a combination of several libraries causing a utf8 bug due to a filename that someone uploaded to our server. The process ended up sending an invalid string into a function that got into an infinite loop (or at least had exponential blowup), and that spelled doom for the server. This isn’t a bug we could’ve predicted ahead of time and it wasn’t in our code.

The question I’m asking myself is why erlang chose to kill the entire vm when it ran out of memory rather than just killing that single process that went crazy? Which would have been cleanly logged and not woken anyone up several nights in a row. Is there a setting for this that I can just put into cowboy to limit its workers to a certain size? Nothing on this server should ever require more than a few dozen M to process pretty much anything, and if it did, I’d almost rather it crash so that I know where we are pushing our luck.

While I didn’t write the BEAM VM and can’t answer your question on why was this done like so, I believe posting some code samples – ideally a GH repo where the bug can be reproduced – would go a long way if you are looking for help on how to make sure the problem never happens again.

There are two issues with this approach:

  1. You’re asking a system that is already in a failure mode to police itself. The entire erlang error handling philosophy is that, when a layer of operation fails, instead of having that layer attempt to recover itself, you crash that layer and allow something at a higher layer initiate a reboot from known good state. If the entire VM is misbehaving and hits a system limit, the best option is to crash the VM and allow the system in charge of starting the VM do so from a known good state.

  2. It is rarely feasible for the system itself to figure out which process is misbehaving. For one thing, how is that logic supposed to even run if there’s no memory left? More to the point though, the last process to ask for ram is only maybe the misbehaving process. If Processes A, B, and C go crazy and use up 99.99% of the ram together, process D might come along answering a simple HTTP request and ask for that last bit of memory which gets to 100%. Killing process D won’t help.

There is a :max_heap_size process flag you could probably set in a plug, without needing to actually handle it at spawn time.

Definitely sorry to hear about that, I’ve been there.

3 Likes

Because the Linux OOM killer is not a good idea and we do not want to bring it into the standard Erlang distribution.

2 Likes