In 99,99% cases when I get erlang crashed by oom, it is error logger
of failing process.
When you have a big gen_server state (more than 10 MB) and your server
is crashing due to error message, last message and last reason are
going to be dumped.
gen_server:format_status doesn’t help at all, because reason is going
to be dumped.
10 MB of state takes gigabyte or more of memory to dump as an error reason.
Realistically, not only do we not know that n = 10,000 upfront, but I’m not sure we can even estimate n = 7,500 since, depending on what the users do, different types of connections may use different amounts of resources.
The core issue here to me is: Erlang is famous for it’s ‘let it crash + restart’ philosophy; I’m just not sure how the works in the context of the entire VM. Unless every process, on every alloc, runs a check of “hey, are we nearly out of memory”, it seems we run into this problem of – everything runs smoothly normally, then, under load, it gets OOM killed by the kernel.
Surely this problem has been solved right? What solution does whatsapp / discord / … use ?
My experience with this, is that the BEAM or at least the application just crashes out with a dump. There is no log error or warning (may depend how fast/tightly you hit the memory limit) and no kmesg (because beam kills itself, not the kernel).
Erlang does have infrastructure to track memory usage (even just checking :erlang.memory() might be what you need, but it has more), so if you have known limit on your servers, you can check against that and ping your scaling infrastructure to hoist another node, rate limit, let it explode, etc.
If your app is complex enough that it may ingest huge amounts of bursty data (video processing?), you probably have multiple nodes being load balanced, etc and can probably use some heuristic to estimate a workload?
I imagine whatsapp/discord just spec with some leeway to handle spikes and let it scale when needed (I also doubt they bother checking BEAM specifically and just track a nodes total capacity).
I unfortunately don’t have specific answers to your questions, but if you haven’t already seen it you should take a look at Erlang in Anger, a great free pdf/ebook that covers how to handle and diagnose many production concerns such as this one: https://erlang-in-anger.com/
There are ways to alleviate. First, Linux’s memory over commit can be tuned to be more conservative, so it will return failure in memory allocation earlier, instead of having to invoke the dreaded OOM killer when situation goes out of control. Second, you can protect some key processes from the OOM killer. There is no 100% guaranty, but you can usually protect a well behaving process from being the victim of rouge process that allocate like crazy.
AFAIK there’s no support for handling OOM errors. If memory can’t be allocated, the beam process will crash. To prevent this from happening you need to proactively manage the load.
In many cases this can be simplified with conservative limits, pessimistic estimates, and some basic napkin math. For example, let’s assume that a single activity (e.g. processing a user request) requires no more than 50kB of memory. This means that 10,000 simultaneous clients would require about 500 MB of memory + the fixed overhead of BEAM and other external OS processes. So based on these numbers, 1GB of RAM should be enough to manage 10k users, so setting the max conns limit to 10k should significantly reduce the chance of beam crash.
IMO the most important thing here is to make sure that the memory usage of each activity remains constant. It should not be possible for an end user to allocate infinite amount of memory. This can typically be controlled with streaming. If a user can supply infinite amount of data, process the input in chunks. If the system fetches the stuff from the db, stream the data, process it, and send response in chunks. Alternatively, if you can’t do it with streaming (or won’t because it’s too complicated), then consider limiting the input (e.g. a user can fetch or supply at most n items).
In more complex cases (large number of users, many different types of actions) you could consider conducting a synthetic load test. Deploy a system to some staging server, load it with synthetic clients, and observe the memory utilisation. This should give you a more realistic feeling on resource requirements and the system-wide limits.
Finally, you can also consider controlling the load depending on the resource usage. After accepting a request, but before starting to process it, you could decide to wait (or immediately reject the request), depending on memory (and/or CPU) utilisation. The jobs library could help with this, or you can roll your own solution.
I know that BEAM has per-process heaps. On creation, can we specify per-process heap limits? I.e. if this elixir process tries to access more than 40kb, kill it.
I am not smart enough to look at a piece of Elixir code and approximate, within a factor of 10x, how much memory it uses. With C, I can manually annotate the mallocs. With C++/Rust,each contanier/vector likely uses 2x whatever max # of elements it contains (assuming some type of double-when-full allocator). With something like Elixir, I do not see an easy way to estimate memory usage.
My fear here is a situation where a connection takes 100kb memory typically, but can take 10MB in degenerate case. We have server with 1GB of memory,based on the 100kb estimate, we think we can handle max 10,000 users. We set limit at 5000 users. But then something triggers a degenerate chain reaction, we suddenly need 5GB memory, and get kill -9 -ed.
On the other hand, if we assume 10MB degen case all the time, we are wasting resources most of the time.
What I would really like, and I do not know if this is possible, is something where:
4.1 we assign a priority to each elixir process (high priority = important = stuff like otp supervisors, low priority = grunt workers)
4.2 when linux is feeling memory pressure, it notifies the elixir vm; elixir vm starts killing low priority elixir-processes with an “memory pressure” flag (which signals to the supervisor "hey, don’t immediately respawn)
4.3 alternatively, each supervisor node periodically tracks how much ttl memory all it’s descendants use; and we send msgs to supervisor nodes of the form ‘hey, kill half your processes’
I think most of this can be done in elixir userspace; all that we really need is for the kernel to, instead of kill -9 ing, send some msg of “hey, memory pressure”
@al2o3cr : On 2nd thought, clearly the solution to this problem is bare metal BEAM. When BEAM is the kernel itself, and the only ‘userland’ apps are elixir processes & beam NIFs, it is much easier to reclaim memory by just killing elixir processes. (only half joking)
I’m curious if this can be done. One thing that has definitely changed is that if we only care about the cloud, there is no need for the complexity of dealing with drivers form various hardware devices. In many ways, BEAM VM already hast most of the elements of a minimalist kernel.
people have run BEAM on metal (GrISP and ling projects come mind. There was even someone who made an operating system with BEAM as the orchestrator, but I think that project died), but I think the VM codebase itself doesn’t support failable allocations… This is nontrivial, so much C code that people write ignores for failing malloc… Going to rust doesn’t really help much. You can make NIFs respect failable allocation, if you go to zig because that language deeply cares about failing allocation and can be easily hooked into the VM’s custom internal allocator. https://www.youtube.com/watch?v=IM_tO8hQgKA
You could collect this info with a synthetic load test and/or by correlating memory usage with load (e.g. number of connections or reqs/sec) from a prod system.
This is of course always possible, though IME the risk can be mitigated with a combination of disciplined programming, practicing code reviews, and measuring the memory usage (synth load testing and/or prod measurements).
I suspect this could become tricky. The VM basically doesn’t know anything about the OTP constructs. It considers all processes to be the same. In you’re proposal you’re already accounting for that with priorities, but I think the problem is more nuanced. You probably want only your app’ss workers to be killed, leaving all other processes intact. Another issue is that relying on the OOM killer or Linux kernel means the solution would not work on other platforms.
However, I feel that an OOM killer could be implemented in a beam language (e.g. Elixir). You could use memsup to observe the OS memory usage, and if it goes above some user-defined threshold, you could collect workers from the supervision tree of the OTP app, and decide which process(es) to kill.
This could be developed as a generic lib. For example, when starting the top-level supervisor, we could do something like OOM.Supervisor.start_link(children, opts), where opts are used to configure the OOM killer params (e.g. threshold). When the threshold is reached, the killer will terminate some worker processes under this supervisor. Each process could set it’s own kill priority e.g. by calling OOM.set_priority(priority). This would allow the app developer to tweak the termination list according to the specifics of their system.
Some people are doubtful about OOM killers, since there’s some amount of randomness involved, and killing random processes might leave the system in a permanent partially working state (which is worse than restarting everything). However, I think that by being conservative (kill only worker processes of the “main” app) an embedded OOM killer might prove to be useful. IMO this is best evaluated in practice, either in a real system or a fake synthetic one.
Set a free memory threshold as a config value, e.g. if free/total memory ratio approaches, say, 90%, then you should have something in your pipeline that yields back-pressure.
You can also combine the above with Process.list() |> Enum.each(&:erlang.garbage_collect/1) but I’d advise against it because it could strain an already struggling system. Maybe just have a process that checks free memory every 5 secs and execute that code if the ratio is 80% or above – which reasonably maps to “a system marching to the limits but still having resources”.
Obviously the above is not at all a guarantee but it’s IMO a sane approach because it’s working with OS (or container) system limits. As a bonus, you can use the background process(es) that do monitoring to also send alarms / warnings via telemetry or various other dashboard systems, so you could interfere in time.
(Finally, and this could be a very random shot in the dark, and my apologies if so – if you have a load balancer in front of your service, just have it enforce a hard upper limit on request size; that’s a good way to make sure that the BEAM won’t spike in memory usage when a huge string is sent to it.)