ECS Fargate kills my Elixir task before crash dump is written

Hi all,

I’m running an Elixir app on ECS Fargate, and occasionally the task is being killed with:

exitCode: 137
OutOfMemoryError: Container killed due to memory usage

From what I understand, that’s ECS killing the container because it exceeded the task’s memory limit.

The problem is that I’m not getting an Erlang crash dump, there are no logs indicating a crash dump was written before the container was terminated. The last logs I sometimes receive are:

[os_mon] memory supervisor port (memsup): Erlang has closed
[os_mon] cpu supervisor port (cpu_sup): Erlang has closed

I would like to configure things so that the Erlang VM itself crashes due to OOM and writes its crash dump, instead of ECS killing the container immediately.

Is there anyone with a similar setup who was able to get the dump being written in a similar case ?

Thanks !

In my experience it’s pretty hard to recover a crash dump from a systems like that, since the container is usually destroyed and restarted after a crash.

You can try setting +hmax Size VM flag so that any process that hoards too much memory is logged and killed. But that will only work if it is a process’ heap hoarding memory. There are plenty of other allocators to do that.

Make sure you also take advantage of built-in VM memory metrics. IIRC they can be aggregated by allocator which will also give you a hint for where to look.

2 Likes

Thank you for your feedback. We have successfully retrieved crash dumps by setting the environment variable ERL_CRASH_DUMP and mounting a volume on our tasks. When we explicitly ask our app to generate a crash dump, it works correctly. However, in the case of an OOM no crash dump is written which is our main issue.

We also tried using the +hmax Size flag, which is definitely useful, but it still wasn’t enough in our case. Since my first message, we’ve been able to find the root cause something like this:

my_string = String.duplicate("a", 1_000_000)
Regex.replace(~r/(?:https?:\/\/)?(?:[^\s?]+)?\.test\.com.*[?&]key=([^\s&]+)/, my_string, "replace")

On Elixir 1.18.4-otp-27, this snippet causes an OOM error in our task. I couldn’t reproduce the OOM on my local machine, but it’s clear the app was hanging/in a bad state.
On Elixir 1.19.1-otp-28, the same code executes seamlessly in just a few microseconds. I assume the improvements in PCRE2 resolved this particular issue.

I’m still a bit bummed that I don’t know how to better track down this kind of issue next time.

1 Like

For OOM it’s not really the app crashing for internal reasons. It’s the app being made to crash from the outside by some oom reaping code. Depending on how harsh that reaping happens erl might simply not get to writing the crash report. This is not up to the erl process at that point.

Are you saying you are successfully getting crash dumps, or your code now just works without using too much memory?

When your process is OOM that is the OS forcing it to stop dead. It doesn’t have the opportunity to even know that it is about to be killed. Furthermore memory issues that crash the vm are always so sudden in my experience, it is rare to pick up any sign of them in telemetry, and no amount of in code memory checking ever caught anything. Just setting hmax was not enough, because the memory at fault was never stored in a single process.

It took me quite a long time to figure out how to generate crash dumps ahead of an OOM event. This seems heavy, so I don’t expect many people to use it, but for me it was a life saver, and hopefully it helps you. The OOMs I had to debug are things that I never would have found without a crash dump. Sometimes they were deep in the bowels of libraries that you would never suspect.

The idea is that the erlang VM has what is called a Super Carrier, disabled by default. That means instead of just allocating memory from the OS whenever it needs it, it will instead just allocate a big slab of memory up front and manage its own allocations from that. If you set the carrier to a limit that is just shy of when k8s or the OS is designated to kill it, it will instead kill itself the moment it runs out, usually on the exact call that tried to allocate too much, and when it does it generates the crashdump, which is always very informative.

+MMscs 3072 ← enables the super carrier, limited to 3072 Megabytes

+MMsco true ← allocations will only be done from the super carrier. When the super carrier gets full, the VM will fail due to out of memory
+Musac false ← my memory on this one is foggy, this may not be needed, but I remember not all memory went to the carrier, and this option made it so they did, ensuring I get every crash

I never saw any degradation of performance. The only externally visible side effect I ever saw was that once the super carrier reaches a certain size, the memory taken by the process never ever shrinks again. While that’s a little bit annoying, it was actually something I came to rely on. I always knew the max amount of memory used on any pod since they’d been up, so I was often aware when something was not quite right before it blew up, especially if one pod was markedly higher than the others.

I’ve been in multiple positions since I last used this where when the vm would crash, no one would know the cause, how to fix it, and everyone would just raise the memory limits and pray it didn’t happen again.

15 Likes

Amazing post! Instant bookmark. Thank you. I’ll definitely try that in my work.

There are other vectors of attack as well, namely having a little heads-up before your program gets reaped.

At $previous_job we used the first – earlyoom – with some success.

1 Like

What I meant is that I’m able to get the crash dump if it’s generated (for example by using halt/1) and I’ve fixed my code so it no longer runs out of memory.

But you’re definitely addressing the real point of my question, which was how to get a crash dump generated instead of having the supervisor hard kill my app.

I’ll take a look and experiment with the SuperCarrier. +Musac false is mentioned in the documentation looks like it’s about forcing sys_alloc to use the SuperCarrier as well.

Thank you so much for sharing your insights I think they’ll save me a lot of time!

@dimitarvp Thanks also for sharing those tools they could definitely come in handy too.