Are you saying you are successfully getting crash dumps, or your code now just works without using too much memory?
When your process is OOM that is the OS forcing it to stop dead. It doesn’t have the opportunity to even know that it is about to be killed. Furthermore memory issues that crash the vm are always so sudden in my experience, it is rare to pick up any sign of them in telemetry, and no amount of in code memory checking ever caught anything. Just setting hmax was not enough, because the memory at fault was never stored in a single process.
It took me quite a long time to figure out how to generate crash dumps ahead of an OOM event. This seems heavy, so I don’t expect many people to use it, but for me it was a life saver, and hopefully it helps you. The OOMs I had to debug are things that I never would have found without a crash dump. Sometimes they were deep in the bowels of libraries that you would never suspect.
The idea is that the erlang VM has what is called a Super Carrier, disabled by default. That means instead of just allocating memory from the OS whenever it needs it, it will instead just allocate a big slab of memory up front and manage its own allocations from that. If you set the carrier to a limit that is just shy of when k8s or the OS is designated to kill it, it will instead kill itself the moment it runs out, usually on the exact call that tried to allocate too much, and when it does it generates the crashdump, which is always very informative.
+MMscs 3072 ← enables the super carrier, limited to 3072 Megabytes
+MMsco true ← allocations will only be done from the super carrier. When the super carrier gets full, the VM will fail due to out of memory
+Musac false ← my memory on this one is foggy, this may not be needed, but I remember not all memory went to the carrier, and this option made it so they did, ensuring I get every crash
I never saw any degradation of performance. The only externally visible side effect I ever saw was that once the super carrier reaches a certain size, the memory taken by the process never ever shrinks again. While that’s a little bit annoying, it was actually something I came to rely on. I always knew the max amount of memory used on any pod since they’d been up, so I was often aware when something was not quite right before it blew up, especially if one pod was markedly higher than the others.
I’ve been in multiple positions since I last used this where when the vm would crash, no one would know the cause, how to fix it, and everyone would just raise the memory limits and pray it didn’t happen again.