Erlang process killed. Whodunnit?

I have an app that uses a handful of GenServers and GenStage modules that communicate via PubSub and regular old Kernel.send/2. Over the weekend, we gave it a chunk of work to chew through. We had previously given it smaller chunks of work, and things had been going great – we checked the dials, verified side-effects, and we were increasing the work load. But this weekend the processed was killed. There wasn’t much in the EC2 logs, just a few lines before it shut down and restarted:

Killed
heart: Mon Oct 25 11:06:01 2021: Erlang has closed.
heart: Mon Oct 25 11:06:01 2021: Wait 5 seconds for Erlang to terminate nicel
heart: Mon Oct 25 11:06:01 2021: Executed "/home/foo/bin/foo daemon" -> 0. Terminating.

It’s running as a built release as a service. I thought maybe the culprit was too many messages in the GenStage buffers – we defined a max buffer_size of 5_000_000, but from my experiments, when that overflows, messages get logged, and we had no warning. I checked the EC2 logs around the event, and nothing interesting preceded it or followed it… it seemed like the regular pitter-patter that has been going on in that log file.

I discovered this blog post and it looks interesting, but I still don’t see any evidence of one process killing another. Thanks for any ideas!

Have you checked OOM killer log?

No, I was just reading about that – I may need to finagle some permissions to view those.

I found them, but I don’t see much in there (I’m not entirely sure what I’m looking at):

...
Oct 24 03:32:59 ip-10-0-33-197 systemd[5782]: Received SIGRTMIN+24 from PID 6152 (kill).
Oct 25 11:06:01 ip-10-0-33-197 run_erl[2076]: Erlang closed the connection.
Oct 25 12:19:49 ip-10-0-33-197 systemd[10681]: Listening on GnuPG cryptographic agent (ssh-agent emulation).
...

The “kill event” happened exactly at Oct 25 11:06:01 – but if it says that Erlang itself closed the connection, does that mean that Erlang killed itself?

FYI: /proc/sys/vm/overcommit_memory is 0 (i.e. don’t overcommit)

syslog had some more info…

Oct 25 11:06:01 ip-10-0-33-197 kernel: [221772.911208] 1_dirty_cpu_sch invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP
_ZERO), order=0, oom_score_adj=0