I have an app that uses a handful of GenServers and GenStage modules that communicate via PubSub and regular old Kernel.send/2
. Over the weekend, we gave it a chunk of work to chew through. We had previously given it smaller chunks of work, and things had been going great – we checked the dials, verified side-effects, and we were increasing the work load. But this weekend the processed was killed. There wasn’t much in the EC2 logs, just a few lines before it shut down and restarted:
Killed
heart: Mon Oct 25 11:06:01 2021: Erlang has closed.
heart: Mon Oct 25 11:06:01 2021: Wait 5 seconds for Erlang to terminate nicel
heart: Mon Oct 25 11:06:01 2021: Executed "/home/foo/bin/foo daemon" -> 0. Terminating.
It’s running as a built release as a service. I thought maybe the culprit was too many messages in the GenStage buffers – we defined a max buffer_size
of 5_000_000
, but from my experiments, when that overflows, messages get logged, and we had no warning. I checked the EC2 logs around the event, and nothing interesting preceded it or followed it… it seemed like the regular pitter-patter that has been going on in that log file.
I discovered this blog post and it looks interesting, but I still don’t see any evidence of one process killing another. Thanks for any ideas!