OK, here’s a tricky issue, and I’m looking for a bit of advice about where to solve it…
We just enabled the New Relic agent in our Elixir app. It uses os_mon, from OTP, to poll for the cpu utilisation and report it back to New Relice. cpu_sup, the part of os_mon that deals with CPU reporting, uses a port subprocess, also called cpu_sup, to hook into the OS and retrieve the utilisation/load/etc.
Our application runs on Heroku. During the shutdown of a dyno, Heroku sends a SIGTERM to all processes, whether child or not. BEAM detects this and starts to shut down the applications in sequence. But the cpu_sup subprocess also receives this and immediately exits. The next time the New Relic agent polls, the cpu_sup GenServer process blows up with a ArgumentError as the port is no longer running.
So, where should we resolve this?
Is it a bug in Erlang/OTP that cpu_sup responds to a SIGTERM? I’m not sure what’s expected of ports here, but it feels like it’s important to be able to exit a detached subprocess with SIGTERM, if that ever occurs. I’m not convinced this process is doing anything wrong, per se.
Or is this an issue in Heroku that it sends SIGTERM to all processes, not just the parent/root process? If we controlled the OS and supervisor we wouldn’t send SIGTERM to all processes, but I can see why they do this. And I can’t see how we prevent it, at the moment.
Or can we trap this in the New Relic agent? I’m not sure how this would work, without trapping exits from the cpu_sup GenServer process, which feels a bit nasty.
Or should we just give in and ignore these errors in our exception tracking?
Any thoughts?