Exact SIGTERM release shutdown behavior and exit signal behavior

Hi everyone. I’m trying to understand some error messages that occasionally appear in our logs durning shutdown. Mostly, I’m just trying to use them to understand the shutdown process a bit better, and trapping exit signals, and all that. When the logs appear, there’s typically about ~10 of them, that look like this:

12/20/2023 7:17:21.434 PM +0000 Last message: {:EXIT, #PID<0.28018.0>, :killed}
      19:17:21.434 [][error] GenServer #PID<0.28046.0> terminating ** (stop) killed

They appear ~7 seconds after I see the long that our release received the SIGTERM and is shutting down. There’s a lot of info in that message, and I’m trying to use it to build a mental model of what’s going on.

From using a remote shell prior to the shutdown and dumping the PIDs and process info, I can see that 0.28018.0 is a supervisor3 and 0.28046.0 is a brod_producer that’s linked to it, both of which come from the brod kafka library we use. The kafka topic this is listening to has 64 partitions, so the supervisor has 64 children, though we typically only get log messages from a handful of them.

So what does Last message: {:EXIT, #PID<0.28018.0>, :killed} mean? I believe this means that:

  • the child, the brod_producer is trapping exits
  • It’s linked to the supervisor, so it receives the supervisor’s exit signals as well.
  • The supervisor received a kill signal (e.g. Process.exit(sup, :kill)), which gets turned into a :killed reason when it propagates.

Question 1:: What does it mean that this was the “last message” that it received? Does it mean it was in the process of handling it? That the message was what caused the GenServer to crash? That it was the last one received and hence the last one in its message inbox, and yet to be processed? That it may have finished handling the message and not gotten any new messages since then?

Question 2: That last message is of the form: {:EXIT, from, reason}. Does that get handled just like any ordinary handle_info? The GenServer terminate/2 won’t handle it, right?

Now, for the other part of the log:

[error] GenServer #PID<0.28046.0> terminating ** (stop) killed

This is the child reporting that it got killed.

Question 3: What are the pieces of those log messages? Is stop or killed the exit reason? Can someone identify what code emits that message and why? Does it indicate the process threw an error? That it was brutal killed? From a brief exploration it looked like that came from formatting a SASL error log or something like that, but I can’t seem to get the message to appear in my local experiments.

Question 4: If that’s in response to an exit signal, do we know what sent it that signal?

I know that normally in a clean shutdown, the supervisor will be told to stop, it will in turn try to stop all its children (in reverse order) with something like GenServer.stop/3, giving 5 seconds (by default, or whatever the shutdown value is in the child_spec), and then it will kill it with something like Process.exit(pid, :kill). But in my example here, it looks like the Supervisor itself was killed prior to this, and that the GenServer trapped that exit, so I’m thinking it’s something else that killed it.

Question 5: When a release receives a SIGTERM, I think that turns into an Application.stop/1 - does that send exit signals beyond simply telling Supervisor.stop/3 to the root supervisor of the application? If the shutdown “hangs” by a misbehaving worker process somewhere, is there a second round of brutal kills that is sent?

The brod kafka library we use has a README with a nice process diagram of how it’s setup. What’s odd about it (and maybe not so standard-OTP?) is the process tree has a client at the top, which is a worker, which manages two children which are supervisors. So my hypothesis is that the application is not shutting down cleanly, because the client worker process at the root of the brod tree is being given 5 seconds to stop, but it then tries to stop its children (which are supervisors) and are taking longer than 5 seconds. So then the application kills the client, which leaves the rest of the tree in a bad state. But I can’t quite tie all the pieces together in my mind yet.

I’d appreciate any help from an expert on OTP practices and how exit signals propagate and all that. Thanks!

1 Like

In the first instance, drop into an IEX console and run Application.stop or otherwise terminate your application tree. Observe what happens and how long it takes?