GenServer terminate/2 not called despite trap_exit: true

ausimian · February 6, 2022, 5:31am

Eventually I figured this out, so chose not to post it into questions, but it might be useful to anyone looking at a similar problem.

I have a small otp application, that acts as a front to a remote, rest-based ‘reservation’ system, and the main module exposes a simple acquire/release api, that communicates with a single GenServer. Client processes (in other apps) that acquire reservations, are monitored by this GenServer so that if they disappear, their reservations are released.

This seemed to work fine with my initial testing, but as I began to ramp up the number of reservations I noticed that if I gracefully brought the system down while a large number of reservations were held, there would be a number that weren’t released at system exit.

I wasn’t quite sure what was going on initially, and tried adding a flag(:trap_exit, true) / terminate/2, where I cleaned up in any ‘residual’ reservations (yes, I know, I was just cargo-culting my way through this ), but according to the logs, terminate/2 wasn’t even being called.

Eventually, I figured it out:

On system shutdown, all the client processes that acquired reservations exit, as their own application stops. They do this very quickly.
The GenServer monitoring these processes receives :DOWN messages for each one in handle_info/2.
It starts working it’s way through these one at a time. Each call to the ‘real’ reservation system to release the resource takes a significant number of milliseconds to round-trip
In the meantime, system shutdown is continuing and eventually reaches the reservation app itself.
The exit signal is sent by the supervisor to the GenServer and because its trapping exits, converted to an :EXIT message. The delivery of that message is (I believe) converted somewhere in the GenServer module into a terminate/2 callback.
However, that message is now stuck behind all the :DOWN messages. The supervisor waits the required amount of time (default 5000ms) for the GenServer to respond.
The GenServer doesn’t respond in time and is killed. All the unprocessed :DOWN messages in the queue are lost, resulting in unreleased reservations.

There is obviously a bottleneck in the design. One fix would be to partition the reservations across multiple GenServers. Another would be just to extend the shutdown timeout of the GenServer (the latter actually works for me, shutdown time is not particularly sensitive in this case)

One thing I realised is that if the clients were all ‘static’ children of Supervisor you might be able to get away with having them explicitly release their reservation, as the shutdown sequence progresses in an orderly fashion, acting as a natural throttle. In my case however, they are children of DynamicSupervisor and so are (again, I believe) stopped in parallel, which just results in the same bottleneck but with (some of) the clients timing out, rather than the GenServer itself.