Eventually I figured this out, so chose not to post it into questions, but it might be useful to anyone looking at a similar problem.
I have a small otp application, that acts as a front to a remote, rest-based ‘reservation’ system, and the main module exposes a simple acquire/release api, that communicates with a single GenServer. Client processes (in other apps) that acquire reservations, are monitored by this GenServer so that if they disappear, their reservations are released.
This seemed to work fine with my initial testing, but as I began to ramp up the number of reservations I noticed that if I gracefully brought the system down while a large number of reservations were held, there would be a number that weren’t released at system exit.
I wasn’t quite sure what was going on initially, and tried adding a
flag(:trap_exit, true) /
terminate/2, where I cleaned up in any ‘residual’ reservations (yes, I know, I was just cargo-culting my way through this ), but according to the logs,
terminate/2 wasn’t even being called.
Eventually, I figured it out:
- On system shutdown, all the client processes that acquired reservations exit, as their own application stops. They do this very quickly.
- The GenServer monitoring these processes receives
:DOWNmessages for each one in
- It starts working it’s way through these one at a time. Each call to the ‘real’ reservation system to release the resource takes a significant number of milliseconds to round-trip
- In the meantime, system shutdown is continuing and eventually reaches the reservation app itself.
- The exit signal is sent by the supervisor to the GenServer and because its trapping exits, converted to an
:EXITmessage. The delivery of that message is (I believe) converted somewhere in the GenServer module into a
- However, that message is now stuck behind all the
:DOWNmessages. The supervisor waits the required amount of time (default 5000ms) for the GenServer to respond.
- The GenServer doesn’t respond in time and is killed. All the unprocessed
:DOWNmessages in the queue are lost, resulting in unreleased reservations.
There is obviously a bottleneck in the design. One fix would be to partition the reservations across multiple GenServers. Another would be just to extend the shutdown timeout of the GenServer (the latter actually works for me, shutdown time is not particularly sensitive in this case)
One thing I realised is that if the clients were all ‘static’ children of
Supervisor you might be able to get away with having them explicitly release their reservation, as the shutdown sequence progresses in an orderly fashion, acting as a natural throttle. In my case however, they are children of
DynamicSupervisor and so are (again, I believe) stopped in parallel, which just results in the same bottleneck but with (some of) the clients timing out, rather than the GenServer itself.