I’ve encountered a Supervisor
behaviour I find surprising.
In a few words, a library I’m working on had an issue where the entire app would shut down after too many errors in a process. All the processes are in a supervision tree with multiple nested supervisors, strategy: :one_for_one
and restart: :permanent
.
I appreciate that it might be the sensible thing to do, but I’m having troubles finding documentation on this.
My specific problem was that I had forgotten to declare a handle_info
clause in a genserver. I’m using Redis pub-sub and I was testing what happens when the Redis server is terminated while the mix app is still running.
The library I’m using, redix_pub_sub
, requires a running genserver to maintain the connection with Redis, which will receive a :disconnect
message if something interrupts the connection. Since I wasn’t handling it, stopping Redis caused a stream of errors.
The process would crash with a FunctionClauseError
for the missing handle_info
, then be restarted as expected, and then crash again for the same reason. I could verify that it would be restarted three times before a complete system failure.
After three restarts and errors, the application would give up and terminate with this message printed on the console:
[info] Application my_app_name exited: shutdown
This would affect that specific supervision tree only, and not the rest. If run inside iex -S mix
, for example, I could restart it with Application.ensure_started(:my_app_name)
.
Implementing the missing function clause solves the immediate problem, of course, but I am wondering if I can control the “shutdown everything” behaviour.
I can reproduce the issue by triggering other errors in quick sequence.
For example adding a 1 / 0
in function that I can trigger manually is “tolerated” and the supervisors do their job. Adding it in a callback that is invoked multiple times when a process starts, on the other hand, will take everything down.
As I said, this might be the most sensible thing to do, but I’d like to learn more.