Which conditions lead to complete crash of a Supervision tree?

tompave · April 17, 2017, 1:56am

I’ve encountered a Supervisor behaviour I find surprising.

In a few words, a library I’m working on had an issue where the entire app would shut down after too many errors in a process. All the processes are in a supervision tree with multiple nested supervisors, strategy: :one_for_one and restart: :permanent.
I appreciate that it might be the sensible thing to do, but I’m having troubles finding documentation on this.

My specific problem was that I had forgotten to declare a handle_info clause in a genserver. I’m using Redis pub-sub and I was testing what happens when the Redis server is terminated while the mix app is still running.
The library I’m using, redix_pub_sub, requires a running genserver to maintain the connection with Redis, which will receive a :disconnect message if something interrupts the connection. Since I wasn’t handling it, stopping Redis caused a stream of errors.

The process would crash with a FunctionClauseError for the missing handle_info, then be restarted as expected, and then crash again for the same reason. I could verify that it would be restarted three times before a complete system failure.
After three restarts and errors, the application would give up and terminate with this message printed on the console:

[info]  Application my_app_name exited: shutdown

This would affect that specific supervision tree only, and not the rest. If run inside iex -S mix, for example, I could restart it with Application.ensure_started(:my_app_name).

Implementing the missing function clause solves the immediate problem, of course, but I am wondering if I can control the “shutdown everything” behaviour.

I can reproduce the issue by triggering other errors in quick sequence.
For example adding a 1 / 0 in function that I can trigger manually is “tolerated” and the supervisors do their job. Adding it in a callback that is invoked multiple times when a process starts, on the other hand, will take everything down.

As I said, this might be the most sensible thing to do, but I’d like to learn more.

pma · April 17, 2017, 8:25am

Hi,

Further docs on the supervisor behaviour:

By default the supervisor will try to restart a child unless a threshold of max_restarts in a window of max_seconds is reached. If the threshold is reached, the assumption is that the error is persistent, and no amount of restarts can bring it to a valid state. So the supervisor terminates all it’s children and then itself. The next higher level supervisor will then take some action, which may include restarting the terminated supervisor. If the restart intensity threshold is reached, this level too will terminate. This will eventually bubble up until the main application supervisor terminates itself.

The default max_restarts is 3 and max_seconds 5.

tompave · April 17, 2017, 10:41am

Thank you, this is exactly what I was looking for.