Strategies to avoid cascading supervisor crashes?

darkmarmot · September 24, 2020, 3:23pm

When a supervised process crashes and restarts repeatedly, it can crash its supervisor, which restarts and crashes and at some point it causes the root application supervisor to crash.

This has hit me a couple times now… and I’m wondering if there are some decent ways to deal with this? Is it generally handled with deeper nesting, higher restart thresholds, delayed process actions that could cause errors? Other techniques?

I am basically looking to have part of my application down rather than fully down (so that it can be inspected and/or fixed with hot code loading).

Thanks!

LostKobrakai · September 24, 2020, 3:33pm

You’ll need to handle at least the part of “now stop trying” manually. Supervisors are meant to keep trying – they operate under the assumption that human intervention might need a technician driving to the device.

That’s why you always want something OS level as well like heart or systemd.

darkmarmot · September 24, 2020, 7:56pm

Thanks!

We have a large cluster across multiple data centers that we regularly upgrade. Our main problem has been when incompatible versions of code are released into the cluster. We’re looking for ways to prevent all the nodes from crashing en masse due to a bad mistake in which “bad” data passes between node boundaries.

I’m experimenting now with using configurable Process.sleep calls via GenServer's handle_continue to throttle process restarts so that repeatings failures have a deterministic upper bound for the crash rate.

wolf4earth · September 24, 2020, 8:55pm

What I’ve been doing in the past, in a larger system I’ve worked on, was to introduce another supervision layer, let’s call it the parent supervisor, above the supervisor which supervised the problematic process, let’s call it the child supervisor.

The parent supervisor starts the child supervisor with the restart option set to temporary. This way the child supervisor still does it’s job as usual, until sh*t hits the metaphorical fan and it goes down. As it was started as temporary the parent supervisor won’t restart it, and acts as a circuit breaker.

Of course you either need some monitoring on top - so a human operator can step in and fix the problem - or some kind of exponential restart strategy. But at least you’ve now isolated the issue to part of your system.

Hope this helps?

darkmarmot · September 24, 2020, 9:10pm

That sounds like a good as well, thanks!

cmkarlsson · September 25, 2020, 1:03am

I know there is a really good talk from Fred Hebert about how to design robust systems

I think it is this:

https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html

but not sure as I think the graphic was different in the one I am thinking of.

Other useful blogs:

https://ferd.ca/it-s-about-the-guarantees.html

pedromtavares · September 25, 2020, 1:14am

Your suggestion is precisely what is recommended by the Elixir in Action book:

Opting for the :temporary strategy also means that the parent supervisor won’t be restarted due to too many failures in its children. Even if there are frequent crashes in one child process, say due to corrupt state, you’ll never take down the entire tree, which should improve the availability of the entire system.