Guidelines for Supervision trees and setting restart intensity parameters

fhunleth · July 7, 2018, 3:56pm

I was wondering if there were any books, blogs, or docs out there that discussed supervisor restart intensity parameters (number of crashes allowed in a period before the Supervisor crashes) and how to handle top-level application supervisor crashes from exceeding these.

For example, there’s some information in http://erlang.org/doc/design_principles/sup_princ.html#tuning-the-intensity-and-period. This is helpful, but the numbers seem arbitrarily picked (this may just be the nature of these numbers).

There also seems to be some unwritten(?) knowledge about handling top-level application supervisor crashes due to exceeding restart intensities. If you’re using a 3rd party application, it will have selected a restart intensity for you. If this doesn’t work for your project, I’ve heard of people starting the OTP application as temporary (instead of permanent) and adding code to monitor whether the application is still running and if not, to have a cooldown period before restarting it. shoehorn gets into this and other strategies, but it doesn’t provide recommendations.

There’s also the give up, let the Erlang VM crash, and have systemd restart it approach.

My hope is that a few people have gone pretty far down the above or similar paths and have written up their experiences. All pointers appreciated.

aseigo · July 9, 2018, 8:04am

IME they aren’t so much arbitrary as driven by the needs specific of use cases.

Some applications “should never” crash and so if they start doing so there are deeper problems that need resolving and restarting that application via the supervision tree is not going to help. This prevents infinite retry of hopeless situations and setting retries pessimistically (e.g. low) is a good idea.

Some applications “might sometimes if the moon is wrong” crash, and for those usually a sensible (as in the docs you pointed to) number suffice.

Some applications are just, by their nature, going to be flaky. Those are the exception IME but they exist: things that require 3rd party services to be available; things that require reliable network; things that require specific hardware devices to be functional … and then the answer gets very application-specific. Can your application work without the dying application, or does the behaviour become undefined or even worse damaging? It really all depends.

My personal “best practice” approach is to write applications such that no dependent applications are absolutely required after startup. The application may fail at startup if it can not do its primary function (e.g. due to an SSL certificate problem, or similar misconfiguration), but once running it should have ways to mitigate losing dependent applications. This can be done through service degradation, job / log / data buffering (in memory or to disk; though both are finite resources), error notification, …

However, I don’t know of any official wisdom on this matter …

tty · July 9, 2018, 9:09am

Go with the defaults, they tend to work for the 80% of the cases you are coding for. As for the remaining 20% do lots and lots and lots of load/stress testing and change the parameters as you go along.

tty · July 9, 2018, 9:13am

I’m going on memory here… i.e might be the wrong talk

fhunleth · July 9, 2018, 1:21pm

An interesting note about the defaults is that Erlang and Elixir chose different ones.

The rationale for the Erlang choice of 1 restart per 5 seconds is here.

The Elixir default of 3 restarts per 5 seconds can be found here. I skimmed through the commit history, but I’m not sure why 3 was chosen.

michaelkschmidt · July 9, 2018, 11:10pm

This has been a major learning point for our project as it has matured. The whole point of limiting the supervisor restarts is to allow the next layer up to try to recover the system. For some things this is the right thing to do, but for others it is not.

For example, we were spawning an external program written by another team. No amount of restarts on our side will fix will help it recover, so we wound up using a “circuit breaker” (https://github.com/jlouis/fuse). It allowed us to restart the program an infinite number of times, while still getting “cooldown” periods where we leave it off. This helps the system recover, as it is not spending all its cycles on restarting, and it also helps resolve random race conditions where the program could not start due to something not being ready.

The other problem we faced was the OTP :ssh application. During penetration testing, the app would simply die and take out the entire VM with it. We were able to use Shoehorn to keep restarting it, and things recovered nicely as soon as the attack subsided.

ConnorRigby · July 10, 2018, 3:00am

Similarly to @michaelkschmidt, i have an anecdotal experience with Supervision trees. This is my monolithic Nerves application, that has been steadily growing for the better part of two years:

I’ve found that when designing my tree, even simple gen_servers wrapping other’s libraries end up with their own special supervisor. For example the currently available sqlite3 adapter for ecto has a pretty nasty race condition in it that if left on the root supervisor (such as in a Phoenix app) will cause the app to fail to start. if you nest it a few levels deep or even nest it behind another OTP Application instead of just a normal supervisor, it is “stable enough”.

Side note: most (if not all) of my experience in this is in Nerves (as opposed to say Phoenix). I’ve found that managing supervision trees isn’t as much of a problem for (at least simple) Phoenix apps since you can make assumptions such as “the time will be set correctly by the time my app starts”

tty · July 19, 2018, 1:44pm

There might even be a time where you find the best is to isolate the external library into a slave node and have it auto restarted by a supervisor/gen_server.