I was wondering if there were any books, blogs, or docs out there that discussed supervisor restart intensity parameters (number of crashes allowed in a period before the Supervisor crashes) and how to handle top-level application supervisor crashes from exceeding these.
For example, there’s some information in http://erlang.org/doc/design_principles/sup_princ.html#tuning-the-intensity-and-period. This is helpful, but the numbers seem arbitrarily picked (this may just be the nature of these numbers).
There also seems to be some unwritten(?) knowledge about handling top-level application supervisor crashes due to exceeding restart intensities. If you’re using a 3rd party application, it will have selected a restart intensity for you. If this doesn’t work for your project, I’ve heard of people starting the OTP application as temporary (instead of permanent) and adding code to monitor whether the application is still running and if not, to have a cooldown period before restarting it. shoehorn gets into this and other strategies, but it doesn’t provide recommendations.
There’s also the give up, let the Erlang VM crash, and have systemd restart it approach.
My hope is that a few people have gone pretty far down the above or similar paths and have written up their experiences. All pointers appreciated.