Background
Some weeks ago, one of our critical apps died. BEAM was rebooting it, but after some time it went down again. The problem here, is that our app was trying to connect to an external HTTP server which was down at the moment. Thus, because the requests were failing, the workers were dying, the supervisor was restarting them without success (until it committed suicide) and so on bubbling up the error.
Aftermath
I immediately came here for help and was presented with some solutions. One such solution was to implement a circuit breaker in my workers.
Another recommendation was to read this entire article (which I did):
https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html
Using supervisors as circuit breakers
This article had an idea I really love. To quote it:
(…) I mark the (…) supervisor as having a
temporary
setting (…)
Then, I add that little highlighted configuration manager. This is me grafting a brain onto my supervisor. What this process will do is go over the supervision tree, possibly compare it to the configuration, and at regular intervals, repair the supervision tree. So it may decide that after 10 seconds, 60 minutes, or a day, (…)
So, the author of the article just lets the workers die. And the Supervisor die as well. Now because the Supervisor is configured to be temporary, it will never restart. Ever.
Restarting the supervisor is the sole job of another process, the “configuration manager”. This process checks the supervision tree every X minutes or so and decides whether or not to restart the dead supervisors.
This idea is rather simple, but also amazing!
Questions
Obviously, I have some questions here.
- Are there any libraries out there that do this already?
- How do I create a “configuration manager”? (How do I tell a process to restart a dead supervisor?)
- What are the main downsides of this approach VS a typical circuit breaker (fuse, circuit_breaker, breaky, etc)
Would love your answers/opinions.