Configuration Managers VS Circuit breakers

Fl4m3Ph03n1x · June 3, 2019, 11:03am

Background

Some weeks ago, one of our critical apps died. BEAM was rebooting it, but after some time it went down again. The problem here, is that our app was trying to connect to an external HTTP server which was down at the moment. Thus, because the requests were failing, the workers were dying, the supervisor was restarting them without success (until it committed suicide) and so on bubbling up the error.

Aftermath

I immediately came here for help and was presented with some solutions. One such solution was to implement a circuit breaker in my workers.
Another recommendation was to read this entire article (which I did):

https://ferd.ca/the-hitchhiker-s-guide-to-the-unexpected.html

Using supervisors as circuit breakers

This article had an idea I really love. To quote it:

(…) I mark the (…) supervisor as having a temporary setting (…)

Then, I add that little highlighted configuration manager. This is me grafting a brain onto my supervisor. What this process will do is go over the supervision tree, possibly compare it to the configuration, and at regular intervals, repair the supervision tree. So it may decide that after 10 seconds, 60 minutes, or a day, (…)

So, the author of the article just lets the workers die. And the Supervisor die as well. Now because the Supervisor is configured to be temporary, it will never restart. Ever.

Restarting the supervisor is the sole job of another process, the “configuration manager”. This process checks the supervision tree every X minutes or so and decides whether or not to restart the dead supervisors.

This idea is rather simple, but also amazing!

Questions

Obviously, I have some questions here.

Are there any libraries out there that do this already?
How do I create a “configuration manager”? (How do I tell a process to restart a dead supervisor?)
What are the main downsides of this approach VS a typical circuit breaker (fuse, circuit_breaker, breaky, etc)

Would love your answers/opinions.

LostKobrakai · June 3, 2019, 11:59am

I’m not aware of any, but I’d suggest also looking for erlang ones, which I expect to more likely yield results.
It can be as simple as a GenServer or gen_statem, depending on how you want it to function. It’s most likely using monitors to keep knowledge about if a certain important process is running and to be notified if it crashes. Another option would be linking itself to important processes, but trapping exits. I’d personally use monitors, because that’s what they’re for. How exactly you deal with the knowledge of if things are running or down or crashing is up to you. Also how the process knows what things should be up and running is up to the implementation.
I’d not say that circuit breakers are a replacement for the “configuration manager”. They handle totally different tasks and maybe you even want to use both.
A circuit breaker does monitor a call into a subsystem and if the number of failing responses exceeds a certain threshold it blocks calls into the subsystem by short-circuiting into an error. Depending on the library there are then certain ways to heal from a blown circuit, which can be time based or with some backoff, maybe only a fraction of calls is let trough and enough successful ones make it go back to normal. A circuit-breaker does nothing for the subsystem’s healing besides blocking request.
The “configuration manager” process on the other hand does not block anything. Its sole purpose is to monitor processes it knows about and maybe restart them based on some logic in the implementation. It basically handles the “healing” part of a subsystem.

LostKobrakai · June 3, 2019, 12:23pm

And some addendum:

Your usecase is a supervisor with lot’s of workers holding connections afaik. If you let your supervisor die it’ll take down all those connections, even the healthy ones. So you don’t want to let your supervisor be temporary, but rather your workers. Or add another layer between both.

Fl4m3Ph03n1x · June 3, 2019, 12:24pm

LostKobrakai:

I’d not say that circuit breakers are a replacement for the “configuration manager”. They handle totally different tasks and maybe you even want to use both.
A circuit breaker does monitor a call into a subsystem and if the number of failing responses exceeds a certain threshold it blocks calls into the subsystem by short-circuiting into an error. Depending on the library there are then certain ways to heal from a blown circuit, which can be time based or with some backoff, maybe only a fraction of calls is let trough and enough successful ones make it go back to normal. A circuit-breaker does nothing for the subsystem’s healing besides blocking request.
The “configuration manager” process on the other hand does not block anything. Its sole purpose is to monitor processes it knows about and maybe restart them based on some logic in the implementation. It basically handles the “healing” part of a subsystem.

I conflate the two here because the author of the article specifically states that using a “configuration manager” like this is implementing a Circuit Breaker using Supervisors. Think about it, the supervisor will only die if its children have too many errors (like a fuse will only blow if it has too many errors) and then you revive it after some time or strategy (like with a circuit breaker).

Fl4m3Ph03n1x · June 3, 2019, 12:25pm

A very good point. If I use the article’s approach, I lose granularity !

LostKobrakai · June 3, 2019, 12:30pm

It does, but this only works for direct calls to those crashing processes. Often requests to your system also involve e.g. pre-processing before calling into the volatile subsystem. With a proper circuit breaker you can stop requests before doing any pre-processing in the event the subsystem is not working. It basically allows you to short circuit in any layer on top of the actual failing subsystem instead of just at the edge of calling into said subsystem.

Edit: As your “configuration manager” does have knowledge about if a certain process is running or not, it could also act as a curcuit breaker if it has an API, which let’s other processes query for the status it knows about, but I’d not like to put a process, which is meant to be as stable as possible in such a potentially hot path as a circuit breaker switch.

peerreynders · June 3, 2019, 2:13pm

At the risk of being repetitive … (from your other topic)

The Hitchhiker’s Guide to the Unexpected

Fallacies of Distributed Computing Explained

The network is reliable

… i.e. there are lots of reasons, some temporary, why one would not be able to reach a server. Distributed calls have many more potential causes for failure than local calls.

The manner in which the current design fails seems to indicate that distributed calls, for convenience sake, are being treated similarly to local calls and that “let it crash” is being used in an attempt to sweep the occasional failure (that should be expected and handled as such) under the rug.

I understand the motivation for wanting to delegate this “unhappy path” either to the runtime (via supervisors) or libraries (that implement the circuit breaker concept in some fashion) but you may have to accept that you need to just adopt the circuit breaker concept or the thinking behind it, in order to solve your particular problem.

As a starting point you may need to separate the responsibilities of dealing “healthy” and “unhealthy” servers.

because the requests were failing, the workers were dying.

Why are the workers dying?
How do these workers operate?
- Does a single worker keep hitting the same server ad infinitum or does it complete one successful request and then move on to another server.
What currently is preventing the worker from being resilient in the face of a failing request?
Is there a way for the worker to “survive” a failed request and potentially declare a server as unhealthy?

One possible approach

Maintain separate pools of “healthy” and “unhealthy” servers.
Workers get their servers from the “healthy” pool.
When a worker detects a pattern of failure it moves the server to the “unhealthy” pool.
To be paranoid, after detecting a failed connection the worker could exit normally. A fresh worker should be spawned to replace it.
A separate process manages the pool of “unhealthy” servers, essentially implementing some sort of back off strategy.
When a server first enters the “unhealthy” pool, the manager schedules it to be returned to the “healthy pool”
After the server is returned to the “healthy” pool the server entry remains latent in the “unhealthy” pool until some long-ish latency period expires. If there are no more failures past expiry the latent entry is removed entirely. Additional failures will cause the expiry to be extended.
When a server enters the “unheathly” pool while the latent entry still exists, the delay for being returned to the “healthy” pool is increased (and the expiry is extended).
The manager should likely report a server that returns to the “unhealthy” pool too frequently as it may be necessary to remove that server entirely from the system.

ferd · June 3, 2019, 2:32pm

Hi, author of the quoted article here.

To my knowledge, there are no such libraries. I wouldn’t necessarily use one anyway because what needs to be restarted, when, and under which conditions, is not necessarily super easy to make generic. I’ve written some that would “diff” supervision trees and be used to “repair” configuration calls that were missed, and I’ve written some that could just do a cooldown. Some would restart workers, some would restart supervision trees wholesale. I’ve had some that had no automation, but relied on an operator sending a command to restart a thing (it acted as a fuse for major cluster meltdowns). Making this kind of stuff generic kind of implies very flexible monitoring and linking schemes with arbitrary logic, and at that point a GenServer or gen_statem are plenty to go from.
I just call one of the supervisor module’s functions (either to restart a child, or to delete the old one and add a new one). I do this from another OTP process.
The general circuit breaker will be to detect and react to faults, react to timeouts, etc. I would use a circuit breaker a lot when I expect failures from the other component rather frequently, especially when there is a need for coordination of fault detection between all workers (i.e. all DB workers may want to expect the remote DB being down and to avoid thundering herds). The “config manager” in our case was to cope with supervision trees where each worker connects to a distinct resource, but each worker could also be started, created, dropped, by remote users interacting with our gateway. Since we already had a need to “repair” the config (say the network is down during 1-2 weeks and we don’t see config changes), so it was simple to bolt the retry feature on top of it. The difference is that really, we wanted to be able to add some smarts to our supervision strategies, whereas a circuit breaker is more of a general overload/flow control mechanism.

keathley · June 3, 2019, 4:45pm

I said these things in the other thread but I guess its worth repeating again.

Your service should be able to boot without any of its dependencies. That means that any dbs, queues, or external services can be 100% unavailable and your app should still start. As I said in the other thread if you can’t do this then your app will be brittle and less reliable. The root of your problem isn’t your supervision strategy. The root problem is that you’re assuming these external services will be mostly available. You need to be more pessimistic.

There isn’t a single supervision strategy, library, circuit breaker, or pattern that is going to make this work for you. Those are all tools. But in order to know how to use those tools you have to start from first principles and design your system to work even when the rest of the world is burning to the ground. That’s why having the ability to start your application without having access to any of its dependencies is a good heuristic for a system that can withstand failure. It means that you aren’t truly dependent on those systems and if you have a transient failure you’ll be in better shape to recover from it. At the very least it points your design in a more reliable direction.

Meta note: this probably didn’t warrant a whole new thread and could have been continued in the original one so as not to lose context.