Configuration Managers VS Circuit breakers

At the risk of being repetitive … (from your other topic)

The Hitchhiker’s Guide to the Unexpected

Fallacies of Distributed Computing Explained

  1. The network is reliable

… i.e. there are lots of reasons, some temporary, why one would not be able to reach a server. Distributed calls have many more potential causes for failure than local calls.

The manner in which the current design fails seems to indicate that distributed calls, for convenience sake, are being treated similarly to local calls and that “let it crash” is being used in an attempt to sweep the occasional failure (that should be expected and handled as such) under the rug.

I understand the motivation for wanting to delegate this “unhappy path” either to the runtime (via supervisors) or libraries (that implement the circuit breaker concept in some fashion) but you may have to accept that you need to just adopt the circuit breaker concept or the thinking behind it, in order to solve your particular problem.

As a starting point you may need to separate the responsibilities of dealing “healthy” and “unhealthy” servers.

because the requests were failing, the workers were dying.

  • Why are the workers dying?
  • How do these workers operate?
    • Does a single worker keep hitting the same server ad infinitum or does it complete one successful request and then move on to another server.
  • What currently is preventing the worker from being resilient in the face of a failing request?
  • Is there a way for the worker to “survive” a failed request and potentially declare a server as unhealthy?

One possible approach

  • Maintain separate pools of “healthy” and “unhealthy” servers.
  • Workers get their servers from the “healthy” pool.
  • When a worker detects a pattern of failure it moves the server to the “unhealthy” pool.
  • To be paranoid, after detecting a failed connection the worker could exit normally. A fresh worker should be spawned to replace it.
  • A separate process manages the pool of “unhealthy” servers, essentially implementing some sort of back off strategy.
  • When a server first enters the “unhealthy” pool, the manager schedules it to be returned to the “healthy pool”
  • After the server is returned to the “healthy” pool the server entry remains latent in the “unhealthy” pool until some long-ish latency period expires. If there are no more failures past expiry the latent entry is removed entirely. Additional failures will cause the expiry to be extended.
  • When a server enters the “unheathly” pool while the latent entry still exists, the delay for being returned to the “healthy” pool is increased (and the expiry is extended).
  • The manager should likely report a server that returns to the “unhealthy” pool too frequently as it may be necessary to remove that server entirely from the system.