Examples of "fallback" tasks for error recovery?

Rich_Morin · February 11, 2023, 3:47am

I’ve seen a couple of references to the idea of “let it crash” and how it can recover from incorrect software. For example, in The do’s and don’ts of error handling, Joe Armstrong said:

Assume software is incorrect and will fail at run time
then do something about it at run-time

However, I think the real story may be a bit more nuanced than this. Let’s say that I run the same software, with the same internal state, on the same input data. I’m pretty sure that it will malfunction again in just the same manner. Fortunately, there are reasons why this might not be the case.

If a process has been building up “bad state” over some period of time, restarting it will clear things up, allowing the new copy to run (at least for a while…). This is a lot like power-cycling a computer, network router, etc… In addition, if the problem is related to a timing issue (e.g., a race condition), restarting the process(es) involved will typically resolve the issue.

Even in the absence of these issues, killing a faulty process and starting up a different version could allow things to keep going. Joe speaks about this possibility in his dissertation:

If an error is detected when trying to achieve a goal, we make an
attempt to correct the error. If we cannot correct the error we immediately abort the current task and start performing a simpler task.

All of this makes me wonder how many production systems have “fallback” tasks of this sort. I know that I’ve never written any, but others may well have done so. Can anyone cite (and describe) some examples?

-r

LostKobrakai · February 11, 2023, 10:08am

One example that comes to my mind is with caching, where you try to keep a cache up to date, but if you fail you might just serve stale content you already have access to. In this case “updating an out of date cached value” is more complex than “using the out of date cached value”.

I’d also argue that circuit breakers are an implementation of that, even though the “simpler task” is telling the user that whatever isn’t executed is considered unavailable for the time being. Not that useful to the enduser, as functionality is not available in both the erroring case as well as in the being told things don’t work case, but the latter is saner for the system to handle and by preventing access it might allow downstream system to recover over time.

You might find a lot more of that in topics around distributed systems – where failure is much more obviously inevitable – than in pieces around coding for a single machine. Also commonly distributed setups involve some degree of not being able to control all pieces of software running, so actual faulty implementations are more of an consideration.

Rich_Morin · February 11, 2023, 2:58pm

Thanks! Here are some references I found for caching and the “circuit breaker” pattern, along with some implementations in Elixir:

How to write a caching server in Elixir
Cachex
CircuitBreaker (Martin Fowler)

The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all. Usually you’ll also want some kind of monitor alert if the circuit breaker trips.
Circuit Breaker Pattern in Elixir (Allan MacGregor)
Hex package search