Understanding the advantages of "let it crash" term

peerreynders · November 6, 2017, 10:18pm

Where did you get the idea that the BEAM crashes? All in all I think @LostKobrakai elaborated on the supervisor issue sufficiently. And a supervisor tree is started as part of an OTP-application and typically multiple OTP-applications are bundled together as a release to form a “system”.

So it is possible for an OTP application to crash repeatedly during startup only to finally give up and ultimately stop - but that doesn’t crash the BEAM.

And maybe “let it crash” is a bit sensationalist - is “let it fail” better?

Joe Armstrong explains the thinking in Programming Erlang 2e p.201:

Why Crash?

Crashing immediately when something goes wrong is often a very good idea; in fact, it has several advantages.

We don’t have to write defensive code to guard against errors; we just crash.

We don’t have to think about what to do; we just crash, and somebody else will fix the error.
We don’t make matters worse by performing additional computations after we know that things have gone wrong.
We can get very good error diagnostics if we flag the first place where an error occurs. Often continuing after an error has occurred leads to even more errors and makes debugging even more difficult.
When writing error recovery code, we don’t need to bother about why something crashed; we just need to concentrate on cleaning up afterward.
It simplifies the system architecture, so we can think about the application and error recovery as two separate problems, not as one interleaved problem.
…

Getting Some Other Guy to Fix It

Letting somebody else fix an error rather than doing it yourself is a good idea and encourages specialization. If I need surgery, I go to a doctor and don’t try to operate on myself.

If something trivial in my car goes wrong, the car’s control computer will try to fix it. If this fails and something big goes wrong, I have to take the car to the garage, and some other guy fixes it.

If something trivial in an Erlang process goes wrong, I can try to fix it with a catch or try statement. But if this fails and something big goes wrong, I’d better just crash and let some other process fix the error.

I used to ask myself the same type of questions - but then I ran into this in Designing for Scalability with Erlang/OTP p.175:

Note how we have grouped dependent processes together in one subset of the tree and related processes in another, starting them from left to right in order of dependency. This forms part of the supervision strategy of a system and in some situations is put in place not by the developer, who focuses only on what particular workers have to do, but by the architect, who has an overall view and understanding of the system and how the different components interact with each other.

So design of the supervision tree is largely an architectural concern and it’s this architecture that has to be designed deal with the failures, not the program code that is down in the weeds. Therefore failures are dealt with in a very general, generic fashion (so the supervision strategies and therefore appropriate thresholds relate how the system needs to operate).

If you want a failure context you wouldn’t use Maybe but Either and capture the context in a Left value (that all subsequent composed computations would leave unmodified). However you are still focusing on the details of the failure. While the details should be logged for later inspection - they often don’t influence the immediate response. The response is often quite generic - either “give up” or “try again (later) from square one”.

A poor man’s version of it that can be leveraged with libraries like exceptional, sure - but the appeal of Maybe (or Either, Result, etc.) is that it implicitly “knows” how to deal with Nothing (Left, Failure, etc.) without any additional outside plumbing.