There are a couple of facets to the let-it-crash story.
At the core of it all is the idea of failing fast. This is not something exclusive to Erlang, and I believe it’s generally a good practice. We want to fail as soon as something is off. By doing this, we ensure that the symptom and the cause are one and the same, which simplifies the problem analysis. By looking at the error log, we can tell both, what went wrong, as well as why.
Now, of course, we don’t want our whole system to crash due to a single error, so we need to isolate a failure of a single task. In many of popular languages, this is done by wrapping the task execution in some sort of a catch-all statement, or by running the task in a separate OS process. So for example, as someone mentioned here, a typical web framework will indeed to this to make sure that the error is caught and reported properly.
However try-catch is not a perfect solution due to a couple of things. First, if shared mutable data is used, a task which fails in the middle could have left the data in an inconsistent state, which means that subsequent tasks might trip over.
Moreover, a task itself could spawn additional concurrent subtasks (threads or lightweight threads), and we need to make sure that the failures of these threads are properly caught. A great example of this is go language. If a web req handler spawns another goroutine, and there’s an undeferred panic (aka uncaught exception) in that gouroutine, the entire system crashes.
In contrast, using separate OS processes helps with this, but you can’t really run one OS process per each task (e.g. a request), so we usually group them somehow (which to me is what microservices are about). Now, you need to run multiple OS processes, and you need an extra piece of tech (e.g. systemd) to start these things in a proper order, restart failing OS processes, and maybe take down related OS processes as well.
With BEAM, all of these issues (and some others) are taken care of directly in our primary tech. If you don’t want a failure of one task to crash other tasks, you’ll typically run the task in a separate process, and fail fast there. With errors being isolated, a failing process doesn’t take down anything else with it (unless you ask for it explicitly via links). Shared-nothing concurrency also ensures that a failing thing can’t leave any junk data behind. Moreover, it ensures that whatever crashes, the associated resources (memory, open sockets or file handles) are properly released. Finally, a termination of a process is a detectable event, which allows other processes (e.g. supervisors) to take some corrective measures and help the system heal itself.
As a result, Erlang-style fault-tolerance is IMO a one-size-fits-all. We use the same approach to improve the fault-tolerance of individual small tasks (e.g. request handlers), as well as other background services, or larger parts of our system. I like to think that supervision tree is our service manager (like systemd, upstart, or Windows service manager). It give us same capabilities and same guarantees, it’s highly concurrent, and it’s built into our main language of choice.
In contrast, in most other technologies, you need to use a combination of try/catch together with microservices backed by an external service manager, and in some cases you might need to resort to your own homegrown patterns (e.g. if you need to propagate a failure of one small activity across microservice boundaries). Therefore, I consider these other solutions to be both more complex and less reliable than the Erlang approach.