Understanding the advantages of "let it crash" term

If their requests ended up on the same JVM scheduler while your request broke it, everything controlled by that scheduler will brake appart. On servers with small loads, you won’t see a difference since often only one scheduler is busy at all, but on larger ones with a lot of requests per second those situations can happen.

For me in real office life, it is like this:

We have coded a go application for a customer, there is a huge machinery spun around it to restart it when a crash happens and we had to code very carefully to avoid those crashes at all, since restarts are expensive, the application has spin-up time of about 5 seconds.

For the first iteration of the software I coded a feature complete clone in erlang which which was able to handle requests even faster, was easier to read (because there was only a happy path and not 75% of the codebase were to handle errors), the external supervisor was replcaced by my own, BEAM internal. It was even capable of supervising other auxiliary tools on the same machine, where the go program had to rely on a working state (and crashed if they were not running, at least until the very last iteration). The actual working horses of the program were back instantly, global state or config was not affected by a single crashing worker.

I was able to push it even into a second iteration until the customers CTO decided that erlang was to “eighties”, and they prefer to have the program implemented in go.

NB: The go program does rely on RabbitMQ, the erlang program didn’t…

NB2: The go program was developed by a team of 4, the erlang version by me alone in my spare time. So the erlang version could have been production ready much earlier than this go program…

5 Likes

This makes my brain hurt…

5 Likes

Well, his exact words were “we decided to go with modern technologies, so we want the go version”, but I really got his point when he said that…

2 Likes

One thing that has stuck with me is something Joe Armstrong said - one way to highlight the difference between BEAM languages and other languages is that we don’t have web servers that handle 20 million sessions - we have 20 million web servers that handle single sessions :003:

So when one crashes, it only impacts that single session …and is usually restarted automatically if being supervised - hence it’s not catastrophic to ‘let it crash’.

…should have said, well, go with Elixir :lol:

3 Likes

I tried, but they explained it were the same technology but with a new painting. When I wanted to tell them something about “modern” CPUs and IBM my boss hold me back.

At least the other devs are keeping an eye on the BEAM since then and do not only consider it as “the ugly runtime necessary for rabbit”

2 Likes

I would have referred them to something Joe Armstrong has said (and what I posted not long ago)…

You’d think a CTO would be aware of stuff like this :lol: but it’s ok - horses for courses right…

2 Likes

Well, that decission was made February '17. And to be honest, with the go program we will probably be able to earn more in terms of support fees :wink:

I love elixir, erlang and the BEAM in general, but currently I can only use erlang while maintaining exercisms erlang track.
I dislike go, but I learn to live with it. I use it to pay my rent, bread and butter at last… And I think it will stay like this for quite a while, go is much better then the available alternatives at my current working place: JavaScript…

2 Likes

That’s a huge argument! :lol:

I guess they decided to use JS for the same reason right??? :laughing:

No, the JavaScript stuff is legacy only. Code we get from the client and need to clean up. Often we try to migrate parts of those js apps to go.

1 Like

I’m not convinced by the arguments. Maybe there’s a learning opportunity for me or us. Let me know if this makes sense:

I don’t understand JVM scheduler here. I never shipped JVM web apps, so idk if this applies to specific web servers/web frameworks, but I thought JVM threads were OS threads and they were scheduled by the OS itself.

In which frameworks/servers is this true? From my .NET experience, having more than one request per OS thread would only be true on a framework supporting Async IO, that can lift up green threads when they do IO, and let other request green threads run meanwhile. Otherwise I can only visualize one OS thread per request.

Even in Async IO enabled scenarios, where the request green thread is sharing its OS thread with other ones, exceptions still shouldn’t kill the OS thread. The framework has to be able to capture unhandled user exceptions to be able to show error pages, have APMs collect stats and libraries automatically log errors. All this being true, the OS thread dying should be bug in the web framework.

In this case, crashing would be throw new SomeException(), and the cost of that is is collecting the stack. That costs CPU time, which, depending on your performance constraints, is not that relevant.

If all of this is true, the only difference I see is having constructs for failure isolation and recovery, which for us in the BEAM are Processes and Supervisors. Akka, a JVM Actor Model library, also advocates let it crash.

1 Like

A beginner’s opinion (I need a beginner badge or something, getting tired of mentioning it :slight_smile: )

The difference is that the future will bring CPUs with 1024 cores.
The OS will stop being responsible for scheduling and that will be delegated to the application developers.
Languages like Java, Go, PHP, C#, Haskell, and so on will become the equivalent of a man who’s trying to run with 1024 legs attached to the same body, while Elixir/Erlang will be a happy centipede.
Erlang’s VM will handle scheduling for us, while Java, Go, Haskell, etc. developers will have to do that themselves.
Then comes error handling, which in other languages you have to write error handling code for 1024 cores, while Elixir/Erlang devs will slap the “Let it crash” sticker and not go insane.

Feel free to correct me if any of the information is incorrect.

The OS will stop being responsible for scheduling and that will be delegated to the application developers.

What makes you think that? Even the Erlang VM is highly dependent on the OS for scheduling. It has a set of scheduler threads to run processes, but also thread pools for I/O, running dirty NIFs, etc., and it’s up to the OS to manage these.

Erlang’s VM will handle scheduling for us, while Java, Go, Haskell, etc. developers will have to do that themselves.

All the languages you cite have user-level scheduling.

https://morsmachine.dk/go-scheduler

Then comes error handling, which in other languages you have to write error handling code for 1024 cores, while Elixir/Erlang devs will slap the “Let it crash” sticker and not go insane.

I think you’re conflating two things. “Let it crash” helps keep the code simpler by moving the error handling out of the business logic. It’s not harder to handle errors when you have 1024 threads than when you have 2, the code is the same.

1 Like

To point a few differences:

  • All resources are owned by a process in Erlang, and the VM guarantees clean-up of resources once the process dies. In Java, Python, etc., if your response handler opens a file then throws a NullPointer at some random location later, the framework’s exception handler will catch it and the application will seem OK… until it runs out of file handles after this happens too many times. I’ve seen a Twisted app lock up hard because of this. So you have to be more careful with your error handling, and spread try/catch over the business logic.

  • This extends to other types of resources, for instance registered names, DB connections, or locks of all kinds. In most languages, if you hit a problem where threads are grabbing DB connections then throwing an exception without releasing the connection, your app will quickly run out of connections and die. In Erlang the connection (a process) can keep a monitor on the owner (another process), and release itself as soon as the owner dies, no matter the reason. So the robustness of your application depends on the correctness of the DB connection’s code, which is rather small, rather than depending on the correctness of every single request handler that uses a DB connection. In the Erlang world this is called the error kernel, and you want to keep it as small as possible.

5 Likes

What makes you think that the OS will continue to be responsible for 1024 cores and how do you see it working?
Does Quasar and Spark handle scheduling, or the developer handles scheduling?
I’m not sure if you mention user-level scheduling as a benefit on 1024 cores or not. if you mention it as a benefit, could you explain why and what the implementation looks like?
Let it crash, as I understand, means that when things go bad on 1024 cores you let it crash, so that the processes can start with clean state, how does Java handle mutable state on 1024 cores?
Can you explain the benefits of writing mutexes on 1024 cores?
Thanks!

1 Like

Ouch… So if you really do a process per request, it forks for every single request that comes in? This sounds rather expensive if there is no pooling, which again would mean serialisation of requests at the end…

But as we said again, do not only think about web requests.

Lets think about a connection of my BEAM-application to another server, lets say a database. For some reason the connection is canceled. The managing BEAM process will crash, some centralised crash handler writes debug info in the logs and then the connection is restarted by the supervisor.

Something similar in Java involves either massive rethrow of exceptions to get to the point where we have the centralised crash handler (not even the place where we try to restart). Also some code has to deal with errors that usually can’t pop in there.

But to be honest, the best thing to understand, is to actually use it.

1 Like

There are a couple of facets to the let-it-crash story.

At the core of it all is the idea of failing fast. This is not something exclusive to Erlang, and I believe it’s generally a good practice. We want to fail as soon as something is off. By doing this, we ensure that the symptom and the cause are one and the same, which simplifies the problem analysis. By looking at the error log, we can tell both, what went wrong, as well as why.

Now, of course, we don’t want our whole system to crash due to a single error, so we need to isolate a failure of a single task. In many of popular languages, this is done by wrapping the task execution in some sort of a catch-all statement, or by running the task in a separate OS process. So for example, as someone mentioned here, a typical web framework will indeed to this to make sure that the error is caught and reported properly.

However try-catch is not a perfect solution due to a couple of things. First, if shared mutable data is used, a task which fails in the middle could have left the data in an inconsistent state, which means that subsequent tasks might trip over.

Moreover, a task itself could spawn additional concurrent subtasks (threads or lightweight threads), and we need to make sure that the failures of these threads are properly caught. A great example of this is go language. If a web req handler spawns another goroutine, and there’s an undeferred panic (aka uncaught exception) in that gouroutine, the entire system crashes.

In contrast, using separate OS processes helps with this, but you can’t really run one OS process per each task (e.g. a request), so we usually group them somehow (which to me is what microservices are about). Now, you need to run multiple OS processes, and you need an extra piece of tech (e.g. systemd) to start these things in a proper order, restart failing OS processes, and maybe take down related OS processes as well.

With BEAM, all of these issues (and some others) are taken care of directly in our primary tech. If you don’t want a failure of one task to crash other tasks, you’ll typically run the task in a separate process, and fail fast there. With errors being isolated, a failing process doesn’t take down anything else with it (unless you ask for it explicitly via links). Shared-nothing concurrency also ensures that a failing thing can’t leave any junk data behind. Moreover, it ensures that whatever crashes, the associated resources (memory, open sockets or file handles) are properly released. Finally, a termination of a process is a detectable event, which allows other processes (e.g. supervisors) to take some corrective measures and help the system heal itself.

As a result, Erlang-style fault-tolerance is IMO a one-size-fits-all. We use the same approach to improve the fault-tolerance of individual small tasks (e.g. request handlers), as well as other background services, or larger parts of our system. I like to think that supervision tree is our service manager (like systemd, upstart, or Windows service manager). It give us same capabilities and same guarantees, it’s highly concurrent, and it’s built into our main language of choice.

In contrast, in most other technologies, you need to use a combination of try/catch together with microservices backed by an external service manager, and in some cases you might need to resort to your own homegrown patterns (e.g. if you need to propagate a failure of one small activity across microservice boundaries). Therefore, I consider these other solutions to be both more complex and less reliable than the Erlang approach.

HTH :slight_smile:

10 Likes

Well, responding to the original @rower687 question, I have a tangible example of an advantage of the “let it crash” philosophy: if you take it to extreme levels, you would not put clauses matching all result on your cases anymore.

The argument for this is just what @sasajuric said:

Excerpting it a little bit: exceptions like CaseClauseError are better errors to get than unexpected values escaping out of your own code.

I’ve posted this (with an example) on another topic:

Videos:

Joe Armstrong: Erlang Master Class 2: Video 3 - Handling errors

Fred Hebert: The Zen of Erlang

4 Likes

Yes, there is pooling. What you do when you don’t have async io but are IO bound is have a far greater number of OS threads in the pool then you have cores (virtual or otherwise). At some point you start to pay on OS scheduling, which is way less optimal than scheduling that understands “I’m doing IO, don’t context switch into me yet pls”.

I don’t get why that means serialization, though. Won’t every thread be independently reading/writing to it’s own socket port?

I personally think it’s important to talk about web requests because I have a conjecture that 99% of people curious about Elixir will be looking to move from Rails to Elixir for simple CRUD apps because functional programming.

Now, if people are pursuing functional programming, they’ll get the benefits of not having shared mutable state anywhere they go. Why is “let it crash” better in this context?

This is another conjecture I have. If you don’t have shared mutable state to start with, “let it crash” is only relevant for stateful services. If we don’t write stateful services and there’s no mutable state, working with {:ok, values}|{:error, reason} is just as good. Throwing exceptions in this context would be just a go to statement, an optimization we have long agreed is not worth it.

There is an argument over just programming the happy path, and that sounds more interesting to me in the context of functional languages. I can’t say how this compares in Elixir vs Clojure, but I know it being a dynamic language leads to way less code than handling Maybes in strictly typed functional languages. But the counter argument also exists in those languages - that having to handle every single case match makes you think about failures and handle them properly.

It makes me think “let it crash” says “let’s treat failure as systems problem, by abstracting code into black boxes that can fail in a few different and easy to reason ways” and strictly typed functional languages say “let’s think about failure in every step of the computation, from data types (make invalid states irrepresentable) to algorithms”.

When you say you pool X OS-processes and distribute incoming requests over your X process, you can’t do more than X requests at a time. Requests that come in while all X processes are working, will have to wait.

If your X is too small, the requests will timeout on the clients side. Depending on how your system is designed, they will be processed by your system anyway, creating a growing backlog. But even when we assume that you recognize those and wont try to process them, you basically loose requests.

In the BEAM, we can spin up as many BEAM-processes as we need, without artificially serialising the requests. We handle them as they come in. Also since every process is handled immediately, TTFB is (relatively) short, even on a loaded system, so the client will not time out that fast.

And I do personally think, elixir is more than only phoenix. If one comes from Rails to Phoenix, he won’t gain much anyway, since he limits himself to a framework. If though one comes from Ruby to Elixir, he has to be openminded, and limit himself to a small use case.

There are kinds of errors, that you can only handle by re-establishing some connection, by deleting current state and re-build it from some set of rules or other things that basically replay your init phase. This is the kind of errors where let-it-crash wins. Since replaying init is exactly what the supervisor will do, after it “hears” that the old process crashed.

Some failures though, which we can take car of easily, are handled in other ways. Eg. in an interactive shell application I ask the user for a file name to read. When I can’t read it, I will get an error which I won’t crash on. I will tell the user that I can’t read the file and that he has to specify a correct one.

As every idiom, let-it-crash has its exceptions from the rule. But the concept behind it makes it really easy to recover from irrecoverable errors by simply restarting affected parts of the application.

When the connection to the RabbitMQ server vanishes, I do not need to tell each and every process that uses this connection, that it is gone. The process that owns the connection will die, all the linked processes that use the connection will die as well (because the runtime tells them to), their corresponding supervisors will issue restarts, and everything works again.

If I were not using let-it-crash, I had to code all the things that supervisors and linked processes do here on my own (which I effectively did in the earlier mentioned go program).

1 Like