Understanding the advantages of "let it crash" term

peerreynders · November 6, 2017, 12:00pm

Videos:

Joe Armstrong: Erlang Master Class 2: Video 3 - Handling errors

Fred Hebert: The Zen of Erlang

omnibs · November 6, 2017, 12:31pm

Yes, there is pooling. What you do when you don’t have async io but are IO bound is have a far greater number of OS threads in the pool then you have cores (virtual or otherwise). At some point you start to pay on OS scheduling, which is way less optimal than scheduling that understands “I’m doing IO, don’t context switch into me yet pls”.

I don’t get why that means serialization, though. Won’t every thread be independently reading/writing to it’s own socket port?

I personally think it’s important to talk about web requests because I have a conjecture that 99% of people curious about Elixir will be looking to move from Rails to Elixir for simple CRUD apps because functional programming.

Now, if people are pursuing functional programming, they’ll get the benefits of not having shared mutable state anywhere they go. Why is “let it crash” better in this context?

This is another conjecture I have. If you don’t have shared mutable state to start with, “let it crash” is only relevant for stateful services. If we don’t write stateful services and there’s no mutable state, working with {:ok, values}|{:error, reason} is just as good. Throwing exceptions in this context would be just a go to statement, an optimization we have long agreed is not worth it.

There is an argument over just programming the happy path, and that sounds more interesting to me in the context of functional languages. I can’t say how this compares in Elixir vs Clojure, but I know it being a dynamic language leads to way less code than handling Maybes in strictly typed functional languages. But the counter argument also exists in those languages - that having to handle every single case match makes you think about failures and handle them properly.

It makes me think “let it crash” says “let’s treat failure as systems problem, by abstracting code into black boxes that can fail in a few different and easy to reason ways” and strictly typed functional languages say “let’s think about failure in every step of the computation, from data types (make invalid states irrepresentable) to algorithms”.

NobbZ · November 6, 2017, 1:02pm

When you say you pool X OS-processes and distribute incoming requests over your X process, you can’t do more than X requests at a time. Requests that come in while all X processes are working, will have to wait.

If your X is too small, the requests will timeout on the clients side. Depending on how your system is designed, they will be processed by your system anyway, creating a growing backlog. But even when we assume that you recognize those and wont try to process them, you basically loose requests.

In the BEAM, we can spin up as many BEAM-processes as we need, without artificially serialising the requests. We handle them as they come in. Also since every process is handled immediately, TTFB is (relatively) short, even on a loaded system, so the client will not time out that fast.

And I do personally think, elixir is more than only phoenix. If one comes from Rails to Phoenix, he won’t gain much anyway, since he limits himself to a framework. If though one comes from Ruby to Elixir, he has to be openminded, and limit himself to a small use case.

There are kinds of errors, that you can only handle by re-establishing some connection, by deleting current state and re-build it from some set of rules or other things that basically replay your init phase. This is the kind of errors where let-it-crash wins. Since replaying init is exactly what the supervisor will do, after it “hears” that the old process crashed.

Some failures though, which we can take car of easily, are handled in other ways. Eg. in an interactive shell application I ask the user for a file name to read. When I can’t read it, I will get an error which I won’t crash on. I will tell the user that I can’t read the file and that he has to specify a correct one.

As every idiom, let-it-crash has its exceptions from the rule. But the concept behind it makes it really easy to recover from irrecoverable errors by simply restarting affected parts of the application.

When the connection to the RabbitMQ server vanishes, I do not need to tell each and every process that uses this connection, that it is gone. The process that owns the connection will die, all the linked processes that use the connection will die as well (because the runtime tells them to), their corresponding supervisors will issue restarts, and everything works again.

If I were not using let-it-crash, I had to code all the things that supervisors and linked processes do here on my own (which I effectively did in the earlier mentioned go program).

NobbZ · November 6, 2017, 1:13pm

NB: I weren’t able to understand for a long time as well. But when I started using it (without knowing I did), it slowly took possesion. I’m still not really using it correctly or perfectly, I assume, but as I have toe-tipped, I do in fact miss similar concepts or mechanics in other languages. And you will never realise the benefits as long as you only look at shortlived processes or single proces applications. Those can’t earn anything from let-it-crash by definition, they just fail. And in fact, for the process itself, it doesn’t matter if it dies because it finished it work or because there was an unhandled error. Its dead, gone, away! This is not different from Java, C, or anything else. This is not different to an OS process at all. The benefits of let-it-crash are only visible if you take a look at the whole picture. The running program. When you look at it and realise, that the runtime itself takes care of everything for you. The fact that you do not need to restart the runtime because a process caused an overflow.

Just open your mind and try it out!

peerreynders · November 6, 2017, 2:52pm

Functional programming languages are still largely “sequential”. “Thread-based” programming concerns itself primarily with “what is the next useful thing this thread should do”. This tends to complect together success and failure paths in code either resulting in the happy path being fragmented over conditional branches or try blocks. So it can be argued that “defensive programming” leads to complexity from the lack of separation of the “normal processing” and “error handling” concerns.

An Erlang/Elixir process has a “mission” - to stay on that “happy path”. But when it recognizes that it is unable to “complete its mission” it terminates with a non-normal reason.

It’s the supervisor’s job to know what to do when one of its processes fails - and that is its only job (i.e separation of failure handling from normal processing). There are a number of “coping” (supervision) strategies it can be configured with - the correct one ultimately depends on the nature of the process that it is supervising.

Typically supervisors will terminate when their “coping strategies” fail repeatedly in a short enough time interval - as a (sub) system failure may be in progress - and dealing with that type of failure is the single responsibility of the next higher level supervisor.

So “normal processing” occurs in the “regular processes” while “failure handling” happens separately in the supervisors.

In many cases the response to the failure isn’t actually that fine grained - while there are multiple locations where failure can occur or be detected often the response is identical. Typically you have a “there is something terribly wrong with this information and I can’t make sense of it” - so you make note of it and stop - and you don’t try again. Or “things all of a sudden have stopped making sense” - so you make note of it and stop - and then you restart with a clean slate.
In many cases it is impossible explicitly predict all possible failure paths much less devise sound custom recovery actions for every single one of them.
In many cases ‘fixing the actual problem’ from the current thread of control is impossible and even if it is possible “fixing the problem” is an entirely “separate mission” from the current happy path.

[quote]but I know it being a dynamic language leads to way less code than handling Maybes in strictly typed functional languages. But the counter argument also exists in those languages
[/quote]
??? Maybes are the basis of Railway oriented programming - Nothing is handled implicitly for you by Maybe which eliminates annoying explicit conditional checks - so there shouldn’t be any “handling”. In fact you hand Maybe the function - it handles the rest.

The one issue that doesn’t go away with JVM-based solutions is the stop-the-world garbage collection. The BEAM’s per process heaps make it extremely efficient to reclaim memory from terminated processes and a single process needing GC doesn’t stop anybody else from accomplishing something useful.

omnibs · November 6, 2017, 3:58pm

I like the argument of failure handling separation. I haven’t found a scenario where I’d give up graceful degradation and just “let it crash” though.

Also, the fact that supervisors themselves crash and that brings down the entire BEAM makes me, personally, carefully consider whether I want to really just “let it crash”. Will this failure happen as frequent as the maximum supervisor restart threshold? Can I predict that? How do I choose a better threshold?

I don’t think it’s that simple if you want any observability in your functional code. You’ll have to wrap your entire response chain in something that can communicate “failures happened at this specific point surrounded by this specific context”. Idk how this works in Haskell, but this is what I can assume, and idk if in F# people just throw, with a stack and some extra context, I’d love to find out.

In Elixir when we fail a match we get all the relevant context in the failure message for that match most of the time (unless you need to poke at previous state to figure out how you got to an invalid one). This is the contrast I was going for with “handling Maybes”.

LostKobrakai · November 6, 2017, 4:28pm

I feel like you’re missing the point of a “tree of supervisors and workers”. Your root supervisor should not get into a state where it would crash under any reasonable circumstance, while supervisors deeper down the tree might actually crash now and then and the workers at the furthest leaves of the tree should be the entities most likely to crash.

While each level of supervisors does also give you additional restarts (3 within 5 seconds per supervisor), the fact that the vm will restarts from a very localized set of entities (leaf nodes) growing to a ever larger set of parts in your system (whole branches) will be much more graceful than having crashes take down large parts or the whole app immediatelly. So if a worker keeps dying try restarting all workers of that supervisor, if it’s still dying restart that supervisor and the data cache sitting besides it and so on moving upward in the tree. So you can quite easily scope the boundries of restarts in a layered fashion without actually caring about that error handling in the code being executed in any workers processes. Only the setup of supervisors will result in that layering.

And to come back to the graceful degradation. The part/process of your application, which did trigger an error in a leaf worker causing it to crash will therefore not automatically die with it. It could simply monitor that process and if it crashes report to the user that the computation did not complete. You can still be reasonably accurate in your error message, because you know which small part of computation just failed. With this in place you can look into more advanced strategies like retries maybe even with some backoff strategy, while only reporting the error to the user if all of that failed as well.

Edit: In the context of a webrequest (which you talked about before) this could mean your webrequest would never do any work with a certain amount of likelyhood of crashing. If you expect some work to be able to crash you’d let that compute elsewhere in the supervision tree, where you can setup the supervision strategy in a way it makes sense for crashes to be handled, while your webrequest can happily observe what’s happening and report back to the user in case of success or error.

OvermindDL1 · November 6, 2017, 6:40pm

Honestly I really really wish modern CPU’s were not so… uniform. I’d love a chip that has, say, 1-8 mega-cores (think modern chips) along with 1024-a_million little smaller micro-cores that run (much) slower but could run in a couple of different modes, either all being distinct with an individually small amount of allocated memory that have join points, or could run in a SIMD mode where they run over a large amount of aligned data with the same operations on each. We kind of have that nowadays with a modern CPU and the GPU, but it is still a bit more limited than what I want because of this mis-match between their access. Most programmers are, to be honest, not quite competent enough to program on such CPU’s though so they’ve not been built to that yet, but we have the ability to make languages that would make such access quite safe now.[quote=“dom, post:15, topic:9748”]
All the languages you cite have user-level scheduling.
[/quote]

C++ does too! This is entirely valid code!!! ^.^

void an_actor(channel_t &chan, int count = 0)
{
	while(true) chan.push(i++);
}

channel_t chan{1};
fiber gen(std::bind(an_actor, chan))

int i;
chan.pop(i); // `i` === 0
chan.pop(i); // `i` === 1
chan.pop(i); // `i` === 2
chan.pop(i); // `i` === 3

Exceptions get properly sent through channels and all if necessary. You could build a very BEAM’y-like OTP in C++ pretty easily now (I wish these libraries existed back then, and yet now there are multiples that can do this). ^.^

Oh and hey, no dynamic memory allocation unless you explicitly do it somewhere. ^.^
Even the chan.pop has no memory allocation, you pass in the memory to fill and it is copied in (ala the BEAM).

Oh and a function does not need to be a single level, even the chan.push is doing a reschedule down inside of it waiting on when the chan is pop’d.

And yes, they can migrate between threads/cores just fine (it has a few chooseable default schedulers, or you can make your own, but the general default is a round-robin scheduler with work stealing), so you can saturate your 1024 cores if you so wish. I’ve been wanting to make a little OTP library in C++ using these for a while now, just to see how it ‘feels’ (it is easy to make a polymorphic channel type as well, though depending on how you make it then it may allocate memory unless you force ‘move’ semantics, which should be fine for message types I’d say).

And channels is not the only synchronization abstraction it has either, it also has a lightweight ‘mutex’, condition variables, and a few other things, including even just not rescheduling the fiber and instead just giving up it’s running time and just manually reschedule it later, like say when a network message is retrieved.

It is not pre-emptive unless you use many threads (which is fine, they can be work-steal’ed when a specific fiber runs too long anyway) but that is fixable by just this_fiber.yield()ing on occasion manually (not automatic, but eh).

Since exceptions can propogate if stated to, or just kill the fiber out-right (or will kill parents on up depending on the API used, or kill the running system if entirely unhandled anywhere, which is of course easy to fix ^.^), then it would not be hard to make an OTP-like setup.

I like forcing channel usage with move-only data though.

It is far more ‘open’ in usage than the BEAM is, so not as ‘safe’, but it is significantly more performant, thus I think it would make a great CNode for an Erlang mesh as you could actually represent real local PID’s as Fibers instead of just faking it.

Yeah this is a big thing where Node and Python and Ruby and such scaling fails pretty hard. Python has libraries (like twisted) that will async data calls through so you can fix it pretty well there, but not always when there is a synchronous call otherwise.

Hear hear, a maybe in Elixir/Erlang is really the :ok/{:ok, value}/:error/{:error, reason} tuples, and there are a host of libraries that can pipeline them (which Elixir itself really should be able to itself, but eh…).

And hey, no GC at all in the C++ versions, all memory needed for a fiber is allocated at fiber creation time (though with an optional growable stack if you go that way, but it is not default and you can instead set a stack size, which I prefer to prevent exploding fibers).

Well the Erlang way is an :error tuple with a reason like {:ok, reason}.

Or exceptions of course… ^.^;

peerreynders · November 6, 2017, 10:18pm

Where did you get the idea that the BEAM crashes? All in all I think @LostKobrakai elaborated on the supervisor issue sufficiently. And a supervisor tree is started as part of an OTP-application and typically multiple OTP-applications are bundled together as a release to form a “system”.

So it is possible for an OTP application to crash repeatedly during startup only to finally give up and ultimately stop - but that doesn’t crash the BEAM.

And maybe “let it crash” is a bit sensationalist - is “let it fail” better?

Joe Armstrong explains the thinking in Programming Erlang 2e p.201:

Why Crash?

Crashing immediately when something goes wrong is often a very good idea; in fact, it has several advantages.

We don’t have to write defensive code to guard against errors; we just crash.

We don’t have to think about what to do; we just crash, and somebody else will fix the error.
We don’t make matters worse by performing additional computations after we know that things have gone wrong.
We can get very good error diagnostics if we flag the first place where an error occurs. Often continuing after an error has occurred leads to even more errors and makes debugging even more difficult.
When writing error recovery code, we don’t need to bother about why something crashed; we just need to concentrate on cleaning up afterward.
It simplifies the system architecture, so we can think about the application and error recovery as two separate problems, not as one interleaved problem.
…

Getting Some Other Guy to Fix It

Letting somebody else fix an error rather than doing it yourself is a good idea and encourages specialization. If I need surgery, I go to a doctor and don’t try to operate on myself.

If something trivial in my car goes wrong, the car’s control computer will try to fix it. If this fails and something big goes wrong, I have to take the car to the garage, and some other guy fixes it.

If something trivial in an Erlang process goes wrong, I can try to fix it with a catch or try statement. But if this fails and something big goes wrong, I’d better just crash and let some other process fix the error.

I used to ask myself the same type of questions - but then I ran into this in Designing for Scalability with Erlang/OTP p.175:

Note how we have grouped dependent processes together in one subset of the tree and related processes in another, starting them from left to right in order of dependency. This forms part of the supervision strategy of a system and in some situations is put in place not by the developer, who focuses only on what particular workers have to do, but by the architect, who has an overall view and understanding of the system and how the different components interact with each other.

So design of the supervision tree is largely an architectural concern and it’s this architecture that has to be designed deal with the failures, not the program code that is down in the weeds. Therefore failures are dealt with in a very general, generic fashion (so the supervision strategies and therefore appropriate thresholds relate how the system needs to operate).

If you want a failure context you wouldn’t use Maybe but Either and capture the context in a Left value (that all subsequent composed computations would leave unmodified). However you are still focusing on the details of the failure. While the details should be logged for later inspection - they often don’t influence the immediate response. The response is often quite generic - either “give up” or “try again (later) from square one”.

A poor man’s version of it that can be leveraged with libraries like exceptional, sure - but the appeal of Maybe (or Either, Result, etc.) is that it implicitly “knows” how to deal with Nothing (Left, Failure, etc.) without any additional outside plumbing.

DianaOlympos · November 7, 2017, 12:37am

The ARM BIG.little architecture tried that and they discovered fun stuff. A tale of an impossible bug: big.LITTLE and caching | Mono

OvermindDL1 · November 7, 2017, 5:54pm

Oooo that was a fun read, thanks!

Their issue was running the same code on the different execution contexts, which definitely should not be done (this is why I like so much the concept of OpenCL Kernels ^.^).