Videos:
Yes, there is pooling. What you do when you donât have async io but are IO bound is have a far greater number of OS threads in the pool then you have cores (virtual or otherwise). At some point you start to pay on OS scheduling, which is way less optimal than scheduling that understands âIâm doing IO, donât context switch into me yet plsâ.
I donât get why that means serialization, though. Wonât every thread be independently reading/writing to itâs own socket port?
I personally think itâs important to talk about web requests because I have a conjecture that 99% of people curious about Elixir will be looking to move from Rails to Elixir for simple CRUD apps because functional programming.
Now, if people are pursuing functional programming, theyâll get the benefits of not having shared mutable state anywhere they go. Why is âlet it crashâ better in this context?
This is another conjecture I have. If you donât have shared mutable state to start with, âlet it crashâ is only relevant for stateful services. If we donât write stateful services and thereâs no mutable state, working with {:ok, values}|{:error, reason}
is just as good. Throwing exceptions in this context would be just a go to
statement, an optimization we have long agreed is not worth it.
There is an argument over just programming the happy path, and that sounds more interesting to me in the context of functional languages. I canât say how this compares in Elixir vs Clojure, but I know it being a dynamic language leads to way less code than handling Maybe
s in strictly typed functional languages. But the counter argument also exists in those languages - that having to handle every single case match makes you think about failures and handle them properly.
It makes me think âlet it crashâ says âletâs treat failure as systems problem, by abstracting code into black boxes that can fail in a few different and easy to reason waysâ and strictly typed functional languages say âletâs think about failure in every step of the computation, from data types (make invalid states irrepresentable) to algorithmsâ.
When you say you pool X OS-processes and distribute incoming requests over your X process, you canât do more than X requests at a time. Requests that come in while all X processes are working, will have to wait.
If your X is too small, the requests will timeout on the clients side. Depending on how your system is designed, they will be processed by your system anyway, creating a growing backlog. But even when we assume that you recognize those and wont try to process them, you basically loose requests.
In the BEAM, we can spin up as many BEAM-processes as we need, without artificially serialising the requests. We handle them as they come in. Also since every process is handled immediately, TTFB is (relatively) short, even on a loaded system, so the client will not time out that fast.
And I do personally think, elixir is more than only phoenix. If one comes from Rails to Phoenix, he wonât gain much anyway, since he limits himself to a framework. If though one comes from Ruby to Elixir, he has to be openminded, and limit himself to a small use case.
There are kinds of errors, that you can only handle by re-establishing some connection, by deleting current state and re-build it from some set of rules or other things that basically replay your init phase. This is the kind of errors where let-it-crash wins. Since replaying init is exactly what the supervisor will do, after it âhearsâ that the old process crashed.
Some failures though, which we can take car of easily, are handled in other ways. Eg. in an interactive shell application I ask the user for a file name to read. When I canât read it, I will get an error which I wonât crash on. I will tell the user that I canât read the file and that he has to specify a correct one.
As every idiom, let-it-crash has its exceptions from the rule. But the concept behind it makes it really easy to recover from irrecoverable errors by simply restarting affected parts of the application.
When the connection to the RabbitMQ server vanishes, I do not need to tell each and every process that uses this connection, that it is gone. The process that owns the connection will die, all the linked processes that use the connection will die as well (because the runtime tells them to), their corresponding supervisors will issue restarts, and everything works again.
If I were not using let-it-crash, I had to code all the things that supervisors and linked processes do here on my own (which I effectively did in the earlier mentioned go program).
NB: I werenât able to understand for a long time as well. But when I started using it (without knowing I did), it slowly took possesion. Iâm still not really using it correctly or perfectly, I assume, but as I have toe-tipped, I do in fact miss similar concepts or mechanics in other languages. And you will never realise the benefits as long as you only look at shortlived processes or single proces applications. Those canât earn anything from let-it-crash by definition, they just fail. And in fact, for the process itself, it doesnât matter if it dies because it finished it work or because there was an unhandled error. Its dead, gone, away! This is not different from Java, C, or anything else. This is not different to an OS process at all. The benefits of let-it-crash are only visible if you take a look at the whole picture. The running program. When you look at it and realise, that the runtime itself takes care of everything for you. The fact that you do not need to restart the runtime because a process caused an overflow.
Just open your mind and try it out!
Functional programming languages are still largely âsequentialâ. âThread-basedâ programming concerns itself primarily with âwhat is the next useful thing this thread should doâ. This tends to complect together success and failure paths in code either resulting in the happy path being fragmented over conditional branches or try blocks. So it can be argued that âdefensive programmingâ leads to complexity from the lack of separation of the ânormal processingâ and âerror handlingâ concerns.
An Erlang/Elixir process has a âmissionâ - to stay on that âhappy pathâ. But when it recognizes that it is unable to âcomplete its missionâ it terminates with a non-normal reason.
Itâs the supervisorâs job to know what to do when one of its processes fails - and that is its only job (i.e separation of failure handling from normal processing). There are a number of âcopingâ (supervision) strategies it can be configured with - the correct one ultimately depends on the nature of the process that it is supervising.
Typically supervisors will terminate when their âcoping strategiesâ fail repeatedly in a short enough time interval - as a (sub) system failure may be in progress - and dealing with that type of failure is the single responsibility of the next higher level supervisor.
So ânormal processingâ occurs in the âregular processesâ while âfailure handlingâ happens separately in the supervisors.
- In many cases the response to the failure isnât actually that fine grained - while there are multiple locations where failure can occur or be detected often the response is identical. Typically you have a âthere is something terribly wrong with this information and I canât make sense of itâ - so you make note of it and stop - and you donât try again. Or âthings all of a sudden have stopped making senseâ - so you make note of it and stop - and then you restart with a clean slate.
- In many cases it is impossible explicitly predict all possible failure paths much less devise sound custom recovery actions for every single one of them.
- In many cases âfixing the actual problemâ from the current thread of control is impossible and even if it is possible âfixing the problemâ is an entirely âseparate missionâ from the current happy path.
[quote]but I know it being a dynamic language leads to way less code than handling Maybe
s in strictly typed functional languages. But the counter argument also exists in those languages
[/quote]
??? Maybe
s are the basis of Railway oriented programming - Nothing
is handled implicitly for you by Maybe
which eliminates annoying explicit conditional checks - so there shouldnât be any âhandlingâ. In fact you hand Maybe the function - it handles the rest.
The one issue that doesnât go away with JVM-based solutions is the stop-the-world garbage collection. The BEAMâs per process heaps make it extremely efficient to reclaim memory from terminated processes and a single process needing GC doesnât stop anybody else from accomplishing something useful.
I like the argument of failure handling separation. I havenât found a scenario where Iâd give up graceful degradation and just âlet it crashâ though.
Also, the fact that supervisors themselves crash and that brings down the entire BEAM makes me, personally, carefully consider whether I want to really just âlet it crashâ. Will this failure happen as frequent as the maximum supervisor restart threshold? Can I predict that? How do I choose a better threshold?
I donât think itâs that simple if you want any observability in your functional code. Youâll have to wrap your entire response chain in something that can communicate âfailures happened at this specific point surrounded by this specific contextâ. Idk how this works in Haskell, but this is what I can assume, and idk if in F# people just throw
, with a stack and some extra context, Iâd love to find out.
In Elixir when we fail a match we get all the relevant context in the failure message for that match most of the time (unless you need to poke at previous state to figure out how you got to an invalid one). This is the contrast I was going for with âhandling Maybe
sâ.
I feel like youâre missing the point of a âtree of supervisors and workersâ. Your root supervisor should not get into a state where it would crash under any reasonable circumstance, while supervisors deeper down the tree might actually crash now and then and the workers at the furthest leaves of the tree should be the entities most likely to crash.
While each level of supervisors does also give you additional restarts (3 within 5 seconds per supervisor), the fact that the vm will restarts from a very localized set of entities (leaf nodes) growing to a ever larger set of parts in your system (whole branches) will be much more graceful than having crashes take down large parts or the whole app immediatelly. So if a worker keeps dying try restarting all workers of that supervisor, if itâs still dying restart that supervisor and the data cache sitting besides it and so on moving upward in the tree. So you can quite easily scope the boundries of restarts in a layered fashion without actually caring about that error handling in the code being executed in any workers processes. Only the setup of supervisors will result in that layering.
And to come back to the graceful degradation. The part/process of your application, which did trigger an error in a leaf worker causing it to crash will therefore not automatically die with it. It could simply monitor that process and if it crashes report to the user that the computation did not complete. You can still be reasonably accurate in your error message, because you know which small part of computation just failed. With this in place you can look into more advanced strategies like retries maybe even with some backoff strategy, while only reporting the error to the user if all of that failed as well.
Edit: In the context of a webrequest (which you talked about before) this could mean your webrequest would never do any work with a certain amount of likelyhood of crashing. If you expect some work to be able to crash youâd let that compute elsewhere in the supervision tree, where you can setup the supervision strategy in a way it makes sense for crashes to be handled, while your webrequest can happily observe whatâs happening and report back to the user in case of success or error.
Honestly I really really wish modern CPUâs were not so⊠uniform. Iâd love a chip that has, say, 1-8 mega-cores (think modern chips) along with 1024-a_million little smaller micro-cores that run (much) slower but could run in a couple of different modes, either all being distinct with an individually small amount of allocated memory that have join points, or could run in a SIMD mode where they run over a large amount of aligned data with the same operations on each. We kind of have that nowadays with a modern CPU and the GPU, but it is still a bit more limited than what I want because of this mis-match between their access. Most programmers are, to be honest, not quite competent enough to program on such CPUâs though so theyâve not been built to that yet, but we have the ability to make languages that would make such access quite safe now.[quote=âdom, post:15, topic:9748â]
All the languages you cite have user-level scheduling.
[/quote]
C++ does too! This is entirely valid code!!! ^.^
void an_actor(channel_t &chan, int count = 0)
{
while(true) chan.push(i++);
}
channel_t chan{1};
fiber gen(std::bind(an_actor, chan))
int i;
chan.pop(i); // `i` === 0
chan.pop(i); // `i` === 1
chan.pop(i); // `i` === 2
chan.pop(i); // `i` === 3
Exceptions get properly sent through channels and all if necessary. You could build a very BEAMây-like OTP in C++ pretty easily now (I wish these libraries existed back then, and yet now there are multiples that can do this). ^.^
Oh and hey, no dynamic memory allocation unless you explicitly do it somewhere. ^.^
Even the chan.pop
has no memory allocation, you pass in the memory to fill and it is copied in (ala the BEAM).
Oh and a function does not need to be a single level, even the chan.push
is doing a reschedule down inside of it waiting on when the chan
is popâd.
And yes, they can migrate between threads/cores just fine (it has a few chooseable default schedulers, or you can make your own, but the general default is a round-robin scheduler with work stealing), so you can saturate your 1024 cores if you so wish. Iâve been wanting to make a little OTP library in C++ using these for a while now, just to see how it âfeelsâ (it is easy to make a polymorphic channel type as well, though depending on how you make it then it may allocate memory unless you force âmoveâ semantics, which should be fine for message types Iâd say).
And channels is not the only synchronization abstraction it has either, it also has a lightweight âmutexâ, condition variables, and a few other things, including even just not rescheduling the fiber and instead just giving up itâs running time and just manually reschedule it later, like say when a network message is retrieved.
It is not pre-emptive unless you use many threads (which is fine, they can be work-stealâed when a specific fiber runs too long anyway) but that is fixable by just this_fiber.yield()
ing on occasion manually (not automatic, but eh).
Since exceptions can propogate if stated to, or just kill the fiber out-right (or will kill parents on up depending on the API used, or kill the running system if entirely unhandled anywhere, which is of course easy to fix ^.^), then it would not be hard to make an OTP-like setup.
I like forcing channel usage with move-only data though.
It is far more âopenâ in usage than the BEAM is, so not as âsafeâ, but it is significantly more performant, thus I think it would make a great CNode for an Erlang mesh as you could actually represent real local PIDâs as Fibers instead of just faking it.
Yeah this is a big thing where Node and Python and Ruby and such scaling fails pretty hard. Python has libraries (like twisted) that will async data calls through so you can fix it pretty well there, but not always when there is a synchronous call otherwise.
Hear hear, a maybe
in Elixir/Erlang is really the :ok
/{:ok, value}
/:error
/{:error, reason}
tuples, and there are a host of libraries that can pipeline them (which Elixir itself really should be able to itself, but ehâŠ).
And hey, no GC at all in the C++ versions, all memory needed for a fiber is allocated at fiber creation time (though with an optional growable stack if you go that way, but it is not default and you can instead set a stack size, which I prefer to prevent exploding fibers).
Well the Erlang way is an :error
tuple with a reason like {:ok, reason}
.
Or exceptions of course⊠^.^;
Where did you get the idea that the BEAM crashes? All in all I think @LostKobrakai elaborated on the supervisor issue sufficiently. And a supervisor tree is started as part of an OTP-application and typically multiple OTP-applications are bundled together as a release to form a âsystemâ.
So it is possible for an OTP application to crash repeatedly during startup only to finally give up and ultimately stop - but that doesnât crash the BEAM.
And maybe âlet it crashâ is a bit sensationalist - is âlet it failâ better?
Joe Armstrong explains the thinking in Programming Erlang 2e p.201:
Why Crash?
Crashing immediately when something goes wrong is often a very good idea; in fact, it has several advantages.
- We donât have to write defensive code to guard against errors; we just crash.
- We donât have to think about what to do; we just crash, and somebody else will fix the error.
- We donât make matters worse by performing additional computations after we know that things have gone wrong.
- We can get very good error diagnostics if we flag the first place where an error occurs. Often continuing after an error has occurred leads to even more errors and makes debugging even more difficult.
- When writing error recovery code, we donât need to bother about why something crashed; we just need to concentrate on cleaning up afterward.
- It simplifies the system architecture, so we can think about the application and error recovery as two separate problems, not as one interleaved problem.
âŠ
Getting Some Other Guy to Fix It
Letting somebody else fix an error rather than doing it yourself is a good idea and encourages specialization. If I need surgery, I go to a doctor and donât try to operate on myself.
If something trivial in my car goes wrong, the carâs control computer will try to fix it. If this fails and something big goes wrong, I have to take the car to the garage, and some other guy fixes it.
If something trivial in an Erlang process goes wrong, I can try to fix it with a catch or try statement. But if this fails and something big goes wrong, Iâd better just crash and let some other process fix the error.
I used to ask myself the same type of questions - but then I ran into this in Designing for Scalability with Erlang/OTP p.175:
Note how we have grouped dependent processes together in one subset of the tree and related processes in another, starting them from left to right in order of dependency. This forms part of the supervision strategy of a system and in some situations is put in place not by the developer, who focuses only on what particular workers have to do, but by the architect, who has an overall view and understanding of the system and how the different components interact with each other.
So design of the supervision tree is largely an architectural concern and itâs this architecture that has to be designed deal with the failures, not the program code that is down in the weeds. Therefore failures are dealt with in a very general, generic fashion (so the supervision strategies and therefore appropriate thresholds relate how the system needs to operate).
If you want a failure context you wouldnât use Maybe
but Either
and capture the context in a Left
value (that all subsequent composed computations would leave unmodified). However you are still focusing on the details of the failure. While the details should be logged for later inspection - they often donât influence the immediate response. The response is often quite generic - either âgive upâ or âtry again (later) from square oneâ.
A poor manâs version of it that can be leveraged with libraries like exceptional, sure - but the appeal of Maybe
(or Either
, Result
, etc.) is that it implicitly âknowsâ how to deal with Nothing
(Left
, Failure
, etc.) without any additional outside plumbing.
The ARM BIG.little architecture tried that and they discovered fun stuff. A tale of an impossible bug: big.LITTLE and caching | Mono
Oooo that was a fun read, thanks!
Their issue was running the same code on the different execution contexts, which definitely should not be done (this is why I like so much the concept of OpenCL Kernels ^.^).