How much more crashproof is a server with Elixir/Beam compared to Rust's axum?

matthias_toepp · August 29, 2024, 2:52pm

Certainly the tools are different, to be clear, I’m not asking for a general comparison of Axum vs Phoenix, or which is better or which I should choose. However, in the end, they are also both web servers.

I’m not really asking for a comparison of the tools. It’s just seems too complicated to discuss everything at once… My understanding, and I might well be wrong, is that a rust server will be faster, more memory efficient, and less productive in terms of developer time at the language level, more difficult to scale horizontally, would require a separate tool to supervise and restart a server if/when it crashes, and this does not include a comparison of ecosystems (both language and web frameworks) my requirement or my preference of the feel of the language and community… and I’m absolutely sure I’ve left off many other points of differentiation.

One of the many points of analysis in choosing between server systems is something like uptime. Of course both Axum and a Phoenix server could be restarted almost immediately, so it seems this is not exactly about uptime. It seems the question is rather how much will one server system crash compared to the other and how much does that matter (as a single point of analysis)? If you would have told me that a Phoenix user could under no circumstances crash the server when writing responses (when using the framework as a normal user lets say) then that seems clearly superior to “just remember not to use all the standard language features that can cause a panic…” which seems to be one of the answers given for rust’s Axum. They have also separately pointed out that panics can be caught (through an added middle ware) so that the server doesn’t go down due to a panic, which may mean that the beam doesn’t triumph in this one single regard (but of course has many pros and cons in other regards) I realize that this point of comparison doesn’t address all the other points of comparison which I would have to evaluate for a particular use case.

One of the reasons I’m curious about this is to know if the beam would offer an advantage in this single regard (I know there are other pros and cons) over a system like this GitHub - roc-lang/basic-webserver: A basic webserver in Roc , which is roc lang using a rust web framework under the hood. Reliability means a lot to me and I’m trying to understand it in terms of this one dimension (servers crashing).

Thank you for all the responses!

D4no0 · August 29, 2024, 3:04pm

Fault tolerance provided by erlang VM and using a catch all block are 2 completely different concepts.

One of the most important distinctions is that each elixir process has it’s own separate memory, this means that when it crashes, it is guaranteed by design that no other process memory will be corrupted. If you catch all panics in rust and you have entities that share the same data constructs in memory, you can easily corrupt that memory space and this is true for each language that has shared mutable values, even though I am not positive how memory safety + panics play in rust, maybe it can handle this better than languages like c/c++.

seeplusplus · August 29, 2024, 8:29pm

“Comparing apples with oranges” to me implies the question is inappropriate or invalid.

The question is invalid. Frameworks and languages are just tools, it’s about how you use them. This is like asking if houses built with Husky hammers will fall apart more or less than houses built with Craftsman hammers. There are many other factors that determine the outcome; the variable of the brand of hammer is insignificant. You can build crash free services in either language. It takes skill. Any answer to the contrary is lying.

I’m wanting to know if a phoenix server will crash less often than an Axum server and how much that really matters.

Phoenix servers will not “crash” (you define this in a later post, I’m using your definition) more or less than an Axum server, inherently. How much it matters is entirely up to you, your team, and your tolerance for failures. This is decided on a case-by-case basis, hopefully by engineers, and is not related to the input of which hammer you choose.

matthias_toepp · August 29, 2024, 9:59pm

Thank you so much for that! You made my day! If I understand you correctly, both technologies provide programmers more or less the same overall level of safety against servers crashing, and it’s really all about how the programmers use those tools…

I mean it’s entirely plausible that both technologies are roughly equal in this regard, and that it really does just come down to programmer skill.

If that turns out to be the case, then I didn’t know that!.. and I’ve learned something!

If I had guessed at the answer to my own question I would have guessed that the beam has a special ingredient that means that a phoenix server crashes less often than an axum server, programmer skill being equal.

matthias_toepp · August 29, 2024, 10:21pm

I would be interested to know how much it matters if a server crashes and is restarted…from people with experience running high demand servers in production. I literally don’t know if people loose sleep over this or if they don’t care, or where they land on the spectrum between these two extremes, or if one community cares less about this than the other because of the nature of the tech.

I provide support for plc systems. I know that there are things that can go wrong and things that are basically impossible. I can try to skillfully avoid doing certain things wrong, but I really really appreciate the extent to which a platform makes dangerous, unfortunate things impossible.

sodapopcan · August 29, 2024, 11:31pm

Erlang/Elixir developers still try and avoid things that can go wrong, we just don’t lose sleep trying to cover every little thing that can go wrong. We generally code the happy path, but if an error starts happening over and over in production, then we certainly either fix it or handle it more gracefully. This why error handling is normally done via :ok/:error tuples—these are things we know can, and likely will, go wrong. Exceptions are left for events that are more exceptional, like briefly losing the network.

I would suggest watching some talks by Joe Armstrong (RIP). “Let it Crash” is more about not worrying about what we can’t control. To paraphrase a point from one of Joe’s talks, a language like Rust can provide strong guarantees against runtime errors, but those guarantees are meaningless if one of the machines on the network is hit by a missile.

Erlang was developed to handle telecom which happens to map pretty nicely to HTTP. While a dropped-call can sometimes be a matter of life-or-death, there isn’t much you can do if a tree takes out a telephone wire, but at least the rest of the network won’t be affected.

seeplusplus · August 30, 2024, 12:06am

If I had guessed at the answer to my own question I would have guessed that the beam has a special ingredient that means that a phoenix server crashes less often than an axum server, programmer skill being equal.

This is sort of the case. Erlang’s design is meant to encourage fault tolerance; what would be an unhandled exception that would crash a process and render a service unreachable in other languages is handled by the BEAM for us. But, the way it does this is not unique. It’s mostly due to an implementation of the actor model baked into the runtime. It can be, and has been, done in other languages, including in Rust. This integration at the core is a point of “more stability” in Elixir/Erlang/Phoenix’s favor, but just because a crash occurs and the runtime handles restarting for you, doesn’t mean that critical sections of your application can’t be down - there is no language or framework that can fix that for you.

I literally don’t know if people loose sleep over this or if they don’t care, or where they land on the spectrum between these two extremes

This is subject to case-by-case. I, like probably many others on this forum and elsewhere, have been woken up by production outages that were “stop the world, this needs fixing now or people can be hurt/sensitive data can be leaked” but sometimes, even at the same organizations/projects an outage is simply "mute this alarm and fix it in the morning. So yes, people do lose sleep over this. You’d probably be interested in looking at software in safety critical systems, e.g., aircraft/medical equipment, to see the types of choices in languages/tools they make. I can tell you Elixir isn’t getting deployed on an airplane, but neither is Axum.

Edit: You might also enjoy this video about Zig, or the first few minutes at least.

Edit 2: One more thing on the “do people lose sleep over this,” in cases where a software defect is worth sleeping over I wouldn’t say between Elixir/Axum one is more or less better/worse than the other. The first thing I’d evaluate on making a choice is the team and the domain.

sodapopcan · August 30, 2024, 12:31am

This is a good summary point right here (along with the rest of your post). And yet “safe” systems languages still kill people, unfortunately. This is why I wish we’d stop using words like “guarantee.” I don’t really have any experience in systems languages but like, how can anything running on a computer guarantee anything? I realize I’m suppose to infer the hyperbole but I’m still human and that word means what it means to me.

seeplusplus · August 30, 2024, 12:45am

I think the unfortunate reality is that some people aren’t exaggerating when they make guarantees, they’re lying or just misinformed. Certainly some languages have design features intended to make things “more safe,” but no amount of language design can protect the programmer from themself.

sodapopcan · August 30, 2024, 1:05am

Actually, it’s pretty funny, I re-read this thread and noticed @D4no0 uses “guarantee” in the sense that an Elixir process’ memory is guaranteed to be isolated… and I agree with that… so I’m apparently one of those people I was just ragging on

I guess all I wanted to stress from my original point—and this becomes especially evident if you read/listen to Joe Armstrong (who was an incredibly entertaining, hilarious, and endearing individual)—Erlang’s relatively high level of abstraction is literally taking environmental/physical-beyond-computer-hardware consequences into account in its design… which is pretty cool

jhogberg · August 30, 2024, 7:10am

Yes and no; those actor frameworks will not help you manage runaway tasks (nor will they grant you much in the way of introspection). In any M:N setup without pre-emption, a job that takes too long will screw up responsiveness, giving you bad latency of unrelated tasks in the best case, and locking things up entirely in the worst.

If we define “crash” as “the system no longer does what it is supposed to do,” Rust et al pushes way more responsibility to avoid them onto the programmer than Elixir.

There’s no free lunch however, and the trade-offs may tilt in Rust’s favor anyway, but I find “language X also has actors with supervision” a bit reductionist. I’ve yet to see any other solution that supports all aspects of resilience (by Hollnagel’s definition) as well as Elixir/Erlang does.

cpud36 · August 30, 2024, 7:18am

Rust panic only crashes the thread or the tokio task. You can have broken state in some cases, though.

In my experience, having panic that breaks shared state is quite rare, but YMMV.

I have had a simple Phoenix app running in production that terminated every 6 months, give or take. I hadn’t ever understood why.
For rust I had apps with years uptime.

This, of course, says more about my expertise in these technologies, rather than anything else.

But, in any case if you care about availability, you likely do want to setup an external supervisor regardless. Be it systemd, kubernetes, or an ops person with alert system - it doesn’t matter.

D4no0 · August 30, 2024, 7:33am

This is exactly what I wanted to point out now. If we talk about actors in JVM, the tools the VM provides are not enough to make this isolation even closely as good as erlang VM. I think that fiasco when Scala official actor library was considered unsafe and was proven to leak memory is a good point supporting this, and even though it can be argued that this is related to errors in code, I think that the building blocks they were using were incompatible.

D4no0 · August 30, 2024, 7:50am

Monitoring of the system is another topic and slapping a daemon that knows how to restart a server is not the universal solution.

The main benefit of fault-tolerance is that you can design huge monoliths that you can ensure work reliably and are much more cost effective than something like k8s. If your entire runtime crashes periodically, the only way to ensure that you don’t have downtime is to separate everything into microservices, that is by far not the most optimal way when it comes to infrastructure costs and people required to make it work.

cpud36 · August 30, 2024, 11:00am

Generally agree with the comment.

Wanted to point out that all of elixir, rust, java etc - all have fault-tolerance and fault modes. These fault modes are just different. Saying one is better evil than the other one - is kinda hard.

That is all of them recover for some extent. Elixir uses supervisors, rust catches panic(tokio), java webservers catch exceptions. Then the question is what they don’t recover, and how often that happens - and this depends largely on the program, not technology.

sbuttgereit · August 30, 2024, 12:23pm

I think this may be over generalizing to where there’s maybe a false equivalence. If I look at throwing and/or catching exceptions in, say, Java… I’m really dealing with managing errors in logic or data: unanswered is what to do about the runtime impacts which are left to the developer to figure out (or to delegate to some external library/framework/tool). In Elixir the equivalent isn’t the supervisor, but Elixir exceptions, try, catch, rescue, etc. all of which exist without invoking supervisors and deal with that logic/data domain. The Elixir/Erlang supervisor is about dealing with the runtime consequences of error: if I fail to capture error at the logic level what happens at the runtime? Supervisors provide an answer to that question. I appreciate that I’m drawing a somewhat fine line, but I think it’s an important distinction between the BEAM and other runtime environments.

For a lot of problems, having both the logic and runtime error management questions dealt with by mature constructs built into the VM (or the standard libraries) means that I don’t have to solve those questions myself, even if that solve is just picking the right external library/framework. I expect the BEAM will beat my hare-brained ideas about fault-tolerance and recovery every day and it gives me simple rules to play by (which seems in part to be the point of Erlang if I read Joe Armstrong’s thesis correctly). Yes, having runtime error handling baked in at such a low level I expect that it’s more prescriptive than those other languages which leave it open and there are trade offs such as raw performance and such… but that’s part of of choice of whether or the BEAM is suitable or not for a project. And also yes, you can still screw up a BEAM based application and end up with a crashed application at the slightest error… but your chances of getting it in the domain of “right” are ultimately better following the more prescriptive path than not.

A long time ago I was deciding between Rust and Elixir. Rust for its overall performance and its own prescriptive aspects in dealing with memory and data (C/C++ scare me). I considered conservative choices like Java or C#, too,… but ultimately the BEAM won because I realized I didn’t need the raw performance of a Rust and it came with fairly straight forward answers to problems like concurrency and fault-tolerance which are not problems I am likely to be great at solving: I would likely end up with a more resilient, available, and concurrent system building on the BEAM than I would with anything else. Elixir was language of choice because I felt it had the right expressiveness for business systems problems.

zkessin · August 30, 2024, 2:33pm

The other thing is that the beam has features for recovery if a machine crashes. It doesn’t matter how great your type system and tests are if your hardware fails. It is possible to start an application on the beam and have it setup so that if node A fails it will resume on node B.

BartOtten · August 31, 2024, 8:28pm

I had one which started when Phoenix just saw it’s first release and recently I shut it down myself. As a mirror redirector, it handled masses of traffic. The single app replaced a Go variant incl. all support (Reddis, systemd monitoring etc).

So Phoenix is not the problem. Luckily Elixir apps can be debugged in production! Have fun

lpil · September 2, 2024, 8:15am

Hi! If I might be a little cheeky I’d just like to say that Gleam has the crash proof properties of the Erlang VM and the type safety of Rust.

They’re rather different guarantees, as folks have pointed out, and I think they complement each other well. This codemesh talk goes into that:

matthias_toepp · September 4, 2024, 1:02pm

@lpil I was just reading this: Typing lists and tuples in Elixir - The Elixir programming language and I’d love to hear what you think about that post, the promise of the Elixir compile-time type system and how it might compare to Gleam going into the future. Your insights would be particularly valuable to me because Gleam’s type system is somewhat familiar to me, being quite similar to Elm, and i think fundamentally similar in it’s essence to F#, Ocaml, Rust and Haskell (with much less complexity), whereas Elixir’s type system is quite new/unfamiliar to me. After reading that post (that I linked to above) I’m just beginning to be able to imagine how it will actually feel to use Elixir types compared to a system like that of Gleam and the others (even though I’ve been reading and watching the presentations about the coming Elixir type system).

I know Gleam should be expected to compile faster than Elixir (since the compiler is written in rust), and I’ve heard Jose say that one of his concerns with the new type system is the time it may take to compile.

Also, to be clear, I realize this is only one point of comparison between Elixir and Gleam, and that people will prefer different styles of development etc. Even though Gleam’s type system is(/was?) perhaps the primary differentiator from Elixir, it clearly also differentiates itself in other ways as well. It compiles to javascript, it’s focus on simplicity, it’s elm-like web framework and in many other ways I’m sure…