My Review
A Review Of:
“Systems that run forever self-heal and scale” by Joe Armstrong (Lambda Jam 2013)
I am a fan of Joe Armstrong. I wish he were still with us to share his wisdom with us (he passed away a few years ago.) Luckily for us he left us a great book and several great talks that help to understand how he created the excellent industrial strength software he created.
It’s not an exaggeration to say that Erlang and the BEAM (which is the purpose-built VM which Erlang runs upon) has been used to create real world systems that have run for literally years without down time. Read that again. YEARS. Anyone who’s ever gotten the “holy crud, the system’s down” call at 2 am will appreciate the wonder of having systems that run for years without tinkering and without failing.
So how did Erlang achieve this miracle? Did they write code that they burned into an eprom and all it does is print “Hello world” trillions of times in an infinite loop? Nope—Erlang was created to run phone switches. If Erlang fails you can’t make phone calls. Again it’s not an exaggeration to say that Erlang literally runs on more than 50% of the phone equipment in the entire world.
So how did they achieve this seeming software development miracle? Well that’s what Joe Armstrong discusses in this talk. The thing that really appeals to me (and why I was and am a fan of his work) is that he didn’t approach the problem from some theoretical, abstract CS approach. He (and Robert Virding and Mike Williams, the co-creators of Erlang) had a practical problem to solve and they created Erlang in order to solve it. They didn’t set out to build some world-killing technology—they set out to solve a difficult engineering problem and just happened to build one of the best engineered solutions I’ve ever seen.
So what are some of the use cases that would lead us to use distributed systems? There are a couple that come to mind without thinking long and hard:
-
LLM’s: Training LLM’s is hardware intensive. If I can deploy 100 machines to training an LLM then I can have the trained model that much faster. If I can deploy 1000 machines—10 times faster.
-
Uptime: As Armstrong points out there’s never been a time when Google is taken down for a software upgrade. Since 1998 there’s never been a time when I went to look for something on Google and seen the message “Google is down for maintenance; come back later” Distributed computing allows you to keep things up and going because you don’t have to take things offline in order to upgrade machines.
In sum this talk isn’t some simple step-by-step “How to build your own distributed system” talk but it is a great introduction to the subject—why it’s tough and what we can do to make it easier to solve. Plus Joe was a great speaker who was very good at explaining a pretty tough concept.
“Systems that run forever self-heal and scale” by Joe Armstrong
Author’s note: This review was published in my January “Impractical Engineer” Newsletter. If you’d like to see the whole newsletter it’s here: The Impractical Engineer I hope I can be forgiven for plugging my newsletter but this does seem like a talk that Elixir folks might be interested in seeing.






















