Elixir in Anger and RCAs (root cause analysis) - How do you deal with them?

Fl4m3Ph03n1x · March 2, 2022, 9:35am

Background

Recently I have read an article that makes a few interesting claims on the nature of RCAs. I would like the opinion of the community in this area in regards to Elixir and the BEAM VM.

Please note this is an open discussion. I encourage you to share your opinion but if you don’t agree with someone else please keep in mind there is no “right/wrong” in here.

Quick Summary

The article starts with a lovely quote:

“Imagine an iron bar thrust into an electric furnace. The bar lengthens, and the “cause” of the lengthening is said to be the heat of the furnace. One is astonished—why should it not be the introduction of the bar into the furnace? Or the existence of the bar? Or the fact that the bar had been previously kept at a lower temperature? None of these possibilities can be termed secondary causes; they are all primary determining causes without which the lengthening phenomenon could not have occurred.”

It then uses this quote to make a point, which I interpret as “complex systems usually fail by death of 1000 cuts”.

Meaning, it is unlikely to have a complex system fail due to a single critical failure, instead such systems are more likely to fail because many small failures happened in conjunction to create a “perfect storm”.

The last point of the article then goes on to claim that RCAs as they are known today serve mostly a social purpose, or even an agenda.

NOTE: this is a over simplification of the article. I truly recommend you (dear reader) give it a full read, I personally think it is worth it.

My interpretation

This is the first time I have heard such a thing. But after some consideration, I do agree with some points of the article. My personal experience tells me that when someone has to write an RCA, heads are about to roll.

While I do like this abstract opinion, I am still not sure about the claims made for complex systems.
During my life I have worked with several Elixir apps. Reading books like Erlang in Anger gave me a very good perspective on how complex system can fail.

I am used to seeing a complex system failing because of a single issue. For example, Atoms being generated dynamically, or a group of GenServers not being garbage collected for long periods of time (causing an eventual crash).

What do you think?

Because this article is open to personal interpretation, there is a good chance I am missing a greater point here. Maybe my experience with Elixir systems is not as broad as I would think, maybe you have a different take (please do share !) on how complex systems work.

Overall, if you have stories of “Elixir in Anger”, please do share. Did you do an RCA after? Was it a “perfect storm”?

dimitarvp · March 2, 2022, 11:39pm

There rarely is only one root cause indeed. But to try and tame the chaos we can at least assign weights to the several different reasons that we found out (or only suspect of).

From then on it becomes a pretty simple game of “always kill the biggest contributor to problems first, then find the next one which is now the biggest, then kill that one too until we’re done”.

Are those assessments of weight distributions correct? Hell no, they are usually grossly wrong, but I found managers / CEOs / CTOs like it when you at least took the initiative to make some analysis and at least attempted to assign risk percentages to the several culprits you’d like to kill off.

Same way as big systems die of a thousand paper cuts, they are saved by stitching one cut at a time. The good news is that one “cut” (i.e. a subtle bug) can fix much more than one place in the code.

Program bugs are like dogs: they like to lie around in the company of other dogs. Don’t allow them. Eliminating subtle failure modes one by one can quadratically reduce the amount of visible bugs since most bugs really love to combine themselves with other bugs.

I found that being meticulous and just patiently untangling the fishing net that represents all the problems in a project really does pay off.

Bad news is, commercial teams rarely have the patience for this. All other things being equal, somebody younger inevitably comes around and gives the genius idea to rewrite the whole thing! And people being people, they will always prefer to pay the debt that’s due in the future and not the debt that’s due now. Because yeah, they’ll make a mess out of the rewrite as well.

(And don’t get me wrong, I more often advocate for rewrites than for repairs but I do recognize when a rewrite has a bigger cost than a repair. Many people can’t and don’t.)

To materialize part of that abstract philosophy when it comes to Elixir, my current work team taught me something valuable: make your code assert-like e.g.

:ok = ExternalApi.do_stuff()

That way, if it does not return :ok then your error capturing system (Rollbar, AppSignal, HoneyBadger etc.) will record the exact return value and next time around you’ll know how to gracefully handle the [apparently expected] error.

These assertive statements serve as invariants that clearly signal “the program is not expected to work if this statement does not return what we expect here”.

And that can and will give you the very first low-hanging fruit which to pick when it comes to bugs or logical errors in your code.

I mean yeah, we should all take a page from Ada and just have pre-conditions, post-conditions and invariants but failing that, we can at least emulate them closely enough.

Or do what Rust does: prefer slow and painful process of gradually eliminating compiler errors but when you are done then a huge chunk of all possible bugs are mathematically proven to be impossible in your program (lest a hardware freak event occurs e.g. bit flip in non-ECC RAM).

gregvaughn · March 3, 2022, 1:41am

This is more about “Systems Thinking” than it is about Elixir. That being said, those with solid BEAM experience tend to have solid experience with complex systems. The message passing architecture allows for non-determinism which (as much as I hate the analogy) brings to mind biological systems and their passing of molecules between “actors”.

Compare/contrast with a doctor trying to treat a patient. The treatment has varying levels of abstraction. A treatment might make the patient comfortable (treating symptoms) as they expect the body to heal itself given some time. It might go deeper and recognize some vitamin deficiency and recommend supplements. It might go even deeper and recommend a change to diet and exercise. Or even deeper to societal concerns in which a diet that used to be fine no longer is because the available food has different nutrition than it used to. Even deeper down the system is a consideration of the economic motivations of food manufacturers.

So, is the “root cause” the vitamin deficiency, or a lack of some law limiting/requiring some food manufacture technique? It’s complicated and you can choose to address the problem at multiple levels.

Fl4m3Ph03n1x · March 3, 2022, 9:19am

Wow, these opinions are so great, this is truly the kind of insight I think has incredible value for everyone reading.

@dimitarvp Over the years, have you found a way to “guess” if a re-write is going to be more costly than a repair? I loved your analogy with “killing the bigger fish first”. Isn’t a re-write of a complex system always going to be more costly than a partial fix? Do let me know!

@gregvaughn Thank you for joining the conversation!
Yes, the original post is more towards Systems in general, but I believe that BEAM is a pretty complex system and that specific battle stories with it are surely to be valuable. I liked your analogy with the doctor. Have you, in your long experience, ever faced a situation like that (with a system) that you could share? Have you, for example, tried to fix an issue only to realize that there was something at a deeper level that could be addressed? As engineers, we often have to make decisions about what to fix and when. How do you decide?

TwistingTwists · March 3, 2022, 11:02am

+1 for Strong typing system! Opaque types certainly help !

This is a bit tangential but, on I’ve avoided doing anything on frontend (read : JS libraries) because it is messy. Even a small project is “hairy”. With Typescript, there is some relief. With opaque types in Elm → It was pleasure working on frontend.
So mathematical guarantees are awesome.

derek-zhou · March 3, 2022, 2:19pm

Right on. Customer impacting production problems are rarely caused by one disaster, but more often as several minor problems, each is not so bad in isolation, somehow happen at the same time.

dimitarvp · March 3, 2022, 3:14pm

This will be a bit of a spread out comment but I feel the separate points are connected.

There really is no clear-cut way but I can recommend one thing that almost always netted me an accurate assessment:

The amount of institutional and business knowledge that’s encoded only inside the code and nowhere else.

If your “legacy” system has that all over then no, please don’t rewrite it. You’ll make an even bigger mess and you’ll reintroduce bugs that have been long fixed in the old system even if the fixes are obscure and hard to find in the code (because again, critical business knowledge is encoded there).

Related to that, here’s something that most working programmers ABSOLUTELY HATE hearing:

The programming language makes very little difference.

We the programmers are petty like that and start dividing ourselves into almost-religious camps over BS syntactic preferences. That’s a distraction sitting in front of what’s important: what does that language / runtime / std library give you that is an edge over the competition? In Erlang/Elixir’s case that’s the BEAM VM due to its guarantees. In Haskell / Rust / OCaml that’s super strict typing that mathematically proves certain bugs are not going to happen in your program after it successfully compiles. In Ada it’s that the preconditions / postconditions / invariants of your class that will never be invalid so you can code in peace knowing that e.g. an account balance will never be a negative number or a person’s name never starts with a number etc.

RE: rewrites, I like rewriting stuff but let’s be clear, that’s a tinkerer’s point of view and priority; they don’t always align with business priorities.

But there’s something else. In the “Refactoring” book by Martin Fowler he makes a very important statement:

“Refactoring code you are new to helps you understand it.”

However, that’s refactoring, NOT rewriting.

In conclusion, I found that if I just start gradually refactoring an old code base to make it easier to iterate on it later, I very often find invaluable insights in the old code and a complete rewrite becomes unnecessary.

tfwright · March 3, 2022, 6:24pm

I would add that making sure you have the right fix for a bug is even more important than fixing the right bug.

I just encountered a situation where, a bug in some data processing logic caused some data to be missing in a particular API response where the FE was expecting it. Some dev had the idea that they knew another place where they could get that same data. Since they were assigned the bug (after all, the “problem” was in a bit of UX that was malfunctioning) they went ahead and added some fallback logic to look at data B if data A was missing. But actually, they were not correct that those datas were interchangeable, so their fix actually introduced another, even more subtle bug. This apparently happened a few times before the real source of the bug in the BE data processing logic was identified and fixed. But by then, a significant amount of tech debt had been added to an already rough bit of react code (why do there seem to be so many react devs that think unit testing components is optional?).

Bug reports should always be reviewed by more than one person and go through discovery etc phases before a fix is applied.

jhogberg · March 3, 2022, 7:32pm

Well put! I’d like to add that while you may be able to write something that perfectly fits your present needs, chances are that the legacy code did too at some point. Code doesn’t become “legacy” on its own, and if the conditions that turned it into a ball of mud are still there, then a rewrite won’t stay perfect for long.

In a sense this ties in where the thread started: before diving into anything as drastic as a rewrite, it’s often a good idea to explore the reasons the codebase went off the rails. It’ll help you regardless of what you decide on in the end.

Fl4m3Ph03n1x · March 4, 2022, 8:55am

Since this thread is now mentioning legacy code, I must now make another question: What is legacy code?.

Code being old does not mean it is legacy if it can be maintained without issues and evolve without issues. So what do you guys consider a legacy code?

And how often does legacy code lead into subtle problems that then need RCA’s to be explained?

PS: My opinions on Legacy code are influenced by this book (I have not finished yet):

derek-zhou · March 4, 2022, 2:09pm

IMO legacy code is hand-me-down. The original authors are gone; the current maintainers are reluctant to make extensive refactoring. It has nothing to do with the technology.

gregvaughn · March 4, 2022, 3:40pm

Legacy code is code you are afraid to change. It has ossified and is no longer _soft_ware. It could be 20 years old in an outdated style and versions of tooling, or it could be 2 seconds old because you don’t really understand why your new code is working.

Tests and docs are the best prevention of legacy status because it gives maintainers some confidence their changes do not have unintended consequences.

jhogberg · March 4, 2022, 7:07pm

I think this is spot on, it doesn’t matter much why you’re afraid to change it, just that you are.

Ironically some of the most ossified systems I’ve worked with had decent documentation and the people who built them were still around, but they were overworked and stressed out of their minds. The last thing they wanted was for someone to rock the boat and get them paged in at four in the morning, even though the tests made the risk of that rather small.

In my experience it’s pretty common as legacy systems hardly adapt to the inevitable changes in the world around them. They also have an annoying tendency to warp their environment because it’s easier to change that, which rarely makes for stable solutions.

pat_rick · March 6, 2022, 12:41pm

I don’t want to lead away from the main discussion, but the first thought I had when I read the context to the question was the video with Richard Feynman answering the question of “why” magnets pull and repel each other. He quickly goes more deep into the question of how difficult a “why” question is to answer.

Fl4m3Ph03n1x · March 9, 2022, 9:28am

This is mind blowing !
Loved it !
I mean, the first few minutes alone helped me understand my life. I have a toddler and he always asks why. It makes sense - we do not share a common framework!

This also builds perfectly into my knowledge of systems. Explaining Managers why something happens can be a hard task in the easiest of days.

I think this was a perfect addition to the conversation!

An experience I share as well.
But this raises yet another question:

if legacy system force the environment around them o adapt to them (thus making it less likely for stable solutions)
if legacy systems have company logic

Should we rewrite them?

@dimitarvp defends we should possibly avoid such a rewrite if a legacy system has business logic inside
@jhogberg defends that legacy systems make it unlikely to have stable long term solutions to business problems

It feels to me that if you have one (and in my experience you always do) then you have no right way to go about it.

dimitarvp · March 9, 2022, 10:39am

Almost. I am defending the position of: “If your ONLY business logic documentation is the code itself then don’t rewrite it – YET. You could rewrite it later if you manage to contain and isolate good business logic documentation”.

wallyfoo · March 9, 2022, 11:17am

Delightful way to express the way a toddler assimilates information.

jhogberg · March 9, 2022, 11:43am

It feels to me that if you have one (and in my experience you always do) then you have no right way to go about it.

I think there is, believing there’s no good way out is one of the ways legacy systems stay the way they are. I’ve yet to encounter something that couldn’t be righted once we fully comprehended it, either through refactoring or understanding the problem deeply enough for a rewrite.

Mind you, if it were easy you probably wouldn’t have the problem to begin with, but sometimes you just have to grab your garden gloves and start digging.

brightball · March 9, 2022, 12:31pm

Yes, there’s rarely one root cause. Ultimately the exercise is to determine how it could be prevented in the future and if it is worth it take the steps required.

I always approach RCAs with the idea that everything will happen again and no incident is such a unique snowflake that it won’t. The RCA uncovers likely repeat causes. You can either take steps to mitigate or not complain when it happens the next time.

Resolve a few of these and you’ll end up with a very reliable system.

pat_rick · March 9, 2022, 1:27pm

@Fl4m3Ph03n1x Thank you! I am glad to hear that you enjoyed it, and that it was well received. Yes, toddlers are a good example of the framework dilemma. I often encountered it working or in general, meeting people from entirely different cultures. Hope you guys are having fun expanding (your) frameworks ; ).