Background
Recently I have read an article that makes a few interesting claims on the nature of RCAs. I would like the opinion of the community in this area in regards to Elixir and the BEAM VM.
Please note this is an open discussion. I encourage you to share your opinion but if you don’t agree with someone else please keep in mind there is no “right/wrong” in here.
Quick Summary
The article starts with a lovely quote:
“Imagine an iron bar thrust into an electric furnace. The bar lengthens, and the “cause” of the lengthening is said to be the heat of the furnace. One is astonished—why should it not be the introduction of the bar into the furnace? Or the existence of the bar? Or the fact that the bar had been previously kept at a lower temperature? None of these possibilities can be termed secondary causes; they are all primary determining causes without which the lengthening phenomenon could not have occurred.”
It then uses this quote to make a point, which I interpret as “complex systems usually fail by death of 1000 cuts”.
Meaning, it is unlikely to have a complex system fail due to a single critical failure, instead such systems are more likely to fail because many small failures happened in conjunction to create a “perfect storm”.
The last point of the article then goes on to claim that RCAs as they are known today serve mostly a social purpose, or even an agenda.
NOTE: this is a over simplification of the article. I truly recommend you (dear reader) give it a full read, I personally think it is worth it.
My interpretation
This is the first time I have heard such a thing. But after some consideration, I do agree with some points of the article. My personal experience tells me that when someone has to write an RCA, heads are about to roll.
While I do like this abstract opinion, I am still not sure about the claims made for complex systems.
During my life I have worked with several Elixir apps. Reading books like Erlang in Anger gave me a very good perspective on how complex system can fail.
I am used to seeing a complex system failing because of a single issue. For example, Atoms being generated dynamically, or a group of GenServers not being garbage collected for long periods of time (causing an eventual crash).
What do you think?
Because this article is open to personal interpretation, there is a good chance I am missing a greater point here. Maybe my experience with Elixir systems is not as broad as I would think, maybe you have a different take (please do share !) on how complex systems work.
Overall, if you have stories of “Elixir in Anger”, please do share. Did you do an RCA after? Was it a “perfect storm”?