LetItCrash - A testing library for crash recovery and OTP supervision behavior

Hello,
How is everyone?
I hope you’re well =)

I worked on a project with some legacy code that existed before Oban’s arrival. This legacy code functioned as a “job” that was created alongside dynamic supervisors. Understanding the outcome of a supervisor failure and recovery has always been a challenge.

So I thought about creating a set of tools to help me write tests that would simulate these scenarios.

It’s a narrow path between knowing “if I’m testing the Supervisor” (which isn’t the goal) and whether I’m testing the behavior of the result after the Supervisor has done its job.

And from this thought, this lib was born. I still have many ideas for what to add to it. There are a world of scenarios to cover (Workers, DB connections, State), but version 0.1.0 has the general idea I want to facilitate.

I brought it to the forum to see if the idea makes sense =)
I hope it can help more people who are also facing this challenge.

Hex: let_it_crash | Hex
GitHub:

6 Likes

At first glance, it looks like this library helps in testing lifecycles, which supervision trees define. I don’t see a use-case, at least in my day to day work. It feels to me like testing if the new keyword in C# or Java, creates a new instance of a class.

First of all, I apologize for the delay in responding to your comment, I was buried by a big project and I didn’t come back here.

Thank you so much for taking the time to share your thoughts! I completely understand your perspective, and honestly, I questioned myself many times about whether this library was really necessary or if I was just “testing OTP” unnecessarily.

However, after using let_it_crash actively in our production codebase, I found it incredibly valuable for a real-world problem we were facing. Let me share our experience:

Our Use Case

We had a scoring system with this supervision tree:


ScoreSupervisor

├── DynamicSupervisor (manages calculation workers)

└── ScoreCoordinator (coordinates the flow)

The Problem: Our system would occasionally get stuck during the normalization phase in production, but we couldn’t figure out why.

What let_it_crash Revealed

When we wrote tests simulating crashes during active processing:


test crash during active processing" do

ScoreCoordinator.start_score_calculations()  # Start with active workers

LetItCrash.crash!(*ScoreCoordinator*)

  assert LetItCrash.recovered?(ScoreCoordinator)  # ❌ FAILED!

end

We discovered a real bug: Our supervisor used :one_for_all strategy, which:

  1. Killed active workers unnecessarily when coordinator crashed
  2. Created orphaned workers trying to communicate with dead processes
  3. Led to failed recovery and the system getting stuck

What We Were Actually Testing

You’re absolutely right that we shouldn’t test if OTP works (we know it does!). But we were testing:

1. Our Configuration Choices**

  • Is :one_for_all or :rest_for_one better for our specific architecture?

  • Does our coordinator’s trap_exit work correctly with shutdown signals?

2. Our Application Logic

  • Does our cleanup code in terminate/2 properly handle active workers?

  • Is our shutting_down? flag preventing new work correctly?

  • Are monitor references being cleaned up?

3. Interactions Between Components

  • What happens to workers when coordinator crashes mid-calculation?

  • Does state remain consistent after recovery?

  • Can the system continue processing after recovery?

4. Edge Cases Specific to Our Domain

  • Crash with zero entities

  • Crash with exactly @max_concurrent_workers entities

  • Crash during normalization phase

  • Multiple rapid crashes

Conclusion

I initially had the samequestion you expressed. But when our production system had a mysterious bug that only happened under specific failure conditions, let_it_crash gave us a systematic way to:

  1. Reproduce the issue in tests

  2. Understand the root cause

  3. Validate the fix

  4. Ensure it doesn’t regress

I don’t know if I managed to express exactly what this journey was like, let me know if I left more questions than answers haha

Thanks again for the discussion :purple_heart:

3 Likes

I think that the crash! naming may be very confusing …

Unlike crash/1, this function uses :kill instead of :shutdown, which ensures the process is terminated even if it has Process.flag(:trap_exit, true). This is useful for testing processes that trap exits, such as GenServers that need to perform cleanup on normal exits.

I would change it to something like:

@spec crash(type :: :shutdown | :kill, pid())
def crash(type \\ :shutdown, piid)
1 Like

I was thinking about doing it the way you suggested before going down the ! route, but I thought it would be “less verbose” by just adding the exclamation point referencing “critical operation” haha.

Thinking twice, this might cause confusion about the actual meaning. I have other changes coming soon, and I’ll add this one as well.

Thank you so much for the feedback :purple_heart:

1 Like

While usually there is no rule that’s always right and there are almost always edge cases we as a community still try to make conventions clear and simple for everyone, so in some specific cases we know what to expect. It’s not rather about how to life, but something like a standard explaining quality of life.|

Therefore it would be amazing especially for a new developers if you could write the code that is compatible with our naming convention. Here is a direct link to the section with a trailing bang character in the function name: Trailing bang (foo!) | Naming conventions @ Elixir hex documentation

2 Likes