First of all, I apologize for the delay in responding to your comment, I was buried by a big project and I didn’t come back here.
Thank you so much for taking the time to share your thoughts! I completely understand your perspective, and honestly, I questioned myself many times about whether this library was really necessary or if I was just “testing OTP” unnecessarily.
However, after using let_it_crash actively in our production codebase, I found it incredibly valuable for a real-world problem we were facing. Let me share our experience:
Our Use Case
We had a scoring system with this supervision tree:
ScoreSupervisor
├── DynamicSupervisor (manages calculation workers)
└── ScoreCoordinator (coordinates the flow)
The Problem: Our system would occasionally get stuck during the normalization phase in production, but we couldn’t figure out why.
What let_it_crash Revealed
When we wrote tests simulating crashes during active processing:
test crash during active processing" do
ScoreCoordinator.start_score_calculations() # Start with active workers
LetItCrash.crash!(*ScoreCoordinator*)
assert LetItCrash.recovered?(ScoreCoordinator) # ❌ FAILED!
end
We discovered a real bug: Our supervisor used :one_for_all strategy, which:
- Killed active workers unnecessarily when coordinator crashed
- Created orphaned workers trying to communicate with dead processes
- Led to failed recovery and the system getting stuck
What We Were Actually Testing
You’re absolutely right that we shouldn’t test if OTP works (we know it does!). But we were testing:
1. Our Configuration Choices**
2. Our Application Logic
-
Does our cleanup code in terminate/2 properly handle active workers?
-
Is our shutting_down? flag preventing new work correctly?
-
Are monitor references being cleaned up?
3. Interactions Between Components
-
What happens to workers when coordinator crashes mid-calculation?
-
Does state remain consistent after recovery?
-
Can the system continue processing after recovery?
4. Edge Cases Specific to Our Domain
Conclusion
I initially had the samequestion you expressed. But when our production system had a mysterious bug that only happened under specific failure conditions, let_it_crash gave us a systematic way to:
-
Reproduce the issue in tests
-
Understand the root cause
-
Validate the fix
-
Ensure it doesn’t regress
I don’t know if I managed to express exactly what this journey was like, let me know if I left more questions than answers haha
Thanks again for the discussion 