I have a production app that uses commanded, the event handlers are started using a supervisor with the default retry values (max restarts 3, max seconds 5).
The problem is that if one of the command event handlers restarts more than 3 times in 5 seconds it will take the entire app down with it because it will kill all the other event handlers and then all the way up to the web server.
What are my options to solve this issue?
Unrelated to commanded specifically, but using the supervision tree to handle expected failures is not a great idea. The restart behaviour of supervisors is meant to keep a system up and running (as in available) and if it fails to do so stop trying, so someone higher up can take a look. Eventually that’s someone on call. That’s what you’re seeing. You could allow more failures to happen before giving up, but at least after a few retries not having backoff is considered waistful.
Generally you should figure out what you do want to happen instead of letting the app fail and the tradeoffs involved. With that you can start looking into how to best implement that, especially given commanded is likely the tool to queue up the events for processing. Not sure what options it provides to deal with failing events over time.
I’d also suggest considering what classes of errors might result in errors of your event handlers. E.g. a bug is unlikely to resolve without a software update, a network issue might resolve over time. You might even want to handle them differently per type of error.
You should implement the
error/3 callback function in event handlers to handle problematic events so that the event handler does not restart on error.
The example error handling in the docs shows a simple strategy to retry X times and then log and skip an event that cannot be handled.
We ended up solving this by setting up a supervisor that allows our event handlers to crash, without restarting them. At runtime, we can query the supervisor about which event handlers / projectors are stopped and not restarted.
Our operations team is notified of this, so that they can investigate the situation.
The reason we did not go with skipping events, is that we’re afraid of continuing handling events, if there is one we cannot handle.
The only time we have needed this, it worked like a charm: We had an event being introduced by accident, that was unhandled in a projector. The projector got stuck and died. The rest of the system kept running, but operations were aware of the issue. We could then handle the situation (which in this case was done by blanking out the event), and restart the projector.