Just a quick question for people that already have experience with OTP and GenServers and especially error recovery. I have a question regarding the recovery of GenServer’s but also erlang processes in an extent.
I know that one of the philosophies of erlang and elixir is the ‘let it crash’ philosophy. The erlang eco-system provides wonderful tools for monitoring and restarting processes. But I have run into some real-world questions regarding the design of the system.
Real world examples
For example, in a phoenix application I have a custom Mailer GenServer that uses an interface that sends an email when a user completes a specific action. This can be done with a cast to the GenServer, however say the mailer process crashes. How do I restart the GenServer, with the buffer of mails that still need to be send, would you guys persists these to a database or ETS?
Another example is a scheduler in the same application, that schedules participants in a tournament in a round-robin fashion, it does this in a POST request to the server. The scheduler saves the state to a MySQL database. What if this scheduling fails (e.g. a database error), would you consider doing a retry in the same request, or maybe retry the scheduling when the GenServer is restarted? Or just show an error message to the user.
What I am looking for is how to resume and robustify these processses. Anyone willing to share their insights?
I am far from being an Elixir expert. Anyway the first thing I can think of is to send one message to the genserver per mail you want to send. If the mailbox crashes I think the supervisor can spawn a new one and keep the current mailbox (the process’ mailbox I mean). You will miss the failed emails though.
You are right. Looks like the recommended way is either to have another process that only stores the messages and is not executing the dangerous operation that will be executed by another process or hold the information in a message queue or the like.