Recovering from connection failures to external services

thehunmonkgroup · January 21, 2018, 8:52pm

I’m designing a server-side application that must regularly connect to a database to retrieve/update information related to its ongoing maintenance tasks. I’ve been considering the ‘OTP way’ to make the application durable to failures related to interfacing with external services.

My first thought was to build a simple ‘retry’ mechanism into a process that handles querying the database. In the event that, for example, the database had to be restarted separately from my running application, the retry mechanism would allow for such intermittent outages, and happily re-query the database as soon as it’s back up.

However, in the Elixir/Erlang/OTP world, this approach feels off the mark to me, especially considering the let it crash mantra.

I’d be curious to know how more experienced Elixir/Erlang developers deal with this implementation in a process/supervision tree architecture?

LostKobrakai · January 21, 2018, 9:00pm

Not super experienced with OTP, but “let it crash” does not mean “let everything crash”. In your case the worker, which should connect to the db can actually crash for whatever reason. This will be detected in other parts of your system, which could now handle any retry strategy you see fit. That process handling the retry should be in a part of your supervision tree, which is less likely to crash and which is not to be easily disturbed by some crashing worker processes.

sasajuric · January 21, 2018, 10:01pm

There a couple of approaches I can think of. As usual, which one is more suitable depends on the exact circumstances

A simple solution is to periodically start a short-living process which operates on the database. For periodic starting, you could use quantum. You tell quantum to regularly invoke some function, and you do the job in the function. The function is invoked in a separate process per each iteration. If that process fails (say due to database not being available), then the error will be logged, and the job will retry in the next iteration. Sooner or later, the database will come back online, and the job will succeed. I usually use this approach for simpler scenarios, for example if I want to periodically delete/archive some stale things.

On the other hand, if you want to periodically pull some stuff from the database, and then do some other jobs on that data, then a pub-sub approach might be more appropriate. In this version, you could still use quantum to periodically start a pull job, which reads from the database. The job then sends a message to interested processes in the system, which handle the data. That way, a failure in the database puller won’t disturb anything else. The same thing holds for other listeners in the system.

You could also catch the exception and retry, though IME most often a solution which is both simpler and more reliable is to run things in separate processes. If you could describe your scenario in more details, and explain the main concerns you have about possible failures and desired recovery behaviour, perhaps a more detailed advice could be given

thehunmonkgroup · January 21, 2018, 10:48pm

One of the design de[quote=“sasajuric, post:3, topic:11800”]
If you could describe your scenario in more details, and explain the main concerns you have about possible failures and desired recovery behaviour,
[/quote]

I’ve got a pretty thorough description of the problem space in this thread: Using GenServer in a state machine type workflow

I think I may have mischaracterized the needed work by calling it ‘ongoing maintenance tasks’. This is more of an event reactor situation: events come in, and they need to be acted on immediately, which doesn’t seem like a good fit with quantam (looks useful though!). Acting on those events includes some reading from and writing to a central database.

Using Tasks seems like a reasonable approach: launch a Task which queries the database and returns the desired data and/or updates data. If for whatever reasons the Task crashes, I can trap that and retry in the parent process.

dom · January 21, 2018, 11:03pm

https://ferd.ca/it-s-about-the-guarantees.html

I agree with the other posters. I also had a similar use case and used a GenServer that spawns a task using TaskSupervisor.async_nolink, then schedules the next one with send_after once it completes. If it fails, so be it! Better luck next time

sasajuric · January 22, 2018, 8:22am

OK, that’s a somewhat different story. Glancing at the description on the linked thread, I see a lot of similarities with what I recently worked on myself. I too have a GenServer which responds to some events, and has to start various jobs. I have different needs for handling failures. For some job types, I want to retry after a brief delay. For others, I just want to report a failure to an external service. I also have a simple state-machine workflow. When some job succeeds, I need to start another job. Finally, in some cases, I want to cancel the running jobs, and start from the beginning.

I chose exactly the approach you mention here. I have a GenServer which trap exits, starts jobs as child tasks, and handles :EXIT messages.

I find myself using this pattern every now and then, and a few months ago I’ve started thinking that a generic behaviour, something like JobParent, could reduce some boilerplate. It would be some combination of GenServer + Registry + Supervisor, which can be used to start immediate children, assign them names, keep track of them, and react to their termination. I did a brief dirty experiment a few months ago, but couldn’t find the time to work more seriously on it. In case someone wants to explore this idea, I’ll be happy to explain my thoughts in more details.

thehunmonkgroup · January 22, 2018, 11:42pm

Excellent, good to know I’m on the right track.

Thanks to all for the feedback.

joaquinalcerro · May 14, 2018, 4:06am

I am interested.

Let me know when can we talk about it.

Best regards.

sasajuric · May 15, 2018, 7:46am

In the meantime I’ve made the first prototype, which is available here.

joaquinalcerro · May 15, 2018, 12:54pm

Oh perfect. I will check the library and contact you back. Best regards.