GenServers maintaining external connections (that can fail) - how to supervise?

hubertlepicki · July 6, 2016, 5:54am

So I’m having a few of GenServers being started using the application callback module, supervised. These GenServers connect to one of the RabbitMQ queues, and react in some way to incoming messages.

The connection to RabbitMQ is started when GenServer starts. When connection drops, we re-connect. So far so good. Our code is based on this basic example:

Is there a better, more OTP way of doing this? I mean, I would think of using Supervisors, but it looks like I have to write custom one to implement the delays between the respawn attempts otherwise I’ll end up reaching the limits for respawns really fast.

I was thinking: when network goes down, I crash the GenServer instead of internally handling the reconnect.

This way I would avoid the nasty timeouts :timer.sleep that block the GenServer’s ability to react to different messages.

Basically I want other parts of my GenServer (reading it’s state) usable, while it re-connects, and the above solution unfortunately blocks.

hubertlepicki · July 6, 2016, 6:01am

Ok the above does not make much sense, but it was great to express my thoughts. Rubber duck debugging.

What I probably need is two processes. One listening to RabbitMQ that I can safely crash/respawn as I need to. Second one would be GenServer keeping the state and being operational all the time, no matter if ther RabbitMQ listener is up or down.

I still need to supervise the RabbitMQ listener somehow - any good tips here?

hubertlepicki · July 6, 2016, 6:04am

…or I just amend the example to use :timer.send_after() rather than :timer.sleep()

sasajuric · July 6, 2016, 6:34am

If you expect a disconnection (which you should), then I’d say you should handle it explicitly, and not leave it to supervisor. One problem you’ll have with supervisor is that it will immediately restart, so you might end up in a tight reconnect attempt loop, and after maximum restart frequency is exceeded the supervisor would crash.

So this case is IMO best handled explicitly, and that’s fine since it’s expected situation (so not a bug), which you want to handle in a custom way. You probably want to delay the reconnect attempt, maybe with an increasing interval between successive attempts, meaning you need to maintain state. This is yet another indication that supervisor is not the tool for the job, since it doesn’t support delayed restarts, nor carrying state over from a crashed process (which you’d need if you want to have increasing reconnect delay).

hubertlepicki · July 6, 2016, 8:07am

Thank you @sasajuric! I was aware of the supervisors max_restarts in max_minutes issue that’s why I was thinking about building a custom supervisor. But I think I’ll leave it handled internally, as you say - it’s expected condition that network goes down.

dom · July 6, 2016, 9:06am

Fred has a good post on that topic: http://ferd.ca/it-s-about-the-guarantees.html

I would suggest also checking out the Connection behaviour and tcp example: https://github.com/fishcakez/connection

Linuus · July 7, 2016, 1:46pm

As stated above, have a look at the Connection library. It’s great.

This talk by Andrea Leopardi may be of interest as well