Message queue handling during bitwalker/swarm node migration

The bitwalker/swarm component is a cool tool. It makes it easy to migrate services around a distributed cluster in an automated fashion. And if the topology of the cluster changes, it will move services around using a hash ring so that services can be quickly located. That migration is interesting, but while it’s happening, my phoenix route handlers end up sending out 500 errors to callers, which triggers AWS alarms and makes everyone unhappy.

1) Has anyone else encountered this, and is there a recommended way to avoid it?

When we started profiling swarm behavior we noticed something else.

Swarm supports migrating GenServers from one node to another using a :begin_handoff / :end_handoff mechanism. At :begin_handoff, the original process instance has the option to say “resume me on the new node, and here’s my current state”, passing his GenServer state. Once re-instantiated on the new node, swarm’s registry will begin reporting the GenServer’s new pid on the new node and callers will begin sending to that pid.

In reading the source code, I don’t see anything that would prevent the old instance of the service from getting more messages until he’s shut down. And indeed, there may already be more messages in his queue already at the time he hands his state over, and I’m pretty sure he’s going to happily process those messages, but his state has already been sent to his new incarnation, so … new state created from those final messages gets lost.

2) Can anyone confirm that the lost-final-messages scenario I mention above does, in fact, happen?

3) Is there a mechanism to extract messages from the queue so they can be sent to the new instance?

4) Is there a way to send a specific return value as the return to all call messages still in the queue when the GenServer dies? That way I could tell them to retry their request again when it would hopefully arrive at and be served by the new service instance.

Cheers!

Michael Watson
Chief Architect
National Instruments, Austin, TX

3 Likes

Maybe something like GitHub - jlouis/fuse: A Circuit Breaker for Erlang or GitHub - klarna/circuit_breaker: 💥 An Erlang library for breaking out of faulty services could work? You would wrap your genserver process in a circuit breaker and return a more appropriate response ({:error, :migrating}?) to all the callers after it has started being moved to another node to avoid accepting messages it cannot handle.

2 Likes

I’ll look into those circuit breakers, thanks!

I have the same concerns since I’m considering using swarm for a project. Did you find an answer to questions 1-3?