Docker Swarm + Phoenix Channels

hubertlepicki · October 10, 2019, 8:36am

So this is interesting topic you are bringing up.

If you are using Phoenix channels in a cluster, it is likely using the default PG2 back-end. This back-end is using the pg2 library from Erlang, that is part of standard library.

The library is pretty minimal yet smart in what it does: your local processes can join a distributed process group, and then based on that Phoenix PubSub sends messages directly to these joined processes.

pg2 monitors the nodes that joined given group and also handles netsplits and recovery / topology changes accordingly, so your nodes can drop/reconnect/join the cluster and Phoenix channels should work as expected. As in: the nodes that do drop out or become unresponsive might miss some messages, but otherwise things will heal on it’s own.

More about this in this thread:

and also in this overview of pg2 failure mechanisms:
http://christophermeiklejohn.com/erlang/2013/06/03/erlang-pg2-failure-semantics.html

Now, the situation is slightly different if you want to start your own GenServers and make them visible in the cluster. The main question I would have is: do you want them to be unique? And if so: unique per node or unique in cluster?

It’s easier to have GenServers unique in a node, and then you can actually use the same pg2 to have some sort of clustering/balancing/message passing between these and other processes in the cluster. Go for it if you can.

If you have to make the processes unique in the cluster you are in a bit of worse position. Unfortunately, that very often is the case that you need to do so.

I don’t know about the library peerage but I suspect it’s similar to libcluster that I use, and just provides a mechanisms to connect/reconnect/change topology of the cluster of Erlang nodes and nothing more.

You need something more if you want to have global cluster processes registration, for unique processes. :global is one obvious choice, but if you are worrying about netsplit & recovery, then this is not going to work. Because it just locks up / stops working on netsplit and there’s no way to recover as far as I undestand either. I tend to rule :global out for any cloud based deployment for that reason. It’s great if you have two nodes hooked together with a gigabit ethernet cable, but anything in the cloud should expect and will experience networking failures between the nodes.

I can see two tools you can use in that situation: GitHub - bitwalker/swarm: Easy clustering, registration, and distribution of worker processes for Erlang/Elixir and GitHub - derekkraan/horde: Horde is a distributed Supervisor and Registry backed by DeltaCrdt.

Both pretty similar, Swarm more of it’s own API, while Horde trying to be compatible with Elixir’s Registry. I’ve used Swarm a lot, just a tiny bit of Horde so far so can’t tell you much about the later experience in production. Swarm (with libcluster) works great.

With both tools you can register GenServers (and other processes) as unique in the cluster, you can handle netsplit/topology changes where cluster breaks in half, configure minimum cluster quorum and also handle recovery when two nodes which briefly disconnected re-connect. The last feature is pretty cool, because you can have situations where two nodes are briefly running independently and started their own “global” GenServers and when they re-connect you want to ensure there is only one left running - and merge the state of both. With both Horde and Swarm you can do it.

I think Swarm may rely on libcluster, so not sure if it’ll work well with the clustering library you use. Horde probably too.