Pogo - distributed supervisor for Elixir

Pogo is a distributed supervisor for clustered Elixir applications.

It uses battle-tested distributed named process groups (:pg) under the hood to maintain cluster-wide state and coordinate work between local supervisors running on different nodes.

Features of distributed supervisor:

  • automatically chooses a node to locally supervise child process
  • a child process running in the cluster can be started or stopped using any local supervisor
  • ensures a child is started only once in the cluster (as long as its child spec is unique)
  • redistributes children when cluster topology changes

In that aspect itā€™s similar to Horde or Swarm, but doesnā€™t provide a distributed registry (Horde and Swarm do). Internals obviously are different - Horde uses āˆ‚-CRDT, Swarm uses Interval Tree Clock for synchronization. Pogoā€™s local supervisors donā€™t exchange messages to synchronize state, but rely on Erlangā€™s process groups, observe their memberships and adjust their local state based on it.

For anyone interested, Pogoā€™s inner workings have been detailed in an introductory blog post :writing_hand:.

To provide some context, the library was developed at Telnyx as an alternative to Horde as we couldnā€™t overcome some problems that TBH could have been peculiar to our environment (20+ node cluster with quite dynamic membership). Not that it went smoothly :slight_smile: but since version 0.3 itā€™s been quite stable.

10 Likes

Thanks for this cool library!

I have several questions:

Does is mean that in case of network split, there will be two instances of the same child_spec: one in the part of network, which is not connected to the leader, and one which is a process, restarted by the topology change? Or will the disconnected from the leader part of the cluster just kill all local children?

In case of a network split there will be two instances of each child process as there is no concept of a leader or majority in pogo. Once the connectivity is restored, extraneous instances will get terminated.

1 Like

Hello,

Could you please tell the status of the lib ? still actively maintained ?
Has it been thoroughly tested in production apps ?

Iā€™m looking for distributed libs to globalize genserver across cluster nodes.

I was going to use Horde but unfortunately the maintainer does not fix issues anymore. (I donā€™t blame him for it).
Swarm seems to have been abandoned.

It seems thereā€™s not so many options leftā€¦

Have you looked at :process_hub?

Itā€™s in alpha stage. Iā€™m hesitant to use it for a production app.

Have you used it ?

Only for small-scale projects.

Did you encounter any issues ?

I only encountered 1 issue, and I fixed it here: Fix `stop_child` when using string IDs for the children by peaceful-james Ā· Pull Request #1 Ā· alfetahe/process-hub Ā· GitHub

Just follow your gut and pick whichever library feels right to you. Whatever you choose, I think you will have to at some point read the source code to understand what is happening.

I have used :pogo and :horde a lot. :pogo is easy to understand. The source code was only a couple of files when I was using it. :horde was difficult for me to understand and in a production project we kept seeing it just ā€œlosingā€ visibility of nodes randomly. That was probably our own ignorant fault. However, I have found :process_hub to be reliable and easier to understand. The source code is clear enough that I am able to feel ā€œin controlā€ because I understand, in broad strokes, what is happening.

You did not ask but here is info on how I run distributed nodes locally with docker: Using :dns_cluster with docker-compose locally (it can be done) When I wrote this I was using :pogo because I was trying to replace :horde. The same approach works for :process_hub.

This is just my subjective opinion based on my experience. For me, the most important factor when choosing a lib is how responsive and active are the authors of the library. I donā€™t care if there are 1000 github issues so long as the maintainers are responding to people.

1 Like

Pogo is intentionally small, doesnā€™t include distributed supervisor for example as Horde does, but its maintenance surface is really small.

After fixing some initial issues that we had with it at Telnyx, itā€™s been succesfully running in production for more than a year now.

Itā€™s been used in several projects, the largest one had a cluster of about 30 elixir nodes distributed globally. Pogo was doing fine there.

1 Like