Pogo - distributed supervisor for Elixir

Pogo is a distributed supervisor for clustered Elixir applications.

It uses battle-tested distributed named process groups (:pg) under the hood to maintain cluster-wide state and coordinate work between local supervisors running on different nodes.

Features of distributed supervisor:

  • automatically chooses a node to locally supervise child process
  • a child process running in the cluster can be started or stopped using any local supervisor
  • ensures a child is started only once in the cluster (as long as its child spec is unique)
  • redistributes children when cluster topology changes

In that aspect it’s similar to Horde or Swarm, but doesn’t provide a distributed registry (Horde and Swarm do). Internals obviously are different - Horde uses āˆ‚-CRDT, Swarm uses Interval Tree Clock for synchronization. Pogo’s local supervisors don’t exchange messages to synchronize state, but rely on Erlang’s process groups, observe their memberships and adjust their local state based on it.

For anyone interested, Pogo’s inner workings have been detailed in an introductory blog post :writing_hand:.

To provide some context, the library was developed at Telnyx as an alternative to Horde as we couldn’t overcome some problems that TBH could have been peculiar to our environment (20+ node cluster with quite dynamic membership). Not that it went smoothly :slight_smile: but since version 0.3 it’s been quite stable.

11 Likes

Thanks for this cool library!

I have several questions:

Does is mean that in case of network split, there will be two instances of the same child_spec: one in the part of network, which is not connected to the leader, and one which is a process, restarted by the topology change? Or will the disconnected from the leader part of the cluster just kill all local children?

In case of a network split there will be two instances of each child process as there is no concept of a leader or majority in pogo. Once the connectivity is restored, extraneous instances will get terminated.

1 Like

Hello,

Could you please tell the status of the lib ? still actively maintained ?
Has it been thoroughly tested in production apps ?

I’m looking for distributed libs to globalize genserver across cluster nodes.

I was going to use Horde but unfortunately the maintainer does not fix issues anymore. (I don’t blame him for it).
Swarm seems to have been abandoned.

It seems there’s not so many options left…

Have you looked at :process_hub?

It’s in alpha stage. I’m hesitant to use it for a production app.

Have you used it ?

Only for small-scale projects.

Did you encounter any issues ?

I only encountered 1 issue, and I fixed it here: Fix `stop_child` when using string IDs for the children by peaceful-james Ā· Pull Request #1 Ā· alfetahe/process-hub Ā· GitHub

Just follow your gut and pick whichever library feels right to you. Whatever you choose, I think you will have to at some point read the source code to understand what is happening.

I have used :pogo and :horde a lot. :pogo is easy to understand. The source code was only a couple of files when I was using it. :horde was difficult for me to understand and in a production project we kept seeing it just ā€œlosingā€ visibility of nodes randomly. That was probably our own ignorant fault. However, I have found :process_hub to be reliable and easier to understand. The source code is clear enough that I am able to feel ā€œin controlā€ because I understand, in broad strokes, what is happening.

You did not ask but here is info on how I run distributed nodes locally with docker: Using :dns_cluster with docker-compose locally (it can be done) When I wrote this I was using :pogo because I was trying to replace :horde. The same approach works for :process_hub.

This is just my subjective opinion based on my experience. For me, the most important factor when choosing a lib is how responsive and active are the authors of the library. I don’t care if there are 1000 github issues so long as the maintainers are responding to people.

1 Like

Pogo is intentionally small, doesn’t include distributed supervisor for example as Horde does, but its maintenance surface is really small.

After fixing some initial issues that we had with it at Telnyx, it’s been succesfully running in production for more than a year now.

It’s been used in several projects, the largest one had a cluster of about 30 elixir nodes distributed globally. Pogo was doing fine there.

1 Like