Distributed cluster with libcluster and gossip protocol

krisleech · November 4, 2024, 4:03pm

We are deploying an Elixir application to production for the first time (first ever Elixir app for me ) and realised that we need to cluster the nodes because things like Quantum will run on every instance, with a cluster we can configure it to run on a single node.

We are using libcluster with the gossip strategy and it works locally.

We can start two iex sessions with different names and they form a cluster.

We can see the other nodes with Node.list().

However once deployed to AWS, whilst we can manually connect the nodes, they do not automatically connect.

In the logs we can see the gossip protocol heartbeat.

Given we can connect manually and we see the heartbeat are there other reasons this might be failing?

This is how we have things setup:

defmodule Shared.Application do
  use Application

  def start(_type, _args) do
    children =
      [
        {
          Cluster.Supervisor, 
          [
            cluster_topologies(), 
            [name: Shared.ClusterSupervisor]
          ]
        }
      ]

    opts = [strategy: :one_for_one, name: Shared.Supervisor]
    Supervisor.start_link(children, opts)
  end

  defp cluster_topologies do
    [
      gossip: [
        strategy: Cluster.Strategy.Gossip
      ]
    ]
  end
end

Any ideas much appreciated!

Hermanverschooten · November 4, 2024, 4:13pm

I am doing something similar in dev, in production however I use a different strategy.
I have this in my runtime.exs:

  topology_hosts =
    System.get_env("PARTNERS") ||
      raise """
      environment variable PARTNERS is missing, no cluster will be active
      """
  config :gratwifi,
    topology: [
      ipc: [
        strategy: Cluster.Strategy.Epmd,
        config: [
          hosts: topology_hosts |> String.split(",") |> Enum.map(&String.to_atom/1)
        ]
      ]
    ]

And then in application.exs:

def start(_type, _args) do
    topologies = Application.get_env(:gratwifi, :topology)
...
children = 
  [
    ...
        {Cluster.Supervisor, [topologies, [name: GratWiFi.ClusterSupervisor]]},
    ...
  ]

The PARTNERS environment variable is set to a comma-separated list of all my nodes.

AND very important the access to the epmd port is restricted to these hosts on all my nodes.

LostKobrakai · November 4, 2024, 4:28pm

The default for the gossip strategy is to use multicast, which is not necessarily supported by all networks as mentioned on later docs to what you linked: Cluster.Strategy.Gossip — libcluster v3.4.1

krisleech · November 4, 2024, 6:02pm

Thanks for the example, the reason for choosing the gossip strategy was that we didn’t want to statically maintain a list of node names. We are using AWS ECS and will be employing auto-scaling in the future. So new instances will be spun on on demand and need to connect to the cluster.

krisleech · November 4, 2024, 6:23pm

In the meantime I tried the libcluster_postgres strategy and this works both locally and on AWS.

It’s perhaps overkill given the cluster will not span multiple regions, so I’ll have a look at the suggestion from @LostKobrakai about multicast support and see if I can get gossip protocol to work.

krisleech · November 4, 2024, 7:11pm

Apparently AWS Fargate does not support UDP multicast or broadcast, I’ve not verified this for certain, but it certainly should explain why the Gossip protocol doesn’t work.

Hermanverschooten · November 5, 2024, 8:40am

If al the nodes are resolvable through a single dns name, you could use DNSCluster.

albydarned · June 11, 2025, 6:27am

Seeing the exact same with EC2 instances. Works locally, on AWS must be connected manually. Seems VPC just doesn’t support multicast. I only need a couple nodes right now, So I guess I could use Epmd for now. Would like this to just be automatic though. Postgres would probably work, but as you said seems like overkill.

What solution did you end up going with?

krisleech · June 11, 2025, 8:31am

We are using libcluster_postgres.

The only thing I have noticed is that when we do a deploy (to ECS, Fargate) it spins up new nodes and they join the same cluster as the existing nodes (prior to them being terminated), so we get double sized cluster for a 10ish mins. We could change the cookie value to have two cluster’s during deployment, but we’ve not seen any issues with the current setup yet.

Also when the old nodes are said to be terminated in the AWS UI they actually aren’t for a few more minutes, you can still connect to them and see them as part of the cluster.

albydarned · June 11, 2025, 8:44am

Interesting, thanks for sharing. Thats helpful to know how the cookies work. I’m running on EC2 machines. Using iex with --name and --cookie options, I can get each node to autoconnect using libcluster_postgres. Most reliable seems to be when i set --name equal to app@ However, making an elixir release seems like the --name is forcing itself to ip-. These then don’t autoconnect, even though I see the “connected to postgres” message from libcluster. I can always call Node.connect and it will work.

Did you run into anything simliar? I’ve even gotten to the point where using env.sh in the release directory to successfully get the ./app start_iex to show the correct app name with the IP. But, in release mode the same settings don’t autoconnect.

albydarned · June 11, 2025, 9:13am

Just tried EPMD, and it works flawlessly with the local DNS names I set up inside of AWS. For some reason I just cannot get Gossip or Postgres to connect automatically when in release mode.

krisleech · June 11, 2025, 11:28am

No, we didn’t get any such problems.

We used the IP address as the node name, I’d have to check exactly how we do this, it was a while ago.

Node.self()
:"ip-10-0-2-251@ip-10-0-2-251"

I didn’t see this recommended anywhere, but it seemed the easiest way to get a unique identifier.

albydarned · June 11, 2025, 1:48pm

Thanks, I’ll keep tinkering.