We are deploying an Elixir application to production for the first time (first ever Elixir app for me ) and realised that we need to cluster the nodes because things like Quantum will run on every instance, with a cluster we can configure it to run on a single node.
We are using libcluster with the gossip strategy and it works locally.
We can start two iex sessions with different names and they form a cluster.
We can see the other nodes with Node.list()
.
However once deployed to AWS, whilst we can manually connect the nodes, they do not automatically connect.
In the logs we can see the gossip protocol heartbeat.
Given we can connect manually and we see the heartbeat are there other reasons this might be failing?
This is how we have things setup:
defmodule Shared.Application do
use Application
def start(_type, _args) do
children =
[
{
Cluster.Supervisor,
[
cluster_topologies(),
[name: Shared.ClusterSupervisor]
]
}
]
opts = [strategy: :one_for_one, name: Shared.Supervisor]
Supervisor.start_link(children, opts)
end
defp cluster_topologies do
[
gossip: [
strategy: Cluster.Strategy.Gossip
]
]
end
end
Any ideas much appreciated!
1 Like
I am doing something similar in dev, in production however I use a different strategy.
I have this in my runtime.exs
:
topology_hosts =
System.get_env("PARTNERS") ||
raise """
environment variable PARTNERS is missing, no cluster will be active
"""
config :gratwifi,
topology: [
ipc: [
strategy: Cluster.Strategy.Epmd,
config: [
hosts: topology_hosts |> String.split(",") |> Enum.map(&String.to_atom/1)
]
]
]
And then in application.exs
:
def start(_type, _args) do
topologies = Application.get_env(:gratwifi, :topology)
...
children =
[
...
{Cluster.Supervisor, [topologies, [name: GratWiFi.ClusterSupervisor]]},
...
]
The PARTNERS
environment variable is set to a comma-separated list of all my nodes.
AND very important the access to the epmd
port is restricted to these hosts on all my nodes.
The default for the gossip strategy is to use multicast, which is not necessarily supported by all networks as mentioned on later docs to what you linked: Cluster.Strategy.Gossip — libcluster v3.4.1
1 Like
Thanks for the example, the reason for choosing the gossip strategy was that we didn’t want to statically maintain a list of node names. We are using AWS ECS and will be employing auto-scaling in the future. So new instances will be spun on on demand and need to connect to the cluster.
In the meantime I tried the libcluster_postgres strategy and this works both locally and on AWS.
It’s perhaps overkill given the cluster will not span multiple regions, so I’ll have a look at the suggestion from @LostKobrakai about multicast support and see if I can get gossip protocol to work.
Apparently AWS Fargate does not support UDP multicast or broadcast, I’ve not verified this for certain, but it certainly should explain why the Gossip protocol doesn’t work.
If al the nodes are resolvable through a single dns name, you could use DNSCluster.