I have a production application using Elixir 1.14.3, OTP 25.2, running Phoenix and LiveView. Originally it was deployed to ECS using Fargate with only a single node. As we add more customers I need to deploy more nodes and have them connected to publish live updates to each user.
To that end, I’ve added libcluster using the DNSPoll strategy and each new node in ECS registers with the AWS Service Map. They seem to be able to connect but they seem to be disconnecting and reconnecting over and over.
2023-03-08T11:42:47.805-07:00 18:42:47.805 [info] [libcluster:dns] connected to :"portal@172.31.10.50"
2023-03-08T11:42:55.640-07:00 18:42:55.639 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:42:55.641-07:00 18:42:55.640 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:42:57.814-07:00 18:42:57.813 [info] [libcluster:dns] connected to :"portal@172.31.10.50"
2023-03-08T11:43:02.446-07:00 18:43:02.445 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.10.50" in order to prevent overlapping partitions
2023-03-08T11:43:02.446-07:00 18:43:02.446 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:17.550-07:00 18:43:17.549 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:43:17.550-07:00 18:43:17.550 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:27.466-07:00 18:43:27.465 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:43:42.848-07:00 18:43:42.847 [info] [libcluster:dns] connected to :"portal@172.31.7.141"
2023-03-08T11:43:50.673-07:00 18:43:50.672 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:50.673-07:00 18:43:50.672 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.7.141" in order to prevent overlapping partitions
This is the output of the logs from a single container. I think the global
disconnect is a symptom of the problem where a node disconnects from the other two nodes and it causes the cluster to globally disconnect that node. It doesn’t seem to be a single node, this happens on all nodes.
The env.sh.eex
file:
export PUBLIC_HOSTNAME=`curl ${ECS_CONTAINER_METADATA_URI}/task | jq -r ".Containers[0].Networks[0].IPv4Addresses[0]"`
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=<%= @release.name %>@${PUBLIC_HOSTNAME}
export RELEASE_COOKIE=monster
export REPLACE_OS_VARS=true
The relevant config from prod.exs
for libcluster
:
config :libcluster,
topologies: [
dns: [
strategy: Cluster.Strategy.DNSPoll,
config: [polling_interval: 5_000, query: "portal.redacted-portal", node_basename: "portal"]
]
]
I thought I made some progress by deploying to a single AWS AZ but that didn’t hold up. I know the ports are open and working
# nc -vz 172.31.7.141 4369
Connection to 172.31.7.141 4369 port [tcp/*] succeeded!
I can manually connect inside the running container:
iex(portal@172.31.11.97)10> Node.list
[]
iex(portal@172.31.11.97)11> Node.connect :"portal@172.31.7.141"
true
iex(portal@172.31.11.97)12> Node.list
[:"portal@172.31.7.141"]
It seems like everything works except the constant disconnects. If anyone has experienced this and has some pointers I would greatly appreciate it.