Deploying Clustered Elixir in AWS ECS Disconnecting and Reconnecting

markclong · March 8, 2023, 6:57pm

I have a production application using Elixir 1.14.3, OTP 25.2, running Phoenix and LiveView. Originally it was deployed to ECS using Fargate with only a single node. As we add more customers I need to deploy more nodes and have them connected to publish live updates to each user.

To that end, I’ve added libcluster using the DNSPoll strategy and each new node in ECS registers with the AWS Service Map. They seem to be able to connect but they seem to be disconnecting and reconnecting over and over.

2023-03-08T11:42:47.805-07:00	18:42:47.805 [info] [libcluster:dns] connected to :"portal@172.31.10.50"
2023-03-08T11:42:55.640-07:00	18:42:55.639 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:42:55.641-07:00	18:42:55.640 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:42:57.814-07:00	18:42:57.813 [info] [libcluster:dns] connected to :"portal@172.31.10.50"
2023-03-08T11:43:02.446-07:00	18:43:02.445 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.10.50" in order to prevent overlapping partitions
2023-03-08T11:43:02.446-07:00	18:43:02.446 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:17.550-07:00	18:43:17.549 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:43:17.550-07:00	18:43:17.550 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:27.466-07:00	18:43:27.465 [warning] 'global' at node :"portal@172.31.11.97" disconnected node :"portal@172.31.7.141" in order to prevent overlapping partitions
2023-03-08T11:43:42.848-07:00	18:43:42.847 [info] [libcluster:dns] connected to :"portal@172.31.7.141"
2023-03-08T11:43:50.673-07:00	18:43:50.672 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.3.16" in order to prevent overlapping partitions
2023-03-08T11:43:50.673-07:00	18:43:50.672 [warning] 'global' at node :"portal@172.31.11.97" requested disconnect from node :"portal@172.31.7.141" in order to prevent overlapping partitions

This is the output of the logs from a single container. I think the global disconnect is a symptom of the problem where a node disconnects from the other two nodes and it causes the cluster to globally disconnect that node. It doesn’t seem to be a single node, this happens on all nodes.

The env.sh.eex file:

export PUBLIC_HOSTNAME=`curl ${ECS_CONTAINER_METADATA_URI}/task | jq -r ".Containers[0].Networks[0].IPv4Addresses[0]"`
export RELEASE_DISTRIBUTION=name
export RELEASE_NODE=<%= @release.name %>@${PUBLIC_HOSTNAME}
export RELEASE_COOKIE=monster
export REPLACE_OS_VARS=true

The relevant config from prod.exs for libcluster:

config :libcluster,
  topologies: [
    dns: [
      strategy: Cluster.Strategy.DNSPoll,
      config: [polling_interval: 5_000, query: "portal.redacted-portal", node_basename: "portal"]
    ]
  ]

I thought I made some progress by deploying to a single AWS AZ but that didn’t hold up. I know the ports are open and working

# nc -vz 172.31.7.141 4369
Connection to 172.31.7.141 4369 port [tcp/*] succeeded!

I can manually connect inside the running container:

iex(portal@172.31.11.97)10> Node.list
[]
iex(portal@172.31.11.97)11> Node.connect :"portal@172.31.7.141"
true
iex(portal@172.31.11.97)12> Node.list
[:"portal@172.31.7.141"]

It seems like everything works except the constant disconnects. If anyone has experienced this and has some pointers I would greatly appreciate it.

BradS2S · March 8, 2023, 7:56pm

Sounds like your nodes can’t talk to each over theie partition.

This article seemed like it might help: https://towardsaws.com/an-elixir-migration-to-microservices-in-aws-as-ecs-fargate-using-service-discovery-to-interconnect-91386a48dfa1#93ac

Create services and allow traffic

Creating the services is not enough to cover the requirements. They should be able to communicate with each other.

The following snippet creates the service and allows traffic within the instances of the services AND between the instances of both services. Beware of opening just the ports you need. If you don’t have to communicate between services you can remove the first two “connections”
/connect instances of both services

this.krillinService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} krillin to vegeta`,
);
this.krillinService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} vegeta to krillin`,
);
//connect instances within services
this.krillinService.connections.allowFrom(
    this.krillinService,
    ec2.Port.allTcp(),
    `${props.prefix} krillin to krillin`,
);
this.vegetaService.connections.allowFrom(
    this.vegetaService,
    ec2.Port.allTcp(),
    `${props.prefix} vegeta to vegeta`,
);

schneebyte · March 14, 2023, 10:25am

Default behaviour was changed in OTP 25 for global.
https://www.erlang.org/doc/man/global.html

If you don’t need global or a fully connected cluster you can just set prevent_overlapping_partitions to false.
Otherwise you have to make sure each node can connect to every other node.

Increasing net_setuptime might help as well.
And maybe the distribution buffer size prevent_overlapping_partitions in larger clusters · Issue #6214 · erlang/otp · GitHub

markclong · March 14, 2023, 7:35pm

@schneebyte thanks for the response. I had changed prevent_overlapping_partitions to false but that was a symptom of the node disconnects. I had also upped the net_setuptime and it wasn’t helping.

@BradS2S I had seen this but avoided adding CDK since I was hoping to not introduce the complexity just yet. I was hoping to prove out the idea using the AWS UI. In the end, Felipe’s example worked and I was able to modify it to my needs with minimal work and it solved the problem.

I cannot tell what was different looking at the resources the CDK code created versus what I had created by hand but something is different. The new cluster is rock solid and working great.

Thanks everyone for taking time to reply. I sincerely appreciate it.