Deploying Elixir into ECS causing many "'global' at node :"xxxxx@10.0.X.X" requested disconnect from node :"xxxx@10.0.X.X" in order to prevent overlapping partitions"

Harrisonl · November 15, 2023, 9:32am

We have an ECS cluster with 4 services, where each task joins a single cluster, via discovery ECS discovery service.

Currently when I deploy, new tasks will attempt to join the old cluster as they a being brought up (potential red flag). I think (but not sure) that this is causing the above message to happen a lot on deploy, as once new tasks reach a healthy state, old tasks drop off causing incomplete partitions as some tasks may be notified first.

This is causing many issues, especially with Horde as it tries to connect to the old nodes (something between Libcluster and Horde not communicating correctly) resulting in every deploy having our Horde registry be reset and orphaned processes.

I’m not sure if this is actually the issue though or not or if there is something else I’m missing. I’m using LibCluster DNSPoller (1 second interval) for node membership and OTP 26.

My questions are:

Is this expected for this setup (dynamic cluster with nodes joining / leaving frequently) or am I missing something?
I couldn’t find a concrete answer on this, but is it better to have the new nodes not join the old cluster when they are brought up (e.g. ensure they all have the same version etc.)?

Harrisonl · November 15, 2023, 9:47am

It also might be related to this - which occurs before those messages start ~1 min before (however not always present before the disconnect messages)

[warning] [libcluster:xxxx] unable to connect to :“xxxxx@10.0.X.X”

Which is weird because it will then connect straight after that. To remove security group issues, I’ve also allowed the tasks to communicate with each other on every port as well.

schneebyte · November 16, 2023, 9:07am

If you don’t need :global then i would recommend to disable prevent_overlapping_partitions
https://www.erlang.org/doc/man/global#prevent_overlapping_partitions

gfviegas · November 16, 2023, 10:55pm

Had the exactly same issue deploying in a Fargate ECS cluster with 2 tasks (each with one container with a phoenix app using libcluster dns poller).

Besides desabling the prevent_overlapping_partitions as recommended by @schneebyte I’ve also decreased the Min running tasks % in my ECS Service to 50% to ensure a task will still be up to converge a replicated cache in the newly created nodes.

It fixed for my use case.

To give a little bit of ‘how to’ for newbies (like me) in clustering in elixir, if you are using releases and wish to disable the prevent overlapping partitions option, you should add a new line in rel/remote.vm.args.eex (or similar) with the followng content:

-kernel prevent_overlapping_partitions false

If you are not using releases and need a quick fix, you can also set an environment variable ELIXIR_ERL_OPTIONS with this setting, such as ELIXIR_ERL_OPTIONS='-kernel prevent_overlapping_partitions false'.