hubertlepicki
Distributed application - network split
So I have been investigating the feasibility of employing Erlang’s native distributed application architecture (Distributed Applications — Erlang System Documentation v29.0.2) for a project. I have done some research but I do lack hands on experience.
I am mostly interested in built in failover and takeover mechanisms, I think it will work just fine in my case. The problem are network splits.
I will be running the cluster in the cloud (EC2) and, I am afraid, the connectivity between those two nodes is not as guaranteed as it would be if the nodes were connected by a physical cable in the same date center. The connectivity will and does go down on occasion. The default dist_ac’s behavior is less then ideal:
- nodes disconnect from each other
- they keep running as if they were single nodes in the cluster.
To make things even worse, re-joining the cluster afterwards does not resolve the situation and failover / takeover no longer take effect.
I can detect that a node was disconnected from the cluster, but that’s about it I can do. I can’t find a way to recover from this situation - other than shutting down and starting the whole cluster from scratch.
Any recommendations you may have?
Most Liked Responses
slashdotdash
You can use the Swarm library which provides a distributed process registry. I contributed changes to make it CP, rather than AP, so that it remains consistent during a network partition. This is achieved by using a static quorum distribution strategy as described by @CptnKirk. These changes have not yet been published to Hex and are available from the master branch in GitHub only. I plan to ask Paul to release v3.1.
During a network partition, processes are redistributed so that only one instance of each named registered process is running on the cluster. Those on the wrong side of the split are stopped. When the cluster reforms the processes are redistributed and you can handle migrating process state per GenServer (discard, hand-off, etc.).
slashdotdash
Someone’s written an EC2 cluster strategy for libcluster: GitHub - kyleaa/libcluster_ec2 · GitHub
hubertlepicki
That’s correct but often the heavy workers I have “pleasure” to work with are not confined within BEAM. Every time they do video or image processing, or spawning headless crawling web browsers for example - either external binaries or external C libraries are called and this actually has an impact on the rest of the system.
This is in fact the reason why the worker nodes would ideally be isolated.








