Distributed application - network split

So I have been investigating the feasibility of employing Erlang’s native distributed application architecture (http://erlang.org/doc/design_principles/distributed_applications.html) for a project. I have done some research but I do lack hands on experience.

I am mostly interested in built in failover and takeover mechanisms, I think it will work just fine in my case. The problem are network splits.

I will be running the cluster in the cloud (EC2) and, I am afraid, the connectivity between those two nodes is not as guaranteed as it would be if the nodes were connected by a physical cable in the same date center. The connectivity will and does go down on occasion. The default dist_ac’s behavior is less then ideal:

  • nodes disconnect from each other
  • they keep running as if they were single nodes in the cluster.

To make things even worse, re-joining the cluster afterwards does not resolve the situation and failover / takeover no longer take effect.

I can detect that a node was disconnected from the cluster, but that’s about it I can do. I can’t find a way to recover from this situation - other than shutting down and starting the whole cluster from scratch.

Any recommendations you may have?

4 Likes

Split brain scenarios like these are a problem for stateful distributed systems. The dirt simple solution is to detect a split by detecting that your current cluster only has < (total / 2) + 1 nodes in it. If so, you’re on the wrong side of the split. Shut yourself down, let something like heart start you back up. Then wait to rejoin the cluster using the normal cluster joining rules you’ve defined for your app.

A more complicated solution might involve only shutting down the request processing tree of processes responsible for handling reqs that could cause damage. Leaving a stateless tree open for better administration and triage.

The biggest problem about distribution is state. If your cluster needs a distributed state (which is not always the case) then you will probably need some distribution strategy.

For example, if you want the state of the cluster to be always available, then every time something gets written to that state, you would replicate it. Then you would probably meed a consensus algorithm (raft, paxos), a distributed hash ring and other tricks to have everything stable. This is the case for something like Lasp or Riak Core.

But, if you can save your state in a DB (even your cache, authentication and whatnot) then that state is not your application worries. So, that is why there is no ready available solution for all use cases.

All true, but none of that works if you can’t detect changes in topology. Especially for split-brain situations where you might double-owner yourself and start to corrupt that shared DB. It sounds like he’s having problems at that level. Getting the join/leave master election stuff working.

Yep you are right! My first read led me to believe this was the first step that would eventually end up on the distributed state problem. But, that may not be the case \o/

You can use the Swarm library which provides a distributed process registry. I contributed changes to make it CP, rather than AP, so that it remains consistent during a network partition. This is achieved by using a static quorum distribution strategy as described by @CptnKirk. These changes have not yet been published to Hex and are available from the master branch in GitHub only. I plan to ask Paul to release v3.1.

During a network partition, processes are redistributed so that only one instance of each named registered process is running on the cluster. Those on the wrong side of the split are stopped. When the cluster reforms the processes are redistributed and you can handle migrating process state per GenServer (discard, hand-off, etc.).

2 Likes

Yes, I’m not looking at it from distributed state point of view at all just yet. As I originally pointed out, the objective is being able to fail over to backup nodes and recover from failure.

My current thinking is that the built-in dist_ac is not suitable for any cloud-based deployment, as it was designed for co-located servers. It does not handle well network splits, resulting in brief freezes of nodes when it happens, and then can’t re-join on it’s own the cluster and restore the topology of apps running on top of the cluster.

I am experimenting now with @bitwalker 's libcluster, as a low-level cluster manager. https://hexdocs.pm/libcluster/readme.html with possibly swarm running on top, or a custom solution. I’d need to implement EC2 clustering strategy, I think too.

Someone’s written an EC2 cluster strategy for libcluster: https://github.com/kyleaa/libcluster_ec2

2 Likes

perfect, thank you :).

If I go with Swarm, I’d have to make some architectural changes, however. Unit that’s being distributed across nodes is in this case process, versus OTP application.

I also can’t seem to find a way to make nodes specialized. For example, I want to run my web app on on some nodes and my heavy worker processes that crunch numbers on others. If I understand Swarm correctly, I’d run everything on all nodes, or create separate clusters for web interface and worker backend.

Swarm doesn’t support node specialisation, but that seems like a good feature to build. You could have configuration mapping a named cluster type (web, worker) to individual cluster config and then specify the type when starting up a registered a process.

Currently you would need separate clusters to restrict where processes can run.

One point to take into consideration here is: BEAM is really, really good at keeping everything responsive under stress. I mean, let’s say you start a heavy worker process that distributes the load in your cluster and starts competing for resources with your web app. The impact on response times for, say, REST calls wouldn’t be big. That’s because the BEAM is one of the single preemptive scheduling systems I’ve ever seen. Depending on the feature, you could just use this fact and just distribute the load in your cluster for all your process just the same.

I have no links to point you to but I’ve personally done some synthetic benchmarks comparing the BEAM and Node.js under stress. Suffice to say that BEAM under load (response times under heavy loads of requests) beats node pretty badly once the machines are on 90% percentile (even a little lower than that). BEAM had practically the same response times up until 99% of stress.

That’s correct but often the heavy workers I have “pleasure” to work with are not confined within BEAM. Every time they do video or image processing, or spawning headless crawling web browsers for example - either external binaries or external C libraries are called and this actually has an impact on the rest of the system.

This is in fact the reason why the worker nodes would ideally be isolated.

2 Likes

I have the same problem, I get all the nodes from the sys.config file and check if any lower node can connect to any higher priority nodes, if it can I use erlang:halt() to kill it. If I stop the app on the master server it failsover to the standby. If I start the app on the master it takes over and the standby stops, work fine if I use it as optional and failsover again if master stops. But I can’t get it to work so that when the master crashes the app failsover, heart gets in the way. It only works if master has no heart. Any advice?