Libcluster intermittent disconnects - what can I configure to create a more reliable cluster?

Hi,

I’m currently running a Elixir application which leverages Phoenix channels. I’ve clustered my application together using libcluster but my application gets intermittent warnings about node disconnects. This is an example of the message that I’m getting:

[warning] ‘global’ at node :“app name@ip” disconnected node :“app name@ip” in order to prevent overlapping partitions

I have also had users raise issues about de-sync, which would make me believe that this is an issue.

I’ve deployed my application on Fly.io and it seems that the nodes that are farther from every other node disconnect more often. For instance, most of my servers are in North America and Europe but my node in South America disconnects quite frequently, with my node in Australia being a close second. I’m running 20 nodes in total.

Basically what my question boils down to is what can I do about this? Is there a configuration that would allow node to wait longer for a response from other nodes? I know there Erlang flags but I wanted to ask here first if those would be a good idea and how could I go about adding these on a Dockerfile (which Fly uses to deploy the application).

Sorry for the long post and any advice would be much appreciated :slight_smile:

Edit: just fixed some formatting in the post

1 Like

It is my understanding that distributed erlang is not really built for geographically distributed clusters by default. These connections are not (as you have observed) the most reliable, and this leads to partitioning and other problematic behavior.

This seems pretty high, can you talk about the load that you see?

Did you check net_ticktime: Erlang -- kernel

1 Like

So would Redis clustering be the way to go?

I have 20 nodes because Fly doesn’t have auto-scaling for their Apps v2 platform, you have to pre-allocate your machines

Cool, would this configured inside the runtime file in the Elixir app?

Yes, something like this should work:

config :kernel,
  net_ticktime: 120
4 Likes

Thanks, I did try that but I get the following:

Cannot configure base applications: [:kernel]

These applications are already started by the time the configuration
executes and these configurations have no effect.

If you want to configure these applications for a release, you can
specify them in your vm.args file:

-kernel config_key config_value

Alternatively, if you must configure them dynamically, you can wrap
them in a conditional block in your config files:

if System.get_env("RELEASE_MODE") do
  config :kernel, ...
end

and then configure your releases to reboot after configuration:

releases: [
  my_app: [reboot_system_after_config: true]
]

This happened when loading config/config.exs or
one of its imports.

I’m going to try and set it through the vm.args.eex file. Is there a command I can use with iex to check the net_ticktime? I’ve been doing some searching but nothing is jumping out at me

It appears that nodes just lose connection from time to time

This is what global source code says about it:

%% ----------------------------------------------------------------
%% Prevent Overlapping Partitions Algorithm
%% ========================================
%%
%% 1. When a node lose connection to another node it sends a
%%    {lost_connection, LostConnNode, OtherNode} message to all
%%    other nodes that it knows of.
%% 2. When a lost_connection message is received the receiver
%%    first checks if it has seen this message before. If so, it
%%    just ignores it. If it has not seen it before, it sends the
%%    message to all nodes it knows of. This in order to ensure
%%    that all connected nodes will receive this message. It then
%%    sends a {remove_connection, LostConnRecvNode} message (where
%%    LostConnRecvNode is its own node name) to OtherNode and
%%    clear all information about OtherNode so OtherNode wont be
%%    part of ReceiverNode's cluster anymore. When this information
%%    has been cleared, no lost_connection will be triggered when
%%    a nodedown message for the connection to OtherNode is
%%    received.
%% 3. When a {remove_connection, LostConnRecvNode} message is
%%    received, the receiver node takes down the connection to
%%    LostConnRecvNode and clears its information about
%%    LostConnRecvNode so it is not part of its cluster anymore.
%%    Both nodes will receive a nodedown message due to the
%%    connection being closed, but none of them will send
%%    lost_connection messages since they have cleared information
%%    about the other node.
%%
%% This will take down more connections than the minimum amount
%% of connections to remove in order to form fully connected
%% partitions. For example, if the user takes down a connection
%% between two nodes, the rest of the nodes will disconnect from
%% both of these nodes instead of just one. This is due to:
%% * We do not want to partition a remaining network when a node
%%   has halted. When you receive a nodedown and/or lost_connection
%%   messages you don't know if the corresponding node has halted
%%   or if there are network issues.
%% * We need to decide which connection to take down as soon as
%%   we receive a lost_connection message in order to prevent
%%   inconsistencies entering global's state.
%% * All nodes need to make the same choices independent of
%%   each other.
%% 
%% ----------------------------------------------------------------

So, the real question is are you sure that the connection between nodes is not lost from time to time? If you’re sure, what tool do you use to check it?

2 Likes

If you don’t need global then i would set prevent_overlapping_partitions to false Erlang -- kernel

And depending on how many messages you’re sending between nodes you might want to increase the distribution buffer busy limit as well, default is only 1 MB
https://www.erlang.org/doc/man/erl.html#+zdbbl

Also net_setuptime Erlang -- kernel

You can use Application.get_env(:kernel, :net_ticktime) to check the value.

1 Like

Thanks everyone for the tips. I’m testing the different configurations and adding a little more monitoring so I can more easily see improvements. I’ll update again once I’ve seen some improvements :slight_smile:

Also very interested in the ideal global config.

We see a lot of packet loss between regions on Fly depending on the region and have managed to only get 4 regions (8 nodes) running somewhat reliably.

Had to build latency monitoring. See: Status - Supabase Realtime

It logs latencies to our logging infra too if they are over a threshold.

Note: the default PubSub adapter sends all messages to all nodes.

The Redis adapter would route everything through one box which would negate the whole global cluster idea.

2 Likes

That dashboard is super neat!

Nice to know that I’m not the only one who is running into these sorts of issues. I’m doing adjustments to config this week but I’m also wondering if maybe using Partisan with libcluster would be the way to go? It doesn’t seem like a lot of people have done this though and I don’t think I have the know-how to do this in elixir yet. Though libcluster does mention that you can use Partisan with it.

Partisan won’t solve your problem. Your problem is caused by disconnects or latency high enough to be considered a disconnect. This problem can’t be solved by a library or any tweak in erts. That means that this problem is architectural. So the real question is what you use this distribution for, what consistency model you want to have.

For example, global splits cluster when full-mesh topology is broken. But global is used in CP scenario, where we just leave cluster unavailable when partition is broken. Do you need global, actually? What do you use it for?

Only considering Partisan because of the following on the readme:

Also, Erlang conflates control plane messages with application messages on the same TCP/IP connection and uses a single TCP/IP connection between two nodes, making it liable to the Head-of-line blocking issue. This also leads to congestion and contention that further affects latency.

This model might scale well for datacenter deployments, where low latency can be assumed, but not for geo-distributed deployments, where latency is non-uniform and can show wide variance.

It was just a thought, I’m not considering that as an experiment until I’ve run dry on config changes.

I need global because its a Phoenix channel application using Phoenix.PubSub so it’s uses distributed Elixir, users need to be able to see events from any other user no matter what server they are connected to

Neither phoenix, nor phoenix_pubsub is using global. If you’re using global, what are you using it for? can you provide a snippet please?

I don’t use it in my application.

I can only assume it’s being used by libcluster.

Edit: sorry for the terseness, I was in a rush to reply. I’m pretty sure this is part of libcluster.

It is not used in libcluster. I’ve grepped libcluster, phoenix and phoenix_pubsub and it is not used in any of them. Perhaps it might worth to grep your deps folder to find calls to global

We’ve considered Partisan and I think it would help, but if one TCP connection is bad I’m assuming they all are usually.

We don’t use global. We do a lot of RPC calls for things and have to expect them to timeout not because the function call is heavy but because the network stalls out.

The PG2 PubSub adapter just sends stuff and I’ve thought about using my network pings to basically route around bad connections with a send_via which would proxy messages to a destination node through another node where I have a good connection.

Not sure if this is a good idea.

Or use some other server as a PubSub adapter which is built inherently for global distribution. I’ve naively thought about Nats. But also Bondy which was built by the guy who maintains Partisan.

He gracefully had a long discussion with me and is actively working on it.

Maybe he’s here @aramallo!

1 Like

I am @chasers ! :blush:

I think the way Partisan could help in this situation is exactly what @chasers suggests. However, this can only be accomplished when using the HyParView topology which maintains a partial mesh (eventually and probabilistically) forming a fully connected graph and uses transitive message transmission.

Using HyParView rules out the possibility of using partisan’s OTP behaviours at the moment (working on trying to solve this next, the key issue is how to implement monitoring).

Also notice that using Partisan rules out using Phoenix as it relies on disterl and OTP. For Phoenix to work we would need to fork it and teach it how to use Partisan and Partisan’s OTP behaviours.

I have plans to test Partisan itself and Bondy (which uses Partisan) in Fly.io for this very same reason.

2 Likes

If you’re talking about pubsub then forking is not needed. Pubsub already supports adapters to use different backends. E.g. there’s a redis based backend for heroku users.

3 Likes