Why BEAM nodes try to connect to all other network nodes?

sezaru · September 25, 2021, 7:40pm

Hello,

Why does the BEAM try to connect each node to all other nodes from the network?

For example, let’s say I have nodes A, B, and C. A needs to talk with B, B needs to talk with A and C only talks with B:

To do this, I configure libcluster with a topology that looks like this:

A <-------> B <-------> C

What I expected was to A and B be connected, and B and C too, but actually A and C connect to each other too.

This is OK in my development environment, but in production, I have more strict firewall rules, so A cannot access C IP and vice versa.

Everything seems to work, but A and C still try to connect to each other, resulting in warnings like this on from my log:

2021-09-25 18:47:47.463 [warn] #PID<0.3141.0> 
↳ global: :"candles@node1.candles.tip-off" failed to connect to :"rocket_dbs@node1.rocket_dbs.tip-off

So, why does this is the default behavior of the BEAM? Can I disable/configure it? What are the advantages/disadvantages of it?

ityonemo · September 25, 2021, 10:32pm

So one thing that it is important for is to obtain cluster-wide global transactional locks using the :global module. If every node knows every other node, then this is not a problem. If you have a more unusual topology, ensuring that all nodes are aware of the lock is not trivial; I don’t know what guarantees :global makes when you have an unusual topology.

I think if you have a situation where your nodes have a heterogeneous topology you should reconsider using erlang clustering as a “service mesh” or at least look into a different clustering protocol; the original use case for erlang clustering is for symmetrical redundancy, not as a service mesh.

There have been attempts to do so (e.g. “partisan”). I think that project is very interesting but I worry that the abstraction is not quite the right one.

qhwa · September 25, 2021, 11:52pm

According to Erlang’s document,

Connections are by default transitive. If a node A connects to node B, and node B has a connection to node C, then node A also tries to connect to node C. This feature can be turned off by using the command-line flag -connect_all false, see the erl(1) manual page in ERTS.

and (as @ityonemo has mentioned)

-connect_all false

If this flag is present, global does not maintain a fully connected network of distributed Erlang nodes, and then global name registration cannot be used; see global(3).

So if you need to discover processes, it will have some issues.

On libcluster’s README, it says:

Features

Easy to use provide your own distribution plumbing (i.e. something other than Distributed Erlang), by implementing a small set of callbacks. This allows libcluster to support projects like Partisan.

I haven’t tried Partisan yet, but it works not in “all to all” mode, and it provides some alternatives to the :global registry and process discovery. As I understand one of its selling points is such networking conditions. But maybe you don’t need it since A and C won’t talk to each other?