Process distribution, communication and deployment

agiuliano · July 11, 2018, 8:54pm

Hi guys, I’m exploring Elixir for high demanding systems and I’m trying to figure out some scenario I face every day with the software I manage.
When I deploy an Elixir app, I can connect nodes together to create a cluster: is there any limitation on the number of nodes that can connect to each other?
My understanding is that nodes, to check other nodes are alive, send small messages, a kind of heartbeat protocol: does this have impact on big clusters? I’m scared that this communication will also start to be visible in terms of resources the bigger my fleet is. What are the consequences in such an environment?
When processes are connected, I can use OTP to send messages among them, messages will be delivered on the remote node possibly. If I’m deploying the receiver node, I will get a failure in the sender node. Is this a situation I have to managed by myself, or are there any tools I can leverage?
I was reading an interesting post regarding message passing across nodes: sending messages real time in system with millions of requests per second is realistic with Elixir/OTP?

Thank you

idi527 · July 12, 2018, 3:00am

Yes, according to some anecdotal evidence from the heavy users, it’s about 60-200 nodes for the default “distributed erlang” topology (mesh). To go beyond that, check out either partisan [1] [2] or scalable distributed erlang (don’t know if it’s usable yet).

[1] GitHub - lasp-lang/partisan: High-performance, high-scalability distributed computing for the BEAM.
[2] [1802.02652] Partisan: Enabling Cloud-Scale Erlang Applications

agiuliano · July 12, 2018, 7:45am

Interesting, thank you for the reference. I was also reading this paper that seems highlighting the same issues. I have a question on this actually: I understand that registering global names is expensive, and can bring to “stop of the world” behaviors.
But why just by connecting with other nodes also degrades the performance? Supposing that I will send messages using consistent hashing directly to the interested node and I won’t have to register any global process

chrismcg · July 12, 2018, 9:19am

BEAM clusters are fully connected so when you connect a node into a cluster it connects to every other node and gossips with them. From the paper you linked to:

First, maintaining a fully connected mesh of Erlang nodes means that a system with n nodes must maintain O(n^2) active TCP/IP connections and this induces significant network traffic above 40 nodes

agiuliano · July 12, 2018, 11:06am

Thank you Chris, yeah I know that, but I was wondering: when a node connects to the entire cluster, that node actually has n-1 connection to the other nodes. So let’s say in a cluster of 3000 nodes each node maintains 2999 connections. Keeping just the connections alive should be such a problem but for performing heartbeat can use some resource. What’s not clear to me is how this can impact the performance on that host? In theory heartbeat should be done with very small messages that are not been sent every microsecond right?
So if that’s the case, what’s the underlying reason why performance degrades?

chrismcg · July 12, 2018, 11:31am

I am trying to find a reference for this but I don’t have one handy so this may be incorrect. I believe this is because everything goes over the same TCP connection. So every node, every 5 seconds, will send 2999 messages. That’s around 15M messages on the network and the TCP connections can’t do anything else while they’re waiting for a response.

tty · July 12, 2018, 12:23pm

Yes this is what goes on. Granted if your connection is large enough it won’t be an issue, however most large clusters partition the nodes into groups to minimize the mesh interconnections.

OvermindDL1 · July 12, 2018, 10:29pm

For note, you can set up stealth connections, so as long as you have 2+ distinct networks with well defined communication points across them then they won’t combine into single large meshes, so you can scale that way to a lot more than ~200 nodes. Though honestly if you are doing that then you might want a custom protocol and proper API interfaces anyway.

agiuliano · July 16, 2018, 6:36am

Thank you guys. Of course there are different solution to scale “better” one of this is to have a custom protocol or API to call on a specific host in the cluster. But one of the most interesting part in OTP was to have a mesh of nodes and talk with them transparently. One thing that is not clear is: why beams have to chat on the same TCP connection on where messages are sent? It’s just to prevent that a message is sent over the wire and the beam doesn’t know if the destination node is alive or not?

tty · July 16, 2018, 9:05am

I’m not sure why but it has been often requested over the years to have a command/data channel separation. This is especially true once you throw Mnesia into the mix and pass around large datasets.