Hi guys, I’m exploring Elixir for high demanding systems and I’m trying to figure out some scenario I face every day with the software I manage.
When I deploy an Elixir app, I can connect nodes together to create a cluster: is there any limitation on the number of nodes that can connect to each other?
My understanding is that nodes, to check other nodes are alive, send small messages, a kind of heartbeat protocol: does this have impact on big clusters? I’m scared that this communication will also start to be visible in terms of resources the bigger my fleet is. What are the consequences in such an environment?
When processes are connected, I can use OTP to send messages among them, messages will be delivered on the remote node possibly. If I’m deploying the receiver node, I will get a failure in the sender node. Is this a situation I have to managed by myself, or are there any tools I can leverage?
I was reading an interesting post regarding message passing across nodes: sending messages real time in system with millions of requests per second is realistic with Elixir/OTP?
Yes, according to some anecdotal evidence from the heavy users, it’s about 60-200 nodes for the default “distributed erlang” topology (mesh). To go beyond that, check out either partisan   or scalable distributed erlang (don’t know if it’s usable yet).
Interesting, thank you for the reference. I was also reading this paper that seems highlighting the same issues. I have a question on this actually: I understand that registering global names is expensive, and can bring to “stop of the world” behaviors.
But why just by connecting with other nodes also degrades the performance? Supposing that I will send messages using consistent hashing directly to the interested node and I won’t have to register any global process
BEAM clusters are fully connected so when you connect a node into a cluster it connects to every other node and gossips with them. From the paper you linked to:
First, maintaining a fully connected mesh of Erlang nodes means that a system with n nodes must maintain O(n^2) active TCP/IP connections and this induces significant network traffic above 40 nodes
Thank you Chris, yeah I know that, but I was wondering: when a node connects to the entire cluster, that node actually has n-1 connection to the other nodes. So let’s say in a cluster of 3000 nodes each node maintains 2999 connections. Keeping just the connections alive should be such a problem but for performing heartbeat can use some resource. What’s not clear to me is how this can impact the performance on that host? In theory heartbeat should be done with very small messages that are not been sent every microsecond right?
So if that’s the case, what’s the underlying reason why performance degrades?
I am trying to find a reference for this but I don’t have one handy so this may be incorrect. I believe this is because everything goes over the same TCP connection. So every node, every 5 seconds, will send 2999 messages. That’s around 15M messages on the network and the TCP connections can’t do anything else while they’re waiting for a response.
Yes this is what goes on. Granted if your connection is large enough it won’t be an issue, however most large clusters partition the nodes into groups to minimize the mesh interconnections.
For note, you can set up stealth connections, so as long as you have 2+ distinct networks with well defined communication points across them then they won’t combine into single large meshes, so you can scale that way to a lot more than ~200 nodes. Though honestly if you are doing that then you might want a custom protocol and proper API interfaces anyway.
Thank you guys. Of course there are different solution to scale “better” one of this is to have a custom protocol or API to call on a specific host in the cluster. But one of the most interesting part in OTP was to have a mesh of nodes and talk with them transparently. One thing that is not clear is: why beams have to chat on the same TCP connection on where messages are sent? It’s just to prevent that a message is sent over the wire and the beam doesn’t know if the destination node is alive or not?
I’m not sure why but it has been often requested over the years to have a command/data channel separation. This is especially true once you throw Mnesia into the mix and pass around large datasets.