Scaling/Federating cluster(s) at planetary scale

brandonpollack23 · March 7, 2021, 10:32pm

So I’m new to elixir and currently work for one of those large tech firms.

I know for a personal project I would never have to do this, but with those large tech firms’ scale in mind I’ve come up with a puzzle and I just can’t seem to find guidance on approach.

Say I were building discord and wanted to publish messages/events to all connected users.

At first I could use a single node, then scale out and cluster up to perhaps a maximum of 200 nodes. (I’ve read alll of Discord’s blog posts on this, they are fantastic and interesting).

My issue comes in beyond that. I’ve read a number of papers referenced in other posts and stack overflow, but none seem to come up with a prescribed solution. The full mesh doesn’t work so you seem to have a couple options:

Delegate the events to another system like kafka and subscribe to that instead, just use independent nodes (seems like I’m throwing away a lot of the point of BEAM).
Federate your system (say using GRPC to communicate between them, though I’d prefer to somehow use BEAM messages, but I’m not sure how to do this without adding to a cluster)
Don’t connect_all and disable epmd to get rid of noisy traffic (thus losing a lot of it’s benefit, but for ephemeral updates like messages that are persisted in a db elsewhere this is ok). Use something like libcluster then to discover nodes and keep track.
Make the “front end session nodes” hidden and create your own topology using libcluster or something custom rolled to load balance onto a publish/subscribe layer (which could be in a cluster topology for availability)
consider lasp

To be clear the topology I’m thinking is a very large load balanced front end that is too many nodes to be in a cluster and publish/subscribe to a system using BEAM messaging (instead of having to delegate to another distributed system).

I keep getting hung up because the libcluster solution seems good, but there doesn’t seem to be any discussion/documentation on disabling EPMD for added benefit (a large consistent hash for instance, would still connect to all nodes and overwhelm the transport layer with heartbeat traffic).

kip · March 7, 2021, 10:52pm

You might find Partisan interesting:

brandonpollack23 · March 8, 2021, 2:07am

Will take a look, thank you !

max-au · March 8, 2021, 5:27am

It is perfectly fine to run many more than 200. We tested clusters up to 40,000 nodes running Erlang distribution over TLS (non-mesh topology), and up to 10,000 in a full mesh. With TCP it can be even larger, although I do not see how it can be useful at the planetary scale. Running unencrypted traffic between datacenters is always a disaster.

Running cluster this large is easier if you are:

Running Erlang/OTP 24 (RC for now). It contains a number of changes compared to OTP23, fixing many issues with distribution over TLS (it is indeed one of the highlights of SSL application)
Disabling mesh connections (with “-connect_all false”) helps a lot. It also helps to to start distribution dynamically (so BEAM is started with -name undefined, and then your application calls net_kernel:start), pacing connection rate.
Using “hibernate” option for TLS distribution connections, because it’s unlikely for all of them to be active most of time.
It may also be helpful to extent net_setuptime from 7 seconds to 15 or even more when cluster grows beyond 10k.

Next challenge in a cluster this large would be service discovery (e.g. how does a process on node A knows processes on node B). In other words, distributed process registry.pg module introduced in OTP 23 solves exactly this problem (it does not do globally locked transactions, providing eventual consistency based on actual network connectivity, and has support for overlay networks - scopes).

We disable EPMD as well, but not because of performance implications (we just don’t see how to use it).

This setup, of course, has considerable overhead (e.g. just establishing 40k TLS connections takes more than a minute, and heartbeat alone takes 1 CPU core). Currently I’d recommend to limit full mesh to 10-12k nodes, or less. Even this amount is enough to support a truly planetary scale application on low-power cloud machines. Then, you can switch to bare metal hardware, and get another 4-8x boost, but I can’t even imagine what purpose such a monster can have.

More robust solution would be to connect only those nodes that need to be connected. For example, frontend nodes are only connected to middleware, and middleware connected to storage. There are a few CodeBEAM (or CodeMESH) talks from WhatsApp explaining this setup.

brandonpollack23 · March 8, 2021, 5:51pm

Thanks for the reply!

Turns out I had that talk open in another tab and am working my way through it now.

Makes a lot of sense to have a group of session servers that keep track of which end user connections are on which nodes via consistent hashing as described, and then have session managers know about one another via pg2 so you can ask any one of them about the others to discover them.

My mental image is now: epmd is off, connect_all is false, you bring up your nodes (if you’re using FB infra I’m assuming using their tupperware system, not too dissimilar from borg or k8s), and the chat servers then use dns or some other means within the orchestrator to discover some session servers, who also discover one another (and through the “wandist” or regular connection transitively share that information with the chat servers).

What is confusing me now is this: if connect_all is false how do the session managers form their full mesh? the talk mentions how each cluster in the “meta-cluster” is a normal full mesh. Since EPMD is off, Is this simply accomplished through the orchestrator and manually connected, libcluster can accomplish this, so can rolling your own thing.

Thanks for your super comprehensive response, <3

I really am glad I’m putting sometime into learning about the erlang ecosystem.

max-au · March 8, 2021, 7:26pm

WhatsApp setup has been explained several times, latest one I can remember is Maxim Fedorov - The art of challenging assumptions | Code Mesh LDN 19 - YouTube (starting with 12:30).

You’re correct, EPMD is off, BEAM listens on a fixed port, connect_all is false, nodes discover each other through a solution similar to DNS. Long live “wandist”. We no longer have it, and are using distribution over TLS.

To discover which nodes need to connect to which, there is a notion of “distributed dependencies” between applications. In app.src file, we have a key named “distributed_dependencies”, similar to “runtime_dependencies”. It lists applications that are expected to be available via pg registry. When we build a release to start (e.g. “session manager” release), it has statically baked list of nodes to connect to. This list is easy to generate, because we know which other releases this application is added to.

Likely you could use something like libcluster to achieve the same effect. I haven’t used this tool, so not aware how easy it is.
We’re also doing research in a more integrated solution, allowing more dynamism (e.g. automatic scaling instead of baked in node names), but it’s far from production stage.

brandonpollack23 · March 8, 2021, 8:51pm

Thanks for your time and patience, Max. Very much appreciated. These questions have been itching at me for a while now and it’s nice to get some color from someone with experience.

Hopefully my little project can produce some useful data to help return the favor.