Scaling/Federating cluster(s) at planetary scale

max-au · March 8, 2021, 5:27am

It is perfectly fine to run many more than 200. We tested clusters up to 40,000 nodes running Erlang distribution over TLS (non-mesh topology), and up to 10,000 in a full mesh. With TCP it can be even larger, although I do not see how it can be useful at the planetary scale. Running unencrypted traffic between datacenters is always a disaster.

Running cluster this large is easier if you are:

Running Erlang/OTP 24 (RC for now). It contains a number of changes compared to OTP23, fixing many issues with distribution over TLS (it is indeed one of the highlights of SSL application)
Disabling mesh connections (with “-connect_all false”) helps a lot. It also helps to to start distribution dynamically (so BEAM is started with -name undefined, and then your application calls net_kernel:start), pacing connection rate.
Using “hibernate” option for TLS distribution connections, because it’s unlikely for all of them to be active most of time.
It may also be helpful to extent net_setuptime from 7 seconds to 15 or even more when cluster grows beyond 10k.

Next challenge in a cluster this large would be service discovery (e.g. how does a process on node A knows processes on node B). In other words, distributed process registry.pg module introduced in OTP 23 solves exactly this problem (it does not do globally locked transactions, providing eventual consistency based on actual network connectivity, and has support for overlay networks - scopes).

We disable EPMD as well, but not because of performance implications (we just don’t see how to use it).

This setup, of course, has considerable overhead (e.g. just establishing 40k TLS connections takes more than a minute, and heartbeat alone takes 1 CPU core). Currently I’d recommend to limit full mesh to 10-12k nodes, or less. Even this amount is enough to support a truly planetary scale application on low-power cloud machines. Then, you can switch to bare metal hardware, and get another 4-8x boost, but I can’t even imagine what purpose such a monster can have.

More robust solution would be to connect only those nodes that need to be connected. For example, frontend nodes are only connected to middleware, and middleware connected to storage. There are a few CodeBEAM (or CodeMESH) talks from WhatsApp explaining this setup.