Ditributed cluster: up to how many nodes we can scale using a library like libcluster?

jaybe78 · June 9, 2025, 5:53pm

Hello,

I would like to know up to how many nodes max we can run a distributed app with a simple libray like libcluster ?

I read somewhere that after 40/50 nodes, you can start having issues like loss of messages and it generally does not scale well above that number.

For the context:

I’m currently building an app with an expectation of 2/3 millions concurrent users.
Those users would spread around 300 000 + channels.
So each channel would contain something like 100 users or less joining and they would mostly exchange chat messages.

Phoenix Presence is set up on those channels so an user can be aware at anytime of whose connected.

I heard about an Erlang library called Partisan that is used for very large cluster, but I guess for my use cases lib cluster should be enough ?

If anyone has been working in a distributed app it would be nice to share your feedback here.

Cheers.

Hermanverschooten · June 10, 2025, 12:21am

There is maybe something for you here.

And I just wanted to mention that libcluster despite it’s name does not cluster.
It’s all Erlang that does the clustering, libcluster just helps you with finding the nodes to connect through the strategies it offers.

jaybe78 · June 10, 2025, 4:49am

Yes seen this one. Thanks.

Your’re right, libcluster only helps you to find the nodes.

Though I still wanted some feedback from people who have used “default” Erlang without any library to manage their cluster, just to see if it makes sense for me at this point to look into a library that can handle much more complex and large cluster.

Also, I guess I’m a bit “intimidated” by Partisan, from what I seen in the docs, there’s no clear documentation about how to set up Partisan locally in an elixir app…

I can see that you can join nodes manually

Though I was looking for a solution where Partisan discover nodes automatically.

In my case I would also have to make it work in an AWS ECS service, and I’m not sure whether it would work seamlessly or it does require some extra configuration !?

The only thing I had to do so far to set up an erlang cluster was to use aws service discovery to resolve DNS names so that was quite easy…

pleb · June 10, 2025, 9:12am

Consider first if your use case actually requires clustering the entire thing, instead of partitioning it into groups of three or five with a stateful load balancer to guide people to the right group.

Check if a consistent hashing scheme like GitHub - lambdaclass/elixir_riak_core: Wrapper to Riak Core for Elixir wouldn’t function better for your use case.

jaybe78 · June 10, 2025, 10:09am

If I understand correctly riak creates virtual nodes within one physical node to spread the load rather than creating new physical nodes ?

pleb · June 10, 2025, 10:31am

Virtual nodes are spread across physical nodes, and the dynamic partitioning scheme allows physical nodes to be added and removed while virtual nodes always find the closest physical server in the event that a node is removed to recover, or spread out when a new node is added.

Although this scheme might not work too well, if the number of participants is not uniform, and one partition is overwhelmed, as was the case with How Discord Scaled Elixir to 5,000,000 Concurrent Users implementing a similar scheme

jaybe78 · June 10, 2025, 12:18pm

Ok thanks.

Riak seems overly complexed (at least for me), I had a look at riak-core-lite but no activity there since 2021…

mpope · June 10, 2025, 1:49pm

I saw this talk at code beam, truly the rocket science of BEAM with 40k nodes!

jaybe78 · June 10, 2025, 4:14pm

lol really interesting but I guess at the moment I’m far from being able to do that kind of work on my own

mpope · June 10, 2025, 4:19pm

Fair enough, thought for food.

For automated discovery specifically take a look at GitHub - nerves-networking/mdns_lite: A simple, no frills mDNS implementation in Elixir

jaybe78 · June 10, 2025, 4:49pm

yes thanks definitely interesting