I would like to know up to how many nodes max we can run a distributed app with a simple libray like libcluster ?
I read somewhere that after 40/50 nodes, you can start having issues like loss of messages and it generally does not scale well above that number.
For the context:
I’m currently building an app with an expectation of 2/3 millions concurrent users.
Those users would spread around 300 000 + channels.
So each channel would contain something like 100 users or less joining and they would mostly exchange chat messages.
Phoenix Presence is set up on those channels so an user can be aware at anytime of whose connected.
I heard about an Erlang library called Partisan that is used for very large cluster, but I guess for my use cases lib cluster should be enough ?
If anyone has been working in a distributed app it would be nice to share your feedback here.
And I just wanted to mention that libcluster despite it’s name does not cluster.
It’s all Erlang that does the clustering, libcluster just helps you with finding the nodes to connect through the strategies it offers.
Your’re right, libcluster only helps you to find the nodes.
Though I still wanted some feedback from people who have used “default” Erlang without any library to manage their cluster, just to see if it makes sense for me at this point to look into a library that can handle much more complex and large cluster.
Also, I guess I’m a bit “intimidated” by Partisan, from what I seen in the docs, there’s no clear documentation about how to set up Partisan locally in an elixir app…
I can see that you can join nodes manually
Though I was looking for a solution where Partisan discover nodes automatically.
In my case I would also have to make it work in an AWS ECS service, and I’m not sure whether it would work seamlessly or it does require some extra configuration !?
The only thing I had to do so far to set up an erlang cluster was to use aws service discovery to resolve DNS names so that was quite easy…
Consider first if your use case actually requires clustering the entire thing, instead of partitioning it into groups of three or five with a stateful load balancer to guide people to the right group.
Virtual nodes are spread across physical nodes, and the dynamic partitioning scheme allows physical nodes to be added and removed while virtual nodes always find the closest physical server in the event that a node is removed to recover, or spread out when a new node is added.
Although this scheme might not work too well, if the number of participants is not uniform, and one partition is overwhelmed, as was the case with How Discord Scaled Elixir to 5,000,000 Concurrent Users implementing a similar scheme