Last year I’ve been writing a presence system that required capability of handling 500k concurrent clients with a few different features like: being able to subscribe to a user join/left events, track where the user currently is on the site and a few other things. In the team we chose Rust for solving it and because we didn’t have much time we went for an unorthodox design - we knew that Rust is fast enough to handle it without distributing it across multiple nodes, so we went with a “one server” approach. All of the people are connected to the same main server, if it fails, we always have a fallback and switch load balancer there. Reconnecting 500k users lasts about 20-30s, so if the node fails it means a bit of a downtime, but in practice it’s not a big issue - 500k is an upper limit and max on production so far was maybe 150-200k and also the service just works, it practically never goes down (like, there was no runtime exceptions so far). And there is a big advantage in my opinion: the code is extremely simple - it’s basically manipulating hashmaps on three threads, no network, no third party dependencies, no distributed computing problems, nothing. Also we know that this approach can handle at least a million connections, it’s possible that it may be more, we just didn’t test it that far. The obvious problem is that the system should definitely be distributed at some point, especially for much higher scale and the thundering herd problem.
I wrote about this to give context of where I’m coming from. And in particular I’m interested into how easy or hard it might be to scale Phoenix Presence to handle millions of connections to not have to write a distributed algorithm myself.
Recently I’ve read about Phoenix’s presence system and I started wondering if there was any performance/scale testing of the feature? I remember reading about reaching 2 million websocket connections using Phoenix channels, but channels are quite a bit simpler and don’t require syncing between nodes. As far as I understand the Presence feature uses CRDTs to sync state between servers, ie. if you have N nodes, the state will be synced between all of them. This is an interesting design but I think there is a certain disadvantage to it. As each of the nodes need to eventually keep the entire state, adding new nodes will not make processing data on a single node faster, it will just distribute connections across more nodes.
A while ago I remember reading about how Discord stumbled upon a performance problem with immutable data structures in a similar scenario. Does Phoenix uses traditional immutable data structures?
And then, has anyone scale tested the Presence feature like channels were tested a few years ago? Hopefully with million+ connections and at least a few nodes. I’m really curious about the results.