Phoenix Presence performance/scale testing, has anyone done it already?

drogus · September 24, 2021, 1:51pm

Hi,

Last year I’ve been writing a presence system that required capability of handling 500k concurrent clients with a few different features like: being able to subscribe to a user join/left events, track where the user currently is on the site and a few other things. In the team we chose Rust for solving it and because we didn’t have much time we went for an unorthodox design - we knew that Rust is fast enough to handle it without distributing it across multiple nodes, so we went with a “one server” approach. All of the people are connected to the same main server, if it fails, we always have a fallback and switch load balancer there. Reconnecting 500k users lasts about 20-30s, so if the node fails it means a bit of a downtime, but in practice it’s not a big issue - 500k is an upper limit and max on production so far was maybe 150-200k and also the service just works, it practically never goes down (like, there was no runtime exceptions so far). And there is a big advantage in my opinion: the code is extremely simple - it’s basically manipulating hashmaps on three threads, no network, no third party dependencies, no distributed computing problems, nothing. Also we know that this approach can handle at least a million connections, it’s possible that it may be more, we just didn’t test it that far. The obvious problem is that the system should definitely be distributed at some point, especially for much higher scale and the thundering herd problem.

I wrote about this to give context of where I’m coming from. And in particular I’m interested into how easy or hard it might be to scale Phoenix Presence to handle millions of connections to not have to write a distributed algorithm myself.

Recently I’ve read about Phoenix’s presence system and I started wondering if there was any performance/scale testing of the feature? I remember reading about reaching 2 million websocket connections using Phoenix channels, but channels are quite a bit simpler and don’t require syncing between nodes. As far as I understand the Presence feature uses CRDTs to sync state between servers, ie. if you have N nodes, the state will be synced between all of them. This is an interesting design but I think there is a certain disadvantage to it. As each of the nodes need to eventually keep the entire state, adding new nodes will not make processing data on a single node faster, it will just distribute connections across more nodes.

A while ago I remember reading about how Discord stumbled upon a performance problem with immutable data structures in a similar scenario. Does Phoenix uses traditional immutable data structures?

And then, has anyone scale tested the Presence feature like channels were tested a few years ago? Hopefully with million+ connections and at least a few nodes. I’m really curious about the results.

sb8244 · September 24, 2021, 5:56pm

I did a bit of testing when I was writing PushEx. We didn’t have the need for 500k clients, but in the high 10s of thousands.

A few recommendations based on that:

Presence ran into issues when I had large topics (500+ clients). It would time-out on the track call. I was able to switch to Tracker and get it a bit higher (maybe 1000-2000 clients). Ultimately, I decided to remove tracking large topics and just assumed they had connected clients. It depends on your use case
Use Tracker if you don’t need to broadcast the diff to every client. You will get better performance because you’re doing less work. You can still send out the diffs manually as needed.
Tracker was generally well-behaved for smaller topics (< 50 clients). So the performance characteristics are going to vary wildly based on your topic configuration

chrismccord · September 24, 2021, 7:52pm

Folks are running 100-200k concurrent presences in production, but beyond that you will likely go CPU bound on CRDT merges. There is a large optimization we can make here where we don’t have to observe removes, but I haven’t had the time to prioritize the work. Hope that helps!