jaybe78
Implementing distributed user counters
Hello,
I’ve recently developed a module that creates distributed user counters, based on what’s described in this post
Basically, locally I use ETS table to atomically know how many people there is in a channel, then I aggregate all counters of all nodes using Phoenix Tracker.
In summary, users starts by joining “room:lobby”, using ETS table, I check what’s the last room id with available slots( <= 70), and if I can’t find one, I assign them to a new topic.
Those topics (“room#{id}”) are created in ascending order (“room#1”, “room#2”, …), and I use the same topics on every nodes.
With that in place I’m able to create channel topics with a max size of 70 or so users.
It works quite well overall, but I have a “problem” which concern the total number of users on channel topics.
As I said previously, the first step before tracking counters across nodes, is to use ETS table locally, to know the current count on topic, so obviously the value I get only concern users who join on that node.
Therefore if I deploy my app on 10 nodes, since the max is 70, I will get 700 users join on every single topic.
It’s not good for me because I don’t want those rooms to be too crowded. (I plan on using Phoenix Presences on those channels, which does not scale very well when there’s too many users)
1) First solution:
To solve that, my first thought was to divide the max, by the number of actives nodes:
For example on 10 nodes, max => 7 users
That means locally 7 users max can join a channel topic.
This works but it’s not perfect either because the users will be load balanced to different nodes.
They might fill room topics on certain nodes faster than on others, so it would create new topics with barely any users while the previous rooms are not filled completely yet.
2) Second solution
To avoid that problem, I could broadcast to every nodes when a specific room is full or since room id are created in ascending order, I could regularly inform the last id with available slots
This would occur in the room tracker after the aggregate.
The only issue is that, there’s always a delay between the moment where users are populated to a specific channel and the aggregate. That information would always arrive too late…
At this point, I\m running out of ideas ![]()
Most Liked Responses
garrison
I’m a little confused with what’s going on in this thread. There’s an implication here that Redis is somehow different than using ETS tables, but there is no reason you can’t just put an ETS table on one node and use it exactly like you would use Redis! You would have no fault tolerance and higher latency, just like Redis. The only reason you would want to actually use Redis is if you wanted its vast array of fancy data structures, which in this case maybe you don’t, seeing as you have already implemented your system on ETS.
Again, it’s perfectly valid to use Redis for its functionality, but “ETS and GenServers” were absolutely designed to be used “cross-node”. OTP literally could not be more designed for that! It’s, like, the entire thing!
This is not a literal answer, but the quintessential example would be something like an exchange/order-matching system: something which processes (financial) transactions. If you want to learn more about that the magic keywords are “LMAX architecture”, plus check out Tigerbeetle for a modern fault-tolerant rendition.
Apologies in advance for nitpicking something you’re probably fully aware of, but CAP and “centralized/distributed” are separate concepts. You can have a distributed consistent system, and you can also have a single-threaded inconsistent system, though this is less common because in a single-threaded system consistency is so easy to achieve.
LostKobrakai
The big change here is not GenServer vs. Redis though. It’s centralized storage of data (CP) vs. distributed storage of data (eventual consistency / AP)
LostKobrakai
Your remarks make it sound like there’s no real reason for the number 70 exactly, so I’d suggest not going with a fixed cutoff point, but rather evalute a solution using low and high watermarks. Start creating new channels at a low watermark, but also make it so only the high watermark makes that number of users be a problem in a channel. That way if people are added to a channel after the low watermark is reached isn’t directly a problem. You can tune those numbers against expected latency of coordination between nodes as well as expected arrival rates of users. This makes even more sense if you seem to be moving people around anyways.
The part about people potentially landing in empty channels cannot really be avoided though. There will always be the case of all existing channels being considered full and the new person being put into a new channel. It’s however a matter of how likely you make that case. E.g. if your balancer does its job how likely is it for that case to happen.
I have no idea about scaling persence, but I’m asking because both likely affect performance in distinct ways. There’s no single knob to dialing performance. I’d also argue that performance work is best evaluated by benchmarking your solution besides reading what others have to say.








