If all you want is the connected users, the presense is definitely overkill, because you are sharing each connected user and their metadata across nodes when all you need is a counter.
A simpler solution is to share only the counter. Here is a high-level outline, broken in two steps of one possible approach. I will be glad to clarify any possible points and answers questions.
Step 1: basic setup
Every time a user connects, you send a message to a process with your PID. We will call the receiveing process the “LocalCounter”. The LocalCounter will bump its internal counter when it receives said message and monitor the PID. Once it receives the DOWN message, it decreases the counter.
You will also have a separate process, which is the GlobalCounter. The GlobalCounter will receive updates from other processes in the cluster. You will:
-
Every X seconds, you will query the local counter and broadcast a message on a “global_counter” topic with your node name and your local counter
-
Other nodes will receive your message and they should store: the time they received the message, the node name, and the counter
-
The total counter is the sum of local counter with all other global counters
-
After you broadcast, you should prune any dead node. You can consider a dead node to be any node that you haven’t received a broadcast from after N*X seconds. Alternatively, you can use :erlang.monitor_nodes()
to see when nodes go up and down so you can remove those entries immediately
The choice of X is important. X means how frequently you will broadcast, too small means a lot of traffic but always up to date. For example, if X is 5 seconds, it means that you will stay behind from other nodes at most 5 seconds. X is also the maximum time it takes for a new node to receive all updates when it goes up.
Step 2: optimizing
The implementation above has one issue: the LocalCounter will likely become a bottleneck. We can address this by using the :counters
module in Erlang and changing it to be a pool of processes. Here is what we will do:
-
Instead of a single local counter, we will start N local counters. We will also create a use the :counters
API to create a counter array of N entries. Each local counter will have an index inside the counter array and update said index.
-
Now, when you need to track a given PID, you should do :erlang.phash2(pid, N)
to select one of the existing local counters. You can use a Registry to track the local counters.
-
Change the global counter to, instead of asking the local counter its current count, to traverse all indexes in the :counters
reference, adding them all. That’s what you will broadcast now.
In pseudo-code, your CounterSupervisor’s init will look like this:
@count 8
def init(_) do
counter = :counters.new(@count, [:write_concurrency])
children = [
{Registry, name: CounterRegistry, kind: :unique},
{GlobalCounter, counter}
] ++
Enum.map(1..count, fn index ->
Supervisor.child_spec({LocalRegistry, counter: counter, index: index}, id: index)
end)
The LocalCounter should register itself under CounterRegistry with the index.
When dispatching to a local counter, you will roughly do this:
def count_me do
# phash is zero-based
index = :erlang.phash2(self(), @count) + 1
name = {:via, Registry, {CounterRegistry, index}}
GenServer.cast(name, {:count, self()})