Hey,
Thanks for getting back to me.
For the context, I’ve implemented your second proposal using Phoenix Tracker where each local counter is tracked.
The topic of the tracker is the node name concat to the local counter index:
"#{Node.self()}_local_counter_#{index_counter}",
The key room
is the topic of the room
Phoenix.Tracker.update(
Messages.Trackers.ChannelsTracker,
self(),
"#{Node.self()}_local_counter_#{index_counter}",
room,
%{counter: count}
)
Tracker diff is like
diff #=> %{
"test-cluster3@127.0.0.1_local_counter_2" => {[
{"room_0",
%{
counter: 1,
}}
], []}
}
I then store the added up counts for each room in each node in each local_counters in a :ets table
This is just for the context.
Please bear with me, and sorry in advance, if my questions sound stupid
To me, I see essentially 2 scenarios.
-
In a cluster, you could have during a brief moment, 2 nodes or more that cannot talk to each other, before the cluster is able to heal. In that scenario no node is really going down.
-
In a cluster, you could have a node which at some point goes down for good and gets replaced by a new node in the cluster.
I assume there’s 2 different things to accomplish in either of those scenario.
1) Scenario 1:
During the “netsplit”, both nodes are going to do their own thing as usual, and there’s going to be changes in local counters of both nodes, that Phoenix Tracker
could not have replicated across the cluster.
If I understand correctly, they might not be able to talk to each other during that time, but they might still be reachable by clients, and so we’re going to have discrepancies in each node.
Therefore when they are back up, meaning, when they are able to talk to each other again, I assume you’ll end up, with correct values in the local counters owned by each node, but not the ones belonging to the other node, right ?
For example let’s say you have 2 nodes "A"and “B”.
The counts stored in tracker for Node A could be different in Node B
Node “A” could read
"Node_A_local_counter_1" => {[
{"room_0",
%{
counter: 10,
}}
While Node B:
"Node_A_local_counter_1" => {[
{"room_0",
%{
counter: 3,
}}
Within that split, we got 7 more new users joining Node A in room_0 that could not be replicated to Node B, and then we get inconsistent values.
you ask its latest copy of the data again
How would you get local counters values from the other nodes ?
Does it mean doing like a RPC call to get the values stored in tracker of the other nodes ??
1) Scenario 2:
In this scenario a node is going down for good and is not coming back up, it gets “replaced” or not, by a new node in the cluster.
In this scenario, local counters of the node who died must be untracked.
You need timeouts or group membership protocols, for example: do other nodes also see that node A is down
My initial idea is to record the time, every time we do Tracker.update
in local counters.
That way we can see the last time a node was broadcast.
When a node goes down I can compare the dates, see the oldest to find out about the dead node and after a X seconds, if that date stays unchanged, for that node I was suspicious about, I can start untracking the counters that node holds…
Thanks