ROW_NUMBER() isn’t stable across different queries though. If a single row is deleted between process A doing the query and process B doing the query then the whole count is off and it won’t partition properly. The value of the modulus approach is that you’re using a value intrinsic to the row, which ensures that all parties can agree on which partition it belongs to just by looking at it knowing the partition counts.
I’m not really sure I follow your argument with respect to the modulus logic. It doesn’t really have an issue with deleted rows, unless for some reason all rows with a specific modulus value were specifically deleted, which seems hard to imagine in a regular use case. If you’re using auto incrementing IDs, then the probability distribution of modulus values is equal. So unless there is a bias in deletes towards specific modulus results I don’t really see what you’re saying.
Concrete example:
iex(11)> ids = 1..1_000_000
1..1000000
iex(12)> ids |> Enum.map(fn i -> i |> rem(10) end) |> Enum.frequencies
%{
0 => 100000,
1 => 100000,
2 => 100000,
3 => 100000,
4 => 100000,
5 => 100000,
6 => 100000,
7 => 100000,
8 => 100000,
9 => 100000
}
We generate a million auto incrementing keys, partition them modulo 10, and we see how many go to each partition. As expected it’s perfectly even. Let’s delete the first 90%:
iex(13)> ids |> Enum.map(fn i -> i |> rem(10) end) |> Enum.drop(900_000) |> Enum.frequencies
%{
0 => 10000,
1 => 10000,
2 => 10000,
3 => 10000,
4 => 10000,
5 => 10000,
6 => 10000,
7 => 10000,
8 => 10000,
9 => 10000
}
Still even, as expected. Let’s delete a random 90%:
iex(15)> ids |> Enum.map(fn i -> i |> rem(10) end) |> Enum.filter(fn _ -> :rand.uniform(10) == 1 end) |> Enum.frequencies
%{
0 => 10007,
1 => 9958,
2 => 9975,
3 => 9888,
4 => 9937,
5 => 10032,
6 => 10153,
7 => 10072,
8 => 10029,
9 => 9854
}
Still basically even. To get a non even result you have to have some sort of delete pattern that is more common in rows with ids ending with 9
than rows ending with 7
and that just seems very rare to me, and is in any case easy to check in your own dataset to see if it’s happening.