Hey folks,
I need to implement some sort of partitioning of data records. A certain part of records should be processed in one way, another part in another way. So I thought it would be nice to implement a stable partitioning algorithm that splits datasets into a given number of partitions with given proportions.
So, given three partitions and their relative thresholds
[a: 0, b: 0.33, c: 0.66]
I want to be able to take a term, pipe it into a partitioning function and receive the name of the partition:
partition("foo", partitions: partitions) # => :a
partition("bar", partitions: partitions) # => :c
partition("foobar", partitions: partitions) # => :b
My idea is to use a simple hash function to generate a uniformly distributed stable pseudo-random number for any term and then pick the suitable partition.
Here’s what I’ve come up. I’m really curious to see what you think of the approach or if you have ideas for better/faster/more elegant solutions.
defmodule Partitioner do
@spec partition(term :: []) :: integer()
def partition(term, opts \\ []) do
term
|> normalize(opts)
|> do_partition(opts)
end
defp normalize(term, opts) do
salt = Keyword.get(opts, :salt, nil)
{term, salt}
|> :erlang.term_to_binary()
|> do_hash()
|> :binary.first()
|> do_normalize()
end
defp do_hash(binary), do: :crypto.hash(:sha, binary)
defp do_normalize(int), do: int / 255
defp do_partition(float, opts) do
opts
|> Keyword.get(:partitions, a: 0, b: 0.5)
|> Enum.max_by(&match_partition(float, &1))
|> elem(0)
end
defp match_partition(float, {_, threshold}) when float > threshold, do: threshold
defp match_partition(_, _), do: 0
end