Stable Partitioning Algorithm With No Side Effects

wmnnd · March 20, 2020, 10:34pm

Hey folks,

I need to implement some sort of partitioning of data records. A certain part of records should be processed in one way, another part in another way. So I thought it would be nice to implement a stable partitioning algorithm that splits datasets into a given number of partitions with given proportions.

So, given three partitions and their relative thresholds

[a: 0, b: 0.33, c: 0.66]

I want to be able to take a term, pipe it into a partitioning function and receive the name of the partition:

partition("foo", partitions: partitions) # => :a
partition("bar", partitions: partitions) # => :c
partition("foobar", partitions: partitions) # => :b

My idea is to use a simple hash function to generate a uniformly distributed stable pseudo-random number for any term and then pick the suitable partition.

Here’s what I’ve come up. I’m really curious to see what you think of the approach or if you have ideas for better/faster/more elegant solutions.

defmodule Partitioner do
  @spec partition(term :: []) :: integer()
  def partition(term, opts \\ []) do
    term
    |> normalize(opts)
    |> do_partition(opts)
  end

  defp normalize(term, opts) do
    salt = Keyword.get(opts, :salt, nil)

    {term, salt}
    |> :erlang.term_to_binary()
    |> do_hash()
    |> :binary.first()
    |> do_normalize()
  end

  defp do_hash(binary), do: :crypto.hash(:sha, binary)

  defp do_normalize(int), do: int / 255

  defp do_partition(float, opts) do
    opts
    |> Keyword.get(:partitions, a: 0, b: 0.5)
    |> Enum.max_by(&match_partition(float, &1))
    |> elem(0)
  end

  defp match_partition(float, {_, threshold}) when float > threshold, do: threshold

  defp match_partition(_, _), do: 0
end

benwilson512 · March 20, 2020, 10:39pm

Sounds like a job for http://erlang.org/doc/man/erlang.html#phash2-2

iex(10)> partitions = [:a, :b, :c]                                   
[:a, :b, :c]
iex(11)> ["foo", "bar", "foobar"] |> Enum.map(&:erlang.phash2(&1, 2)) |> Enum.map(&Enum.at(partitions, &1))
[:a, :a, :b]

wmnnd · March 20, 2020, 11:09pm

That’s awesome, I was half-expecting to find a function like this in the Erlang standard library