Stable Partitioning Algorithm With No Side Effects

Hey folks,

I need to implement some sort of partitioning of data records. A certain part of records should be processed in one way, another part in another way. So I thought it would be nice to implement a stable partitioning algorithm that splits datasets into a given number of partitions with given proportions.

So, given three partitions and their relative thresholds

[a: 0, b: 0.33, c: 0.66]

I want to be able to take a term, pipe it into a partitioning function and receive the name of the partition:

partition("foo", partitions: partitions) # => :a
partition("bar", partitions: partitions) # => :c
partition("foobar", partitions: partitions) # => :b

My idea is to use a simple hash function to generate a uniformly distributed stable pseudo-random number for any term and then pick the suitable partition.

Here’s what I’ve come up. I’m really curious to see what you think of the approach or if you have ideas for better/faster/more elegant solutions.

defmodule Partitioner do
  @spec partition(term :: []) :: integer()
  def partition(term, opts \\ []) do
    |> normalize(opts)
    |> do_partition(opts)

  defp normalize(term, opts) do
    salt = Keyword.get(opts, :salt, nil)

    {term, salt}
    |> :erlang.term_to_binary()
    |> do_hash()
    |> :binary.first()
    |> do_normalize()

  defp do_hash(binary), do: :crypto.hash(:sha, binary)

  defp do_normalize(int), do: int / 255

  defp do_partition(float, opts) do
    |> Keyword.get(:partitions, a: 0, b: 0.5)
    |> Enum.max_by(&match_partition(float, &1))
    |> elem(0)

  defp match_partition(float, {_, threshold}) when float > threshold, do: threshold

  defp match_partition(_, _), do: 0

Sounds like a job for

iex(10)> partitions = [:a, :b, :c]                                   
[:a, :b, :c]
iex(11)> ["foo", "bar", "foobar"] |>, 2)) |>, &1))
[:a, :a, :b]

That’s awesome, I was half-expecting to find a function like this in the Erlang standard library :smiley:

1 Like