Generating unique integer hashes from text

Hello!

I have a large postgres table that has a text column called help. Typically this is under 200 chars but there is no length limit on that column.

I need to group by this column, but this is super slow - like 8 seconds or so when there are many rows in the query. Obviously grouping by a text column is not a good choice (unless there’s some performance tip that you can share!), so my approach to optimize this has been adding a new help_key integer column, that contains an integer generated as a hash from that text. Grouping by this column has proved to be much faster (now it’s under 1 second).

Now, what I need is a good way to generate a (reasonably) unique integer from a text so I can ensure that given 2 different texts, the generated integers are different. I don’t need to decode the hash back to the original text, I need is a one-way conversion from text to integer.

I know that theoretically this is not possible but, is there a good way to guarantee a reasonable small chance of collision?

My first attempt for the proof of concept is this function:

  @maxint 2_147_483_647
  def string_to_integer(str) do
    :crypto.hash(:sha, str)
    |> :binary.bin_to_list()
    |> Enum.with_index()
    |> Enum.map(fn {x, i} -> x + i end)
    |> Enum.sum()
    |> rem(@maxint)
  end

Another idea would be having a separate table with an autoincrement id to store the unique texts, but I’m afraid this would take a lot of DB space.

What would you be your approach for this problem?

Take a look at :erlang.phash2/1 :wink:

iex(1)> :erlang.phash2("This is some text")
46491085
iex(2)> :erlang.phash2("This is some more text")
69427601
3 Likes

Have you tried if solving this at the database by adding an index on that column is quick enough?

By relying on the suggested phash2/1,2, it will be hard to re-apply the technique from another language if necessary later on.

Thanks, this looks great. I have to see how I can specify a range larger than integer to minimize change of collisions, though.

I already have this index, that improves searching, but it doesn’t help in grouping:

    create(
      index("violations", ["(to_tsvector('english', help))"],
        name: :violations_help_vector,
        using: "GIN"
      )
    )

Wikipedia has nice table of collision probability given hash size and number of items.

For example with 32 bits hash and 9300 strings there is 1% chance of collision and with 77000 strings 50% chance of collision.

1 Like

Thanks, I’ll run some checks using the current content we have and see how many collisions there are.

Nitpick: if GROUP BY is doing anything useful with these rows, that means you’ve got multiple copies of identical text in the table currently; extracting them to a separate table and referencing them by ID will save space unless your texts are usually shorter than a bigint.

2 Likes

Then you loose the ability to edit those texts individually. This might be a requirement, or it might not be…

You’re right, probably the best solution is improving the relational design on those tables.

True. Not in my case, these texts don’t need to be edited. I would need to do some cleanup and remove orphan texts though.