I have a large postgres table that has a text column called help. Typically this is under 200 chars but there is no length limit on that column.
I need to group by this column, but this is super slow - like 8 seconds or so when there are many rows in the query. Obviously grouping by a text column is not a good choice (unless there’s some performance tip that you can share!), so my approach to optimize this has been adding a new help_key integer column, that contains an integer generated as a hash from that text. Grouping by this column has proved to be much faster (now it’s under 1 second).
Now, what I need is a good way to generate a (reasonably) unique integer from a text so I can ensure that given 2 different texts, the generated integers are different. I don’t need to decode the hash back to the original text, I need is a one-way conversion from text to integer.
I know that theoretically this is not possible but, is there a good way to guarantee a reasonable small chance of collision?
My first attempt for the proof of concept is this function:
@maxint 2_147_483_647
def string_to_integer(str) do
:crypto.hash(:sha, str)
|> :binary.bin_to_list()
|> Enum.with_index()
|> Enum.map(fn {x, i} -> x + i end)
|> Enum.sum()
|> rem(@maxint)
end
Another idea would be having a separate table with an autoincrement id to store the unique texts, but I’m afraid this would take a lot of DB space.
Nitpick: if GROUP BY is doing anything useful with these rows, that means you’ve got multiple copies of identical text in the table currently; extracting them to a separate table and referencing them by ID will save space unless your texts are usually shorter than a bigint.