Strange behavior with SHA vs phash2

We have some code that creates unique hashes on documents to handle duplicate checks.

It wasn’t working on nodes that rebooted (letting dupes through) and as a last resort I changed from SHA to phash2 and it works perfectly now. This seems really strange to me. Is there some kind of timestamp seed or something in the crypto SHA I’m not aware of?

Here’s the code change that fixed our issues:

hash =
  :erlang.phash2(doc, 1_000_000)

Here’s the original that is not working across restarts:

hash =
  :crypto.hash(:sha, :erlang.term_to_binary(doc)) |> Base.encode64(case: :lower)

Maybe term_to_binary or encode64… but… it just surprised me a lot…

1 Like

Sorry for a few follow-up questions. I am not even sure I can help here.

  • Can you consistently reproduce this, the same document always produce different results after reboot of a node?
  • What is doc? Is it a binary string or an erlang term? Something else?
  • Is the initial hash and hash done on rebooted node done on the same nodes?
  • If on different nodes are they running on different server architecture?
  • Are you able to try the different parts of the sha hash individually to see which one is not working.

My bet is that it is term_to_binary. I am not sure it is meant to always produce the same result, only that it can be decoded with binary_to_term.

phash2 on the other hand is designed to always produce consistent result when run on different version of erlang and machine architectures.

2 Likes

The collision chance for phash2 is gonna be a lot higher than a sha though, so I’d definitely recommend finding a way to use a sha.

2 Likes