Calculating MD5 checksum of large files

I know I can calculate MD5 signatures of strings by doing something like this:

input
|> :erlang.md5()
|> Base.encode16(case: :lower)

However, I’m wondering if there’s a good way to calculate file checksums. Reading the contents of a file into a string, e.g. input = File.read!(large_file), would not work well for large files. It looks like S3 does “part level checksums” by calculating the md5 hash on 16mb chunks of a file. I guess I’m hoping for something as straight-forward as PHP’s md5_file() function.

And just to clarify: yes, I know md5 hashes are not cryptographically secure and are not guaranteed to be unique. For our use-cases, all these limitations are fine: we don’t need to ensure uniqueness, we just want an easy way to locate files that are likely duplicates, and the database storing this only allows 32 characters for a signature, so md5 seems to fit the bill nicely.

I’m not aware of any single function you can use like that PHP example, but I think using :crypto.hash_init/1 (docs) and its related funcs achieves what you need:

hash = :crypto.hash_init(:md5)

md5 =
  "data.txt"
  |> File.stream!([], 2048)
  |> Enum.reduce(hash, fn bytes, hash_state ->
    :crypto.hash_update(hash_state, bytes)
  end)
  |> :crypto.hash_final()
  |> Base.encode16()

IO.puts(md5)
5 Likes

Brilliant, thank you! I came across stream_hash | Hex which seems to implement your solution.