I know I can calculate MD5 signatures of strings by doing something like this:
input
|> :erlang.md5()
|> Base.encode16(case: :lower)
However, I’m wondering if there’s a good way to calculate file checksums. Reading the contents of a file into a string, e.g. input = File.read!(large_file)
, would not work well for large files. It looks like S3 does “part level checksums” by calculating the md5 hash on 16mb chunks of a file. I guess I’m hoping for something as straight-forward as PHP’s md5_file() function.
And just to clarify: yes, I know md5 hashes are not cryptographically secure and are not guaranteed to be unique. For our use-cases, all these limitations are fine: we don’t need to ensure uniqueness, we just want an easy way to locate files that are likely duplicates, and the database storing this only allows 32 characters for a signature, so md5 seems to fit the bill nicely.