patryk.it
File encoding detection library
Hi guys!
My question is related to the file encoding topic. I’m working on a feature that allows users to upload a CSV file that might be in different charset. It’s not only about UTF-8 and I need to detect given encoding and then parse data.
In Golang, Python, Java, C++, Objective-C we have UniversalDetector/chardet libraries. Do we have something similar to the Elixir/Erlang usage? I’m stuck with this.
How are you working with the different charsets/encodings in Elixir applications? I know - that’s not the main problem to which was designed this language, but I won’t use erlports or AWS Lambda only for detecting charset. ![]()
Most Liked
NobbZ
There is no way to get what you want.
Ask your clients to upload using a specified charset only.
If you see the single byte 0xC4, that could be a latin-1 encoded Ä, though in ISO-8859-5 (kyrillic) it would be as valid as with latin-1 though encoding a different character, the Ф.
Therefore it is impossible to detect the encoding without knowing the content in advance.
What you describe from the other languages, is usually a very dumb heuristic.
- Check if the input is valid UTF-8/16/32 by corresponding validators, perhaps even helped by a BOM
- If not, fall back to a single byte encoding derived from the hosts local settings
LostKobrakai
nimble_csv supports converting data to utf-8 before parsing / converting from utf-8 before dumping. By default it even ships NimbleCSV.Spreadsheet, which uses utf-16 le to work with excel csv’s. But you’ll need to setup a module per encoding.
evadne
The libMagic route is definitely worth considering. I have a library for that. Low profile, but working.








