File encoding detection library

patryk.it · October 16, 2020, 7:53am

Hi guys!

My question is related to the file encoding topic. I’m working on a feature that allows users to upload a CSV file that might be in different charset. It’s not only about UTF-8 and I need to detect given encoding and then parse data.

In Golang, Python, Java, C++, Objective-C we have UniversalDetector/chardet libraries. Do we have something similar to the Elixir/Erlang usage? I’m stuck with this.

How are you working with the different charsets/encodings in Elixir applications? I know - that’s not the main problem to which was designed this language, but I won’t use erlports or AWS Lambda only for detecting charset.

LostKobrakai · October 16, 2020, 7:59am

nimble_csv supports converting data to utf-8 before parsing / converting from utf-8 before dumping. By default it even ships NimbleCSV.Spreadsheet, which uses utf-16 le to work with excel csv’s. But you’ll need to setup a module per encoding.

patryk.it · October 16, 2020, 9:06am

Thanks for your reply.
You’re right - NimbleCSV supports encoding converting (via :encoding option), but my problem is that I don’t know the encoding of the incoming file. It may be UTF-8 or many different, like these with German characters.

When I’m trying to parse a CSV with German characters I got

no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

Meh. Do I need to port chardet from the other languages? I’m stuck with this problem.

LostKobrakai · October 16, 2020, 9:12am

I know that’s why I explicitly only quoted you asking for the parsing part.

Generally I’d suggest you trying utf-16 le. If the csv comes out of excel this will be the encoding. If you really need the dark arts of detecting encoding (especially if no byte order mark is set) then you’ll need to wait for the input of people more knowledgeable in that than myself.

I’ve by now just let people copy and paste out of excel into a textarea, which results in csv format getting pasted as well.

NobbZ · October 16, 2020, 9:23am

There is no way to get what you want.

Ask your clients to upload using a specified charset only.

If you see the single byte 0xC4, that could be a latin-1 encoded Ä, though in ISO-8859-5 (kyrillic) it would be as valid as with latin-1 though encoding a different character, the Ф.

Therefore it is impossible to detect the encoding without knowing the content in advance.

What you describe from the other languages, is usually a very dumb heuristic.

Check if the input is valid UTF-8/16/32 by corresponding validators, perhaps even helped by a BOM
If not, fall back to a single byte encoding derived from the hosts local settings

patryk.it · October 16, 2020, 9:23am

Anyway thanks a lot for trying to help! Yes, I don’t have BOM set in these files.

I’m waiting for encoding hero.

harmon25 · October 16, 2020, 10:46pm

Tried out a package today called ExMagic that is a nif wrapper around ‘libmagic’… If running the ‘file’ Unix command helps determine the text encoding (not sure if it does) you might get something meaningful from ExMagic by passing it some raw binary… could get away with just the first 2kb or so of the file

ondrej-tucek · October 17, 2020, 1:33pm

I think that the file cmd does, see lesmana’s reply. In short

$ file -i *
umlaut-iso88591.txt: text/plain; charset=iso-8859-1
umlaut-utf16.txt:    text/plain; charset=utf-16le
umlaut-utf8.txt:     text/plain; charset=utf-

There are also another unix cmds: enca or uchardet. So connect them via ports…

harmon25 · October 17, 2020, 4:25pm

Nice!
Not sure if the ExMagic package is running the code with the equivalent option to -i to actually get back the encoding - think it just returns the mime… might be an easy tweak though.

evadne · October 17, 2020, 10:52pm

The libMagic route is definitely worth considering. I have a library for that. Low profile, but working.

harmon25 · October 18, 2020, 3:40am

I ended up using ExMagic as it is just a Niff wrapper without any processes/gen_server.

I am curious though, you mention this:

The Server should be run under a pool which provides concurrency and resiliency

Is that really necessary? I may not fully understand Niffs - does calling a niff function get serialized somehow? I figured if you were calling the niff from say multiple beam processes they would all just invoke the C code independently of each other, and not require being wrapped in a pool for concurrent access?

evadne · October 18, 2020, 1:18pm

Pools as in Poolboy so you have the benefit of not touching the content of the file within the OS process responsible for running BEAM — design wise, I like to isolate external programs and communicate with them using ports rather than use NIFs. When a NIF crashes the whole thing comes down.

harmon25 · October 18, 2020, 2:47pm

While in many cases I would agree. I think libmagic and the file(1) command in unix has been tested extremely thoroughly and if it were to crash with some arbitrary input - that would be worthy of a CVE.

If you write your own low level c implementation and do not have google security engineers fuzzing it for weeks - then yea, would probably be safer as a port…

When comparing gen_magic to ExMagic, I figured the less code wrapped around libmagic - the better.

IMO the addition of a pool, and state(gen_statem) to libmagic is unnecessary complexity - that may introduce different bugs…