My question is related to the file encoding topic. I’m working on a feature that allows users to upload a CSV file that might be in different charset. It’s not only about UTF-8 and I need to detect given encoding and then parse data.
In Golang, Python, Java, C++, Objective-C we have UniversalDetector/chardet libraries. Do we have something similar to the Elixir/Erlang usage? I’m stuck with this.
How are you working with the different charsets/encodings in Elixir applications? I know - that’s not the main problem to which was designed this language, but I won’t use erlports or AWS Lambda only for detecting charset.
nimble_csv supports converting data to utf-8 before parsing / converting from utf-8 before dumping. By default it even ships NimbleCSV.Spreadsheet, which uses utf-16 le to work with excel csv’s. But you’ll need to setup a module per encoding.
Thanks for your reply.
You’re right - NimbleCSV supports encoding converting (via :encoding option), but my problem is that I don’t know the encoding of the incoming file. It may be UTF-8 or many different, like these with German characters.
When I’m trying to parse a CSV with German characters I got
no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5
Meh. Do I need to port chardet from the other languages? I’m stuck with this problem.
I know that’s why I explicitly only quoted you asking for the parsing part.
Generally I’d suggest you trying utf-16 le. If the csv comes out of excel this will be the encoding. If you really need the dark arts of detecting encoding (especially if no byte order mark is set) then you’ll need to wait for the input of people more knowledgeable in that than myself.
I’ve by now just let people copy and paste out of excel into a textarea, which results in csv format getting pasted as well.
Tried out a package today called ExMagic that is a nif wrapper around ‘libmagic’… If running the ‘file’ Unix command helps determine the text encoding (not sure if it does) you might get something meaningful from ExMagic by passing it some raw binary… could get away with just the first 2kb or so of the file
I ended up using ExMagic as it is just a Niff wrapper without any processes/gen_server.
I am curious though, you mention this:
The Server should be run under a pool which provides concurrency and resiliency
Is that really necessary? I may not fully understand Niffs - does calling a niff function get serialized somehow? I figured if you were calling the niff from say multiple beam processes they would all just invoke the C code independently of each other, and not require being wrapped in a pool for concurrent access?
Pools as in Poolboy so you have the benefit of not touching the content of the file within the OS process responsible for running BEAM — design wise, I like to isolate external programs and communicate with them using ports rather than use NIFs. When a NIF crashes the whole thing comes down.
While in many cases I would agree. I think libmagic and the file(1) command in unix has been tested extremely thoroughly and if it were to crash with some arbitrary input - that would be worthy of a CVE.
If you write your own low level c implementation and do not have google security engineers fuzzing it for weeks - then yea, would probably be safer as a port…
When comparing gen_magic to ExMagic, I figured the less code wrapped around libmagic - the better.
IMO the addition of a pool, and state(gen_statem) to libmagic is unnecessary complexity - that may introduce different bugs…