wtUTF8 - encoding issues reading CSV file

uranther · April 14, 2016, 6:37pm

There’s no easy way to detect the character encoding of a file or string of text. The UniversalDetector linked above is probabilistic.

Some required reading on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

File.stream!/3 reads files as raw iodata by default, but you can pass it the UTF-8 character encoding if you know it for sure: File.stream!/3 docs

Encoding

In order to write and read files, one must use the functions in the IO module. By default, a file is opened in binary mode, which requires the functions IO.binread/2 and IO.binwrite/2 to interact with the file. A developer may pass :utf8 as an option when opening the file, then the slower IO.read/2 and IO.write/2 functions must be used as they are responsible for doing the proper conversions and providing the proper data guarantees.

Note that filenames when given as char lists in Elixir are always treated as UTF-8. In particular, we expect that the shell and the operating system are configured to use UTF-8 encoding. Binary filenames are considered raw and passed to the OS as is.

It looks like the Elixir standard library does not provide facilities to convert among character encodings. For this, you will need iconv or the handydandy Elixir library codepagex.