wtUTF8 - encoding issues reading CSV file

ejc123 · March 9, 2016, 11:17pm

I don’t think this should be under ‘Advanced’, as I’m definitely not advanced Sorry for the length, but I’ve tried a bunch of things – everything I can think of.

I’ve been attempting to read in a CSV file. I found a awesome CSV library: csv It’s standards compliant and uses parallel streams! My problem seems to be with character encoding. I’m not an expert in this area so I apologize if I get the terminology wrong.

The file I’m reading is supposedly UTF8 encoded and if I read it in using the CSV library like so:

File.stream!("BADFILE.CSV") |> CSV.Decoder.decode(headers: true) |> Enum.to_list

I get

** (CSV.Lexer.EncodingError) Invalid encoding on line 10983
             lib/csv/decoder.ex:168: CSV.Decoder.handle_error_for_result!/1
    (elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
    (elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
    (elixir) lib/stream.ex:732: Stream.do_list_transform/9
    (elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
    (elixir) lib/enum.ex:1477: Enum.reduce/3
    (elixir) lib/enum.ex:2248: Enum.to_list/1

I narrowed it down to this character á which should be valid Unicode. I then read up on File.stream!/2 and found that it supports a :utf8 mode. So I create a file with just that character in it and try this:

File.stream!("SHORT_BADFILE.CSV",[:utf8]) |> CSV.Decoder.decode(headers: true) |> Enum.to_list

and get this:

** (UndefinedFunctionError) undefined function :unicode.format_error/1
    (stdlib) :unicode.format_error(:unicode)
    (kernel) file.erl:148: :file.format_error/1
    (elixir) lib/io/stream.ex:6: IO.StreamError.exception/1
    (elixir) lib/io.ex:416: IO.each_stream/2
    (elixir) lib/stream.ex:1099: Stream.do_resource/5
    (elixir) lib/stream.ex:700: Stream.do_transform/8
    (elixir) lib/enum.ex:2066: Enum.take/2
             lib/csv/decoder.ex:153: CSV.Decoder.get_first_row/2

I dug through the Elixir and Erlang source code to figure this one out and the “UndefinedFunctionError” is misleading; file.format_error/1 is trying to call a function based on the module. Here’s the relevant code from file.erl:

format_error({Line, Mod, Reason}) ->
    io_lib:format("~w: ~ts", [Line, Mod:format_error(Reason)]);

If I try

File.read!("SHORT_BADFILE.CSV")`

I get

<<225, 10>>

Which is the bytes I’d expect, but I can’t figure out why it can’t be decoded.
Any ideas?

bthach · March 10, 2016, 1:25am

It’s strange, I downloaded csv and run above codes without any error.
Did you run mix deps.get to check dependencies?

ejc123 · March 10, 2016, 3:09am

Thank you for checking that. I tried copying the á and pasting it into a file. This worked great! Unfortunately, I think the issue is that my source file is incorrectly encoded – or I just don’t know how it’s encoded or how to tell Elixir how to read it.

In the “bad” file, it is encoded as a single byte 0xE1 This file has á\n

$ od -x BAD.CSV
0000000 0ae1
0000002

When I copied the character from my browser into a text file, the character is 2 bytes, 0xC3A1 and a \n (0x0a)

od -x good.csv 
0000000 a1c3 000a
0000003

This page has better information on encodings for this character. It seems that good.csv is actually UTF-8 encoded, while bad.csv is something else.

Any thoughts on how to read this mangled encoding?
Either that or I need to find out at which point in the pipeline this file is getting munged.

ejc123 · March 10, 2016, 7:37am

Wow, I know having clean data is good, but this really drives it home.

I ended up performing this, which feels really hackish

for <<a <- File.read!("BADFILE.CSV") >>, into: <<>>, do: <<a::utf8>>

especially since this returns the whole file as a string so I had to use String.split/2 in order to get it into a format that the CSV library could use.

Looking at this, there’s probably some way to use pattern matching to get the comprehension to output a line at a time, but I’m too tired to work on it right now. It would also be nice if I could use a Stream for the comprehension.

uranther · March 10, 2016, 4:28pm

Maybe you already tried this, but can you use File.stream!/3?

Unfortunately, there’s no straightforward way to detect the character encoding of a file. This project called UniversalDetector does some heuristics and statistical analysis to guess the character encoding.

It may be that your file just has junk characters. You could run it through some regex to strip any non-printable characters.

ejc123 · March 10, 2016, 4:46pm

No, File.Stream!/3 won’t work – according to José Valim’s answer here.

I have proposed a “stream for” and “parallel for” for future Elixir versions, however it is pending some other improvements to the language.

I am excited to see those things added, though!

As far as detecting, I already know that these files have poorly encoded characters, they are valid, just not UTF-8. I’m currently mulling over pre-processing the files using tr or something to convert all the odd characters to proper UTF-8 encodings.

ejc123 · March 10, 2016, 8:23pm

After far too many hours spent on this, I’ll answer my own question. The solution I came up with is to use iconv to convert the file to UTF-8 encoding. In my case I use

$ cat FILE |  iconv -f WINDOWS-1250 -t UTF-8 -o NEWFILE

And NEWFILE is loaded correctly into my elixir script!

I guess it pays to start out with the correct encoding

emoragaf · March 15, 2016, 5:04pm

So the file was originally encoded using something like windows-1250/windows-1252/latin-1/latin-2 ?

i’ve had similar issues, latin-1 encoding is very common in my country.

I once had a database table supposedly in UTF-8, actual data was in latin-1 couldn’t change the table collation without double encoding the values…

ejc123 · March 15, 2016, 5:44pm

Yes, the original encoding must have been one of those. My source for the data couldn’t tell me what the encoding was so I had to guess.

I mistakenly thought it was UTF-8 because the characters looked right when I set my terminal to UTF-8 Had I been using Elixir/Erlang from the start, I’d have found out right away.

emerleite · April 11, 2016, 4:17am

I’m not sure this is a solution. IMO, It should have a way to read using other encodings.

uranther · April 14, 2016, 6:37pm

There’s no easy way to detect the character encoding of a file or string of text. The UniversalDetector linked above is probabilistic.

Some required reading on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
by Joel Spolsky

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

File.stream!/3 reads files as raw iodata by default, but you can pass it the UTF-8 character encoding if you know it for sure: File.stream!/3 docs

Encoding

In order to write and read files, one must use the functions in the IO module. By default, a file is opened in binary mode, which requires the functions IO.binread/2 and IO.binwrite/2 to interact with the file. A developer may pass :utf8 as an option when opening the file, then the slower IO.read/2 and IO.write/2 functions must be used as they are responsible for doing the proper conversions and providing the proper data guarantees.

Note that filenames when given as char lists in Elixir are always treated as UTF-8. In particular, we expect that the shell and the operating system are configured to use UTF-8 encoding. Binary filenames are considered raw and passed to the OS as is.

It looks like the Elixir standard library does not provide facilities to convert among character encodings. For this, you will need iconv or the handydandy Elixir library codepagex.

nathanl · August 29, 2018, 1:27pm

I just had this issue and solved it by specifying the encoding to File.stream!:

    file_path
    |> File.stream!([{:encoding, :latin1}])
    |> CSV.decode(headers: true)
    # ....

I didn’t initially know how the file was encoded; I just used trial-and-error with the various supported encodings listed here.