ejc123
wtUTF8 - encoding issues reading CSV file
I don’t think this should be under ‘Advanced’, as I’m definitely not advanced
Sorry for the length, but I’ve tried a bunch of things – everything I can think of.
I’ve been attempting to read in a CSV file. I found a awesome CSV library: csv It’s standards compliant and uses parallel streams! My problem seems to be with character encoding. I’m not an expert in this area so I apologize if I get the terminology wrong.
The file I’m reading is supposedly UTF8 encoded and if I read it in using the CSV library like so:
File.stream!("BADFILE.CSV") |> CSV.Decoder.decode(headers: true) |> Enum.to_list
I get
** (CSV.Lexer.EncodingError) Invalid encoding on line 10983
lib/csv/decoder.ex:168: CSV.Decoder.handle_error_for_result!/1
(elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
(elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
(elixir) lib/stream.ex:732: Stream.do_list_transform/9
(elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
(elixir) lib/enum.ex:1477: Enum.reduce/3
(elixir) lib/enum.ex:2248: Enum.to_list/1
I narrowed it down to this character á which should be valid Unicode. I then read up on File.stream!/2 and found that it supports a :utf8 mode. So I create a file with just that character in it and try this:
File.stream!("SHORT_BADFILE.CSV",[:utf8]) |> CSV.Decoder.decode(headers: true) |> Enum.to_list
and get this:
** (UndefinedFunctionError) undefined function :unicode.format_error/1
(stdlib) :unicode.format_error(:unicode)
(kernel) file.erl:148: :file.format_error/1
(elixir) lib/io/stream.ex:6: IO.StreamError.exception/1
(elixir) lib/io.ex:416: IO.each_stream/2
(elixir) lib/stream.ex:1099: Stream.do_resource/5
(elixir) lib/stream.ex:700: Stream.do_transform/8
(elixir) lib/enum.ex:2066: Enum.take/2
lib/csv/decoder.ex:153: CSV.Decoder.get_first_row/2
I dug through the Elixir and Erlang source code to figure this one out and the “UndefinedFunctionError” is misleading; file.format_error/1 is trying to call a function based on the module. Here’s the relevant code from file.erl:
format_error({Line, Mod, Reason}) ->
io_lib:format("~w: ~ts", [Line, Mod:format_error(Reason)]);
If I try
File.read!("SHORT_BADFILE.CSV")`
I get
<<225, 10>>
Which is the bytes I’d expect, but I can’t figure out why it can’t be decoded.
Any ideas?
Marked As Solved
ejc123
After far too many hours spent on this, I’ll answer my own question. The solution I came up with is to use iconv to convert the file to UTF-8 encoding. In my case I use
$ cat FILE | iconv -f WINDOWS-1250 -t UTF-8 -o NEWFILE
And NEWFILE is loaded correctly into my elixir script!
I guess it pays to start out with the correct encoding ![]()
Also Liked
nathanl
I just had this issue and solved it by specifying the encoding to File.stream!:
file_path
|> File.stream!([{:encoding, :latin1}])
|> CSV.decode(headers: true)
# ....
I didn’t initially know how the file was encoded; I just used trial-and-error with the various supported encodings listed here.
ejc123
Thank you for checking that. I tried copying the á and pasting it into a file. This worked great! Unfortunately, I think the issue is that my source file is incorrectly encoded – or I just don’t know how it’s encoded or how to tell Elixir how to read it.
In the “bad” file, it is encoded as a single byte 0xE1 This file has á\n
$ od -x BAD.CSV
0000000 0ae1
0000002
When I copied the character from my browser into a text file, the character is 2 bytes, 0xC3A1 and a \n (0x0a)
od -x good.csv
0000000 a1c3 000a
0000003
This page has better information on encodings for this character. It seems that good.csv is actually UTF-8 encoded, while bad.csv is something else.
Any thoughts on how to read this mangled encoding?
Either that or I need to find out at which point in the pipeline this file is getting munged.
ejc123
Wow, I know having clean data is good, but this really drives it home.
I ended up performing this, which feels really hackish
for <<a <- File.read!("BADFILE.CSV") >>, into: <<>>, do: <<a::utf8>>
especially since this returns the whole file as a string so I had to use String.split/2 in order to get it into a format that the CSV library could use.
Looking at this, there’s probably some way to use pattern matching to get the comprehension to output a line at a time, but I’m too tired to work on it right now. It would also be nice if I could use a Stream for the comprehension.








