I don’t think this should be under ‘Advanced’, as I’m definitely not advanced Sorry for the length, but I’ve tried a bunch of things – everything I can think of.
I’ve been attempting to read in a CSV file. I found a awesome CSV library: csv It’s standards compliant and uses parallel streams! My problem seems to be with character encoding. I’m not an expert in this area so I apologize if I get the terminology wrong.
The file I’m reading is supposedly UTF8 encoded and if I read it in using the CSV library like so:
File.stream!("BADFILE.CSV") |> CSV.Decoder.decode(headers: true) |> Enum.to_list
I get
** (CSV.Lexer.EncodingError) Invalid encoding on line 10983
lib/csv/decoder.ex:168: CSV.Decoder.handle_error_for_result!/1
(elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
(elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
(elixir) lib/stream.ex:732: Stream.do_list_transform/9
(elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
(elixir) lib/enum.ex:1477: Enum.reduce/3
(elixir) lib/enum.ex:2248: Enum.to_list/1
I narrowed it down to this character á which should be valid Unicode. I then read up on File.stream!/2 and found that it supports a :utf8 mode. So I create a file with just that character in it and try this:
File.stream!("SHORT_BADFILE.CSV",[:utf8]) |> CSV.Decoder.decode(headers: true) |> Enum.to_list
and get this:
** (UndefinedFunctionError) undefined function :unicode.format_error/1
(stdlib) :unicode.format_error(:unicode)
(kernel) file.erl:148: :file.format_error/1
(elixir) lib/io/stream.ex:6: IO.StreamError.exception/1
(elixir) lib/io.ex:416: IO.each_stream/2
(elixir) lib/stream.ex:1099: Stream.do_resource/5
(elixir) lib/stream.ex:700: Stream.do_transform/8
(elixir) lib/enum.ex:2066: Enum.take/2
lib/csv/decoder.ex:153: CSV.Decoder.get_first_row/2
I dug through the Elixir and Erlang source code to figure this one out and the “UndefinedFunctionError” is misleading; file.format_error/1 is trying to call a function based on the module. Here’s the relevant code from file.erl:
format_error({Line, Mod, Reason}) ->
io_lib:format("~w: ~ts", [Line, Mod:format_error(Reason)]);
If I try
File.read!("SHORT_BADFILE.CSV")`
I get
<<225, 10>>
Which is the bytes I’d expect, but I can’t figure out why it can’t be decoded.
Any ideas?