ejc123

ejc123

wtUTF8 - encoding issues reading CSV file

I don’t think this should be under ‘Advanced’, as I’m definitely not advanced :wink: Sorry for the length, but I’ve tried a bunch of things – everything I can think of.

I’ve been attempting to read in a CSV file. I found a awesome CSV library: csv It’s standards compliant and uses parallel streams! My problem seems to be with character encoding. I’m not an expert in this area so I apologize if I get the terminology wrong.

The file I’m reading is supposedly UTF8 encoded and if I read it in using the CSV library like so:

File.stream!("BADFILE.CSV") |> CSV.Decoder.decode(headers: true) |> Enum.to_list

I get

** (CSV.Lexer.EncodingError) Invalid encoding on line 10983
             lib/csv/decoder.ex:168: CSV.Decoder.handle_error_for_result!/1
    (elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
    (elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
    (elixir) lib/stream.ex:732: Stream.do_list_transform/9
    (elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
    (elixir) lib/enum.ex:1477: Enum.reduce/3
    (elixir) lib/enum.ex:2248: Enum.to_list/1

I narrowed it down to this character á which should be valid Unicode. I then read up on File.stream!/2 and found that it supports a :utf8 mode. So I create a file with just that character in it and try this:

File.stream!("SHORT_BADFILE.CSV",[:utf8]) |> CSV.Decoder.decode(headers: true) |> Enum.to_list

and get this:

** (UndefinedFunctionError) undefined function :unicode.format_error/1
    (stdlib) :unicode.format_error(:unicode)
    (kernel) file.erl:148: :file.format_error/1
    (elixir) lib/io/stream.ex:6: IO.StreamError.exception/1
    (elixir) lib/io.ex:416: IO.each_stream/2
    (elixir) lib/stream.ex:1099: Stream.do_resource/5
    (elixir) lib/stream.ex:700: Stream.do_transform/8
    (elixir) lib/enum.ex:2066: Enum.take/2
             lib/csv/decoder.ex:153: CSV.Decoder.get_first_row/2

I dug through the Elixir and Erlang source code to figure this one out and the “UndefinedFunctionError” is misleading; file.format_error/1 is trying to call a function based on the module. Here’s the relevant code from file.erl:

format_error({Line, Mod, Reason}) ->
    io_lib:format("~w: ~ts", [Line, Mod:format_error(Reason)]);

If I try

File.read!("SHORT_BADFILE.CSV")`

I get

<<225, 10>>

Which is the bytes I’d expect, but I can’t figure out why it can’t be decoded.
Any ideas?

Marked As Solved

ejc123

ejc123

After far too many hours spent on this, I’ll answer my own question. The solution I came up with is to use iconv to convert the file to UTF-8 encoding. In my case I use

$ cat FILE |  iconv -f WINDOWS-1250 -t UTF-8 -o NEWFILE 

And NEWFILE is loaded correctly into my elixir script!

I guess it pays to start out with the correct encoding :blush:

Also Liked

nathanl

nathanl

I just had this issue and solved it by specifying the encoding to File.stream!:

    file_path
    |> File.stream!([{:encoding, :latin1}])
    |> CSV.decode(headers: true)
    # ....

I didn’t initially know how the file was encoded; I just used trial-and-error with the various supported encodings listed here.

ejc123

ejc123

Thank you for checking that. I tried copying the á and pasting it into a file. This worked great! Unfortunately, I think the issue is that my source file is incorrectly encoded – or I just don’t know how it’s encoded or how to tell Elixir how to read it.

In the “bad” file, it is encoded as a single byte 0xE1 This file has á\n

$ od -x BAD.CSV
0000000 0ae1
0000002

When I copied the character from my browser into a text file, the character is 2 bytes, 0xC3A1 and a \n (0x0a)

od -x good.csv 
0000000 a1c3 000a
0000003

This page has better information on encodings for this character. It seems that good.csv is actually UTF-8 encoded, while bad.csv is something else.

Any thoughts on how to read this mangled encoding?
Either that or I need to find out at which point in the pipeline this file is getting munged.

ejc123

ejc123

Wow, I know having clean data is good, but this really drives it home.

I ended up performing this, which feels really hackish

for <<a <- File.read!("BADFILE.CSV") >>, into: <<>>, do: <<a::utf8>>

especially since this returns the whole file as a string so I had to use String.split/2 in order to get it into a format that the CSV library could use.

Looking at this, there’s probably some way to use pattern matching to get the comprehension to output a line at a time, but I’m too tired to work on it right now. It would also be nice if I could use a Stream for the comprehension.

Where Next?

Popular in Questions Top

sen
Hi All, I set a environment variables in dev.exs , like below code. when i start server, how can i set the ${enable} value? thanks. d...
New
lastday4you
I wanted to check elixir version in phoenix because i found that my elixir is 1.5 but when i use Enum.chunk_by it said the function is un...
New
tduccuong
Hi, is there any work on GUI with Elixir, that is similar to Electron/Javascript? My idea is to bundle Phoenix and BEAM into a single se...
New
vac
Hi, I’m quite new in Elixir and I’m trying to format a string to a PEM format. I have the certificate value like MIIDBTCCAe2...... and I...
New
Lily
In templates/appointment/index.html.eex: &lt;%= for appointment &lt;- @appointments do %&gt; &lt;tr&gt; &lt;td&gt;&lt;%= appoi...
New
baxterw3b
Hi guys, i’m new in the Elixir world, and i have to say, that i love it! i’m having some problem to understand anonymous functions with ...
New
Emily
I have VueJS GUIs with the project generated using Webpack. I have Elixir modules that will need to be used by the VueJS GUIs. I forese...
New
lucidguppy
I have a super simple question about elixir - how would I take a file like this foo bar baz and output a new file that enumerates th...
New
JDanielMartinez
Hi! May someone helps me, please! I have two apps into an umbrella project: the first one is Database, which manages queries, and the se...
New
vonH
In asking this question I am more interested about the expressiveness of the language itself and less concerned about the availability of...
New

Other popular topics Top

AstonJ
Posting this to see if we can make things easier for people to get into Neovim. If you use Neovim and have a favourite distro please let ...
New
Nvim
Anybody knows a comprehensive comparison of Django and Phoenix, thanks for the help. Where are they similar? Where do they differ the m...
New
Patoshizzle
After calling mix ecto.create I get this error: 17:00:32.162 [error] GenServer #PID&lt;0.412.0&gt; terminating ** (Postgrex.Error) FATAL...
New
ovidiubadita
Hey all, I discovered Elixir and I love it. I always wanted to learn a functional programming and I intended to go for Haskell, but afte...
New
jerry
Good day to you all. I have been struggling to get a query involving like and ilike to work. Can anyone assist me on this, please? pro...
New
pmjoe
I have a relationship of love and hate with Elixir. Lots of things are just absolutely right, but there are some things that are kind of ...
New
freewebwithme
Using vs code and installed ElixirLS: support and debugger. And I got an error popped up on start up says Failed to run ‘elixir’ comma...
New
AstonJ
Please see the new poll here: Which code editor or IDE do you use? (Poll) (2022 Edition) It’s been a while since we first asked this, I...
208 31142 143
New
bsollish-terakeet
Credo is smart enough to check for (something like) this: assert length(the_list) == 0 with this response: Checking if an enum is empt...
New
nobody
Hi! In PHP: $_SERVER[‘SERVER_ADDR’] - in Elixir? Searched the docs for ip address and the web, no good results. Thanks!
New

We're in Beta

About us Mission Statement