ejc123

ejc123

wtUTF8 - encoding issues reading CSV file

I don’t think this should be under ‘Advanced’, as I’m definitely not advanced :wink: Sorry for the length, but I’ve tried a bunch of things – everything I can think of.

I’ve been attempting to read in a CSV file. I found a awesome CSV library: csv It’s standards compliant and uses parallel streams! My problem seems to be with character encoding. I’m not an expert in this area so I apologize if I get the terminology wrong.

The file I’m reading is supposedly UTF8 encoded and if I read it in using the CSV library like so:

File.stream!("BADFILE.CSV") |> CSV.Decoder.decode(headers: true) |> Enum.to_list

I get

** (CSV.Lexer.EncodingError) Invalid encoding on line 10983
             lib/csv/decoder.ex:168: CSV.Decoder.handle_error_for_result!/1
    (elixir) lib/stream.ex:454: anonymous fn/4 in Stream.map/2
    (elixir) lib/enum.ex:2744: Enumerable.List.reduce/3
    (elixir) lib/stream.ex:732: Stream.do_list_transform/9
    (elixir) lib/stream.ex:1247: Enumerable.Stream.do_each/4
    (elixir) lib/enum.ex:1477: Enum.reduce/3
    (elixir) lib/enum.ex:2248: Enum.to_list/1

I narrowed it down to this character á which should be valid Unicode. I then read up on File.stream!/2 and found that it supports a :utf8 mode. So I create a file with just that character in it and try this:

File.stream!("SHORT_BADFILE.CSV",[:utf8]) |> CSV.Decoder.decode(headers: true) |> Enum.to_list

and get this:

** (UndefinedFunctionError) undefined function :unicode.format_error/1
    (stdlib) :unicode.format_error(:unicode)
    (kernel) file.erl:148: :file.format_error/1
    (elixir) lib/io/stream.ex:6: IO.StreamError.exception/1
    (elixir) lib/io.ex:416: IO.each_stream/2
    (elixir) lib/stream.ex:1099: Stream.do_resource/5
    (elixir) lib/stream.ex:700: Stream.do_transform/8
    (elixir) lib/enum.ex:2066: Enum.take/2
             lib/csv/decoder.ex:153: CSV.Decoder.get_first_row/2

I dug through the Elixir and Erlang source code to figure this one out and the “UndefinedFunctionError” is misleading; file.format_error/1 is trying to call a function based on the module. Here’s the relevant code from file.erl:

format_error({Line, Mod, Reason}) ->
    io_lib:format("~w: ~ts", [Line, Mod:format_error(Reason)]);

If I try

File.read!("SHORT_BADFILE.CSV")`

I get

<<225, 10>>

Which is the bytes I’d expect, but I can’t figure out why it can’t be decoded.
Any ideas?

Marked As Solved

ejc123

ejc123

After far too many hours spent on this, I’ll answer my own question. The solution I came up with is to use iconv to convert the file to UTF-8 encoding. In my case I use

$ cat FILE |  iconv -f WINDOWS-1250 -t UTF-8 -o NEWFILE 

And NEWFILE is loaded correctly into my elixir script!

I guess it pays to start out with the correct encoding :blush:

Also Liked

nathanl

nathanl

I just had this issue and solved it by specifying the encoding to File.stream!:

    file_path
    |> File.stream!([{:encoding, :latin1}])
    |> CSV.decode(headers: true)
    # ....

I didn’t initially know how the file was encoded; I just used trial-and-error with the various supported encodings listed here.

ejc123

ejc123

Thank you for checking that. I tried copying the á and pasting it into a file. This worked great! Unfortunately, I think the issue is that my source file is incorrectly encoded – or I just don’t know how it’s encoded or how to tell Elixir how to read it.

In the “bad” file, it is encoded as a single byte 0xE1 This file has á\n

$ od -x BAD.CSV
0000000 0ae1
0000002

When I copied the character from my browser into a text file, the character is 2 bytes, 0xC3A1 and a \n (0x0a)

od -x good.csv 
0000000 a1c3 000a
0000003

This page has better information on encodings for this character. It seems that good.csv is actually UTF-8 encoded, while bad.csv is something else.

Any thoughts on how to read this mangled encoding?
Either that or I need to find out at which point in the pipeline this file is getting munged.

ejc123

ejc123

Wow, I know having clean data is good, but this really drives it home.

I ended up performing this, which feels really hackish

for <<a <- File.read!("BADFILE.CSV") >>, into: <<>>, do: <<a::utf8>>

especially since this returns the whole file as a string so I had to use String.split/2 in order to get it into a format that the CSV library could use.

Looking at this, there’s probably some way to use pattern matching to get the comprehension to output a line at a time, but I’m too tired to work on it right now. It would also be nice if I could use a Stream for the comprehension.

Where Next?

Popular in Questions Top

_russellb
I want to try my hand at web scraping. What tools/libraries do I need to use. I’m hoping to turn this into something professional so don’...
New
lessless
I believe there are people here who are dealing with CSV files import on the daily basis, and since Excel is a really popular tool there ...
New
gshaw
What is the idiomatic way of matching for not nil in Elixir? E.g., First way: defp halt_if_not_signed_in(conn, signed_in_account) when...
New
electic
Hi, I am new to Elixir. I am trying to use the DateTime component to insert a date into MySQL however the there seems to be no way to fo...
New
jononomo
I am trying to figure out how Mix knows whether the environment is test, dev, or prod – where is this set? Thanks.
New
vac
Hi, I’m quite new in Elixir and I’m trying to format a string to a PEM format. I have the certificate value like MIIDBTCCAe2...... and I...
New
joeerl
Hello again - after a longish gap I’ve decided I really must dig into Elixir and see what’s been happening here - so I have a few questio...
New
sergio_101
I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor...
New
fayddelight
I tried installing elixir 1.11.2 erlang 23.3.4 via asdf in my zsh shell. Enabled the versions locally and globally. When I list them ...
New
JDanielMartinez
Hi! May someone helps me, please! I have two apps into an umbrella project: the first one is Database, which manages queries, and the se...
New

Other popular topics Top

malloryerik
Hi, this is for people who, like me, have had some friction using .html.heex templates in VSCode. The solution seems to be, in a hyphena...
New
sorentwo
Hello! tl;dr Announcing Oban, an Ecto based job processing library with a focus on reliability and historical observability. After spen...
985 42920 311
New
lastday4you
I wanted to check elixir version in phoenix because i found that my elixir is 1.5 but when i use Enum.chunk_by it said the function is un...
New
electic
Hi, I am new to Elixir. I am trying to use the DateTime component to insert a date into MySQL however the there seems to be no way to fo...
New
josevalim
Hi everyone, One of the features added to Elixir early on to help integration with Erlang code was the idea of overridable function defi...
New
SoCreat
i’m a new one to elixir which editor can i use vs code? or atom? Thanks! :smiley:
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
nsuchy
Hi. I’ve noticed that Windows Powershell has it’s own IEX command and you cannot access Elixir’s IEX due to the conflict. This isn’t a cr...
New
openscript
Hello! Sorry for this astonishing simple question, but I’m really stuck. I try to set up the intellij-elixir plugin, but I don’t know ho...
New
AstonJ
Seen any cool LiveView demos, sample apps or examples? Please post them here! :003:
New

We're in Beta

About us Mission Statement