Unicode in CSV import

I have a CSV file that I need to process.

I am decoding it using CSV — CSV v2.4.1 to decode the file.

The problem i am running into is that file has lots of unicode characters (ä, ö, å, so far) and this raises an exception with CSV.

Do I need to set this up differently to process these without raising an exception?

Thanks!

I’d suggest using nimble_csv which is fine with UTF8 data.

6 Likes

Are you sure the csv file is encoded in UTF8? We use it ( the package you are using) ourselves in our app and haven’t had any issues directly related to UTF8.

1 Like

To be honest, this is the first time I have come across something like this (in modern times), but opening it up in a text editor, those characters are definitely in there, along with the degree mark (180°) and others i am finding. They render just fine in the text editor (emacs)… but my errors look like this…

** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5    
    
    The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:
    
        # 1
        <<176, 32, 72, 83, 71, 32, 65, 83, 83, 89, 44, 49, 44, 57, 55, 46, 56, 52, 44, 57, 55, 46, 56, 52, 44, 80, 82, 69, 83, 84, 65, 71, 69, 32, 70, 65, 82, 77, 83, 32, 73, 78, 67, 44, 77, 65, 73, 78, 84, 69, ...>>

where, in this case, 176 is (°)

… in ISO 8859-1 and Windows-1252, among others. It’s not a valid byte in UTF8 except as the second or following byte of a multibyte sequence.

2 Likes

nimble_csv can not only deal with utf-8 but also other character encodings. Though you need to know it in advance. It can‘t guess the encoding. As for excel be aware that by default it doesn‘t use utf8 (yes I know) but utf16 le. There‘s a preconfigured module NimbleCSV.Spreadsheet for that one.

Would you care to share the file? Don’t know emacs but the fact that your editor opens it doesn’t necessarily mean it’s utf8, the editor might just be clever :slight_smile:

1 Like

sorry i missed this message… the file contained a bunch of customer specific info… so i couldn’t share it… but it turned out they didn’t need that info anyway… so i moved on to another project… but thanks!

FWIW, it turns out the starts_sequence? crash is a bug, caused by a missing clause: