No function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

Hey OGs,
I have a function that loads csv files into my app. When my user used special characters, it failed
they used name like: Zoë
Here is my function:

defp read(csv) do
    csv.path
    |> Path.expand()
    |> File.stream!()
    |> CSV.decode(headers: true)
    |> Stream.map(fn {:ok, data} -> data end)
    |> Stream.reject(&is_nil/1)
    |> break_point()
    |> Enum.to_list() # It fails here
    |> Enum.uniq()
  end

Error:

The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:
    
        # 1
        <<145, 32, 72, 111, 97, 100, 44, 105, 110, 115, 116, 97, 103, 114, 97, 109>>
    
        # 2
        "o"
    
        # 3
        false
    
        # 4
        44
    
        # 5
        ""
Attempted function clauses (showing 5 out of 5):
    
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
        defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
        defp starts_sequence?("", _, quoted, _, sequence_start)

Apparently, it couldn’t recognise this character ë
I wished I could like make a callback for starts_sequence? to return a meaningful error or something like that…
How can I prevent this from happening ?

Make sure you have the latest version of csv - version 2.5.0 contains a fix for this error:

Thanks alot al2o3cr
My leader doesn’t want to upgrade since he’s unsure if it won’t break other things.
Is there a way that I can print out which character is?

The offending byte is the first one of the binary in the first argument of that error - 145. The preceding character was apparently o (the second argument to starts_sequence?) so a byte with value 145 is not valid UTF8.

145 happens to be a left single-quote () in Windows-1252, which is probably what the creator of the CSV intended.

There’s no good way to handle this error apart from upgrading, as the code that was added back in 2.5.0 is the error handling code!

2 Likes

Thanks, @al2o3cr
I actually upgraded to 2.5.0 anyway, it does not seem to process all the lines that have offending characters and stops at the line with the first offending one.
Apart from the error handling code, is it possible to retrieve the offending byte?

AFAIK the only currently available way to deal with malformed UTF8 in CSV.decode is to pass the undocumented replacement option - it’s passed down unmodified to CSV.Decoding.Decoder where it is documented.

Prior to 2.5.0, this wasn’t an option because replacement happens in the lexer way after the starts_sequence? code in the preprocessor, so bad characters never made it to where they could be replaced.

A possible extension to CSV would be to also accept a 1-argument function as replacement, and provide it with the bad character.

1 Like