No function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5

docjazither · October 16, 2022, 11:04pm

Hey OGs,
I have a function that loads csv files into my app. When my user used special characters, it failed
they used name like: Zoë
Here is my function:

defp read(csv) do
    csv.path
    |> Path.expand()
    |> File.stream!()
    |> CSV.decode(headers: true)
    |> Stream.map(fn {:ok, data} -> data end)
    |> Stream.reject(&is_nil/1)
    |> break_point()
    |> Enum.to_list() # It fails here
    |> Enum.uniq()
  end

Error:

The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5:
    
        # 1
        <<145, 32, 72, 111, 97, 100, 44, 105, 110, 115, 116, 97, 103, 114, 97, 109>>
    
        # 2
        "o"
    
        # 3
        false
    
        # 4
        44
    
        # 5
        ""
Attempted function clauses (showing 5 out of 5):
    
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>>
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _)
        defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
        defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start)
        defp starts_sequence?("", _, quoted, _, sequence_start)

Apparently, it couldn’t recognise this character ë
I wished I could like make a callback for starts_sequence? to return a meaningful error or something like that…
How can I prevent this from happening ?

al2o3cr · October 16, 2022, 11:20pm

Make sure you have the latest version of csv - version 2.5.0 contains a fix for this error:

github.com/beatrichartz/csv

Pass through non-UTF8 bytes in lines preprocessor

beatrichartz:master ← al2o3cr:fix_invalid_encoding_crash

opened 03:58PM - 15 May 22 UTC

al2o3cr

+13 -0

Attempting to `CSV.decode` a stream that contains non-UTF8 bytes raises a `Funct…ionClauseError`: ``` ** (FunctionClauseError) no function clause matching in CSV.Decoding.Preprocessing.Lines.starts_sequence?/5 The following arguments were given to CSV.Decoding.Preprocessing.Lines.starts_sequence?/5: # 1 <<225, 110, 100, 101, 122>> # 2 "n" # 3 false # 4 44 # 5 "" Attempted function clauses (showing 5 out of 5): defp starts_sequence?(<<34::utf8(), tail::binary()>>, last_token, false, separator, _) when last_token == <<separator::utf8()>> defp starts_sequence?(<<34::utf8(), tail::binary()>>, "", false, separator, _) defp starts_sequence?(<<34::utf8(), tail::binary()>>, _, quoted, separator, sequence_start) defp starts_sequence?(<<head::utf8(), tail::binary()>>, _, quoted, separator, sequence_start) defp starts_sequence?("", _, quoted, _, sequence_start) code: result = CSV.decode(stream) |> Enum.to_list() stacktrace: (csv 2.4.1) CSV.Decoding.Preprocessing.Lines.starts_sequence?/5 (csv 2.4.1) lib/csv/decoding/preprocessing/lines.ex:85: CSV.Decoding.Preprocessing.Lines.start_sequence/3 (elixir 1.13.0) lib/stream.ex:902: Stream.do_transform_user/6 ``` This makes it impossible to handle encoding errors per-line or use machinery like `Decoder`'s `replacement` option. The code that would prevent this crash was accidentally deleted in https://github.com/beatrichartz/csv/commit/4f5069b99b8c0e4387c9e31798aed508b3f9998f because it is "unused" for files that only contain valid UTF8. This PR restores the deleted clause and adds a high-level test; existing tests cover `Decoder` and `Lexer` but not the complete pipeline.

docjazither · October 16, 2022, 11:58pm

Thanks alot al2o3cr
My leader doesn’t want to upgrade since he’s unsure if it won’t break other things.
Is there a way that I can print out which character is?

al2o3cr · October 17, 2022, 2:01pm

The offending byte is the first one of the binary in the first argument of that error - 145. The preceding character was apparently o (the second argument to starts_sequence?) so a byte with value 145 is not valid UTF8.

145 happens to be a left single-quote (‘) in Windows-1252, which is probably what the creator of the CSV intended.

There’s no good way to handle this error apart from upgrading, as the code that was added back in 2.5.0 is the error handling code!

docjazither · October 19, 2022, 12:44am

Thanks, @al2o3cr
I actually upgraded to 2.5.0 anyway, it does not seem to process all the lines that have offending characters and stops at the line with the first offending one.
Apart from the error handling code, is it possible to retrieve the offending byte?

al2o3cr · October 19, 2022, 1:02am

AFAIK the only currently available way to deal with malformed UTF8 in CSV.decode is to pass the undocumented replacement option - it’s passed down unmodified to CSV.Decoding.Decoder where it is documented.

Prior to 2.5.0, this wasn’t an option because replacement happens in the lexer way after the starts_sequence? code in the preprocessor, so bad characters never made it to where they could be replaced.

A possible extension to CSV would be to also accept a 1-argument function as replacement, and provide it with the bad character.

jarvism · September 18, 2023, 2:27pm

Thank you for this code!!!
As a newb, seeing an actual implementation of reading the CSV is priceless!!!
The answer to this question is also very helpful!
I’m just not there yet.