Error when importing CSV with Explorer.DataFrame.from_csv! possibly related to dtype

Hello,

I’m an elixir beginner, exploring livebook. I’m running a local notebook on a linux machine. Initially I tried to import a CSV file with Explorer like so:

alias Explorer.{DataFrame, Series}
path_to_csv = "./some_file.csv"
DataFrame.from_csv!(path_to_csv)

However, I am seeing the following error:

** (RuntimeError) from_csv failed: {:polars, "Could not parse `Fee Amount` 
    as dtype Int64 at column 16.\nThe current offset in the file is 249 bytes.\n\n
    Consider specifying the correct dtype, increasing\n
    the number of records used to infer the schema,\n
    enabling the `ignore_errors` flag, or adding\n`Fee Amount` to the `null_values` list."}
    (explorer 0.5.5) lib/explorer/data_frame.ex:438: Explorer.DataFrame.from_csv!/2

To check if the problem was indeed with the data in the column Fee Amount, I filled the column with integers. But I get the same error. What can I try to better diagnose this problem?

Hi @alphydan welcome! Are you able to provide a sample of that CSV file?

1 Like

https://hexdocs.pm/explorer/Explorer.DataFrame.html#from_csv/2

You can use :dtype to explicitly set the data type for a column instead of letting it try to infer it.

One small thing that jumps out:

Could not parse Fee Amount as dtype Int64 at column 16.
The current offset in the file is 249 bytes.

This sounds like it’s on at least the second line of the file, and so should be data

That’s a reference to the with_null_values setting in polars_io::csv::CsvReader, which is intended to parse values like "n/a" to null.

The message suggests that the reader found the literal string “Fee Amount” where it expected data.

:thinking: what was a header-shaped value like “Fee Amount” doing on the second line of a CSV?

1 Like

@BradS2S @benwilson512 @al2o3cr Thank you for all your suggestions. “Fee Amount” was not on the second line, but on the header (first row).

A potential problem with the header led me to try changing all headers to the format something_or_the_other instead of Something Or \n the Other.

I found that it failed when I had a return character inside the header. The CSV gets imported correctly now. Thank you for the prompt replies.

Were these quoted? If no, then that’s indeed invalid. If they were, then this sounds like a but in the csv parser.

Ok, here’s an example to further diagnose the problem:

Running $ cat -v example_failing.csv in the terminal shows:

"Timestamp$
(ISO)",Animal Type,Animal Color,"Friends of$
Given Animal","Another$
Timestamp$
(ISO)",Feedstock,"Favorite food$
for Animal",Feed Amount,Feed Frequency^M$
2023,Tiger,Yellow,Lion,2023-03-14,3,meat,32.2,weekly^M$
2023,Tiger,Yellow,Lion,2023-03-14,3,meat,32.2,weekly

(Where $ corresponds to \n or Unix line endings). Trying to import it in livebook:

path_to_csv = "./example_failing.csv"
IO.puts(path_to_csv)
DataFrame.from_csv!(path_to_csv)

leads to this error:

** (RuntimeError) from_csv failed: {:polars, "Could not parse `(ISO)\"` as dtype Int64 at column 1.\n
The current offset in the file is 11 bytes.\n\n

Consider specifying the correct dtype, increasing\n
the number of records used to infer the schema,\n
enabling the `ignore_errors` flag, or adding
\n`(ISO)\"` to the `null_values` list."}
    (explorer 0.5.5) lib/explorer/data_frame.ex:438: Explorer.DataFrame.from_csv!/2

Remove all the new line characters from the column headers.

That sounds like a bug / missing implementation in the parser of polars then. LF or CRLF are supposed to be allowed in quoted fields based on RFC 4180.

Removing all return characters leads to an import without errors.

"Timestamp (ISO)",Animal Type,Animal Color,"Friends of Given Animal","Another Timestamp (ISO)",Feedstock,"Favorite food for Animal",Feed Amount,Feed Frequency^M
2023,Tiger,Yellow,Lion,2023-03-14,3,meat,32.2,weekly^M
2023,Tiger,Yellow,Lion,2023-03-14,3,meat,32.2,weekly

is imported correctly. Does that mean that return characters are not allowed in headers, even with quotes?

return characters outside the header seem to be imported correctly. for example, if I have an element of the csv be Tiger and \nBeasts, but no returns in the header I can see:

  Polars[2 x 9]
  Timestamp (ISO) integer [2023, 2023]
  Animal Type string ["Tiger and \nBeasts", "Tiger"]
  Animal Color string ["Yellow", "Yellow"]
  Friends of Given Animal string ["Lion", "Lion"]
  Another Timestamp (ISO) string ["2023-03-14", "2023-03-14"]
  Feedstock integer [3, 3]
  Favorite food for Animal string ["meat", "meat"]
  Feed Amount float [32.2, 32.2]
  Feed Frequency string ["weekly", "weekly"]
``

Yes you want it on a single line. That parsing error means they don’t expect labels to be on more than one line

If that’s the reasoning then this is a bad error though. Not sure why they’d push such domain specific requirements into what seems to be an otherwise fully capable parser for csv. If they don’t expect multi line headers that’s not a problem with the csv, but with the supplied data.

I’m a guy coming from Python and just the idea of having a new line character in a header row is like saying you are from outer space :slight_smile:

I’m totally fine with headers be required to be single line. I’m just argueing that this should not be a CSV parsing error, but a proper “hey we need you to have single line headers” error.

1 Like

Oh yes the error message could be more helpful but it is a parsing error. It sees the newline character as a new row. It looks forward and sees mostly numbers in the row so it decides to infer int64 which doesn’t work for the second half of the string.

2 Likes

Given it seems to be able to deal with newlines in data fields just fine it doesn’t need to be a parsing error though.

And yet, pandas can process \n in the header or in the fields if they are quoted.

The csv:

"Timestamp (ISO)",Animal Type,Animal Color,"Friends $
of Given Animal","Another Timestamp$
 (ISO)",Feedstock,"Favorite food for Animal",Feed Amount,Feed Frequency^M$
2023,"Tiger and $
Beasts",Yellow,Lion,2023-03-14,3,meat,32.2,weekly^M$

is imported by pandas as:

Timestamp (ISO)	Animal Type	Animal Color	Friends \n of Given Animal	Another Timestamp\n (ISO)	Feedstock	Favorite food for Animal	Feed Amount	Feed Frequency
0	2023	Tiger and \nBeasts	Yellow	Lion	2023-03-14	3	meat	32.2	weekly
1	2023	Tiger	Yellow	Lion	2023-03-14	3	meat	32.2	weekly

Should I give feedback to the Explorer library?

1 Like

Yeah that might not be bad to submit

1 Like