Csv import - how to correctly specify dtype

Starlet9334 · March 14, 2024, 8:07pm

Hi there,

I’ve just started exploring LiveBook, and it is fantastic. I was able to import a 700 MByte csv with more than 1.6 million rows of data by just dragging it from LiveView’s file storage to the book and selecting “Dataframe”, which gives me this:

df =
  Kino.FS.file_path("data.csv")
  |> Explorer.DataFrame.from_csv!()

However, explorer somehow thinks that all columns are strings. I would like to specify the data types of each column.

The explorer documentation tells me to provide these as an option: " :dtypes - A list/map of {"column_name", dtype} tuples." How exactly do I do this? What are the dtype’s I should use?

I tried something like Explorer.DataFrame.from_csv!([{"timestamp", :string}]) but that throws an exception which I don’t fully understand:

** (ArgumentError) expected a keyword list as first argument, got invalid entry: {"timestamp", :string}
    (elixir 1.15.7) lib/keyword.ex:281: Keyword.validate/5
    (elixir 1.15.7) lib/keyword.ex:336: Keyword.validate!/2
    (explorer 0.8.1) lib/explorer/data_frame.ex:558: Explorer.DataFrame.from_csv/2
    (explorer 0.8.1) lib/explorer/data_frame.ex:601: Explorer.DataFrame.from_csv!/2
    #cell:w7fv6gjiaf3lnb55:3: (file)

Any help or pointers to tutorials etc. would be appreciated!

Thank you

Starlet9334 · March 14, 2024, 8:11pm

Okay, I partly figured out how to pass the arguments correctly.

df =
  Kino.FS.file_path("data.csv")
  |> Explorer.DataFrame.from_csv!(dtypes: [{"timestamp", :time}], header: true, infer_schema_length: 10000)

I am still wondering what dtypes are available.

billylanchantin · March 14, 2024, 8:20pm

Hi @Starlet9334! Welcome

It looks like you found the right option, :dtypes. The available dtypes can be found here:

Explorer.Series — Explorer v0.8.1

:binary - Binaries (sequences of bytes)

:boolean - Boolean

:category - Strings but represented internally as integers

:date - Date type that unwraps to Elixir.Date

{:datetime, precision} - DateTime type with millisecond/microsecond/nanosecond precision that unwraps to Elixir.NaiveDateTime

{:duration, precision} - Duration type with millisecond/microsecond/nanosecond precision that unwraps to Explorer.Duration

{:f, size} - a 64-bit or 32-bit floating point number

{:s, size} - a 8-bit or 16-bit or 32-bit or 64-bit signed integer number.

{:u, size} - a 8-bit or 16-bit or 32-bit or 64-bit unsigned integer number.

:null - nils exclusively

:string - UTF-8 encoded binary

:time - Time type that unwraps to Elixir.Time

{:list, dtype} - A recursive dtype that can store lists. Examples: {:list, :boolean} or a nested list dtype like {:list, {:list, :boolean}}.

{:struct, [{key, dtype}]} - A recursive dtype that can store Arrow/Polars structs (not to be confused with Elixir’s struct). This type unwraps to Elixir maps with string keys. Examples: {:struct, [{"a", :string}]} or a nested struct dtype like {:struct, [{"a", {:struct, [{"b", :string}]}}]}.

EDIT: I’ve changed the link to the top of the Series docs. There is a more explicit list there.

If you’re still having trouble, please don’t hesitate with follow-up questions! I also think we can maybe link to the list of dtypes in the documentation if that would be helpful…

billylanchantin · March 14, 2024, 8:40pm

Some other thoughts while I look at this.

For your "timestamp" column, you might want want one of the :datetime dtypes, e.g. {:datetime, :microsecond}. The :time dtype is for hours of the day like ~T[12:00:00]. If your timestamps look like "2024-01-01T00:00:00", you want a :datetime one.
If you do want one of the :datetime dtypes, only naive datetimes are currently supported. Any timezone information will be ignored or not parsed, so be wary! I suggest converting all timestamps to UTC and just remembering that that’s the correct timezone.
If you provide the dtype for all columns, you can set the :infer_schema_length option to 0 since there’s nothing to infer. Don’t set it to nil. That tells Explorer to scan the entire CSV.

Starlet9334 · March 14, 2024, 8:52pm

Thanks a lot, that’s very helpful!

My timestamps are in unix epoch (seconds since 1970), do you have a suggestion on how to correctly parse those? It looks like the :datetime dtype can only deal with resolutions of milliseconds and higher.

billylanchantin · March 14, 2024, 9:11pm

Assuming you have a CSV like this:

timestamp
1710440000
1710450000

This will work:

require Explorer.DataFrame, as: DF

df =
  "path/to/file.csv"
  |> DF.from_csv!(dtypes: [{"timestamp", {:s, 64}}])
  |> DF.mutate(timestamp: cast(1_000 * timestamp, {:datetime, :millisecond}))
# #Explorer.DataFrame<
# Polars[2 x 1]
# timestamp datetime[ms] [2024-03-14 18:13:20.000000, 2024-03-14 21:00:00.000000]
# >

It parses the result as an integer then casts on the other side. Note the multiplication by 1_000 to get the result in milliseconds. This can be adapted for :microsecond and :nanosecond depending on your desired precision.

Also, it seems like we should be able to parse unix timestamps directly into into datetimes? I was surprised we needed to convert at all given this from the Polars docs:

Parsing - Polars user guide

Datetime: Datetime representation e.g. 2014-07-08 07:00:00. It is internally represented as a 64 bit integer since the Unix epoch and can have different units such as ns, us, ms.

Might be worth an issue on GitHub.

Starlet9334 · March 14, 2024, 9:30pm

Great, I can confirm it works and gives me the correct timestamps.