Explorer.DataFrame.from_csv problem

Hi all

I have some issues reading nanosecond precision timestamps in CSVs using latest explorer.

Repro as follows:

./script/ingestion_test.csv

datetime_ns
1738573413286629810

./script/ingestion_test.exs

Mix.install([
  :explorer
])

{options, _, _} = OptionParser.parse(System.argv(), strict: [path: :string])
{:ok, path} = Keyword.fetch(options, :path)

Explorer.DataFrame.from_csv(Path.relative_to_cwd(path), [
  infer_schema_length: 0,
  columns: [
    "datetime_ns"
  ],
  dtypes: [
    {"datetime_ns", {:naive_datetime, :nanosecond}}
  ]
])
|> IO.inspect()

Outcome:

$ elixir ./script/ingestion_test.exs --path ./script/ingestion_test.csv
{:error,
 %RuntimeError{
   message: "Polars Error: could not parse `1738573413286629810` as dtype `datetime[ns]` at column 'datetime_ns' (column number 1)\n\nThe current offset in the file is 12 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `dtypes` argument\n- setting `ignore_errors` to `True`,\n- adding `1738573413286629810` to the `null_values` list.\n\nOriginal error: ```could not find a 'date/datetime' pattern for '1738573413286629810'```"
 }}

Surely I’m missing something elementary?

https://www.unixtimestamp.com/ agrees the timestamp is good. But it won’t parse.

Many thanks in advance

elixir 1.17.2
erlang 27.0.1
explorer 0.10.1

I think Explorer only supports NaiveDatetime. I made it work with this code in Livebook:

tmp_dir = System.tmp_dir!()
test_csv_path = Path.join(tmp_dir, "test.csv")

File.write!(test_csv_path, 
"""
datetime_ns
1738573413286629810
""")

df = Explorer.DataFrame.from_csv!(test_csv_path, [
  infer_schema_length: 0,
  columns: [
    "datetime_ns"
  ],
  dtypes: [
    datetime_ns: {:u, 64}
  ]
])

dt = df["datetime_ns"]
  |> Explorer.Series.transform(fn dns -> DateTime.from_unix!(dns, :nanosecond) end)


Explorer.DataFrame.put(df, "datetime_ns", dt)
2 Likes
{:ok, frame} = Explorer.DataFrame.from_csv(Path.relative_to_cwd(path), [
  infer_schema_length: 0,
  columns: [
    "datetime_ns"
  ],
  dtypes: [
    {"datetime_ns", {:u, 64}}
  ]
])

frame
|> Explorer.DataFrame.put("dataframe_ns", frame["datetime_ns"] |> Explorer.Series.cast({:naive_datetime, :nanosecond}))
|> IO.inspect()

=

#Explorer.DataFrame<
  Polars[1 x 2]
  datetime_ns u64 [1738573413286629810]
  dataframe_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
>
2 Likes

@evadne Glad you figured it out :slight_smile:

FWIW I couldn’t get Polars to (which Explorer uses under the hood) to parse the timestamp directly either. So your cast approach is what I’d use too. I will throw a few more pointers out there.

DF.mutate/2 is nice because you don’t need to keep around the original column:

require Explorer.DataFrame, as: DF

path = "./ingestion_test.csv"

path
|> DF.from_csv!(dtypes: %{datetime_ns: :u64})
|> DF.mutate(datetime_ns: cast(datetime_ns, {:naive_datetime, :nanosecond}))
# #Explorer.DataFrame<
#   Polars[1 x 1]
#   datetime_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
# >

There is also the :lazy option. It will defer the computation so that you never create the intermediate column in the first place:

path
|> DF.from_csv!(dtypes: %{datetime_ns: :u64}, lazy: true)
|> DF.mutate(datetime_ns: cast(datetime_ns, {:naive_datetime, :nanosecond}))
|> DF.collect()
# #Explorer.DataFrame<
#   Polars[1 x 1]
#   datetime_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
# >

That option is only needed if you need to go real fast for some reason.

Also, we do have support for non-naive datetimes. But that doesn’t seem super relevant for this use case.

2 Likes