evadne
February 10, 2025, 4:40am
1
Hi all
I have some issues reading nanosecond precision timestamps in CSVs using latest explorer.
Repro as follows:
./script/ingestion_test.csv
datetime_ns
1738573413286629810
./script/ingestion_test.exs
Mix.install([
:explorer
])
{options, _, _} = OptionParser.parse(System.argv(), strict: [path: :string])
{:ok, path} = Keyword.fetch(options, :path)
Explorer.DataFrame.from_csv(Path.relative_to_cwd(path), [
infer_schema_length: 0,
columns: [
"datetime_ns"
],
dtypes: [
{"datetime_ns", {:naive_datetime, :nanosecond}}
]
])
|> IO.inspect()
Outcome:
$ elixir ./script/ingestion_test.exs --path ./script/ingestion_test.csv
{:error,
%RuntimeError{
message: "Polars Error: could not parse `1738573413286629810` as dtype `datetime[ns]` at column 'datetime_ns' (column number 1)\n\nThe current offset in the file is 12 bytes.\n\nYou might want to try:\n- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),\n- specifying correct dtype with the `dtypes` argument\n- setting `ignore_errors` to `True`,\n- adding `1738573413286629810` to the `null_values` list.\n\nOriginal error: ```could not find a 'date/datetime' pattern for '1738573413286629810'```"
}}
Surely I’m missing something elementary?
https://www.unixtimestamp.com/ agrees the timestamp is good. But it won’t parse.
Many thanks in advance
elixir 1.17.2
erlang 27.0.1
explorer 0.10.1
I think Explorer only supports NaiveDatetime. I made it work with this code in Livebook:
tmp_dir = System.tmp_dir!()
test_csv_path = Path.join(tmp_dir, "test.csv")
File.write!(test_csv_path,
"""
datetime_ns
1738573413286629810
""")
df = Explorer.DataFrame.from_csv!(test_csv_path, [
infer_schema_length: 0,
columns: [
"datetime_ns"
],
dtypes: [
datetime_ns: {:u, 64}
]
])
dt = df["datetime_ns"]
|> Explorer.Series.transform(fn dns -> DateTime.from_unix!(dns, :nanosecond) end)
Explorer.DataFrame.put(df, "datetime_ns", dt)
2 Likes
evadne
February 10, 2025, 6:54pm
3
{:ok, frame} = Explorer.DataFrame.from_csv(Path.relative_to_cwd(path), [
infer_schema_length: 0,
columns: [
"datetime_ns"
],
dtypes: [
{"datetime_ns", {:u, 64}}
]
])
frame
|> Explorer.DataFrame.put("dataframe_ns", frame["datetime_ns"] |> Explorer.Series.cast({:naive_datetime, :nanosecond}))
|> IO.inspect()
=
#Explorer.DataFrame<
Polars[1 x 2]
datetime_ns u64 [1738573413286629810]
dataframe_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
>
2 Likes
@evadne Glad you figured it out
FWIW I couldn’t get Polars to (which Explorer uses under the hood) to parse the timestamp directly either. So your cast
approach is what I’d use too. I will throw a few more pointers out there.
DF.mutate/2
is nice because you don’t need to keep around the original column:
require Explorer.DataFrame, as: DF
path = "./ingestion_test.csv"
path
|> DF.from_csv!(dtypes: %{datetime_ns: :u64})
|> DF.mutate(datetime_ns: cast(datetime_ns, {:naive_datetime, :nanosecond}))
# #Explorer.DataFrame<
# Polars[1 x 1]
# datetime_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
# >
There is also the :lazy
option. It will defer the computation so that you never create the intermediate column in the first place:
path
|> DF.from_csv!(dtypes: %{datetime_ns: :u64}, lazy: true)
|> DF.mutate(datetime_ns: cast(datetime_ns, {:naive_datetime, :nanosecond}))
|> DF.collect()
# #Explorer.DataFrame<
# Polars[1 x 1]
# datetime_ns naive_datetime[ns] [2025-02-03 09:03:33.286629]
# >
That option is only needed if you need to go real fast for some reason.
Also, we do have support for non-naive datetimes . But that doesn’t seem super relevant for this use case.
2 Likes