Hi there,
I’ve just started exploring LiveBook, and it is fantastic. I was able to import a 700 MByte csv with more than 1.6 million rows of data by just dragging it from LiveView’s file storage to the book and selecting “Dataframe”, which gives me this:
df =
Kino.FS.file_path("data.csv")
|> Explorer.DataFrame.from_csv!()
However, explorer
somehow thinks that all columns are strings. I would like to specify the data types of each column.
The explorer documentation tells me to provide these as an option: " :dtypes
- A list/map of {"column_name", dtype}
tuples." How exactly do I do this? What are the dtype
’s I should use?
I tried something like Explorer.DataFrame.from_csv!([{"timestamp", :string}])
but that throws an exception which I don’t fully understand:
** (ArgumentError) expected a keyword list as first argument, got invalid entry: {"timestamp", :string}
(elixir 1.15.7) lib/keyword.ex:281: Keyword.validate/5
(elixir 1.15.7) lib/keyword.ex:336: Keyword.validate!/2
(explorer 0.8.1) lib/explorer/data_frame.ex:558: Explorer.DataFrame.from_csv/2
(explorer 0.8.1) lib/explorer/data_frame.ex:601: Explorer.DataFrame.from_csv!/2
#cell:w7fv6gjiaf3lnb55:3: (file)
Any help or pointers to tutorials etc. would be appreciated!
Thank you
1 Like
Okay, I partly figured out how to pass the arguments correctly.
df =
Kino.FS.file_path("data.csv")
|> Explorer.DataFrame.from_csv!(dtypes: [{"timestamp", :time}], header: true, infer_schema_length: 10000)
I am still wondering what dtypes are available.
Hi @Starlet9334! Welcome
It looks like you found the right option, :dtypes
. The available dtypes can be found here:
:binary
- Binaries (sequences of bytes)
:boolean
- Boolean
:category
- Strings but represented internally as integers
:date
- Date type that unwraps to Elixir.Date
{:datetime, precision}
- DateTime type with millisecond/microsecond/nanosecond precision that unwraps to Elixir.NaiveDateTime
{:duration, precision}
- Duration type with millisecond/microsecond/nanosecond precision that unwraps to Explorer.Duration
{:f, size}
- a 64-bit or 32-bit floating point number
{:s, size}
- a 8-bit or 16-bit or 32-bit or 64-bit signed integer number.
{:u, size}
- a 8-bit or 16-bit or 32-bit or 64-bit unsigned integer number.
:null
- nil
s exclusively
:string
- UTF-8 encoded binary
:time
- Time type that unwraps to Elixir.Time
{:list, dtype}
- A recursive dtype that can store lists. Examples: {:list, :boolean}
or a nested list dtype like {:list, {:list, :boolean}}
.
{:struct, [{key, dtype}]}
- A recursive dtype that can store Arrow/Polars structs (not to be confused with Elixir’s struct). This type unwraps to Elixir maps with string keys. Examples: {:struct, [{"a", :string}]}
or a nested struct dtype like {:struct, [{"a", {:struct, [{"b", :string}]}}]}
.
EDIT: I’ve changed the link to the top of the Series docs. There is a more explicit list there.
If you’re still having trouble, please don’t hesitate with follow-up questions! I also think we can maybe link to the list of dtypes in the documentation if that would be helpful…
1 Like
Some other thoughts while I look at this.
- For your
"timestamp"
column, you might want want one of the :datetime
dtypes, e.g. {:datetime, :microsecond}
. The :time
dtype is for hours of the day like ~T[12:00:00]
. If your timestamps look like "2024-01-01T00:00:00"
, you want a :datetime
one.
- If you do want one of the
:datetime
dtypes, only naive datetimes are currently supported. Any timezone information will be ignored or not parsed, so be wary! I suggest converting all timestamps to UTC and just remembering that that’s the correct timezone.
- If you provide the dtype for all columns, you can set the
:infer_schema_length
option to 0
since there’s nothing to infer. Don’t set it to nil
. That tells Explorer to scan the entire CSV.
2 Likes
Thanks a lot, that’s very helpful!
My timestamps are in unix epoch (seconds since 1970), do you have a suggestion on how to correctly parse those? It looks like the :datetime
dtype can only deal with resolutions of milliseconds and higher.
Assuming you have a CSV like this:
timestamp
1710440000
1710450000
This will work:
require Explorer.DataFrame, as: DF
df =
"path/to/file.csv"
|> DF.from_csv!(dtypes: [{"timestamp", {:s, 64}}])
|> DF.mutate(timestamp: cast(1_000 * timestamp, {:datetime, :millisecond}))
# #Explorer.DataFrame<
# Polars[2 x 1]
# timestamp datetime[ms] [2024-03-14 18:13:20.000000, 2024-03-14 21:00:00.000000]
# >
It parses the result as an integer then casts on the other side. Note the multiplication by 1_000
to get the result in milliseconds. This can be adapted for :microsecond
and :nanosecond
depending on your desired precision.
Also, it seems like we should be able to parse unix timestamps directly into into datetimes? I was surprised we needed to convert at all given this from the Polars docs:
Datetime
: Datetime representation e.g. 2014-07-08 07:00:00. It is internally represented as a 64 bit integer since the Unix epoch and can have different units such as ns, us, ms.
Might be worth an issue on GitHub.
1 Like
Great, I can confirm it works and gives me the correct timestamps.
1 Like