Hi there,
I’ve just started exploring LiveBook, and it is fantastic. I was able to import a 700 MByte csv with more than 1.6 million rows of data by just dragging it from LiveView’s file storage to the book and selecting “Dataframe”, which gives me this:
df =
Kino.FS.file_path("data.csv")
|> Explorer.DataFrame.from_csv!()
However, explorer somehow thinks that all columns are strings. I would like to specify the data types of each column.
The explorer documentation tells me to provide these as an option: " :dtypes - A list/map of {"column_name", dtype} tuples." How exactly do I do this? What are the dtype’s I should use?
I tried something like Explorer.DataFrame.from_csv!([{"timestamp", :string}]) but that throws an exception which I don’t fully understand:
** (ArgumentError) expected a keyword list as first argument, got invalid entry: {"timestamp", :string}
(elixir 1.15.7) lib/keyword.ex:281: Keyword.validate/5
(elixir 1.15.7) lib/keyword.ex:336: Keyword.validate!/2
(explorer 0.8.1) lib/explorer/data_frame.ex:558: Explorer.DataFrame.from_csv/2
(explorer 0.8.1) lib/explorer/data_frame.ex:601: Explorer.DataFrame.from_csv!/2
#cell:w7fv6gjiaf3lnb55:3: (file)
Any help or pointers to tutorials etc. would be appreciated!
Thank you
1 Like
Okay, I partly figured out how to pass the arguments correctly.
df =
Kino.FS.file_path("data.csv")
|> Explorer.DataFrame.from_csv!(dtypes: [{"timestamp", :time}], header: true, infer_schema_length: 10000)
I am still wondering what dtypes are available.
Hi @Starlet9334! Welcome 
It looks like you found the right option, :dtypes. The available dtypes can be found here:
:binary - Binaries (sequences of bytes)
:boolean - Boolean
:category - Strings but represented internally as integers
:date - Date type that unwraps to Elixir.Date
{:datetime, precision} - DateTime type with millisecond/microsecond/nanosecond precision that unwraps to Elixir.NaiveDateTime
{:duration, precision} - Duration type with millisecond/microsecond/nanosecond precision that unwraps to Explorer.Duration
{:f, size} - a 64-bit or 32-bit floating point number
{:s, size} - a 8-bit or 16-bit or 32-bit or 64-bit signed integer number.
{:u, size} - a 8-bit or 16-bit or 32-bit or 64-bit unsigned integer number.
:null - nils exclusively
:string - UTF-8 encoded binary
:time - Time type that unwraps to Elixir.Time
{:list, dtype} - A recursive dtype that can store lists. Examples: {:list, :boolean} or a nested list dtype like {:list, {:list, :boolean}}.
{:struct, [{key, dtype}]} - A recursive dtype that can store Arrow/Polars structs (not to be confused with Elixir’s struct). This type unwraps to Elixir maps with string keys. Examples: {:struct, [{"a", :string}]} or a nested struct dtype like {:struct, [{"a", {:struct, [{"b", :string}]}}]}.
EDIT: I’ve changed the link to the top of the Series docs. There is a more explicit list there.
If you’re still having trouble, please don’t hesitate with follow-up questions! I also think we can maybe link to the list of dtypes in the documentation if that would be helpful…
1 Like
Some other thoughts while I look at this.
- For your
"timestamp" column, you might want want one of the :datetime dtypes, e.g. {:datetime, :microsecond}. The :time dtype is for hours of the day like ~T[12:00:00]. If your timestamps look like "2024-01-01T00:00:00", you want a :datetime one.
- If you do want one of the
:datetime dtypes, only naive datetimes are currently supported. Any timezone information will be ignored or not parsed, so be wary! I suggest converting all timestamps to UTC and just remembering that that’s the correct timezone.
- If you provide the dtype for all columns, you can set the
:infer_schema_length option to 0 since there’s nothing to infer. Don’t set it to nil. That tells Explorer to scan the entire CSV.
2 Likes
Thanks a lot, that’s very helpful!
My timestamps are in unix epoch (seconds since 1970), do you have a suggestion on how to correctly parse those? It looks like the :datetime dtype can only deal with resolutions of milliseconds and higher.
Assuming you have a CSV like this:
timestamp
1710440000
1710450000
This will work:
require Explorer.DataFrame, as: DF
df =
"path/to/file.csv"
|> DF.from_csv!(dtypes: [{"timestamp", {:s, 64}}])
|> DF.mutate(timestamp: cast(1_000 * timestamp, {:datetime, :millisecond}))
# #Explorer.DataFrame<
# Polars[2 x 1]
# timestamp datetime[ms] [2024-03-14 18:13:20.000000, 2024-03-14 21:00:00.000000]
# >
It parses the result as an integer then casts on the other side. Note the multiplication by 1_000 to get the result in milliseconds. This can be adapted for :microsecond and :nanosecond depending on your desired precision.
Also, it seems like we should be able to parse unix timestamps directly into into datetimes? I was surprised we needed to convert at all given this from the Polars docs:
Datetime: Datetime representation e.g. 2014-07-08 07:00:00. It is internally represented as a 64 bit integer since the Unix epoch and can have different units such as ns, us, ms.
Might be worth an issue on GitHub.
1 Like
Great, I can confirm it works and gives me the correct timestamps.
1 Like