Equivalent of panda commands in Explorer

Aurel · October 9, 2024, 8:10am

Hello

I am playing with Explorer coming from pandas, and have some difficulties finding the right way to do some of my usual workflow steps.

Reading data

When using pandas in Python, I use a lot of .loc[] and .iloc[] calls to access data.

What is the idiomatic way to access data in Explorer ?
The closest approach I could find is to grab the underlying series and the ask the value by index (which sounds like a mix of loc and iloc to me) for ex.:

df["year"][0]

but it doesn’t feel very natural.

Cleaning data

Most of the time when I work in pandas, the first step in the notebook is to clean the database before processing.

Does anybody know how to do it in Explorer ?
More precisely I look for the equivalent of the following commands:

Drop invalid columns (columns with only NaN or nils), equivalent to panda’s df.dropna(how='all', axis=1)
Same thing for invalid rows: df.dropna(how='all', axis=0
Data replacement with fillna : replace all NaN values with a given value df.fillna(0) or do it by column df.fillna({'col1': 0, 'col2': 3}) etc.
conversion from string to integer: an equivalent to pandas.to_numeric (with coercion of invalid values) would be useful, and equivalents to the “str methods” (to do some replace in the string before parsing them to numbers) would also be useful

If anybody can help on these I would be grateful

I understand that these are newby questions, but I could not figure them on my own… I’m also biased towards the “pandas way” since I’ve been using it for a long time so there are probably ways I just don’t see.

Kind regards,
Aurélien

billylanchantin · October 11, 2024, 5:04pm

I also came from Pandas so Explorer concepts were confusing at first.

First, Explorer is built on Polars rather than Pandas. So you’re already at a bit of a disadvantage if you’re trying to carry over your Pandas intuition. This guide may help a bit with that aspect: Coming from Pandas - Polars user guide.

As for the specific questions:

What is the idiomatic way to access data in Explorer?

EDIT: left this section out at first.

This is IMO the biggest mindset shift in transferring from Pandas to Polars. In Polars, accessing data like Pandas does with .loc[] and .iloc[] is discouraged. I suggest reading over that “Coming from Pandas” tutorial to better understand why. There’s a section specifically for selecting data.

The thinking is roughly this: it’s better to try and express what data you want though the built-in functions that are available. Those functions are designed with fast querying/manipulating in mind.

For example, selecting data by its index isn’t usually what you want to do because the index of a particular row is usually incidental. It’ll be better to try and access it by its properties through filtering. E.g. I want the row where name == "Billy", not where index == 5. Filtering by name is just as fast as filtering by index, but filtering by name also works regardless of the row order.

If you really do need data at a specific index, you can do it just like you’ve described. But the Polars philosophy discourages this access pattern in favor of filtering/selecting for a good reason.

Drop columns with `nil`s

I don’t think we have a specific function. This should work:

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

df = DF.new(a: [1, 2, 3], b: [4, nil, 6])
# #Explorer.DataFrame<
#   Polars[3 x 2]
#   a s64 [1, 2, 3]
#   b s64 [4, nil, 6]
# >

columns_with_nils =
  df
  # Benefit of Polars: this step happens in parallel.
  |> DF.summarise(for(col <- across(), do: {col.name, nil_count(col) > 0}))
  |> DF.pivot_longer(df.names, names_to: "column", values_to: "has_nil")
  |> DF.filter(has_nil)
  |> DF.pull("column")
  |> S.to_list()
  #=> ["b"]

DF.discard(df, columns_with_nils)
# #Explorer.DataFrame<
#   Polars[3 x 2]
#   a s64 [1, 2, 3]
# >

Drop rows with `nil`s

drop_nil/2

Replace `nil`s:

fill_missing/2 for a specific column

For multiple columns at once, do:

col_to_value = %{"a" => 1, "b" => 2}

DF.mutate(df, for col <- across() do
  {col.name, fill_missing(col, ^col_to_value[col.name])}
end)
# #Explorer.DataFrame<
#   Polars[3 x 2]
#   a s64 [1, 2, 3]
#   b s64 [4, 2, 6]
# >

Conversion from string to integer

categorise/2… ?

I’m not as sure about this one. I’d need to know more about what you’re trying to accomplish specifically.

I understand that these are newby questions, but I could not figure them on my own

Newby questions are welcome here

I’m also biased towards the “pandas way” since I’ve been using it for a long time

Related: try out Polars for your Python work. I heavily prefer it these days.

Aurel · October 14, 2024, 9:08am

Wow thanks a lot for the detailed response

I obviously need to change my mindset, especially the “mask → filter → mask → filter…” logic inherited from numpy and pandas, but I’ve been using them for a long time so this is not easy…

I will also try to learn how to use across() which you use in your examples.

Regards, Aurélien