The idea is to summarise columns in groups by looking at other columns. For example, I want to pick the
name
column in the row where theid2
column is notnil
.
We’ve got a data frame in which we want to summarise groups by looking at other columns.
For example, let’s say we have a DF
iex(45)> alias Explorer.DataFrame, as: DF
iex(46)> alias Explorer.Series, as: S
iex(47)> df = Explorer.DataFrame.new(%{id: [1, 2, 3, 1], id2: [1, 2, 3, nil], name: ["foo", "bar", "baz", "qux"]})
#Explorer.DataFrame<
Polars[4 x 3]
id integer [1, 2, 3, 1]
id2 integer [1, 2, 3, nil]
name string ["foo", "bar", "baz", "qux"]
iex(48)> grouped = Explorer.DataFrame.group_by(df, ["id"])
#Explorer.DataFrame<
Polars[4 x 3]
Groups: ["id"]
id integer [1, 2, 3, 1]
id2 integer [1, 2, 3, nil]
name string ["foo", "bar", "baz", "qux"]
>
So we’ve grouped our data frame by ID and we would like to give priority for names where id2
is not nil
. How can we do that?
Only workaround I could find is by sorting the columns and then picking the first one, but I am not sure whether I’d have problems with that in more complex situations, or maybe I am using it in a wrong way, not sure, but then what happens when I want to do this for multiple columns?
iex(50)> grouped |> DF.arrange(asc: "id2") |> DF.summarise_with(&[name: S.first(&1["name"])])
#Explorer.DataFrame<
Polars[3 x 2]
id integer [1, 2, 3]
name string ["qux", "bar", "baz"]
>
or I could go with the first non nil value for a column in the group, just to make things easier.