Dynamic column selection for Explorer dataframe

toodle · June 12, 2024, 10:15pm

I’d like to query an Explorer DataFrame containing a “select” column such that the output gives, for each row, the field/column that was pointed to by the “select” column. Example:

iex>DF.print(my_df)
+--------------------------------------------+
| Explorer DataFrame: [rows: 3, columns: 3]  |
+--------------+--------------+--------------+
|     col1     |     col2     |    select    |
|   <string>   |   <string>   |   <string>   |
+==============+==============+==============+
| a            | b            | col1         |
+--------------+--------------+--------------+
| c            | d            | col2         |
+--------------+--------------+--------------+
| e            | f            | col2         |
+--------------+--------------+--------------+

The output I want is:

#Explorer.Series<
  Polars[3]
  string ["a", "d", "f"]
>

This pops up while processing categorical data and is related to a use-case involving one-hot encoding that was discussed here.

I was hoping to be able to use syntax that is fairly short, something like

my_df[.., my_df["select"]]

…but I haven’t been able to get that working yet. What I have gotten working is:

defmodule Janky do
  def dynamic_select(dataframe) do
    # use mutate_with combined with Series.select
    DF.mutate_with(dataframe, fn df ->
      [selection: 
          Enum.reduce(
            Enum.reject(dataframe.names, fn x -> x == "select" end),
            "nil",
            fn x, acc ->
              this_column = S.equal(df["select"], x)
              S.select(this_column, df[x], acc)
            end
          )
      ]
    end)
    |> DF.pull("selection")
  end
end

iex> Janky.dynamic_select(my_df)
#Explorer.Series<
  Polars[3]
  string ["a", "d", "f"]
>

This seems like a lot of code to do something fairly straightforward—is there anything more sleek built into Explorer?

In Pandas, there used to be DataFrame.lookup(), which I guess has been replaced by something slightly more verbose. I feel like this type of dynamic selection is supported by dplyr too…

Anything built into Explorer (or on the roadmap) for this?

billylanchantin · June 12, 2024, 11:33pm

For the two-column example, there’s this:

df["select"] |> S.equal("col1") |> S.select(df["col1"], df["col2"])

To generalize, this works:

Enum.reduce(df.names -- ["select"], df["select"], fn col, acc ->
  df["select"] |> S.equal(col) |> S.select(df[col], acc)
end)

It’s very close what you had. The overall approach does seem a bit wasteful – I’d love to do things lazily. But I tried a few other ideas and didn’t get anywhere.

As for the roadmap, I think exposing some fold-based expressions may help here:

exposing the `fold` expressions from Polars · Issue #911 · elixir-explorer/explorer · GitHub

toodle · June 13, 2024, 4:11pm

Ah, I see, so without mutate_with, pretty much. Nice! that is a bit tidier, thanks! Unfortunately, as you mentioned, it does remove the ability to do things lazily.

Even on eager frames, I’m seeing that mutate_with makes things 3-4x faster, kind of interersting.

r.e. fold, that seems very relevant, although it appears like it requires pretty much the same amount of code as this example uses. I’m thinking more tight function calls (that maybe wrap fold or what we’ve written here). Seems like a type of horizontal aggregation, which maybe it’d be helpful for Explorer to have a small number of. Polars issue tracking horizontal aggregations.