Explorer - subsetting rows by columns that match one of multiple values

srowley · June 24, 2022, 9:45pm

Explorer supports creating a subset of dataframe rows:

df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])

Explorer.DataFrame.filter(df, Explorer.Series.greater(df["b"], 1))
#Explorer.DataFrame<
  Polars[2 x 2]
  a string ["b", "c"]
  b integer [2, 3]
>

And it supports multiple filters:

df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
b_gt = Explorer.Series.greater(df["b"], 1)
a_eq = Explorer.Series.equal(df["a"], "b")

Explorer.DataFrame.filter(df, Explorer.Series.and(a_eq, b_gt))
#Explorer.DataFrame<
  Polars[1 x 2]
  a string ["b"]
  b integer [2]
>

With that in mind, is there a simple way to subset where a column takes any one of several values, i.e.:

df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])

Explorer.DataFrame.filter(df, Explorer.Series.in(df["b"], [1, 2]))
#Explorer.DataFrame<
  Polars[2 x 2]
  a string ["a", "b"]
  b integer [1, 2]
>

I think you could do this by chaining a bunch of Series.equal\2 and Series.or\2 calls, but at some point that is pretty tedious.

cigrainger · June 24, 2022, 10:20pm

Oh for sure! We need to add something to handle that. I’ve raised an issue: Support filtering on a list of values · Issue #273 · elixir-nx/explorer · GitHub. Thanks for bringing this up!

srowley · June 25, 2022, 1:19pm

Thanks!

Postscript - I realized you could do this with a join for now. Don’t know if it’s the most efficient method, but it is workable at least:

df = Explorer.DataFrame.new(a: ["a", "b", "c"], b: [1, 2, 3])
filter = Explorer.DataFrame.new(b: [1, 2])

Explorer.DataFrame.join(df, filter)
#Explorer.DataFrame<
  Polars[2 x 2]
  a string ["a", "b"]
  b integer [1, 2]
>