How to select rows in a dataframe, based on the rows and columns of a second (smaller) dataframe

vegabook · July 27, 2024, 1:17pm

I have a larger dataframe, and I want to select rows in it based on (a smaller subset of) columns and rows from another dataframe:

iex(134)> require Explorer.DataFrame, as: DF
Explorer.DataFrame
iex(135)> df1 = DF.new(strs: ["a", "b", "c", "c"], nums: [1, 2, 3, 4], names: ["joe", "alice", "vikas", "sofija"])
#Explorer.DataFrame<
  Polars[4 x 3]
  strs string ["a", "b", "c", "c"]
  nums s64 [1, 2, 3, 4]
  names string ["joe", "alice", "vikas", "sofija"]
>
iex(136)> df2 = DF.new(strs: ["b", "c"], nums: [2, 4])
#Explorer.DataFrame<
  Polars[2 x 2]
  strs string ["b", "c"]
  nums s64 [2, 4]

So note that in df1 column strs has value "c" repeated. Now I want to select the rows in df1 based on df2, but all conditions must match for each row of df2 so:

if strs is "b", AND nums is 2, then return matching rows from df1
OR
if strs is "c" AND nums is 4, then return matching rows from df1

So the result should be the first and fourth rows of df1.

Basically the question is how do I do grid-based conditions select with AND in rows dimension and OR in the columns.

I have tried zipping the rows of df2 along the following lines, with no success:

iex(138)> dupz
[
  {"720186AN5", 31665, 20240620},
  {"428236BR3", 55, 20240614},
  {"25470DBJ7", 48, 20240611},
  {"Y5S5CGAM4", 236, 20240702},
  {"G84228EP9", 31664, 20240620},
  {"06050WFJ9", 66879, 20240612},
  {"097023CX1", 49, 20240611},
  {"437076DC3", 114644, 20240625},
  {"437076DC3", 7736, 20240625},
  {"M9869GV65", 41446, 20240621},
  {"M9869GV65", 7885, 20240624},
  {"M9869GV65", 7886, 20240624},
  {"M9869GV65", 1595, 20240625},
...]

iex(139)> dupz0 = Enum.at(dupz, 0)
{"720186AN5", 31665, 20240620}
iex(140)> DF.filter(df, {col("CUSIP"), col("MSN"), col("Settlement_date")} == ^dupz0)
** (ArgumentError) cannot invoke Explorer.Series.equal/2 with mismatched dtypes: {#Explorer.Series<
   LazySeries[???]
   string (column("CUSIP"))
 >, #Explorer.Series<
   LazySeries[???]
   s64 (column("MSN"))
 >, #Explorer.Series<
   LazySeries[???]
   s64 (column("Settlement_date"))
 >} and {"720186AN5", 31665, 20240620}
    (explorer 0.8.2) lib/explorer/series.ex:6587: Explorer.Series.dtype_mismatch_error/3
    iex:140: (file)
    iex:140: (file)

(and in the above obviously I’d map through a bunch of tuples for each “row”, but that’s moot because it gives me an error)

What is the best way to achieve this?

vegabook · July 27, 2024, 1:30pm

Lol I think I realised this is basically an inner join.

Leaving it open for a few hours in case I’m wrong.