I have a larger dataframe, and I want to select rows in it based on (a smaller subset of) columns and rows from another dataframe:
iex(134)> require Explorer.DataFrame, as: DF
Explorer.DataFrame
iex(135)> df1 = DF.new(strs: ["a", "b", "c", "c"], nums: [1, 2, 3, 4], names: ["joe", "alice", "vikas", "sofija"])
#Explorer.DataFrame<
Polars[4 x 3]
strs string ["a", "b", "c", "c"]
nums s64 [1, 2, 3, 4]
names string ["joe", "alice", "vikas", "sofija"]
>
iex(136)> df2 = DF.new(strs: ["b", "c"], nums: [2, 4])
#Explorer.DataFrame<
Polars[2 x 2]
strs string ["b", "c"]
nums s64 [2, 4]
So note that in df1
column strs
has value "c"
repeated. Now I want to select the rows in df1
based on df2
, but all conditions must match for each row of df2
so:
- if
strs
is"b"
, ANDnums
is2
, then return matching rows fromdf1
OR - if
strs
is"c"
ANDnums
is4
, then return matching rows fromdf1
So the result should be the first and fourth rows of df1
.
Basically the question is how do I do grid-based conditions select with AND in rows dimension and OR in the columns.
I have tried zipping the rows of df2
along the following lines, with no success:
iex(138)> dupz
[
{"720186AN5", 31665, 20240620},
{"428236BR3", 55, 20240614},
{"25470DBJ7", 48, 20240611},
{"Y5S5CGAM4", 236, 20240702},
{"G84228EP9", 31664, 20240620},
{"06050WFJ9", 66879, 20240612},
{"097023CX1", 49, 20240611},
{"437076DC3", 114644, 20240625},
{"437076DC3", 7736, 20240625},
{"M9869GV65", 41446, 20240621},
{"M9869GV65", 7885, 20240624},
{"M9869GV65", 7886, 20240624},
{"M9869GV65", 1595, 20240625},
...]
iex(139)> dupz0 = Enum.at(dupz, 0)
{"720186AN5", 31665, 20240620}
iex(140)> DF.filter(df, {col("CUSIP"), col("MSN"), col("Settlement_date")} == ^dupz0)
** (ArgumentError) cannot invoke Explorer.Series.equal/2 with mismatched dtypes: {#Explorer.Series<
LazySeries[???]
string (column("CUSIP"))
>, #Explorer.Series<
LazySeries[???]
s64 (column("MSN"))
>, #Explorer.Series<
LazySeries[???]
s64 (column("Settlement_date"))
>} and {"720186AN5", 31665, 20240620}
(explorer 0.8.2) lib/explorer/series.ex:6587: Explorer.Series.dtype_mismatch_error/3
iex:140: (file)
iex:140: (file)
(and in the above obviously I’d map through a bunch of tuples for each “row”, but that’s moot because it gives me an error)
What is the best way to achieve this?