List operation `sort` in Explorer

toodle · May 29, 2024, 11:41pm

I have an Explorer dataframe with a column that is a list type ({:list, {:s, 64}} ), and I’d like to mutate the dataframe to have a column that contains the sorted version of this list.

iex> df_with_list_col = DF.new(%{
       list_col: [[1, 5, 3], [1, 1, 2, 0]], 
       regular_col: ["regular", "column"]
     })
iex> DF.print(df_with_list_col)
"""
+-------------------------------------------+
| Explorer DataFrame: [rows: 2, columns: 2] |
+---------------------+---------------------+
|     regular_col     |      list_col       |
|      <string>       |     <list[s64]>     |
+=====================+=====================+
| regular             | 1                   |
|                     | 5                   |
|                     | 3                   |
+---------------------+---------------------+
| column              | 1                   |
|                     | 1                   |
|                     | 2                   |
|                     | 0                   |
+---------------------+---------------------+
"""

I wanted to reach for a Explorer.Series list op , since there appears to be what I want in Polars, but it seems to be missing from the Explorer API. I was expecting to be able to do something like:

iex> DF.mutate(df_with_list_col, list_col_sorted: sort(list_col))
     |> DF.print()
"""
+----------------------------------------------------------------------+
|             Explorer DataFrame: [rows: 2, columns: 3]                |
+---------------------+---------------------+--------------------------+
|     regular_col     |      list_col       |      list_col_sorted     |
|      <string>       |     <list[s64]>     |        <list[s64]>       |
+=====================+=====================+==========================+
| regular             | 1                   | 1                        |
|                     | 5                   | 3                        |
|                     | 3                   | 5                        |
+---------------------+---------------------+--------------------------+
| column              | 1                   | 0                        |
|                     | 1                   | 1                        | 
|                     | 2                   | 1                        |
|                     | 0                   | 2                        | 
+---------------------+---------------------+--------------------------+
"""

Is there perhaps a different recommended way of doing this? I suppose there is one hint here in this open issue, but I’m wondering if there is something much simpler and more obvious that I’m just missing.

billylanchantin · May 30, 2024, 3:55pm

Hi, @toodle!

Unfortunately no, you’re not missing anything. Our support for the list column type is newer and we’ve not yet exposed all related Polars functionality.

Using the same method as the hint in the PR, this should work for now:

df_with_list_col
|> DF.explode(:list_col)
|> DF.sort_by(list_col)
|> DF.group_by(:regular_col)
|> DF.summarise(list_col: list_col)
|> DF.print()
# +-------------------------------------------+
# | Explorer DataFrame: [rows: 2, columns: 2] |
# +---------------------+---------------------+
# |     regular_col     |      list_col       |
# |      <string>       |     <list[s64]>     |
# +=====================+=====================+
# | column              | 0                   |
# |                     | 1                   |
# |                     | 1                   |
# |                     | 2                   |
# +---------------------+---------------------+
# | regular             | 1                   |
# |                     | 3                   |
# |                     | 5                   |
# +---------------------+---------------------+

Basically the current workaround is: explode → operate on exploded column → group_by → summarise.