Having trouble splitting rows using Explorer.DataFrame

klonowsk · March 7, 2024, 6:55pm

Hi,

I’m having trouble figuring out how to take a dataframe, split one of the text cells based on a function, and create new rows with the results, with all other column data duplicated for each row.

I just can’t figure out how to convert the list of text fragments into something that can be expanded into additional rows.

My (naïve) stab at it looks as follows for now:

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

defmodule MyDF do
  def apply(df, column, new_column, func) do
    series = DF.pull(df, column)
    list = S.to_list(series)

    new_series =
      list
      |> Enum.map(&func.(&1))
      |> S.from_list()

    DF.put(df, new_column, new_series)
  end
end

df = DF.new(
  class: [1, 2, 1, 3],
  text: ["AAA, BBB", "CCC, DDD", "EEE", "FFF, GGG, HHH"]
)

df
|> MyDF.apply("text", "new_text", &String.split(&1, ","))

Resulting in:

explorer.DataFrame<
Polars[4 x 3]
class integer [1, 2, 1, 3]
text string [“AAA, BBB”, “CCC, DDD”, “EEE”, “FFF, GGG, HHH”]
new_text list[string] [
[“AAA”, " BBB"],
[“CCC”, " DDD"],
[“EEE”],
[“FFF”, …]
]

But I would like that list of new_text to be spread accross multiple rows.

I’m coming from R+ and tidy/dplyr, where I would do something simple like:

df %>%
  dplyr::rowwise() %>%
  mutate(new_text = parse_text(text))

Thank you

klonowsk · March 8, 2024, 8:50pm

Well, seems I figured it out. Perhaps not the most elegant way (if there is one, I’d love to see it). But here goes:

df
|> DF.to_rows_stream()
|> Stream.map(&(Map.pop!(&1, "text")))
|> Enum.into(
    [],
    fn {text, map} -> for y <- String.split(text, ", "), do: Map.put(map, "new_text", y)
    end)
|> List.flatten()
|> DF.new()

Explorer.DataFrame.to_rows_stream() to the rescue. We stream over the rows, Map.pop! the value from column of interest (“text” in this case), leaving the remainder of the columns (in this case only the one extra column - “class”) as a map. Then we split “text” into substrings combine each with the rest of the columns (map), and using Enum.into place each of these new maps into an empty list, flatten the list of lists, and finally feed it into a new dataframe. Which results in exactly what I was looking for:

explorer.DataFrame<
Polars[8 x 3]
class integer [1, 1, 2, 2, 1, …]
new_text string [“AAA”, “BBB”, “CCC”, “DDD”, “EEE”, …]

Wee!

billylanchantin · March 8, 2024, 9:54pm

Hi @klonowsk,

If I’ve followed what you’re trying to achieve, I think this works:

require Explorer.DataFrame, as: DF

df = DF.new(
  class: [1, 2, 1, 3],
  text: ["AAA, BBB", "CCC, DDD", "EEE", "FFF, GGG, HHH"]
)

df
|> DF.mutate(text: split(text, ", "))
|> DF.explode("text")
# #Explorer.DataFrame<
#   Polars[8 x 2]
#   class s64 [1, 1, 2, 2, 1, ...]
#   text string ["AAA", "BBB", "CCC", "DDD", "EEE", ...]
# >

klonowsk · March 9, 2024, 7:24pm

Definitely a cleaner solution. Requires Explorer >= 0.8. I was still on 0.7.2. Now upgraded everything to the latest stable versions, including all other Mix libraries as well as erlang and elixir (using asdf). However, now I’m getting the following error with DF.mutate(text: split(text, ", ")).

** (ErlangError) Erlang error: :nif_not_loaded
:erlang.nif_error(:nif_not_loaded)
(explorer 0.8.1) lib/explorer/polars_backend/native.ex:458: Explorer.PolarsBackend.Native.err/0
(explorer 0.8.1) lib/explorer/polars_backend/data_frame.ex:625: anonymous fn/2 in Explorer.PolarsBackend.DataFrame.mutate_with/3
(elixir 1.16.1) lib/enum.ex:2528: Enum.“-reduce/3-lists^foldl/2-0-”/3
(explorer 0.8.1) lib/explorer/polars_backend/data_frame.ex:623: Explorer.PolarsBackend.DataFrame.mutate_with/3
#cell:ltmduarooibwjsyo:11: (file)

I also tried DF.mutate(text: Explorer.Series.split(text, ", ")) and DF.mutate_with(&[text: S.split(&1["text"], ", ")]). Same error.

Like I mentioned, I’ve upgraded everything to the latest stable versions. No go. Any ideas?

bdarla · March 9, 2024, 8:16pm

I think you missed the caret symbol ^ at text (see below).

df = DF.new(
  class: [1, 2, 1, 3],
  text: ["AAA, BBB", "CCC, DDD", "EEE", "FFF, GGG, HHH"]
)
df
|> DF.mutate(text: split(^text, ", "))
|> DF.explode("text")

explorer.DataFrame<
Polars[8 x 2]
class s64 [1, 1, 2, 2, 1, …]
text string [“AAA”, “BBB”, “CCC”, “DDD”, “EEE”, …]

klonowsk · March 9, 2024, 8:24pm

I thought the caret is used to refer to a variable declared elsewhere. Here “text” is the name of one of the columns in the df. Either way, this doesn’t work:

error: undefined variable “text”

bdarla · March 9, 2024, 9:36pm

Indeed, it seems that a previous cell in my Livebook allowed the code to run. Please, try the following:

Mix.install([
  {:axon, "~> 0.6"}, 
  {:nx, "~> 0.7"}, 
  {:explorer, "~> 0.8"}, 
  {:kino, "~> 0.12"}
])

require Explorer.DataFrame, as: DF
require Explorer.Series, as: S

text_list = ["AAA, BBB", "CCC, DDD", "EEE", "FFF, GGG, HHH"]
text = S.from_list(text_list)
df = DF.new(
class: [1, 2, 1, 3],
text: text_list
)
split_list = S.split(text, ", ")
df
|> DF.mutate(text: ^split_list)
|> DF.explode(:text)

billylanchantin · March 9, 2024, 10:25pm

Whoops! I forgot there was a bug with split which was just fixed last week:

Add native expression for Series.split/2 by H12 · Pull Request #875 · elixir-explorer/explorer · GitHub

I was on main locally so I didn’t see it. To workaround, you’ll need to do a pipeline like this:

df
|> DF.put("text", S.split(df["text"], ", "))
|> DF.explode("text")

(@bdarla’s approach works too.)

Or you can use main until the next release. Sorry about that!

klonowsk · March 11, 2024, 1:42pm

Yes, all the above solutions work, and yes, I fell for the S.split() issue, however, sourcing from git:main results in far too many dependencies braking, so DF.put("text", S.split(df["text"], ", ")) works just fine for this toy example as well as for my needs as I have my own somewhat more complex text splitting function, which I just modified to return a series lists of the split texts to feed into DF.put(). The more relevant function is DF.explode(), which was missing from the version of the Explorer library I was using. So thank you!