Does Explorer have the Polars `gather_every` function and if not, how do I achieve the same thing?

vegabook · May 31, 2024, 9:01am

I have a 40 x 120k row CSVs which I want to concatenate into a one single DataFrame. But these won’t fit into my 32GB memory, so I want to take every 10th row from each input DataFrame. How do I do that? Polars has gather_every (polars.DataFrame.gather_every — Polars documentation) but Explorer doesn’t seem to have it.

34   def mktax_all(skiprate \\ 10) do
 35     Ftp.local_paths_ls(:marketaxess)
 36     |> Enum.take(3)
 37     |> Enum.filter(fn x -> String.contains?(x, "mabond") end)
 38     |> Enum.map(&read_csv/1)
 39     |> Enum.filter(fn df -> DF.n_rows(df) > 0 end)
 40     |> Enum.map(fn df -> DF.filter_with(df,
 41         &Explorer.Series.equal(&1["SETTLEMENTDATE"], Explorer.Series.mode(df["SETTLEMENTDATE"])[0])) end)
 42     |> Enum.map(fn df -> DF.gather_every(df, skiprate) end)

EDIT

messy:

 41     |> Enum.map(fn df -> DF.slice(df, Enum.map(0..(Kernel.div(DF.n_rows(df), skiprate) - 1),
 42       fn x -> x * skiprate end)) end)

03juan · May 31, 2024, 1:38pm

The chain of Enums are greedily creating a lot of intermediate data that you then end up dropping before the DF operations.

Consider using a lazy Stream to process each file up to the filter_with one at a time, and then join the results together at the end.

Ftp.local_paths_ls(:marketaxess)
|> Stream.take(3)
|> Stream.filter(fn x -> String.contains?(x, "mabond") end)
|> Stream.map(&read_csv/1)
|> Stream.filter(fn df -> DF.n_rows(df) > 0 end)
|> Enum.map(fn df ->
      DF.filter_with(df,
         &Explorer.Series.equal(&1["SETTLEMENTDATE"], 
           Explorer.Series.mode(df["SETTLEMENTDATE"])[0]))
   end)
|> DF.concat_rows()

As for your “messy” slicing (if it’s even necessary using the streams approach). DF.slice/2 can enumerate a lazy Elixir Range struct, so you could do:

|> Enum.map(fn df ->
      n_rows = DF.n_rows(df)
      range = 0..n_rows//skiprate # pretty sure DFs are 0-indexed

      df
      |> DF.filter_with(
         &Explorer.Series.equal(&1["SETTLEMENTDATE"], 
           Explorer.Series.mode(df["SETTLEMENTDATE"])[0])
         )
      |> DF.slice(range)
    end)
|> DF.concat_rows()

vegabook · June 1, 2024, 9:45am

Particularly love the // thing didn’t realise that was possible. And yeah I don’t need the skiprate anymore anyway, but nice to know.

vegabook · June 2, 2024, 10:27pm

Feel compelled to say, as an R and Python data science expert, my initial explorations of Explorer are incredibly pleasant.

I was not expecting anything like this kind of performance, and the ergonomics are excellent, far better than Pandas, and a proper rival for the R (which is built from the ground up for wrangling). Excellent job so far.

Documentation super helpful too and an h away in the REPL which is great too.