I have a 40 x 120k row CSVs which I want to concatenate into a one single DataFrame. But these won’t fit into my 32GB memory, so I want to take every 10th row from each input DataFrame. How do I do that? Polars has gather_every
(polars.DataFrame.gather_every — Polars documentation) but Explorer doesn’t seem to have it.
34 def mktax_all(skiprate \\ 10) do
35 Ftp.local_paths_ls(:marketaxess)
36 |> Enum.take(3)
37 |> Enum.filter(fn x -> String.contains?(x, "mabond") end)
38 |> Enum.map(&read_csv/1)
39 |> Enum.filter(fn df -> DF.n_rows(df) > 0 end)
40 |> Enum.map(fn df -> DF.filter_with(df,
41 &Explorer.Series.equal(&1["SETTLEMENTDATE"], Explorer.Series.mode(df["SETTLEMENTDATE"])[0])) end)
42 |> Enum.map(fn df -> DF.gather_every(df, skiprate) end)
EDIT
messy:
41 |> Enum.map(fn df -> DF.slice(df, Enum.map(0..(Kernel.div(DF.n_rows(df), skiprate) - 1),
42 fn x -> x * skiprate end)) end)
The chain of Enum
s are greedily creating a lot of intermediate data that you then end up dropping before the DF
operations.
Consider using a lazy Stream
to process each file up to the filter_with
one at a time, and then join the results together at the end.
Ftp.local_paths_ls(:marketaxess)
|> Stream.take(3)
|> Stream.filter(fn x -> String.contains?(x, "mabond") end)
|> Stream.map(&read_csv/1)
|> Stream.filter(fn df -> DF.n_rows(df) > 0 end)
|> Enum.map(fn df ->
DF.filter_with(df,
&Explorer.Series.equal(&1["SETTLEMENTDATE"],
Explorer.Series.mode(df["SETTLEMENTDATE"])[0]))
end)
|> DF.concat_rows()
As for your “messy” slicing (if it’s even necessary using the streams approach). DF.slice/2
can enumerate a lazy Elixir Range
struct, so you could do:
|> Enum.map(fn df ->
n_rows = DF.n_rows(df)
range = 0..n_rows//skiprate # pretty sure DFs are 0-indexed
df
|> DF.filter_with(
&Explorer.Series.equal(&1["SETTLEMENTDATE"],
Explorer.Series.mode(df["SETTLEMENTDATE"])[0])
)
|> DF.slice(range)
end)
|> DF.concat_rows()
5 Likes
Particularly love the //
thing didn’t realise that was possible. And yeah I don’t need the skiprate anymore anyway, but nice to know.
1 Like
Feel compelled to say, as an R and Python data science expert, my initial explorations of Explorer are incredibly pleasant.
I was not expecting anything like this kind of performance, and the ergonomics are excellent, far better than Pandas, and a proper rival for the R (which is built from the ground up for wrangling). Excellent job so far.
Documentation super helpful too and an h
away in the REPL which is great too.
3 Likes