I was playing around with some real-life solutions for 1BRC using elixir and external libraries such as Flow and Explorer and found that Explorer’s performance is great but changes 10x (for worse) when dealing with 1bil lines vs 500mil or less.
Here’s the code with some comments:
defmodule WithExplorer do
# Results
# [
# 1_000_000_000: 675483.000ms,
# 500_000_000: 58244.713ms,
# 100_000_000: 10321.046ms,
# 50_000_000: 5104.949ms,
# ]
require Explorer.DataFrame
alias Explorer.{DataFrame, Series}
@filename "./data/measurements.txt"
def run() do
parent = self()
results = @filename
|> DataFrame.from_csv!(header: false, delimiter: ";", eol_delimiter: "\n")
|> DataFrame.group_by("column_1")
|> DataFrame.summarise(min: Series.min(column_2), mean: Series.mean(column_2), max: Series.max(column_2))
|> DataFrame.arrange(column_1)
# for idx <- 0..(results["column_1"] |> Series.to_list() |> length() |> Kernel.-(1)) do
# "#{results["column_1"][idx]}=#{results["min"][idx]}/#{:erlang.float_to_binary(results["mean"][idx], decimals: 2)}/#{results["max"][idx]}"
# end
end
end
What I observe is that CPUs are still busy but not fully utilized and suddenly a lot of disk IO shows up. I have some idea of what might be happening and wonder if there is a way to control this behavior from the high-level API or by compiling Explorer with some Polars specific options.
Probably the data no longer fits in memory and then it is using disk swap? If that’s the case, that’s happening at the operating system level, so there isn’t much to control.
However, you can pass the :lazy Option to from_csv and then Call collect to perform the operation at once. It should go easier on the memory usage.
Thank you, José. Enabling :lazy cut the time in half. Your suggestion also made me read the docs with more attention and I found that I could set the floats to f32 instead of using f64, which had been automatically inferred.
This made the computation light enough to fit in memory and go even faster, regardless of lazy mode.
Results:
Reading and aggregating 1 Billion Lines with Explorer
- Eager f64: 675483.00ms
- Lazy (f64): 389491.00ms
- Lazy (f32): 53575.23ms
- Eager f32: 55091.87ms
You got to love a forum that you can just happen to stumble into and read a post about making “checks notes” … a billion row csv “checks notes again”… run faster.
Good old gaming computer, but I’d love to try on the M1 too.
OS Name Microsoft Windows 11 Home (wsl)
System Model X570 AORUS ELITE WIFI
Processor AMD Ryzen 9 3900X 12-Core Processor,
3801 Mhz, 12 Core(s), 24 Logical Processor(s)
Installed Physical Memory (RAM) 32.0 GB