Thank you all for the continued input. This is interesting! I formalized my repo to use Benchee so I could continue trying out some variants. Here are the results (so far):
Name ips average deviation median 99th %
python 2.27 0.44 s ±22.35% 0.45 s 0.64 s
:prim_file async 1.38 0.72 s ±22.75% 0.63 s 1.04 s
Concurrent 0.63 1.59 s ±8.90% 1.59 s 1.78 s
Split file 0.40 2.51 s ±22.56% 2.35 s 3.30 s
Task.async_stream 0.32 3.16 s ±21.99% 3.19 s 3.95 s
:prim_file 0.31 3.26 s ±41.38% 2.82 s 5.19 s
File 0.31 3.27 s ±24.74% 3.48 s 4.00 s
Jsonrs 0.28 3.54 s ±20.90% 3.56 s 4.27 s
Comparison:
python 2.27
:prim_file async 1.38 - 1.64x slower +0.28 s
Concurrent 0.63 - 3.59x slower +1.14 s
Split file 0.40 - 5.68x slower +2.07 s
Task.async_stream 0.32 - 7.16x slower +2.72 s
:prim_file 0.31 - 7.37x slower +2.81 s
File 0.31 - 7.41x slower +2.83 s
Jsonrs 0.28 - 8.02x slower +3.10 s
In short, Python is still the fastest. The fastest Elixir solution (so far) is the one that uses Task.async_stream
and the :prim_file
:
index_file
|> File.stream!()
|> Task.async_stream(fn line ->
path = String.trim(line)
{:ok, contents} = :prim_file.read_file(path)
{:ok, %{"paths" => txt_paths}} = Jason.decode(contents)
Enum.each(txt_paths, fn p ->
:prim_file.read_file_info(p)
end)
end)
|> Stream.run()
I tried variants that used EITHER Task.async_stream
OR :prim_file
, but they didn’t perform as well. Loading the file into memory instead of streaming it also didn’t perform as well. I haven’t been able to get jiffy working, so I gave jsonrs
a try, but unfortunately, it performed the worst of these (!!).
What is challenging here is that the solutions have very different performance characteristics. In other words, it’s easy to fall into a hole here, so I’m hoping to identify patterns to avoid. I should probably try coming up with more simplified use-cases, because this one touches on a lot of things: streaming, checking the file system, and JSON decoding.