My elixir implementation of the 1 billion row challenge

stevensonmt · January 16, 2024, 3:06pm

I think you may have changed some other parts of the code as well, looking at the github repo. So you might have minimized the impact the change could have. I just re-ran the benchmark on my machine, using an ETS implementation and got similar results as I posted above:

:ets.new(:sample, [:named_table, :public])
map |> Enum.each(fn {k, v} -> :ets.insert(:sample, {k, v}) end)
fun3 = fn n -> 1..n |> Enum.each(fn i -> :ets.lookup(:sample, rem(i,8)) end) end
1..100 |> Enum.map(fn _ -> :timer.tc(fn -> fun3.(100_000) end) end) |> Enum.map(&elem(&1, 0)) |> Enum.sum() |> div(100)
# 131589
1..100 |> Enum.map(fn _ -> :timer.tc(fn -> fun2.(100_000) end) end) |> Enum.map(&elem(&1, 0)) |> Enum.sum() |> div(100)
# 139659
1..100 |> Enum.map(fn _ -> :timer.tc(fn -> fun.(100_000) end) end) |> Enum.map(&elem(&1, 0)) |> Enum.sum() |> div(100)
# 152674

Also be aware that while ETS lookups are fast (O(1) as you say) and process communication is very efficient in the BEAM, making process calls can be slower than accessing a data structure in the same process. ETS will generally be a better choice for very large collections and for structures needing to be accessed from multiple concurrent processes. In this case your worker_pool is never going to be that large. Whether sharing a single ETS across multiple processes is more efficient than creating the worker_pool map for each process I can’t say but probably it is given how many processes you could end up spawning to access it. This thread has a good discussion of the tradeoffs between maps and ETS tables.

Rather than focusing on this tiny worker_pool data structure, I think you should look at processing all the lines directly to an ETS table rather than creating a bunch of maps stored in Agents that you have to merge later. Create Tasks that process lines to the ETS table then at the end pull the table data for your final output. If you set up your ETS table as an ordered_set you can probably avoid the step where you’re having to sort N items where N is the number of cities in the data set. You might event look at DETS to avoid running out of memory given the large data set.