For reference, your original code took 1:54 total on my machine.
Simply running
IO.stream(:stdio, :line)
|> Stream.flat_map(&(String.split(&1)))
|> Stream.run()
takes around 35s on my machine. So just reading from IO and splitting is already slower than the full ruby implementation.
Switching to IO.binstream
speeds it up from 35 to 25s , so that’s one step. The difference is that we don’t have UTF8 support, which I assume most of the implementations don’t bother with anyway.
Next lets take a look at @josevalim’s suggestion of using ETS
# Create the ETS table
:ets.new(:words, [{:write_concurrency, true}, :named_table])
# setup input stream for reading and tokenisation
IO.binstream(:stdio, :line)
|> Stream.flat_map(&String.split(&1))
# Insert and update counters for each word in table
|> Enum.each(fn word -> :ets.update_counter(:words, word, {2, 1}, {word, 0}) end)
# Read all rows from the table
:ets.match_object(:words, {:"$0", :"$1"})
|> Enum.sort(fn {_, a}, {_, b} -> b < a end)
# print it beautifully
|> Enum.each(fn {word, count} ->
IO.puts(String.pad_leading(Integer.to_string(count), 8) <> " " <> word)
end)
This runs in 42s on my machine, so we’ve made a big improvement from 114s. We can shave another 2s off by replacing the last line with building an iolist.
|> Enum.map(fn {word, count} ->
[String.pad_leading(Integer.to_string(count), 8), " ", word, "\n"] end)
|> IO.puts
But we’re still not really using the concurrency available to Elixir. @kokolegorille suggested using Flow, which lets us easily run on more than one core. Using almost the exact example posted in the documentation for Flow I ran
IO.binstream(:stdio, :line)
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split(&1))
|> Flow.partition()
|> Flow.reduce(fn -> %{} end, fn word, acc ->
Map.update(acc, word, 1, & &1 + 1)
end)
|> Enum.sort(fn {_, a}, {_, b} -> b < a end)
|> Enum.map(fn {word, count} ->
[String.pad_leading(Integer.to_string(count), 8), " ", word, "\n"] end)
|> IO.puts
which finishes in 21s. That means we’ve caught up with ruby. But I dropped ETS, so lets bring it back.
:ets.new(:words, [{:write_concurrency, true}, :named_table, :public])
IO.binstream(:stdio, :line)
|> Flow.from_enumerable()
|> Flow.flat_map(&String.split(&1))
|> Flow.partition()
|> Flow.each(fn word ->
:ets.update_counter(:words, word, {2, 1}, {word, 0})
end)
|> Flow.run()
:ets.match_object(:words, {:"$0", :"$1"})
|> Enum.sort(fn {_, a}, {_, b} -> b < a end)
|> Enum.map(fn {word, count} ->
[String.pad_leading(Integer.to_string(count), 8), " ", word, "\n"] end)
|> IO.puts
This doesn’t run any faster than the previous Flow version, still clocking in at 21s (saw some variance, maybe 20-22s, not surprising since I’m using around 7/8 virtual cores and I have lots of apps running). I’m not really sure why it’s not faster, considering we’re avoiding updating a Map
. I’m curious if there’s a smarter way of reading the ETS table, maybe even in a sorted fashion?
Finally, the original post claims to allow VMs startup time before measuring, which your examples of using time
does not. Timing the code from start to finish gives me an actual time of at least 0.6s less than what time
reports. If you want to get more accurate than that you’ll need to use a proper testing harness like https://github.com/bencheeorg/benchee
ps fun note about ETS, without {:write_concurrency, true}
the final version takes 120s instead of 21s
pps noticed I forgot an unnecessary .partition
in the final version. Removing it yields a run time of ~17s.