Erlang/Elixir string performance - can this be improved?

evadne · April 11, 2019, 12:13am

Hi, a friend of mine sent this over, so I gave it a try: https://gist.github.com/evadne/43e79dd51c068dc19e42086cdbecc8a7

FWIW

$ curl http://teralink.net/\~serpent/words.txt > words.txt

$ md5 words.txt 
MD5 (words.txt) = 640ef9e082ef75ef07f0e9869e9d8ae2

$du -h words.txt
217M	words.txt
Execution

$ elixir ./count.ex ./words.txt --sort > out-sorted.txt
Started
Using 4 Workers
Processing: 4.748574s
Load Results: 0.820397s
Order Results: 1.99405s
Print Results: 3.136065s
Total Runtime: 10.710908s

$ elixir ./count.ex ./words.txt > out.txt
Started
Using 4 Workers
Processing: 4.944752s
Load Results: 0.836416s
Print Results: 2.556994s
Total Runtime: 8.343633s

$ wc -l out*
  557745 out-sorted.txt
  557745 out.txt
 1115490 total

$ head -n 5 out-sorted.txt
1033175 0
103404 00
353 000
752 0000
19 00000

Please let me know your thoughts. Thanks again!

jola · April 11, 2019, 8:14am

Nice! That’s a really cool solution, and probably scales well! It fails the first requirement though, the input should be piped in. Reading from a file is faster than reading from stdin in general (eg replace IO.binstream with File.read and it runs faster) and gives you much more control, like you show in your code.

You can speed up your version quite a bit if you switch to iolists, instead of printing line by line.

I actually ended up writing an article about this script, with my final version single threaded (13 lines of code) running in 13 seconds https://blog.jola.dev/elixir-string-processing-optimization

alvises · April 11, 2019, 7:20pm

Hi @jola, did you try getting rid of the :line option and processing chunk by chunk? What I’ve seen so far (more details here: Streaming lines from an enum of chunks) is that this

File.stream!("numbers.txt",[],2048) #chunks instead of lines
|> Stream.transform("",fn chunk, acc ->
  [last_line | lines] = 
      acc <> chunk
      |> String.split("\n")
      |> Enum.reverse()
	{Enum.reverse(lines),last_line}
end)

seems to be 2x faster than using the :line option in File.stream!. I still have to try it in your code and see if there is any difference.

jola · April 11, 2019, 7:24pm

I did do that in the article and compared performance as well as some words on the downsides

alvises · April 11, 2019, 7:34pm

thx @jola, do you mean this part?

IO.binstream(:stdio, 102400)
|> Enum.to_list()
|> :binary.list_to_bin()
|> String.split(pattern)

Does it load everything into memory and then splits?

jola · April 11, 2019, 7:48pm

It does like I said in the article, I didn’t spend too much time messing around with the chunk size, but that number gave decent performance improvement compared to reading line by line with that input on my machine.

Also, switching to reading the file directly would be faster, but the original article used stdin in all examples, so it would be not be a fair comparison.

Take a look at @evadne’s solution for one that optimizes reading over 4 workers if you’re curious how fast it can be reading directly from file.

evadne · April 11, 2019, 8:50pm

New version is up. It reads from STDIN and somehow manages to beat the file-based approach.

elixir ./count-stream.ex < words.txt > out.txt
Started
Processing: 4.627264s
Reporting: 0.771289s
Total Runtime: 5.408457s

elixir ./count-stream.ex --sort < words.txt > out.txt
Started
Processing: 4.761361s
Reporting: 1.936912s
Total Runtime: 6.707307s

JEG2 · September 6, 2019, 1:54am

I love this problem. I tried several things that I just knew you be faster, only to learn the hard way that they were not.

Here’s the faster thing I was able to come up with:

defmodule WordCounter do
  def run(table, pattern, parent) do
    spawn(__MODULE__, :read_and_count, [table, pattern, parent])
  end

  def read_and_count(table, pattern, parent) do
    case IO.binread(:line) do
      :eof ->
        send(parent, {:done, self()})

      line ->
        line
        |> String.split(pattern)
        |> Enum.each(fn word ->
          :ets.update_counter(table, word, {2, 1}, {word, 0})
        end)
        read_and_count(table, pattern, parent)
    end
  end

  def wait_on(pid) do
    receive do
      {:done, ^pid} ->
        :ok
    end
  end
end

table = :ets.new(:words, [:public, write_concurrency: true])
pattern = :binary.compile_pattern([" ", "\n"])

Stream.repeatedly(fn -> WordCounter.run(table, pattern, self()) end)
|> Enum.take(System.schedulers_online)
|> Enum.each(fn pid -> WordCounter.wait_on(pid) end)

:ets.tab2list(table)
|> Enum.sort(fn {_, a}, {_, b} -> b < a end)
|> Enum.map(fn {word, count} ->
  [String.pad_leading(Integer.to_string(count), 8), " ", word, "\n"]
end)
|> IO.binwrite

I did like that I was able to get some decent speed with a still pretty straight forward approach. (It just reads lines in multiple processes.)

darinwilson · September 6, 2019, 3:23am

What was the runtime of this implementation?

JEG2 · September 6, 2019, 2:00pm

It runs in about 8.7 seconds on my laptop. I expect that it’s super dependent on how many cores a machine has.

darinwilson · September 6, 2019, 4:01pm

That’s awesome! I think @jola got hers down to 7-ish seconds in her talk (if I’m remembering correctly)? But it would be neat to throw yours onto a 16-core machine and see how it does

OvermindDL1 · September 6, 2019, 4:24pm

I have 16 core here, what needs to be tested? ^.^

JEG2 · September 6, 2019, 4:43pm

My “System Report” says I have 8 cores, but the BEAM starts 16 schedulers when I run it.

Another advantage of my approach though is that memory usage should stay pretty reasonable, since it’s still working line-by-line.

sb8244 · September 6, 2019, 4:59pm

I ran @JEG2 locally and it came up with 9.5s. Here’s the comparison:

➜ time elixir wp_parallel.exs < ./words.txt > /dev/null
elixir wp_parallel.exs < ./words.txt > /dev/null 27.14s user 1.79s system 302% cpu 9.550 total

➜ time elixir wp_singlel.exs < ./words.txt > /dev/null
elixir wp_single.exs < ./words.txt > /dev/null 10.28s user 2.91s system 103% cpu 12.739 total

You can see that the parallel version uses much more CPU, but it brings the time down by ~20%. It is a cool approach to read in parallel. I tried to read at once and then spawn out the processes, which was slower than I expected.

jola · September 6, 2019, 5:35pm

I guess you have 8 physical cores and 16 virtual?

@darinwilson running with > /dev/null gets it down lower

➜  elixirconf time elixir lib/direction3/async_stream.ex < ../words.txt > /dev/null
elixir lib/direction3/async_stream.ex < ../words.txt > /dev/null  17.79s user 3.08s system 380% cpu 5.484 total

and then you can cheat a bit

➜  elixirconf time elixir --erl "+hms 500000000" lib/direction3/async_stream.ex < ../words.txt > /dev/null 
elixir --erl "+hms 500000000" lib/direction3/async_stream.ex < ../words.txt >  16.26s user 2.82s system 420% cpu 4.543 total

For those keeping score at home, 4.5 is almost as fast as C (admittedly it uses 4x CPU)

Here’s my execution time for @JEG2’s version

➜  elixirconf time elixir lib/extra/jeg2.exs < ../words.txt > /dev/null
elixir lib/extra/jeg2.exs < ../words.txt > /dev/null  27.02s user 1.77s system 276% cpu 10.417 total

Using the +hms cheat brings it down to like 7 seconds.

I really like this solution. You don’t have to stitch together prefix+suffix and memory usage is about 1GB on my machine. Speed is totally reasonable

dimitarvp · September 6, 2019, 9:51pm

I can try the code on a 10 / 20 core machine. Can somebody point me at a repo?

evadne · September 9, 2019, 8:06pm

My count-stream solution was sorting by string content instead of by count. This has been fixed and gives a modest speed-up.

Old:

$ elixir ./count.ex ./words.txt --sort 1> out-sorted.txt
Started
Using 4 Workers
Processing: 4.665098s
Load Results: 0.819044s
Order Results: 2.111665s
Print Results: 3.020239s
Total Runtime: 10.620714s

New:

$ elixir ./count.ex ./words.txt --sort 1> out-sorted.txt
Started
Using 4 Workers
Processing: 4.474982s
Load Results: 0.717677s
Order Results: 1.109028s
Print Results: 2.277602s
Total Runtime: 8.583487s

The stream based solution is faster though. Additionally, with the sort function replaced with List.keysort/2 there was another 500ms or so recovered.

jola · September 10, 2019, 4:55am

You weren’t sorting by strings, you were sorting by count (they second value in the tuple). The output was correct. At least in the version I got from your gist.

defp report_sort(entries), do: Enum.sort(entries, fn {_, lhs}, {_, rhs} -> lhs > rhs end)

With List.keysort they get sorted in reverse order though (asc instead of desc). There doesn’t seem to be an option to choose order? But yeah, it’s faster nice find! Something like 200-400ms after a few runs.

And a message to everyone in the thread: I just want to point out that all my measurements are with the output printed to terminal, unless otherwise specified. Just to be clear. Running with > /dev/null or > outputfile is much faster.

alvises · November 1, 2019, 1:54pm

Hey @jola, just re-watched your talk at ElixirConf 2019

To close the “hole” in the pipeline between the Enum.each and :ets.tab2list, what do you think about Enum.reduce with the ets table as the accumulator?

...
|> Enum.reduce(:ets.new(:words, []), fn word, table ->
  :ets.update_counter(table, word, {2, 1}, {word, 0})
  table
end)
|> :ets.tab2list()
...

Performance wise I get same results. But maybe we could argue that returning the table in the reduction is not aesthetically that pleasant

jola · November 1, 2019, 1:56pm

haha, yeah, it closes the hole but it still doesn’t feel great!