Help improving Elixir word count

belaustegui · March 26, 2016, 1:29am

There is an interesting project on GitHub consisting on counting words in different languages.
The exercise input is the latest Hungarian Wikisource XML dump (since it contains a lot of non ASCII characters) and the expected output is a list of words along with their number of appearances sorted in a certain way.
I’ve provided a pull request with an implementation written in Elixir, but it is notably slower than other languages.

Any ideas to make the Elixir implementation more performant?

NobbZ · March 26, 2016, 1:41am

receive do
  interesting_thing -> take_a_closer_look
after @i_got_some_sleep

mkunikow · March 27, 2016, 9:16pm

Hi I assume you are using one core of CPU ? Is it possible to span this to many processes?
But I am not familiar yet with OPT

There is some article http://theerlangelist.blogspot.com/2013/04/parallelizing-independent-tasks.html
GitHub - eproxus/parallel: A parallel Enum implementation for Elixir

NobbZ · March 27, 2016, 10:25pm

This is one of our biggest problems, the rules do clearly discourage the use of threads:

single-thread is preferred but you can add multi-threaded or multicore versions too

But since the singlethreaded version has settled now (on my machine it went from ~150 secs to ~110 secs), I might talk to the projects maintainer, if we are allowed to submit a multithreaded version also (we came in right during housekeeping, and there’s a new rule coming in to have only one submission per language).

trydis · March 27, 2016, 10:38pm

Elixir noob here, but shouldn’t the strip: true on line 3 be trim: true? And then i guess it’s unnecessary to check for “empty” words in the Enum.reduce?

mkunikow · March 27, 2016, 10:41pm

Can someone add more details logs which task takes how much time?
*load file
*plit the input in words and reject empty
…
*write to file

NobbZ · March 27, 2016, 10:54pm

@trydis: We don’t use String.split/3 anymore, it has been replaced by Regex.scan/3, but you are right regarding the empty string, but removing that check doen’t gain anything on my machine.

@mkunikow: Currently we can’t provide such stuff. We are mostly reading blindly from stdin, transform the stream via Stream-module until we first hit our reduce, which then transforms the incoming stream of words into a map of word => integer. The next step is to transform this map into a sorted list of tuples. Sadly the required ordering isn’t natural in Elixir

I had ExProf runing for some experiments, and these experiments guided me from String.split/3 to Regex.scan/3, but so far I have not used “poor mans profiling”.

Also the PR refered to in the OP is already merged and obseleted by a proper mix-project which compiles into an escript.

mkunikow · March 27, 2016, 10:58pm

I assume you can do this by commenting/uncommeting steps. Maybe I will try to get this data

mkunikow · March 28, 2016, 1:18am

Ok here is my micro benchmark (maybe not accurate as profiler)

| wordcount step | time elapsed [s] |
| -------------- | ---------------- |
| read file      | 8.70E-04         |
| flat map(regex)| 0.003366         |
| reduce         | 102.674637       |
| sort           | 112.543272       |
| map            | 112.600831       |
| wrtie file     | 117.441563       |

NobbZ · March 28, 2016, 11:28am

You are doing some false assumptions here, making your poormans profiling more or less useless.

The first two steps you show here, do nothing but creating/modifying a Stream, a lazy enumerator. Only reaching the Enum.reduce-line will actually trigger the evaluation of the Stream, until then there is not a single byte read from input. So these 100 secs you measure for reduce do actually contain all three first steps, while the first one entries of your table do just measure NO-OPs.

If you really want to do a more realistic profiling of the steps, you need to force the stream by piping into Enum.into([]), which again might make your results dirty again.

If one really wants to profile this for the sake of squeezing the last bit out of it, one needs to use dedicated profiling tools like eprof.

But I have to report another gain of ~90% runtime (~120 -> ~70 secs) by throwing away the map and using ets instead. (No PR currently, doing some further tests before hand, hoping to improve runtime even further)

mkunikow · March 28, 2016, 2:54pm

@NobbZ Yes you are right about 3 first steps. This is lazy stream so it will be evaluated in step 3. Sorry my bad . But what I see most time took to load data from stream into Map.

belaustegui · March 28, 2016, 8:35pm

Although I’m very noob with Elixir, I’ve learn a lot with this little exercise. Thank you very much @NoobZ, hoping to see soon the ets implementation!!

NobbZ · March 28, 2016, 9:45pm

There’s already a PR open, but marked as WIP, I’m still fiddling around locally and was able cross the boarder of one minute!

NobbZ · March 29, 2016, 12:28pm

OK, that last PR got merged.

I don’t see any room for further improvements, except for going parallel, which has not been settled on in the rules yet, so we have to wait a bit.

On my dev-machine we went from ~3 minutes of runtime down to ~1 minute for the 2.5 million lines testfile. The 5 million lines testfile does run for about 2 minutes on my machine and the full testfile with about 67 million lines didn’t finish over night¹.

¹: I had it running for about 10 hours of wall clock time. When I came back to my machine it was nearely unresponsive and htop told me, that it consumed 14 GiB of memory (my machine has 8 GiB RAM + 8 GiB swap partition) and had only ~35 minutes of CPU time during the night and of course my HDD was stress tested I can’t tell in which phase of the process it was, since I forgot to give the started node a name, so I wasn’t able to hack in via remote console…

mackenza · April 7, 2016, 3:32pm

I know Erlang/Elixir aren’t strongly suited to this type processing and that concurrency is important to the BEAM, so being so far behind Rust and C++ are fine, but personally I wish we hadn’t been so far behind PHP

Even Bash was faster in the small dataset and used almost no memory.

I wonder if there is another approach altogether? I can’t think of one but it might exist.

good job

NobbZ · April 7, 2016, 4:01pm

The memory consumption for bash is not accurate due to some issues in time, this is only mentioned once in the very first blog post linked, but is still true.

mackenza · April 7, 2016, 4:07pm

fair enough, my disappointment with being ~= to bash in time stands though

NobbZ · April 7, 2016, 4:11pm

I did already some profiling, and found out that a huge amount of time is spent inside of :re-module and tried some things, but even while these things all speed up the small dataset by 5 to 10 seconds on my machine, they are failing at least one test. Until I find a way to repair the tests AND to keep speed up, I’ll report back here

NobbZ · April 8, 2016, 11:07am

OK, using mix profile.fprof I was able to identify some bottlenecks, which I was able to factor away. But identifying further bottlenecks gets harder and harder, since the functions with a huge share of runtime are either anonymous (so hard to identify in code) or already direct calls to Erlang modules (or wrapped calls to internals that I don’t want to use directly since they are either not exported or not documented erlang stuff).

As a summary, I started with the following (all measures against huwikisource, which has about 2.5 million lines):

real	1m1.005s
user	0m59.533s
sys	0m1.511s

And I ended up with:

real	0m41.149s
user	0m41.379s
sys	0m0.362s

So I was able to squeeze about 33% more performance out of it by adding 3 lines (one was just needed for fprof) and changing another line.

juditacs, the maintainer of the project, already merged my changes and told me the runtime of her 5 million lines test:

268 → 220sec on 5M lines.

Here we do only have improved by about 20%, but still good enough. I’m eager to see a new official run on the full dataset.

Last but not least: a link to the PR

uranther · April 11, 2016, 3:25am

Well done on the optimizations! I think your changes are instructional. Avoiding the blank string update to ETS seems to be a minor change compared to the replacement of Regex.scan/2 with a :binary.split/3. It is reassuring to know regex is much slower. Was the Elixir standard library String.split/3 much worse than the Erlang module? Would writing a NIF in Rust be cheating?

Thanks for letting us know about mix profile.fprof. I will remember to use it next time.

Haskell is supposed to be only slightly behind C and C++ in terms of speed, yet is 2 ranks behind Elixir, and behind Bash. I wonder how that wordcount implementation can be improved?