There is an interesting project on GitHub consisting on counting words in different languages.
The exercise input is the latest Hungarian Wikisource XML dump (since it contains a lot of non ASCII characters) and the expected output is a list of words along with their number of appearances sorted in a certain way.
I’ve provided a pull request with an implementation written in Elixir, but it is notably slower than other languages.
Any ideas to make the Elixir implementation more performant?
This is one of our biggest problems, the rules do clearly discourage the use of threads:
single-thread is preferred but you can add multi-threaded or multicore versions too
But since the singlethreaded version has settled now (on my machine it went from ~150 secs to ~110 secs), I might talk to the projects maintainer, if we are allowed to submit a multithreaded version also (we came in right during housekeeping, and there’s a new rule coming in to have only one submission per language).
@trydis: We don’t use String.split/3 anymore, it has been replaced by Regex.scan/3, but you are right regarding the empty string, but removing that check doen’t gain anything on my machine.
@mkunikow: Currently we can’t provide such stuff. We are mostly reading blindly from stdin, transform the stream via Stream-module until we first hit our reduce, which then transforms the incoming stream of words into a map of word => integer. The next step is to transform this map into a sorted list of tuples. Sadly the required ordering isn’t natural in Elixir
I had ExProf runing for some experiments, and these experiments guided me from String.split/3 to Regex.scan/3, but so far I have not used “poor mans profiling”.
Also the PR refered to in the OP is already merged and obseleted by a proper mix-project which compiles into an escript.
You are doing some false assumptions here, making your poormans profiling more or less useless.
The first two steps you show here, do nothing but creating/modifying a Stream, a lazy enumerator. Only reaching the Enum.reduce-line will actually trigger the evaluation of the Stream, until then there is not a single byte read from input. So these 100 secs you measure for reduce do actually contain all three first steps, while the first one entries of your table do just measure NO-OPs.
If you really want to do a more realistic profiling of the steps, you need to force the stream by piping into Enum.into(), which again might make your results dirty again.
If one really wants to profile this for the sake of squeezing the last bit out of it, one needs to use dedicated profiling tools like eprof.
But I have to report another gain of ~90% runtime (~120 -> ~70 secs) by throwing away the map and using ets instead. (No PR currently, doing some further tests before hand, hoping to improve runtime even further)
I don’t see any room for further improvements, except for going parallel, which has not been settled on in the rules yet, so we have to wait a bit.
On my dev-machine we went from ~3 minutes of runtime down to ~1 minute for the 2.5 million lines testfile. The 5 million lines testfile does run for about 2 minutes on my machine and the full testfile with about 67 million lines didn’t finish over night¹.
¹: I had it running for about 10 hours of wall clock time. When I came back to my machine it was nearely unresponsive and htop told me, that it consumed 14 GiB of memory (my machine has 8 GiB RAM + 8 GiB swap partition) and had only ~35 minutes of CPU time during the night and of course my HDD was stress tested I can’t tell in which phase of the process it was, since I forgot to give the started node a name, so I wasn’t able to hack in via remote console…
I know Erlang/Elixir aren’t strongly suited to this type processing and that concurrency is important to the BEAM, so being so far behind Rust and C++ are fine, but personally I wish we hadn’t been so far behind PHP
Even Bash was faster in the small dataset and used almost no memory.
I wonder if there is another approach altogether? I can’t think of one but it might exist.
I did already some profiling, and found out that a huge amount of time is spent inside of :re-module and tried some things, but even while these things all speed up the small dataset by 5 to 10 seconds on my machine, they are failing at least one test. Until I find a way to repair the tests AND to keep speed up, I’ll report back here
OK, using mix profile.fprof I was able to identify some bottlenecks, which I was able to factor away. But identifying further bottlenecks gets harder and harder, since the functions with a huge share of runtime are either anonymous (so hard to identify in code) or already direct calls to Erlang modules (or wrapped calls to internals that I don’t want to use directly since they are either not exported or not documented erlang stuff).
As a summary, I started with the following (all measures against huwikisource, which has about 2.5 million lines):
And I ended up with:
So I was able to squeeze about 33% more performance out of it by adding 3 lines (one was just needed for fprof) and changing another line.
juditacs, the maintainer of the project, already merged my changes and told me the runtime of her 5 million lines test:
268 -> 220sec on 5M lines.
Here we do only have improved by about 20%, but still good enough. I’m eager to see a new official run on the full dataset.
Well done on the optimizations! I think your changes are instructional. Avoiding the blank string update to ETS seems to be a minor change compared to the replacement of Regex.scan/2 with a :binary.split/3. It is reassuring to know regex is much slower. Was the Elixir standard library String.split/3 much worse than the Erlang module? Would writing a NIF in Rust be cheating?
Thanks for letting us know about mix profile.fprof. I will remember to use it next time.
Haskell is supposed to be only slightly behind C and C++ in terms of speed, yet is 2 ranks behind Elixir, and behind Bash. I wonder how that wordcount implementation can be improved?