Seeking for advice to understand why my parallel running demo code is slower than the sequential one

Hi,

I’m currently learning Elixir and I’ve come along this code (from the book “Programming Elixir >= 1.6”):

defmodule Parallel do
  def pmap(collection, func) do
    collection
    |> Enum.map(&(Task.async(fn -> func.(&1) end)))
    |> Enum.map(&Task.await/1)
  end
end

When running the following command:

iex> result = Parallel.pmap 1..1000000, &(&1 * &1)

it creates 1 million processes to calculate the square sum of each number (1 to 1 million).

I’ve added this sequential function called smap to check how much faster the parallel one is:

defmodule Parallel do
  def pmap(collection, func) do
    collection
    |> Enum.map(&(Task.async(fn -> func.(&1) end)))
    |> Enum.map(&Task.await/1)
  end

  def smap(collection, func) do
    collection
    # |> Enum.map(&(fn -> func.(&1) end))
    |> Enum.map(&(func.(&1)))
  end
end

I was astonished that the sequential function runs much faster. I went up to 10 million square numbers and it even gets worse! I’m running this code on a 2 GHz Quad-Core Intel Core i5 on macOS.

Am I doing something wrong? Is the overhead of creating, managing and (a)waiting for all the processes causing the parallel version to run slower?

I’d really appreciate any feedback.

Best regards,
Dominik

Side note:
I’ve searched in this forum but the only related topic I can find was this one but the problem there seems to be, that this user is using a single-core machine.

2 Likes

The issue here is that you’re trying to parallelize too primitive operation. The cost of multiplication of two numbers is tiny comparing to spawning a new process of the VM. Also, in the pmap Enum.map is invoked twice.

6 Likes

Distributing a tiny calculation such as the square on a million processes makes little sense in my opinion. I would try a more heavy and long running task.

2 Likes

Or, in other words. The cost of Task.async/1 is bigger as Kernel.*, so calling 1000000 times Task.async/1 takes more time as calling 1000000 times Kernel.*.
You could change your example to something like this:

iex(1)> f = fn x -> Process.sleep(1_000); x + 1 end
#Function<44.40011524/1 in :erl_eval.expr/5>
iex(2)> s = fn n -> Enum.map(n, f) end
#Function<44.40011524/1 in :erl_eval.expr/5>
iex(3)> a = fn n -> Enum.map(n, &Task.async(fn -> f.(&1) end)) |> Enum.map(&Task.await/1) end
#Function<44.40011524/1 in :erl_eval.expr/5>
iex(4)> :timer.tc(s, [1..10])
{10009653, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
iex(5)> :timer.tc(a, [1..10])
{1005634, [2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
iex(6)>
1 Like

Ok, got it!

Thank you all very much.

1 Like

A couple of notes that can perhaps help you thinking about processes in the beam (they’re not rules though):

  • Usually you want to parallelise things that have longer running time than the overhead of spawning processes (as others mentioned here)
  • Besides the overhead of spawning there’s also the overhead for the scheduler of the VM, in this case it has to guarantee that these 1M processes get scheduled, executed, removed, etc.
  • The last point is why usually it’s said whenever you have a queuable/parallel workload, to start about the same number of worker processes as you have logical CPUs. This is mostly true for work that is purely CPU bound, but due to scheduling, messaging, etc, I found that you might have gains even when you go a little beyond the number of CPUs (for pure CPU work).
  • When the work interleaves CPU/IO you can have many more worker processes and see a speedup although testing and knowing the workloads helps to decide correctly
  • Workers with an Orchestrator and/or Pool sometimes are used not for the speedup, but for managing memory usage to a reasonable level while still being able to process an unbound source of tasks/work or cleaning. (in the same vein, streams for instance are about controlling the memory usage, they’re slower than plain iteration)

One thing that helped me see how these things impact (I think it was in Joe Armstrong’s book Programming Erlang) was to build a program that md5’ed the contents of all files on your computer by doing directory traversal split by processes. Perhaps to around a few hundred (can’t remember what the number was in my tests) concurrent processes resulted in a speedup, but after a certain number, it became slower as the scheduling, yielding, sending back the results to the “main process” etc amounted to an higher overhead than the “lulls” when doing IO to get the filesystem information.

Since this also depends on the underlying system (soft&hardware) you usually have to test a bit to get the optimal settings for your particular use case, in that particular configuration.

4 Likes

Apart from everything already said: I would think this is a good place to use Task.await_many/1?

The way you do it, i would think it waits for all the tasks in the order of the collection. So it does spawn the 1,000,000 tasks, but then waits for the first one and blocks the await for all the other tasks during it.

Although I do have to say, i don’t really know if this would actually speed up anything in your case, sorry :sweat_smile:

2 Likes

Couldn’t resist and made a small GitHub repo that tests both of @Dominik’s functions and a naive implementation of a parallel chunk algorithm by myself here: GitHub - dimitarvp/elixir-benchmark-collection-algorithms: Small project that can test your collection processing algorithm (in Elixir)

Here are my results (check the README for how to run the benchmarks, it’s mega-simple):

Operating System: macOS
CPU Information: Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
Number of Available Cores: 20
Available memory: 64 GB
Elixir 1.12.1
Erlang 23.3.4.4

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 30 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 1.60 min

Benchmarking dominik_serial_0...
Benchmarking dominik_parallel_0...
Benchmarking dimitarvp_parallel_chunk...

Name                               ips        average  deviation         median         99th %
dominik_serial_0                 12.87       77.69 ms     ±9.53%       76.04 ms       92.04 ms
dimitarvp_parallel_chunk          4.71      212.50 ms     ±6.31%      208.92 ms      252.02 ms
dominik_parallel_0               0.157     6359.22 ms     ±8.46%     6258.66 ms     7200.11 ms

Comparison:
dominik_serial_0                 12.87
dimitarvp_parallel_chunk          4.71 - 2.74x slower +134.81 ms
dominik_parallel_0               0.157 - 81.85x slower +6281.53 ms

So while mine is much better than the original (81.85x better) it’s still 2.74x slower than a simple serial implementation – because, as others have said, just squaring a number is simply not worth the parallelization.

I can guarantee (because I’ve seen it many times) that the moment the function becomes even slightly complex – and the elements of the list are not plain numbers – then “my” (of course it’s not mine!) naive algorithm has won consistently against many home-grown other algorithms, simply because it divides the work between all available CPU threads in chunks.

4 Likes

Thank you very much for your additions.

That sounds very interesting. Do you recommend this book generally?

Wow, that’s awesome! Thank you very much for your effort.

Regarding your code and approach, I have a few questions. If you don’t mind answering them I’d be happy to learn from you.

  • Do you recommend using “your” algorithm generally when parallelizing things?
  • Is there a rule of thumb when it is useful to parallelize things or is it just about developing a gut feeling when it’s useful to parallelize stuff?
  • I’ve looked at your code and see that you never create a chunk smaller than 500 elements. Is this number based on experience? Does this apply to other problems/work/calculations as well or just for this specific case with that low workload?
2 Likes

2 articles for you today:

Everything others said are in short real life examples of these laws at work.

4 Likes

Yeah, pretty much. Obviously this is heavily tilted in the direction of how fast my machines are but I’ve noticed that most Elixir code behaves well up until lists of size of several hundred elements, and it starts gradually decline in performance from there – this also has to do with calling length on them every now and then (which is an O(n) operation). But still, take this number and the assumptions that go with it with a huge bag of salt. Experiment with your own algorithm and find the right number for your use-case.

“My” algorithm is basically “make best use of your CPU cores/threads” but, as @hauleth’s linked articles (especially Amdahl’s Law) show, you don’t get linearly increased gains by parallelizing things. Under ideal conditions you get 75% of the theoretical gains (so with 8 cores you can’t expect 800% improvements – more like 600%).

1 Like

Not really a gut feeling. You need a function that actually isn’t instant (like squaring a number is). F.ex. stuff that relies on an external service is heavily parallelizable; we’re not just talking the number of your CPU cores here – if you have some production-grade DB with a limit of 200-300 connections then you can very easily spawn 200+ Elixir processes that consume 90% of that connection pool, because 98% of the time the code will wait on I/O (network / disk) from the DB anyway.

The “gut feeling” here comes from experience. But one thing that might help you is this tidbit:

The CPU is always the fastest thing on the machine. Most of the time it waits for other parts of the machine to do their job. So you might as well make the CPU work more and in parallel, as much as you can.

But as with everything, this is not 100% universally applicable (although I’d still bravely say it does apply in 90% of the cases).

2 Likes

Alright, got it! I understand that I need to optimize the number of processes and grade of parallelism based on the machines and functions I’m working with. Now I have a good starting point for the future.

Great stuff! Thank you all very much. I really appreciate that you share your insights that kindly. :slight_smile:

1 Like

Yep. My machine has 10 CPU cores and 20 CPU threads so my PostgreSQL by default operates best if you saturate 20 connections to it (I tested this, many times). That’s why the default values for Ecto (Elixir’s de facto DB data mapper library) aren’t performing great on my machine if I stress-test my app (e.g. have tens of thousands of SQL INSERT-s) because they usually give you 10 connection pool – which is a reasonable default but not good enough on most workstations.

So when I was doing various Phoenix web app tutorials years ago I just had to immediately bump my PostgreSQL connection pool count to equal my CPU threads. Then the app would perform best (again, provided it can actually saturate the DB which isn’t an easy thing to do btw).

There are no universal answers or formulas. If you checked the mini app I created above you’ll see how easy it is to benchmark different functions. When in doubt – measure! :slight_smile:

Parallelism isn’t the easiest thing in programming, that’s a fact. Some analysis and your own tests are needed for you to get the most out of it. Be curious!

1 Like

Good to know!

I’ve cloned your repo to my local hard drive to perform tests in the future :wink:

I’m sure I need to perform many tests but, for me, that’s part of the game and also fun. Therefore I did these tests and asked here for help to get a better understanding.

I’ll stay curious :v:

1 Like