Understanding concurrent tasks in context of BEAM schedulers & CPU cores

jmurphyweb · April 24, 2020, 4:11pm

Hi,

I’m just reading Elixir In Action and have a question about running concurrent tasks in relation to number of cores available.

From what I can understand, a scheduler can only run one CPU-bound task at a time. So spawning less processes than there are schedulers should show a similar amount of time a CPU-bound task takes to run.

I put together my first ever benchmark test using Benchee. But the results were not as expected. Spawning 4 tasks took a lot longer than spawning 1 task even though Benchee reckons I have 8 Cores available.

Can anyone shed any light on what’s happening here?

TEST

Assumptions:

Number of BEAM schedulers equals the number of cores c on my machine.
cpu_bound = fn -> 1..5_000_000 |> Enum.map(&(&1 + 1)) end executes a “CPU bound task” which takes some amount of time t on my machine to run.

Hypothesis:
I’d expect that spawning tasks which run cpu_bound to take roughly t time until the number of tasks spawned increases above the number of schedulers available.

Result:

iex(1)> Bench.run()
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-8569U CPU @ 2.80GHz
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.10.1
Erlang 22.2.6

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
parallel: 1

Name             ips        average  deviation         median         99th %
spawn_1          1.94      516.54 ms     ±1.19%      517.21 ms      528.99 ms
spawn_2          1.68      595.94 ms     ±1.36%      593.47 ms      614.60 ms
spawn_4          1.26      794.28 ms     ±2.83%      800.31 ms      821.24 ms
spawn_8          0.77     1302.10 ms     ±1.03%     1306.02 ms     1317.07 ms

Comparison:
spawn_1          1.94
spawn_2          1.68 - 1.15x slower +79.40 ms
spawn_4          1.26 - 1.54x slower +277.74 ms
spawn_8          0.77 - 2.52x slower +785.55 ms

And the code:

defmodule Bench do
  def run do
    Benchee.run(
      %{
        "spawn_1" => fn -> MultiProc.async(1..1) end,
        "spawn_2" => fn -> MultiProc.async(1..2) end,
        "spawn_4" => fn -> MultiProc.async(1..4) end,
        "spawn_8" => fn -> MultiProc.async(1..8) end
      },
      time: 10,
      memory_time: 2
    )
  end
end

defmodule MultiProc do
  def async(range \\ 1..4) do
    self = self()

    range
    |> Enum.map(fn x ->
      spawn(fn ->
        1..5_000_000 |> Enum.map(&(&1 + 1))
        send(self, x)
      end)
    end)
    |> Enum.map(&receive_message(&1))
  end

  def receive_message(_msg) do
    receive do
      n -> n
    end
  end
end

hauleth · April 24, 2020, 6:56pm

This is partially true, as if you would have more schedulers than cores then it will not be true. In general one CPU cannot run more than one task at the time, and you can think about schedulers as “virtual CPUs”.

You have forgot about scheduling and cache locality. Each process will have N reductions (“virtual CPU units”) assigned and when it uses them (or will call receive) will be exempted and other process will take its place. The problem there is that all data this process is currently using need to be cached, all registers need to be saved somewhere (BEAM is register-based VM), etc. This is not negligible amount of work that need to be done.

Another thing that you have forgot is Amdahl’s law which mean that the possible speedup depends on the joining operation.

Another thing is that it also highly depends on what is your machine doing in the meantime, as BEAM do not “own” CPUs, so the underlying host OS can do their own scheduling in the meantime.

jmurphyweb · April 25, 2020, 6:57am

Thanks a lot for the response!

I have no computer science knowledge and that probably lead me to over-simplifying the how I imagined it would work.

BEAM do not “own” CPUs

That makes sense.

From a brief read on Amdahl’s law, I understand that is a factor at play but relative to the 8 processor spawn taking 2.5 times longer it sounds like this would be a fairly small factor?

I don’t understand the caching and register part, and don’t expect you to explain further!

But one follow up question would be:
if you had to single out one factor causing this 2.5 times slower operation which one would be the main culprit?

hauleth · April 25, 2020, 9:48am

This one is quite important if you want to write performant software (and partially this is also culprit of SPECTRE attacks on Intel’s x86 CPUs).

So in short - there is difference in access time between each part of memory in computer. Disk is the slowest and RAM is faster. In old days it was how it was working. However to speedup computation even further the access to RAM was still too slow. So CPU added cache on their own, which is the quickest memory store possible (except direct access to data stored in registers). However the point is that the quickest is the access the less memory you have. So if you are reading some data, then CPU stores that in it’s cache (often a little bit more than you wanted just in case it is sequential read, ex. array), which allows you to process data faster. However when you jump between unrelated (non sequential) parts of memory (like in linked lists) then the CPU will slow down as on each step it need to fetch data from slower RAM again.

Also as each CPU can execute only one process at the time (obviously) then each time you want to swap to another process you need to store all of information used by that process (all CPU registers) in RAM and then load data for “new” process so they will “feel” that nothing have changed (except time).

There is even more, like branch prediction that helps modern CPUs with even faster computations (but with problems like mentioned SPECTRE) or SIMD which is instruction-level parallelism for performing computations on multiple data at once.

In general it is very interesting topic that has huge impact on performance, even in languages like Erlang that lives quite far from the “bare metal”. In IT it is generally known as “law of leaky abstractions” where hiding some information about underlying systems still has impact on your code anyway (other example is NFS which simulates local filesystem while in reality dealing with much harder stuff - networking).

Hard to tell, but the general rule is that processes in Erlang aren’t meant to be used as sources of performance improvements (in old days Erlang was single threaded, always), but as a “error containments”, so failure of a single process will not affect the rest of the system. Another thing is that Erlang is not the best thing out there for CPU-intensive tasks.

jmurphyweb · April 25, 2020, 11:50am

hauleth:

So in short - there is difference in access time between each part of memory in computer. Disk is the slowest and RAM is faster. In old days it was how it was working. However to speedup computation even further the access to RAM was still too slow. So CPU added cache on their own, which is the quickest memory store possible (except direct access to data stored in registers). However the point is that the quickest is the access the less memory you have. So if you are reading some data, then CPU stores that in it’s cache (often a little bit more than you wanted just in case it is sequential read, ex. array), which allows you to process data faster. However when you jump between unrelated (non sequential) parts of memory (like in linked lists) then the CPU will slow down as on each step it need to fetch data from slower RAM again.

That’s a nice explanation.

Really appreciate the effort you’ve gone to @hauleth, thanks a lot.

dimitarvp · May 4, 2020, 11:37am

This is not strictly true. The BEAM preemptively pauses and switches processes. It’s true that latency can suffer (very minimally in my experience) if you spawn 100 CPU-bound tasks on a 4-core machine but that won’t cause a deadlock in the meaning of how an infinite loop C/C++ program would do it.

See what the author of Benchee has to say on parallel benchmarks. My takeaway is that it is simply not made to benchmark parallel workloads.