Parallel tuple build paradigm

Nazari · October 15, 2020, 9:38pm

This is driving me crazy, any explanation will be rewarded with a fresh beer here in Spain…
I’ve written a short macro to build tuples with more than two elements in parallel using Task. The macro is named ‘parallel’ so if you type:

s = 100
[:timer.tc(fn ->  {:first, :timer.sleep(s), :timer.sleep(s)} end), 
 :timer.tc(fn -> parallel {:second, :timer.sleep(s), :timer.sleep(s)} end)
]

> [{201235, {:first, :ok, :ok}}, {104703, {:second, :ok, :ok}}]

Parallel version is faster (double) than sequential. In fact the parallel version is faster for any value of s greater than zero. For s=0 values are:

> [{11, {:first, :ok, :ok}}, {242, {:second, :ok, :ok}}]

So the overhead of the macro is about 230 millis. But, when i use paralell in the library I’m building (exun), performance degrades substantially. Is there something I’m missing? This test probes that for any operation over 1 millis for each element, the macro will produce benefits… or not?

The macro:

  defmacro parallel({:{}, _attrs, lista}) do
    task_list =
      lista
      |> Enum.map(fn calculus ->
        {{:., [], [{:__aliases__, [alias: false], [:Task]}, :async]}, [],
         [{:fn, [], [{:->, [], [[], calculus ]}]}]}
      end)

    quote do
      unquote(task_list) |> Task.await_many() |> List.to_tuple()
    end
  end

mpope · October 16, 2020, 12:07am

How are you performing the measurements? Are you using Benchee, or timing through iex?

Nazari · October 16, 2020, 7:36am

Hi mpope, I’m using timing through iex, and I’ll probe Benchee; thanks for the reference!

Nazari · October 16, 2020, 8:27am

With Benchee:

import Exun
expression = "(g(a^b,b^a)/g(b^a,a^b))'a"
context = %{"g(x,y)" => "(x^y/ln(sinh(y^x))+y^tanh(x)/cos(x*y))'x'y'x"}

Benchee.run(%{
"parallel" => fn -> eval(expression, context) end
})

Name               ips        average  deviation         median         99th %
parallel          0.21         4.75 s     ±1.60%         4.75 s         4.80 s

Name               ips        average  deviation         median         99th %
noparallel        0.23         4.44 s     ±6.61%         4.44 s         4.64 s

Still parallel version is slower! My machine:

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-3635QM CPU @ 2.40GHz
Number of Available Cores: 8
Available memory: 8 GB
Elixir 1.11.0
Erlang 23.1.1

LostKobrakai · October 16, 2020, 8:41am

I hope you’re aware of that, but neither :timer.tc nor Benchee will measure time taken to compile stuff. They measure time taken to run code.

So your initial post’s code is essentially comparing sequential sleeps (single process doing both) vs. async sleeps in multiple processes. It needs half the time because it’s two sleeps after each other vs. two sleeps in parallel.

The 230 millis for s=0 are the overhead not of the macro (compilation is long done at the point of measurement), but the overhead of spawning and using the Task processes for evaluating the code a.k.a. sleeping.

So if you have performance benefits depends on if spawning a process and sending it instructions to eval in parallel is faster than doing the work in a single process. If that’s the case depends not only on your code, but on the number of cores available, how fast the cores do eval the code and how fast message passing will be (especially depends on the size of the messages).

mpope · October 16, 2020, 3:42pm

Maybe I am misunderstanding your library’s usecase. Do you want equation solving to happen in parallel, or do you want the parsing lumped in there as well? Asking because it seems alot is going on in eval and if you can isolate parts it’ll be easier to find what the performance issue is. If you’re just trying to measure the time of solution execution maybe you can pre-parse then measure the parallel solving?