Reduce vs reduce by chunks performance

Hi guys!

I’ve been learning Elixir for some time now, and recently I’ve noticed something weird while playing with benchee.
I was creating different implementations for factorial calculation and measuring their time and memory consumption. Here are two of them:

  1. simple reduce of a range:
    def factorial_with_reduce(n)
        when is_integer(n) and n >= 0 do
        1..n//1
        |> Enum.reduce(1, &*/2)
    end
    
  2. split by chunks, reduce them, then reduce the reduced results:
    def factorial_with_reduce_on_chunks(n, chunk_size \\ 4)
        when is_integer(n) and n >= 0 do
        1..n//1
        |> Stream.chunk_every(chunk_size)
        |> Stream.map(fn list ->
            Enum.reduce(list, 1, &*/2)
        end)
        |> Enum.reduce(1, &*/2)
    end
    

I’ve also created some benchmarks and results surprised me very much.

As I expected, factorial_with_reduce consumed only a constant amount of memory. However, it consumed almost 2.5x more time than factorial_with_reduce_on_chunks.
I created a repo on Github specifically for this question (benchmarks are provided in README.md), please take a closer look if you wish: GitHub - IhorTsoi/factorial_implementations_question: This repo holds the code for performance related elixir question..

My question is: how can the second implementation be faster, if we still need to do n-1 multiplications + create intermediate chunks? How can it be 2.5x faster? Simple reduce does linear multiplication and nothing more, doesn’t it? What’s the problem with it (or my implementation of the function / its benchmark)?

I used small (5’000), medium (100’000) and large inputs (500’000). And at each point, factorial_with_reduce_on_chunks was significantly faster…

I’m really confused about it. Could you please help me on this?

In my opinion, as Enum are greedy functions, they need all data at first sight, the System does wait for 1..n values and then starts processing.

on the other side lazy streams, that’s just need only chunk size data, they don’t need to wait till that time the system is processing 1..n elements.

that’s the reason stream is faster than reduce.

2 Likes

If you want more information, you could check with a profiler (some are shipped with Erlang/Elixir, for instance see the mix profile.eprof task.)
That way you will see where the program is spending its time between the two variants.

2 Likes

I thought that maybe it has to do with multiplication, but my tests show maybe that’s not relevant…I’m including my thought process in case it’s helpful: (edit, I do think it’s multiplication after all)

In the Enum.reduce example, the product is getting large as iterations increase. So it’s doing lots of large number * large number. In the chunked example, the individual products are staying smaller during iteration until they’re combined at the end. The number of large multiplications should be lower in the chunked example, then.

I did a test of benchmarking Enum.each vs Stream chunks. The Enum.each was both faster and constant memory. I did a test of benchmarking large multiplication and…it wasn’t meaningfully different :frowning:

I did another test with Stream removed and the chunked example is 3x faster:

  def each(n) do
    1..n//1
    |> Enum.reduce(1, &*/2)
  end

  def chunk_each(n) do
    1..n//1
    |> Enum.chunk_every(4)
    |> Enum.map(fn list ->
      Enum.reduce(list, 1, &*/2)
    end)
    |> Enum.reduce(1, &*/2)
  end

##### With input 100'000 #####
Name                ips        average  deviation         median         99th %
ChunkEach          0.65         1.53 s    ±20.52%         1.48 s         1.96 s
Each               0.22         4.53 s     ±0.33%         4.53 s         4.54 s

When I change the function to not be reduce, but just each, then the non-chunked example is much faster (Enum.each(& &1))

##### With input 100'000 #####
Name                ips        average  deviation         median         99th %
Each             369.95        2.70 ms     ±1.52%        2.70 ms        2.83 ms
ChunkEach         94.72       10.56 ms     ±5.39%       10.36 ms       13.50 ms

Since the only change is multiplication, it really feels like that has to do with it…

edit: I just inspected the size of the output and I am going back to large multiplication being the culprit. My numbers weren’t large enough. The final number printed to a file is 446 KB large, so those multiplications are going to be more expensive (and there’s more of them).

Another fun thing is playing with the chunk size:

##### With input 100'000 #####
Name                   ips        average  deviation         median         99th %
ChunkEach 40          0.92         1.09 s     ±8.55%         1.09 s         1.24 s
ChunkEach 8           0.86         1.16 s     ±8.65%         1.16 s         1.25 s
ChunkEach 4           0.63         1.59 s    ±25.15%         1.47 s         2.17 s
Each                 0.175         5.72 s     ±0.00%         5.72 s         5.72 s

Comparison: 
ChunkEach 40          0.92
ChunkEach 8           0.86 - 1.06x slower +0.0658 s
ChunkEach 4           0.63 - 1.46x slower +0.50 s
Each                 0.175 - 5.24x slower +4.63 s
3 Likes

Enum.chunk_every is very inefficient for lists, because it uses the same generic implementation as Stream.chunk_every, so I think you could get even better numbers with a specialized version. I noticed this previously here where it had a big impact on the benchmark for a parallel map library that batched work in chunks: Pelemay Fast Parallel map has just been released! - #17 by dom

I couldn’t secure permission from my employer to contribute back the list specialization, but if somebody else could, it would be a cool improvement!

1 Like

In this particular case, the amount of time spent in chunk_every (or iteration in general) is miniscule compared to the cost of multiplication. I’m not sure you’d get better performance unless your list was significantly large.

2 Likes

I cloned the repo and profiled it:

mix profile.eprof -e "FactorialImplementationsQuestion.factorial_with_reduce(100_000)"
Warmup...


Profile results of #PID<0.515.0>
#                                                         CALLS     %    TIME µS/CALL
Total                                                    200006 100.0 5656689   28.28
Enum.reduce/3                                                 1  0.00       0    0.00
anonymous fn/0 in :elixir_compiler_2.__FILE__/1               1  0.00       0    0.00
Range.new/3                                                   1  0.00       1    1.00
FactorialImplementationsQuestion.factorial_with_reduce/1      1  0.00       2    2.00
:erlang.apply/2                                               1  0.00       3    3.00
Enum.reduce_range/5                                      100001  0.56   31882    0.32
:erlang.*/2                                              100000 99.44 5624801   56.25

Profile done over 7 matching functions
mix profile.eprof -e "FactorialImplementationsQuestion.factorial_with_reduce_on_chunks(100_000)"
Warmup...


Profile results of #PID<0.181.0>
#                                                                                     CALLS     %    TIME µS/CALL
Total                                                                                775049 100.0 2254229    2.91
:erlang.max/2                                                                             1  0.00       0    0.00
anonymous fn/0 in :elixir_compiler_1.__FILE__/1                                           1  0.00       0    0.00
anonymous fn/3 in Stream.Reducers.chunk_every/5                                           1  0.00       0    0.00
Enumerable.Range.reduce/3                                                                 1  0.00       0    0.00
anonymous fn/2 in Enumerable.Stream.do_reduce/3                                           2  0.00       0    0.00
anonymous fn/3 in Enumerable.Stream.do_reduce/3                                           2  0.00       0    0.00
Enumerable.Stream.reduce/3                                                                2  0.00       0    0.00
Stream."-fun.chunk_while/4-"/4                                                            1  0.00       0    0.00
anonymous fn/2 in Stream.chunk_while/4                                                    1  0.00       0    0.00
Stream."-fun.after_chunk_while/2-"/2                                                      1  0.00       0    0.00
anonymous fn/2 in Stream.map/2                                                            1  0.00       0    0.00
Stream.chunk_while_fun/2                                                                  1  0.00       0    0.00
Stream.chunk_every/2                                                                      1  0.00       0    0.00
Stream.after_chunk_while/2                                                                1  0.00       0    0.00
Enumerable.impl_for/1                                                                     3  0.00       0    0.00
FactorialImplementationsQuestion.factorial_with_reduce_on_chunks/1                        1  0.00       0    0.00
Range.new/3                                                                               1  0.00       1    1.00
Stream.Reducers.chunk_every/5                                                             1  0.00       1    1.00
Enumerable.Stream."-do_reduce/3-lists^foldl/2-0-"/3                                       4  0.00       1    0.25
Enumerable.Stream.do_each/4                                                               2  0.00       1    0.50
Enumerable.Stream.do_done/2                                                               2  0.00       1    0.50
Stream.map/2                                                                              1  0.00       1    1.00
Stream.chunk_while/4                                                                      1  0.00       1    1.00
Stream.chunk_every/4                                                                      1  0.00       1    1.00
Enumerable.impl_for!/1                                                                    3  0.00       1    0.33
:erlang.apply/2                                                                           1  0.00       2    2.00
Enumerable.Stream.do_reduce/3                                                             2  0.00       2    1.00
Enumerable.reduce/3                                                                       3  0.00       2    0.67
FactorialImplementationsQuestion.factorial_with_reduce_on_chunks/2                        1  0.00       2    2.00
Enum.take/2                                                                           25000  0.13    2964    0.12
Enum.reduce/3                                                                         25001  0.13    3002    0.12
anonymous fn/1 in FactorialImplementationsQuestion.factorial_with_reduce_on_chunks/2  25000  0.14    3141    0.13
:lists.reverse/2                                                                      25000  0.19    4179    0.17
:lists.reverse/1                                                                      25003  0.27    5991    0.24
anonymous fn/4 in Stream.map/2                                                        25000  0.28    6386    0.26
anonymous fn/3 in Enum.reduce/3                                                       25000  0.32    7234    0.29
anonymous fn/3 in Enumerable.Stream.reduce/3                                          50000  0.59   13222    0.26
anonymous fn/5 in Stream.Reducers.chunk_every/5                                      100000  0.85   19166    0.19
Enumerable.Range.reduce/5                                                            100001  1.07   24194    0.24
Enum."-reduce/3-lists^foldl/2-0-"/3                                                  125000  1.08   24319    0.19
anonymous fn/4 in Stream.chunk_while_fun/2                                           100000  1.21   27284    0.27
:erlang.*/2                                                                          125000 93.74 2113130   16.91

Profile done over 42 matching functions

You can clearly see that in the Stream.chunk-example, a lot more intermediate function calls are happening (775049 vs. 200006, 3.8 times as many calls).
However, in both cases more than 90% of the time is spent in the calls to :erlang.*/2.
What is very interesting here, is that the average time per call to :erlang.*/2 is 56.25 µs in the Enum.reduce case, but only 16.91 µs in the Stream.chunk case. That’s a more than 3.3 times speed difference!

The cause of this slowdown has to be that we are dealing with much larger numbers being passed to the multiplication operation in the Enum.reduce case (on average), than we are in the Stream.chunk case.

7 Likes

Guys, thanks very much for your help :grinning_face_with_smiling_eyes:

1 Like