Body recursion faster than tail recursion in factorial

virzen · August 25, 2021, 1:36pm

Hi, I’m new here. I’m finally trying out Elixir after hearing and reading about it for quite some time.

I’m implementing simple algorithms first, and already at factorial I found something perplexing. I’ve implemented it in several versions, one of them being tail recursive. I expected that one to be the fastest, but to my surprise it wasn’t - body recursion was.

I’ve read Erlang's Tail Recursion is Not a Silver Bullet and Erlang -- The Seven Myths of Erlang Performance but these only talk about lists; when calculating simple value, like factorial does, tail recursion should be faster, as far as I understand.

I’m doing the benchmark using Benchee, in a script run with mix run benchmark.exs. Project is clean project created using mix new. Below is the code for the benchmark and the functions themselves.

Am I missing something? Are my implementations wrong? Is the benchmark wrong? Am I running it wrong? Is this “idiomatic code being optimized”?

# benchmark.exs
n = 10000
Benchee.run(%{
  "body" => fn -> Factorial.body(n) end,
  "tail" => fn -> Factorial.tail(n) end
})

# factorial.ex
defmodule Factorial do
  def body(n) do
    case n do
      0 -> 1
      1 -> 1
      n -> n * body(n - 1)
    end
  end

  def tail(n) do
    tail_internal(1, n)
  end
  defp tail_internal(acc, 0), do: acc
  defp tail_internal(acc, n) do
    tail_internal(acc * n, n - 1)
  end
end

One output is below, but the situation with tail recursion being 5 - 15% slower repeats consistently and goes down as the input goes up, presumably because there is less and less calls to the function (only 1 when n is 1 000 000).

Operating System: Linux
CPU Information: AMD Ryzen 5 3600 6-Core Processor
Number of Available Cores: 12
Available memory: 12.44 GB
Elixir 1.12.2
Erlang 24.0.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 14 s

Benchmarking body...
Benchmarking tail...

Name           ips        average  deviation         median         99th %
body        3.20 K      312.96 μs     ±7.01%      309.20 μs      383.20 μs
tail        2.79 K      359.05 μs     ±6.48%         355 μs      441.43 μs

Comparison: 
body        3.20 K
tail        2.79 K - 1.15x slower +46.09 μs

LostKobrakai · August 25, 2021, 1:43pm

It might be that cleaning up the stack between each individual call might be slower than just doing all the calculation while accumulating stacks and freeing them up in one batch at the end.

Tail call optimization is much more a thing of “allow programs to run, which cannot be run without it” and less “it’s better than body recursion”.

derek-zhou · August 25, 2021, 2:08pm

10000! is a pretty big number. Large integer calculation is slow, and will dominate the run time. The small saving in function calling overhead is insignificant.

If you change the multiply to add you can clearly see that tail recursion is faster.

dimitarvp · August 25, 2021, 2:16pm

Good analysis here by @ferd: Erlang's Tail Recursion is Not a Silver Bullet

mpope · August 25, 2021, 4:45pm

This made me curious. My first guess would be that passing along the two arguments instead of one would have a larger amount of overhead so I tested with:

defmodule Mod do
  def fun1(n) do
    case n do
      0 -> 1
      n -> fun1(n - 1)
    end
  end

  def fun2(acc, 0), do: 1
  def fun2(acc, n) do
    fun2(acc, n - 1)
  end

  def fun3(0), do: 1
  def fun3(n), do: fun3(n - 1)
end

n = 10000
Benchee.run(%{
  "one arg" => fn -> Mod.fun1(n) end,
  "two args" => fn -> Mod.fun2(1000000000, n) end,
  "one arg pattern match" => fn -> Mod.fun3(n) end
})

But the results say otherwise:

Comparison:
two args                    95.19 K
one arg pattern match       83.46 K - 1.14x slower +1.48 μs
one arg                     80.19 K - 1.19x slower +1.97 μs

The two args were actually faster than both single argument test cases.

My next theory was that maybe doing so many multiplications at one time would have overhead.

defmodule Mod do
  def fun4(n) do
    n * n * n * n * n * n * n * n * n * n * n * n * n * n * n * n * n
  end

  def fun5(n) do
    case n do
      n when n >= 11843044313729355057238118681361701 -> n
      _ -> fun5(n * n)
    end
  end
end

n = 101
Benchee.run(%{
  "all at once" => fn -> Mod.fun4(n) end,
  "one at a time" => fn -> Mod.fun5(n) end
})

Resulted in

Name                    ips        average  deviation         median         99th %
one at a time        3.41 M      292.87 ns  ±8152.96%           0 ns        1000 ns
all at once          2.84 M      352.35 ns  ±2610.58%           0 ns        1000 ns

Comparison:
one at a time        3.41 M
all at once          2.84 M - 1.20x slower +59.48 ns

So maybe that is it?

virzen · August 25, 2021, 5:12pm

Very interesting. That would be related to the implementation of the * then I guess?

virzen · August 25, 2021, 5:17pm

This is true, which would also point to multiplication as the source of the difference. Very interesting.

mpope · August 25, 2021, 5:26pm

I know erlc has the -S option to view the instructions. I’m not sure if elixirc has something similar to take a deeper look.

derek-zhou · August 25, 2021, 5:45pm

No. This is related to the fact that infinite accurate integer arithmetic is slow.

virzen · August 26, 2021, 11:30am

Ok, but it happens in both cases. Both implementation have the same amount of multiplications, just in different order. Although with that line of reasoning, changing the operation to addition shouldn’t matter, but it does…

derek-zhou · August 26, 2021, 12:48pm

Remember, elixir automatically promotes integers to infinite precision on a as needed basis. Changing to addition eliminates all large numbers, so everything fit inside standard word size, (64 bit in most computers)