Tail Call Optimization in Elixir & Erlang – not as efficient and important as you probably think

PragTob · June 16, 2016, 6:12pm

I wrote up a detailed blog post about tail call optimization in Elixir/Erlang and its performance. The TLDR; sort of is that none tail call optimized recursive functions (body-recursive) can be faster and more memory efficient than TCO functions. This is something that I never thought before, that TCO is always faster seems to be a common misconception. My example is also not contrived - it’s an implementation of map.

Posting here to share as I think it’s important knowledge Also to get feedback, maybe someone has another implementation that blows mine out of the water.

PS: Hope this is the appropriate forum/tag?

NobbZ · June 16, 2016, 6:36pm

Some time ago I read something, I thik it was written by @rvirding but I’m not really sure about it, that aimed as well at this discussion.

It TL;DR was roughly: When building a list which order matters, it does not really matter if you build it “on the stack” or in an accumulator which you reverse afterwards. One of both might kill the stack, the other one just will stress your GC. But whenever order of the resulting list does not matter anymore or when you reduce to a single value, an accumulater in a tail recursive function is way better, because it is faster and does not stress the stack that much.

Also I do have an important nitpick about nomenclature:

We don’t write TCO’d functions. Tail Call Optimisation is an compile time optimisation done by a compiler. This particular optimisation does mean to rewrite a tail recursive function in a way that it can get compiled into a loop/direct jump in the targetted language instead of a function call. Sice we don’t push and pop on a/the stack when looping we save up some cycles there, as well as we do not need to grow the stack by 20k frames which are more or less the same.

There are some discussions around about the fact if TCO is really a good thing so in many compilers (not necessarily for erlang or elixir) you can switch off TCO (as well as other optimisations). Some languages (-> Python) do even refuse to implement TCO in the referential implementation because of these concerns: When looping instead of calling, then there are steps missing in the stacktrace and you can’t know how deep you actually are before getting that exception thrown in the face.

rvirding · June 16, 2016, 7:32pm

Sorry, there are some cases where TCO functions are absolutely critical. Kill your system and make it worthless if you don’t do it critical. You just have to get it right.

The examples you show are all calls to functions which (eventually) return and in those cases TCO is a nice but not need. It may not save you heap but it will save you stack, which can actually save you GC as well[*].

No, the main case where is TCO is critical is in process top-loops. These functions never return (unless the process dies) so they will build up stack never to release it. Here you have to get it right. There are no alternatives. The same applies if you top-loop is actually composed of a set of mutually calling functions. There there are no alternatives. Sorry for pushing this again, and again, but it is critical.

In other cases my way of choosing is the one which gives me the most intelligible code. And that varies depending on what the function is supposed to do and how I choose to do it.

Robert

[*] This is because the heap and stack share the same memory area and when it is full you get a gc. So keeping the stack small can delay gc.

PragTob · June 16, 2016, 7:54pm

Hey, thanks for pointing that out and the nitpick about nomenclature. You are right, I tend to mix the two as I always think about it as “a function to which tco will apply” - I’ll update the blog post to make the distinction between TCO and tail-recursive functions clearer.

Interesting to hear about Python, never thought any language would refuse to implement it.

PragTob · June 16, 2016, 8:00pm

totally agreed, of course for some cases TCO is super critical. If I recall correctly that was the case for OTP servers.I’ll also see that I incorporate you comment into the blog post. Thank you for taking the time to comment, it’s so great to have you around here!

Qqwy · June 16, 2016, 10:06pm

A while back, I had a discussion on Github with @josevalim and other members of the Elixir core group, as I wanted to add “this function is tail-recursive” to List.foldl/3, and an “this function is not tail-recursive” to List.foldr/3.

In the end it was not added as people argued that the difference was not important enough.

Coming from a Haskell background, I am still not fully convinced as in Haskell, tail-call optimization is extremely important, since you are dealing with lazily-evaluated infinite lists most of the time.

EDIT: Actually, I just found out about ‘Tail recursion modulo cons’ which basically lets languages that implement it treat the ‘cons’ operator (the operator that is used to add something to the front of a list) as something that does not prevent the rest of the function from being tail-recursive.

Qqwy · June 16, 2016, 10:28pm

I think that actually the ‘tail recursion modulo cons’ is one of the things that creates the misconception of tail-recursion being fast.

In Haskell, the (tail-recursive!) map is written as follows (This is the implementation of map from the Prelude, Haskell’s standard library):

map _ []     = []
map f (x:xs) = f x : map f xs

That’s right! It is written exactly the same (*) as your non-tail-recursive map variant in Elixir:

  def map_body([], _func), do: []
  def map_body([head | tail], func) do
    [func.(head) | map_body(tail, func)]
  end

Note that it is not at all required in Haskell’s tail-recursive version to reverse the list at the end. This reversal is probably where quite a lot of time is spent in your tail-recursive map (you basically have to iterate over the list a second time), besides the memory requirement for reverse that you have mentioned.

So, although it is still critical in the top-loops of processes, as @rvirding has mentioned, I think that Tail-recursion becomes a whole lot less useful when you’re not allowed to construct lists without breaking it.

(*) (bar the parameter order – Haskell likes the list parameter at the end because of currying, while Elixir likes it at the front because of pipelines)

michalmuskala · June 17, 2016, 5:00am

As far as I understand how lists work in Elixir/Erlang it also has the “tail recursion modulo cons” optimisation. That’s why a map that builds up a list on the stack is also tail recursive in Elixir/Erlang.

After http://erlang.org/doc/efficiency_guide/myths.html:

In R12B and later releases, there is an optimisation that in many cases reduces the number of words used on the stack in body-recursive calls. A body-recursive list function and a tail-recursive function that calls lists:reverse/1 at the end will use the same amount of memory. lists:map/2, lists:filter/2, list comprehensions, and many other recursive functions now use the same amount of space as their tail-recursive equivalents.

With that in mind both example functions end up being tail recursive (and thus faster), it’s just that one leverages an additional compiler optimisation do to that, so the fact that the function is tail recursive is hidden from the programmer.

Qqwy · June 17, 2016, 5:54am

Hmm… Now I am not sure any more if any of the conclusions of Erlang’s Tail Recursion is Not a Silver Bullet are actually true.

I haven’t been able to find patch notes for Erlang 12B so far that support or reject this claim. (the section on the Erlang Myths page is not very clear; It is possible that it refers to a different kind of optimization.)

EDIT: I did find this interesting Erlang forum thread, which pointed me to the List Handling part of the Erlang doc’s efficiency section, which, however, states more or less the same as what is included in the Myths section, albeit with a clearer example.

The last post in that forum thread states:

Consider

    map(F, [X|Xs]) -> [F(X) | map(F, Xs)]; 
    map(_, []    ) -> [].

By the classical definition, this is not tail recursive.
The Prolog equivalent, however, is.

The reason the Prolog version is tail recursive is that it
can build a data structure with a hole in it [F(X) | Hole]
and pass the hole:

    map(F, [X|Xs], &Result) => 
        *Result := [F(X) | Hole], 
        map(F, Xs, &Hole); 
    map(F, [], &Result) => 
        *Result := [].

And this would be safe even in Erlang because nobody (other
than the garbage collector) can ever see the hole.
This extended version of “tail recursion” applies when there
the last expression that the call occurs in is a data structure
construction.

The existing Erlang system doesn’t support extended tail
recursion, largely for the sake of the garbage collector.
But it could, without any change to the source language
or its semantics.

So this seems to refute the idea that the optimization made in R12B is tail-recursion modulo cons.
However, a few posts before, it is stated:

Body recursion:
For every message, build it on the heap, push the message
handle on the stack (1 word), call yourself meaning push the
return address (1 word). If we run out of stack + heap that are
in the same memory block (in the current implementation) do a
garbage collect that will increase the free space.

When the loop is ended we pop one stack frame at the time
(1+1 words) and build one list cons cell (2 words). The result
is built while we return and requires no garbage collections
since the pops from the stack match what’s built on the heap.

Tail recursion:
For every message, build it on the heap, build one list cons
cell (2 words) on the heap and loop to yourself. Garbage
collect just like above. Practically equivalent to the
body recursive variant but build all on the heap and
nothing on the stack.

When the loop is ended, do the lists:reverse/1 to build the
result. A new list is built on the heap requiring the same
size as the old list, and the reversed list becomes garbage.

So, tail recursion should require a top memory consumption of
2 * size(MsgList) words more than body recursion. This will
cause extra garbage collection. Also the lists:reverse/1 is
extra work that the body recursive variant avoids.

In this case I speculate body recursion wins.

In some other implementation where heap and stack are not in
the same memory block the situation changes.

So, because the heap and the stack are in a shared memory space, it seems that the difference is not that huge, and rather in favour of the body-recursive variant.

I am thorougly confused by what all these different sources claim.

Onor.io · June 17, 2016, 1:30pm

I think considering TCO from the stand point of performance is possibly missing the point anyway. We can’t shut off our brains and say “Always use TCO” or “Never use TCO”. If there’s a possibility that the data set you’re processing is large enough that it might blow the stack then use TCO. If TCO isn’t much harder to read or understand than the other way, then use TCO (or don’t). If using TCO will make the code significantly harder to comprehend then I’d really need to see that there’s truly a chance that I will blow the stack.

I find it a little depressing (but not surprising) that so many software developers want easy answers. “Always do a” ;“Never do a” . Life just isn’t that simple or clearcut. Annoying but true.

Qqwy · June 17, 2016, 2:24pm

As for ‘blowing the stack’. It seems that Erlang is very particular in that the stack and the heap are in the same memory space and grow towards each other; so using a lot of one or a lot of the other is okay (but not using a lot of both).

This is very true. There is no ‘universal tool’ in programming. Or life.
It can be annoying at times, but it definitely makes the world a whole lot more interesting.

rvirding · June 17, 2016, 4:37pm

But I agree that you should know what it gives you and why and when it is important. There are many case where you end up with TCO as a result of a solution you chose for other reasons. For example often when you carry values around in accumulators you end up with a tail-recursive function even if this was not the goal.

And the efficiency always needs to be measured. Saving stack can however save you gc even if you may not save memory.

rvirding · June 17, 2016, 4:43pm

The prolog example is maybe be tail-recursive depending on the implementation, but you pay for it with memory. There is a hole in the list, sort of anyway as it refers to an unbound logical variable which is not free.

rvirding · June 17, 2016, 4:47pm

No, erlang does not have “tail recursion modulo cons”. They optimised the stack usage but it is still not TR. What they mean is that if you do the TR version which then does a lists:reverse then yes you will save stack as lists:reverse is TR, or implemented in C. But you will pay for it by building the list twice, once in TR function and once in lists:reverse to get it in the right order. Take your pick, more stack and less heap or less stack and more heap. Pest eller kolera.

michalmuskala · June 17, 2016, 5:08pm

Thank you, that makes it clear.

PragTob · November 1, 2016, 8:56am

I just updated the post after all this time (somebody commented) - when the list size is increased it seems like tail-recursive functions get faster again. E.g. instead of the original 10_000 element list I used a 10_000_000 list and recorded the benchmarking results here. The tail-recursive functions are almost twice as fast as the body-recursive version and the standard library version.

So, always benchmark your own use case and it’s time that I give benchee its multiple inputs feature

(sorry for doing a little resurrection but I thought this was a rather important point to raise here)

OvermindDL1 · November 1, 2016, 2:08pm

Great to know!

PragTob · April 8, 2019, 3:16pm

And more resurrection, I used benchee 1.0 to take this (as one of the “original” benchee benchmarks) up for a spin to see how things have improved and found some peculiar behaviour: https://pragtob.wordpress.com/2019/04/08/revisiting-tail-call-optimization-in-elixir-erlang-with-benchee-1-0/