Enum fusion?

al2o3cr · December 2, 2022, 11:37pm

Like @tcoopman said, the problem is side-effects: all of the side-effects of one Enum.map need to happen before any of the side-effects of the following one, or else the code doesn’t necessarily mean the same thing.

Loop fusion is workable in a language like Haskell for two reasons:

the type system can prove that the functions being fused don’t have side-effects
because of lazy evaluation, the functions always run in the same order with or without fusion

hst337 · December 3, 2022, 4:43am

To be honest, I am working on an optimizing elixir compiler right now, on this exact problem. There are a lot of rough edges. For example, map’s with side-effects can be joined in the one map (as mentioned by @tcoopman above). But my compiler can prove that side-effects won’t be possible.

Without any other compilers or libraries, tail-recursive functions are always the most efficient way to traverse lists. Next solution (in terms of efficiency) is :lists module which is inlined by the compiler with special @compile options.

UPD: Elixir’s for is always the slowest way to traverse the list, because it is actually translated to :lists.reverse(Enum.reduce(...)) which performs triple list traversal

bmitc · December 3, 2022, 6:06am

This was my thinking. Although the type system doesn’t encode side effects, it seemed to be that, in theory, a compiler should be able to figure out if something like a mapper inside Enum.map has side effects. From there, it can simply not perform pipe fusion in the presence of side effects but do it in other pure cases.

hst337 · December 3, 2022, 6:10am

Nah, traditional Elixir and Erlang (and almost any beam language) compilers can’t rely on inter-module calls, because modules can be recompiled and reloaded in runtime.

Therefore, even if Elixir’s compiler could prove that the code inside map is pure, it couldn’t join two Enum.maps if they had any so-called remote calls

LostKobrakai · December 3, 2022, 8:31am

If by “efficient” you mean either fast or least amount of memory used than this is not correct. For as long as you’re not running out of memory and stacks it’s likely that body recursion beats tail recursion on both metrics.

https://ferd.ca/erlang-s-tail-recursion-is-not-a-silver-bullet.html

hst337 · December 3, 2022, 8:33am

I mean the special list-traversal type of tail recursion which looks something like

defp increase([head | tail]) do
  [head + 1 | increase(tail)]
end
defp increase([]), do: []

This is the fastest and the most memory efficient way to traverse a list

LostKobrakai · December 3, 2022, 8:35am

That’s body recursion, given the recursive call is not the last expression in the function body.

hst337 · December 3, 2022, 8:36am

What ferd wrote in his blog post is not a body recursion. Erlang compiler actually performs tricky optimization where it deletes the stack frame as in tail recursion and just points the list head pointer on the return of the last call.

hst337 · December 3, 2022, 8:37am

This is de facto tail recursion. Any code which makes it impossible to drop the stack frame (like throw) will make this function slower

josevalim · December 3, 2022, 9:24am

Apologies, this is not generally true and potentially bad advice.

First of all, it performs a double traversal: one on reduce, another on reverse (which is optimized in C). It is classic tail recursion.

Second, for can perform map+filtering+flattening in one step, so as soon as you start piping into multiple Enum and Stream, for should win by considerable margins.

The only case where for is faster is when you are using it only for filtering or mapping a single list. And when we do so, the margins are very thin, as explained in Fred’s article.

In other words, for can be considerably faster in several use cases, and slower by thin margins in one case.

As far as I know, Erlang does not perform this. It generally eliminates stack frames as part of body recursion but it does not perform any special behaviour for lists. Looking at the BEAM assembly I see no hints for this either:

bar([H | T]) -> [H * 2 | bar(T)];
bar([]) -> [].

{function, bar, 1, 2}.
  {label,1}.
    {line,[{location,"foo.erl",4}]}.
    {func_info,{atom,foo},{atom,bar},1}.
  {label,2}.
    {test,is_nonempty_list,{f,3},[{x,0}]}.
    {get_list,{x,0},{x,1},{x,0}}.
    {gc_bif,'*',{f,0},2,[{x,1},{integer,2}],{x,1}}.
    {allocate,1,2}.
    {move,{x,1},{y,0}}.
    {call,1,{f,2}}. % bar/1
    {'%',{var_info,{x,0},[{type,{t_list,number,nil}}]}}.
    {test_heap,2,1}.
    {put_list,{y,0},{x,0},{x,0}}.
    {deallocate,1}.
    return.
  {label,3}.
    {test,is_nil,{f,1},[{x,0}]}.
    return.

However, my knowledge may be out of date. Maybe the JIT performs new tricks? So pointers will be appreciated in such case.

hst337 · December 3, 2022, 10:33am

I’ve benchmarked all of this several times and nothing beats so-called “body” recursive variant even on <10^5 lists

defmodule X do

  @compile :inline_list_funcs

  def increase([head | tail]) do
    [head + 1 | increase tail]
  end
  def increase([]), do: []

  def tail_increase([head | tail], acc) do
    tail_increase(tail, [head + 1 | acc])
  end
  def tail_increase(_, acc), do: :lists.reverse(acc)

  def for_increase(list) do
    for i <- list, do: i + 1
  end

  def enum(list) do
    Enum.map(list, fn x -> x + 1 end)
  end

  def lists(list) do
    :lists.map(fn x -> x + 1 end, list)
  end

end

list = Enum.to_list 1..100_000

%{
  "tail" =>  fn -> X.tail_increase(list, []) end,
  "body" =>  fn -> X.increase(list) end,
  "for" =>   fn -> X.for_increase(list) end,
  "enum" =>  fn -> X.enum(list) end,
  "lists" => fn -> X.lists(list) end
}
|> tap(fn tests ->
  1 = length Enum.uniq Enum.map(tests, fn {_, f} -> f.() end)
end)
|> Benchee.run(time: 10, warmup: 2)

"""
body         907.08        1.10 ms    ±14.56%        1.12 ms        1.43 ms
tail         580.20        1.72 ms    ±35.70%        1.79 ms        3.19 ms
enum         503.79        1.98 ms    ±17.27%        1.91 ms        3.14 ms
lists        495.31        2.02 ms     ±8.98%        2.04 ms        2.50 ms
for          436.56        2.29 ms    ±30.84%        2.30 ms        4.31 ms

Comparison:
body         907.08
tail         580.20 - 1.56x slower +0.62 ms
enum         503.79 - 1.80x slower +0.88 ms
lists        495.31 - 1.83x slower +0.92 ms
for          436.56 - 2.08x slower +1.19 ms
"""

For bigger lists, nested lists, and filter-map for is still 20-60% behind the recursive private functions variant.

Yeah, I might be wrong. I thought I saw this in erlang docs.

hst337 · December 3, 2022, 10:35am

And this is a great space for optimization my compiler is targeting.

Optimizing for, Enum pipes, plaining nested case-s and removing x.y runtime should give around 20% performance increase for Elixir code

bjorng · December 3, 2022, 10:38am

Your are right. The Erlang compiler does not do this optimization. It does attempt to keep the stack frame as small as possible when recursing, trimming the stack frame when possible.

Nope, no new tricks for body recursion.

josevalim · December 7, 2022, 3:59am

Enum pipes is hard to optimize without changing semantics.

We can optimize for though. The only way I know how to optimize it though requires doubling its body in the source code based on the generators. Unless we emit some sort of special Kernel.Utils.for for the cases for works on a single collection?

Plaining nested cases is interesting and perhaps even a feature the Erlang/OTP team could be interested in?

And I am definitely curious about the x.y stuff.

Looking forward to learn more!

hst337 · December 7, 2022, 7:12am

Yes, it is hard, but it is not impossible. Right now, I know that these optimizations are possible:

Convert sum, count, product, frequencies to Enum.reduce
Join zip and reduce to zip_reduce
When one of steps is pure, I can join filter and filter, filter and map, map and map, map and reduce, filter and reduce, flat_map and flat_map, flat_map and reduce, filter and each, map and each
Optimize for to plain map. In my compiler this optimization looks like this:

  def join_consecutive([{:lists, :reverse, []} = reverse, {Enum, :reduce, [[], reducer]} = reduce | tail]) do
    case reducer do
      tri(fn x, acc -> [body | acc] end) ->
        [{Enum, :map, [tri(fn x -> func end)]} | join_consecutive(tail)]

      _ ->
        [reverse, join_consecutive([reduce | tail])]
    end
  end

filter and empty? can be joined
If I know that Enum.reduce is definitely called upon a list, I can replace call with :lists.foldl, which will reduce 1 call, but more importantly, will be efficiently inlined by Erlang compiler
Enum.into(%{}) with fn or map to Map.new

Yeah, but this is a hardest optimization I am working on. Benchmarks show that there is almost no general rule where this optimization would increase performance. So, this is only for code size reduction, and must use some profiles, for which I need an infrastructure

defmodule X do

  def dot(map) do
    [
      map.x + map.y,
      map.y + map.x,
      map.x + map.y,
      map.y + map.x,
      map.x + map.y,
      map.y + map.x,
    ]
  end

  def mapdot(%{} = map) do
    [
      map.x + map.y,
      map.y + map.x,
      map.x + map.y,
      map.y + map.x,
      map.x + map.y,
      map.y + map.x,
    ]
  end

  def match(%{x: x, y: y}) do
    [
      x + y,
      y + x,
      x + y,
      y + x,
      x + y,
      y + x,
    ]
  end

end

Benchee.run(%{
  "dot" => fn -> X.dot(%{x: 1, y: 1}) end,
  "mapdot" => fn -> X.mapdot(%{x: 1, y: 1}) end,
  "match" => fn -> X.match(%{x: 1, y: 1}) end,
}, warmup: 1, time: 2)

"""
match        21.30 M       46.96 ns ±62508.26%           0 ns          88 ns
mapdot       20.62 M       48.51 ns ±60512.81%           3 ns          85 ns
dot           7.10 M      140.75 ns ±16462.57%          63 ns         206 ns

Comparison:
match        21.30 M
mapdot       20.62 M - 1.03x slower +1.55 ns
dot           7.10 M - 3.00x slower +93.80 ns
"""

So, compiler can optimize the code which uses . and it know that the map is map, however, compiler generates code which performs check each-time if it can’t prove that the variable map contains map (and not module).

So, there are two optimizations possible.

Prove that the map is always a map. The simplest way to do this is to check if there is a module with such functions, haha
Perform check only once, and duplicate the code for each branch. This will increase the code size, but Erlang and Elixir devs usually write small functions, so I don’t except performance to decrease

josevalim · December 7, 2022, 10:17am

I believe the issue right now is that map.x means both remote call and map lookup. So I believe it makes it impossible for the code that we generate to be optimized. We started supporting map.x and module.x() as two distinct calls and we emit some warnings when possible but it may be a while until we start emitting smarter (and smaller) code.

bjorng · December 7, 2022, 1:29pm

What exactly is plaining of nested cases? Can you show an example?

hst337 · December 7, 2022, 4:43pm

case x do
  %{y: y} ->
    case y do
      %{z: z} -> {:ok, z}
      _ -> :error
    end
  _ ->
    :error
end

To

case x do
  %{y: %{z: z}} -> {:ok, z}
  _ -> :error
end

This particular optimization is useless, but flattening of nested cases can reduce amount of code, reduce amount of clauses and help generate more efficient switch instructions

josevalim · December 7, 2022, 6:52pm

Oh, I assume in this case this happen a lot in Elixir because of with? In this case, once maybe expressions are fully integrated to Erlang/OTP, I am hoping we can compile to something more efficient.

benwilson512 · December 7, 2022, 8:58pm

How is this knowable though? Every one of those functions you listed takes a user supplied function, and the purity of that function is not knowable.