Losing binary compilation optimizations when moving functions to dynamically compiled module

I’m trying to optimize the codepagex library so that it doesn’t use massive amounts of memory in compilation, because loading a lot of encodings requires an ungodly amount of GBs of RAM. The core issue is that the library loads all encodings’ transformation functions into a single module, so if there are many heavy encodings there will be many functions in that module in memory while it compiles.

My first thought was to split each encoding into it’s own module, then let each module compile in parallel and deload from memory once all its function clauses have been loaded. However, I’m suspecting this somehow disables the private_append optimization because I’m calling another module and passing the binary reference there, so the benchmarks show a massive increase in memory usage.

Here’s how the function clauses are built (my only change here is making the function names to_string and making them public. I’ve also tested making them private but that doesn’t change the benchmark.

defmacro def_to_string(name, encoding) do
  quote(bind_quoted: [n: name, e: encoding], generated: true, unquote: false) do
    alias Codepagex.Mappings.Helpers
    fn_name = Helpers.function_name_for_mapping_name("to_string", n)

    for encoding_point <- e do
      case encoding_point do
        {from, to} ->
          defp unquote(fn_name)(
                unquote(from) <> rest,
                acc,
                missing_fun,
                outer_acc
              ) do
            unquote(fn_name)(
              rest,
              acc <> <<unquote(to)::utf8>>,
              missing_fun,
              outer_acc
            )
          end
      end
    end

    defp unquote(fn_name)("", result, _, outer_acc) do
      {:ok, result, outer_acc}
    end

    defp unquote(fn_name)(rest, acc, missing_fun, outer_acc) do
      case missing_fun.(rest, outer_acc) do
        res = {:error, _, _} ->
          res

        {:ok, codepoints, new_rest, new_outer_acc} ->
          unquote(fn_name)(
            new_rest,
             acc <> codepoints,
            missing_fun,
            new_outer_acc
          )
      end
    end
  end
end

Here’s how the original code builds the clauses:

for {n, m} <- @encodings, do: Helpers.def_to_string(n, m)

# define methods to forward to_string(...) to a specific implementation
for {name, _} <- @encodings do
  fun_name = Helpers.function_name_for_mapping_name("to_string", name)

  def to_string(binary, unquote(name |> String.to_atom()), missing_fun, acc) do
    unquote(fun_name)(binary, <<>>, missing_fun, acc)
  end
end

def to_string(_, encoding, _, acc) do
  {:error, "Unknown encoding #{inspect(encoding)}", acc}
end

# define the from_string_xxx for each encoding
for {n, m} <- @encodings, do: Helpers.def_from_string(n, m)

# define methods to forward from_string(...) to a specific implementation
for {name, _} <- @encodings do
  fun_name = Helpers.function_name_for_mapping_name("from_string", name)

  def from_string(
        string,
        unquote(name |> String.to_atom()),
        missing_fun,
        acc
      ) do
    unquote(fun_name)(string, <<>>, missing_fun, acc)
  end
end

def from_string(_, encoding, _, acc) do
  {:error, "Unknown encoding #{inspect(encoding)}", acc}
end

And it’s benchmark:

Name                                          ips        average  deviation         median         99th %
ascii_to_string                           2170.95      0.00046 s    ±10.50%      0.00045 s      0.00060 s
iso_to_string                             2027.25      0.00049 s     ±9.27%      0.00049 s      0.00064 s
iso_from_string                           1588.24      0.00063 s     ±8.83%      0.00061 s      0.00080 s
ascii_from_string                          345.06      0.00290 s     ±9.99%      0.00276 s      0.00342 s
ascii_from_gigantic_string                  0.136         7.38 s     ±0.00%         7.38 s         7.38 s
iso_from_gigantic_string                    0.135         7.40 s     ±0.00%         7.40 s         7.40 s
erlang_unicode_from_gigantic_string        0.0785        12.73 s     ±0.00%        12.73 s        12.73 s

Comparison: 
ascii_to_string                           2170.95
iso_to_string                             2027.25 - 1.07x slower +0.00003 s
iso_from_string                           1588.24 - 1.37x slower +0.00017 s
ascii_from_string                          345.06 - 6.29x slower +0.00244 s
ascii_from_gigantic_string                  0.136 - 16016.26x slower +7.38 s
iso_from_gigantic_string                    0.135 - 16056.97x slower +7.40 s
erlang_unicode_from_gigantic_string        0.0785 - 27643.14x slower +12.73 s

Memory usage statistics:

Name                                   Memory usage
ascii_to_string                           195.31 KB
iso_to_string                             195.31 KB - 1.00x memory usage +0 KB
iso_from_string                           195.31 KB - 1.00x memory usage +0 KB
ascii_from_string                        6757.81 KB - 34.60x memory usage +6562.50 KB
ascii_from_gigantic_string                414.06 KB - 2.12x memory usage +218.75 KB
iso_from_gigantic_string                  195.31 KB - 1.00x memory usage +0 KB
erlang_unicode_from_gigantic_string  15661817.52 KB - 80188.51x memory usage +15661622.20 KB

**All measurements for memory usage were the same**

And here’s my refactor:

for {name, encodings} <- @encodings do
  parsed_name = String.replace(name, ["/", " "], "_")
  module_name = Module.concat(Codepagex.Functions.Generated, parsed_name)

  module_content =
    quote bind_quoted: [name: name, module_name: module_name, encodings: encodings] do
      defmodule module_name do
        require Codepagex.Mappings.Helpers
        alias Codepagex.Mappings.Helpers

        Helpers.def_to_string(name, encodings)
        Helpers.def_from_string(name, encodings)
      end
    end

  {{:module, module_name, module_binary, _}, _} = Code.eval_quoted(module_content)

  :code.load_binary(module_name, ~c"#{module_name}.beam", module_binary)
end

for {name, _} <- @encodings do
  module_name = Helpers.module_name_for_mapping_name(name)

  def to_string(binary, unquote(name |> String.to_atom()), missing_fun, acc) do
    unquote(module_name).to_string(binary, <<>>, missing_fun, acc)
  end

  def from_string(binary, unquote(name |> String.to_atom()), missing_fun, acc) do
    unquote(module_name).from_string(binary, <<>>, missing_fun, acc)
  end
end

And its benchmark:

Name                                          ips        average  deviation         median         99th %
ascii_to_string                           1165.07      0.00086 s    ±12.50%      0.00084 s      0.00117 s
iso_to_string                             1094.68      0.00091 s    ±14.99%      0.00088 s      0.00125 s
iso_from_string                            900.31      0.00111 s    ±12.60%      0.00108 s      0.00160 s
ascii_from_string                          295.03      0.00339 s    ±14.80%      0.00315 s      0.00429 s
erlang_unicode_from_gigantic_string        0.0615        16.25 s     ±0.00%        16.25 s        16.25 s
ascii_from_gigantic_string                 0.0469        21.33 s     ±0.00%        21.33 s        21.33 s
iso_from_gigantic_string                   0.0395        25.35 s     ±0.00%        25.35 s        25.35 s

Comparison: 
ascii_to_string                           1165.07
iso_to_string                             1094.68 - 1.06x slower +0.00006 s
iso_from_string                            900.31 - 1.29x slower +0.00025 s
ascii_from_string                          295.03 - 3.95x slower +0.00253 s
erlang_unicode_from_gigantic_string        0.0615 - 18936.89x slower +16.25 s
ascii_from_gigantic_string                 0.0469 - 24845.47x slower +21.32 s
iso_from_gigantic_string                   0.0395 - 29528.73x slower +25.34 s

Memory usage statistics:

Name                                   Memory usage
ascii_to_string                             1.56 MB
iso_to_string                               1.75 MB - 1.12x memory usage +0.191 MB
iso_from_string                             2.14 MB - 1.37x memory usage +0.57 MB
ascii_from_string                           8.54 MB - 5.46x memory usage +6.98 MB
erlang_unicode_from_gigantic_string     15294.74 MB - 9779.09x memory usage +15293.18 MB
ascii_from_gigantic_string              38147.34 MB - 24390.48x memory usage +38145.77 MB
iso_from_gigantic_string                38147.13 MB - 24390.34x memory usage +38145.56 MB

**All measurements for memory usage were the same**

Any help or pointer is appreciated!

Here’s the branch with my changes over the main library if you want to see the diff more clearly: mode functions into dynamic module · hawkyre/codepagex@35a5197 · GitHub

And what’s the question here?

Erlang is losing this optimization because it allows modules to be changed in runtime in any possible way, so no optimizations which rely on information about the function from another module are performed by the compiler. (I am trying to solve this problem in compiler of mine, but it is a very long way to go)

So you can’t pick two and perhaps there is some other way to approach this problem. Maybe you can reduce the amount of code generated in compile-time and move something to initialization or runtime. Or maybe it’s XY and the root of the problem is on another level. But these are just wild guesses I have. I will try to give this project a deeper look

Hey man, I solved it! The issue was that I was creating the accumulation binary outside the module, which kept the reference to it and when there are 2 references to a binary erlang doesn’t optimize this tail-call pattern since it doesn’t know what will happen with the binary outside the module. Here are some erlang docs to this if they may help you with your issue. Moving the accumulation binary inside the new module fixed the issue. I commented the new benchmarks in my PR if you’re interested: Divide function clause building in a single module per encoding by hawkyre · Pull Request #34 · tallakt/codepagex · GitHub

The dynamic module AST is built just like any other module, so everything should be exactly the same for it regardless of how it’s compiled. We just have more control over how the module is built.

4 Likes

That’s some stellar work. Compliments.

1 Like