Compiling files with many function heads is very slow(OTP 26 issue?)

I have quite a few files in my project that have many function heads(100s to 1000s). I don’t think these can easily be replaced with another way of doing it. For example, transliterating Japanese into Latin characters. Right now I have compile time code read from this Unihan_Readings.txt file[Warning, 5mb text file], and generate functions that match the Japanese character at the beginning of the string. This can’t be replaced by String.split, because some matches are multi-character, but the variation in byte length is even wider, so binary pattern matching is the most obvious solution.

Anyway, compiling a file like this, is currently taking a LONG time, and it seems to have gotten worse under OTP 26, like a lot worse. It used to take maybe 30-40 seconds, now it takes 20-30 minutes(M1 MBP running Ventura 13.4. OTP 26.0, Elixir 1.15.3-otp-26). Is there any way to optimize this ahead of time for quicker compilation?

I’m currently running a compile with ERL_COMPILER_OPTIONS=time mix compile --force --profile time, so far(45 minutes), here’s some of the output:

beam_ssa_codegen              :      0.715 s  114644.0 kB
 beam_validator_strong         :      0.195 s  114644.0 kB
 beam_a                        :      0.018 s  115052.9 kB
 beam_block                    :      0.025 s  120606.8 kB
 beam_jump                     :      0.105 s  100114.5 kB
 beam_clean                    :      0.002 s  100114.5 kB
 beam_trim                     :      0.000 s  100114.5 kB
 beam_flatten                  :      0.000 s   99683.7 kB
 beam_z                        :      0.000 s   99668.4 kB
 beam_validator_weak           :      0.099 s   99668.4 kB
 beam_asm                      :      0.191 s   95366.4 kB
 beam_ssa_opt                  :   1876.602 s  277640.0 kB
    %% Sub passes of beam_ssa_opt from slowest to fastest:
    ssa_opt_alias              :   1866.038 s  99 %
    ssa_opt_live               :      2.007 s   0 %
    ssa_opt_type_start         :      1.973 s   0 %
    ssa_opt_dead               :      1.113 s   0 %
    ssa_opt_type_continue      :      0.958 s   0 %
    ssa_opt_bsm_shortcut       :      0.621 s   0 %
    ssa_opt_merge_blocks       :      0.447 s   0 %
    ssa_opt_cse                :      0.374 s   0 %
1 Like

I’ve only seen such huge slowdown when I was trying to compile for ARM on my x64 Mac. Are you sure you are using fully ARM-native terminal, homebrew and everything?

It looks like everything is running on arm.

$ uname -m
arm64

And ActivityMonitor shows beam.smp running as Apple, which I believe is arm64

It would be nice if you showed at least some example on what your functions look like, from what I understand you are generating code based on this file?

Here’s a stripped down version of what i’m doing. This example doesn’t use the Unihan file because that part has a bit of preprocessing, but it’s ultimately similar. Here hepburn.tsv and ja_punct.tsv, are files of two columns, that look like this:

あ	a
リュ	ryu
ロー	rō
ヲー	wō
ヤ	ya

Hepburn is 604 lines, ja_punct is 91 lines.

defmodule JA do
  @priv :code.priv_dir(:my_app)
  @ja_path Path.join(@priv, "langs/ja")
  @punct_path Path.join(@priv, "ja_punct.tsv")
  @hepburn_path Path.join(@ja_path, "hepburn.tsv")
  @hepburn File.read!(@hepburn_path)
  |> String.trim()
  |> String.split("\n")
  |> Enum.map(&String.split(&1, "\t"))
  |> Enum.map(&List.to_tuple/1)

  @punct File.read!(@punct_path)
  |> String.trim()
  |> String.split("\n")
  |> Enum.map(&String.split(&1, "\t"))
  |> Enum.map(&List.to_tuple/1)

  @all Enum.sort_by(@hepburn ++ @punct, fn {a, _} -> -byte_size(a) end)

  def replace("" <> string), do: replace(string, [])

  def replace("", acc), do: acc

  Enum.each(@all, fn {k, v} ->
    def replace(<<unquote(k), rest::binary>>, acc),
      do: replace(rest, [unquote(Macro.escape(v)) | acc])
  end)

  def replace(<<char::binary-size(1), rest::binary>>, acc), do: replace(rest, [char | acc])
end

This might be a good place to start. From what I remember Enum.sort_by/3 is not one of the most performant of the functions. I guess a place to start debugging might be by using timer, and I would start from this sort function as is one of the possible bottleneck places.

Consider something like nimble_csv to parse the tsv files. You’re iterating a bit more often than needed through the data.

If I run those module attributes with an |> IO.inspect() at the end, they all finish within a second, which leads me to believe the issue is not so much with the parsing of the TSV or the sorting. I just logged a count, and it’s a total of 16,089 items that would be used to generate the functions dynamically. From what I can tell, the actual compile time code and the elixir side run in a matter of seconds, I think it’s the erlang compilation that’s taking 30minutes.

It’s certainly likely that this can be optimized, but from what I can tell, it’s not the culprit. I’m not sure if this is an accurate way to test how long those module attributes take to run, but if I do something like this:

defmodule JA do
  @time System.monotonic_time(:millisecond)

  @data ....
    |> ....
    |> IO.inspect(label: "Done #{System.monotonic_time(:millisecond) - @time}")
end

It shows that all my module attributes run in <200 milliseconds.

That’s useful information. In this case it’s really likely that this is due to the compilation of functions, not the compile time logic. Iirc from older discussions anything beyond like the 1000 of functions/heads would be considered problematic, so 16k is indeed large. Iirc cldr libraries did also run into this every now an then. That might be something to search for.

Yeah, it’s probably certainly beyond what they’re developing it for, but it is a fairly clean way to code it, so that’s what drew me to it. Since there’s so much variation in the byte size of what I’m matching, and ultimately I can’t ask all the worlds languages to change, so there’s going to be the number of items in the search space that the language dictates, I could recode it to do something like progressively pulling a byte off the front of the string, and accumulate that until it matches a key in a Map, which would hold the 16k items, but that seems a bit convoluted in comparison to the current solution.

But with all that said, my current solution works just fine, and in OTP 25 it compiled in a reasonable 20-30 seconds rather than 20-30 minutes, so maybe I’ll just go back down to OTP 25 for now.

It looks like this github issue Jose opened a month ago may be related: Undesired(?) slow down in ssa_opt_alias · Issue #7432 · erlang/otp · GitHub

It’s the only mention of ssa_opt_alias I can find on Google

2 Likes

EDIT: debunked, NVM. See post below

Incorrect Hypothesis

IIRC, multi-clause same-arity function heads compile down to a single function that wraps an equivalent case statement in the Erlang Abstract Format.

Based on my reading of this thread, perhaps what’s taking so long is the compiler doing function-call-reachability checks for every function head. Maybe you could meta-program a single function clause with a single case statement instead, to skip this overhead?

defmodule Test do
  @all [{"foo", :bar}, {"fizz", :buzz}]

  all_case_statements = Enum.flat_map(@all, fn {k, v} ->
    quote do
      <<unquote(k), rest::binary>> -> replace(rest, [unquote(Macro.escape(v)) | var!(acc)])
    end
  end)

  def replace("", acc) do
    acc
  end

  def replace(thing, acc) do
    case thing, do: unquote(all_case_statements)
  end
end

This works for me locally:

Test.replace("foofizzfoo", [])
#=> [:bar, :buzz, :bar]
2 Likes

That shouldn’t make any difference as the compiler does the same optimisations for the case clauses as it does for the function clauses.

3 Likes

That’s what I’m counting on. My (untested, speculative) hypothesis is that the issue (particularly slow compile times for a ridiculous number of function heads) is partially caused by a performance regression during lexical analysis of definitions and call sites; which a single head/case statement avoids despite getting compiled with the same runtime optimizations.

Had to step away from my computer for the day before I could reproduce a setup to see if this helps, but wanted to share the hypothesis before I did. Again, total speculation, but not concerning the optimization pass.

1 Like

I was itching to test this hypothesis, and it looks like it is indeed bunk! Both versions take the same time to compile, based on real rough benchmarking.

This is with 10,000 function heads (22^3 ~= 10k) on my M2 machine:

defmodule Test do

  ## BASELINE: run without uncommenting any code
  ## time mix compile --force -> 0.63 secs

  @all (for x <- 0..22, y <- 0..22, z <- 0..22 do
          {{x, y, z}, Integer.to_string(x) <> Integer.to_string(y) <> Integer.to_string(z)}
        end)

  def replace("", acc) do
    acc
  end

  ## TEST 1: uncomment this block
  ## time mix compile --force -> 44.51 secs

  # Enum.each(@all, fn {{x, y, z}, v} ->
  #   def replace(<<unquote(x), unquote(y), unquote(z), rest::binary>>, acc) do
  #     replace(rest, [unquote(Macro.escape(v)) | acc])
  #   end
  # end)

  ## TEST 2: uncomment this block
  ## time mix compile --force -> 44.41 secs

  # all_case_statements =
  #   Enum.flat_map(@all, fn {{x, y, z}, v} ->
  #     quote do
  #       <<unquote(x), unquote(y), unquote(z), rest::binary>> ->
  #         replace(rest, [unquote(Macro.escape(v)) | var!(acc)])
  #     end
  #   end)
  # def replace(thing, acc) do
  #   case thing, do: unquote(all_case_statements)
  # end
end

Interestingly, I didn’t realize I was running OTP 25 at first—so I initially tested this with 100,000 function heads instead (47^3 ~= 100k), and my tests took 60s. Upgrading to OTP 26, however, they refused to finish within 15 minutes, so :sweat_smile:… If you want a distilled example for a bug report, you can use my code here.

2 Likes

Thanks for doing this! It does at least confirm that my experience isn’t something that’s only affecting me.

It’s a reasonable approach. Sounds like some issue with OTP 26.

I am doing something similar here: GitHub - cogini/public_suffix_list: Elixir library which uses the https://publicsuffix.org/ list to parse DNS domains

That’s over 10k heads, and compilation performance has never been that bad.

1 Like

Most likely the same as this one: Undesired(?) slow down in ssa_opt_alias · Issue #7432 · erlang/otp · GitHub - TL;DR: they are working on a fix.

This may also help in future cases: How to debug Elixir/Erlang compiler performance - Dashbit Blog

5 Likes

@josevalim I’m a little unclear on what was added in OTP26 that had such a drastic effect, and what is the benefit of having it there? If it’s for optimization of potentially unreachable function heads, would it be possible for some sort of compiler annotation to skip the new check for these lengthy modules?

If you have a file that is literally this, or something that essentially amounts to this:

File.read!("many_10s_1000s_of_lines.json")
|> some_parsing()
|> Enum.each(fn {k,v} ->
  def some_function(<<unquote(k), rest::binary>>, acc), do: some_function(rest, [unquote(v) | acc])
end)

would some sort of @compile {:way_too_many_lines, :dont_even_bother}(obviously don’t rely on me for good naming suggestions :slight_smile: ) help to alleviate this issue, or maybe I’m misunderstanding what the new code is doing, and maybe it’s equally as useful for these types of files.

But just in case you do want more great naming suggestings, this could work:

@compile {:i_know_more_than_the_compiler, ignoring: :ample_evidence_to_the_contrary}