Improve regex performance or temporarely disable during bulk insertion?

I have a resource that has the following fields with regexes:

    attribute :cnpj_basico, :string do
      public? true
      allow_nil? false

      constraints min_length: 8, max_length: 8, match: ~r/^[A-Za-z0-9]{8}$/
    end

    attribute :cnpj_sufixo, :string do
      public? true
      allow_nil? false

      constraints min_length: 6, max_length: 6, match: ~r/^\d{6}$/
    end

    attribute :cnpj_formatado, :string do
      public? true
      allow_nil? false

      constraints min_length: 18, max_length: 18, match: ~r/^\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}$/
    end

I need to bulk insert millions of rows into the database using this resource.

I noticed that casting the resource would have a major slowdown if the match constraints are in place.

Here are the most expensive calls with match enabled using eprof:

#                                                                                  CALLS     % TIME µS/CALL
Total                                                                               3676 100.0 2155    0.59
...
:crypto.strong_rand_bytes_nif/1                                                        1  0.79   17   17.00
Ecto.Type.cast_fun/1                                                                  13  0.84   18    1.38
Ash.Changeset.force_change_attribute/3                                                10  1.07   23    2.30
Enum."-map/2-lists^map/1-1-"/2                                                        80  1.16   25    0.31
:lists.member/2                                                                       47  1.16   25    0.53
:lists.keyfind/3                                                                      40  1.16   25    0.63
:ets.match_object/2                                                                    4  1.30   28    7.00
:re.import/1                                                                         495 24.41  526    1.06
Core.Cnpj.Estabelecimento.persisted/0                                                 41 38.61  832   20.29

And here are the most expensive ones if I remove the match constraint

#                                                                                  CALLS    % TIME µS/CALL
Total                                                                               3160 100.  592    0.19
Spark.Dsl.Extension.persisted!/3                                                      43 1.69   10    0.23
Ash.Changeset.do_change_attribute/4                                                    7 1.86   11    1.57
:erlang.module_loaded/1                                                               30 1.86   11    0.37
anonymous fn/1 in Ash.Changeset.expand_upsert_fields/2                                37 2.20   13    0.35
Enum."-map/2-lists^map/1-1-"/2                                                        80 2.36   14    0.18
:crypto.strong_rand_bytes_nif/1                                                        1 2.53   15   15.00
Ash.Changeset.force_change_attribute/3                                                10 2.70   16    1.60
:erlang.binary_to_atom/2                                                               8 2.70   16    2.00
:ets.match_object/2                                                                    4 4.22   25    6.25

As you can see, the call went from 2155 µS to 592 µS.

This is even more expressive when processing the data in bulk, my times when from (10_000 chunks) ~6 seconds to 0.3~0.5 seconds.

So, is there some way to optimize this checks in Ash.Changeset calls during casting?

If not, is there some way for me to disable the regex check during bulk insertion so I can make it faster?

1 Like

Interesting. You could likely switch to use a change to validate the values, and then do it conditionally in bulk action calls instead of on each attribute?

Regexes could be translated to binary pattern matches at compile time, that would be significantly faster.

That’s true :smiley: I’ve wanted a generic regex transpiration library for a while, thought of writing it myself. Something to take an elixir regex and transpire it to match spec, js, and PostgreSQL especially. (naturally allowing that not all regexes can be translated that way).

In unicode_string i have a function Unicode.Set.to_pattern/1 to transpile Unicode sets into binary patterns (or to :re-compatible regexs). Adapting that to transpile a subset of regex into pattern match wouldn’t be too much work so I’ll look at that.

Another thought that strikes me is that maybe, for Ash, something like the old COBOL PIC clause might be a good abstraction - it is declarative and it would map directly to a binary pattern match. The examples in this thread would be supported by such a construct. match could be reserved for regex, and pattern could be added for binary pattern matching.

Here I’m using the following symbols but they can anything:

  • X - ASCII alphanumeric
  • A - ASCII alphbetic
  • D - ASCII digits
  • Anything other character - a literal

I can envisage additional symbols for

  • Unicode alphanumeric
  • Unicode digit
  • ASCII alphabetic
  • Unicode alphabetic
  • Sign matching (+ / -)
  • Upper/lower case matching
  • Quote mark matching (open then close)
  • Unicode general category matching (generalised form of the above)
  • Unicode script matching

A pattern could look like the following (using the examples in the thread) :`

# ~r/^[A-Za-z0-9]{8}$/
X(8)

# ~r/^\d{6}$/
D(6)

# ~r/^\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}$/
DD.DDD.DDD/DDDD-DD

I think this is quite an elegant and declarative way to express many (but clearly not all!) data format expectations. And since they can be transpiled to binary pattern matches they would be a good fit for the BEAM. They can also abstract away some of the complexities of ā€œwhat is a letterā€, ā€œwhat is a digitā€ and ā€œwhat is whitespaceā€ given Unicode has quite extensive character repertories for these categories and others.

3 Likes

Hmm…I much prefer something that doesn’t require the user learning a new pattern language. Its a neat idea though, and pulling from an existing standard is the only thing that really makes it tenable.

sick :smiling_face_with_sunglasses:

So @zachdaniel I just tried your suggestion, basically I changed my attributes to this:

     attribute :cnpj_basico, :string do
       public? true
       allow_nil? false

       constraints min_length: 8, max_length: 8
     end

     attribute :cnpj_sufixo, :string do
       public? true
       allow_nil? false

       constraints min_length: 6, max_length: 6
     end

     attribute :cnpj_formatado, :string do
       public? true
       allow_nil? false

       constraints min_length: 18, max_length: 18
     end

And added these changes to my create action:

       change fn changeset, _ ->
        regex = ~r/^[A-Za-z0-9]{8}$/

        cnpj_basico = Ash.Changeset.get_attribute(changeset, :cnpj_basico)

        if not is_nil(cnpj_basico) and Regex.match?(regex, cnpj_basico) do
          changeset
        else
          exception = [value: cnpj_basico, field: :cnpj_basico, message: "must match the pattern %{regex}", vars: [regex: inspect(regex)]] |> Ash.Error.Changes.InvalidAttribute.exception()

          Ash.Changeset.add_error(changeset, exception)
        end
      end

      change fn changeset, _ ->
        regex = ~r/^\d{6}$/

        cnpj_sufixo = Ash.Changeset.get_attribute(changeset, :cnpj_sufixo)

        if not is_nil(cnpj_sufixo) and Regex.match?(regex, cnpj_sufixo) do
          changeset
        else
          exception = [value: cnpj_sufixo, field: :cnpj_sufixo, message: "must match the pattern %{regex}", vars: [regex: inspect(regex)]] |> Ash.Error.Changes.InvalidAttribute.exception()

          Ash.Changeset.add_error(changeset, exception)
        end
      end

      change fn changeset, _ ->
        regex = ~r/^\d{2}\.\d{3}\.\d{3}\/\d{4}-\d{2}$/

        cnpj_formatado = Ash.Changeset.get_attribute(changeset, :cnpj_formatado)

        if not is_nil(cnpj_formatado) and Regex.match?(regex, cnpj_formatado) do
          changeset
        else
          exception = [value: cnpj_formatado, field: :cnpj_formatado, message: "must match the pattern %{regex}", vars: [regex: inspect(regex)]] |> Ash.Error.Changes.InvalidAttribute.exception()

          Ash.Changeset.add_error(changeset, exception)
        end
      end

So, theorically, AFAIK, these two should be equal and should take the same time.

But, when I compared them both in my bulk loading, the times where very different, for example, here is the time with my changes:

process establishments: 0.235459
process establishments: 0.235015
process establishments: 0.256663
process establishments: 0.229017
process establishments: 0.241808
process establishments: 0.264744
process establishments: 0.264392
process establishments: 0.241751
process establishments: 0.236187
process establishments: 0.24704

Around 0.24s per batch

Now with the match constraint:

process establishments: 1.889976
process establishments: 1.869798
process establishments: 1.88271
process establishments: 1.924556
process establishments: 1.94613
process establishments: 2.154315
process establishments: 2.37499
process establishments: 2.378589
process establishments: 2.349797
process establishments: 2.36511

Around 2.1s per batch

Is this expected? I would expect that they would take the same amount of time. This makes me wonder if the match constraint is actually doing more work than necessary?

:thinking: that is very surprising. I’m wondering if we may be applying the regex more than once for case where its an attribute constraint :thinking:

Yep, that seems to be the case, I run both codes now with cprof and I can see some interesting changes:

With the changes, I get this number of calls for these functions:

  :re.run/3                                                            3
  :re.import/1                                                         3
  Regex.safe_run/3                                                     3
  Regex.match?/2                                                       3
  anonymous fn/4 in Regex.safe_run/3                                   3

Which makes sense, I have 3 changes and each are doing one Regex.match? call.

Now, removing the changes and having the match constraint, I get these numbers:

  :re.import/1                                                       495
  :re.run/3                                                            3
  Regex.safe_run/3                                                     3
  Regex.match?/2                                                       3
  anonymous fn/4 in Regex.safe_run/3                                   3

As you can see, all the calls are the same except for :re.import/1 which is called 495.

Going back to eprof this is the heaviest call:

:re.import/1                                                        495 33.19  525    1.06 

In other words, it seems that, for some reason, having the match constraint makes Ash call :re.import/1 hundreds of times per time during an Ash.Changeset.for_create call.

Can you do constraints: [match: {Spark.Regex, :cache, ["^\\d{8}$", ""]}] instead of constraints: [match: ~r/^\d{8}$/]?

You will need to change match: [ type: :regex ] to match: [ type: :regex_as_mfa ] in ā€Žlib/ash/type/string.ex.

It was reverted in this commit.

Ah, that explains it, I did saw some code in the regex match part that allowed calling a function with the mfa, but when I tried myself it would break somewhere before because the support was removed.

@zachdaniel Any idea about this? If there is some more tests or code changes that I can do here to help figure this out just let me know :slight_smile:

I haven’t had a chance to look into this. I don’t think you should have to do that though, the regex should be compiled in. The cached regex stuff was added because an inadvertent change was made in Elixir due to OTP 28.0 that didn’t support regexes at compile time. But that’s been patched in 28.1 and latest Elixir AFAIK. So the regex should be compiled in. And in both cases they use the same option type for regexes.

If you can confirm that going back to the old option and using the tuple format solves the performance issue then it may be a simple fix.

I feel that trying to understand where the 495 calls to :re.import/1 comes from is a better bet to fix the issue instead of a workaround no?

I tried to find who is calling :re.import using this code:

defmodule Test do
  alias Core.Cnpj.Estabelecimento
  alias Core.Ash.Resource.Cast
  
  @args %{
     cnpj_basico: "11111111",
     cnpj_sufixo: "666666",
     cnpj_formatado: "22.333.333/4444-22",
     nome: "Blibs blobs",
     matriz?: false,
     situacao_cadastral: :ativa,
     inicio_da_atividade: Date.utc_today()
   }

  def run do
    tracer =
      spawn(fn ->
        receive_loop = fn receive_loop ->
          receive do
            {:trace_ts, pid, :call, {:re, :import, _args}, _ts} ->
              stack = Process.info(pid, :current_stacktrace)
              IO.puts("\n=== re.import called by #{inspect(pid)} ===")
              IO.inspect(stack)
              receive_loop.(receive_loop)

            _ ->
              receive_loop.(receive_loop)
          end
        end

        receive_loop.(receive_loop)
      end)

    :erlang.trace(:all, true, [:call, :timestamp, {:tracer, tracer}])
    :erlang.trace_pattern({:re, :import, 1}, true, [])
    @args
    |> Cast.cast(:create, Estabelecimento)
  end
end

But the stacktrace I got is not very useful, for my changes, I got these 3 times:

=== re.import called by #PID<0.95.0> ===
{:current_stacktrace,
 [
   {:code_server, :call, 1, [file: ~c"code_server.erl", line: 163]},
   {:code_server, :call, 1, [file: ~c"code_server.erl", line: 163]},
   {:code, :ensure_loaded, 1, [file: ~c"code.erl", line: 578]},
   {:error_handler, :undefined_function, 3,
    [file: ~c"error_handler.erl", line: 86]},
   {Ash.Changeset, :run_change_or_validation, 6,
    [file: ~c"lib/ash/changeset/changeset.ex", line: 3245]},
   {Ash.Changeset, :"-run_action_changes/6-fun-2-", 6,
    [file: ~c"lib/ash/changeset/changeset.ex", line: 3150]},
   {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: ~c"lib/enum.ex", line: 2520]},
   {Ash.Changeset, :run_action_changes, 6,
    [file: ~c"lib/ash/changeset/changeset.ex", line: 3146]},
   {Ash.Changeset, :do_for_action, 4,
    [file: ~c"lib/ash/changeset/changeset.ex", line: 2577]},
   {Core.Ash.Resource.Cast, :cast, 3,
    [file: ~c"lib/core/ash/resource/cast.ex", line: 14]},
   {:elixir, :eval_external_handler, 3, [file: ~c"elixir.erl", line: 365]},
   {:erl_eval, :do_apply, 7, [file: ~c"erl_eval.erl", line: 924]},
   {:elixir, :eval_forms, 4, [file: ~c"elixir.erl", line: 343]},
   {Module.ParallelChecker, :verify, 1,
    [file: ~c"lib/module/parallel_checker.ex", line: 155]},
   ...
 ]}

And for the match constraint, I get this trace repeated 495 times:

=== re.import called by #PID<0.95.0> ===
{:current_stacktrace,
 [
   {:code_server, :call, 1, [file: ~c"code_server.erl", line: 163]},
   {:elixir_compiler, :dispatch, 4, [file: ~c"elixir_compiler.erl", line: 82]},
   {:elixir_compiler, :compile, 4, [file: ~c"elixir_compiler.erl", line: 56]},
   {:elixir_compiler, :maybe_fast_compile, 2,
    [file: ~c"elixir_compiler.erl", line: 43]},
   {:elixir_lexical, :run, 3, [file: ~c"elixir_lexical.erl", line: 17]},
   {:elixir_compiler, :quoted, 3, [file: ~c"elixir_compiler.erl", line: 21]},
   {Module.ParallelChecker, :verify, 1,
    [file: ~c"lib/module/parallel_checker.ex", line: 155]},
   {Code, :require_file, 2, [file: ~c"lib/code.ex", line: 1613]},
   {Kernel.CLI, :wrapper, 1, [file: ~c"lib/kernel/cli.ex", line: 562]},
   {Enum, :"-map/2-lists^map/1-1-", 2, [file: ~c"lib/enum.ex", line: 1688]},
   {Kernel.CLI, :process_commands, 1, [file: ~c"lib/kernel/cli.ex", line: 93]},
   {Kernel.CLI, :"-main/1-fun-1-", 2, [file: ~c"lib/kernel/cli.ex", line: 39]},
   {Kernel.CLI, :"-exec_fun/2-fun-0-", 3,
    [file: ~c"lib/kernel/cli.ex", line: 141]}
 ]}

:re.import is called by the code generated by expanding the ~r sigil (which needs to re-import the exported binary pattern back to a ref).

~r/foo/

is expanded as

%Regex{
  re_pattern:
    :re.import(
      {:re_exported_pattern, "re-PCRE2¤þ¿", "foo", [:export],
       "<<long-binary>>"}
    ),
  source: "foo",
  opts: []
}

Importing the exported pattern is faster than re-compiling from scratch (it is the reason it was introduced), but is still is slower than the actual matching.

But that should only happen 3 times in my resource since I have only used the sigil 3 times in the resource no?

Here is a one file code that reproduces the issue:

Application.put_env(:ash, :validate_domain_resource_inclusion?, false)
Application.put_env(:ash, :validate_domain_config_inclusion?, false)

Mix.install([{:ash, "~> 3.0"}], consolidate_protocols: false)

defmodule Cast do
  def cast(arguments, action, resource) do
    %{accept: accept} = Ash.Resource.Info.action(resource, action)

    with {:ok, resource} <-
           resource
           |> Ash.Changeset.for_create(action, arguments)
           |> Ash.Changeset.apply_attributes() do
      {:ok, Map.take(resource, accept)}
    end
  end

  def cast!(arguments, action, resource),
    do: arguments |> cast(action, resource) |> Ash.Helpers.unwrap_or_raise!()
end

defmodule Tutorial.Profile do
  use Ash.Resource,
    domain: Tutorial

  actions do
    defaults [create: :*]
  end

  attributes do
    uuid_primary_key :id

    attribute :attr_1, :string do
      public? true
      allow_nil? false

      constraints min_length: 8, max_length: 8, match: ~r/^[A-Za-z0-9]{8}$/
    end
  end
end

defmodule Tutorial do
  use Ash.Domain

  resources do
    resource Tutorial.Profile
  end
end
defmodule Test do
  @args %{attr_1: "11111111"}

  def run do
    tracer =
      spawn(fn ->
        receive_loop = fn receive_loop ->
          receive do
            {:trace_ts, pid, :call, {:re, :import, _args}, _ts} ->
              stack = Process.info(pid, :current_stacktrace)
              IO.puts("\n=== re.import called by #{inspect(pid)} ===")
              IO.inspect(stack)
              receive_loop.(receive_loop)

            _ ->
              receive_loop.(receive_loop)
          end
        end

        receive_loop.(receive_loop)
      end)

    :erlang.trace(:all, true, [:call, :timestamp, {:tracer, tracer}])
    :erlang.trace_pattern({:re, :import, 1}, true, [])

    @args |> Cast.cast(:create, Tutorial.Profile)
  end
end

Test.run()

This is more a question for Ash maintainers, since I don’t know what the macros get expanded to.
But to give a pure elixir example:

# import is called in the lambda, so n times
Enum.map(strings, fn string -> string =~ ~r/foo/ end)

# import is called once
regex = ~r/foo/
Enum.map(strings, fn string -> string =~ regex end)

This is a bit of an unfortunate state of things, but given the way it works in OTP, I don’t think there is an easy way around this. Maybe the compiler could try to be smart and translate the first into the second, but it might break if the lambda invocation is done across nodes.

The ideal fix would be to have actual constants in OTP as @michalmuskala suggested in the past, but I don’t think there’s any active plan for it at the moment.

Ok, I tested your suggestion and it indeed fixed the issue, I didn’t even needed to change my constraint, I just leave it as a regex since spark will convert it into a cache regardless:

  defp validate_type(:regex_as_mfa, _key, %Regex{} = regex) do
    source = Regex.source(regex)
    opts = Regex.opts(regex)
    {:ok, {Spark.Regex, :cache, [source, opts]}}
  end

So, the only change I did was change the match type in lib/ash/type/string.ex from :regex to :regex_as_mfa.

With this, now the code is at least twice as fast. Here is a comparison with my code above but looped 10k times with eprof:

with :regex:

#                                                                 CALLS     %   TIME µS/CALL
Total                                                            966500 100.0 146590    0.15

with :regex_as_mfa

#                                                                 CALLS    %  TIME µS/CALL
Total                                                            912500 100. 59629    0.07

It is almost 3x faster.

@zachdaniel would you consider reverting the types for string.ex, ci_string.ex, type.ex and match.ex from that commit from :regex to :regex_as_mfa?

1 Like

Hmm… yes. I would accept a PR that does that :person_bowing:.

There you go :slight_smile: