Pattern Matching Performance vs Regexes

mgwidmann · April 21, 2019, 4:21am

Ok so I have 3 implementations that I’ve benchmarked with surprisingly different results and I was wondering if anyone could explain them.

The problem is this, given a file with a list of strings (about 1100 strings total), each with a placeholder %s either at the beginning or end of the string. The objective is to extract the %s piece. I need to compare pattern matching performance against iterating a list of a regular expressions to determine the fastest implementation. I’m happy to say pattern matching wins in all cases but some of the pattern matching strategies yield dramatically different results than others…

I read through this explanation but it didn’t quite answer all my questions, take a look if you haven’t read this already as it is really good info.
http://erlang.org/doc/efficiency_guide/functions.html

Benchmark Code

The longest and last file has the string "%s1" which will match any string ending with the number 1. That is the worst case match so that is the first test. The second test is a complete miss.

Benchee.run(
  %{
    "matches? hit last element in large file" => fn ->
      PatternMatchTest.matches?("thisisareallylongguess1")
    end,
    "matches_regex? hit last element in large file" => fn ->
      PatternMatchTest.matches_regex?("thisisareallylongguess1")
    end
  },
  time: 10
)

IO.puts("\n\n")

Benchee.run(
  %{
    "matches? miss" => fn ->
      PatternMatchTest.matches?("thisisareallylongguess")
    end,
    "matches? miss" => fn ->
      PatternMatchTest.matches_regex?("thisisareallylongguess")
    end
  },
  time: 10
)

Implementation 1 (List)

Basically, this ends up producing a bunch of definitions like the following:

# Beginning of string matching
def matches?([s1, s2, s3, s4, s5, s6, "-", "s", "o", "m", "e", "t", "h", "i", "n", "g"]) do
  {true, [s1, s2, s3, s4, s5, s6] |> Enum.join("")}
end
# End of string matching
def matches?(["s", "o", "m", "e", "t", "h", "i", "n", "g", "-", s1, s2, s3, s4, s5, s6]) do
  {true, [s1, s2, s3, s4, s5, s6] |> Enum.join("")}
end

I allow for anywhere 1 to 25 characters before the string by varying the number of s variables in the function definition. This generates 25 * 1100 functions and takes a good minute or two to compile .

  def matches?(string) when is_binary(string) do
    string
    |> String.split("", trim: true)
    |> matches?()
  end

  for file <- ["ten_strings.txt", "one_thousand_strings.txt"] do
    for string <- File.stream!(file), i <- 1..25 |> Enum.reverse() do
      character_wildcards = 1..i |> Enum.map(& quote(do: var!(unquote({:"string#{&1}", [], Elixir}))))
      chunks = string
               |> String.trim()
               |> String.codepoints()
               |> Enum.reduce([], fn
                "s", ["%" | tail] ->
                  quote(do: [unquote_splicing(character_wildcards) | unquote(tail)])
                other_char, acc ->
                  [other_char | acc]
              end)
              |> Enum.reverse()


      # quote do
      #   def matches?(unquote(chunks)) do
      #     {true, [unquote_splicing(character_wildcards |> Enum.reverse())] |> Enum.join("")}
      #   end
      # end |> Macro.to_string() |> IO.puts()

      def matches?(unquote(chunks)) do
        {true, [unquote_splicing(character_wildcards |> Enum.reverse())] |> Enum.join("")}
      end
    end
  end

  def matches?(_other), do: false

Benchmark (5.49x speedup over regexes)

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...

Name                                                  ips        average  deviation         median         99th %
matches?  hit last element in large file              4.93 K        0.20 ms    ±25.01%       0.187 ms        0.38 ms
matches_regex?  hit last element in large file        0.90 K        1.11 ms    ±20.89%        1.05 ms        1.95 ms

Comparison:
matches? hit last element in large file              4.93 K
matches_regex? hit last element in large file        0.90 K - 5.49x slower +0.91 ms



Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? miss...
Benchmarking matches_regex? miss...

Name                                     ips        average  deviation         median         99th %
matches? miss              5.15 K       0.194 ms    ±26.13%       0.178 ms        0.36 ms
matches_regex? miss        0.94 K        1.07 ms    ±18.67%        1.00 ms        1.90 ms

Comparison:
matches? miss              5.15 K
matches_regex? miss        0.94 K - 5.49x slower +0.87 ms

Implementation 2 (Binaries)

This keeps the input in tact and uses binary pattern matching by specifying the size. Again, allowing for 1..25 bytes at the beginning and any number at the end for end of string matches. The functions generated look something like this:

# Beginning of string matching (25 functions defined per string)
def matches(<<s :: binary-size(12)>> <> "-something") do
  {true, s}
end
# End of string matching (one per string)
def matches("something-" <> s) do
  {true, s}
end

Where the size(12) will actually get iterated from 1..25 for when the %s is at the beginning. This generates about 25 * 600 + 450 functions since the end of string matching just generate a single definition.

  for file <- ["default.txt", "extended.txt"] do
    for string <- "lib/bucket_stream/permutations" |> Path.join(file) |> File.stream!(), i <- 1..25 |> Enum.reverse() do
      # Binary version
      [first, last] = string |> String.trim() |> String.split("%s")
      first = if String.length(first) == 0 do
        quote(do: <<var!(s) :: binary-size(unquote(i))>>)
      else
        first
      end
      last = if String.length(last) == 0 do
        quote(do: <<var!(s) :: binary-size(unquote(i))>>)
      else
        last
      end
      chunks = quote do
        unquote(first) <> unquote(last)
      end

      # quote do
      #   def matches?(unquote(chunks)) do
      #     {true, var!(s)}
      #   end
      # end |> Macro.to_string() |> IO.puts()

      def matches?(unquote(chunks)) do
        {true, var!(s)}
      end
    end
  end

  def matches?(_other), do: false

Benchmark (6.96x speedup over regexes)

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...

Name                                                               ips        average  deviation         median         99th %
matches? hit last element in large file              6.20 K       0.161 ms    ±32.30%       0.145 ms        0.35 ms
matxhes_regex? hit last element in large file        0.91 K        1.09 ms    ±19.66%        1.02 ms        1.94 ms

Comparison:
matches? hit last element in large file              6.20 K
matches_regex? hit last element in large file        0.91 K - 6.78x slower +0.93 ms



Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? miss...
Benchmarking matches_regex? miss...

Name                                     ips        average  deviation         median         99th %
public_permutation? miss              6.25 K       0.160 ms    ±45.62%       0.141 ms        0.36 ms
public_permutation_regex? miss        0.90 K        1.11 ms    ±30.28%        1.04 ms        2.13 ms

Comparison:
public_permutation? miss              6.25 K
public_permutation_regex? miss        0.90 K - 6.96x slower +0.95 ms

Implementation 3 (`when` gaurd clause using `in`)

This implementation defines exactly 26 functions. 1..25 for beginning of string matching, each with a when in guard clause and 1 more for end of string matches with a when in guard clause. The functions generated look something like this:

# Beginning of string matching (25 of these)
def matches?(<<s :: binary-size(12)>> <> string) when string in ["long", "list", "of", "strings", ...] do
  {true, string}
end
# End of string matching (just 1 of these)
def matches?(<<string>> <> s) when string in ["long", "list", "of", "strings", ...] do
  {true, s}
end

list_of_strings = (File.read!("ten_strings.txt") |> String.split("\n", trim: true)) ++
                          (File.read!("one_thousand_strings.txt") |> String.split("\n", trim: true))
                          # []

  grouped = list_of_strings
            |> Enum.group_by(fn p ->
              [first, _last] = p |> String.trim() |> String.split("%s")
              if first == "" do
                "first"
              else
                "last"
              end
            end)
            # |> IO.inspect

  for n <- 1..25 |> Enum.reverse() do
    first = grouped
            |> Map.get("first")
            |> Enum.map(fn string ->
              ["", s] = string |> String.trim() |> String.split("%s")
              s
            end)
    # quote do
    #   def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(first) do
    #     {true, s}
    #   end
    # end |> Macro.to_string() |> IO.puts
    def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(first) do
      {true, s}
    end
  end

  last = grouped
          |> Map.get("last")
          |> Enum.map(fn string ->
            [s, ""] = string |> String.trim() |> String.split("%s")
            s
          end)
  # quote do
  #   def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(last) do
  #     {true, s}
  #   end
  # end |> Macro.to_string() |> IO.puts
  def matches?(<<string>> <> s) when string in unquote(last) do
    {true, s}
  end

  def matches?(_other), do: false

Benchmark (132.96x on hit & 11.30x on miss, speedup over regexes)

Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...

Name                                                               ips        average  deviation         median         99th %
matches? hit last element in large file            116.94 K     0.00855 ms   ±206.41%     0.00798 ms      0.0200 ms
matches_regex? hit last element in large file        0.88 K        1.14 ms    ±21.64%        1.06 ms        2.14 ms

Comparison:
matches? hit last element in large file            116.94 K
matches_regex? hit last element in large file        0.88 K - 132.96x slower +1.13 ms



Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s

Benchmarking matches? miss...
Benchmarking matches_regex? miss...

Name                                     ips        average  deviation         median         99th %
matches? miss             10.53 K      0.0949 ms    ±23.51%      0.0860 ms       0.172 ms
matches_regex? miss        0.93 K        1.07 ms    ±19.71%        1.00 ms        1.90 ms

Comparison:
matches? miss             10.53 K
matches_regex? miss        0.93 K - 11.30x slower +0.98 ms

The guard clause surprised me to be so fast as I expected it to iterate the whole list like the regex implementation and so be slower than the other two.

Regex Implementation

  list_of_strings = (File.read!("ten_strings.txt") |> String.split("\n", trim: true)) ++
                      (File.read!("one_thousand_strings.txt") |> String.split("\n", trim: true))
                      # []
@regexes list_of_strings
    |> Enum.map(fn string ->
      [first, last] = string |> String.trim() |> String.split("%s")
      cond do
        first == "" -> {string, ~r/^(?<s>.+)#{Regex.escape(last)}$/}
        last == "" -> {string, ~r/^#{Regex.escape(first)}(?<s>.+)$/}
        true -> raise "Unable to determine regular expression for string: #{string}"
      end
    end)
    # |> IO.inspect()
  def matches_regex?(string) do
    case Enum.find(@regexes, fn {_string, regex} -> string =~ regex end) do
      {_string, regex} ->
        {true, regex |> Regex.named_captures(string) |> Map.get("s")}
      nil ->
        false
    end
  end

Any idea why the when guard clause is so much faster? Any way to improve the speed?

benwilson512 · April 21, 2019, 4:39pm

I haven’t had a chance to dig into the code, but do make sure to validate that each of your pattern matching versions gets the same results as your regex version. I’ve seen benchmark code that was surprisingly fast but only because it immediately returned with the wrong answer.

mgwidmann · April 22, 2019, 12:26am

Yeah sorry I didn’t put it in there but I did have just a simple positive and negative exunit test that checked for that. I came across the same issue where I got weird results and then wrote a test to check it (and had a bug that caused it to have different performance).

I’ll double check the 3rd implementation to be sure it’s correct since it’s the odd one.

lpil · April 22, 2019, 12:38pm

When in is used in a guard it expands at compile time to a clause per item in the list, which can then be optimised by the BEAM. It does not traverse the list at runtime

The Erlang efficiency guide has some information on pattern matching Erlang -- Functions

axelson · April 22, 2019, 7:27pm

What does %s mean in this case? Are you compiling patterns from the list of strings that you mention?

michalmuskala · April 22, 2019, 8:25pm

The issue with the regex implementation is that you have multiple regexes. The regex engine should be able to build a rather nice state machine for something like this if all the patterns are in a single regex or maybe two - one matching on prefixes and one on suffixes. One way to produce this single regex would be something like:

regex = 
  list_of_strings
  |> Enum.map(fn str -> "^#{String.replace(Regex.escape(str), "%s", "(.+)")}$" end)
  |> Enum.join("|")
  |> Regex.compile!()

case Regex.run(regex, string) do
  nil -> nil
  list -> List.last(list)
end

mgwidmann · April 22, 2019, 8:30pm

%s in this case is just the prefix or suffix string. The .* in the regular expression basically.

mgwidmann · April 22, 2019, 8:34pm

Yeah this will probably be faster you’re right, but would that allow extracting the wildcard string? If I used named captures like I am doing would it be ok to use the same name over and over again? Output needs to be more than just a boolean, I also need to know the string (the runtime one, not from the text files)

michalmuskala · April 22, 2019, 8:37pm

The solution I pasted should return the matched substring.

mgwidmann · April 23, 2019, 1:02am

Surprisingly this didnt improve the performance at all. Actually, I’d lean toward it slowing it down. Heres a snippet of what your code built as a regex (except there is over 1000 of them):

~r/^(.+)003$|^(.+)002$|^(.+)001$|^(.+)03$|^(.+)02$|^(.+)01$|^(.+)3$|^(.+)2$|^(.+)1$/

And the performance results:
(I’ve altered the test to include both a pre and post match for both pattern matching and the regex, so there are 6 tests now)

Name                                                                 ips        average  deviation         median         99th %
matches? hit last element in large file (post-match)            108.59 K     0.00921 ms   ±161.30%     0.00898 ms      0.0220 ms
matches? miss                                                     9.37 K       0.107 ms    ±75.48%       0.101 ms        0.21 ms
matches? hit last element in large file (pre-match)               8.41 K       0.119 ms    ±31.77%       0.114 ms        0.21 ms
matches_regex? miss                                               0.83 K        1.20 ms    ±30.12%        1.15 ms        1.91 ms
matches_regex? hit last element in large file (post-match)        0.78 K        1.28 ms    ±22.43%        1.22 ms        1.91 ms
matches_regex? hit last element in large file (pre-match)         0.70 K        1.43 ms    ±17.13%        1.37 ms        2.21 ms

Comparison:
matches? hit last element in large file (post-match)            108.59 K
matches? miss                                                     9.37 K - 11.58x slower +0.0975 ms
matches? hit last element in large file (pre-match)               8.41 K - 12.91x slower +0.110 ms
matches_regex? miss                                               0.83 K - 130.10x slower +1.19 ms
matches_regex? hit last element in large file (post-match)        0.78 K - 138.49x slower +1.27 ms
matches_regex? hit last element in large file (pre-match)         0.70 K - 155.02x slower +1.42 ms

As you can see, the 3 regex tests are between 130 and 150 times slower than a post match.

mgwidmann · April 23, 2019, 2:26am

So I’m really pushing the boundaries here. I generated a file with 9000 more entries by running the following code:

data = 100..9000 |> Enum.map(fn i -> if(i < 5000, do: "%s#{i}", else: "#{i}%s") end) |> Enum.join("\n")

This froze my computer (since my editor spawned like 10 elixir compiler instances) once and took about half my battery and 15 minutes to compile once I rebooted

There is now a total of 11 + 1085 + 8901 = 9997 patterns. I’ve updated the benchmark code to match the last pre/post match of the 9k file. The regexes cannot handle any more than 2000 strings at a time, so I had to chunk them into 5 different regexes and execute each of them until a match is found.

Here is the new benchmarks for the 3rd implementation vs the regex:

Name                                                                 ips        average  deviation         median         99th %
matches? hit last element in large file (post-match)             25.59 K      0.0391 ms    ±42.17%      0.0360 ms      0.0720 ms
matches? miss                                                     1.24 K        0.81 ms    ±19.88%        0.74 ms        1.47 ms
matches? hit last element in large file (pre-match)               1.11 K        0.90 ms    ±17.18%        0.84 ms        1.61 ms
matches_regex? hit last element in large file (post-match)        0.91 K        1.10 ms    ±18.18%        1.02 ms        2.02 ms
matches_regex? hit last element in large file (pre-match)         0.78 K        1.28 ms    ±18.85%        1.18 ms        2.28 ms
matches_regex? miss                                             0.0702 K       14.24 ms    ±12.50%       13.28 ms       19.59 ms

Comparison:
matches? hit last element in large file (post-match)             25.59 K
matches? miss                                                     1.24 K - 20.62x slower +0.77 ms
matches? hit last element in large file (pre-match)               1.11 K - 23.07x slower +0.86 ms
matches_regex? hit last element in large file (post-match)        0.91 K - 28.20x slower +1.06 ms
matches_regex? hit last element in large file (pre-match)         0.78 K - 32.68x slower +1.24 ms
matches_regex? miss                                             0.0702 K - 364.24x slower +14.20 ms