Ok so I have 3 implementations that I’ve benchmarked with surprisingly different results and I was wondering if anyone could explain them.
The problem is this, given a file with a list of strings (about 1100 strings total), each with a placeholder %s
either at the beginning or end of the string. The objective is to extract the %s
piece. I need to compare pattern matching performance against iterating a list of a regular expressions to determine the fastest implementation. I’m happy to say pattern matching wins in all cases but some of the pattern matching strategies yield dramatically different results than others…
I read through this explanation but it didn’t quite answer all my questions, take a look if you haven’t read this already as it is really good info.
http://erlang.org/doc/efficiency_guide/functions.html
Benchmark Code
The longest and last file has the string "%s1"
which will match any string ending with the number 1. That is the worst case match so that is the first test. The second test is a complete miss.
Benchee.run(
%{
"matches? hit last element in large file" => fn ->
PatternMatchTest.matches?("thisisareallylongguess1")
end,
"matches_regex? hit last element in large file" => fn ->
PatternMatchTest.matches_regex?("thisisareallylongguess1")
end
},
time: 10
)
IO.puts("\n\n")
Benchee.run(
%{
"matches? miss" => fn ->
PatternMatchTest.matches?("thisisareallylongguess")
end,
"matches? miss" => fn ->
PatternMatchTest.matches_regex?("thisisareallylongguess")
end
},
time: 10
)
Implementation 1 (List)
Basically, this ends up producing a bunch of definitions like the following:
# Beginning of string matching
def matches?([s1, s2, s3, s4, s5, s6, "-", "s", "o", "m", "e", "t", "h", "i", "n", "g"]) do
{true, [s1, s2, s3, s4, s5, s6] |> Enum.join("")}
end
# End of string matching
def matches?(["s", "o", "m", "e", "t", "h", "i", "n", "g", "-", s1, s2, s3, s4, s5, s6]) do
{true, [s1, s2, s3, s4, s5, s6] |> Enum.join("")}
end
I allow for anywhere 1 to 25 characters before the string by varying the number of s variables in the function definition. This generates 25 * 1100 functions and takes a good minute or two to compile .
def matches?(string) when is_binary(string) do
string
|> String.split("", trim: true)
|> matches?()
end
for file <- ["ten_strings.txt", "one_thousand_strings.txt"] do
for string <- File.stream!(file), i <- 1..25 |> Enum.reverse() do
character_wildcards = 1..i |> Enum.map(& quote(do: var!(unquote({:"string#{&1}", [], Elixir}))))
chunks = string
|> String.trim()
|> String.codepoints()
|> Enum.reduce([], fn
"s", ["%" | tail] ->
quote(do: [unquote_splicing(character_wildcards) | unquote(tail)])
other_char, acc ->
[other_char | acc]
end)
|> Enum.reverse()
# quote do
# def matches?(unquote(chunks)) do
# {true, [unquote_splicing(character_wildcards |> Enum.reverse())] |> Enum.join("")}
# end
# end |> Macro.to_string() |> IO.puts()
def matches?(unquote(chunks)) do
{true, [unquote_splicing(character_wildcards |> Enum.reverse())] |> Enum.join("")}
end
end
end
def matches?(_other), do: false
Benchmark (5.49x speedup over regexes)
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...
Name ips average deviation median 99th %
matches? hit last element in large file 4.93 K 0.20 ms ±25.01% 0.187 ms 0.38 ms
matches_regex? hit last element in large file 0.90 K 1.11 ms ±20.89% 1.05 ms 1.95 ms
Comparison:
matches? hit last element in large file 4.93 K
matches_regex? hit last element in large file 0.90 K - 5.49x slower +0.91 ms
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? miss...
Benchmarking matches_regex? miss...
Name ips average deviation median 99th %
matches? miss 5.15 K 0.194 ms ±26.13% 0.178 ms 0.36 ms
matches_regex? miss 0.94 K 1.07 ms ±18.67% 1.00 ms 1.90 ms
Comparison:
matches? miss 5.15 K
matches_regex? miss 0.94 K - 5.49x slower +0.87 ms
Implementation 2 (Binaries)
This keeps the input in tact and uses binary pattern matching by specifying the size. Again, allowing for 1..25
bytes at the beginning and any number at the end for end of string matches. The functions generated look something like this:
# Beginning of string matching (25 functions defined per string)
def matches(<<s :: binary-size(12)>> <> "-something") do
{true, s}
end
# End of string matching (one per string)
def matches("something-" <> s) do
{true, s}
end
Where the size(12)
will actually get iterated from 1..25
for when the %s
is at the beginning. This generates about 25 * 600 + 450 functions since the end of string matching just generate a single definition.
for file <- ["default.txt", "extended.txt"] do
for string <- "lib/bucket_stream/permutations" |> Path.join(file) |> File.stream!(), i <- 1..25 |> Enum.reverse() do
# Binary version
[first, last] = string |> String.trim() |> String.split("%s")
first = if String.length(first) == 0 do
quote(do: <<var!(s) :: binary-size(unquote(i))>>)
else
first
end
last = if String.length(last) == 0 do
quote(do: <<var!(s) :: binary-size(unquote(i))>>)
else
last
end
chunks = quote do
unquote(first) <> unquote(last)
end
# quote do
# def matches?(unquote(chunks)) do
# {true, var!(s)}
# end
# end |> Macro.to_string() |> IO.puts()
def matches?(unquote(chunks)) do
{true, var!(s)}
end
end
end
def matches?(_other), do: false
Benchmark (6.96x speedup over regexes)
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...
Name ips average deviation median 99th %
matches? hit last element in large file 6.20 K 0.161 ms ±32.30% 0.145 ms 0.35 ms
matxhes_regex? hit last element in large file 0.91 K 1.09 ms ±19.66% 1.02 ms 1.94 ms
Comparison:
matches? hit last element in large file 6.20 K
matches_regex? hit last element in large file 0.91 K - 6.78x slower +0.93 ms
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? miss...
Benchmarking matches_regex? miss...
Name ips average deviation median 99th %
public_permutation? miss 6.25 K 0.160 ms ±45.62% 0.141 ms 0.36 ms
public_permutation_regex? miss 0.90 K 1.11 ms ±30.28% 1.04 ms 2.13 ms
Comparison:
public_permutation? miss 6.25 K
public_permutation_regex? miss 0.90 K - 6.96x slower +0.95 ms
Implementation 3 (when
gaurd clause using in
)
This implementation defines exactly 26 functions. 1..25
for beginning of string matching, each with a when in
guard clause and 1 more for end of string matches with a when in
guard clause. The functions generated look something like this:
# Beginning of string matching (25 of these)
def matches?(<<s :: binary-size(12)>> <> string) when string in ["long", "list", "of", "strings", ...] do
{true, string}
end
# End of string matching (just 1 of these)
def matches?(<<string>> <> s) when string in ["long", "list", "of", "strings", ...] do
{true, s}
end
list_of_strings = (File.read!("ten_strings.txt") |> String.split("\n", trim: true)) ++
(File.read!("one_thousand_strings.txt") |> String.split("\n", trim: true))
# []
grouped = list_of_strings
|> Enum.group_by(fn p ->
[first, _last] = p |> String.trim() |> String.split("%s")
if first == "" do
"first"
else
"last"
end
end)
# |> IO.inspect
for n <- 1..25 |> Enum.reverse() do
first = grouped
|> Map.get("first")
|> Enum.map(fn string ->
["", s] = string |> String.trim() |> String.split("%s")
s
end)
# quote do
# def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(first) do
# {true, s}
# end
# end |> Macro.to_string() |> IO.puts
def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(first) do
{true, s}
end
end
last = grouped
|> Map.get("last")
|> Enum.map(fn string ->
[s, ""] = string |> String.trim() |> String.split("%s")
s
end)
# quote do
# def matches?(<<s::binary-size(unquote(n))>> <> string) when string in unquote(last) do
# {true, s}
# end
# end |> Macro.to_string() |> IO.puts
def matches?(<<string>> <> s) when string in unquote(last) do
{true, s}
end
def matches?(_other), do: false
Benchmark (132.96x on hit & 11.30x on miss, speedup over regexes)
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? hit last element in large file...
Benchmarking matches_regex? hit last element in large file...
Name ips average deviation median 99th %
matches? hit last element in large file 116.94 K 0.00855 ms ±206.41% 0.00798 ms 0.0200 ms
matches_regex? hit last element in large file 0.88 K 1.14 ms ±21.64% 1.06 ms 2.14 ms
Comparison:
matches? hit last element in large file 116.94 K
matches_regex? hit last element in large file 0.88 K - 132.96x slower +1.13 ms
Operating System: macOS
CPU Information: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
Number of Available Cores: 4
Available memory: 16 GB
Elixir 1.8.0
Erlang 21.3.5
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 24 s
Benchmarking matches? miss...
Benchmarking matches_regex? miss...
Name ips average deviation median 99th %
matches? miss 10.53 K 0.0949 ms ±23.51% 0.0860 ms 0.172 ms
matches_regex? miss 0.93 K 1.07 ms ±19.71% 1.00 ms 1.90 ms
Comparison:
matches? miss 10.53 K
matches_regex? miss 0.93 K - 11.30x slower +0.98 ms
The guard clause surprised me to be so fast as I expected it to iterate the whole list like the regex implementation and so be slower than the other two.
Regex Implementation
list_of_strings = (File.read!("ten_strings.txt") |> String.split("\n", trim: true)) ++
(File.read!("one_thousand_strings.txt") |> String.split("\n", trim: true))
# []
@regexes list_of_strings
|> Enum.map(fn string ->
[first, last] = string |> String.trim() |> String.split("%s")
cond do
first == "" -> {string, ~r/^(?<s>.+)#{Regex.escape(last)}$/}
last == "" -> {string, ~r/^#{Regex.escape(first)}(?<s>.+)$/}
true -> raise "Unable to determine regular expression for string: #{string}"
end
end)
# |> IO.inspect()
def matches_regex?(string) do
case Enum.find(@regexes, fn {_string, regex} -> string =~ regex end) do
{_string, regex} ->
{true, regex |> Regex.named_captures(string) |> Map.get("s")}
nil ->
false
end
end