Is there a faster way to write this code?

performance
strings

#1

Does anyone know a faster way to write this code?

  def escape_body(body) do
    body
    |> String.replace("&", "&")
    |> String.replace("<", "&lt;")
    |> String.replace(">", "&gt;")
  end

It seems like there should be an easy way to avoid all the intermediate strings. I see a related proposal/issue but I don’t think that was ever implemented: https://github.com/elixir-lang/elixir/issues/4473

Note: I was originally using String.graphemes and then matching on each character individually but that was slower. I also I only want to replace those three characters.


#2

Does that mean recursing through the list so that you only create one new list?

Given the nature of the characters you are replacing :binary.bin_to_list/2 and :erlang.list_to_binary/1 might be interesting.


#3

String.replace uses :binary.replace which in turn uses binary:matches with some recursion. By writing a custom version of :binary.replace we can do it all in one go.

I should say that :binary.replace almost works out of the box.

:binary.replace(body, ["&", "<", ">"], <<"&;">>, [:global, {:insert_replaced, 1}])

Unfortunately we need not just to wrap the replacement but substitute them with amp, lt and gt respectively.

Anyway. After a quick look at binary.replace implementation you can do the same with:

  def replace_char("&"), do: "&amp;"
  def replace_char(">"), do: "&gt;"
  def replace_char("<"), do: "&lt;"

  def escape_body(body) do
    match_list = :binary.matches(body, ["&", ">", "<"])
    IO.iodata_to_binary(do_replace(body, match_list, &replace_char/1, 0))
  end

  def do_replace(h, [], _, n) do
    [ :binary.part(h, {n, byte_size(h) - n}) ]
  end
  def do_replace(h, [{a, b} | t ], replace_fun, n) do
    [ :binary.part(h, {n, a - n}),
      replace_fun.(:binary.part(h, {a, b}))
      | do_replace(h, t, replace_fun, a  + b)
    ]
  end

This is basically the binary:replace/3 function with some options taken away and using a function to replace the matches rather than a plain substitution. I don’t know what the penalty of this is.

Obviously this is not more readable than your current example but if you generalize it you can likely have a function like this:

general_replace(body, [{"&", "&amp;"}, {"<", "&lt;"}, {">", "&gt;"}])

EDIT
Here is a the general substitution function:

  def substitute(data, subs) do
    match_list = :binary.matches(data, Map.keys(subs))
    IO.iodata_to_binary(do_replace(data, match_list, subs, 0))
  end

  def do_replace(h, [], _, n) do
    [ :binary.part(h, {n, byte_size(h) - n}) ]
  end
  def do_replace(h, [{a, b} | t ], subs, n) do
    [ :binary.part(h, {n, a - n}),
      Map.get(subs, :binary.part(h, {a, b}))
      | do_replace(h, t, subs, a  + b)
    ]
  end

It takes a body of text and a map where each key in the map will be replaced by its value.


#4

This is the previous code. It is a relatively simple split, map, and join:

  def escape_body(nil), do: nil
  def escape_body(string) do
    String.graphemes(string)
    |> Enum.map(&escape_character/1)
    |> Enum.join()
  end

  def escape_character("&"), do: "&amp;"
  def escape_character("<"), do: "&lt;"
  def escape_character(">"), do: "&gt;"
  def escape_character(other), do: other

#5

Thanks for digging into that! I’ll try it out and maybe do some small benchmarks.


#6

AFAIK the fastest way to do this is using binary pattern matching similar to what is done in Plug.HTML.html_escape/1


#7

I did some benchmarking of this:

  • binary_replace = The adaptation of :binary.replace/3 but with “multi substitution”
  • html_escape = From Plug.HTML
  • string_replace = The original “multi-trip” String.replace
  • binary_to_list = List recursion and pattern matching
  • binary_matching = Naive binary pattern matching
  • graphemes = Convert to graphemes and then Enum.map as earlier suggested

On large document (1.3M) with a small number of replacements:

Name                      ips        average  deviation         median         99th %
binary_replace         480.70        2.08 ms    ±10.99%        2.05 ms        2.53 ms
html_escape             56.08       17.83 ms     ±2.19%       17.80 ms       19.65 ms
string_replace          48.46       20.64 ms     ±1.31%       20.61 ms       21.63 ms
binary_to_list          12.65       79.04 ms    ±20.55%       88.49 ms       99.01 ms
binary_matching          3.17      315.19 ms    ±31.72%      343.34 ms      384.80 ms
graphemes                1.99      502.36 ms    ±10.75%      526.71 ms      549.55 ms

Comparison:
binary_replace         480.70
html_escape             56.08 - 8.57x slower
string_replace          48.46 - 9.92x slower
binary_to_list          12.65 - 37.99x slower
binary_matching          3.17 - 151.51x slower
graphemes                1.99 - 241.48x slower

Large document with large number of replacements:

Name                      ips        average  deviation         median         99th %
html_escape             26.19       38.18 ms     ±6.75%       36.86 ms       44.07 ms
binary_to_list          12.49       80.05 ms    ±20.83%       88.57 ms      109.65 ms
string_replace          11.10       90.06 ms     ±1.04%       90.23 ms       91.76 ms
binary_replace          10.10       98.98 ms     ±4.01%       99.30 ms      106.93 ms
binary_matching          2.95      338.67 ms     ±4.41%      337.16 ms      385.76 ms
graphemes                2.05      487.34 ms     ±8.99%      476.39 ms      578.18 ms

Comparison:
html_escape             26.19
binary_to_list          12.49 - 2.10x slower
string_replace          11.10 - 2.36x slower
binary_replace          10.10 - 2.59x slower
binary_matching          2.95 - 8.87x slower
graphemes                2.05 - 12.76x slower

On a small html page:

Name                      ips        average  deviation         median         99th %                                                                    
binary_replace         463.91        2.16 ms    ±15.29%        2.14 ms        2.91 ms
html_escape            379.62        2.63 ms     ±8.95%        2.64 ms        3.16 ms
string_replace         247.72        4.04 ms     ±9.13%        4.01 ms        4.78 ms
binary_to_list         207.41        4.82 ms    ±14.20%        4.72 ms        6.53 ms
binary_matching         76.33       13.10 ms    ±20.63%       13.37 ms       21.28 ms
graphemes               21.09       47.41 ms    ±13.23%       46.24 ms       72.87 ms

Comparison:
binary_replace         463.91
html_escape            379.62 - 1.22x slower
string_replace         247.72 - 1.87x slower
binary_to_list         207.41 - 2.24x slower
binary_matching         76.33 - 6.08x slower
graphemes               21.09 - 21.99x slower

#8

Wow, thanks for benchmarking all of this! I see that my original approach is by far the slowest :joy:!

I’ll probably go with your binary_replace method since that’ll match the use-case I have well (relatively short binaries without too many replacements needed). But it would be interesting to create a macro/library that sets up replacements using the method that Plug.HTML.html_escape uses.


#9

Did you compile the binary pattern before hand or was that just passing the list of binaries? This could potentially significantly improve the results for the :binary.replace/3 solution.


#10

What did you use to generate those benchmark tables?


#11

looks like benchee


#12

The expensive part seem to be the String.graphemes. Using binary_to_list and Enum.map on incoming data that is 6x faster.

I did not. Just lifted the functionality as is from erlang code. I did try it however with a 1.03x slower result, so not significant for the patterns and data I was testing with.

Indeed.


#13

Yeah, but I need to work with UTF-8 strings since these are from user-input so binary_to_list won’t work for me.

iex(1)> :erlang.binary_to_list("abc🎉d") |> to_string
<<97, 98, 99, 195, 176, 194, 159, 194, 142, 194, 137, 100>>

#14

I’m not fully read up on what it entails to be fully unicode aware but the only solution that takes utf8 into account is your String.graphemes solution. All the others work with individual bytes for matching or uses :binary module which is not unicode aware, this includes String.replace

I think because you are only matching on non-unicode characters (&, > and <) and you are simply replacing these and keeping the rest intact you should be fine.


#15

String.replace supports UTF-8 just fine:

iex(40)> String.replace("abc🎉d", "b", "Z")
"aZc🎉d"

I’d have to test out the others to better understand their support but Plug.HTML.html_escape works as well so at the very least we can copy and generalize their solution.

iex(47)> Plug.HTML.html_escape("abc🎉d<")
"abc🎉d&lt;"

#16

That is good. It uses the binary module for plain substitution and that is not unicode aware. It just matches the binary data in the bytes and replaces the matching binary with other binary and then reassembles the binary. If this works then the other solutions should work.

:erlang.binary_to_list("abc🎉d") |> IO.iodata_to_binary
"abc🎉d"

Seems to work. But again I know hardly anything about unicode.