String.replace and escaping weirdness

cohen · July 25, 2019, 9:51pm

I have some code that finds all instances of the characters , %, and _, inserting a backslash in front of them to escape them in the resulting SQL string. I’m a little bit confused about the amount of \ characters I need to use to do this. I would think that String.replace(string, ~r/([\\%_])/, "\\\\1") would do it, since I put in "\\" for a single backslash, then "\\1" for the backslash-one syntax to get my first capture. However, this results in substituting the characters backslash and 1, e.g., a_b -> a\1b (on IO.puts).

It seems like this because "\\\\1" is the intended syntax for substituting an actual, literal backslash and "\\\\\\1" does the trick, but to be honest I’'m confused about how the escaping is actually working in this case.

Does anyone have any insights?

NobbZ · July 25, 2019, 10:05pm

In the replacement language \ has a special meaning. So if you want it literally, you need to escape it.

Your string \\\\1 is seen by the replacement language as \\1, which will result in the replace of \1 (as printed) or \\1 (as inspected).

To actually get a single backslash followed by the content of the capture, you need 3 backslashes followed by a one in the replacement language, which in a string literall have to be doubled, such that you end up with 6 of them.

When I do write replacments, I usually use ~S to avoid the duplication, then I can do ~S"\\\1".

cohen · July 26, 2019, 12:58am

Thanks, @NobbZ! I think I see what you mean by “replacement language”. This seems to be a special case when a regex is passed as the pattern to replace/4, as illustrated below:

$ iex
Erlang/OTP 22 [erts-10.4] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [hipe]

Interactive Elixir (1.9.0) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> "abc" |> String.replace("a", "z\\0") |> IO.puts()
z\0bc
:ok
iex(2)> "abc" |> String.replace(~r/a/, "z\\0") |> IO.puts()
zabc
:ok
iex(3)> "abc" |> String.replace("a", "\\") |> IO.puts()    
\bc
:ok
iex(4)> "abc" |> String.replace("a", "\\\\") |> IO.puts()
\\bc
:ok
iex(5)> "abc" |> String.replace(~r/a/, "\\") |> IO.puts()
\bc
:ok
iex(6)> "abc" |> String.replace(~r/a/, "\\\\") |> IO.puts()
\bc
:ok

Is this behavior documented anywhere? I know the docs mention using “\1”, etc. to do capture substitution, and that implies that \ is being treated specially, but I’m still surprised that even without capture replacement regex patterns cause replacements to behave differently.

Edit: For a simple regexes, one can avoid the replacement language by either 1) supplying a list of strings as the pattern argument, or a function as the replacement argument.

cohen · July 26, 2019, 1:19pm

Ah, and when replacement is a function in Regex.replace/4, it behaves a little differently than String.replace/4. For Regex, the function gets n + 1 arguments where n is the number of captures in the regex. The first argument is the whole match and next n correspond to each match, whereas String.replace/4's replacement function only ever takes a single argument (the whole match).

gcb · June 17, 2023, 6:01am

(i’ve spent too much time looking at sources on a friday to understand this exactly thanks to How to prevent LIKE-injections, so i’m going to post my notes on this 4yr old thread :^) sorry

regex/string.replace methods have their own extra parser, after the string one.

    test "string works as expected" do
      assert "\\a" == "\\a" # pass
      assert ~S(\a) == "\\a" # pass
      # assert "\\a" == "\\\\a" # fails
      # assert "\\a" == "\\\\\\a" # fails
      assert "	a" == "\ta" # pass (left has a literal tab char before "a")
      assert ~S(\ta) == "\\ta" # pass
      assert ~S(\\ta) == "\\\\ta" # pass
      assert ~S(\	a) == "\\\ta" # pass (left has a literal tab char before "a")
      assert ~S(\\	a) == "\\\\\ta" # pass (left has a literal tab char before "a")
      assert "\\	a" == "\\\ta" # pass (left has a literal tab char before "a")
    end

With strings there are no surprises. Two slashes are one slash and anything special gets converted if they have their own slash.

for regex and it’s replacements, all the slashes are parsed from left, matching pairs for the string parsing… and then the same happens again for the special chars in the those methods, including slashes/double slashes again. …if i got these right

I do not think all languages do that (e.g. elixir, php and ruby). Usually you just have to deal with one level to achieve everything (or be limited by it i guess), unless passing to another eval step that will re-parse the string explicitly. This feels like a extra half eval so to say. Or maybe other languages provide special cases for strings when parsed in a regex.

for example, in perl/python/PCRE you can replace with "\\\1" and it’s fine.
(with PCRE2 they changed \1 to $1 so its harder to compare).

>>> import re
>>> re.sub(r'(%)', r'\\\1', "%abc")
'\\%abc'
>>> re.sub(r'(%)', r"\\\1", "%abc")
'\\%abc'
>>>

seeing with ~S helps as it removes the first string parser from the picture and then things behave like most other languages. it is easy to see when you are escaping the slash that would trigger the special case you wanted. "\\\\0" or ~S(\\0) here it is easy to see the first two slashes just became one slash and the zero never gets “activated” because the first slash (one the “second” parser) is escaped and not activating anything.

"\\\\\\0" or ~S(\\\0) you get the two slashes turned into one slash, and then the zero with it’s activation slash…

  "\\ \\ \\ 0" 
   |  |  |  |  <- string parsing
~S(\  \  \  0) 
     |     |  
    \\    \0  <- actual parsing you were thinking about

It kinda makes sense when the pairs matches, but when they “leak” to the side it feels stranger.

in the end, i think i should have been using ~S instead of " everywhere…!

…

PS: if thought that was bad, javascript “wins” by parsing both $1 and \$1 the same. PHP8 does the same with PCRE1 compatibility on.

"%abc".replace(/(%)/, "$1")
<- "%abc" 
"%abc".replace(/(%)/, "\$1")
<- "%abc"
"%abc".replace(/(%)/, "\\$1")
<- "\\%abc"
"%abc".replace(/(%)/, "\\\$1")
<- "\\%abc"
"%abc".replace(/(%)/, "\\\\$1")
<- "\\\\%abc"
"%abc".replace(/(%)/, "\\\\\$1")
<- "\\\\%abc"