String.replace and escaping weirdness

gcb · June 17, 2023, 6:01am

(i’ve spent too much time looking at sources on a friday to understand this exactly thanks to How to prevent LIKE-injections, so i’m going to post my notes on this 4yr old thread :^) sorry

regex/string.replace methods have their own extra parser, after the string one.

    test "string works as expected" do
      assert "\\a" == "\\a" # pass
      assert ~S(\a) == "\\a" # pass
      # assert "\\a" == "\\\\a" # fails
      # assert "\\a" == "\\\\\\a" # fails
      assert "	a" == "\ta" # pass (left has a literal tab char before "a")
      assert ~S(\ta) == "\\ta" # pass
      assert ~S(\\ta) == "\\\\ta" # pass
      assert ~S(\	a) == "\\\ta" # pass (left has a literal tab char before "a")
      assert ~S(\\	a) == "\\\\\ta" # pass (left has a literal tab char before "a")
      assert "\\	a" == "\\\ta" # pass (left has a literal tab char before "a")
    end

With strings there are no surprises. Two slashes are one slash and anything special gets converted if they have their own slash.

for regex and it’s replacements, all the slashes are parsed from left, matching pairs for the string parsing… and then the same happens again for the special chars in the those methods, including slashes/double slashes again. …if i got these right

I do not think all languages do that (e.g. elixir, php and ruby). Usually you just have to deal with one level to achieve everything (or be limited by it i guess), unless passing to another eval step that will re-parse the string explicitly. This feels like a extra half eval so to say. Or maybe other languages provide special cases for strings when parsed in a regex.

for example, in perl/python/PCRE you can replace with "\\\1" and it’s fine.
(with PCRE2 they changed \1 to $1 so its harder to compare).

>>> import re
>>> re.sub(r'(%)', r'\\\1', "%abc")
'\\%abc'
>>> re.sub(r'(%)', r"\\\1", "%abc")
'\\%abc'
>>>

seeing with ~S helps as it removes the first string parser from the picture and then things behave like most other languages. it is easy to see when you are escaping the slash that would trigger the special case you wanted. "\\\\0" or ~S(\\0) here it is easy to see the first two slashes just became one slash and the zero never gets “activated” because the first slash (one the “second” parser) is escaped and not activating anything.

"\\\\\\0" or ~S(\\\0) you get the two slashes turned into one slash, and then the zero with it’s activation slash…

  "\\ \\ \\ 0" 
   |  |  |  |  <- string parsing
~S(\  \  \  0) 
     |     |  
    \\    \0  <- actual parsing you were thinking about

It kinda makes sense when the pairs matches, but when they “leak” to the side it feels stranger.

in the end, i think i should have been using ~S instead of " everywhere…!

…

PS: if thought that was bad, javascript “wins” by parsing both $1 and \$1 the same. PHP8 does the same with PCRE1 compatibility on.

"%abc".replace(/(%)/, "$1")
<- "%abc" 
"%abc".replace(/(%)/, "\$1")
<- "%abc"
"%abc".replace(/(%)/, "\\$1")
<- "\\%abc"
"%abc".replace(/(%)/, "\\\$1")
<- "\\%abc"
"%abc".replace(/(%)/, "\\\\$1")
<- "\\\\%abc"
"%abc".replace(/(%)/, "\\\\\$1")
<- "\\\\%abc"