When is \s not a \s? Can someone explain Regex patterns?

A \s inside a string represents a space, i.e. " ". In other words:

iex> " " === "\s"
true
iex> "this is a string" === "this\sis\sa\sstring"
true

However, inside a regular expression, the \s takes on a different meaning: it is shorthand for “whitespace character”, so it can match on tabs, newlines, and more. Here it can replace all of them in one swoop:

iex> str = "\tsome\nthing   with\f\nspaces\s\s"
"\tsome\nthing   with\f\nspaces  "
iex> Regex.replace(~r/\s+/, str, "-")
"-some-thing-with-spaces-"

Whereas if you just want to replace literal spaces, you have to explicitly use a space character and NOT \s:

iex> Regex.replace(~r/ +/, str, "-")
"\tsome\nthing-with\f\nspaces-"

By contrast, you can replace specific characters like tabs by referencing them literally:

iex> Regex.replace(~r/\t+/, str, "-")
"-some\nthing   with\f\nspaces  "

Can someone explain why this is the case? I just had this realization that \s is not a \s and I wanted to put the thought out to the community in a coherent post.

Related, I finally understand how the u unicode flag can affect the output. E.g. referencing a chart of whitespace characters, we can come up with a string that uses unicode whitespace characters:

iex> str = "Unicode\u00A0spaces\u2006"
"Unicode spaces "
iex> Regex.replace(~r/\s+/, str, "-")
"Unicode spaces "  # <-- spaces not matched!
iex> Regex.replace(~r/\s+/u, str, "-")
"Unicode-spaces-"  #< -- the u flag causes the spaces to be matched!
1 Like

How the \s is interpreted depends on the surrounding syntax, which all sigils have an opportunity to change. For example:

iex(3)> IO.puts "\s"
 
:ok
iex(4)> IO.puts ~S(\s)
\s
:ok
iex(5)> 

It isn’t so much about regular expressions as it is about sigils being allowed to interpret their inner contents.

4 Likes

You are mixing two different syntaxes. Strings and regular expressions are different things and have different escapes.

Strings/sigils in Elixir have these escapes: https://elixir-lang.org/getting-started/sigils.html#interpolation-and-escaping-in-string-sigils

Elixir’s regular expression library on the other hand is based on PCRE and it has the PCRE escapes, that are described here: http://erlang.org/doc/man/re.html#backslash

Both systems are different and any commonalities in their escapes are coincidental and a result of people gravitating towards the escapes that are already in common use when designing systems. So \s in a string is different from \s in a regex.

Note also that in strings/sigils, escapes are used to represent single, specific characters. In a regex escapes can be used to match certain ranges of characters (1…n). So they can’t be mapped to each other because they are for different things.

4 Likes