When is \s not a \s? Can someone explain Regex patterns?

A \s inside a string represents a space, i.e. " ". In other words:

iex> " " === "\s"
iex> "this is a string" === "this\sis\sa\sstring"

However, inside a regular expression, the \s takes on a different meaning: it is shorthand for “whitespace character”, so it can match on tabs, newlines, and more. Here it can replace all of them in one swoop:

iex> str = "\tsome\nthing   with\f\nspaces\s\s"
"\tsome\nthing   with\f\nspaces  "
iex> Regex.replace(~r/\s+/, str, "-")

Whereas if you just want to replace literal spaces, you have to explicitly use a space character and NOT \s:

iex> Regex.replace(~r/ +/, str, "-")

By contrast, you can replace specific characters like tabs by referencing them literally:

iex> Regex.replace(~r/\t+/, str, "-")
"-some\nthing   with\f\nspaces  "

Can someone explain why this is the case? I just had this realization that \s is not a \s and I wanted to put the thought out to the community in a coherent post.

Related, I finally understand how the u unicode flag can affect the output. E.g. referencing a chart of whitespace characters, we can come up with a string that uses unicode whitespace characters:

iex> str = "Unicode\u00A0spaces\u2006"
"Unicode spaces "
iex> Regex.replace(~r/\s+/, str, "-")
"Unicode spaces "  # <-- spaces not matched!
iex> Regex.replace(~r/\s+/u, str, "-")
"Unicode-spaces-"  #< -- the u flag causes the spaces to be matched!
1 Like

How the \s is interpreted depends on the surrounding syntax, which all sigils have an opportunity to change. For example:

iex(3)> IO.puts "\s"
iex(4)> IO.puts ~S(\s)

It isn’t so much about regular expressions as it is about sigils being allowed to interpret their inner contents.


You are mixing two different syntaxes. Strings and regular expressions are different things and have different escapes.

Strings/sigils in Elixir have these escapes: https://elixir-lang.org/getting-started/sigils.html#interpolation-and-escaping-in-string-sigils

Elixir’s regular expression library on the other hand is based on PCRE and it has the PCRE escapes, that are described here: http://erlang.org/doc/man/re.html#backslash

Both systems are different and any commonalities in their escapes are coincidental and a result of people gravitating towards the escapes that are already in common use when designing systems. So \s in a string is different from \s in a regex.

Note also that in strings/sigils, escapes are used to represent single, specific characters. In a regex escapes can be used to match certain ranges of characters (1…n). So they can’t be mapped to each other because they are for different things.