A \s
inside a string represents a space, i.e. " "
. In other words:
iex> " " === "\s"
true
iex> "this is a string" === "this\sis\sa\sstring"
true
However, inside a regular expression, the \s
takes on a different meaning: it is shorthand for “whitespace character”, so it can match on tabs, newlines, and more. Here it can replace all of them in one swoop:
iex> str = "\tsome\nthing with\f\nspaces\s\s"
"\tsome\nthing with\f\nspaces "
iex> Regex.replace(~r/\s+/, str, "-")
"-some-thing-with-spaces-"
Whereas if you just want to replace literal spaces, you have to explicitly use a space character and NOT \s
:
iex> Regex.replace(~r/ +/, str, "-")
"\tsome\nthing-with\f\nspaces-"
By contrast, you can replace specific characters like tabs by referencing them literally:
iex> Regex.replace(~r/\t+/, str, "-")
"-some\nthing with\f\nspaces "
Can someone explain why this is the case? I just had this realization that \s
is not a \s
and I wanted to put the thought out to the community in a coherent post.
Related, I finally understand how the u
unicode flag can affect the output. E.g. referencing a chart of whitespace characters, we can come up with a string that uses unicode whitespace characters:
iex> str = "Unicode\u00A0spaces\u2006"
"Unicode spaces "
iex> Regex.replace(~r/\s+/, str, "-")
"Unicode spaces " # <-- spaces not matched!
iex> Regex.replace(~r/\s+/u, str, "-")
"Unicode-spaces-" #< -- the u flag causes the spaces to be matched!