\s inside a string represents a space, i.e.
" ". In other words:
iex> " " === "\s" true iex> "this is a string" === "this\sis\sa\sstring" true
However, inside a regular expression, the
\s takes on a different meaning: it is shorthand for “whitespace character”, so it can match on tabs, newlines, and more. Here it can replace all of them in one swoop:
iex> str = "\tsome\nthing with\f\nspaces\s\s" "\tsome\nthing with\f\nspaces " iex> Regex.replace(~r/\s+/, str, "-") "-some-thing-with-spaces-"
Whereas if you just want to replace literal spaces, you have to explicitly use a space character and NOT
iex> Regex.replace(~r/ +/, str, "-") "\tsome\nthing-with\f\nspaces-"
By contrast, you can replace specific characters like tabs by referencing them literally:
iex> Regex.replace(~r/\t+/, str, "-") "-some\nthing with\f\nspaces "
Can someone explain why this is the case? I just had this realization that
\s is not a
\s and I wanted to put the thought out to the community in a coherent post.
Related, I finally understand how the
u unicode flag can affect the output. E.g. referencing a chart of whitespace characters, we can come up with a string that uses unicode whitespace characters:
iex> str = "Unicode\u00A0spaces\u2006" "Unicode spaces " iex> Regex.replace(~r/\s+/, str, "-") "Unicode spaces " # <-- spaces not matched! iex> Regex.replace(~r/\s+/u, str, "-") "Unicode-spaces-" #< -- the u flag causes the spaces to be matched!