Regex.scan and repeated words

So I define my regex like so:

{:ok, reg} = Regex.compile("^#{word} | #{word}$|^#{word}$| #{word} ")

With word = "go"

Then I run

Regex.scan(reg, "go go gordon go") |> List.flatten

And I’m expecting to get ["go", " go ", " go"] but instead I get ["go ", " go"].

What am I doing wrong? Does it have to do with how the regex is organized? I’m doing it like this to avoid matching the go in gordon.

You have four alternatives in your regexp:

  1. The word at the beginning of the string, then a space
  2. A space, then the word at the end of the string
  3. The word at the beginning of the string, then the end of the string
  4. A space, the word, then another space

When you scan go go gordon go the following things happen: rule 1 matches go , the second go doesn’t match because it’s neither at the beginning, nor at the end, nor it has a space before AND after (the space before was already consumed by the first match), gordon as expected doesn’t match, finally the last go matches rule 2.

In fact, if you had a second space after the first go you would get the match you expect.

One possible way to match all the “go” that are not part of another word is ~r/\bgo\b/ (the \b represents a word boundary). It is not equivalent to your expectation, because it does not return spaces in the matches, but maybe it is what you ultimately want?

word = "go"
{:ok, reg} = Regex.compile("\\b#{word}\\b")
Regex.scan(reg, "go go gordon go") |> List.flatten()
# => ["go", "go", "go"]
4 Likes

Yep, @lucaong explained what’s going on. To solve the problem, I was about to suggest using negative lookahead + lookbehind, like this:

~r/(?<![^\s])go(?![^\s])/

This works but @lucaong’s solution using \b is way more elegant

2 Likes

Marking this one as solution because it explains the “why” behind the issue very well. You even saw through my clumsy coding and understood my underlying intention. Thank you! And thanks to @trisolaran as well.

2 Likes