String.split on a sub-string within a pattern?

I’m trying all sorts of things with no luck, so I thought I’d ask for some help.

I want to split a string anytime there’s a “b” in “abc”, but not in " b ". I.e., when a “b” is surrounded by whitespace, I don’t want it to be a splitting character, but an element in the array returned by the split.

String.split("abcde", regex) == ["a", "cde"]
String.split("a b cde", regex) == ["a", "b", "cde"]

So I need (I think) a way to craft a regex that looks for abc, but only splits on the b part.

Regex Negative Lookbehind is your friend in cases like this:

iex(24)> String.split("abcde", ~r/((?<!\s)b(?!\s)|\s)/) 
["a", "cde"]
iex(25)> String.split("a b cde", ~r/((?<!\s)b(?!\s)|\s)/)
["a", "b", "cde"]
iex(26)> String.split("ab cde", ~r/((?<!\s)b(?!\s)|\s)/) 
["ab", "cde"]
iex(27)> String.split("a bcde", ~r/((?<!\s)b(?!\s)|\s)/)
["a", "bcde"]

The above pattern will match a b for splitting, only if it’s entirely surrounded by non-whitespace, eg. it’s not preceded by whitespace ((?<!\s)) and it’s not followed by whitespace ((?!\s)). It’ll also split on standalone whitespace ((...|\s)).

Depending on your exact needs you may want to adjust the pattern, of course.

4 Likes

Excellent. The resources I had looked at didn’t deal with lookbehind patterns.

I think the pattern can be even simpler. Can you show some sample inputs and what you want to see as outputs?

1 Like

Well, I’m doing an exercism.io exercise that involves recreating a subset of Forth.

One of the test requirements is that non-word characters are separators between tokens, such that:

"1\x002\x013\n4\r5 6\t7" == "1 2 3 4 5 6 7"

But the exercise involves building a toy language that can do subtraction, so I need a way to distinguish between hyphens used to separate word characters and hyphens used as subtraction symbols:

"1 2 + 4 -" == "1 2 + 4 -"

So actually, what I need to represent in a regex is “all non-word characters EXCEPT a hyphen on its own.” This question was just to unstick me on that last part, finding the hyphen on its own.

Could you just split out the terms and then process the resulting list?

iex(3)> String.split("1\x002\x013\n4\r5 6\t7", ~r{\W+})
["1", "2", "3", "4", "5", "6", "7"]

No, that won’t work. If I split on ALL non-word characters, my mathematical operators vanish. I need to be able to keep the operators (characters in the class [±*/], and surrounded by whitespace).

I may end up pre-processing the string and converting the operators to words for the operations they represent. Then if I just split on non-word characters, everything is easy. But I also have to re-convert them back to the original characters at the end. So this seemed like a potentially better way.

Does this help?

iex(2)> String.split("1\x002\x013\n4\r5 6\t7", ~r{[^-+*/\d]+})
["1", "2", "3", "4", "5", "6", "7"]

Oops. There’s supposed to be a hyphen between the 5 and the 6 of the input. I don’t know how that got left out

The trick is catching that hyphen when it’s between two word characters and splitting on it, but NOT splitting on a hyphen that is adrift in white space.

String.split(“1\x002\x013\n4\r5-6\t7”, regex) == “1 2 3 4 5 6 7”
String.split(“5 2 -”, regex) == [“5”, “2”, “-”]

Can you give two or three string inputs and the lists you expect the code to produce from those strings? I think that would really help here.

I read this blog series(sorry it’s in python) last year that I think will help. It’s about building a simple interpreter and it starts off with simple maths and then onto supporting order of operation. Also requires no regex :slight_smile:

1 Like

\B matches non word boundary, ie. not at the beginning of a word

iex(9)> for s <- ["abcde", "a b cde"], do: String.split(s, ~r(\Bb|\s))
[["a", "cde"], ["a", "b", "cde"]]