String.split on a sub-string within a pattern?

MatthewMDavis · October 26, 2016, 12:31pm

I’m trying all sorts of things with no luck, so I thought I’d ask for some help.

I want to split a string anytime there’s a “b” in “abc”, but not in " b ". I.e., when a “b” is surrounded by whitespace, I don’t want it to be a splitting character, but an element in the array returned by the split.

String.split("abcde", regex) == ["a", "cde"]
String.split("a b cde", regex) == ["a", "b", "cde"]

So I need (I think) a way to craft a regex that looks for abc, but only splits on the b part.

jwarlander · October 26, 2016, 1:21pm

Regex Negative Lookbehind is your friend in cases like this:

iex(24)> String.split("abcde", ~r/((?<!\s)b(?!\s)|\s)/) 
["a", "cde"]
iex(25)> String.split("a b cde", ~r/((?<!\s)b(?!\s)|\s)/)
["a", "b", "cde"]
iex(26)> String.split("ab cde", ~r/((?<!\s)b(?!\s)|\s)/) 
["ab", "cde"]
iex(27)> String.split("a bcde", ~r/((?<!\s)b(?!\s)|\s)/)
["a", "bcde"]

The above pattern will match a b for splitting, only if it’s entirely surrounded by non-whitespace, eg. it’s not preceded by whitespace ((?<!\s)) and it’s not followed by whitespace ((?!\s)). It’ll also split on standalone whitespace ((...|\s)).

Depending on your exact needs you may want to adjust the pattern, of course.

MatthewMDavis · October 26, 2016, 1:51pm

Excellent. The resources I had looked at didn’t deal with lookbehind patterns.

JEG2 · October 26, 2016, 2:07pm

I think the pattern can be even simpler. Can you show some sample inputs and what you want to see as outputs?

MatthewMDavis · October 26, 2016, 2:36pm

Well, I’m doing an exercism.io exercise that involves recreating a subset of Forth.

One of the test requirements is that non-word characters are separators between tokens, such that:

"1\x002\x013\n4\r5 6\t7" == "1 2 3 4 5 6 7"

But the exercise involves building a toy language that can do subtraction, so I need a way to distinguish between hyphens used to separate word characters and hyphens used as subtraction symbols:

"1 2 + 4 -" == "1 2 + 4 -"

So actually, what I need to represent in a regex is “all non-word characters EXCEPT a hyphen on its own.” This question was just to unstick me on that last part, finding the hyphen on its own.

JEG2 · October 26, 2016, 2:46pm

Could you just split out the terms and then process the resulting list?

iex(3)> String.split("1\x002\x013\n4\r5 6\t7", ~r{\W+})
["1", "2", "3", "4", "5", "6", "7"]

MatthewMDavis · October 26, 2016, 2:55pm

No, that won’t work. If I split on ALL non-word characters, my mathematical operators vanish. I need to be able to keep the operators (characters in the class [±*/], and surrounded by whitespace).

I may end up pre-processing the string and converting the operators to words for the operations they represent. Then if I just split on non-word characters, everything is easy. But I also have to re-convert them back to the original characters at the end. So this seemed like a potentially better way.

JEG2 · October 26, 2016, 3:32pm

Does this help?

iex(2)> String.split("1\x002\x013\n4\r5 6\t7", ~r{[^-+*/\d]+})
["1", "2", "3", "4", "5", "6", "7"]

MatthewMDavis · October 26, 2016, 3:59pm

Oops. There’s supposed to be a hyphen between the 5 and the 6 of the input. I don’t know how that got left out

The trick is catching that hyphen when it’s between two word characters and splitting on it, but NOT splitting on a hyphen that is adrift in white space.

String.split(“1\x002\x013\n4\r5-6\t7”, regex) == “1 2 3 4 5 6 7”
String.split(“5 2 -”, regex) == [“5”, “2”, “-”]

JEG2 · October 26, 2016, 6:11pm

Can you give two or three string inputs and the lists you expect the code to produce from those strings? I think that would really help here.

swelham · October 26, 2016, 7:40pm

I read this blog series(sorry it’s in python) last year that I think will help. It’s about building a simple interpreter and it starts off with simple maths and then onto supporting order of operation. Also requires no regex

tallakt · October 27, 2016, 4:56pm

\B matches non word boundary, ie. not at the beginning of a word

iex(9)> for s <- ["abcde", "a b cde"], do: String.split(s, ~r(\Bb|\s))
[["a", "cde"], ["a", "b", "cde"]]