Regex question for hyphen match

belgoros · March 19, 2019, 11:05am

I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference of matching in Javascript, Java, and Ruby (that’s why I’m a kind of stuck).
So I’d like to split a String input on one of the below characters:

spaces
underscores
comma
colon
punctuation
special characters

So I have the following code snippet in iex session:

String.split("testing, one, two car : carpet as java : javascript!!&@$%^& co-operation one_two 1 2", ~r/[\W-[_]|:]+/u, trim: true)

that returns:

["testing", "one", "two", "car", "carpet", "as", "java", "javascript", "co",
 "operation", "one", "two", "1", "2"]

I had to substract underscore symbol from \W(any “non-word” character) as it includes it (a-zA-Z_).
As you see, there is still a problem with matching the hyphen in co-operation word.
Whatever I try, whenever I put - in the above regex, nothing works, - it just breaks the previously matching cases.

Any ideas? Thank you.

NobbZ · March 19, 2019, 11:11am

I’m not sure what you mean by “substracting the underscore”.

When I simply use ~r/[\W_]/ as the splitter, I get the same result as you with your more complicated regex. It seems as if it were already splitting on -. this way.

blatyo · March 19, 2019, 11:58am

If I understand your problem, I think you need to escape - in your character class. The reason is that - is used to defined ranges of characters. For example, ~r/[a-z]/ means all characters from a to z, not a, -, and z. You can escape characters in a character class, by using \. So, for the previous example, to get it to mean a, -, and z, you’d do ~r/[a\-z]/.

Hope that helps

belgoros · March 19, 2019, 12:24pm

@blatyo Exactly, I should not split on hyphen and should exclude it from matching.
So, using a backslash to split on - worked but make other examples fail:

#before
String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[\W-[_]|:]+/u, trim: true) 
#=> 
["co", "operation", "tree", "five", "one", "two", "1", "2", "azerty", "double",
 "side"]

After:

String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[\W-[_]\-|:]+/u, trim: true)
#=> ["co-operation tree_five one two 1 2 azerty : double:side"]

As you see, the split is no more working on colon (:) and underscore(_).

blatyo · March 19, 2019, 12:33pm

Oh, I finally understand what you’re doing. Use this regex instead.

String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[^a-zA-Z0-9\-]+/u, trim: true)
#=> ["co-operation", "tree", "five", "one", "two", "1", "2", "azerty", "double",
 "side"]

\W-[_] doesn’t mean subtract the underscore from the \W character class. There is no such supported operation in a regex like that which I’m aware of.

belgoros · March 19, 2019, 1:18pm

@blatyo Thank you for your response. Weird, because I followed this article before applying this substract syntax.
Your Regex works but still fails for Unicode characters:

String.split("testing, co-operation tree_five one two 1 2 azerty : double:side schöner javascript!!&@$%^&", ~r/[^a-zA-Z\-]+/u, trim: true)

#=> ["testing", "co-operation", "tree", "five", "one", "two", "azerty", "double",
 "side", "sch", "ner", "javascript"]

As you see, it splits the word schöner on 2 parts instead of keeping it as a whole one.
I think it would be the same if I replace [^a-zA-Z0-9\-]+/u with [^\w\-]+/u.

But in this case the problem of splitting by underscore is still open

blatyo · March 19, 2019, 1:34pm

Try this: ~r/([^\w\-]|_)+/u

peerreynders · March 19, 2019, 1:36pm

Shorthand Character Classes

In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included.

iex(1)> String.split("testing, co-operation tree_five one two 1 2 azerty : double:side schöner javascript!!&@$%^&", ~r/[^\p{L}\-]+/u, trim: true)
["testing", "co-operation", "tree", "five", "one", "two", "azerty", "double", 
 "side", "schöner", "javascript"]

Unicode Regular Expressions

You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

belgoros · March 19, 2019, 1:49pm

@blatyo Yep, it worked like a charm, thanks a lot
@peerreynders, yes, Regex was always kind of obscure corner

NobbZ · March 19, 2019, 1:53pm

You missed the very first sentence of the article:

Character class subtraction is supported by the XML Schema, XPath, .NET (version 2.0 and later), and JGsoft regex flavors.

Neither Erlangs, nor Elixirs (which are basically the same) RegEx engine is in that list.

belgoros · March 19, 2019, 2:06pm

Nice catch, - good to know, thank you !