Regex question for hyphen match

I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference of matching in Javascript, Java, and Ruby (that’s why I’m a kind of stuck).
So I’d like to split a String input on one of the below characters:

  • spaces
  • underscores
  • comma
  • colon
  • punctuation
  • special characters

So I have the following code snippet in iex session:

String.split("testing, one, two car : carpet as java : javascript!!&@$%^& co-operation one_two 1 2", ~r/[\W-[_]|:]+/u, trim: true)

that returns:

["testing", "one", "two", "car", "carpet", "as", "java", "javascript", "co",
 "operation", "one", "two", "1", "2"]

I had to substract underscore symbol from \W(any “non-word” character) as it includes it (a-zA-Z_).
As you see, there is still a problem with matching the hyphen in co-operation word.
Whatever I try, whenever I put - in the above regex, nothing works, - it just breaks the previously matching cases.

Any ideas? Thank you.

I’m not sure what you mean by “substracting the underscore”.

When I simply use ~r/[\W_]/ as the splitter, I get the same result as you with your more complicated regex. It seems as if it were already splitting on -. this way.

1 Like

If I understand your problem, I think you need to escape - in your character class. The reason is that - is used to defined ranges of characters. For example, ~r/[a-z]/ means all characters from a to z, not a, -, and z. You can escape characters in a character class, by using \. So, for the previous example, to get it to mean a, -, and z, you’d do ~r/[a\-z]/.

Hope that helps

1 Like

@blatyo Exactly, I should not split on hyphen and should exclude it from matching.
So, using a backslash to split on - worked but make other examples fail:

#before
String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[\W-[_]|:]+/u, trim: true) 
#=> 
["co", "operation", "tree", "five", "one", "two", "1", "2", "azerty", "double",
 "side"]

After:

String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[\W-[_]\-|:]+/u, trim: true)
#=> ["co-operation tree_five one two 1 2 azerty : double:side"]

As you see, the split is no more working on colon (:) and underscore(_).

Oh, I finally understand what you’re doing. Use this regex instead.

String.split("co-operation tree_five one two 1 2 azerty : double:side", ~r/[^a-zA-Z0-9\-]+/u, trim: true)
#=> ["co-operation", "tree", "five", "one", "two", "1", "2", "azerty", "double",
 "side"]

\W-[_] doesn’t mean subtract the underscore from the \W character class. There is no such supported operation in a regex like that which I’m aware of.

2 Likes

@blatyo Thank you for your response. Weird, because I followed this article before applying this substract syntax.
Your Regex works but still fails for Unicode characters:

String.split("testing, co-operation tree_five one two 1 2 azerty : double:side schöner javascript!!&@$%^&", ~r/[^a-zA-Z\-]+/u, trim: true)

#=> ["testing", "co-operation", "tree", "five", "one", "two", "azerty", "double",
 "side", "sch", "ner", "javascript"]

As you see, it splits the word schöner on 2 parts instead of keeping it as a whole one.
I think it would be the same if I replace [^a-zA-Z0-9\-]+/u with [^\w\-]+/u.

But in this case the problem of splitting by underscore is still open :frowning:

Try this: ~r/([^\w\-]|_)+/u

3 Likes

Shorthand Character Classes

In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included.

iex(1)> String.split("testing, co-operation tree_five one two 1 2 azerty : double:side schöner javascript!!&@$%^&", ~r/[^\p{L}\-]+/u, trim: true)
["testing", "co-operation", "tree", "five", "one", "two", "azerty", "double", 
 "side", "schöner", "javascript"]

Unicode Regular Expressions

You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.

1 Like

@blatyo Yep, it worked like a charm, thanks a lot :slight_smile:
@peerreynders, yes, Regex was always kind of obscure corner :slight_smile:

You missed the very first sentence of the article:

Character class subtraction is supported by the XML Schema, XPath, .NET (version 2.0 and later), and JGsoft regex flavors.

Neither Erlangs, nor Elixirs (which are basically the same) RegEx engine is in that list.

1 Like

Nice catch, - good to know, thank you !