I’m not a pro in using Regex and can’t figure out why the following behaviour happens, especially if we take into account the difference of matching in Javascript, Java, and Ruby (that’s why I’m a kind of stuck).
So I’d like to split a String input on one of the below characters:
spaces
underscores
comma
colon
punctuation
special characters
So I have the following code snippet in iex session:
String.split("testing, one, two car : carpet as java : javascript!!&@$%^& co-operation one_two 1 2", ~r/[\W-[_]|:]+/u, trim: true)
I had to substract underscore symbol from \W(any “non-word” character) as it includes it (a-zA-Z_).
As you see, there is still a problem with matching the hyphen in co-operation word.
Whatever I try, whenever I put - in the above regex, nothing works, - it just breaks the previously matching cases.
I’m not sure what you mean by “substracting the underscore”.
When I simply use ~r/[\W_]/ as the splitter, I get the same result as you with your more complicated regex. It seems as if it were already splitting on -. this way.
If I understand your problem, I think you need to escape - in your character class. The reason is that - is used to defined ranges of characters. For example, ~r/[a-z]/ means all characters from a to z, not a, -, and z. You can escape characters in a character class, by using \. So, for the previous example, to get it to mean a, -, and z, you’d do ~r/[a\-z]/.
@blatyo Exactly, I should not split on hyphen and should exclude it from matching.
So, using a backslash to split on - worked but make other examples fail:
@blatyo Thank you for your response. Weird, because I followed this article before applying this substract syntax.
Your Regex works but still fails for Unicode characters:
As you see, it splits the word schöner on 2 parts instead of keeping it as a whole one.
I think it would be the same if I replace [^a-zA-Z0-9\-]+/u with [^\w\-]+/u.
But in this case the problem of splitting by underscore is still open
In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren’t digits may or may not be included.
You can match a single character belonging to the “letter” category with \p{L}. You can match a single character not belonging to that category with \P{L}.