Regex punctuation matchers

mudasobwa · September 25, 2018, 7:23am

AFAIU, Elixir delegates regular expression evaluation to Erlang.

Erlang claims to support Unicode general category matchers.

“ is declared in Unicode spec as LEFT DOUBLE QUOTATION MARK under General punctuation. Both Ruby and Perl do recognize this symbol as opening punctuation:

[0x201C].pack('U*').match /\p{Pi}/
#⇒ #<MatchData "“">

Both Elixir and Erlang, unfortunately, do not:

Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/, "'\"“”‘’«»")
#⇒ [[<<171>>], [<<187>>]] # these are « and »

What am I missing and/or what should I tune to make the regex engine to work properly?

NobbZ · September 25, 2018, 7:48am

You forgot to enable unicode mode:

iex(1)> Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/, "'\"“”‘’«»")
[[<<171>>], [<<187>>]]
iex(2)> Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/u, "'\"“”‘’«»")
[["“"], ["”"], ["‘"], ["’"], ["«"], ["»"]]

mudasobwa · September 25, 2018, 8:33am

Thanks! I wonder how come it’s disabled by default.

NobbZ · September 25, 2018, 8:40am

Yeah, thats a thing I do not understand as well. Elixir has very good unicode support, but having to remember to explicitely enable it for a regex everytime is a bit…

OvermindDL1 · September 25, 2018, 2:39pm

Because parsing unicode is muuuuuuch slower than parsing ascii in a few common cases, and if you know you don’t need direct unicode matching then there is no need.

mudasobwa · September 25, 2018, 3:02pm

Nah. People who do parse looooooong texts with regular expressions or require a μs speed-up on parsing the natural string consisting of a dozen of characters, are indeed aware of all the flags.

The gain for, say, parsing [pun intended] emails, is negligible though. For the sake of saving keystrokes (I could buy new RAM stick for every 100K “u” typed) the switch should be turned on by default IMHO.

OvermindDL1 · September 25, 2018, 3:04pm

Except you can’t always know if they are parsing a hundred thousand emails for example. But still, if you think it should be on by default, make up a proposal to the elixir core mailing list after a forum thread dedicated to it first.

mudasobwa · September 25, 2018, 3:22pm

Fair enough.

This is a matter of personal preference and I cordially avoid wasting core team time with such requests

OvermindDL1 · September 25, 2018, 3:23pm

You can always make your own delegating sigil_r macro that auto-adds the unicode tag too and import it where you need.

NobbZ · September 25, 2018, 6:34pm

If one does this, I’d be happy if he did not use sigil_r, it might be confusing when copy pasting into a module that does not have that import… Perhaps sigil_u for unicode?

mudasobwa · September 26, 2018, 6:21am

C’mon

sigil_ฤ.

NobbZ · September 26, 2018, 7:15am

Only [a-zA-Z] is allowed as a sigil name

mudasobwa · September 26, 2018, 7:21am

I live in 2022