Regex punctuation matchers

AFAIU, Elixir delegates regular expression evaluation to Erlang.

Erlang claims to support Unicode general category matchers.

is declared in Unicode spec as LEFT DOUBLE QUOTATION MARK under General punctuation. Both Ruby and Perl do recognize this symbol as opening punctuation:

[0x201C].pack('U*').match /\p{Pi}/
#⇒ #<MatchData "“">

Both Elixir and Erlang, unfortunately, do not:

Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/, "'\"“”‘’«»")
#⇒ [[<<171>>], [<<187>>]] # these are « and »

What am I missing and/or what should I tune to make the regex engine to work properly?

You forgot to enable unicode mode:

iex(1)> Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/, "'\"“”‘’«»")
[[<<171>>], [<<187>>]]
iex(2)> Regex.scan(~r/[\p{Pi}\p{Pf}\p{Ps}\p{Pe}]/u, "'\"“”‘’«»")
[["“"], ["”"], ["‘"], ["’"], ["«"], ["»"]]
3 Likes

Thanks! I wonder how come it’s disabled by default.

1 Like

Yeah, thats a thing I do not understand as well. Elixir has very good unicode support, but having to remember to explicitely enable it for a regex everytime is a bit…

1 Like

Because parsing unicode is muuuuuuch slower than parsing ascii in a few common cases, and if you know you don’t need direct unicode matching then there is no need.

Nah. People who do parse looooooong texts with regular expressions or require a μs speed-up on parsing the natural string consisting of a dozen of characters, are indeed aware of all the flags.

The gain for, say, parsing [pun intended] emails, is negligible though. For the sake of saving keystrokes (I could buy new RAM stick for every 100K “u” typed) the switch should be turned on by default IMHO.

Except you can’t always know if they are parsing a hundred thousand emails for example. But still, if you think it should be on by default, make up a proposal to the elixir core mailing list after a forum thread dedicated to it first. :slight_smile:

1 Like

Fair enough.

This is a matter of personal preference and I cordially avoid wasting core team time with such requests :slight_smile:

2 Likes

You can always make your own delegating sigil_r macro that auto-adds the unicode tag too and import it where you need. :slight_smile:

1 Like

If one does this, I’d be happy if he did not use sigil_r, it might be confusing when copy pasting into a module that does not have that import… Perhaps sigil_u for unicode?

1 Like

C’mon :slight_smile:

sigil_ฤ.

2 Likes

Only [a-zA-Z] is allowed as a sigil name :frowning:

1 Like

I live in 2022 :slight_smile:

1 Like