Regex.match?( ~r/^[[:ascii:]]+$/u, "ü")
returns true
But in Regex docs:
ascii - Character codes 0-127
And ü codepoint is 252
Regex.match?( ~r/^[[:ascii:]]+$/u, "ü")
returns true
But in Regex docs:
ascii - Character codes 0-127
And ü codepoint is 252
That is definitely surprising. The use of the u
flag does change the meaning of some POSIX character classes but there’s no mention of [:ascii:]
changing in those circumstances as you noted.
In my unicode_set library I build the regex in the way you expect - just in case that lib helps you in some way.
iex> regex = Unicode.Regex.compile! "[[:ascii:]]"
~r/[\x{0}-\x{7F}]/u
iex> Unicode.Regex.match? regex, "ü"
false
I also tried it without the u
flag and got the same result.
iex(1)> Regex.match?( ~r/^[[:ascii:]]+$/, "ü")
true
Yes, which is even more weird since in non-unicode mode the comparison is meant to be byte-wise. Since << 195, 188 >> = "ü"
, neither of which is 0 < ascii < 128
.
I believe :ascii: works for extended ASCII table which is 0-255. So something is wrong in docs.
Probably something inherited from the erlang regex implementation:
nix shell nixpkgs#erlang -c erl
Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe]
Eshell V10.7 (abort with ^G)
1> re:run(<<"ü">>, "[[:ascii:]]").
{match,[{0,1}]}
2> re:run(<<"ü"/utf8>>, "[[:ascii:]]").
{match,[{0,1}]}
3> re:run("ü", "[[:ascii:]]").
{match,[{0,1}]}
looking at the docs I would not expect this:
https://erlang.org/doc/man/re.html#posix-character-classes
EDIT:
iex(1)> match_char = fn c -> Regex.match?( ~r/^[[:ascii:]]+$/u, List.to_string([c])) end
#Function<44.79398840/1 in :erl_eval.expr/5>
iex(2)> length(for c <- 0..255, match_char.(c), do: c)
256
Well, Regex
is just wrapper over re
, so this is pretty obvious.
@AdiletAbylov I would check whether it isn’t bug in PCRE which is used by Erlang (with few patches).
I tried this in several online regex checkers, each of which claims to use PCRE. And none of them return the same result as :re
. So seems to be either a bug or a doc error.
.
I opened an issue on the erlang OTP repository.
It is definitely a bug in the character class.
The character is: 𑩄 U+11A44 ZANABAZAR SQUARE MARK LONG TSHEG
iex> Regex.match?( ~r/^[[:ascii:]]+$/, <<72260::utf8>>)
true
The challenge is to find a character that will return false
Need to apply Unicode mode to match unicode properly (one of the unfortunate defaults in Elixir but its too late to fix now).
iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<72260::utf8>>)
false
iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<256::utf8>>)
false
Thank you. Whoa! This is first time I faced a bug in language
Being overly pedantic - standard library, not language per se (as such I would count bug in the compiler).
You are right.
Today I learned something new.
Thank you!
The reason for that behaviour was found on the Erlang Slack - namely the worst idea for the API ever designed - C locales. [:ascii:]
character class use isprint()
C function behind the scenes, and that function is locale aware (in some implementations of libc). You aren’t the first one that was bitten in the arse by such quirks:
Thats beyond weird. How can this not be a bug? isprint()
doesn’t correlate at all well with [:ascii:]
since [:ascii:]
is supposed to be 0..127
and a lot of those aren’t printable! And yet they do match.
iex> Regex.match? ~r/[[:ascii:]]/, <<1>>
true
iex> Regex.match? ~r/[[:ascii:]]/, <<0>>
true
I haven’t seen any response to the issue I raised so I’ll be curious if any will be forthcoming …
[:ascii:]
roughly translates to isprint(c) || iscontrol(c)
, that allows it to match on <<0>>
. And yes, this is a bug in the implementation as well as bug in the C standard (locale API should be burned in fire).
And definitely confusing to those coming from ruby or perl as neither of these have ascii matching non ascii characters, so presumably this means it is not PCRE compliant!?
my $s = "ü";
if ($s =~ /^[[:ascii:]]+$/ ) {
print "$s is ascii\n";
} else {
print "$s is NOT ascii\n";
}
if ( $s =~ /^[[:ascii:]]+$/u ) {
print "$s is ascii-u\n";
} else {
print "$s is NOT ascii-u\n";
}
$ perl /tmp/p
ü is NOT ascii
ü is NOT ascii-u