Why ü is an ASCII charatcter?

AdiletAbylov · February 22, 2021, 1:38pm

Regex.match?( ~r/^[[:ascii:]]+$/u, "ü") returns true

But in Regex docs:

ascii - Character codes 0-127

And ü codepoint is 252

kip · February 22, 2021, 3:42pm

That is definitely surprising. The use of the u flag does change the meaning of some POSIX character classes but there’s no mention of [:ascii:] changing in those circumstances as you noted.

In my unicode_set library I build the regex in the way you expect - just in case that lib helps you in some way.

iex> regex = Unicode.Regex.compile! "[[:ascii:]]"
~r/[\x{0}-\x{7F}]/u
iex> Unicode.Regex.match? regex, "ü"             
false

LostKobrakai · February 22, 2021, 3:44pm

I also tried it without the u flag and got the same result.

iex(1)> Regex.match?( ~r/^[[:ascii:]]+$/, "ü")
true

kip · February 22, 2021, 4:04pm

Yes, which is even more weird since in non-unicode mode the comparison is meant to be byte-wise. Since << 195, 188 >> = "ü", neither of which is 0 < ascii < 128.

AdiletAbylov · February 22, 2021, 4:37pm

I believe :ascii: works for extended ASCII table which is 0-255. So something is wrong in docs.

NobbZ · February 22, 2021, 4:57pm

Probably something inherited from the erlang regex implementation:

nix shell nixpkgs#erlang -c erl
Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe]

Eshell V10.7  (abort with ^G)
1> re:run(<<"ü">>, "[[:ascii:]]").
{match,[{0,1}]}
2> re:run(<<"ü"/utf8>>, "[[:ascii:]]").
{match,[{0,1}]}
3> re:run("ü", "[[:ascii:]]").         
{match,[{0,1}]}

Sebb · February 22, 2021, 5:00pm

looking at the docs I would not expect this:

https://erlang.org/doc/man/re.html#posix-character-classes

EDIT:

iex(1)> match_char = fn c -> Regex.match?( ~r/^[[:ascii:]]+$/u, List.to_string([c])) end
#Function<44.79398840/1 in :erl_eval.expr/5>
iex(2)> length(for c <- 0..255, match_char.(c), do: c)                                  
256

hauleth · February 22, 2021, 5:37pm

Well, Regex is just wrapper over re, so this is pretty obvious.

@AdiletAbylov I would check whether it isn’t bug in PCRE which is used by Erlang (with few patches).

kip · February 22, 2021, 5:54pm

I tried this in several online regex checkers, each of which claims to use PCRE. And none of them return the same result as :re. So seems to be either a bug or a doc error.
.

kip · February 23, 2021, 12:15am

I opened an issue on the erlang OTP repository.

eksperimental · February 23, 2021, 1:52am

It is definitely a bug in the character class.

The character is: 𑩄 U+11A44 ZANABAZAR SQUARE MARK LONG TSHEG

iex> Regex.match?( ~r/^[[:ascii:]]+$/, <<72260::utf8>>)   
true

The challenge is to find a character that will return false

kip · February 23, 2021, 2:11am

Need to apply Unicode mode to match unicode properly (one of the unfortunate defaults in Elixir but its too late to fix now).

iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<72260::utf8>>)  
false

iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<256::utf8>>)    
false

AdiletAbylov · February 23, 2021, 2:21am

Thank you. Whoa! This is first time I faced a bug in language

hauleth · February 23, 2021, 8:08am

Being overly pedantic - standard library, not language per se (as such I would count bug in the compiler).

AdiletAbylov · February 23, 2021, 9:40am

You are right.

eksperimental · February 24, 2021, 12:00am

Today I learned something new.
Thank you!

hauleth · February 24, 2021, 7:48pm

The reason for that behaviour was found on the Erlang Slack - namely the worst idea for the API ever designed - C locales. [:ascii:] character class use isprint() C function behind the scenes, and that function is locale aware (in some implementations of libc). You aren’t the first one that was bitten in the arse by such quirks:

regex - Should we consider using range [a-z] as a bug? - Stack Overflow
stream_libarchive: workaround various types of locale braindeath · mpv-player/mpv@1e70e82 · GitHub

kip · February 24, 2021, 11:00pm

Thats beyond weird. How can this not be a bug? isprint() doesn’t correlate at all well with [:ascii:] since [:ascii:] is supposed to be 0..127 and a lot of those aren’t printable! And yet they do match.

iex> Regex.match? ~r/[[:ascii:]]/, <<1>>
true
iex> Regex.match? ~r/[[:ascii:]]/, <<0>>
true

I haven’t seen any response to the issue I raised so I’ll be curious if any will be forthcoming …

hauleth · February 25, 2021, 7:54am

[:ascii:] roughly translates to isprint(c) || iscontrol(c), that allows it to match on <<0>>. And yes, this is a bug in the implementation as well as bug in the C standard (locale API should be burned in fire).

ianheggie · February 26, 2021, 8:41am

And definitely confusing to those coming from ruby or perl as neither of these have ascii matching non ascii characters, so presumably this means it is not PCRE compliant!?

my $s = "ü";
if ($s =~ /^[[:ascii:]]+$/ ) {
  print "$s is ascii\n";
} else {
  print "$s is NOT ascii\n";
}
if ( $s =~ /^[[:ascii:]]+$/u ) {
  print "$s is ascii-u\n";
} else {
  print "$s is NOT ascii-u\n";
}
$ perl /tmp/p
ü is NOT ascii
ü is NOT ascii-u