Why ü is an ASCII charatcter?

Regex.match?( ~r/^[[:ascii:]]+$/u, "ü") returns true

But in Regex docs:

ascii - Character codes 0-127

And ü codepoint is 252

7 Likes

That is definitely surprising. The use of the u flag does change the meaning of some POSIX character classes but there’s no mention of [:ascii:] changing in those circumstances as you noted.

In my unicode_set library I build the regex in the way you expect - just in case that lib helps you in some way.

iex> regex = Unicode.Regex.compile! "[[:ascii:]]"
~r/[\x{0}-\x{7F}]/u
iex> Unicode.Regex.match? regex, "ü"             
false
3 Likes

I also tried it without the u flag and got the same result.

iex(1)> Regex.match?( ~r/^[[:ascii:]]+$/, "ü")
true

Yes, which is even more weird since in non-unicode mode the comparison is meant to be byte-wise. Since << 195, 188 >> = "ü", neither of which is 0 < ascii < 128.

I believe :ascii: works for extended ASCII table which is 0-255. So something is wrong in docs.

1 Like

Probably something inherited from the erlang regex implementation:

nix shell nixpkgs#erlang -c erl
Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe]

Eshell V10.7  (abort with ^G)
1> re:run(<<"ü">>, "[[:ascii:]]").
{match,[{0,1}]}
2> re:run(<<"ü"/utf8>>, "[[:ascii:]]").
{match,[{0,1}]}
3> re:run("ü", "[[:ascii:]]").         
{match,[{0,1}]}
1 Like

looking at the docs I would not expect this:

https://erlang.org/doc/man/re.html#posix-character-classes

EDIT:

iex(1)> match_char = fn c -> Regex.match?( ~r/^[[:ascii:]]+$/u, List.to_string([c])) end
#Function<44.79398840/1 in :erl_eval.expr/5>
iex(2)> length(for c <- 0..255, match_char.(c), do: c)                                  
256
1 Like

Well, Regex is just wrapper over re, so this is pretty obvious.

@AdiletAbylov I would check whether it isn’t bug in PCRE which is used by Erlang (with few patches).

I tried this in several online regex checkers, each of which claims to use PCRE. And none of them return the same result as :re. So seems to be either a bug or a doc error.
.

I opened an issue on the erlang OTP repository.

6 Likes

It is definitely a bug in the character class.

The character is: 𑩄 U+11A44 ZANABAZAR SQUARE MARK LONG TSHEG

iex> Regex.match?( ~r/^[[:ascii:]]+$/, <<72260::utf8>>)   
true

The challenge is to find a character that will return false :wink:

1 Like

Need to apply Unicode mode to match unicode properly (one of the unfortunate defaults in Elixir but its too late to fix now).

iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<72260::utf8>>)  
false

iex> Regex.match?( ~r/^[[:ascii:]]+$/u, <<256::utf8>>)    
false
3 Likes

Thank you. Whoa! This is first time I faced a bug in language :grimacing:

2 Likes

Being overly pedantic - standard library, not language per se (as such I would count bug in the compiler).

1 Like

You are right.

1 Like

Today I learned something new.
Thank you!

The reason for that behaviour was found on the Erlang Slack - namely the worst idea for the API ever designed - C locales. [:ascii:] character class use isprint() C function behind the scenes, and that function is locale aware (in some implementations of libc). You aren’t the first one that was bitten in the arse by such quirks:

5 Likes

Thats beyond weird. How can this not be a bug? isprint() doesn’t correlate at all well with [:ascii:] since [:ascii:] is supposed to be 0..127 and a lot of those aren’t printable! And yet they do match.

iex> Regex.match? ~r/[[:ascii:]]/, <<1>>
true
iex> Regex.match? ~r/[[:ascii:]]/, <<0>>
true

I haven’t seen any response to the issue I raised so I’ll be curious if any will be forthcoming …

2 Likes

[:ascii:] roughly translates to isprint(c) || iscontrol(c), that allows it to match on <<0>>. And yes, this is a bug in the implementation as well as bug in the C standard (locale API should be burned in fire).

4 Likes

And definitely confusing to those coming from ruby or perl as neither of these have ascii matching non ascii characters, so presumably this means it is not PCRE compliant!?

my $s = "ü";
if ($s =~ /^[[:ascii:]]+$/ ) {
  print "$s is ascii\n";
} else {
  print "$s is NOT ascii\n";
}
if ( $s =~ /^[[:ascii:]]+$/u ) {
  print "$s is ascii-u\n";
} else {
  print "$s is NOT ascii-u\n";
}
$ perl /tmp/p
ü is NOT ascii
ü is NOT ascii-u