Why ü is an ASCII charatcter?

hauleth · February 26, 2021, 9:00am

@ianheggie what is your locale? Because that is the main culprit.

LostKobrakai · February 26, 2021, 9:41am

❯❯❯❯ locale                                                              ~/Temp
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=
❯❯❯❯ cat p                                                               ~/Temp
my $s = "ü";
if ($s =~ /^[[:ascii:]]+$/ ) {
  print "$s is ascii\n";
} else {
  print "$s is NOT ascii\n";
}
if ( $s =~ /^[[:ascii:]]+$/u ) {
  print "$s is ascii-u\n";
} else {
  print "$s is NOT ascii-u\n";
}%
❯❯❯❯ perl p                                                              ~/Temp
ü is NOT ascii
ü is NOT ascii-u
❯❯❯❯ iex                                                                 ~/Temp
Erlang/OTP 23 [erts-11.1.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [hipe]

Interactive Elixir (1.11.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Regex.match?( ~r/^[[:ascii:]]+$/, "ü")
true

hauleth · February 26, 2021, 12:46pm

I cannot force Perl to run in Latin-1 mode, but grep shows exactly the same quirk:

$ printf "\303\274" | LC_ALL=sv_SE.iso88591 grep -P "[[:ascii:]]"
��

Remember that for compatibility reasons Erlang will always use sv_SE.iso88591 encoding when running the re module. So it is “consistent” with that behaviour.

Answering questions:

is this confusing? YES
is this stupid? YES
should that ever happened in the first place? NO
why does this happen? Because C committee don’t want to state that locales API is a mistake and should be removed.
can we do anything about it? Not really, at least not without breaking backward compatibility. However we could issue warning if there is [:ascii:] character class used in precompiled regexes

kip · February 26, 2021, 12:59pm

…and the documentation can be updated to NOT say 0..127. I guess I can try for my first Erlang PR. If I’m feeling brave.

hauleth · February 26, 2021, 1:16pm

kip · February 26, 2021, 1:24pm

Narrow escape … for me

Thanks for the link.

jhogberg · February 26, 2021, 1:39pm

I’ve merged a documentation fix to master now, thanks for looking into and reporting this.

Nicd · February 26, 2021, 2:04pm

I think we will need an explainer to Elixir docs as well, or a link to read the notes from Erlang docs. Otherwise many people will make the same mistake (or worse, don’t know that they’ve made the same mistake).

hauleth · February 26, 2021, 2:51pm

Elixir documentation do not have regex syntax description AFAIK, it points to Erlang docs. However I think that we could parse static regexes during compilation and warn about these. It shouldn’t be hard to implement that via compiler tracers.

Sebb · February 26, 2021, 2:57pm

https://hexdocs.pm/elixir/Regex.html#module-character-classes

hauleth · February 26, 2021, 4:18pm

Thanks for correction.

stevensonmt · February 26, 2021, 4:26pm

Can you explain why it is not possible to fix? It seems that “backwards compatibility” should not mean maintaining reliance upon unintended bugs.

HP4k1h5 · February 27, 2021, 3:41pm

i don’t know what the best way forward is in situations like these, but “backwards compatability” is often cited as a way of shorthanding reasons that others might understand. i’ve never really understood the constraints on this justification or how unbounded its parameters. changing anything or adding anything to a language could always break something somewhere on some machine.

is there really someone running critical code that depends on the “faulty”? implementation of a regex api? is this same person, all at the same time constantly, a) not maintaining the software itself, b) IS updating elixir on release, c) not testing the regex with the updated version?

it seems like all of software update management is based on this exact person and NO other persons or use cases. this someone who runs theoretically “important” code that they DON’T maintain but DO update the runtime for, so regularly and without any tests, such that no one can have a working regex in the real world? mind you this person also has to be running code that relies on a quirk in the implementation of a regex module. Seems almost guaranteed that a) such a person doesn’t really matter, and b) there’s lots of people who are probably using the regex incorrectly given their expectations and its actual behavior.

derek-zhou · February 27, 2021, 4:27pm

Backwards compatibility is super-duper important, at least for all long term successful software projects. As @hauleth has explained, this bug involved 3 parties:

erlang
pcre
libc

The latter two are outside erlang’s control, and changing erlang alone will break backward compatibility, not only in the way of [:accii:] but many things else also.

Sebb · February 27, 2021, 4:47pm

OTP uses pcre 8.44 while the latest is 10.37-RC1.
The problem may not occur with the latest pcre.

stevensonmt · February 27, 2021, 5:05pm

I did not mean to suggest backwards compatibility should be ignored. I was more wondering why fixing this bug would be a breaking change as depending upon unintended behavior is not something I expect to be generally supported. As for this being outside the control of erlang or Elixir, I lack the technical expertise to know so I defer to you on that. From reading the links in @hauleth’s posts it seems that as character classes are defined according to the locale, is it not possible to set the locale to some standard at compile time such that the class can then be depended to return the same matches regardless of the locale setting on the local system at runtime? If that is possible, would that represent a breaking change?

hauleth · February 27, 2021, 6:09pm

It is how it works right now. Just the locale is set to sv_SE.Latin-1 instead of “expected one” C locale.

derek-zhou · February 27, 2021, 6:38pm

I was not aware that OTP forked pcre. In this case maybe the OTP team can fix the bug in pcre and push a patch upstream.

stevensonmt · February 27, 2021, 7:10pm

That makes it sound like a simple fix. What am I missing?

ianheggie · February 28, 2021, 3:35am

My locale Australia (apart from time is GB as I want weeks to start monday):

$ locale
LANG=en_AU.UTF-8
LANGUAGE=en_AU:en_AU:en
LC_CTYPE="en_AU.UTF-8"
LC_NUMERIC=en_AU.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE="en_AU.UTF-8"
LC_MONETARY=en_AU.UTF-8
LC_MESSAGES="en_AU.UTF-8"
LC_PAPER=en_AU.UTF-8
LC_NAME=en_AU.UTF-8
LC_ADDRESS=en_AU.UTF-8
LC_TELEPHONE=en_AU.UTF-8
LC_MEASUREMENT=en_AU.UTF-8
LC_IDENTIFICATION=en_AU.UTF-8
LC_ALL=