Why ü is an ASCII charatcter?

@ianheggie what is your locale? Because that is the main culprit.

❯❯❯❯ locale                                                              ~/Temp
LANG="de_DE.UTF-8"
LC_COLLATE="de_DE.UTF-8"
LC_CTYPE="de_DE.UTF-8"
LC_MESSAGES="de_DE.UTF-8"
LC_MONETARY="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
LC_ALL=
❯❯❯❯ cat p                                                               ~/Temp
my $s = "ü";
if ($s =~ /^[[:ascii:]]+$/ ) {
  print "$s is ascii\n";
} else {
  print "$s is NOT ascii\n";
}
if ( $s =~ /^[[:ascii:]]+$/u ) {
  print "$s is ascii-u\n";
} else {
  print "$s is NOT ascii-u\n";
}%
❯❯❯❯ perl p                                                              ~/Temp
ü is NOT ascii
ü is NOT ascii-u
❯❯❯❯ iex                                                                 ~/Temp
Erlang/OTP 23 [erts-11.1.4] [source] [64-bit] [smp:12:12] [ds:12:12:10] [async-threads:1] [hipe]

Interactive Elixir (1.11.2) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> Regex.match?( ~r/^[[:ascii:]]+$/, "ü")
true
1 Like

I cannot force Perl to run in Latin-1 mode, but grep shows exactly the same quirk:

$ printf "\303\274" | LC_ALL=sv_SE.iso88591 grep -P "[[:ascii:]]"
��

Remember that for compatibility reasons Erlang will always use sv_SE.iso88591 encoding when running the re module. So it is “consistent” with that behaviour.

Answering questions:

  • is this confusing? YES
  • is this stupid? YES
  • should that ever happened in the first place? NO
  • why does this happen? Because C committee don’t want to state that locales API is a mistake and should be removed.
  • can we do anything about it? Not really, at least not without breaking backward compatibility. However we could issue warning if there is [:ascii:] character class used in precompiled regexes
2 Likes

…and the documentation can be updated to NOT say 0..127. I guess I can try for my first Erlang PR. If I’m feeling brave.

5 Likes
3 Likes

Narrow escape … for me :slight_smile:

Thanks for the link.

I’ve merged a documentation fix to master now, thanks for looking into and reporting this. :slight_smile:

8 Likes

I think we will need an explainer to Elixir docs as well, or a link to read the notes from Erlang docs. Otherwise many people will make the same mistake (or worse, don’t know that they’ve made the same mistake).

Elixir documentation do not have regex syntax description AFAIK, it points to Erlang docs. However I think that we could parse static regexes during compilation and warn about these. It shouldn’t be hard to implement that via compiler tracers.

1 Like

https://hexdocs.pm/elixir/Regex.html#module-character-classes

4 Likes

Thanks for correction.

1 Like

Can you explain why it is not possible to fix? It seems that “backwards compatibility” should not mean maintaining reliance upon unintended bugs.

i don’t know what the best way forward is in situations like these, but “backwards compatability” is often cited as a way of shorthanding reasons that others might understand. i’ve never really understood the constraints on this justification or how unbounded its parameters. changing anything or adding anything to a language could always break something somewhere on some machine.

is there really someone running critical code that depends on the “faulty”? implementation of a regex api? is this same person, all at the same time constantly, a) not maintaining the software itself, b) IS updating elixir on release, c) not testing the regex with the updated version?

it seems like all of software update management is based on this exact person and NO other persons or use cases. this someone who runs theoretically “important” code that they DON’T maintain but DO update the runtime for, so regularly and without any tests, such that no one can have a working regex in the real world? mind you this person also has to be running code that relies on a quirk in the implementation of a regex module. Seems almost guaranteed that a) such a person doesn’t really matter, and b) there’s lots of people who are probably using the regex incorrectly given their expectations and its actual behavior.

1 Like

Backwards compatibility is super-duper important, at least for all long term successful software projects. As @hauleth has explained, this bug involved 3 parties:

  • erlang
  • pcre
  • libc

The latter two are outside erlang’s control, and changing erlang alone will break backward compatibility, not only in the way of [:accii:] but many things else also.

OTP uses pcre 8.44 while the latest is 10.37-RC1.
The problem may not occur with the latest pcre.

I did not mean to suggest backwards compatibility should be ignored. I was more wondering why fixing this bug would be a breaking change as depending upon unintended behavior is not something I expect to be generally supported. As for this being outside the control of erlang or Elixir, I lack the technical expertise to know so I defer to you on that. From reading the links in @hauleth’s posts it seems that as character classes are defined according to the locale, is it not possible to set the locale to some standard at compile time such that the class can then be depended to return the same matches regardless of the locale setting on the local system at runtime? If that is possible, would that represent a breaking change?

It is how it works right now. Just the locale is set to sv_SE.Latin-1 instead of “expected one” C locale.

1 Like

I was not aware that OTP forked pcre. In this case maybe the OTP team can fix the bug in pcre and push a patch upstream.

That makes it sound like a simple fix. What am I missing?

My locale Australia (apart from time is GB as I want weeks to start monday):

$ locale
LANG=en_AU.UTF-8
LANGUAGE=en_AU:en_AU:en
LC_CTYPE="en_AU.UTF-8"
LC_NUMERIC=en_AU.UTF-8
LC_TIME=en_GB.UTF-8
LC_COLLATE="en_AU.UTF-8"
LC_MONETARY=en_AU.UTF-8
LC_MESSAGES="en_AU.UTF-8"
LC_PAPER=en_AU.UTF-8
LC_NAME=en_AU.UTF-8
LC_ADDRESS=en_AU.UTF-8
LC_TELEPHONE=en_AU.UTF-8
LC_MEASUREMENT=en_AU.UTF-8
LC_IDENTIFICATION=en_AU.UTF-8
LC_ALL=