Regular expressions in Elixir with unicode characters

abtrapp · November 20, 2018, 7:33am

Hi!

Can somebody please explain this:

iex(9)> Regex.match?(~r/\p{Lu}/, "Ö")
true
iex(10)> Regex.match?(~r/\p{Lu}/, "ö")
true
iex(11)> Regex.match?(~r/\p{Lu}/, "o")
false
iex(12)> Regex.match?(~r/\p{Lu}/, "O")
true

Why are all unicode characters that are not from the english alphabet uppercase characters in Elixir (in contrast to every other language I know and the definition of \p{Lu})?

Elixir 1.7.4 (compiled with Erlang/OTP 21)

mudasobwa · November 20, 2018, 7:57am

To make it even funnier, erlang re module, Regex relies upon, treats the composed and decomposed unicode characters differently:

iex|1 ▶ :re.run(String.normalize("ö", :nfc),"\\p{Lu}")
#⇒ {:match, [{0, 1}]}
iex|2 ▶ :re.run(String.normalize("ö", :nfd),"\\p{Lu}")
#⇒ {:match, [{1, 1}]}

Note the matched position. I would treat it as a bug in re module, which should probably be addressed.

chriseyre · November 20, 2018, 8:05am

There is the/u option to enable Unicode support on regex.

Nicd · November 20, 2018, 8:14am

With unicode flag:

iex(1)> Regex.match?(~r/\p{Lu}/u, "Ö")
true
iex(2)> Regex.match?(~r/\p{Lu}/u, "ö")
false

abtrapp · November 20, 2018, 9:59am

Thank you very much! That’s not the default? Performance reason? Can’t remember when I used ASCII the last time

Btw. Wouldn’t it be better to answer that expression with an error than silently returning a wrong answer?

LostKobrakai · November 20, 2018, 10:04am

More likely something along the lines of backwards compatibility.

NobbZ · November 20, 2018, 10:08am

Its not wrong…

If you do not specify the u option, the erlang regex module will interpret the input as latin-1, so a nfd "ö" is "oÌ" <> <<136>> where the second character is uppercase, for an nfc "ö" the regex sees "Â¶", where the first character is uppercase.

So as you can see the matches are correct.

abtrapp · November 20, 2018, 11:06am

Thanks for the answer.

But latin as default? 8+3 characters were “correct” too for filenames some time ago

Hope that Elixir will not be the next MS-DOS. Coming from Ruby and hope that after the String handling and this regex default there are not too many things that don’t make any sense and are just “correct” because they have been implemented in a specific way some “hundret” years ago

Just for the statistics: Latin: < 10% of the websites / Unicode: > 60% (numbers from 2012, today probably much higher). So when Elixir was invented almost nobody used latin any more. So the “/u” as default would make much more sense imho.

This feels more like Java (deprecations for decades) than fun to me… - luckily most of the language really is fun to work with.

Sorry, forgot the link to the source: Official Google Blog: Unicode over 60 percent of the web

NobbZ · November 20, 2018, 11:14am

Latin comes from the underlying erlang module. u is not the default as it is usually not needed but has measurable impact. Also it makes it easier to simply drop to the erlang regex module.

There was a similar question recently, perhaps the derailed discussion towards the end might help you?

abtrapp · November 20, 2018, 11:26am

Perfect. Thanks. So it’s either like I thought (performance) or just a lack of interest in ~ 80-90% of the people coding out there - that’s why I loved ruby. no “8+3 filenames” … at least >= 1.9 - so maybe there is hope for a future Elixir version too

At least I have a solution and some hints to the reason. Happy and satisfied.

Thanks!

Puzzles me that I haven’t found that issue in my regex unicode query I made before posting this. However: No I know that the community here is very helpful

Cochonours · November 20, 2018, 9:24pm

if you know you don’t need direct unicode matching then there is no need.

Well, it should be in those peculiar cases that a /l or whatever should be specified. I have not worked with Latin-1 text for almost a decade…

kip · November 21, 2018, 1:28am

i would challenge the “unicode is not usually needed” assumption. Unless you have complete confidence in your knowledge of the cultural context of your audience, Latin-1 is a poor assumption. And in an Elixir context, the String.t type is unicode so its not surprising that developers are caught out by the default behaviour of Regex being Latin-1 (even though I understand why the choice, I think its hard to justify as the default).

Latin-1 doesn’t even include the Euro symbol. @michalmuskala’s native language can’t be represented in Latin-1 (Polish is Latin-2). And given the global participation of this forum I’d warrant there are many members building applications for a global audience and therefore should not make limited assumptions on character input or output.

I suppose I may be an outlier on this (Australian, living in Singapore, and having worked in several Asian countries for quite a while). It even bugs me that we don’t consistently address José with the name his mother gave him

Elixir is a Unicode-centric language. I think that the :re option "u" should be the default.

michalmuskala · November 21, 2018, 10:11am

I will agree that the elixir regexes should include the u flag by default and require explicit opt-out. We generally should prefer correctness over speed with ways to opt-in into faster, but potentially incorrect results.

Unfortunately, the soonest we could be able to change this would be 2.0 (and there are no plans for 2.0 at the moment). This is because that would be a breaking change since many functions would start returning different results.

kip · November 21, 2018, 4:02pm

Oh I never had any expectation on a breaking change - its just one of those things that, in hindsight, would probably have been a better default. With all the work on globalisation stuff I’m doing I still forget this flag. This conversation even reminded me of one place I hadn’t set it!

tallakt · November 21, 2018, 4:15pm

The sooner its done the better

mudasobwa · January 25, 2019, 5:43am

You gotta be kidding. Ruby failed to convert case with String#upcase / String#downcase in stdlib till 2018 and simple unicode normalization with String#unicode_normalize happen to appear in 2.5 only.

Ruby literally had the worst unicode support among many languages till a while ago (maybe only python had the worse.)

abtrapp · January 25, 2019, 7:56am

Please re-read the posts. I never said that the utf-8 handling in Ruby was perfect (has it’s quirks like in every language I know). Upcase/downcase was not supported because it is locale dependend, … and 3rd party gems did what Ruby did not - but ruby has useful default behavior, if something does not make sense in practice over time the default behavior has been changed, … - that’s what I like about Ruby (contrast: Java, where things are “deprecated” for decades), I really don’t want to discuss unicode support in Ruby or start a language war in this forum

This topic should be about how to solve a problem (my initial posting) and maybe (the discussion that started) discussing how to make Elixir easier to work with for most people (at least I hope so). Elixir should be “fun”, just like Ruby - digging into Erlang is not fun for most users when dealing with problem that they get in > 80% of the projects.

I have a lot of respect for people who write languages/libraries/frameworks because it’s always easy to point out some errors/improvements, but it’s hard to get something done right from scratch. So, when I’m new to a language (like Elixir) I try to help by providing some thoughts as a newbie to that language to help to make the transition easier for other people that are following - because after a short time using that language I don’t even think about this stuff any more.

If my english has not been good enough to bring that message across, please accept my apology. And if you can point me to the misleading text that lead to the “I hate Ruby” bashing post I will try to rewrite it.

mudasobwa · January 25, 2019, 8:25am

I honestly do not see how the documented default behaviour that somehow differs from your expectations might make digging into language less fun.

Also, I am biased because I literally hate “convention over configuration” paradigm and I am positive Rails killed the great language because at some point it become too easy to jump into. I constantly ask the question “how foo(&:bar) syntax works in ruby in a nutshell” and I constantly receive a vast majority of answers saying “it’s a syntactic sugar.“

abtrapp · January 25, 2019, 8:56am

Let’s say 2019 ~ 80-90% working with UTF-8. (no current numbers here, but I’m pretty sure that we are in that region):

If I call a function that supports UTF-8 on an UTF-8 string I assume it is working with that UTF-8 string and delivers the correct behavior. However it does not to that, you need a special parameter/option to make the function call return the correct result. Especially when switching from an OO language (duck typing: that’s an UTF-8 string, it should handle it like that).
If 80-90% of all modern web applications are dealing with UTF-8 (not ASCII/Latin) then something like this should be the default (imho).
If everything should be explicit, there should be a mandatory parameter (this is a Latin string / this is UTF8)… which I would find very ugly
Nothing that changes the world, but it just would make the entry and daily usage easier.

Everybody is biased - I worked with Java when it had stubs and skels. Learned to love “convention over configuration” and DRY and started to have fun with my work after copy/pasting everything 10 times in Java - wrote a MDSD template generator back then that saved people about 1-2 weeks of typing because they had everything configured (some stuff multiple times)

Imho Rails was that success because concepts like DRY and “convention over configuration” made the entry easy (and imho the entry can never be too easy) - but I agree that every decision has two sides. When it became so easy that people didn’t even think about what that “database thing” was doing it lead to a log of nice performance issues that I have in 9/10 projects where I have to get “startup code” to a more performant/stable state that is maintainenable.

However: all of the Rails programmers that I know that look into Elixir are looking because 1 reason only: performance - which I find a little bit sad because I really like some Elixir concepts and I think it makes you a better Rails programmer too. (Functional core / imperative shell as example) If somebody came from Java to Rails Elixir seems like a litte step back (configuration over convention, repeat yourself, the module a.b.c stuff can remind people of Java packages, …) - and things like Erlang naming convention, default values that made sense when Erlang was written, … do feel like I’m back in the Java world sometimes - but that’s just my opinion.

mudasobwa · January 25, 2019, 9:11am

Eh. DRY in Elixir is way easier than in Ruby FWIW.

Eh. How ActiveRecord::Base is less reminding of fully qualified names than ActiveRecord.Base?

I never considered myself Rails programmer (I even have an explicit tagline “I like Ruby off the Rails” everywhere in the internets,) but I obviously have some experience with Rails. I can tell that I admire Elixir for its conciseness, clarity, expressiveness, resilience, fault-tolerance, and (the biggest one) AST on hand.

I am able to cook Ruby to be performant enough to suit my needs, so performance is a great bonus, but it definitely not what made me to switch.