Regular expressions in Elixir with unicode characters

Nicd · January 25, 2019, 10:01am

Technically all strings in Elixir are UTF-8, there is no way to handle other character sets in the String and Regex modules (disregarding Erlang functions and charlists).

abtrapp · January 25, 2019, 10:02am

I think we’re going off topic… - can we move the latest part of this thread to a new section in this forum?

Sorry, not clear enough here: RoR vs Ecto, for example field/attribute: Elixir/Ecto: DB-migration + schema + n changesets cast + validations vs. Ruby: migration + 1 x security for the assignments (depending on the RoR version), even the validations can be handled automatically from the DB definition. Even with 1 reused changeset a lot of typing the same stuff. And using :string in the schema and :text in the migration isn’t making maintenance easier.

Sorry, wasn’t clear enough about the second point either, I thought more of: [1,3,2].uniq.sort.count vs Enum.count(Enum.sort(Enum.dedup([1, 3, 2]))) (and no, [1, 3, 2] |> Enum.dedup() |> Enum.sort() |> Enum.count() does not make it better and yes - I know that you rarely find something like this, mostly in the introduction examples - so the first impression you get from the language is far worse than some “real” code later where you find out about the beauty of pattern matching, …

I know that some of this decisions are made on purpose and some imho make a lot of sense (no automatic preloading, …), but in some other places the implizit/explizit mixing in Phoenix puzzles me too. controller -> view directory name = convention over configuration -> template name = configuration over convention… things like this let me put Elixir always somewhere between RoR (everything = convention) and old school Java stuff (everything was configured) - I’m just glad that the template name is not defined in an XML file - for me it would be a lot easier if I know the idea behind such decisions (like the original question with UTF8). I don’t want to learn what is convention and what not, I want to understand why a decision has been made.

Will definitively not argue about conciseness, clarity, expressiveness, resilience, fault-tolerance and quoted expressions - at least not against Elixir in this areas…

I hope (and I’m confident) that the community here will stay like it is now: very friendly & helpful and I hope that Phoenix and Ecto will continue to improve (and hopefully 3rd party libraries will follow, really miss some basic stuff - but on the other hand you learn more about ltree in postgres and other things when you have to solve such things by yourself).

abtrapp · January 25, 2019, 10:05am

Exactly! That was the point of this discussion. Everything is UTF-8, everybody uses UTF-8, why the default behavior is Latin. To understand the “wrong” result you have to dig into Erlang. And the fear that I have is that the people here are building a really awesome language but have some legacy decisions that make it unnecessary hard for newbies to jump on that train.

OvermindDL1 · January 25, 2019, 6:28pm

iex(1)> import Enum
Enum
iex(2)> [1,3,2]|>uniq|>sort|>count
3

^.^

I personally find that whole dotted style hard to read though, like it’s a singular value not ‘calling’ something (that’s just weird, like in ruby how do you differentiate between a.b.c.d where b, c, and d are value accesses on a record or function calls, that’s very confusing, I don’t see how someone can find that clear… But then again I find Elixir’s dotted access confusing too, like is a.c calling a function c on whatever atom/module that a points to or is it doing a struct access, like what on earth there… ^.^;), so I’d usually do:

[1, 3, 2]
|> Enum.dedup()
|> Enum.sort()
|> Enum.count()

As it’s obvious that each line has a singular operation on the previous result, simple and clear and unambiguous.

Though I find OCaml’s syntax even far more readable:

(* Module access, in Elixir this is List.flatten([[1], [2]]) with same result*)
List.flatten([[1]; [2]])

(* Module access via passed in binding 'm', in Elixir this is `m.flatten([[1], [2]]) *)
(* As OCaml is strongly typed I need to define what interface has to be fullfilled of what is passed in, so... *)
module type Flattenable = sig val flatten : 'a list list -> 'a list end
(* Then use it like this *)
module Blah = (val m : Flattenable) (* give it a temporary scoped name of the defined interface *)
Blah.flatten([[1]; [2]]) (* Then call it, notice this is still the same syntax as a module call even though the module is unknown, it could be List or something else *)

(* Struct access, in Elixir this is s.field *)
s.field (* Lower case start, modules are upper-case start *)

(* Row-typed structure, this is like a map-with-atom-keys usage in Elixir, in Elixir it is m.a *)
m#a (* These forums need a language syntax adjustment, this is so wrong...  ^.^;

(* In OCaml maps are a library, not a built-in type like with Elixir, example usage *)
module StringMap = Map.Make(String)
let m = StringMap.(
  empty
  |> add "hello", 1
  |> add "world", 2
)
(* The `Module.(...)` is like a scoped import, you could just as well have done `open Module` instead to open it in the outer scope instead *)
(* The Elixir version of the above would be:
m = (
  import Map
  %{}
  |> put("hello", 1)
  |> put("world", 2)
)
*)

(* And of the above elixir piped example of manipulating a list, in OCaml: *)
(* Notice how `;` is a list separator, a `,` is a tuple separator, you can even tell what things are that way *)
[1; 3; 2]
|> List.sort_uniq compare
|> List.length

(* Or if you want to locally open `List`: *)
List.(
  [1; 3; 2]
  |> sort_uniq compare
  |> length

The big thing is that everything is unambiguous, you know what something is and what it can be from how it is used, and it is fully type safe the whole way through.
That style is not very conducive to hot-code-swapping though, but it sets of modules with dyn-typed integration points were swapped atomically then it would work fine.

Hmm? What do you mean? If by preloading you mean loading all modules at load-time, that happens in a release (or if you set a flag in mix otherwise, which you shouldn’t need to do).

Yeah I would prefer that to be defined on the use ... line personally too.

It would break backwards compat (the regex exists since ‘before’ binaries had direct utf8 matching support), in addition looking up in a Unicode table is MUCH slower than an ASCII table, so it is good to make that cost explicit.

abtrapp · January 25, 2019, 7:35pm

Oh, I just wrote everything in 1 line to compare them, that’s not the code I normally write

automatic association preloading in ActiveRecord / explicit in Ecto -> that’s definitively a good decision

backwards compatibility - my biggest fear as said. In a few years you have a bunch of garbage that nobody can understand and that does not make any sense. And my 2 cents regarding performance: you can make 4.345 / 2.0 = 2 with integer handling faster than floats. Wrong, but faster. Or concatenate Strings by ignoring the second one. Wrong but very fast. If you want to use “slow” concatenation use a _real suffix. That’s imho the wrong way: look how fast Elixir is (in the few cases that are not used at all, and in the majority of the cases you have to write additional code because of backward compatibility). Sometimes you have to make cuts. Can hurt but from 3 decades of experience in languages, frameworks and hardware I can say: if you don’t do it it will hurt even more.

OvermindDL1 · January 25, 2019, 8:26pm

abtrapp:

backwards compatibility - my biggest fear as said. In a few years you have a bunch of garbage that nobody can understand and that does not make any sense. And my 2 cents regarding performance: you can make 4.345 / 2.0 = 2 with integer handling faster than floats. Wrong, but faster. Or concatenate Strings by ignoring the second one. Wrong but very fast. If you want to use “slow” concatenation use a _real suffix. That’s imho the wrong way: look how fast Elixir is (in the few cases that are not used at all, and in the majority of the cases you have to write additional code because of backward compatibility). Sometimes you have to make cuts. Can hurt but from 3 decades of experience in languages, frameworks and hardware I can say: if you don’t do it it will hurt even more.

Just as a note, you can trivially make you own sigil_r function to replace the built-in one that defaults to unicode, just use/import it into your module. ^.^

abtrapp · January 26, 2019, 5:26am

Thanks for the great hint! For all new to Elixir who don’t want to wait if/until this will change and found this page via google… the link to the excellent Elixir documentation about custom sigils: https://elixir-lang.org/getting-started/sigils.html#custom-sigils

chriseyre · March 10, 2019, 5:45pm

Here is some upcoming documentation on using character classes in Elixir regex:

https://hexdocs.pm/elixir/master/Regex.html#module-character-classes

kip · March 10, 2019, 8:35pm

Thats a great contribution, thanks!

Would you think it appropriate to also mention that it supports Unicode category notation? Like

iex> Regex.match? ~r/\p{Lu}\p{Ll}+/, "José"
true

And Unicode Block notation too:

iex> Regex.match? ~r/\p{Latin}+/, "José"
true
iex> Regex.match? ~r/\p{Hiragana}+/, "José"
false
iex> Regex.match? ~r/\p{Cherokee}+/, "José"
false