I’ve read that elixir’s unicode support is top notch ( Jose on Twitter, among other places ), and I believe I have a rudimentary understanding of codepoint vs grapheme now; but I’m at a loss regarding how to actually prove it. Is someone able to help me with a string that will actually demonstrate the distinction, or educate me on why that question doesn’t make sense?
Specifically, I’d like to use this example in a talk I’m giving to my co-workers later this week (and I’ll be using my MacBook, if that matters in having a way to actually enter the relevant characters).
Talk about a rabbit hole, I actually read the article linked in the tweet (The string type is broken ) and found not only much better examples, but that someone posted that elixir already gets basically all of them correct.
Programming Elixir has a nice section with examples from page 123 under the Double-Quoted Strings Are Binaries section:
graphemes(str)
Returns the graphemes in the string. This is different from the codepoints function, which lists combining characters separately. The following example uses a combining diaeresis along with the letter “e” to represent “ë”. (It might not display properly on your ereader.)
The printed version of this book actually confused me about this distinction, because it is printed incorrectly! It shows the following, which does not match your IEx output:
a = "noël"
IO.puts a
IO.puts String.reverse(a)
IO.puts String.slice(a, 0..2)
IO.puts String.length(a)
b = "😸😾"
IO.puts b
IO.puts String.length(b)
IO.puts String.slice(b, 1..-1)
IO.puts String.reverse(b)
c = "baffle"
IO.puts c
IO.puts String.upcase(c)
d = "noël"
IO.puts d
IO.puts a == d
IO.puts String.equivalent?(a, d)
The one thing that Elixir doesn’t do so far and some people would like to see is to have boolean methods that return interesting information about a given grapheme, such as if the grapheme is alphanumeric, uppercase, a control character, a spacing character etc.
These functions are mainly useful to validate/filter data fields so it only contains characters in a certain given subset. Often, using Regular Expressions in these contexts is either overkill or at least slower than necessary (And I’m not entirely sure if the RegExp actually follows Unicode-laws or Perl-style-RegExp laws on what groups certain characters belong to – I’m fairly certain that it is the second).
To clarify:
teststring = "noël"
teststring =~ ~r{\w{4}} # false, fails on the ë as `\w` is defined as `[a-zA-Z0-9]`, while ë is clearly alphanumeric.