Codepoints vs Grapheme

ckhrysze · March 28, 2016, 3:51am

I’ve read that elixir’s unicode support is top notch ( Jose on Twitter, among other places ), and I believe I have a rudimentary understanding of codepoint vs grapheme now; but I’m at a loss regarding how to actually prove it. Is someone able to help me with a string that will actually demonstrate the distinction, or educate me on why that question doesn’t make sense?

Specifically, I’d like to use this example in a talk I’m giving to my co-workers later this week (and I’ll be using my MacBook, if that matters in having a way to actually enter the relevant characters).

ckhrysze · March 28, 2016, 5:41am

I’d still like to understand this better, but part of my problem ended up being a terminal issue. The following gist at least shows what I was going for: https://gist.github.com/ckhrysze/76ece39ad6a0c4c2c55e

Talk about a rabbit hole, I actually read the article linked in the tweet (The string type is broken ) and found not only much better examples, but that someone posted that elixir already gets basically all of them correct.

uranther · March 28, 2016, 4:24pm

Programming Elixir has a nice section with examples from page 123 under the Double-Quoted Strings Are Binaries section:

graphemes(str)
Returns the graphemes in the string. This is different from the codepoints function, which lists combining characters separately. The following example uses a combining diaeresis along with the letter “e” to represent “ë”. (It might not display properly on your ereader.)
iex> String.codepoints "noe\u0308l"
["n", "o", "e", "¨", "l"]
iex> String.graphemes "noe\u0308l"
["n", "o", "ë", "l"]

The printed version of this book actually confused me about this distinction, because it is printed incorrectly! It shows the following, which does not match your IEx output:

iex> String.graphemes “noe\u0308l”
[“n”, “o”, “e¨”, “l”]

So in general, if you want to get each printed character of a string as a list, use String.graphemes/1

ckhrysze · March 28, 2016, 4:38pm

I updated my gist here to reflect all the examples listed on the The String Type is Broken article.

a = "noël"
IO.puts a
IO.puts String.reverse(a)
IO.puts String.slice(a, 0..2)
IO.puts String.length(a)

b = "😸😾"
IO.puts b
IO.puts String.length(b)
IO.puts String.slice(b, 1..-1)
IO.puts String.reverse(b)

c = "baﬄe"
IO.puts c
IO.puts String.upcase(c)

d = "noël"
IO.puts d
IO.puts a == d
IO.puts String.equivalent?(a, d)

noël
lëon
noë
4

2

baﬄe
BAFFLE
noël
false
true

Qqwy · May 2, 2016, 5:27am

The one thing that Elixir doesn’t do so far and some people would like to see is to have boolean methods that return interesting information about a given grapheme, such as if the grapheme is alphanumeric, uppercase, a control character, a spacing character etc.

These functions are mainly useful to validate/filter data fields so it only contains characters in a certain given subset. Often, using Regular Expressions in these contexts is either overkill or at least slower than necessary (And I’m not entirely sure if the RegExp actually follows Unicode-laws or Perl-style-RegExp laws on what groups certain characters belong to – I’m fairly certain that it is the second).

To clarify:

teststring = "noël"
teststring =~ ~r{\w{4}} # false, fails on the ë as `\w` is defined as `[a-zA-Z0-9]`, while ë is clearly alphanumeric.

benwilson512 · May 2, 2016, 10:25am

iex(6)> teststring = "noël"     
"noël"
iex(7)> teststring =~ ~r/\w{4}/ 
false
iex(8)> teststring =~ ~r/\w{4}/u
true

Qqwy · May 2, 2016, 11:33am

And this is how you learn something new every day. Thank you!