How would you explain the difference in codepoints and graphemes in Strings?

kartheek · January 18, 2022, 7:36am

This is unicode related terminology. When a string is encoded in Unicode - it contains code points, graphemes and glyphs

Codepoint is atomic part of text - meaning it cannot be divided further or smallest possible unit .

Grapheme is what you see on screen or print. It may be consisting one or more codepoints.

There are languages whose letters (grapheme) need more than one code point.

Your example string is english so you won’t find difference.

iex(1)> s = "\u0065\u0301at"
"éat"
iex(2)> String.graphemes(s)
["é", "a", "t"] # length is 3
iex(3)> String.codepoints(s)
["e", "́", "a", "t"] # length is 4
iex(4)> {grapheme, rest} = String.next_grapheme(s)
{"é", "at"} # grapheme is "é"
iex(5)> {second_grapheme, rest} = String.next_grapheme(rest)
{"a", "t"} # second_grapheme is "a"
iex(6)> {codepoint, rest} = String.next_codepoint(s)
{"e", "́at"} # codepoint is e, top accent is still in rest (little difficult to notice)
iex(7)> {second_codepoint, rest} = String.next_codepoint(rest)
{"́", "at"} # second_codepoint is top accent

Previous discussion in the forum - Codepoints vs Grapheme - #2 by ckhrysze