How would you explain the difference in codepoints and graphemes in Strings?

vikram · January 18, 2022, 6:45am

Hi Everyone,
Myself Vikram, a beginner in elixir.
Can anyone explain the difference between next_codepoint and next_graphemes?
As they give the same output, I could not understand it.

iex(8)> String.next_codepoint(“trimming”)
{“t”, “rimming”}
iex(9)> String.next_grapheme(“trimming”)
{“t”, “rimming”}

Thank you.

kartheek · January 18, 2022, 7:36am

This is unicode related terminology. When a string is encoded in Unicode - it contains code points, graphemes and glyphs

Codepoint is atomic part of text - meaning it cannot be divided further or smallest possible unit .

Grapheme is what you see on screen or print. It may be consisting one or more codepoints.

There are languages whose letters (grapheme) need more than one code point.

Your example string is english so you won’t find difference.

iex(1)> s = "\u0065\u0301at"
"éat"
iex(2)> String.graphemes(s)
["é", "a", "t"] # length is 3
iex(3)> String.codepoints(s)
["e", "́", "a", "t"] # length is 4
iex(4)> {grapheme, rest} = String.next_grapheme(s)
{"é", "at"} # grapheme is "é"
iex(5)> {second_grapheme, rest} = String.next_grapheme(rest)
{"a", "t"} # second_grapheme is "a"
iex(6)> {codepoint, rest} = String.next_codepoint(s)
{"e", "́at"} # codepoint is e, top accent is still in rest (little difficult to notice)
iex(7)> {second_codepoint, rest} = String.next_codepoint(rest)
{"́", "at"} # second_codepoint is top accent

Previous discussion in the forum - Codepoints vs Grapheme - #2 by ckhrysze

stefanchrobot · January 18, 2022, 7:38am

Hey, welcome to the forums! Here’s another example from the Wikipedia article:

iex(6)> "\u00f1"
"ñ"
iex(5)> "\u006E\u0303"
"ñ"
iex(7)> "\u006E\u0303" |> String.next_codepoint()
{"n", "̃"}
iex(8)> "\u006E\u0303" |> String.next_grapheme()
{"ñ", ""}

vikram · January 19, 2022, 2:47pm

Thank you for giving the solution and your valuable time.

vikram · January 19, 2022, 2:48pm

Thank you for your solution and your valuable time.

vikram · January 24, 2022, 1:39pm

@kartheek I have an one more question. As you have said that the codepoints is an atomic part of the text and divides the smallest possible unit. Then can you please explain this one too.
iex> String.codepoints(“olá”)
[“o”, “l”, “á”]
iex(5)> String.graphemes(“olá”)
[“o”, “l”, “á”]

Thank you.

kartheek · January 24, 2022, 2:03pm

That is most likely - “\u00E1” which is Latin Small Letter A with Acute.

á

Unicode characters can be looked up at - U+00E1 LATIN SMALL LETTER A WITH ACUTE: á – Unicode – Codepoints

According to elixir docs

In Elixir you can use a ? in front of a character literal to reveal its code point:

iex(1)> ?á
225
iex(2)> "\u00E1" === "á"
true
iex(3)> 0x00E1 = 225 = ?á #code point by its hexadecimal
225

Another site which I use most often - http://compart.com

LostKobrakai · January 24, 2022, 2:15pm

iex(1)> emoji = "👨‍👩‍👦‍👦"
"👨‍👩‍👦‍👦"
iex(2)> String.graphemes(emoji)
["👨‍👩‍👦‍👦"]
iex(3)> String.codepoints(emoji)
["👨", "‍", "👩", "‍", "👦", "‍", "👦"]

dimitarvp · January 24, 2022, 2:38pm

I wonder if one day we’ll have clay (or titanium) tablets with emoji on them for the future archeologists to decipher.

vikram · January 25, 2022, 4:12am

@kartheek @LostKobrakai
Thank you for explaining the concept and have a good day.

03juan · January 26, 2022, 9:59am

Here is a very good overview of Unicode

vikram · January 27, 2022, 9:55am

@03juan Thank you