How would you explain the difference in codepoints and graphemes in Strings?

Hi Everyone,
Myself Vikram, a beginner in elixir.
Can anyone explain the difference between next_codepoint and next_graphemes?
As they give the same output, I could not understand it.

iex(8)> String.next_codepoint(“trimming”)
{“t”, “rimming”}
iex(9)> String.next_grapheme(“trimming”)
{“t”, “rimming”}

Thank you.

This is unicode related terminology. When a string is encoded in Unicode - it contains code points, graphemes and glyphs

Codepoint is atomic part of text - meaning it cannot be divided further or smallest possible unit .

Grapheme is what you see on screen or print. It may be consisting one or more codepoints.

There are languages whose letters (grapheme) need more than one code point.

Your example string is english so you won’t find difference.

iex(1)> s = "\u0065\u0301at"
"éat"
iex(2)> String.graphemes(s)
["é", "a", "t"] # length is 3
iex(3)> String.codepoints(s)
["e", "́", "a", "t"] # length is 4
iex(4)> {grapheme, rest} = String.next_grapheme(s)
{"é", "at"} # grapheme is "é"
iex(5)> {second_grapheme, rest} = String.next_grapheme(rest)
{"a", "t"} # second_grapheme is "a"
iex(6)> {codepoint, rest} = String.next_codepoint(s)
{"e", "́at"} # codepoint is e, top accent is still in rest (little difficult to notice)
iex(7)> {second_codepoint, rest} = String.next_codepoint(rest)
{"́", "at"} # second_codepoint is top accent

Previous discussion in the forum - Codepoints vs Grapheme - #2 by ckhrysze

10 Likes

Hey, welcome to the forums! Here’s another example from the Wikipedia article:

iex(6)> "\u00f1"
"ñ"
iex(5)> "\u006E\u0303"
"ñ"
iex(7)> "\u006E\u0303" |> String.next_codepoint()
{"n", "̃"}
iex(8)> "\u006E\u0303" |> String.next_grapheme()
{"ñ", ""}
6 Likes

Thank you for giving the solution and your valuable time.

4 Likes

Thank you for your solution and your valuable time.

4 Likes

@kartheek I have an one more question. As you have said that the codepoints is an atomic part of the text and divides the smallest possible unit. Then can you please explain this one too.
iex> String.codepoints(“olá”)
[“o”, “l”, “á”]
iex(5)> String.graphemes(“olá”)
[“o”, “l”, “á”]

Thank you.

1 Like

That is most likely - “\u00E1” which is Latin Small Letter A with Acute.

á

Unicode characters can be looked up at - U+00E1 LATIN SMALL LETTER A WITH ACUTE – Codepoints

According to elixir docs

In Elixir you can use a ? in front of a character literal to reveal its code point:

iex(1)> ?á
225
iex(2)> "\u00E1" === "á"
true
iex(3)> 0x00E1 = 225 = ?á #code point by its hexadecimal
225

Another site which I use most often - http://compart.com

3 Likes
iex(1)> emoji = "👨‍👩‍👦‍👦"
"👨‍👩‍👦‍👦"
iex(2)> String.graphemes(emoji)
["👨‍👩‍👦‍👦"]
iex(3)> String.codepoints(emoji)
["👨", "‍", "👩", "‍", "👦", "‍", "👦"]
4 Likes

I wonder if one day we’ll have clay (or titanium) tablets with emoji on them for the future archeologists to decipher.

1 Like

@kartheek @LostKobrakai
Thank you for explaining the concept and have a good day.

Here is a very good overview of Unicode :brain:

1 Like

@03juan Thank you

1 Like