vikram
January 18, 2022, 6:45am
1
Hi Everyone,
Myself Vikram, a beginner in elixir.
Can anyone explain the difference between next_codepoint and next_graphemes?
As they give the same output, I could not understand it.
iex(8)> String.next_codepoint(“trimming”)
{“t”, “rimming”}
iex(9)> String.next_grapheme(“trimming”)
{“t”, “rimming”}
Thank you.
This is unicode related terminology. When a string is encoded in Unicode - it contains code points, graphemes and glyphs
Codepoint is atomic part of text - meaning it cannot be divided further or smallest possible unit .
Grapheme is what you see on screen or print. It may be consisting one or more codepoints.
There are languages whose letters (grapheme) need more than one code point.
Your example string is english so you won’t find difference.
iex(1)> s = "\u0065\u0301at"
"éat"
iex(2)> String.graphemes(s)
["é", "a", "t"] # length is 3
iex(3)> String.codepoints(s)
["e", "́", "a", "t"] # length is 4
iex(4)> {grapheme, rest} = String.next_grapheme(s)
{"é", "at"} # grapheme is "é"
iex(5)> {second_grapheme, rest} = String.next_grapheme(rest)
{"a", "t"} # second_grapheme is "a"
iex(6)> {codepoint, rest} = String.next_codepoint(s)
{"e", "́at"} # codepoint is e, top accent is still in rest (little difficult to notice)
iex(7)> {second_codepoint, rest} = String.next_codepoint(rest)
{"́", "at"} # second_codepoint is top accent
Previous discussion in the forum - Codepoints vs Grapheme - #2 by ckhrysze
12 Likes
Hey, welcome to the forums! Here’s another example from the Wikipedia article :
iex(6)> "\u00f1"
"ñ"
iex(5)> "\u006E\u0303"
"ñ"
iex(7)> "\u006E\u0303" |> String.next_codepoint()
{"n", "̃"}
iex(8)> "\u006E\u0303" |> String.next_grapheme()
{"ñ", ""}
6 Likes
vikram
January 19, 2022, 2:47pm
4
Thank you for giving the solution and your valuable time.
4 Likes
vikram
January 19, 2022, 2:48pm
5
Thank you for your solution and your valuable time.
4 Likes
vikram
January 24, 2022, 1:39pm
6
@kartheek I have an one more question. As you have said that the codepoints is an atomic part of the text and divides the smallest possible unit. Then can you please explain this one too.
iex> String.codepoints(“olá”)
[“o”, “l”, “á”]
iex(5)> String.graphemes(“olá”)
[“o”, “l”, “á”]
Thank you.
1 Like
That is most likely - “\u00E1” which is Latin Small Letter A with Acute.
á
Unicode characters can be looked up at - U+00E1 LATIN SMALL LETTER A WITH ACUTE: á – Unicode – Codepoints
According to elixir docs
In Elixir you can use a ?
in front of a character literal to reveal its code point:
iex(1)> ?á
225
iex(2)> "\u00E1" === "á"
true
iex(3)> 0x00E1 = 225 = ?á #code point by its hexadecimal
225
Another site which I use most often - http://compart.com
3 Likes
iex(1)> emoji = "👨👩👦👦"
"👨👩👦👦"
iex(2)> String.graphemes(emoji)
["👨👩👦👦"]
iex(3)> String.codepoints(emoji)
["👨", "", "👩", "", "👦", "", "👦"]
5 Likes
I wonder if one day we’ll have clay (or titanium) tablets with emoji on them for the future archeologists to decipher.
1 Like
vikram
January 25, 2022, 4:12am
10
@kartheek @LostKobrakai
Thank you for explaining the concept and have a good day.
03juan
January 26, 2022, 9:59am
11
Here is a very good overview of Unicode
1 Like