Where did the name "binaries" come from? And how does this relate to Base2

These two are not the same - String.split with an empty string splits on graphemes, which are a further unit bigger than codepoints.

The rules for clustering codepoints into graphemes are defined by the Unicode standards, and the code for handling them is generated from canonical text files.

iex(4)> String.codepoints("🇺🇸")
["🇺", "🇸"]
iex(5)> String.split("🇺🇸", "", trim: true)
["🇺🇸"]
iex(6)> "🇺🇸" <><<0>>
<<240, 159, 135, 186, 240, 159, 135, 184, 0>>

The single displayed character :us: is a grapheme, composed of two codepoints U+1F1FA and U+1F1F8, represented by 8 bytes.

Another way that codepoints and graphemes can diverge is combining characters; for instance, U+0308 is “Combining Diaresis” which will add ¨ to the preceding character. Example:

iex(9)> s = "ca\u0308t"
"cät"
iex(10)> String.codepoints(s)
["c", "a", "̈", "t"]
iex(11)> String.split(s, "", trim: true)
["c", "ä", "t"]

(note that the combining character prints very oddly when isolated inside ")

10 Likes