A grapheme is a singular visible character though.
Even specific codepoints are larger than a wordsize though in UTF16 or UTF8 so the UTF16 example prior was inherently broken, even something as simple as 💩
is larger than the wordsize of UTF8 or UTF16 requiring multiple words to represent (4 words in UTF8 and 2 in UTF16). With UTF32 you can represent a singular codepoint as a 32-bit integer (as unicode codepoints are capped to 21-bits), but even then a codepoint is not a character or grapheme. Even something as simple as a character with a diacritical mark can be represented as a single codepoint or as a sequence of codepoints, and even then many such marks are not representable as a singular codepoint, hence why you have combining marks.
Pretending a unicode character/grapheme can fit into a bounded integer will always eventually fail, and that is what 99% of the time that people are accessing codepoints are actually intending to do. Unicode is not an integral format, it is a binary format.
The issue in the original post was just printing out the hex format of the UTF8 binary (where they seemed to be wanting a UTF16 binary):
iex(13)> <<a, b, c>> = "草"
"草"
iex(14)> {a, b, c}
{232, 141, 137}
iex(15)> <<a, b>> = <<"草"::utf16>>
<<131, 73>>
It is just the way it is encoded, but even then adding any combiners to the character will increase the size even further even though it is still a single printable character/grapheme.
Either way, Base.encode16/1
was encoding an 8-bit binary to a hex string, it was not encoding unicode in any form or any way and the unicode encoding needed to be handled before passing it to the hexidecimal encoding (as well as probably normalizing it too to shrink it down as much as possible).
And to note, it is not a decorated E̋̉̅
, it is a singular character, which happens to be made up of some combining marks codepoints along with a basic codepoint in the binary level, but you should never worry about that at the ‘string’ level or probably doing something wrong anyway.