In the past, I’ve used Base.encode16() to get the hex value of characters. I did this for 草 (grass) which is used as an example on unicode.org.
Base.encode16("草") produces "E88D89", but according to unicode.org this should be "008349"
To get a decimal value for characters you can do "草" |> Base.encode16() |> Integer.parse(16) which produces 15240585, but this is not what is given by ?草 or <<x::utf8>> = "草". These produce 33609, and we can get the same number by doing Integer.parse("008349", 16).
I also tried a Base 16 encoding library in Ruby and 草 did produce 008349.
Am I misunderstanding Base.encode16() or is it producing the wrong hex value?
Thanks @sync08 for the link and @NobbZ for the tip on Integer.to_string(?草, 16) . I did not know that “E88D89” and “008349” would appropriately represent the same character; the first being UTF-8 as hexadecimal and the second as unicode hex codepoint. I’m just now getting into Unicode and find it very intriguing.
*edit, just wanted to put this here for anyone else that was still curious. To use Base.encode16() to find the unicode hex codepoint and not the UTF-8 representation you could do <<"草"::utf16>> |> Base.encode16() but using Integer.to_string(?草, 16) is much more succinct.
Eh not really, because not all characters are representable in utf16 either. Unicode, all sizes of it, are unbound in their combinators. Technically I’m not sure if Integer.to_string will always work either, unicode should not be treated as anything but a binary.
UTF16, nor UTF8, nor UTF32 can encode all characters into a single word/integer, you have to keep it as binary as even something simple like E̋̉̅l͆̎̏̑͊ĭͨ̇͆̏͋ẋ̇͗͗ͨ̆̅̌i̓̿ͫͫͩrͯ̂̒̈̈̎ͩ́̄ with combining marks, each character in it will not fit in any of the above encodings and you require multiple words (8/16/32-bit integers) to represent each character.
You Can Not fully represent a unicode character in a single integer (unless you have infinite integral sizes, but at that point it’s just a binary anyway).
Treating a unicode character as an integer is nonsensical from a spec perspective.
A grapheme is a singular visible character though.
Even specific codepoints are larger than a wordsize though in UTF16 or UTF8 so the UTF16 example prior was inherently broken, even something as simple as 💩 is larger than the wordsize of UTF8 or UTF16 requiring multiple words to represent (4 words in UTF8 and 2 in UTF16). With UTF32 you can represent a singular codepoint as a 32-bit integer (as unicode codepoints are capped to 21-bits), but even then a codepoint is not a character or grapheme. Even something as simple as a character with a diacritical mark can be represented as a single codepoint or as a sequence of codepoints, and even then many such marks are not representable as a singular codepoint, hence why you have combining marks.
Pretending a unicode character/grapheme can fit into a bounded integer will always eventually fail, and that is what 99% of the time that people are accessing codepoints are actually intending to do. Unicode is not an integral format, it is a binary format.
The issue in the original post was just printing out the hex format of the UTF8 binary (where they seemed to be wanting a UTF16 binary):
It is just the way it is encoded, but even then adding any combiners to the character will increase the size even further even though it is still a single printable character/grapheme.
Either way, Base.encode16/1 was encoding an 8-bit binary to a hex string, it was not encoding unicode in any form or any way and the unicode encoding needed to be handled before passing it to the hexidecimal encoding (as well as probably normalizing it too to shrink it down as much as possible).
And to note, it is not a decorated E̋̉̅, it is a singular character, which happens to be made up of some combining marks codepoints along with a basic codepoint in the binary level, but you should never worry about that at the ‘string’ level or probably doing something wrong anyway.
Any Unicode codepoint will always be representable by an unbound integer (which is the default in elixir).
Of course you can not represent combined graphemes which are built from multiple codepoints as a single integer, but Unicode defines Codepoints as the atomic unit. You can always split graphemes in one or many codepoints, but not the other way round.
And of course, I do understand, that some codepoints do need more than one/two byte in UTF8/UTF16.
But I insist on “because not all characters are representable in utf16 either” beeing a false statement. Since you can represent all codepoints in any UTFx encoding, but not necessarily with exact x bits.
Specifically they cannot fit in a wordsize of utf16 as occasionally you need surrogate pairs, which is why flat-out hexidecimal encoding them is quite odd at times. I did not say that utf16 could not represent all characters, it certainly can, it just cannot do it within it’s wordsize (where utf32 can).