Troubleshooting unexpected hex value for Chinese character, 草

Logan · April 15, 2018, 8:18am

In the past, I’ve used Base.encode16() to get the hex value of characters. I did this for 草 (grass) which is used as an example on unicode.org.

Base.encode16("草") produces "E88D89", but according to unicode.org this should be "008349"

To get a decimal value for characters you can do "草" |> Base.encode16() |> Integer.parse(16) which produces 15240585, but this is not what is given by ?草 or <<x::utf8>> = "草". These produce 33609, and we can get the same number by doing Integer.parse("008349", 16).

I also tried a Base 16 encoding library in Ruby and 草 did produce 008349.

Am I misunderstanding Base.encode16() or is it producing the wrong hex value?

sync08 · April 15, 2018, 8:23am

The resulting “E88D89” is a UTF-8 bytestring.

http://unicode.scarfboy.com/?s=U%2B8349

NobbZ · April 15, 2018, 8:51am

It is the Base16 encoded version of the UTF-8 encoded "草", depending on the input encoding, the output can differ. The encoding/decoding functions in Base work on the bytes, not the codepoints.

To actually get the hex representation of the Codepoint, you should do Integer.to_string(?草, 16).

Logan · April 15, 2018, 3:15pm

Thanks @sync08 for the link and @NobbZ for the tip on Integer.to_string(?草, 16) . I did not know that “E88D89” and “008349” would appropriately represent the same character; the first being UTF-8 as hexadecimal and the second as unicode hex codepoint. I’m just now getting into Unicode and find it very intriguing.

*edit, just wanted to put this here for anyone else that was still curious. To use Base.encode16() to find the unicode hex codepoint and not the UTF-8 representation you could do <<"草"::utf16>> |> Base.encode16() but using Integer.to_string(?草, 16) is much more succinct.

OvermindDL1 · April 16, 2018, 7:05pm

Eh not really, because not all characters are representable in utf16 either. Unicode, all sizes of it, are unbound in their combinators. Technically I’m not sure if Integer.to_string will always work either, unicode should not be treated as anything but a binary.

NobbZ · April 16, 2018, 7:38pm

Integer to string will always work unless unicode changes its internal number schema to something not hexadecimal.and even then integer to string will work, you only need to change the base.

And utf16 can encode everything in unicode, but it can not encode all of them in only 2 byte.

OvermindDL1 · April 16, 2018, 7:49pm

UTF16, nor UTF8, nor UTF32 can encode all characters into a single word/integer, you have to keep it as binary as even something simple like E̋̉̅l͆̎̏̑͊ĭͨ̇͆̏͋ẋ̇͗͗ͨ̆̅̌i̓̿ͫͫͩrͯ̂̒̈̈̎ͩ́̄ with combining marks, each character in it will not fit in any of the above encodings and you require multiple words (8/16/32-bit integers) to represent each character.

You Can Not fully represent a unicode character in a single integer (unless you have infinite integral sizes, but at that point it’s just a binary anyway).

Treating a unicode character as an integer is nonsensical from a spec perspective.

╰─➤  iex                                                                                                                      255 ↵
Erlang/OTP 20 [erts-9.3] [source] [64-bit] [smp:1:1] [ds:1:1:10] [async-threads:10] [hipe] [kernel-poll:false]

Interactive Elixir (1.6.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> s = "E̋̉̅l͆̎̏̑͊ĭͨ̇͆̏͋ẋ̇͗͗ͨ̆̅̌i̓̿ͫͫͩrͯ̂̒̈̈̎ͩ́̄"
"E̋̉̅l͆̎̏̑͊ĭͨ̇͆̏͋ẋ̇͗͗ͨ̆̅̌i̓̿ͫͫͩrͯ̂̒̈̈̎ͩ́̄"
iex(2)> byte_size(s)
78
iex(3)> ?E̋̉̅
** (SyntaxError) iex:3: unexpected token: "̋" (column 3, codepoint U+030B)
iex(3)> String.graphemes(s)
["E̋̉̅", "l͆̎̏̑͊", "ĭͨ̇͆̏͋", "ẋ̇͗͗ͨ̆̅̌", "i̓̿ͫͫͩ", "rͯ̂̒̈̈̎ͩ́̄"]

Even Elixir bails out because it cannot represent it (you can only take a very limited subset of unicode to the right of the ? operator in elixir).

NobbZ · April 16, 2018, 7:54pm

One can represent all codepoints in itf8, utf16 or utf32. They all have different requirements in the size of the resulting bytestring, but they all are able to represent all codepoints.

And the decorated E you have there is not a single codepoints but a grapheme.

OvermindDL1 · April 16, 2018, 8:11pm

A grapheme is a singular visible character though.

Even specific codepoints are larger than a wordsize though in UTF16 or UTF8 so the UTF16 example prior was inherently broken, even something as simple as 💩 is larger than the wordsize of UTF8 or UTF16 requiring multiple words to represent (4 words in UTF8 and 2 in UTF16). With UTF32 you can represent a singular codepoint as a 32-bit integer (as unicode codepoints are capped to 21-bits), but even then a codepoint is not a character or grapheme. Even something as simple as a character with a diacritical mark can be represented as a single codepoint or as a sequence of codepoints, and even then many such marks are not representable as a singular codepoint, hence why you have combining marks.

Pretending a unicode character/grapheme can fit into a bounded integer will always eventually fail, and that is what 99% of the time that people are accessing codepoints are actually intending to do. Unicode is not an integral format, it is a binary format.

The issue in the original post was just printing out the hex format of the UTF8 binary (where they seemed to be wanting a UTF16 binary):

iex(13)> <<a, b, c>> = "草"
"草"
iex(14)> {a, b, c}
{232, 141, 137}
iex(15)> <<a, b>> = <<"草"::utf16>>      
<<131, 73>>

It is just the way it is encoded, but even then adding any combiners to the character will increase the size even further even though it is still a single printable character/grapheme.
Either way, Base.encode16/1 was encoding an 8-bit binary to a hex string, it was not encoding unicode in any form or any way and the unicode encoding needed to be handled before passing it to the hexidecimal encoding (as well as probably normalizing it too to shrink it down as much as possible).

And to note, it is not a decorated E̋̉̅, it is a singular character, which happens to be made up of some combining marks codepoints along with a basic codepoint in the binary level, but you should never worry about that at the ‘string’ level or probably doing something wrong anyway.

NobbZ · April 16, 2018, 8:31pm

Any Unicode codepoint will always be representable by an unbound integer (which is the default in elixir).

Of course you can not represent combined graphemes which are built from multiple codepoints as a single integer, but Unicode defines Codepoints as the atomic unit. You can always split graphemes in one or many codepoints, but not the other way round.

And of course, I do understand, that some codepoints do need more than one/two byte in UTF8/UTF16.

But I insist on “because not all characters are representable in utf16 either” beeing a false statement. Since you can represent all codepoints in any UTFx encoding, but not necessarily with exact x bits.

OvermindDL1 · April 16, 2018, 8:57pm

Specifically they cannot fit in a wordsize of utf16 as occasionally you need surrogate pairs, which is why flat-out hexidecimal encoding them is quite odd at times. I did not say that utf16 could not represent all characters, it certainly can, it just cannot do it within it’s wordsize (where utf32 can).