shotleybuilder
Unicode for emoji - string concatenation
Hi,
I’m not understanding the behaviour when including unicode inside a string which is then string concatenated. The resulting string is not known when streamed into a file.
Here’s something that works:
"❔ foobar" -> "❔ foobar"
This works
"\u2754 foobar" -> "❔ foobar"
This doesn’t work:
"❔" <> "foobar" -> �
Neither does this
"\u2754" <> "foobar" -> �
What am I missing?
Most Liked
NobbZ
+U274c is represented by these 3 bytes in utf8: 0xE2 0x9D 0x8C.
If your terminal doesn’t support utf8, then you should probably inspect on a byte level rather than printed representation.
As you have not specified your terminals encoding, basically everything can be triggered by these bytes.
Assuming latin-X, which is often used in the windows world, then the 8x and 9x bytes are not used and can cause weirdnesses.
kip
iex> << 0x1F517 :: utf8 >>
"🔗"
iex> << 0xF0, 0x9F, 0x94, 0x97 >>
"🔗"
iex> << 0xF0, 0x9F, 0x94, 0x97>> == << 0x1f517 :: utf8 >>
true
d8 3d dd 17 is the UTF16 representation. Elixir is a UTF8 language where the encoding is f0 9f 94 97.
kip
With a little bit of erlang magic:
iex> :erlang.binary_to_list << 0x1f517 :: utf8 >>
[240, 159, 148, 151]
I have a unicode library that examines code points. For example:
iex> Unicode.category << 0x1F49A :: utf8 >>
[:So]
iex> Unicode.properties << 0x1F49A :: utf8 >>
[[:emoji, :grapheme_base]]
Unfortunately the Regex module doesn’t support the Unicode character class [:So:]. Or any other Unicode character classes. I have another lib unicode_set that has some support for Unicode character classes. But regexes not yet - soon though.








