Unicode for emoji - string concatenation

shotleybuilder · June 9, 2020, 3:43pm

Hi,

I’m not understanding the behaviour when including unicode inside a string which is then string concatenated. The resulting string is not known when streamed into a file.

Here’s something that works:
"❔ foobar" -> "❔ foobar"
This works
"\u2754 foobar" -> "❔ foobar"

This doesn’t work:
"❔" <> "foobar" -> �
Neither does this
"\u2754" <> "foobar" -> �

What am I missing?

NobbZ · June 9, 2020, 3:46pm

works for me:

Erlang/OTP 22 [erts-10.7] [source] [64-bit] [smp:2:2] [ds:2:2:10] [async-threads:1] [hipe]

Interactive Elixir (1.10.3) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> "❔" <> "foobar"
"❔foobar"

shotleybuilder · June 9, 2020, 4:10pm

Well, my terminal doesn’t support emojis but if I run

iex(1)> IO.inspect(“\u274c”)
"â

Or more simply

iex(1)> “\u274c”
"â

No closing double quote …

benwilson512 · June 9, 2020, 4:19pm

Is your terminal not in UTF8 mode?

NobbZ · June 9, 2020, 4:36pm

+U274c is represented by these 3 bytes in utf8: 0xE2 0x9D 0x8C.

If your terminal doesn’t support utf8, then you should probably inspect on a byte level rather than printed representation.

As you have not specified your terminals encoding, basically everything can be triggered by these bytes.

Assuming latin-X, which is often used in the windows world, then the 8x and 9x bytes are not used and can cause weirdnesses.

shotleybuilder · June 9, 2020, 8:50pm

You gave me fresh motivation to fix this. The fix to render of emoji in the VS Code terminal was a new Win10 feature called “Beta: Use Unicode UTF-8 for worldwide language support”. iex.bat still freezes in Powershell (no iex prompt), and it makes no difference to the command prompt. The fix works in hyper terminal too (but that’s using Powershell).
However, the issue with the behaviour of the code remains, but I can now see emojis in the terminal [baby steps] (see below).

shotleybuilder · June 9, 2020, 9:03pm

Okay, it appears the problem occurs ‘just’ with the 5 character emojis (just happened to be the one I was using)

iex(16)> “\u274c” <> " foobar"
“ foobar”
iex(17)> “\u1f517” <> " foobar"
“ὑ7 foobar”
iex(18)> “\u2754” <> " foobar"
“ foobar”
iex(19)> “\u1F49A” <> " foobar"
“ὉA foobar”

I’m not sure where to go with this, but I’ve found the exmoji library and that might help. Back to this tomorrow

thanks for the help

NobbZ · June 9, 2020, 9:13pm

iex(1)> "\u274c" <> " foobar"                                                                                                           
"❌ foobar"
iex(2)> "\u1f517" <> " foobar"
"ὑ7 foobar"
iex(3)> "\u2754" <> " foobar"
"❔ foobar"
iex(4)> "\u1F49A" <> " foobar"
"ὉA foobar"

This is how it looks for me.

As \u takes 4 digit hex, there are additional characters attached to the second and fourth example.

Not sure what you mean by “3 character emojis”.

shotleybuilder · June 9, 2020, 9:55pm

I think I need to use surrogate pairs. The link emoji is D83D + DD17. Not sure of the syntax in Elixir.
http://www.russellcottrell.com/greek/utilities/SurrogatePairCalculator.htm

kip · June 9, 2020, 11:05pm

iex> << 0x1F517 :: utf8 >>
"🔗"

iex> << 0xF0, 0x9F, 0x94, 0x97 >>
"🔗"

iex> << 0xF0, 0x9F, 0x94, 0x97>> == << 0x1f517 :: utf8 >>
true

d8 3d dd 17 is the UTF16 representation. Elixir is a UTF8 language where the encoding is f0 9f 94 97.

shotleybuilder · June 10, 2020, 9:07am

Kip, many thanks.

Just a quick question. How do you find << 0xF0, 0x9F, 0x94, 0x97 >>?

<< 0x1F49A :: utf8 >> works just fine though as does using the emoji picker in Windows to paste the emoji into the VS Code editor (which was the approach I was taking).

The usual mess of conflicting results led me down the wrong path, but I’ve learnt something along the way … my actual problem was a clumsy piece of regex using class names that didn’t include the emoji bytes(?)

kip · June 10, 2020, 9:38am

With a little bit of erlang magic:

iex> :erlang.binary_to_list << 0x1f517 :: utf8 >>       
[240, 159, 148, 151]

I have a unicode library that examines code points. For example:

iex> Unicode.category << 0x1F49A :: utf8 >>             
[:So]
iex> Unicode.properties << 0x1F49A :: utf8 >>                                                                          
[[:emoji, :grapheme_base]]

Unfortunately the Regex module doesn’t support the Unicode character class [:So:]. Or any other Unicode character classes. I have another lib unicode_set that has some support for Unicode character classes. But regexes not yet - soon though.

LostKobrakai · June 10, 2020, 10:17am

iex(2)> IEx.configure inspect: [base: :hex]
:ok
iex(3)> << 0x1f517 :: utf8 >>
<<0xF0, 0x9F, 0x94, 0x97>>