Help converting regex (to remove emojis)

9mm · December 10, 2018, 4:53am

I’m trying to make a regex to remove emojis based on this thread:

This causes an error:

Regex.replace(~r/[\u{1F600}-\u{1F6FF}]/, "💰 Monies! 💲", "")
# (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 2

I tried using \x{FFFF} syntax instead according to this: Regex Tutorial - Unicode Characters and Properties

But I get another error:

Regex.replace(~r/[\x{1F600}-\x{1F6FF}]/, "💰 Monies! 💲", "")
# (Regex.CompileError) character value in \x{} or \o{} is too large at position 9

Apparently I also need u to enable unicode but that doesnt seem to work either:

Regex.replace(~r/[\x{1F600}-\x{1F6FF}]/u, "💰 Monies! 💲", "")
"💰 Monies! 💲"

kip · December 10, 2018, 5:16am

You might have some better luck using the unicode character class So (other symbol). For example:

iex> x = "💰 Monies! 💲"                                
"💰 Monies! 💲"
iex> String.replace x, ~r/\p{So}/u, ""
" Monies! "

BTW, the “correct” approach to this would be:

iex> String.replace x, ~r/\p{Emoji}/u, ""

But erlang’s re module doesn’t support that character class.

9mm · December 10, 2018, 5:18am

Interesting… does that strip other languages or only emojis? I was trying to avoid stripping chinese/japanese/korean, which it seems like a lot of emoji regex do

kip · December 10, 2018, 5:22am

No, it doesn’t. The full list of code points in So are here: https://www.fileformat.info/info/unicode/category/So/list.htm

It’s more permissive than just emoji, but it does not include scripts.

iex> String.replace "御言宣", ~r/\p{So}/u, ""
"御言宣"

9mm · December 10, 2018, 5:28am

Hmm yess, and it also is missing some emoji for some reason. God this is confusing

9mm · December 10, 2018, 5:29am

Can you try this string?

x = "abcdefghijklmnopqrstuvwxyz....0123456789 不極，物片類書車裡！十今果半接國先雄 ニッポン」「ニホン」両方使用される中 には文중국, 일본, 베트남 등 한자 문화권에 속하는 아시아 여러 국가에서는 한국어的差异外，通常认为还存在词汇上的差异。例如繁体中文里多用的“原 لمنطقة الشرق الأوسط هيلي: التحرك ضد إيران سيبدأ من مجلس الأمن📌☮️💟🔯☪️㊗️🈵🆚💯❕🔞🚷🔰⁉️⚠️💤🌐🌀▶️🔠🔣↔️↩️👁‍🗨🗨🗨🗨🗨 ◽️🔲🇵🇦🏳️🏳️‍🌈🌈🌈🌈🌈🇹🇲🇹🇷🤛🤜🏼👍🏽👌☝🏼🥝🥦🌶🌽🍎 🍲🍔🥞🍝🍔🍗🌮🍯🥠🥢🍴🥄🥂☕️😀😃😄🤣😂😅😆☺️😊😍😌 😘😗😙😚😜😝😛😋🤨🧐🤓😒😏🤩🤩😎😞😔😖😢😣☹️😩🙁🤯 😰😓😦😲🤒🤕👿👹👽✊🏼"

It leaves the other languages properly but some emojis are left.

9mm · December 10, 2018, 5:33am

I did find this guys ruby gem but I just wasted like an hour trying to convert this regex and when it finally worked it erased all the chinese characters, so obviously I got that wrong

mudasobwa · December 10, 2018, 5:42am

There are many issues here. Both characters above are not in the range you specified in the first place.

▶ to_charlist "💰" 
#⇒ [128176]
▶ to_charlist "💲"
#⇒ [128178]
▶ 0x1F600
#⇒ 128512

To make it work for the range with \u, just interpolate the literals (I voluntarily changed the starting value for the range to what it probably should be):

▶ Regex.replace(~r/[#{"\u{1F000}"}-#{"\u{1F6FF}"}]/u, "💰 Monies! 💲", "")
" Monies! "

\x will work out of the box with a proper range:

Regex.replace(~r/[\x{1F000}-\x{1F6FF}]/u, "💰 Monies! 💲", "")
" Monies! "

kip · December 10, 2018, 5:43am

Yes, I see the same. It’s a bit surprising since according to some utility code I wrote they look to be So but re certainly doesn’t agree:

iex> x
"🤨🧐🤓🤩"
iex> Cldr.Unicode.Category.category x 
[:So, :So, :So, :So]
iex> String.replace x, ~r/\p{So}/u, ""
"🤨🧐🤓🤩"

kip · December 10, 2018, 6:09am

Seems that emoji belong to the Common script so this appears to get closer - but it also deletes digits and punctuation.

iex> x = "abcdefghijklmnopqrstuvwxyz....0123456789 不極，物片類書車裡！十今果半接國先雄 ニッポン」「ニホン」両方使用される中 には文중국, 일본, 베트남 등 한자 문화권에 속하는 아시아 여러 국가에서는 한국어的差异外，通常认为还存在词汇上的差异。例如繁体中文里多用的“原 لمنطقة الشرق الأوسط هيلي: التحرك ضد إيران سيبدأ من مجلس الأمن📌☮️💟🔯☪️㊗️🈵🆚💯❕🔞🚷🔰⁉️⚠️💤🌐🌀▶️🔠🔣↔️↩️👁‍🗨🗨🗨🗨🗨 ◽️🔲🇵🇦🏳️🏳️‍🌈🌈🌈🌈🌈🇹🇲🇹🇷🤛🤜🏼👍🏽👌☝🏼🥝🥦🌶🌽🍎 🍲🍔🥞🍝🍔🍗🌮🍯🥠🥢🍴🥄🥂☕️😀😃😄🤣😂😅😆☺️😊😍😌 😘😗😙😚😜😝😛😋🤨🧐🤓😒😏🤩🤩😎😞😔😖😢😣☹️😩🙁🤯 😰😓😦😲🤒🤕👿👹👽✊🏼"
"abcdefghijklmnopqrstuvwxyz....0123456789 不極，物片類書車裡！十今果半接國先雄 ニッポン」「ニホン」両方使用される中 には文중국, 일본, 베트남 등 한자 문화권에 속하는 아시아 여러 국가에서는 한국어的差异外，通常认为还存在词汇上的差异。例如繁体中文里多用的“原 لمنطقة الشرق الأوسط هيلي: التحرك ضد إيران سيبدأ من مجلس الأمن📌☮️💟🔯☪️㊗️🈵🆚💯❕🔞🚷🔰⁉️⚠️💤🌐🌀▶️🔠🔣↔️↩️👁‍🗨🗨🗨🗨🗨 ◽️🔲🇵🇦🏳️🏳️‍🌈🌈🌈🌈🌈🇹🇲🇹🇷🤛🤜🏼👍🏽👌☝🏼🥝🥦🌶🌽🍎 🍲🍔🥞🍝🍔🍗🌮🍯🥠🥢🍴🥄🥂☕️😀😃😄🤣😂😅😆☺️😊😍😌 😘😗😙😚😜😝😛😋🤨🧐🤓😒😏🤩🤩😎😞😔😖😢😣☹️😩🙁🤯 😰😓😦😲🤒🤕👿👹👽✊🏼"
iex> String.replace x, ~r/\p{Common}/u, ""
"abcdefghijklmnopqrstuvwxyz不極物片類書車裡十今果半接國先雄ニッポンニホン両方使用される中には文중국일본베트남등한자문화권에속하는아 시아여러국가에서는한국어的差异外通常认为还存在词汇上的差异例如繁体中文里多用的原لمنطقةالشرقالأوسطهيليالتحركضدإيرانسيبدأمنمجلسالأمن️️️️️️️️‍️️️‍️️️"

9mm · December 10, 2018, 6:11am

That is good but it removed tons of numbers and such, so I’m nervous about using that as I have no idea what all it’s going to remove

kip · December 10, 2018, 6:19am

Its possible the range you want is a little broader than @mudasobwa suggested (depends on your requirements). The nearest I can establish the relevant unicode blocks are:

1F000…1F02F; Mahjong Tiles
1F030…1F09F; Domino Tiles
1F0A0…1F0FF; Playing Cards
1F100…1F1FF; Enclosed Alphanumeric Supplement
1F200…1F2FF; Enclosed Ideographic Supplement
1F300…1F5FF; Miscellaneous Symbols and Pictographs
1F600…1F64F; Emoticons
1F650…1F67F; Ornamental Dingbats
1F680…1F6FF; Transport and Map Symbols
1F700…1F77F; Alchemical Symbols
1F780…1F7FF; Geometric Shapes Extended
1F800…1F8FF; Supplemental Arrows-C
1F900…1F9FF; Supplemental Symbols and Pictographs

So perhaps ~r/[\x{1F000}-\x{1F9FF}]/u would be an alternative.

9mm · December 10, 2018, 6:20am

How do I go about showing the actual encoded values in a string? The nightmare I’m facing right now is that some 3rd party service is giving internal server error, and I originally thought it was from emojis but now I’m thinking maybe its some hidden garbage in the string… i want to be able to ‘see’ the invisible characters somehow

Edit: for example like all the invisible space unicode characters, or whatever other weird unicode things could be in there. I know in the past ive encoutered issues with strings that have invisible things I had to delete and re-type and suddenly it works

9mm · December 10, 2018, 6:20am

oh nice! thanks

kip · December 10, 2018, 6:23am

Try String.codepoints/1. Where the code point can’t be decoded to UTF-8 you’ll see a number.

iex> String.codepoints x
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p",
 "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", ".", ".", ".", ".", "0", "1",
 "2", "3", "4", "5", "6", "7", "8", "9", " ", "不", "極", "，", "物", "片",
 "類", "書", "車", "裡", ...]

If you want to get the numeric value, add <<0>> to the string:

iex> x <> <<0>>
<<97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
  113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 46, 46, 46, 46, 48, 49, 50,
  51, 52, 53, 54, 55, 56, 57, 32, 228, 184, 141, 230, 165, 181, 239, 188, 140,
  ...>>

If you want the raw integers underneath use to_charlist/1:

iex> to_charlist x
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 46, 46, 46, 46, 48, 49, 50,
 51, 52, 53, 54, 55, 56, 57, 32, 19981, 26997, 65292, 29289, 29255, 39006,
 26360, 36554, 35041, ...]

9mm · December 10, 2018, 6:26am

Wow awesome, is there a way i can see the byteS?

And also, how do i make IO.inspect show the full output

kip · December 10, 2018, 6:29am

IEx.configure [inspect: [limit: :infinity]]

in iex will do the trick

Not sure what you mean about bytes -> thats what x <> <<0>> is showing you.

9mm · December 10, 2018, 6:30am

Thanks on that tip!

And yes, I guess I’m just not used to the syntax. I don’t really work with this stuff often at all, I just know in ruby I encountered this madness before trying to strip garbage characters and they show up in the strong as \uu1234 or something (I cant quite remember), and it was visible when I showed the bytes. I’ll just try one of these methods and hope something stands out! thanks again for this help

kip · December 10, 2018, 6:34am

So far nothing in your strings is “garbage” since its valid UTF-8. If you do String.codepoints/1 then any invalid encoding will show up clearly as a number rather than a string - thats often the most useful for me.

But if you string is valid UTF-8 and you’re getting errors then its likely that the service you’re interacting with doesn’t know how to handle UTF-8 and you’ll need to work out what the expected encoding is.

9mm · December 10, 2018, 6:35am

I guess I was thinking of invalid stuff or wack stuff like: U+00A0 NO-BREAK SPACE