Help converting regex

Iโ€™m trying to make a regex to remove emojis based on this thread:

This causes an error:

Regex.replace(~r/[\u{1F600}-\u{1F6FF}]/, "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ", "")
# (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 2

I tried using \x{FFFF} syntax instead according to this: https://www.regular-expressions.info/unicode.html

But I get another error:

Regex.replace(~r/[\x{1F600}-\x{1F6FF}]/, "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ", "")
# (Regex.CompileError) character value in \x{} or \o{} is too large at position 9

Apparently I also need u to enable unicode but that doesnt seem to work either:

Regex.replace(~r/[\x{1F600}-\x{1F6FF}]/u, "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ", "")
"๐Ÿ’ฐ Monies! ๐Ÿ’ฒ"

You might have some better luck using the unicode character class So (other symbol). For example:

iex> x = "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ"                                
"๐Ÿ’ฐ Monies! ๐Ÿ’ฒ"
iex> String.replace x, ~r/\p{So}/u, ""
" Monies! "

BTW, the โ€œcorrectโ€ approach to this would be:

iex> String.replace x, ~r/\p{Emoji}/u, ""

But erlangโ€™s re module doesnโ€™t support that character class.

3 Likes

Interestingโ€ฆ does that strip other languages or only emojis? I was trying to avoid stripping chinese/japanese/korean, which it seems like a lot of emoji regex do

No, it doesnโ€™t. The full list of code points in So are here: https://www.fileformat.info/info/unicode/category/So/list.htm

Itโ€™s more permissive than just emoji, but it does not include scripts.

iex> String.replace "ๅพก่จ€ๅฎฃ", ~r/\p{So}/u, ""
"ๅพก่จ€ๅฎฃ"
1 Like

Hmm yess, and it also is missing some emoji for some reason. God this is confusing

Can you try this string?

x = "abcdefghijklmnopqrstuvwxyz....0123456789 ไธๆฅต๏ผŒ็‰ฉ็‰‡้กžๆ›ธ่ปŠ่ฃก๏ผๅไปŠๆžœๅŠๆŽฅๅœ‹ๅ…ˆ้›„ ใƒ‹ใƒƒใƒใƒณใ€ใ€Œใƒ‹ใƒ›ใƒณใ€ไธกๆ–นไฝฟ็”จใ•ใ‚Œใ‚‹ไธญ ใซใฏๆ–‡์ค‘๊ตญ, ์ผ๋ณธ, ๋ฒ ํŠธ๋‚จ ๋“ฑ ํ•œ์ž ๋ฌธํ™”๊ถŒ์— ์†ํ•˜๋Š” ์•„์‹œ์•„ ์—ฌ๋Ÿฌ ๊ตญ๊ฐ€์—์„œ๋Š” ํ•œ๊ตญ์–ด็š„ๅทฎๅผ‚ๅค–๏ผŒ้€šๅธธ่ฎคไธบ่ฟ˜ๅญ˜ๅœจ่ฏๆฑ‡ไธŠ็š„ๅทฎๅผ‚ใ€‚ไพ‹ๅฆ‚็นไฝ“ไธญๆ–‡้‡Œๅคš็”จ็š„โ€œๅŽŸ ู„ู…ู†ุทู‚ุฉ ุงู„ุดุฑู‚ ุงู„ุฃูˆุณุท ู‡ูŠู„ูŠ: ุงู„ุชุญุฑูƒ ุถุฏ ุฅูŠุฑุงู† ุณูŠุจุฏุฃ ู…ู† ู…ุฌู„ุณ ุงู„ุฃู…ู†๐Ÿ“Œโ˜ฎ๏ธ๐Ÿ’Ÿ๐Ÿ”ฏโ˜ช๏ธใŠ—๏ธ๐Ÿˆต๐Ÿ†š๐Ÿ’ฏโ•๐Ÿ”ž๐Ÿšท๐Ÿ”ฐโ‰๏ธโš ๏ธ๐Ÿ’ค๐ŸŒ๐ŸŒ€โ–ถ๏ธ๐Ÿ” ๐Ÿ”ฃโ†”๏ธโ†ฉ๏ธ๐Ÿ‘โ€๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ โ—ฝ๏ธ๐Ÿ”ฒ๐Ÿ‡ต๐Ÿ‡ฆ๐Ÿณ๏ธ๐Ÿณ๏ธโ€๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ‡น๐Ÿ‡ท๐Ÿค›๐Ÿคœ๐Ÿผ๐Ÿ‘๐Ÿฝ๐Ÿ‘Œโ˜๐Ÿผ๐Ÿฅ๐Ÿฅฆ๐ŸŒถ๐ŸŒฝ๐ŸŽ ๐Ÿฒ๐Ÿ”๐Ÿฅž๐Ÿ๐Ÿ”๐Ÿ—๐ŸŒฎ๐Ÿฏ๐Ÿฅ ๐Ÿฅข๐Ÿด๐Ÿฅ„๐Ÿฅ‚โ˜•๏ธ๐Ÿ˜€๐Ÿ˜ƒ๐Ÿ˜„๐Ÿคฃ๐Ÿ˜‚๐Ÿ˜…๐Ÿ˜†โ˜บ๏ธ๐Ÿ˜Š๐Ÿ˜๐Ÿ˜Œ ๐Ÿ˜˜๐Ÿ˜—๐Ÿ˜™๐Ÿ˜š๐Ÿ˜œ๐Ÿ˜๐Ÿ˜›๐Ÿ˜‹๐Ÿคจ๐Ÿง๐Ÿค“๐Ÿ˜’๐Ÿ˜๐Ÿคฉ๐Ÿคฉ๐Ÿ˜Ž๐Ÿ˜ž๐Ÿ˜”๐Ÿ˜–๐Ÿ˜ข๐Ÿ˜ฃโ˜น๏ธ๐Ÿ˜ฉ๐Ÿ™๐Ÿคฏ ๐Ÿ˜ฐ๐Ÿ˜“๐Ÿ˜ฆ๐Ÿ˜ฒ๐Ÿค’๐Ÿค•๐Ÿ‘ฟ๐Ÿ‘น๐Ÿ‘ฝโœŠ๐Ÿผ"

It leaves the other languages properly but some emojis are left.

I did find this guys ruby gem but I just wasted like an hour trying to convert this regex and when it finally worked it erased all the chinese characters, so obviously I got that wrong

There are many issues here. Both characters above are not in the range you specified in the first place.

โ–ถ to_charlist "๐Ÿ’ฐ" 
#โ‡’ [128176]
โ–ถ to_charlist "๐Ÿ’ฒ"
#โ‡’ [128178]
โ–ถ 0x1F600
#โ‡’ 128512

To make it work for the range with \u, just interpolate the literals (I voluntarily changed the starting value for the range to what it probably should be):

โ–ถ Regex.replace(~r/[#{"\u{1F000}"}-#{"\u{1F6FF}"}]/u, "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ", "")
" Monies! "

\x will work out of the box with a proper range:

Regex.replace(~r/[\x{1F000}-\x{1F6FF}]/u, "๐Ÿ’ฐ Monies! ๐Ÿ’ฒ", "")
" Monies! "
3 Likes

Yes, I see the same. Itโ€™s a bit surprising since according to some utility code I wrote they look to be So but re certainly doesnโ€™t agree:

iex> x
"๐Ÿคจ๐Ÿง๐Ÿค“๐Ÿคฉ"
iex> Cldr.Unicode.Category.category x 
[:So, :So, :So, :So]
iex> String.replace x, ~r/\p{So}/u, ""
"๐Ÿคจ๐Ÿง๐Ÿค“๐Ÿคฉ"
1 Like

Seems that emoji belong to the Common script so this appears to get closer - but it also deletes digits and punctuation.

iex> x = "abcdefghijklmnopqrstuvwxyz....0123456789 ไธๆฅต๏ผŒ็‰ฉ็‰‡้กžๆ›ธ่ปŠ่ฃก๏ผๅไปŠๆžœๅŠๆŽฅๅœ‹ๅ…ˆ้›„ ใƒ‹ใƒƒใƒใƒณใ€ใ€Œใƒ‹ใƒ›ใƒณใ€ไธกๆ–นไฝฟ็”จใ•ใ‚Œใ‚‹ไธญ ใซใฏๆ–‡์ค‘๊ตญ, ์ผ๋ณธ, ๋ฒ ํŠธ๋‚จ ๋“ฑ ํ•œ์ž ๋ฌธํ™”๊ถŒ์— ์†ํ•˜๋Š” ์•„์‹œ์•„ ์—ฌ๋Ÿฌ ๊ตญ๊ฐ€์—์„œ๋Š” ํ•œ๊ตญ์–ด็š„ๅทฎๅผ‚ๅค–๏ผŒ้€šๅธธ่ฎคไธบ่ฟ˜ๅญ˜ๅœจ่ฏๆฑ‡ไธŠ็š„ๅทฎๅผ‚ใ€‚ไพ‹ๅฆ‚็นไฝ“ไธญๆ–‡้‡Œๅคš็”จ็š„โ€œๅŽŸ ู„ู…ู†ุทู‚ุฉ ุงู„ุดุฑู‚ ุงู„ุฃูˆุณุท ู‡ูŠู„ูŠ: ุงู„ุชุญุฑูƒ ุถุฏ ุฅูŠุฑุงู† ุณูŠุจุฏุฃ ู…ู† ู…ุฌู„ุณ ุงู„ุฃู…ู†๐Ÿ“Œโ˜ฎ๏ธ๐Ÿ’Ÿ๐Ÿ”ฏโ˜ช๏ธใŠ—๏ธ๐Ÿˆต๐Ÿ†š๐Ÿ’ฏโ•๐Ÿ”ž๐Ÿšท๐Ÿ”ฐโ‰๏ธโš ๏ธ๐Ÿ’ค๐ŸŒ๐ŸŒ€โ–ถ๏ธ๐Ÿ” ๐Ÿ”ฃโ†”๏ธโ†ฉ๏ธ๐Ÿ‘โ€๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ โ—ฝ๏ธ๐Ÿ”ฒ๐Ÿ‡ต๐Ÿ‡ฆ๐Ÿณ๏ธ๐Ÿณ๏ธโ€๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ‡น๐Ÿ‡ท๐Ÿค›๐Ÿคœ๐Ÿผ๐Ÿ‘๐Ÿฝ๐Ÿ‘Œโ˜๐Ÿผ๐Ÿฅ๐Ÿฅฆ๐ŸŒถ๐ŸŒฝ๐ŸŽ ๐Ÿฒ๐Ÿ”๐Ÿฅž๐Ÿ๐Ÿ”๐Ÿ—๐ŸŒฎ๐Ÿฏ๐Ÿฅ ๐Ÿฅข๐Ÿด๐Ÿฅ„๐Ÿฅ‚โ˜•๏ธ๐Ÿ˜€๐Ÿ˜ƒ๐Ÿ˜„๐Ÿคฃ๐Ÿ˜‚๐Ÿ˜…๐Ÿ˜†โ˜บ๏ธ๐Ÿ˜Š๐Ÿ˜๐Ÿ˜Œ ๐Ÿ˜˜๐Ÿ˜—๐Ÿ˜™๐Ÿ˜š๐Ÿ˜œ๐Ÿ˜๐Ÿ˜›๐Ÿ˜‹๐Ÿคจ๐Ÿง๐Ÿค“๐Ÿ˜’๐Ÿ˜๐Ÿคฉ๐Ÿคฉ๐Ÿ˜Ž๐Ÿ˜ž๐Ÿ˜”๐Ÿ˜–๐Ÿ˜ข๐Ÿ˜ฃโ˜น๏ธ๐Ÿ˜ฉ๐Ÿ™๐Ÿคฏ ๐Ÿ˜ฐ๐Ÿ˜“๐Ÿ˜ฆ๐Ÿ˜ฒ๐Ÿค’๐Ÿค•๐Ÿ‘ฟ๐Ÿ‘น๐Ÿ‘ฝโœŠ๐Ÿผ"
"abcdefghijklmnopqrstuvwxyz....0123456789 ไธๆฅต๏ผŒ็‰ฉ็‰‡้กžๆ›ธ่ปŠ่ฃก๏ผๅไปŠๆžœๅŠๆŽฅๅœ‹ๅ…ˆ้›„ ใƒ‹ใƒƒใƒใƒณใ€ใ€Œใƒ‹ใƒ›ใƒณใ€ไธกๆ–นไฝฟ็”จใ•ใ‚Œใ‚‹ไธญ ใซใฏๆ–‡์ค‘๊ตญ, ์ผ๋ณธ, ๋ฒ ํŠธ๋‚จ ๋“ฑ ํ•œ์ž ๋ฌธํ™”๊ถŒ์— ์†ํ•˜๋Š” ์•„์‹œ์•„ ์—ฌ๋Ÿฌ ๊ตญ๊ฐ€์—์„œ๋Š” ํ•œ๊ตญ์–ด็š„ๅทฎๅผ‚ๅค–๏ผŒ้€šๅธธ่ฎคไธบ่ฟ˜ๅญ˜ๅœจ่ฏๆฑ‡ไธŠ็š„ๅทฎๅผ‚ใ€‚ไพ‹ๅฆ‚็นไฝ“ไธญๆ–‡้‡Œๅคš็”จ็š„โ€œๅŽŸ ู„ู…ู†ุทู‚ุฉ ุงู„ุดุฑู‚ ุงู„ุฃูˆุณุท ู‡ูŠู„ูŠ: ุงู„ุชุญุฑูƒ ุถุฏ ุฅูŠุฑุงู† ุณูŠุจุฏุฃ ู…ู† ู…ุฌู„ุณ ุงู„ุฃู…ู†๐Ÿ“Œโ˜ฎ๏ธ๐Ÿ’Ÿ๐Ÿ”ฏโ˜ช๏ธใŠ—๏ธ๐Ÿˆต๐Ÿ†š๐Ÿ’ฏโ•๐Ÿ”ž๐Ÿšท๐Ÿ”ฐโ‰๏ธโš ๏ธ๐Ÿ’ค๐ŸŒ๐ŸŒ€โ–ถ๏ธ๐Ÿ” ๐Ÿ”ฃโ†”๏ธโ†ฉ๏ธ๐Ÿ‘โ€๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ๐Ÿ—จ โ—ฝ๏ธ๐Ÿ”ฒ๐Ÿ‡ต๐Ÿ‡ฆ๐Ÿณ๏ธ๐Ÿณ๏ธโ€๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐ŸŒˆ๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ‡น๐Ÿ‡ท๐Ÿค›๐Ÿคœ๐Ÿผ๐Ÿ‘๐Ÿฝ๐Ÿ‘Œโ˜๐Ÿผ๐Ÿฅ๐Ÿฅฆ๐ŸŒถ๐ŸŒฝ๐ŸŽ ๐Ÿฒ๐Ÿ”๐Ÿฅž๐Ÿ๐Ÿ”๐Ÿ—๐ŸŒฎ๐Ÿฏ๐Ÿฅ ๐Ÿฅข๐Ÿด๐Ÿฅ„๐Ÿฅ‚โ˜•๏ธ๐Ÿ˜€๐Ÿ˜ƒ๐Ÿ˜„๐Ÿคฃ๐Ÿ˜‚๐Ÿ˜…๐Ÿ˜†โ˜บ๏ธ๐Ÿ˜Š๐Ÿ˜๐Ÿ˜Œ ๐Ÿ˜˜๐Ÿ˜—๐Ÿ˜™๐Ÿ˜š๐Ÿ˜œ๐Ÿ˜๐Ÿ˜›๐Ÿ˜‹๐Ÿคจ๐Ÿง๐Ÿค“๐Ÿ˜’๐Ÿ˜๐Ÿคฉ๐Ÿคฉ๐Ÿ˜Ž๐Ÿ˜ž๐Ÿ˜”๐Ÿ˜–๐Ÿ˜ข๐Ÿ˜ฃโ˜น๏ธ๐Ÿ˜ฉ๐Ÿ™๐Ÿคฏ ๐Ÿ˜ฐ๐Ÿ˜“๐Ÿ˜ฆ๐Ÿ˜ฒ๐Ÿค’๐Ÿค•๐Ÿ‘ฟ๐Ÿ‘น๐Ÿ‘ฝโœŠ๐Ÿผ"
iex> String.replace x, ~r/\p{Common}/u, ""
"abcdefghijklmnopqrstuvwxyzไธๆฅต็‰ฉ็‰‡้กžๆ›ธ่ปŠ่ฃกๅไปŠๆžœๅŠๆŽฅๅœ‹ๅ…ˆ้›„ใƒ‹ใƒƒใƒใƒณใƒ‹ใƒ›ใƒณไธกๆ–นไฝฟ็”จใ•ใ‚Œใ‚‹ไธญใซใฏๆ–‡์ค‘๊ตญ์ผ๋ณธ๋ฒ ํŠธ๋‚จ๋“ฑํ•œ์ž๋ฌธํ™”๊ถŒ์—์†ํ•˜๋Š”์•„ ์‹œ์•„์—ฌ๋Ÿฌ๊ตญ๊ฐ€์—์„œ๋Š”ํ•œ๊ตญ์–ด็š„ๅทฎๅผ‚ๅค–้€šๅธธ่ฎคไธบ่ฟ˜ๅญ˜ๅœจ่ฏๆฑ‡ไธŠ็š„ๅทฎๅผ‚ไพ‹ๅฆ‚็นไฝ“ไธญๆ–‡้‡Œๅคš็”จ็š„ๅŽŸู„ู…ู†ุทู‚ุฉุงู„ุดุฑู‚ุงู„ุฃูˆุณุทู‡ูŠู„ูŠุงู„ุชุญุฑูƒุถุฏุฅูŠุฑุงู†ุณูŠุจุฏุฃู…ู†ู…ุฌู„ุณุงู„ุฃู…ู†๏ธ๏ธ๏ธ๏ธ๏ธ๏ธ๏ธ๏ธโ€๏ธ๏ธ๏ธโ€๏ธ๏ธ๏ธ" 

That is good but it removed tons of numbers and such, so Iโ€™m nervous about using that as I have no idea what all itโ€™s going to remove :confused:

Its possible the range you want is a little broader than @mudasobwa suggested (depends on your requirements). The nearest I can establish the relevant unicode blocks are:

1F000โ€ฆ1F02F; Mahjong Tiles
1F030โ€ฆ1F09F; Domino Tiles
1F0A0โ€ฆ1F0FF; Playing Cards
1F100โ€ฆ1F1FF; Enclosed Alphanumeric Supplement
1F200โ€ฆ1F2FF; Enclosed Ideographic Supplement
1F300โ€ฆ1F5FF; Miscellaneous Symbols and Pictographs
1F600โ€ฆ1F64F; Emoticons
1F650โ€ฆ1F67F; Ornamental Dingbats
1F680โ€ฆ1F6FF; Transport and Map Symbols
1F700โ€ฆ1F77F; Alchemical Symbols
1F780โ€ฆ1F7FF; Geometric Shapes Extended
1F800โ€ฆ1F8FF; Supplemental Arrows-C
1F900โ€ฆ1F9FF; Supplemental Symbols and Pictographs

So perhaps ~r/[\x{1F000}-\x{1F9FF}]/u would be an alternative.

2 Likes

How do I go about showing the actual encoded values in a string? The nightmare Iโ€™m facing right now is that some 3rd party service is giving internal server error, and I originally thought it was from emojis but now Iโ€™m thinking maybe its some hidden garbage in the stringโ€ฆ i want to be able to โ€˜seeโ€™ the invisible characters somehow

Edit: for example like all the invisible space unicode characters, or whatever other weird unicode things could be in there. I know in the past ive encoutered issues with strings that have invisible things I had to delete and re-type and suddenly it works

oh nice! thanks

Try String.codepoints/1. Where the code point canโ€™t be decoded to UTF-8 youโ€™ll see a number.

iex> String.codepoints x
["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p",
 "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", ".", ".", ".", ".", "0", "1",
 "2", "3", "4", "5", "6", "7", "8", "9", " ", "ไธ", "ๆฅต", "๏ผŒ", "็‰ฉ", "็‰‡",
 "้กž", "ๆ›ธ", "่ปŠ", "่ฃก", ...]

If you want to get the numeric value, add <<0>> to the string:

iex> x <> <<0>>
<<97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
  113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 46, 46, 46, 46, 48, 49, 50,
  51, 52, 53, 54, 55, 56, 57, 32, 228, 184, 141, 230, 165, 181, 239, 188, 140,
  ...>>

If you want the raw integers underneath use to_charlist/1:

iex> to_charlist x
[97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 46, 46, 46, 46, 48, 49, 50,
 51, 52, 53, 54, 55, 56, 57, 32, 19981, 26997, 65292, 29289, 29255, 39006,
 26360, 36554, 35041, ...]
3 Likes

Wow awesome, is there a way i can see the byteS?

And also, how do i make IO.inspect show the full output

IEx.configure [inspect: [limit: :infinity]]

in iex will do the trick

Not sure what you mean about bytes -> thats what x <> <<0>> is showing you.

1 Like

Thanks on that tip!

And yes, I guess Iโ€™m just not used to the syntax. I donโ€™t really work with this stuff often at all, I just know in ruby I encountered this madness before trying to strip garbage characters and they show up in the strong as \uu1234 or something (I cant quite remember), and it was visible when I showed the bytes. Iโ€™ll just try one of these methods and hope something stands out! thanks again for this help :smiley:

So far nothing in your strings is โ€œgarbageโ€ since its valid UTF-8. If you do String.codepoints/1 then any invalid encoding will show up clearly as a number rather than a string - thats often the most useful for me.

But if you string is valid UTF-8 and youโ€™re getting errors then its likely that the service youโ€™re interacting with doesnโ€™t know how to handle UTF-8 and youโ€™ll need to work out what the expected encoding is.

1 Like

I guess I was thinking of invalid stuff or wack stuff like: U+00A0 NO-BREAK SPACE