iex(1)> tmp = "\ud83d"
** (SyntaxError) iex:1:8: invalid or reserved Unicode code point \u{d83d}. Syntax error after: \u
I’m not really sure how to proceed from here.
How can I clean up my string from invalid code points before decoding it? It’s ok for me to just drop the invalid ones, but I would like to keep the valid ones.
That is a good question. Looks like String.chunk/2 is a good place to start to weed out invalid UTF8 characters. This section of the docs gives some more information. String — Elixir v1.16.0
As for emoji code points, there might be a library that can help with that but unfortunately this is as far as my knowledge extends on the subject.
I don’t think emoji fall into the official unicode spec, that may be part of the issue.
edit:
If you don’t want to keep the emoji at all then a simple Enum.filter(raw, &String.valid?/1) should do the trick?
\ud83d is a Unicode “surrogate pair” character; it’s normally followed by another character of the \uDxxx variety to represent a character above \uFFFF in UTF-16 systems. If there’s one in the text by itself (a “lone surrogate”) that string can’t be represented in UTF-8 at all and is invalid.
Old thread, but I encountered a somewhat adjacent issue with my file names. Character encodings over 1418 aren’t represented as their string forms, and you get a boxed question mark instead or a boxed question mark with a space, and character encodings over 553295 (553_295) cause a codepoint error with string conversion attempts.
(Maybe this is due to something that needs to be installed/configured on my computer?)
I had to filter out character lists with such encodings (ones over 553295) when passing :file.dir_list output elements to File.dir? because this function first converts character lists to strings, however lower encodings were still usable, even though they didn’t have distinct string representation.