(Syntax Error) invalid or reserved Unicode code point

Hey everybody,

I get a response from a (pretty funky) API, and am trying to decode it with HTTPoison.
It fails with something like this:

{:error,
 %Poison.ParseError{
   data: " { \"config\": ...",
   skip: 86293,
   value: "\\ud83d"
 }}

Toying around with it in IEx also doesn’t work:

iex(1)> tmp = "\ud83d"
** (SyntaxError) iex:1:8: invalid or reserved Unicode code point \u{d83d}. Syntax error after: \u

I’m not really sure how to proceed from here.
How can I clean up my string from invalid code points before decoding it? It’s ok for me to just drop the invalid ones, but I would like to keep the valid ones.

You could sanitise the raw response with Enum.reduce before trying to decode it, dropping each value that results in an incorrect codepoint.

I don’t know which ones are invalid, so I don’t really know which ones to drop?
Is there an allow-list of valid ones builtin somewhere?

I realize that a single code point might not be meaningful in itself. Here is an actual value from the API response that also fails in IEx:

iex(1)> "\ud83d\udccd FooBar"
** (SyntaxError) iex:1:2: invalid or reserved Unicode code point \u{d83d}. Syntax error after: \u

The response is a Javascript block. The above should be this emoji:
https://charbase.com/1f4cd-unicode-round-pushpin

I guess Javascript uses a different codepoint set, than Elixir? How do I tell Elixir or Poison to use Javascript’s?

That is a good question. Looks like String.chunk/2 is a good place to start to weed out invalid UTF8 characters. This section of the docs gives some more information. String — Elixir v1.12.2

As for emoji code points, there might be a library that can help with that but unfortunately this is as far as my knowledge extends on the subject.

I don’t think emoji fall into the official unicode spec, that may be part of the issue.

edit:

If you don’t want to keep the emoji at all then a simple Enum.filter(raw, &String.valid?/1) should do the trick?

\ud83d is a Unicode “surrogate pair” character; it’s normally followed by another character of the \uDxxx variety to represent a character above \uFFFF in UTF-16 systems. If there’s one in the text by itself (a “lone surrogate”) that string can’t be represented in UTF-8 at all and is invalid.

Jason can parse these (when they are paired):

iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)22> {:ok, decoded} = Jason.decode("{\"foo\":\"\\uD83D\\uDE04\"}")
{:ok, %{"foo" => "😄"}}

But oddly, doesn’t produce them:

iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)23> {:ok, encoded} = Jason.encode(decoded)
{:ok, "{\"foo\":\"😄\"}"}
iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)24> String.to_charlist(encoded)
[123, 34, 102, 111, 111, 34, 58, 34, 128516, 34, 125]
6 Likes

Thanks for explaining, and pointing me to Jason! It works perfectly!

1 Like

Emoji are definitely part of the Unicode specification.

2 Likes

Yes that was an ignorant statement from my part from lack of research. Thanks for the correction.