(Syntax Error) invalid or reserved Unicode code point

Doerge · August 11, 2021, 5:31pm

Hey everybody,

I get a response from a (pretty funky) API, and am trying to decode it with HTTPoison.
It fails with something like this:

{:error,
 %Poison.ParseError{
   data: " { \"config\": ...",
   skip: 86293,
   value: "\\ud83d"
 }}

Toying around with it in IEx also doesn’t work:

iex(1)> tmp = "\ud83d"
** (SyntaxError) iex:1:8: invalid or reserved Unicode code point \u{d83d}. Syntax error after: \u

I’m not really sure how to proceed from here.
How can I clean up my string from invalid code points before decoding it? It’s ok for me to just drop the invalid ones, but I would like to keep the valid ones.

03juan · August 11, 2021, 5:38pm

You could sanitise the raw response with Enum.reduce before trying to decode it, dropping each value that results in an incorrect codepoint.

Doerge · August 11, 2021, 5:56pm

I don’t know which ones are invalid, so I don’t really know which ones to drop?
Is there an allow-list of valid ones builtin somewhere?

Doerge · August 11, 2021, 6:01pm

I realize that a single code point might not be meaningful in itself. Here is an actual value from the API response that also fails in IEx:

iex(1)> "\ud83d\udccd FooBar"
** (SyntaxError) iex:1:2: invalid or reserved Unicode code point \u{d83d}. Syntax error after: \u

The response is a Javascript block. The above should be this emoji:
https://charbase.com/1f4cd-unicode-round-pushpin

I guess Javascript uses a different codepoint set, than Elixir? How do I tell Elixir or Poison to use Javascript’s?

03juan · August 11, 2021, 6:07pm

That is a good question. Looks like String.chunk/2 is a good place to start to weed out invalid UTF8 characters. This section of the docs gives some more information. String — Elixir v1.16.0

As for emoji code points, there might be a library that can help with that but unfortunately this is as far as my knowledge extends on the subject.

I don’t think emoji fall into the official unicode spec, that may be part of the issue.

edit:

If you don’t want to keep the emoji at all then a simple Enum.filter(raw, &String.valid?/1) should do the trick?

al2o3cr · August 11, 2021, 6:10pm

\ud83d is a Unicode “surrogate pair” character; it’s normally followed by another character of the \uDxxx variety to represent a character above \uFFFF in UTF-16 systems. If there’s one in the text by itself (a “lone surrogate”) that string can’t be represented in UTF-8 at all and is invalid.

Jason can parse these (when they are paired):

iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)22> {:ok, decoded} = Jason.decode("{\"foo\":\"\\uD83D\\uDE04\"}")
{:ok, %{"foo" => "😄"}}

But oddly, doesn’t produce them:

iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)23> {:ok, encoded} = Jason.encode(decoded)
{:ok, "{\"foo\":\"😄\"}"}
iex(livebook_ky4n2p4p@Matts-MacBook-Pro-2)24> String.to_charlist(encoded)
[123, 34, 102, 111, 111, 34, 58, 34, 128516, 34, 125]

Doerge · August 11, 2021, 6:18pm

Thanks for explaining, and pointing me to Jason! It works perfectly!

kip · August 11, 2021, 10:24pm

Emoji are definitely part of the Unicode specification.

03juan · August 12, 2021, 7:20am

Yes that was an ignorant statement from my part from lack of research. Thanks for the correction.

DidactMacros · March 5, 2024, 10:45am

Old thread, but I encountered a somewhat adjacent issue with my file names. Character encodings over 1418 aren’t represented as their string forms, and you get a boxed question mark instead or a boxed question mark with a space, and character encodings over 553295 (553_295) cause a codepoint error with string conversion attempts.

(Maybe this is due to something that needs to be installed/configured on my computer?)

I had to filter out character lists with such encodings (ones over 553295) when passing :file.dir_list output elements to File.dir? because this function first converts character lists to strings, however lower encodings were still usable, even though they didn’t have distinct string representation.

al2o3cr · March 5, 2024, 5:42pm

It’s difficult to narrow down what could be causing this because there are a lot of suspects:

the font your terminal program uses
the terminal program
the compiler options used to build the BEAM
the shell’ls character set settings
the operating system’s character set settings
the operating system’s implementation of the filesystem APIs

DidactMacros · March 5, 2024, 6:27pm

Ah I thought this may just be a general occurrence.

I’ll check if the same happens on my Linux system.

edit: The boxed question mark occurs only for characters over 43008 while the codepoint error still has the same trigger point.

I don’t think it’s the operating system’s character set since I can see emojis in my file names. It’s likely the terminal.