How to remove invalid unicode from an API response: Jason.DecodeError unexpected sequence "\\udc51"

Hello!

We are working with responses from an API that contain HTML from websites. Some of these websites contain unicode code points that Jason is unable to decode. We have a Python script that runs if this fails and can JSON decode these large strings just fine. I’ve also tried to see if Jiffy or Poison could handle this but they still fail at the same code point as Jason.

I’ve also tried several Elixir/Erlang String and Unicode functions to try and filter out anything that’s not valid UTF8 but the code points are ignored and the whole string is considered valid UTF8.

Below is a snippet of the response that we are trying to decode and only a very small part HTML that we occasionally get back, I just provided the portion that causes the problem with the decoding. Any help is appreciated!

 {"server": "Microsoft-IIS/10.0", "headers_hash": 1111111111, "host": "127.0.0.1", "html": "\ufffdPNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0003\u0000\u0000\u0000\u0002f\b\u0006\u0000\u0000\u0000\ufffd[\ufffd}\u0000\u0000\u0000\u0001sRGB\u0000\ufffd\ufffd\u001c\ufffd\u0000\u0000\u0000\u0004gAMA\u0000\u0000\ufffd\ufffd\u000b\ufffda\u0005\u0000\u0000\u0000\tpHYs\u0000\u0000\u000e\ufffd\u0000\u0000\u000e\ufffd\u0001\ufffd\u0007R\ufffd\ufffd\u04b6\u03c6\ufffd\udc51\ufffd}6\ufffdc\ufffd'\u0005\ufffd\ufffd\ufffd/"}
1 Like

That “html” is not HTML, but a PNG, jason wants to serialize that into a string, which needs to be valid UTF-8. This String isn’t.

And as far as I read the Jason sources, Jason shouldn’t care at all.

$ nix shell nixpkgs#elixir -c iex
Erlang/OTP 25 [erts-13.2.2.1] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [jit:ns]

Interactive Elixir (1.14.5) - press Ctrl+C to exit (type h() ENTER for help)
iex(1)> s = ~S|{"server": "Microsoft-IIS/10.0", "headers_hash": 1111111111, "host": "127.0.0.1", "html": "\ufffdPNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0003\u0000\u0000\u0000\u0002f\b\u0006\u0000\u0000\u0000\ufffd[\ufffd}\u0000\u0000\u0000\u0001sRGB\u0000\ufffd\ufffd\u001c\ufffd\u0000\u0000\u0000\u0004gAMA\u0000\u0000\ufffd\ufffd\u000b\ufffda\u0005\u0000\u0000\u0000\tpHYs\u0000\u0000\u000e\ufffd\u0000\u0000\u000e\ufffd\u0001\ufffd"}|
"{\"server\": \"Microsoft-IIS/10.0\", \"headers_hash\": 1111111111, \"host\": \"127.0.0.1\", \"html\": \"\\ufffdPNG\\r\\n\\u001a\\n\\u0000\\u0000\\u0000\\rIHDR\\u0000\\u0000\\u0003\\u0000\\u0000\\u0000\\u0002f\\b\\u0006\\u0000\\u0000\\u0000\\ufffd[\\ufffd}\\u0000\\u0000\\u0000\\u0001sRGB\\u0000\\ufffd\\ufffd\\u001c\\ufffd\\u0000\\u0000\\u0000\\u0004gAMA\\u0000\\u0000\\ufffd\\ufffd\\u000b\\ufffda\\u0005\\u0000\\u0000\\u0000\\tpHYs\\u0000\\u0000\\u000e\\ufffd\\u0000\\u0000\\u000e\\ufffd\\u0001\\ufffd\"}"
iex(2)> Mix.install([{:jason, "~> 1.4"}])
Resolving Hex dependencies...
Dependency resolution completed:
New:
  jason 1.4.0
==> jason
Compiling 10 files (.ex)
Generated jason app
:ok
iex(3)> Jason.decode(s)
{:ok,
 %{
   "headers_hash" => 1111111111,
   "host" => "127.0.0.1",
   "html" => <<239, 191, 189, 80, 78, 71, 13, 10, 26, 10, 0, 0, 0, 13, 73, 72,
     68, 82, 0, 0, 3, 0, 0, 0, 2, 102, 8, 6, 0, 0, 0, 239, 191, 189, 91, 239,
     191, 189, 125, 0, 0, 0, 1, 115, 82, ...>>,
   "server" => "Microsoft-IIS/10.0"
 }}

As you can see, Jason happily decodes the blob into a binary.

1 Like

Ah, sorry I left out the code point that Jason was erroring out on, I’ve updated the snippet to include the \udc51 code point which causes the error. The sample I provided is only a very small snippet of the full HTML that gets returned, but that’s an interesting note about it being a PNG.

The JSON standard is pretty clear that strings are Unicode code points (could be UTF8, 16 or 32). Elixir strings are UTF8 and that’s the contract. So I would expect failure iii the string is not UTF8.

(Section 9) A string is a sequence of Unicode code points wrapped with quotation marks (U+0022).

Therefore the best solution is for the source system to correct its invalid JSON. Since I suspect you’ll say that’s not going to happen, any solution will need to work around the non-standard binary data and your Python approach may be as good as any.

It’s really unfortunate to see binary data (the PNG image in this case) being passed off as a string. It should at least be encoded in some fashion.

3 Likes

Thanks for the very informative answer. I think you’re right about the Python approach.

1 Like

This is the “magic number” that starts a PNG file, except the first byte with a value of 0x89 has instead been replaced with the Unicode replacement character (\uFFFD).

The data was already corrupted by the time it got to whatever printed that log line; you’ll need to look earlier in the call stack for where things are going wrong.

It will help troubleshoot if you can capture exactly the bytes that the API is replying with - my suspicion is that it’s sending something like:

"html":"<BYTE 0x89>PNG<BYTE 0x0D><BYTE 0x0A>...etc...

which is problematic from both a “valid UTF8” and a “valid JSON” perspective.

4 Likes