How to properly convert this ?Octal? encoded UTF16 BE string to UTF8

preciz · July 20, 2020, 12:12pm

This is the string:

"\\376\\377\\0001\\0009\\000 \\0001\\0002\\000 \\0001\\0009\\000 \\000P\\000r\\000o\\000g\\000r\\000a\\000m\\000m\\000f\\000o\\003\\b\\000r\\000d\\000e\\000r\\000u\\000n\\000g\\000 \\0002\\0000\\0002\\0000\\000 \\000-\\000 \\000A\\000n\\000l\\000a\\000g\\000e\\000 \\0002\\000 \\000-\\000 \\000F\\000o\\003\\b\\000r\\000d\\000e\\000r\\000e\\000r\\000g\\000e\\000b\\000n\\000i\\000s\\000s\\000e\\000.\\000d\\000o\\000c\\000x"

That should be converted to: "19 12 19 Programmförderung 2020 - Anlage 2 - Förderergebnisse.docx"

It’s from a PDFs /Info dictionary.

The "\\376\\377" is a BOM (byte order mark) meaning it’s an UTF16 big endian.

Firefox can read it properly and I assume this is the source code they use:

github.com

mozilla/pdf.js/blob/50bc4a18e8c564753365d927d5ec6a6d2cce3072/src/display/metadata.js#L37


  const parser = new SimpleXMLParser();
  const xmlDocument = parser.parseFromString(data);

  this._metadataMap = new Map();

  if (xmlDocument) {
    this._parse(xmlDocument);
  }
}

_repair(data) {
  // Start by removing any "junk" before the first tag (see issue 10395).
  return data
    .replace(/^[^<]+/, "")
    .replace(/>\\376\\377([^<]+)/g, function (all, codes) {
      const bytes = codes
        .replace(/\\([0-3])([0-7])([0-7])/g, function (code, d1, d2, d3) {
          return String.fromCharCode(d1 * 64 + d2 * 8 + d3 * 1);
        })
        .replace(/&(amp|apos|gt|lt|quot);/g, function (str, name) {
          switch (name) {

I did try to replicate this in Elixir but when it gets to the "ö" it breaks into non relevant characters.

What would be the correct way to convert the string?

hauleth · July 20, 2020, 12:15pm

You should check :unicode module for that. It has all functions you will need.

malaire · July 20, 2020, 12:40pm

That shown code isn’t enough to decode the string as it doesn’t handle decoding \\003\\b to UTF16 codepoint 0x0308.

This could be fixed e.g. by first converting \\b to \\010, but full solution would need a list of all possible escape codes used in the encoding.

preciz · July 20, 2020, 1:09pm

Can you elaborate on the bit why is this escape character used in the string?

malaire · July 20, 2020, 1:23pm

It looks like UTF16 BE bytes in printable ASCII range are represented as ASCII characters (e.g. P instead of \\120), and at least some UTF16 BE bytes in ASCII control character range use the usual escape codes (e.g. \\b meaning ASCII backspace).

So encoding might also be using some other escapes like \\n for \\012 (line feed) or \\r for \\015 (carriage return).

dimitarvp · July 21, 2020, 10:41am

Does the pdfinfo tool read this data properly btw?

I don’t even see how \0001\0009 transcodes to "19". This is not how UTF-16BE does it. Looks to be some other encoding.

malaire · July 21, 2020, 10:56am

String \\0001\\0009 is converted to 4 bytes:

\\000 is octal encoding for byte 0x00
1 means ASCII value of character 1, i.e. 0x31
\\000 is octal encoding for byte 0x00
9 means ASCII value of character 9, i.e. 0x39

So that is 4 bytes 0x00 0x31 0x00 0x39 which is UTF16 BE encoding for "19".

dimitarvp · July 21, 2020, 11:07am

Oh, so as you said some values aren’t exactly UTF-16BE encoded. They are directly their printable equivalents.

Well, that’s still not UTF-16BE per se.

g-andrade · July 21, 2020, 6:15pm

This seems to do it in Erlang:

convert_text_with_bom_to_utf8() ->
    Data = <<"\376\377\0001\0009\000 \0001\0002\000 \0001\0009\000 "
             "\000P\000r\000o\000g\000r\000a\000m\000m\000f\000o\003"
             "\b\000r\000d\000e\000r\000u\000n\000g\000 \0002\0000"
             "\0002\0000\000 \000-\000 \000A\000n\000l\000a\000g"
             "\000e\000 \0002\000 \000-\000 \000F\000o\003\b\000r\000d\000e\000r"
             "\000e\000r\000g\000e\000b\000n\000i\000s\000s\000e"
             "\000.\000d\000o\000c\000x"
           >>,

    {Encoding, BomLength} = unicode:bom_to_encoding(Data), % determine Data encoding from BOM
    <<_:BomLength/bytes, Text/bytes>> = Data, % thrown BOM away
    <<_/bytes>> = unicode:characters_to_binary(Text, Encoding, utf8). % convert to UTF-8

You can interact with it here.

Unfortunately I wasn’t able to have declare an equivalent literal value in Elixir; but, as long as your string is in the expected format, the same pattern should work.

If, however, the double-backslashes represent actual backslashes in the original string and are not the result of printing / inspecting it, then they’ll need to be unescaped first.

g-andrade · July 21, 2020, 6:40pm

Now, as far as unescaping the string - if needed - this seems to do the trick:

  def unescape(data) do                                                                                                                                                                                                                        
    charlist = String.to_charlist(data)                                                                                                                                                                                                        
    erlang_literal = '"#{charlist}"'                                                                                                                                                                                                           
    {:ok, [{:string, _, unescaped_charlist}], _} = :erl_scan.string(erlang_literal)                                                                                                                                                            
    List.to_string(unescaped_charlist)                                                                                                                                                                                                         
  end

What it does is transform the string into a charlist containing an Erlang “string expression”, which it proceeds to parse - and so I wouldn’t trust it with untrusted input without further investigation.