How to properly convert this ?Octal? encoded UTF16 BE string to UTF8

This is the string:

"\\376\\377\\0001\\0009\\000 \\0001\\0002\\000 \\0001\\0009\\000 \\000P\\000r\\000o\\000g\\000r\\000a\\000m\\000m\\000f\\000o\\003\\b\\000r\\000d\\000e\\000r\\000u\\000n\\000g\\000 \\0002\\0000\\0002\\0000\\000 \\000-\\000 \\000A\\000n\\000l\\000a\\000g\\000e\\000 \\0002\\000 \\000-\\000 \\000F\\000o\\003\\b\\000r\\000d\\000e\\000r\\000e\\000r\\000g\\000e\\000b\\000n\\000i\\000s\\000s\\000e\\000.\\000d\\000o\\000c\\000x"

That should be converted to: "19 12 19 Programmförderung 2020 - Anlage 2 - Förderergebnisse.docx"

It’s from a PDFs /Info dictionary.

The "\\376\\377" is a BOM (byte order mark) meaning it’s an UTF16 big endian.

Firefox can read it properly and I assume this is the source code they use:

I did try to replicate this in Elixir but when it gets to the "ö" it breaks into non relevant characters.

What would be the correct way to convert the string?

You should check :unicode module for that. It has all functions you will need.

1 Like

That shown code isn’t enough to decode the string as it doesn’t handle decoding \\003\\b to UTF16 codepoint 0x0308.

This could be fixed e.g. by first converting \\b to \\010, but full solution would need a list of all possible escape codes used in the encoding.

1 Like

Can you elaborate on the bit why is this escape character used in the string?

It looks like UTF16 BE bytes in printable ASCII range are represented as ASCII characters (e.g. P instead of \\120), and at least some UTF16 BE bytes in ASCII control character range use the usual escape codes (e.g. \\b meaning ASCII backspace).

So encoding might also be using some other escapes like \\n for \\012 (line feed) or \\r for \\015 (carriage return).

2 Likes

Does the pdfinfo tool read this data properly btw?

I don’t even see how \0001\0009 transcodes to "19". This is not how UTF-16BE does it. Looks to be some other encoding.

String \\0001\\0009 is converted to 4 bytes:

  1. \\000 is octal encoding for byte 0x00
  2. 1 means ASCII value of character 1, i.e. 0x31
  3. \\000 is octal encoding for byte 0x00
  4. 9 means ASCII value of character 9, i.e. 0x39

So that is 4 bytes 0x00 0x31 0x00 0x39 which is UTF16 BE encoding for "19".

Oh, so as you said some values aren’t exactly UTF-16BE encoded. They are directly their printable equivalents.

Well, that’s still not UTF-16BE per se. :stuck_out_tongue:

This seems to do it in Erlang:

convert_text_with_bom_to_utf8() ->
    Data = <<"\376\377\0001\0009\000 \0001\0002\000 \0001\0009\000 "
             "\000P\000r\000o\000g\000r\000a\000m\000m\000f\000o\003"
             "\b\000r\000d\000e\000r\000u\000n\000g\000 \0002\0000"
             "\0002\0000\000 \000-\000 \000A\000n\000l\000a\000g"
             "\000e\000 \0002\000 \000-\000 \000F\000o\003\b\000r\000d\000e\000r"
             "\000e\000r\000g\000e\000b\000n\000i\000s\000s\000e"
             "\000.\000d\000o\000c\000x"
           >>,

    {Encoding, BomLength} = unicode:bom_to_encoding(Data), % determine Data encoding from BOM
    <<_:BomLength/bytes, Text/bytes>> = Data, % thrown BOM away
    <<_/bytes>> = unicode:characters_to_binary(Text, Encoding, utf8). % convert to UTF-8

You can interact with it here.

Unfortunately I wasn’t able to have declare an equivalent literal value in Elixir; but, as long as your string is in the expected format, the same pattern should work.

If, however, the double-backslashes represent actual backslashes in the original string and are not the result of printing / inspecting it, then they’ll need to be unescaped first.

2 Likes

Now, as far as unescaping the string - if needed - this seems to do the trick:

  def unescape(data) do                                                                                                                                                                                                                        
    charlist = String.to_charlist(data)                                                                                                                                                                                                        
    erlang_literal = '"#{charlist}"'                                                                                                                                                                                                           
    {:ok, [{:string, _, unescaped_charlist}], _} = :erl_scan.string(erlang_literal)                                                                                                                                                            
    List.to_string(unescaped_charlist)                                                                                                                                                                                                         
  end  

What it does is transform the string into a charlist containing an Erlang “string expression”, which it proceeds to parse - and so I wouldn’t trust it with untrusted input without further investigation.

2 Likes