Bitstring with codepoints of non-ASCII characters

I stumbled upon something that baffles me while working with PDF files. The problem arose due to the non-ASCII characters âãÏÓ on the second line of the PDFs.

I started with a base64 encoded PDF whose first two lines when decoded with Base.decode64! result in: <<37, 80, 68, 70, 45, 49, 46, 53, 13, 10, 37, 226, 227, 207, 211, 13, 10>> with some carriage returns and newlines.

This is not a valid string however, as 226, 227, 207, 211 are the utf-8 codepoints for “âãÏÓ”, but not its bitstring, which is <<195, 162, 195, 163, 195, 143, 195, 147>>.

If we substitute this in the original string so that it becomes <<37, 80, 68, 70, 45, 49, 46, 53, 13, 10, 37, 195, 162, 195, 163, 195, 143, 195, 147, 13, 10>> it is now valid.

I also noticed that <<226 :: utf8>> returns "â", but <<226, 277 :: utf8>> returns <<226, 196, 149>>.

The PDF is naturally a binary, but why is this "âãÏÓ" part not decoded with its bitstring representation and instead it is decoded with its codepoints? I understand that this might sound stupid if the answer is obvious, but I still find it strange. Does it have something to do with the fact that each of these characters use two bytes? And when there is more than one of them so they cannot be meaningfully distinguished?

And as a sidenote, does saving a PDF binary with these codepoints in it work because they are part of a comment (with a %) in the PDF structure and hence the line they are on is completely ignored?

How have they been initially written to the PDF?

Perhaps it’s the source encoding that skews you here?

A quick glance makes me assume that source was latin-1/ISO8859-1 encoded.

5 Likes

I am not a PDF expert, but I think that PDF generally does not use UTF-8 encoding, but rather some single-byte encodings or built in font encoding. For example, if the Latin 1/ISO 8859-1 encoding is used, the characters âãÏÓ would each be encoded by one byte corresponding to their code point.

Note that there is nothing like UTF-8 code points. There are Unicode code points, and UTF-8 is a possible encoding for them. So 226 is the Unicode code point for â, which is encoded in UTF-8 as the binary <<195, 162>>. In another encoding it can be different, for example in Latin 1/ISO 8859-1 (a single-byte encoding) it is encoded as the binary <<226>>.

In short, your PDF is not encoded in UTF-8, so it’s normal that you won’t get UTF-8 bitstrings when looking at it in binary form.

6 Likes

Encodings in PDF are tricky.

Textual segments in the PDF are 7bit ASCII as far as I remember, binary segments may contain arbitrary data.

Text as seen in the rendered PDF does not necessarily exist like that in the PDF, but only as a binary segment listing glyphs to use from another segment. In such a scenario the byte 5 can represent an A while the byte 6 represents the letter ē.

2 Likes

That’s because <<226 :: utf8>> gives you the UTF-8 encoded binary corresponding to the code point 226, which is "â" (or, equivalently, the bitstring <<195, 162>>). The meaning of the expression <<226, 277 :: utf8>> is instead: the byte <<226>>, and the bytes corresponding to the code point 227 encoded in UTF-8, which is <<196, 149>>.

The expression <<226 :: utf8, 277 :: utf8>> is probably what you meant to do, and returns, as expected, "âĕ" (or, equivalently, <<195, 162, 196, 149>>).

2 Likes

Thank you both for the replies and for the helpful information! And for correcting me about the unicode codepoints (not UTF-8).

Yes, from the fact that "âãÏÓ" is represented with its codepoints I suppose that the encoding is Latin 1/ISO 8859-1 and that is what got me confused.

Thank you again!

1 Like