I stumbled upon something that baffles me while working with PDF files. The problem arose due to the non-ASCII characters âãÏÓ on the second line of the PDFs.
I started with a base64 encoded PDF whose first two lines when decoded with Base.decode64! result in: <<37, 80, 68, 70, 45, 49, 46, 53, 13, 10, 37, 226, 227, 207, 211, 13, 10>> with some carriage returns and newlines.
This is not a valid string however, as 226, 227, 207, 211 are the utf-8 codepoints for “âãÏÓ”, but not its bitstring, which is <<195, 162, 195, 163, 195, 143, 195, 147>>.
If we substitute this in the original string so that it becomes <<37, 80, 68, 70, 45, 49, 46, 53, 13, 10, 37, 195, 162, 195, 163, 195, 143, 195, 147, 13, 10>> it is now valid.
I also noticed that <<226 :: utf8>> returns "â", but <<226, 277 :: utf8>> returns <<226, 196, 149>>.
The PDF is naturally a binary, but why is this "âãÏÓ" part not decoded with its bitstring representation and instead it is decoded with its codepoints? I understand that this might sound stupid if the answer is obvious, but I still find it strange. Does it have something to do with the fact that each of these characters use two bytes? And when there is more than one of them so they cannot be meaningfully distinguished?
And as a sidenote, does saving a PDF binary with these codepoints in it work because they are part of a comment (with a %) in the PDF structure and hence the line they are on is completely ignored?






















