PDFInfo - Extract metadata from a PDF binary with only Regex

preciz · July 23, 2020, 10:30pm

I had to extract the /Info dictionary and /Metadata object from a lot of PDFs so I used the pdfinfo package on linux.

It turns out that package doesn’t extract all metadata or sometimes (for ex. when the PDF has errors) it doesn’t find it at all.

After reading this: https://www.sans.org/reading-room/whitepapers/forensics/pdf-metadata-extraction-python-38800 I decided I want to have an Elixir library with similar functionality.

Through a trial and error approach I have developed a library with similar but extended functionality:

It has a naive Regex approach on purpose so it doesn’t assume anything about the PDF and it’s more stable this way to extract the metadata.
Its limitation is that it currently can’t inflate compressed PDF object streams (would be solvable with Erlang’s :zlib) and can’t decrypt an encrypted PDF.

Any kind of feedback is welcome! Code review about any mistakes in the assumptions made in the code about PDFs would be helpful.