Parsing pdf file

gpatankar21 · June 20, 2019, 12:02pm

I am required to parse a resume in pdf format to extract fields like phone-number github-url linkedIn-url etc, is there any way to parse the pdf to extract this data from the pdf.

NobbZ · June 20, 2019, 12:19pm

I tend to say no, unless you have a proper FORM, but still it won’t be easy then AFAIK, but I have not used any tools that would do so, as we read and process our resumes manually at my company.

What would make it hard, is that in a PDF not necessarily a single letter of the text you read as a human has to be saved in the text. In theory, it could be drawn as a single large vector graphic.

Its rarely done due to the cost in size though, and embedding fonts and text is quite common. Still, tabular views are not saved like that necessarily.

They could be saved as a single free positioned box per cell, without any possibility to read programatically which row and column this cell belongs to, but easy recognizable as a human.

Just read the resume, or if you have a license for a good OCR software, use that. Or require resumes to be handed in in a better machine readable format.

outlog · June 20, 2019, 1:40pm

It’s difficult if the pdf is not systematically created by some system (say bank statement pdfs from your bank)

see Lib for pdf processing - and experiment with parsing the output from pdftotext - eg linkedin uri and emails should be easy - something like phone number, addresses will be more difficult.

tme_317 · June 20, 2019, 2:21pm

I use Ghostscript to convert the PDF to a txt file then attempt to parse it.

System.cmd("gs", ["-sDEVICE=txtwrite", "-o#{txt_path}", pdf_path])

Parsing this resulting text file can be difficult if the PDFs are not of consistent format but at least you have text to work with.

gpatankar21 · June 21, 2019, 10:22am

I used pdftotext to convert the content of pdf but it does not worlb for large sized pdf’s
It works when the pdf size is about 5 kb or less. Otherwise the content is empty in pdf_as_text.

{pdf_as_text,_} = System.cmd("pdftotext", ~w[#{attrs["attachment"].filename} -], cd: "/home/gagan/aviahire-web/uploads/gagan/applicant/3/attachments/thumb")
      IO.puts("++++++++")
      IO.puts(pdf_as_text)
      IO.puts("++++++++")

NobbZ · June 21, 2019, 10:28am

What is the exit code? You really shouldn’t ignore it.

And what do you see when you try to rectify the bigger PDFs manually from the terminal?

gpatankar21 · June 21, 2019, 12:30pm

If I try to convert bigger pdfs then, I get an empty string in the pdf_as_text variable.
Is there any way I can convert bigger pdfs into text and then extract contents like url and phone number.

peerreynders · June 21, 2019, 12:54pm

I get an empty string in the pdf_as_text variable.

That isn’t the issue. You are using:

{pdf_as_text,_} = System.cmd(...)

You are ignoring the exit_status from System.cmd/3

You have already observed that you are getting an empty string in pdf_as_text. So use instead:

{pdf_as_text, exit_status} = System.cmd(...)

and find out what the value of exit_status is - its value may give you some hint as to why pdf_as_text is empty.

For example this lists:

0 No error.
1 Error opening a PDF file.
2 Error opening an output file.
3 Error related to PDF permissions.
99 Other error.

NobbZ · June 21, 2019, 1:12pm

Besides of what @peerreynders says, please check using on a terminal, as there you might see output on stderr which is ignored by System.cmd/3 (unless combined with stdout using an option to the call).

gpatankar21 · June 21, 2019, 1:43pm

what is other error?
I am getting 99 as output.
How to resolve that error?

NobbZ · June 21, 2019, 1:44pm

Have you tried to invoke the command from the terminal? Does it produce any output?

gpatankar21 · June 21, 2019, 1:52pm

I tried to run the command on terminal its not working.

NobbZ · June 21, 2019, 1:59pm

What does “its not working” mean for you? Does it produce any output (not that I’m asking for this output for the first time…)

sribe · June 21, 2019, 3:48pm

Do you have any idea what’s actually in your larger PDF? I ask because larger might mean it’s scanned, and that it’s just images, with no text present. So yeah, do what others are asking, run command, check return code & stderr. But realize that if you want to get text from an image, you’re talking about running OCR software.

But for those that do contain text, I’d recommend Apache Tika. People are correct that there’s no requirement that text in a PDF be available in any way that you can make sense of, since it’s a list of drawing commands it could draw the characters in random order. But in real life the PDFs that you come across that are exported from word processing documents or created via print-to-pdf actually contain the text in a usable way. (Columns and tables are still tricky. But plain text is pretty easy to handle.)

gpatankar21 · June 23, 2019, 4:41am

I ran the same command on terminal but i am getting an error. What is meant by invoking command from terminal ?
Also please tell me how can i avoid the the errors reflected by exit status 99

NobbZ · June 23, 2019, 6:53am

Which?

Invoking means the same as running in this context.

Depends on the actual error, which you refuse to tell us.

coilardium · November 1, 2021, 7:20am

How do I integrate this lib GitHub - jalan/pdftotext: Simple PDF text extraction in my elixir app