Parsing pdf file

I am required to parse a resume in pdf format to extract fields like phone-number github-url linkedIn-url etc, is there any way to parse the pdf to extract this data from the pdf.

I tend to say no, unless you have a proper FORM, but still it won’t be easy then AFAIK, but I have not used any tools that would do so, as we read and process our resumes manually at my company.

What would make it hard, is that in a PDF not necessarily a single letter of the text you read as a human has to be saved in the text. In theory, it could be drawn as a single large vector graphic.

Its rarely done due to the cost in size though, and embedding fonts and text is quite common. Still, tabular views are not saved like that necessarily.

They could be saved as a single free positioned box per cell, without any possibility to read programatically which row and column this cell belongs to, but easy recognizable as a human.

Just read the resume, or if you have a license for a good OCR software, use that. Or require resumes to be handed in in a better machine readable format.

4 Likes

It’s difficult if the pdf is not systematically created by some system (say bank statement pdfs from your bank)

see Lib for pdf processing - and experiment with parsing the output from pdftotext - eg linkedin uri and emails should be easy - something like phone number, addresses will be more difficult.

1 Like

I use Ghostscript to convert the PDF to a txt file then attempt to parse it.

System.cmd("gs", ["-sDEVICE=txtwrite", "-o#{txt_path}", pdf_path])

Parsing this resulting text file can be difficult if the PDFs are not of consistent format but at least you have text to work with.

4 Likes

I used pdftotext to convert the content of pdf but it does not worlb for large sized pdf’s
It works when the pdf size is about 5 kb or less. Otherwise the content is empty in pdf_as_text.

{pdf_as_text,_} = System.cmd("pdftotext", ~w[#{attrs["attachment"].filename} -], cd: "/home/gagan/aviahire-web/uploads/gagan/applicant/3/attachments/thumb")
      IO.puts("++++++++")
      IO.puts(pdf_as_text)
      IO.puts("++++++++")
2 Likes

What is the exit code? You really shouldn’t ignore it.

And what do you see when you try to rectify the bigger PDFs manually from the terminal?

2 Likes

If I try to convert bigger pdfs then, I get an empty string in the pdf_as_text variable.
Is there any way I can convert bigger pdfs into text and then extract contents like url and phone number.

1 Like

I get an empty string in the pdf_as_text variable.

That isn’t the issue. You are using:

{pdf_as_text,_} = System.cmd(...)

You are ignoring the exit_status from System.cmd/3

You have already observed that you are getting an empty string in pdf_as_text. So use instead:

{pdf_as_text, exit_status} = System.cmd(...)

and find out what the value of exit_status is - its value may give you some hint as to why pdf_as_text is empty.

For example this lists:

0 No error.
1 Error opening a PDF file.
2 Error opening an output file.
3 Error related to PDF permissions.
99 Other error.

3 Likes

Besides of what @peerreynders says, please check using on a terminal, as there you might see output on stderr which is ignored by System.cmd/3 (unless combined with stdout using an option to the call).

2 Likes

what is other error?
I am getting 99 as output.
How to resolve that error?

1 Like

Have you tried to invoke the command from the terminal? Does it produce any output?

1 Like

I tried to run the command on terminal its not working.

What does “its not working” mean for you? Does it produce any output (not that I’m asking for this output for the first time…)

1 Like

Do you have any idea what’s actually in your larger PDF? I ask because larger might mean it’s scanned, and that it’s just images, with no text present. So yeah, do what others are asking, run command, check return code & stderr. But realize that if you want to get text from an image, you’re talking about running OCR software.

But for those that do contain text, I’d recommend Apache Tika. People are correct that there’s no requirement that text in a PDF be available in any way that you can make sense of, since it’s a list of drawing commands it could draw the characters in random order. But in real life the PDFs that you come across that are exported from word processing documents or created via print-to-pdf actually contain the text in a usable way. (Columns and tables are still tricky. But plain text is pretty easy to handle.)

4 Likes

I ran the same command on terminal but i am getting an error. What is meant by invoking command from terminal ?
Also please tell me how can i avoid the the errors reflected by exit status 99

Which?

Invoking means the same as running in this context.

Depends on the actual error, which you refuse to tell us.

How do I integrate this lib GitHub - jalan/pdftotext: Simple PDF text extraction in my elixir app