Lib for pdf processing

Hi,

i´m looking for a way to get the content of a pdf in a text format.

Any ideas?

thanks

1 Like

Have you looked at:

More PDF libraries here.

7 Likes

It very much depends on what kind of pdfs you are trying to read, I use something like this to scrape pdf bank statements:

{pdf_as_text,_} = System.cmd("pdftotext", ~w[-raw bank_statement.pdf -], cd: "/Users/myuser/hello_phoenix/pdfs/")

pdf_as_text
|> String.split("\n") 
|> Stream.map(fn line ->
  regexline = Regex.run(~r/(^\d{1,2}\.\d{1,2}\.\d{4})\s(.+?(?=\s\d{1,2}\.\d{1,2}))\s(\d{1,2}\.\d{1,2})\s(.+?(?=\s))\s(.*)/, line)
  case regexline do
  	[_,full_date, text, date, amount, total] ->
  	  {full_date, text, date, amount, total}
  	_ -> 
  	    nil
  end
end) |> Enum.reject( fn(x) -> x == nil end) |> Enum.to_list |> IO.inspect

basically gets raw text from pdf and then runs a regex on each line, also explore the -layout option for pdftotext, depending on your use case…

5 Likes

Thank you both. pdf2htmlex looks nice. But pdftotext is what i was looking for.

just adding this example if somebody wants to scrape a pdf that you have stored in DB, or downloaded through http etc.

you need Porcelain, and with the goon dependency installed (“go build” it yourself from source as the shipped binary is currently wrong version - at least for mac):

  def to_text_from_pdf(msg) do
    decoded = :base64.decode(msg.pdffile)
    %Porcelain.Result{out: output, status: status} = Porcelain.exec("pdftotext", ["-raw", "-", "-"], in: decoded)
    output
  end
2 Likes

how do I integrate this lib GitHub - jalan/pdftotext: Simple PDF text extraction in my elixir app