Lib for pdf processing

Max · December 27, 2016, 10:33am

Hi,

i´m looking for a way to get the content of a pdf in a text format.

Any ideas?

thanks

AstonJ · December 27, 2016, 12:58pm

Have you looked at:

More PDF libraries here.

outlog · December 27, 2016, 7:53pm

It very much depends on what kind of pdfs you are trying to read, I use something like this to scrape pdf bank statements:

{pdf_as_text,_} = System.cmd("pdftotext", ~w[-raw bank_statement.pdf -], cd: "/Users/myuser/hello_phoenix/pdfs/")

pdf_as_text
|> String.split("\n") 
|> Stream.map(fn line ->
  regexline = Regex.run(~r/(^\d{1,2}\.\d{1,2}\.\d{4})\s(.+?(?=\s\d{1,2}\.\d{1,2}))\s(\d{1,2}\.\d{1,2})\s(.+?(?=\s))\s(.*)/, line)
  case regexline do
  	[_,full_date, text, date, amount, total] ->
  	  {full_date, text, date, amount, total}
  	_ -> 
  	    nil
  end
end) |> Enum.reject( fn(x) -> x == nil end) |> Enum.to_list |> IO.inspect

basically gets raw text from pdf and then runs a regex on each line, also explore the -layout option for pdftotext, depending on your use case…

Max · December 28, 2016, 7:29am

Thank you both. pdf2htmlex looks nice. But pdftotext is what i was looking for.

outlog · January 11, 2017, 12:57pm

just adding this example if somebody wants to scrape a pdf that you have stored in DB, or downloaded through http etc.

you need Porcelain, and with the goon dependency installed (“go build” it yourself from source as the shipped binary is currently wrong version - at least for mac):

  def to_text_from_pdf(msg) do
    decoded = :base64.decode(msg.pdffile)
    %Porcelain.Result{out: output, status: status} = Porcelain.exec("pdftotext", ["-raw", "-", "-"], in: decoded)
    output
  end

coilardium · November 1, 2021, 7:16am

how do I integrate this lib GitHub - jalan/pdftotext: Simple PDF text extraction in my elixir app