Lib for pdf processing



i´m looking for a way to get the content of a pdf in a text format.

Any ideas?



Have you looked at:

More PDF libraries here.


It very much depends on what kind of pdfs you are trying to read, I use something like this to scrape pdf bank statements:

{pdf_as_text,_} = System.cmd("pdftotext", ~w[-raw bank_statement.pdf -], cd: "/Users/myuser/hello_phoenix/pdfs/")

|> String.split("\n") 
|> line ->
  regexline =^\d{1,2}\.\d{1,2}\.\d{4})\s(.+?(?=\s\d{1,2}\.\d{1,2}))\s(\d{1,2}\.\d{1,2})\s(.+?(?=\s))\s(.*)/, line)
  case regexline do
  	[_,full_date, text, date, amount, total] ->
  	  {full_date, text, date, amount, total}
  	_ -> 
end) |> Enum.reject( fn(x) -> x == nil end) |> Enum.to_list |> IO.inspect

basically gets raw text from pdf and then runs a regex on each line, also explore the -layout option for pdftotext, depending on your use case…


Thank you both. pdf2htmlex looks nice. But pdftotext is what i was looking for.


just adding this example if somebody wants to scrape a pdf that you have stored in DB, or downloaded through http etc.

you need Porcelain, and with the goon dependency installed (“go build” it yourself from source as the shipped binary is currently wrong version - at least for mac):

  def to_text_from_pdf(msg) do
    decoded = :base64.decode(msg.pdffile)
    %Porcelain.Result{out: output, status: status} = Porcelain.exec("pdftotext", ["-raw", "-", "-"], in: decoded)