Max

Max

Lib for pdf processing

Hi,

i´m looking for a way to get the content of a pdf in a text format.

Any ideas?

thanks

Marked As Solved

outlog

outlog

It very much depends on what kind of pdfs you are trying to read, I use something like this to scrape pdf bank statements:

{pdf_as_text,_} = System.cmd("pdftotext", ~w[-raw bank_statement.pdf -], cd: "/Users/myuser/hello_phoenix/pdfs/")

pdf_as_text
|> String.split("\n") 
|> Stream.map(fn line ->
  regexline = Regex.run(~r/(^\d{1,2}\.\d{1,2}\.\d{4})\s(.+?(?=\s\d{1,2}\.\d{1,2}))\s(\d{1,2}\.\d{1,2})\s(.+?(?=\s))\s(.*)/, line)
  case regexline do
  	[_,full_date, text, date, amount, total] ->
  	  {full_date, text, date, amount, total}
  	_ -> 
  	    nil
  end
end) |> Enum.reject( fn(x) -> x == nil end) |> Enum.to_list |> IO.inspect

basically gets raw text from pdf and then runs a regex on each line, also explore the -layout option for pdftotext, depending on your use case..

Also Liked

AstonJ

AstonJ

Have you looked at:

More PDF libraries here.

outlog

outlog

just adding this example if somebody wants to scrape a pdf that you have stored in DB, or downloaded through http etc.

you need Porcelain, and with the goon dependency installed (“go build” it yourself from source as the shipped binary is currently wrong version - at least for mac):

  def to_text_from_pdf(msg) do
    decoded = :base64.decode(msg.pdffile)
    %Porcelain.Result{out: output, status: status} = Porcelain.exec("pdftotext", ["-raw", "-", "-"], in: decoded)
    output
  end

Where Next?

Popular in Questions Top

sergio
In Ruby, I can go: User.find_by(email: "foobar@email.com").update(email: "hello@email.com") How can I do something similar in Elixir? ...
New
New
Harrisonl
We have an ECS cluster with 4 services, where each task joins a single cluster, via discovery ECS discovery service. Currently when I de...
New
jononomo
I am trying to figure out how Mix knows whether the environment is test, dev, or prod – where is this set? Thanks.
New
shahryarjb
Hello, I get Persian date from my client and convert it to normal calendar like this: def jalali_string_to_miladi_english_number(persi...
New
dokuzbir
I want to highlight html closing tags when i click a html tag. That works in .html files but doesnt work for html.eex templates. How can...
New
sergio_101
I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
nsuchy
Hi. I’ve noticed that Windows Powershell has it’s own IEX command and you cannot access Elixir’s IEX due to the conflict. This isn’t a cr...
New
svb
Hi! Currently I want to submit a form by pressing the Enter key. However, since my input field is of type “textarea” this is just adds a...
New

Other popular topics Top

Darmani72
If I have a post route which an argument: post /my_post_route/:my_param1, MyController.my_post_handler How would get the post params ...
New
vertexbuffer
Hello, can anybody help here..? I have a list of players and I what to delete an element, but every for loop the list is reverting to ori...
New
New
SoCreat
i’m a new one to elixir which editor can i use vs code? or atom? Thanks! :smiley:
New
fayddelight
I tried installing elixir 1.11.2 erlang 23.3.4 via asdf in my zsh shell. Enabled the versions locally and globally. When I list them ...
New
ashish173
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
saif
Hello everyone, Long time lurker first time poster here. I’ve recently begun working on Elixir full-time again! :raised_hands: It’s been...
New
joaquinalcerro
Hi there, I am working with Ecto-Postgresql and I need to call all of the records from a specific table but the table has 40,000 records...
New
marick
I had some trouble figuring out how to make many-to-many associations work. Once I got it working, I wrote a blog post. Because I’m a nov...
New
PeterCarter
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New

We're in Beta

About us Mission Statement