Max
Lib for pdf processing
Hi,
i´m looking for a way to get the content of a pdf in a text format.
Any ideas?
thanks
Marked As Solved
outlog
It very much depends on what kind of pdfs you are trying to read, I use something like this to scrape pdf bank statements:
{pdf_as_text,_} = System.cmd("pdftotext", ~w[-raw bank_statement.pdf -], cd: "/Users/myuser/hello_phoenix/pdfs/")
pdf_as_text
|> String.split("\n")
|> Stream.map(fn line ->
regexline = Regex.run(~r/(^\d{1,2}\.\d{1,2}\.\d{4})\s(.+?(?=\s\d{1,2}\.\d{1,2}))\s(\d{1,2}\.\d{1,2})\s(.+?(?=\s))\s(.*)/, line)
case regexline do
[_,full_date, text, date, amount, total] ->
{full_date, text, date, amount, total}
_ ->
nil
end
end) |> Enum.reject( fn(x) -> x == nil end) |> Enum.to_list |> IO.inspect
basically gets raw text from pdf and then runs a regex on each line, also explore the -layout option for pdftotext, depending on your use case..
5
Also Liked
outlog
just adding this example if somebody wants to scrape a pdf that you have stored in DB, or downloaded through http etc.
you need Porcelain, and with the goon dependency installed (“go build” it yourself from source as the shipped binary is currently wrong version - at least for mac):
def to_text_from_pdf(msg) do
decoded = :base64.decode(msg.pdffile)
%Porcelain.Result{out: output, status: status} = Porcelain.exec("pdftotext", ["-raw", "-", "-"], in: decoded)
output
end
2
Popular in Questions
In Ruby, I can go:
User.find_by(email: "foobar@email.com").update(email: "hello@email.com")
How can I do something similar in Elixir?
...
New
I have an umbrella app.
Some of the apps inside depend on other apps in the umbrella, unsurprisingly.
I’m writing a test for one of the...
New
We have an ECS cluster with 4 services, where each task joins a single cluster, via discovery ECS discovery service.
Currently when I de...
New
I am trying to figure out how Mix knows whether the environment is test, dev, or prod – where is this set?
Thanks.
New
Hello, I get Persian date from my client and convert it to normal calendar like this:
def jalali_string_to_miladi_english_number(persi...
New
I want to highlight html closing tags when i click a html tag. That works in .html files but doesnt work for html.eex templates. How can...
New
I am VERY much an elixir newbie. I have taken one elixir course and one phoenix course on Udemy. During that course, I saw the instructor...
New
I am trying to run a deploy with docker and I successfully runned with this command:
docker build -t romenigld/blog-prod .
but when I t...
New
Hi. I’ve noticed that Windows Powershell has it’s own IEX command and you cannot access Elixir’s IEX due to the conflict. This isn’t a cr...
New
Hi!
Currently I want to submit a form by pressing the Enter key. However, since my input field is of type “textarea” this is just adds a...
New
Other popular topics
If I have a post route which an argument:
post /my_post_route/:my_param1, MyController.my_post_handler
How would get the post params ...
New
Hello, can anybody help here..? I have a list of players and I what to delete an element, but every for loop the list is reverting to ori...
New
I have an umbrella app.
Some of the apps inside depend on other apps in the umbrella, unsurprisingly.
I’m writing a test for one of the...
New
i’m a new one to elixir
which editor can i use
vs code? or atom?
Thanks! :smiley:
New
I tried installing
elixir 1.11.2
erlang 23.3.4
via asdf in my zsh shell. Enabled the versions locally and globally.
When I list them ...
New
I am using Ecto timestamps with postgres, I can see the timestamps() use the :naive_dateime but for my use case I wanted to store the ti...
New
Hello everyone,
Long time lurker first time poster here. I’ve recently begun working on Elixir full-time again! :raised_hands: It’s been...
New
Hi there,
I am working with Ecto-Postgresql and I need to call all of the records from a specific table but the table has 40,000 records...
New
I had some trouble figuring out how to make many-to-many associations work. Once I got it working, I wrote a blog post. Because I’m a nov...
New
There are pre-rolled solutions for other frameworks that do work. However, Phoenix does not seem to have these. Have people had good expe...
New








