Hi!
What is currently the best library/method for parsing text and tabular data out of PDF files in Elixir or Erlang?
Hi!
What is currently the best library/method for parsing text and tabular data out of PDF files in Elixir or Erlang?
I donât think there is one. But would love to be proved wrong. There are a couple of PDF parsers written in Elm, one with an interesting UI. Youâd think that Elixir would be a more natural choice given its pattern matching. I have thought of converting one to Elixir, but it would likely only handle text and not tables.
I had a very similar idea. Until I started reading the PDF specification ⊠![]()
(Even after buying two technical books on it)
There is one in PythonâŠ
About 1 year ago, I attempted a âside-projectâ for a guy who I met at a laptop repair place. It was a site with some simple functionality but he wanted PDF parsing. I gave up after about 4 weekends.
I tried using the python pdfreader library but found it extremely brittle. I actually am reading the email chain from Aug 2023 about it and I explained to the guy that unpredictable spacing and layouts made a catch-all solution almost impossible (for me, at least, Iâm sure some AI wizards born and raised in soundproof bunkers deep under Silicon Valley could do it).
Anyway, so I tried another approach which actually works very well, but only if the PDF format does not change. I wrote something that worked for all the test data presented but the next batch of data was wildly different (different table headings, different number of columns, different data types, random annotations and notes in strange places).
This 2nd approach used a Java jar called tabula. I got it from here https://github.com/tabulapdf/tabula-java/releases/download/v1.0.5/tabula-1.0.5-jar-with-dependencies.jar but check around, maybe there are later releases. It worked super well (until the inputs started changing like crazy).
Anyway, if you want my honest opinion - do it yourself if you have strong control over the input pdfs and if there is a tolerance for mistakes. Otherwise, use an established pdf parsing service.
With all that said, here is some Elixir code I wrote (hastily) about a year ago to parse a specific pdf. I am just copying and pasting it wholesale so please donât judge it too harshly. Maybe you can scavenge it for parts.
Good luck.
defmodule Tnnt.PdfParser do
@moduledoc """
The PDF Parsing module
There are 2 kinds of pages:
1. bookings - lists many tenants with their references
1. payments - lists payments for a single tenant
"""
alias Ecto.Changeset
alias Tnnt.Bookings.Booking
alias Tnnt.Payments.Payment
NimbleCSV.define(MyParser, separator: ",", escape: "\"")
def pdf_hash(pdf_file_path) do
pdf_file_path
|> File.read!()
|> then(&:crypto.hash(:sha256, &1))
end
def pdf_to_text(pdf_file_path) do
java_args = [
"-jar",
"./tabula-1.0.5-jar-with-dependencies.jar",
"--pages",
"all",
pdf_file_path
]
{output, 0} = System.cmd("java", java_args)
output
end
@doc """
Parses text returned by pdftotext
"""
def parse(text) do
init_acc =
%{
booking_changesets: [],
payment_changesets: [],
unparseable_lines: [],
imported_file_attrs: %{}
}
text
|> MyParser.parse_string(skip_headers: false)
|> Enum.reduce_while({:ok, init_acc}, fn cells, {:ok, acc} ->
case parse_cells(cells, acc) do
{:ok, new_acc} -> {:cont, {:ok, new_acc}}
{:error, error} -> {:halt, {:error, error}}
end
end)
|> then(fn result ->
case result do
{:ok, new_acc} ->
new_acc
|> Map.update(:unparseable_lines, [], &Enum.reverse/1)
|> then(&{:ok, &1})
{:error, error} ->
{:error, error}
end
end)
end
@bookings_headers_0 [
"BOOKING REFERENCE",
"TENANT NAME",
"DATES",
"ROOM NUMBER",
"TOTAL CONTRACT VALUE",
"VALUE MINUS COMMS"
]
@bookings_headers_1 @bookings_headers_0 ++ ["TRANSFER DUE"]
@payments_headers ["PAYMENT DUE", "AMOUNT", "PAID BY"]
@all_headers @bookings_headers_1 ++ @payments_headers
@type header :: String.t()
@typedoc """
The parsing accumulator keeps track of the parsing context.
"""
@type parsing_acc :: %{
:current_page_type => :bookings | :payments,
:current_booking_changeset => Changeset.t(),
:current_headers => %{non_neg_integer() => header()},
:current_location => String.t(),
:booking_changesets => [Ecto.Changeset.t()],
:unparseable_lines => [String.t()]
}
defp tidy_cell(cell) do
cell
|> String.replace("\"", "")
|> String.replace(",", "")
|> String.trim()
end
@spec parse_cells([String.t()], parsing_acc()) :: {:ok, parsing_acc()} | {:error, String.t()}
defp parse_cells(["Confirmed bookings for " <> location | _], acc) do
acc
|> Map.put(:current_location, location)
|> then(&{:ok, &1})
end
defp parse_cells(["Confirmed upcoming bookings for " <> location | _], acc) do
acc
|> Map.put(:current_location, location)
|> then(&{:ok, &1})
end
defp parse_cells(cells, acc) do
{header_pos, entry_pos} =
cells
|> Enum.with_index()
|> Enum.reduce({%{}, %{}}, fn {cell, index}, {header_pos, entry_pos} ->
tidied_cell = tidy_cell(cell)
new_entry_pos = Map.put(entry_pos, index, tidied_cell)
if tidied_cell in @all_headers do
new_header_pos = Map.put(header_pos, index, tidied_cell)
{new_header_pos, new_entry_pos}
else
{header_pos, new_entry_pos}
end
end)
parsed_headers = Map.values(header_pos)
found_all_payments_headers? = Enum.all?(@payments_headers, &(&1 in parsed_headers))
found_all_bookings_headers? = Enum.all?(@bookings_headers_0, &(&1 in parsed_headers))
new_acc =
cond do
found_all_payments_headers? ->
acc
|> Map.put(:current_page_type, :payments)
|> Map.put(:current_headers, header_pos)
found_all_bookings_headers? ->
acc
|> Map.put(:current_page_type, :bookings)
|> Map.put(:current_headers, header_pos)
true ->
parse_non_header_line(acc, cells, entry_pos)
end
{:ok, new_acc}
end
defp parse_non_header_line(acc, cells, entry_pos) do
current_headers = Map.get(acc, :current_headers, %{})
current_location = Map.get(acc, :current_location)
current_page_type = Map.get(acc, :current_page_type)
if Enum.any?(current_headers) do
{row_map, new_entry_pos} =
Enum.reduce(current_headers, {%{}, entry_pos}, fn {index, header}, {row_map, entry_pos} ->
{value, new_entry_pos} = Map.pop(entry_pos, index)
new_row_map = Map.put(row_map, header, value)
{new_row_map, new_entry_pos}
end)
notes =
new_entry_pos
|> Map.values()
|> Enum.reject(&(String.length(&1) == 0))
|> Enum.join("\n")
if row_map |> Map.values() |> Enum.all?(&(String.length(&1) > 0)) do
case current_page_type do
:bookings ->
booking_changesets = Map.get(acc, :booking_changesets, [])
%{"BOOKING REFERENCE" => reference} = row_map
existing_changeset_index =
Enum.find_index(booking_changesets, fn booking_changeset ->
Changeset.get_field(booking_changeset, :reference) == reference
end)
# if already have this booking, just add the TRANSFER DUE
case existing_changeset_index do
nil ->
booking_changeset =
row_map
|> to_booking_attrs(current_location)
|> Booking.creation_changeset()
acc
|> Map.update(:booking_changesets, [booking_changeset], &[booking_changeset | &1])
|> Map.put(:current_booking_changeset, booking_changeset)
existing_changeset_index when is_integer(existing_changeset_index) ->
transfer_due_string = Map.get(row_map, "TRANSFER DUE")
transfer_due = parse_date(transfer_due_string, "/")
booking_changeset =
acc.booking_changesets
|> Enum.at(existing_changeset_index)
|> Changeset.put_change(:transfer_due, transfer_due)
acc
|> Map.update!(:booking_changesets, fn booking_changesets ->
List.replace_at(booking_changesets, existing_changeset_index, booking_changeset)
end)
|> Map.put(:current_booking_changeset, booking_changeset)
end
:payments ->
previous_payment_changeset =
acc
|> Map.get(:payment_changesets, [])
|> List.first()
payment_changeset =
row_map
|> to_payment_attrs(acc.current_booking_changeset, previous_payment_changeset)
|> then(fn payment_attrs ->
if String.length(notes) > 0 do
Map.put(payment_attrs, :notes, notes)
else
payment_attrs
end
end)
|> Payment.creation_changeset()
Map.update(acc, :payment_changesets, [payment_changeset], &[payment_changeset | &1])
end
else
maybe_add_line_to_notes(acc, cells)
end
else
maybe_add_line_to_notes(acc, cells)
end
end
defp maybe_add_line_to_notes(acc, cells) do
reduced_cells = Enum.filter(cells, &(String.length(&1) > 0))
if Enum.any?(reduced_cells) and !!acc.current_booking_changeset do
current_reference = Changeset.get_field(acc.current_booking_changeset, :reference)
new_notes = Enum.join(reduced_cells, ", ")
updated_notes =
case Changeset.get_field(acc.current_booking_changeset, :notes) do
nil -> new_notes
current_notes when is_binary(current_notes) -> current_notes <> "\n" <> new_notes
end
existing_changeset_index =
Enum.find_index(acc.booking_changesets, fn booking_changeset ->
Changeset.get_field(booking_changeset, :reference) == current_reference
end)
booking_changeset =
acc.booking_changesets
|> Enum.at(existing_changeset_index)
|> Changeset.put_change(:notes, updated_notes)
acc
|> Map.update!(:booking_changesets, fn booking_changesets ->
List.replace_at(booking_changesets, existing_changeset_index, booking_changeset)
end)
|> Map.put(:current_booking_changeset, booking_changeset)
else
acc
end
end
defp to_booking_attrs(row_map, location) do
reference = Map.get(row_map, "BOOKING REFERENCE")
dates_string = Map.get(row_map, "DATES")
room_number = Map.get(row_map, "ROOM NUMBER")
tenant_name = Map.get(row_map, "TENANT NAME")
total_contract_value_string = Map.get(row_map, "TOTAL CONTRACT VALUE")
value_minus_comms_string = Map.get(row_map, "VALUE MINUS COMMS")
transfer_due_string = Map.get(row_map, "TRANSFER DUE")
total_contract_value = Money.parse(total_contract_value_string)
value_minus_comms = Money.parse(value_minus_comms_string)
{from_date, to_date} = parse_dates(dates_string)
transfer_due = parse_date(transfer_due_string, "/")
%{
:reference => reference,
:from_date => from_date,
:to_date => to_date,
:room_number => room_number,
:tenant_name => tenant_name,
:total_contract_value => total_contract_value,
:value_minus_comms => value_minus_comms,
:location => location,
:transfer_due => transfer_due
}
end
defp to_payment_attrs(row_map, booking_changeset, previous_payment_changeset) do
payment_due_string = Map.get(row_map, "PAYMENT DUE")
amount_string = Map.get(row_map, "AMOUNT")
paid_by = Map.get(row_map, "PAID BY")
amount = Money.parse(amount_string)
booking_reference = Changeset.get_field(booking_changeset, :reference)
transfer_due = Changeset.get_field(booking_changeset, :transfer_due)
previous_payment_due =
case previous_payment_changeset do
nil ->
nil
previous_payment_changeset ->
Changeset.get_field(previous_payment_changeset, :payment_due)
end
payment_due =
Tnnt.Payments.payment_due_from_string(
payment_due_string,
transfer_due,
previous_payment_due
)
%{
:payment_due => payment_due,
:amount => amount,
:paid_by => paid_by,
:booking_reference => booking_reference
}
end
defp parse_dates(dates_string) do
[from_date_string, to_date_string] = String.split(dates_string, " - ")
from_date = parse_date(from_date_string)
to_date = parse_date(to_date_string)
{from_date, to_date}
end
defp parse_date(date_string, separator \\ "-")
defp parse_date(nil, _separator), do: nil
defp parse_date(date_string, separator) do
[day, month, year] =
date_string
|> String.trim()
|> String.split(separator)
|> Enum.map(&String.to_integer/1)
Date.new!(year, month, day)
end
end
I also went for a Java based solution, i.e. tika-server.
Then, I use e.g. HTTPoison to send the PDF content to http://localhost:9998/tika and I parse the result using Floki, e.g. metadata or content.
It is an easy solution that works, though I cannot argue that it is the optimal solution.
I currently use âpdftotextâ currently distributed in poppler to extract the text out of PDFs called by âSystem.cmdâ.
After that itâs easy to parse the text file with standard Elixir code. Previously I used Ghostscript to extract text but I find pdftotextâs -layout option provides more consistent formatting of the text output making it easier to parse.
Excellent question ![]()
Iâve looked into this and found three main ways:
pdf2txt.pyThere is pdfreader as mentioned by @kokolegorille. Another one is pdf2txt.py from the pdfminer.six package, which I use myself for some simpler use cases.
One thing that limited me with this was that a recurring element in my documents (a vector and text stamp) didnât show up with the correct line ordering in the extracted plaintext. That could be manageable for one kind of element â but quickly gets utterly unmanageable. So: it depends.
This is a really cool blog post on Huggingface setting out how one can use machine learning utilising not only textual information, but also positional information and scanned documents.
There is a neat table describing the licenses of the models. Whatâs really appealing to me is the possibility to run inference on your own.
But last time I checked I think that the cost of getting started seemed too high for me. Also, I donât think that itâs as easy as Bumblebee 1-2-3 quite yet. Which brings me to what I went for ![]()
Since weâre a Microsoft shop I looked into Azure Document Intelligence and⊠yeah, itâs really nice actually.
For our use-case itâs been great. To start with, weâre a Microsoft shop as I mentioned. But primarily, it was easy to get started with the web interface giving you point and click labelling for subsequent training. Finally, we donât have to scale massively. I guess inference could get prohibitively expensive if you do.
The coolest option? Gosh no. I wish I was the European version of Sean Moriarty, but Iâm certainly not. Azure Document Intelligence allowed me to become terribly productive and cost has thus far been way below what would be acceptable.
Best of luck with your endeavor!
PS. If even @kip balks at the idea of writing a PDF parser, then Iâll be running the other way immediately.
Yes!
Also yes to the 3rd-party service.
For tabular data I have had really good luck using Tabula (GitHub - tabulapdf/tabula-java: Extract tables from PDF files). Usage is pretty simple too and youâll get the contents of tables in the PDF back as a CSV on STDOUT:
System.cmd(
"java",
[
"-jar",
tabula_jar_path,
"--format",
"CSV",
"--lattice",
"--guess",
"--pages",
"all",
input_file_path
],
env: [{"HOME", "#{System.tmp_dir!()}/tabula"}]
)
The only downside is I needed to add Java to my docker imageâŠbut such is life.
For image based PDFs, GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages is excellent. Itâs recently released, open source and competes with the major cloud services for accuracy. It has some restrictions on commercial use above $5M USD gross p/a (a nice problem to have) but otherwise is free to use. Running a page through surya and passing the bounding box data (as json) into ChatGPT seems to give good comprehension on tabluar data too. Combining it with a structured output solution like Instructor or Outlines would probably yield good results.
Iâve been planning to use a similar setup (Elixir + surya) for some open data projects but havenât got round to it yet. Please let us know how you get on with it.
For text based PDFs I agree with the suggestion of using Tika, although itâs been a few years since I worked with it so take that with a pinch of salt.
Thanks, everyone! These suggestions have been great! I am probably going to start with a 3rd party (Either Surya cloud or Microsoft) and then maybe graduate to surya or Tika if costs become too crazy.
Thanks again! (And keep the suggestions coming as this is a great list!)