Best Libraries/methods for parsing text and content PDF files?

cgraham · May 30, 2024, 8:30pm

Hi!

What is currently the best library/method for parsing text and tabular data out of PDF files in Elixir or Erlang?

asianfilm · May 31, 2024, 6:27am

I don’t think there is one. But would love to be proved wrong. There are a couple of PDF parsers written in Elm, one with an interesting UI. You’d think that Elixir would be a more natural choice given its pattern matching. I have thought of converting one to Elixir, but it would likely only handle text and not tables.

kip · May 31, 2024, 6:42am

I had a very similar idea. Until I started reading the PDF specification …

(Even after buying two technical books on it)

kokolegorille · May 31, 2024, 11:39am

There is one in Python…

slouchpie · May 31, 2024, 12:04pm

About 1 year ago, I attempted a “side-project” for a guy who I met at a laptop repair place. It was a site with some simple functionality but he wanted PDF parsing. I gave up after about 4 weekends.

I tried using the python pdfreader library but found it extremely brittle. I actually am reading the email chain from Aug 2023 about it and I explained to the guy that unpredictable spacing and layouts made a catch-all solution almost impossible (for me, at least, I’m sure some AI wizards born and raised in soundproof bunkers deep under Silicon Valley could do it).

Anyway, so I tried another approach which actually works very well, but only if the PDF format does not change. I wrote something that worked for all the test data presented but the next batch of data was wildly different (different table headings, different number of columns, different data types, random annotations and notes in strange places).

This 2nd approach used a Java jar called tabula. I got it from here https://github.com/tabulapdf/tabula-java/releases/download/v1.0.5/tabula-1.0.5-jar-with-dependencies.jar but check around, maybe there are later releases. It worked super well (until the inputs started changing like crazy).

Anyway, if you want my honest opinion - do it yourself if you have strong control over the input pdfs and if there is a tolerance for mistakes. Otherwise, use an established pdf parsing service.

With all that said, here is some Elixir code I wrote (hastily) about a year ago to parse a specific pdf. I am just copying and pasting it wholesale so please don’t judge it too harshly. Maybe you can scavenge it for parts.

Good luck.

defmodule Tnnt.PdfParser do
  @moduledoc """
  The PDF Parsing module

  There are 2 kinds of pages:
  1. bookings - lists many tenants with their references
  1. payments - lists payments for a single tenant
  """
  alias Ecto.Changeset
  alias Tnnt.Bookings.Booking
  alias Tnnt.Payments.Payment

  NimbleCSV.define(MyParser, separator: ",", escape: "\"")

  def pdf_hash(pdf_file_path) do
    pdf_file_path
    |> File.read!()
    |> then(&:crypto.hash(:sha256, &1))
  end

  def pdf_to_text(pdf_file_path) do
    java_args = [
      "-jar",
      "./tabula-1.0.5-jar-with-dependencies.jar",
      "--pages",
      "all",
      pdf_file_path
    ]

    {output, 0} = System.cmd("java", java_args)
    output
  end

  @doc """
  Parses text returned by pdftotext
  """
  def parse(text) do
    init_acc =
      %{
        booking_changesets: [],
        payment_changesets: [],
        unparseable_lines: [],
        imported_file_attrs: %{}
      }

    text
    |> MyParser.parse_string(skip_headers: false)
    |> Enum.reduce_while({:ok, init_acc}, fn cells, {:ok, acc} ->
      case parse_cells(cells, acc) do
        {:ok, new_acc} -> {:cont, {:ok, new_acc}}
        {:error, error} -> {:halt, {:error, error}}
      end
    end)
    |> then(fn result ->
      case result do
        {:ok, new_acc} ->
          new_acc
          |> Map.update(:unparseable_lines, [], &Enum.reverse/1)
          |> then(&{:ok, &1})

        {:error, error} ->
          {:error, error}
      end
    end)
  end

  @bookings_headers_0 [
    "BOOKING REFERENCE",
    "TENANT NAME",
    "DATES",
    "ROOM NUMBER",
    "TOTAL CONTRACT VALUE",
    "VALUE MINUS COMMS"
  ]

  @bookings_headers_1 @bookings_headers_0 ++ ["TRANSFER DUE"]
  @payments_headers ["PAYMENT DUE", "AMOUNT", "PAID BY"]
  @all_headers @bookings_headers_1 ++ @payments_headers

  @type header :: String.t()

  @typedoc """
  The parsing accumulator keeps track of the parsing context.
  """
  @type parsing_acc :: %{
          :current_page_type => :bookings | :payments,
          :current_booking_changeset => Changeset.t(),
          :current_headers => %{non_neg_integer() => header()},
          :current_location => String.t(),
          :booking_changesets => [Ecto.Changeset.t()],
          :unparseable_lines => [String.t()]
        }

  defp tidy_cell(cell) do
    cell
    |> String.replace("\"", "")
    |> String.replace(",", "")
    |> String.trim()
  end

  @spec parse_cells([String.t()], parsing_acc()) :: {:ok, parsing_acc()} | {:error, String.t()}
  defp parse_cells(["Confirmed bookings for " <> location | _], acc) do
    acc
    |> Map.put(:current_location, location)
    |> then(&{:ok, &1})
  end

  defp parse_cells(["Confirmed upcoming bookings for " <> location | _], acc) do
    acc
    |> Map.put(:current_location, location)
    |> then(&{:ok, &1})
  end

  defp parse_cells(cells, acc) do
    {header_pos, entry_pos} =
      cells
      |> Enum.with_index()
      |> Enum.reduce({%{}, %{}}, fn {cell, index}, {header_pos, entry_pos} ->
        tidied_cell = tidy_cell(cell)
        new_entry_pos = Map.put(entry_pos, index, tidied_cell)

        if tidied_cell in @all_headers do
          new_header_pos = Map.put(header_pos, index, tidied_cell)
          {new_header_pos, new_entry_pos}
        else
          {header_pos, new_entry_pos}
        end
      end)

    parsed_headers = Map.values(header_pos)
    found_all_payments_headers? = Enum.all?(@payments_headers, &(&1 in parsed_headers))
    found_all_bookings_headers? = Enum.all?(@bookings_headers_0, &(&1 in parsed_headers))

    new_acc =
      cond do
        found_all_payments_headers? ->
          acc
          |> Map.put(:current_page_type, :payments)
          |> Map.put(:current_headers, header_pos)

        found_all_bookings_headers? ->
          acc
          |> Map.put(:current_page_type, :bookings)
          |> Map.put(:current_headers, header_pos)

        true ->
          parse_non_header_line(acc, cells, entry_pos)
      end

    {:ok, new_acc}
  end

  defp parse_non_header_line(acc, cells, entry_pos) do
    current_headers = Map.get(acc, :current_headers, %{})
    current_location = Map.get(acc, :current_location)
    current_page_type = Map.get(acc, :current_page_type)

    if Enum.any?(current_headers) do
      {row_map, new_entry_pos} =
        Enum.reduce(current_headers, {%{}, entry_pos}, fn {index, header}, {row_map, entry_pos} ->
          {value, new_entry_pos} = Map.pop(entry_pos, index)
          new_row_map = Map.put(row_map, header, value)
          {new_row_map, new_entry_pos}
        end)

      notes =
        new_entry_pos
        |> Map.values()
        |> Enum.reject(&(String.length(&1) == 0))
        |> Enum.join("\n")

      if row_map |> Map.values() |> Enum.all?(&(String.length(&1) > 0)) do
        case current_page_type do
          :bookings ->
            booking_changesets = Map.get(acc, :booking_changesets, [])

            %{"BOOKING REFERENCE" => reference} = row_map

            existing_changeset_index =
              Enum.find_index(booking_changesets, fn booking_changeset ->
                Changeset.get_field(booking_changeset, :reference) == reference
              end)

            # if already have this booking, just add the TRANSFER DUE
            case existing_changeset_index do
              nil ->
                booking_changeset =
                  row_map
                  |> to_booking_attrs(current_location)
                  |> Booking.creation_changeset()

                acc
                |> Map.update(:booking_changesets, [booking_changeset], &[booking_changeset | &1])
                |> Map.put(:current_booking_changeset, booking_changeset)

              existing_changeset_index when is_integer(existing_changeset_index) ->
                transfer_due_string = Map.get(row_map, "TRANSFER DUE")
                transfer_due = parse_date(transfer_due_string, "/")

                booking_changeset =
                  acc.booking_changesets
                  |> Enum.at(existing_changeset_index)
                  |> Changeset.put_change(:transfer_due, transfer_due)

                acc
                |> Map.update!(:booking_changesets, fn booking_changesets ->
                  List.replace_at(booking_changesets, existing_changeset_index, booking_changeset)
                end)
                |> Map.put(:current_booking_changeset, booking_changeset)
            end

          :payments ->
            previous_payment_changeset =
              acc
              |> Map.get(:payment_changesets, [])
              |> List.first()

            payment_changeset =
              row_map
              |> to_payment_attrs(acc.current_booking_changeset, previous_payment_changeset)
              |> then(fn payment_attrs ->
                if String.length(notes) > 0 do
                  Map.put(payment_attrs, :notes, notes)
                else
                  payment_attrs
                end
              end)
              |> Payment.creation_changeset()

            Map.update(acc, :payment_changesets, [payment_changeset], &[payment_changeset | &1])
        end
      else
        maybe_add_line_to_notes(acc, cells)
      end
    else
      maybe_add_line_to_notes(acc, cells)
    end
  end

  defp maybe_add_line_to_notes(acc, cells) do
    reduced_cells = Enum.filter(cells, &(String.length(&1) > 0))

    if Enum.any?(reduced_cells) and !!acc.current_booking_changeset do
      current_reference = Changeset.get_field(acc.current_booking_changeset, :reference)
      new_notes = Enum.join(reduced_cells, ", ")

      updated_notes =
        case Changeset.get_field(acc.current_booking_changeset, :notes) do
          nil -> new_notes
          current_notes when is_binary(current_notes) -> current_notes <> "\n" <> new_notes
        end

      existing_changeset_index =
        Enum.find_index(acc.booking_changesets, fn booking_changeset ->
          Changeset.get_field(booking_changeset, :reference) == current_reference
        end)

      booking_changeset =
        acc.booking_changesets
        |> Enum.at(existing_changeset_index)
        |> Changeset.put_change(:notes, updated_notes)

      acc
      |> Map.update!(:booking_changesets, fn booking_changesets ->
        List.replace_at(booking_changesets, existing_changeset_index, booking_changeset)
      end)
      |> Map.put(:current_booking_changeset, booking_changeset)
    else
      acc
    end
  end

  defp to_booking_attrs(row_map, location) do
    reference = Map.get(row_map, "BOOKING REFERENCE")
    dates_string = Map.get(row_map, "DATES")
    room_number = Map.get(row_map, "ROOM NUMBER")
    tenant_name = Map.get(row_map, "TENANT NAME")
    total_contract_value_string = Map.get(row_map, "TOTAL CONTRACT VALUE")
    value_minus_comms_string = Map.get(row_map, "VALUE MINUS COMMS")
    transfer_due_string = Map.get(row_map, "TRANSFER DUE")

    total_contract_value = Money.parse(total_contract_value_string)
    value_minus_comms = Money.parse(value_minus_comms_string)
    {from_date, to_date} = parse_dates(dates_string)
    transfer_due = parse_date(transfer_due_string, "/")

    %{
      :reference => reference,
      :from_date => from_date,
      :to_date => to_date,
      :room_number => room_number,
      :tenant_name => tenant_name,
      :total_contract_value => total_contract_value,
      :value_minus_comms => value_minus_comms,
      :location => location,
      :transfer_due => transfer_due
    }
  end

  defp to_payment_attrs(row_map, booking_changeset, previous_payment_changeset) do
    payment_due_string = Map.get(row_map, "PAYMENT DUE")
    amount_string = Map.get(row_map, "AMOUNT")
    paid_by = Map.get(row_map, "PAID BY")

    amount = Money.parse(amount_string)

    booking_reference = Changeset.get_field(booking_changeset, :reference)
    transfer_due = Changeset.get_field(booking_changeset, :transfer_due)

    previous_payment_due =
      case previous_payment_changeset do
        nil ->
          nil

        previous_payment_changeset ->
          Changeset.get_field(previous_payment_changeset, :payment_due)
      end

    payment_due =
      Tnnt.Payments.payment_due_from_string(
        payment_due_string,
        transfer_due,
        previous_payment_due
      )

    %{
      :payment_due => payment_due,
      :amount => amount,
      :paid_by => paid_by,
      :booking_reference => booking_reference
    }
  end

  defp parse_dates(dates_string) do
    [from_date_string, to_date_string] = String.split(dates_string, " - ")
    from_date = parse_date(from_date_string)
    to_date = parse_date(to_date_string)
    {from_date, to_date}
  end

  defp parse_date(date_string, separator \\ "-")
  defp parse_date(nil, _separator), do: nil

  defp parse_date(date_string, separator) do
    [day, month, year] =
      date_string
      |> String.trim()
      |> String.split(separator)
      |> Enum.map(&String.to_integer/1)

    Date.new!(year, month, day)
  end
end

bdarla · May 31, 2024, 7:50pm

I also went for a Java based solution, i.e. tika-server.
Then, I use e.g. HTTPoison to send the PDF content to http://localhost:9998/tika and I parse the result using Floki, e.g. metadata or content.
It is an easy solution that works, though I cannot argue that it is the optimal solution.

tme_317 · June 1, 2024, 12:08am

I currently use ‘pdftotext’ currently distributed in poppler to extract the text out of PDFs called by ‘System.cmd’.

After that it’s easy to parse the text file with standard Elixir code. Previously I used Ghostscript to extract text but I find pdftotext’s -layout option provides more consistent formatting of the text output making it easier to parse.

carlgleisner · June 1, 2024, 7:00am

Excellent question

I’ve looked into this and found three main ways:

Simple: Parsing out the plain text with Python libraries, such as pdf2txt.py
Cool but hard: Using a so-called “Document AI” model, such as Donut
Advanced but also fairly easy: Using a third-party service, such as Azure Document Intelligence

1. Python libraries

There is pdfreader as mentioned by @kokolegorille. Another one is pdf2txt.py from the pdfminer.six package, which I use myself for some simpler use cases.

One thing that limited me with this was that a recurring element in my documents (a vector and text stamp) didn’t show up with the correct line ordering in the extracted plaintext. That could be manageable for one kind of element – but quickly gets utterly unmanageable. So: it depends.

2. “Document AI” models

This is a really cool blog post on Huggingface setting out how one can use machine learning utilising not only textual information, but also positional information and scanned documents.

There is a neat table describing the licenses of the models. What’s really appealing to me is the possibility to run inference on your own.

But last time I checked I think that the cost of getting started seemed too high for me. Also, I don’t think that it’s as easy as Bumblebee 1-2-3 quite yet. Which brings me to what I went for

3. Third-party service

Since we’re a Microsoft shop I looked into Azure Document Intelligence and… yeah, it’s really nice actually.

Extraction of fields, tables
Classification of pages ← great for my use case instead of splitting per vector/text stamp
Fine-tuning doesn’t require 300 PB of data (5 documents minimum, but not enough in my experience)
Gives you a REST API endpoint for running inference on the models you have fine-tuned (I created a small Req-wrapper and went to town)

For our use-case it’s been great. To start with, we’re a Microsoft shop as I mentioned. But primarily, it was easy to get started with the web interface giving you point and click labelling for subsequent training. Finally, we don’t have to scale massively. I guess inference could get prohibitively expensive if you do.

The coolest option? Gosh no. I wish I was the European version of Sean Moriarty, but I’m certainly not. Azure Document Intelligence allowed me to become terribly productive and cost has thus far been way below what would be acceptable.

Best of luck with your endeavor!

PS. If even @kip balks at the idea of writing a PDF parser, then I’ll be running the other way immediately.

slouchpie · June 1, 2024, 10:08am

Yes!

Also yes to the 3rd-party service.

akoutmos · June 3, 2024, 3:00pm

For tabular data I have had really good luck using Tabula (GitHub - tabulapdf/tabula-java: Extract tables from PDF files). Usage is pretty simple too and you’ll get the contents of tables in the PDF back as a CSV on STDOUT:

    System.cmd(
      "java",
      [
        "-jar",
        tabula_jar_path,
        "--format",
        "CSV",
        "--lattice",
        "--guess",
        "--pages",
        "all",
        input_file_path
      ],
      env: [{"HOME", "#{System.tmp_dir!()}/tabula"}]
    )

The only downside is I needed to add Java to my docker image…but such is life.

xavriley · June 6, 2024, 5:33pm

For image based PDFs, GitHub - VikParuchuri/surya: OCR, layout analysis, reading order, line detection in 90+ languages is excellent. It’s recently released, open source and competes with the major cloud services for accuracy. It has some restrictions on commercial use above $5M USD gross p/a (a nice problem to have) but otherwise is free to use. Running a page through surya and passing the bounding box data (as json) into ChatGPT seems to give good comprehension on tabluar data too. Combining it with a structured output solution like Instructor or Outlines would probably yield good results.

I’ve been planning to use a similar setup (Elixir + surya) for some open data projects but haven’t got round to it yet. Please let us know how you get on with it.

For text based PDFs I agree with the suggestion of using Tika, although it’s been a few years since I worked with it so take that with a pinch of salt.

cgraham · June 10, 2024, 7:48pm

Thanks, everyone! These suggestions have been great! I am probably going to start with a 3rd party (Either Surya cloud or Microsoft) and then maybe graduate to surya or Tika if costs become too crazy.

Thanks again! (And keep the suggestions coming as this is a great list!)