ExtractousEx - a NIF wrapper around Extractous Rust library extracting text from files

Hi everyone :waving_hand: I’ve just released ExtractousEx, an Elixir library for extracting text and metadata from various document formats using the Rust-based Extractous library. There are some other libraries doing this, just they were either not supporting all Extractous options, or not providing precompiled Rust binaries. This project aims to provide 100% coverage of options available in Extractous library.

Fun fact: Extractous itself is based on Apache Tika compiled using GraalVM to the native code. So, in reality we’re able to run Java code through Rust in Elixir with a performance of a native function :exploding_head:
Key Features:

  • Extract from PDF, Office docs, HTML, CSV, Markdown, EPUB, and more
  • ~18x faster than unstructured-io with ~11x less memory usage
  • Precompiled binaries for macOS, Linux, and Windows (no Rust compilation needed)
  • Extract from files, bytes, or URLs with identical APIs
  • Optional XML output preserving document structure

Installation:

{:extractous_ex, “~> 0.2.0”}

Usage:

{:ok, result} = ExtractousEx.extract_from_file(“document.pdf”)

IO.puts(result.content)
# "This is my content"

IO.inspect(result.metadata)
# %{
#  "Content-Length" => ["7335065"],
#  "access_permission:extract_for_accessibility" => ["true"],
#  "access_permission:modify_annotations" => ["true"],
#  "pdf:docinfo:modified" => ["2025-04-07T14:20:54Z"],
#  ...
}


# Extract from binary data in memory
{:ok, data} = File.read("document.pdf")
{:ok, result} = ExtractousEx.extract_from_bytes(data)

# With options
{:ok, result} = ExtractousEx.extract_from_url(url, xml: true, pdf: [ocr_strategy: "NO_OCR"])

Available on extractous_ex | Hex with full documentation on ExtractousEx v0.2.0 — Documentation.

PS I would greatly appreciate a :star: on github! :hugs:

6 Likes