I’ve published Text 0.4.0 today with seven new NLP modules (all native Elixir, no NIF or ML).
They cover the kinds of preprocessing you might reach for once your sentiment / classification / search pipeline outgrows String.split/1.
This release represents a largely feature complete text library from my perspective. Happy to take feature suggestions though.
A few of the more immediately useful additions in this release:
Text.Clean — pipeline-style normalisation
Whitespace, control characters, smart quotes, mojibake, NFC/NFKC. Composable; defaults are sensible.
iex> Text.Clean.clean("<p>it’s <em>cool</em></p>")
"it's cool"
iex> Text.Clean.collapse_whitespace(" hello \tworld \n")
"hello world"
Text.Truecase — restore casing for ALL-CAPS or lowercased text
POS-aware heuristics for proper nouns, acronyms, and sentence starts. Useful when an upstream system has destroyed the casing (chat logs, OCR, screaming customer feedback).
iex> Text.Truecase.truecase("THE QUICK BROWN FOX JUMPS OVER NEW YORK")
"The quick brown fox jumps over New York"
iex> Text.Truecase.truecase("nasa launched apollo 11 in july 1969.")
"NASA launched Apollo 11 in July 1969."
# Add domain-specific terms once at boot
Text.Truecase.add_terms(["GraphQL", "Phoenix"])
Text.Truecase.truecase("we use phoenix and graphql")
#=> "we use Phoenix and GraphQL"
Text.Emoji — detection, stripping, counting, conversion
Backed by the :unicode package’s emoji property tables, so it recognises every codepoint flagged emoji in the current Unicode release — no shipped JSON.
iex> Text.Emoji.count("Loved it 🤩 read it twice 📚📚")
3
iex> Text.Emoji.demojize("ship it 🚀")
"ship it :rocket:"
iex> Text.Emoji.emojize("ship it :rocket:")
"ship it 🚀"
Text.Hyphenation — Knuth–Liang TeX-pattern hyphenation
Ships en-US patterns baked in (~5 000). Other languages load from any standard hyph-*.tex file.
iex> Text.Hyphenation.hyphenate("hyphenation")
"hy-phen-ation"
iex> Text.Hyphenation.count("supercalifragilisticexpialidocious")
9
# Load German patterns once; thereafter all calls are fast
Text.Hyphenation.load_language(:de, path: "hyph-de-1996.tex")
Text.Hyphenation.hyphenate("Bundesausbildungsförderungsgesetz", language: :de)
#=> "Bun-des-aus-bil-dungs-för-de-rungs-ge-setz"
Text.PII — detect & redact common identifiers
Phone, email, credit-card-shaped digits, IBANs, IPv4/IPv6, US SSN. Pattern-based — fast and deterministic. The right tool for “please don’t paste this into the LLM” preflight; pair with a stricter checker if you need legal-grade accuracy.
iex> Text.PII.detect("Email me at jane@example.com or call (415) 555-0142.")
[%{type: :email, value: "jane@example.com", offset: 12, length: 16}, %{type: :phone, value: "(415) 555-0142", offset: 37, length: 14}]
iex> Text.PII.redact("Card 4111-1111-1111-1111 expires 12/29")
"Card [CREDIT_CARD] expires 12/29"
Text.Spell — Norvig-style spelling suggestions
Edit-distance candidates ranked by frequency in Text.WordFreq (the 30,000-word English frequency table that also ships in 0.4.0).
iex> Text.Spell.correct("speling")
"spelling"
iex> Text.Spell.candidates("teh") |> Enum.take(3)
[%{word: "the", distance: 1, frequency: 6_187_267}, %{word: "tech", distance: 1, frequency: 49_320}, %{word: "ten", distance: 1, frequency: 21_117}]
Text.Summarize — extractive summarisation via TextRank
Sentence-graph TextRank with configurable similarity (:cosine or :jaccard) and target length.
article = """
The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Engineers worked three winters to anchor the central pier on the riverbed. Residents who used to take a 40-minute ferry now make the trip in five. The mayor said the project came in 2 % under budget, a rarity for civic work of this scale.
"""
iex> Text.Summarize.summarize(article, sentences: 2)
"The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Residents who used to take a 40-minute ferry now make the trip in five."