kip
Text - a text analysis library
I’ll shortly be launching Text, a nascent text analysis library.
Current functionality
In this early version (not ready for prime time) it includes:
- Word counting
- N-gram generation
- Language detection (of about 250 languages with pluggable vocabularies and pluggable correlation models)
- An English inflector (singular to plural) using a non-regex algorithmic approach
Future functionality
-
A language stemmer - as soon as I finishing writing the snowball compiler
-
Parts of speech tagger
Collaboration encouraged
-
Contributions in all areas are most welcome
-
Non-english speakers who would like to contribute to non-english inflectors are particularly welcome
Next steps
After some polishing this weekend I will publish a version to hex.
Most Liked
kip
Thought I’d share the near term roadmap in a little more detail. Feedback is most definitely welcome on the capabilities you would find most useful. Or any areas you’d like to contribute to.
Step 1: Language recognition
Most natural language processing is language dependent. So identifying the source language is important. The primary way of identifying languages is to split the text into n-grams and then perform various statistical analysis of the source text versus the same analysis of a standard corpora in multiple different languages. The Universal Declaration of Human Rights is a standard text published in a lot of languages so this is the corpora I’m using. There are different ways to correlate source text versus a corpora. I am primarily using the algorithms in Language Identification from Text Using N-gram Based Cumulative Frequency Addition.
This is the due now for delivery on 28th June.
Step 2: Text segmentation
No matter what analysis is required, segmenting the text into grapheme clusters, words and sentences is required. This is very language dependent. Elixir’s String.graphemes/1 implements the Unicode segmentation algorithm for grapheme clusters so thats taken care of. Elixir’s String.split/1 implements the Unicode segmentation algorithm for words. String.split/1 is great for a default case but its not sufficient for language-specific segmentation. And we still need sentence segmentation too. Therefore I am implementing the CLDR Segmentation rules which provide language-specific customisation for text segmentation. This is another rules parser (I think so far I have implemented 8 different rules parsers and “compilers” in various parts of the ex_cldr project).
The text segmentation algorithms will be implemented as part of the unicode_string library.
Step 3: Parts of Speech Tagging
Now we have segments of text we can proceed to understanding what is being expressed. The starting point for this is called “parts of speech tagging”. Because I want a good native Elixir implementation that supports a wide variety of languages with good (but not necessarily the absolutely best) tagging I’m using A Rule-based Part-of-Speech and Morphological Tagging Toolkit which provides a fully trained corpora for ~90 languages using the open source data maintained in the Universal Dependencies treebanks. The trained models are maintained in the RDRPOSTagger project which also defines a rules engine that I will implement in Elixir. Another parser/compiler ![]()
Step 4: Sentiment Analysis
Now that we have a grammatical breakdown of the target source we can start to identify meaning. Wikipedia says:
A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.
The implementation approach is not yet defined and feedback and suggestions are warmly welcomed.
Step 5
To be determined. It will take a 2-4 months to get through the first 4 steps so thats plenty of time for feedback and collaboration ![]()
kip
I’ve published Text 0.4.0 today with seven new NLP modules (all native Elixir, no NIF or ML).
They cover the kinds of preprocessing you might reach for once your sentiment / classification / search pipeline outgrows String.split/1.
This release represents a largely feature complete text library from my perspective. Happy to take feature suggestions though.
A few of the more immediately useful additions in this release:
Text.Clean — pipeline-style normalisation
Whitespace, control characters, smart quotes, mojibake, NFC/NFKC. Composable; defaults are sensible.
iex> Text.Clean.clean("<p>it’s <em>cool</em></p>")
"it's cool"
iex> Text.Clean.collapse_whitespace(" hello \tworld \n")
"hello world"
Text.Truecase — restore casing for ALL-CAPS or lowercased text
POS-aware heuristics for proper nouns, acronyms, and sentence starts. Useful when an upstream system has destroyed the casing (chat logs, OCR, screaming customer feedback).
iex> Text.Truecase.truecase("THE QUICK BROWN FOX JUMPS OVER NEW YORK")
"The quick brown fox jumps over New York"
iex> Text.Truecase.truecase("nasa launched apollo 11 in july 1969.")
"NASA launched Apollo 11 in July 1969."
# Add domain-specific terms once at boot
Text.Truecase.add_terms(["GraphQL", "Phoenix"])
Text.Truecase.truecase("we use phoenix and graphql")
#=> "we use Phoenix and GraphQL"
Text.Emoji — detection, stripping, counting, conversion
Backed by the :unicode package’s emoji property tables, so it recognises every codepoint flagged emoji in the current Unicode release — no shipped JSON.
iex> Text.Emoji.count("Loved it 🤩 read it twice 📚📚")
3
iex> Text.Emoji.demojize("ship it 🚀")
"ship it :rocket:"
iex> Text.Emoji.emojize("ship it :rocket:")
"ship it 🚀"
Text.Hyphenation — Knuth–Liang TeX-pattern hyphenation
Ships en-US patterns baked in (~5 000). Other languages load from any standard hyph-*.tex file.
iex> Text.Hyphenation.hyphenate("hyphenation")
"hy-phen-ation"
iex> Text.Hyphenation.count("supercalifragilisticexpialidocious")
9
# Load German patterns once; thereafter all calls are fast
Text.Hyphenation.load_language(:de, path: "hyph-de-1996.tex")
Text.Hyphenation.hyphenate("Bundesausbildungsförderungsgesetz", language: :de)
#=> "Bun-des-aus-bil-dungs-för-de-rungs-ge-setz"
Text.PII — detect & redact common identifiers
Phone, email, credit-card-shaped digits, IBANs, IPv4/IPv6, US SSN. Pattern-based — fast and deterministic. The right tool for “please don’t paste this into the LLM” preflight; pair with a stricter checker if you need legal-grade accuracy.
iex> Text.PII.detect("Email me at jane@example.com or call (415) 555-0142.")
[%{type: :email, value: "jane@example.com", offset: 12, length: 16}, %{type: :phone, value: "(415) 555-0142", offset: 37, length: 14}]
iex> Text.PII.redact("Card 4111-1111-1111-1111 expires 12/29")
"Card [CREDIT_CARD] expires 12/29"
Text.Spell — Norvig-style spelling suggestions
Edit-distance candidates ranked by frequency in Text.WordFreq (the 30,000-word English frequency table that also ships in 0.4.0).
iex> Text.Spell.correct("speling")
"spelling"
iex> Text.Spell.candidates("teh") |> Enum.take(3)
[%{word: "the", distance: 1, frequency: 6_187_267}, %{word: "tech", distance: 1, frequency: 49_320}, %{word: "ten", distance: 1, frequency: 21_117}]
Text.Summarize — extractive summarisation via TextRank
Sentence-graph TextRank with configurable similarity (:cosine or :jaccard) and target length.
article = """
The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Engineers worked three winters to anchor the central pier on the riverbed. Residents who used to take a 40-minute ferry now make the trip in five. The mayor said the project came in 2 % under budget, a rarity for civic work of this scale.
"""
iex> Text.Summarize.summarize(article, sentences: 2)
"The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Residents who used to take a 40-minute ferry now make the trip in five."
kip
text version 2.0 has been published today, just on schedule. In addition the library text_corpus_udhr is also published today - it provides a corpus to support natural language detection.
Language Detection
text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:
Text.Language.Classifier.CommulativeFrequencyText.Language.Classifier.NaiveBayesianText.Language.Classifier.RankOrder
Additional classifiers can be added by defining a module that implements the Text.Language.Classifier behaviour.
The library text_corpus_udhr implements the Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.
Examples:
iex> Text.Language.detect "this is some english language thing"
{:ok, "en"}
# Options include `:corpus`, `:vocabulary` and `:classifier`
iex> Text.Language.detect "this is some english language thing", corpus: Text.Corpus.Udhr, vocabulary: Text.Vocabulary.Udhr.Quadgram
{:ok, "en"}
Word Counting
text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2 can be a String.t, File.Stream.t or Flow.t allowing flexible streaming of text.
English Pluralization
text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En and the functions:
Text.Inflect.En.pluralize/2Text.Inflect.En.pluralize_noun/2Text.Inflect.En.pluralize_verb/1Text.Inflect.En.pluralize_adjective/1
N-Gram generation
The Text.Ngram module supports efficient generation of n-grams of length 2 to 7. See Text.Ngram.ngram/2.
Language detection accuracy
Detection accuracy is reliable at text lengths of 150 characters or more, reasonable at 100 characters and may not be considered acceptable at shorter lengths.
The results are consistent for the range of tested languages with German being a clear exception where the results are unacceptable for now.
- English
- Greek
- Russian
- Spanish
- Finnish
- French
- Icelandic
- Italian
- Japanese
- Simplified Chinese
Further details are contained in the github repo in the analysis directory.
English language with Naive Bayesian classifier
Text.Language.detect/2 with classifier: Text.Classifier.NaiveBayesian and three different vocabularies.
| Text Length | Udhr.Bigram | Udhr.Multigram | Udhr.Quadgram |
|---|---|---|---|
| 50 | 95.6% | 92.7% | 95.6% |
| 100 | 99.9% | 99.5% | 98.8% |
| 150 | 100.0% | 100.0% | 99.3% |
| 300 | 100.0% | 100.0% | 100.0% |
Accuracy for German language detection
German is an exception to the consistent accuracy of most languages and the results are poor. Further analysis is required to understand the underlying cause.
| Text Length | Udhr.Bigram | Udhr.Multigram | Udhr.Quadgram |
|---|---|---|---|
| 50 | 45.6% | 38.0% | 27.2% |
| 100 | 64.4% | 47.2% | 42.3% |
| 150 | 71.6% | 57.6% | 51.4% |
| 300 | 78.7% | 47.2% | 56.9% |
Popular in Announcing
Other popular topics
Categories:
Sub Categories:
Forums
Popular Tags
- #ecto
- #liveview
- #troubleshooting
- #learning-elixir
- #deployment
- #library
- #erlang
- #testing
- #genserver
- #mix
- #absinthe
- #remote-other
- #otp
- #plug
- #how-to-question
- #macros
- #postgres
- #channels
- #elixirconf
- #exunit
- #discussion
- #javascript
- #code-sync
- #podcasts
- #onsite
- #dialyzer
- #docker
- #authentication
- #umbrella
- #full-time-contract
- #podcasts-by-brainlid
- #ecto-query
- #elixir-ls
- #phoenix_html
- #iex
- #blog-post
- #graphql
- #genstage
- #ai
- #websockets
- #supervisor
- #advent-of-code
- #elixirconf-us
- #distillery
- #processes
- #forms
- #api
- #metaprogramming
- #security
- #performance









