kip

kip

ex_cldr Core Team

Text - a text analysis library

I’ll shortly be launching Text, a nascent text analysis library.

Current functionality

In this early version (not ready for prime time) it includes:

  1. Word counting
  2. N-gram generation
  3. Language detection (of about 250 languages with pluggable vocabularies and pluggable correlation models)
  4. An English inflector (singular to plural) using a non-regex algorithmic approach

Future functionality

  • A language stemmer - as soon as I finishing writing the snowball compiler

  • Parts of speech tagger

Collaboration encouraged

  • Contributions in all areas are most welcome

  • Non-english speakers who would like to contribute to non-english inflectors are particularly welcome

Next steps

After some polishing this weekend I will publish a version to hex.

Most Liked

kip

kip

ex_cldr Core Team

Thought I’d share the near term roadmap in a little more detail. Feedback is most definitely welcome on the capabilities you would find most useful. Or any areas you’d like to contribute to.

Step 1: Language recognition

Most natural language processing is language dependent. So identifying the source language is important. The primary way of identifying languages is to split the text into n-grams and then perform various statistical analysis of the source text versus the same analysis of a standard corpora in multiple different languages. The Universal Declaration of Human Rights is a standard text published in a lot of languages so this is the corpora I’m using. There are different ways to correlate source text versus a corpora. I am primarily using the algorithms in Language Identification from Text Using N-gram Based Cumulative Frequency Addition.

This is the due now for delivery on 28th June.

Step 2: Text segmentation

No matter what analysis is required, segmenting the text into grapheme clusters, words and sentences is required. This is very language dependent. Elixir’s String.graphemes/1 implements the Unicode segmentation algorithm for grapheme clusters so thats taken care of. Elixir’s String.split/1 implements the Unicode segmentation algorithm for words. String.split/1 is great for a default case but its not sufficient for language-specific segmentation. And we still need sentence segmentation too. Therefore I am implementing the CLDR Segmentation rules which provide language-specific customisation for text segmentation. This is another rules parser (I think so far I have implemented 8 different rules parsers and “compilers” in various parts of the ex_cldr project).

The text segmentation algorithms will be implemented as part of the unicode_string library.

Step 3: Parts of Speech Tagging

Now we have segments of text we can proceed to understanding what is being expressed. The starting point for this is called “parts of speech tagging”. Because I want a good native Elixir implementation that supports a wide variety of languages with good (but not necessarily the absolutely best) tagging I’m using A Rule-based Part-of-Speech and Morphological Tagging Toolkit which provides a fully trained corpora for ~90 languages using the open source data maintained in the Universal Dependencies treebanks. The trained models are maintained in the RDRPOSTagger project which also defines a rules engine that I will implement in Elixir. Another parser/compiler :slight_smile:

Step 4: Sentiment Analysis

Now that we have a grammatical breakdown of the target source we can start to identify meaning. Wikipedia says:

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.

The implementation approach is not yet defined and feedback and suggestions are warmly welcomed.

Step 5

To be determined. It will take a 2-4 months to get through the first 4 steps so thats plenty of time for feedback and collaboration :slight_smile:

14
Post #3
kip

kip

ex_cldr Core Team

I’ve published Text 0.4.0 today with seven new NLP modules (all native Elixir, no NIF or ML).

They cover the kinds of preprocessing you might reach for once your sentiment / classification / search pipeline outgrows String.split/1.

This release represents a largely feature complete text library from my perspective. Happy to take feature suggestions though.

A few of the more immediately useful additions in this release:

Text.Clean — pipeline-style normalisation

Whitespace, control characters, smart quotes, mojibake, NFC/NFKC. Composable; defaults are sensible.

iex> Text.Clean.clean("<p>it’s   <em>cool</em></p>")
"it's cool"
iex> Text.Clean.collapse_whitespace("  hello \tworld  \n")
"hello world"

Text.Truecase — restore casing for ALL-CAPS or lowercased text

POS-aware heuristics for proper nouns, acronyms, and sentence starts. Useful when an upstream system has destroyed the casing (chat logs, OCR, screaming customer feedback).

iex> Text.Truecase.truecase("THE QUICK BROWN FOX JUMPS OVER NEW YORK")
"The quick brown fox jumps over New York"
iex> Text.Truecase.truecase("nasa launched apollo 11 in july 1969.")
"NASA launched Apollo 11 in July 1969."

# Add domain-specific terms once at boot
Text.Truecase.add_terms(["GraphQL", "Phoenix"])
Text.Truecase.truecase("we use phoenix and graphql")
#=> "we use Phoenix and GraphQL"

Text.Emoji — detection, stripping, counting, conversion

Backed by the :unicode package’s emoji property tables, so it recognises every codepoint flagged emoji in the current Unicode release — no shipped JSON.

iex> Text.Emoji.count("Loved it 🤩 read it twice 📚📚")
3
iex> Text.Emoji.demojize("ship it 🚀")
"ship it :rocket:"
iex> Text.Emoji.emojize("ship it :rocket:")
"ship it 🚀"

Text.Hyphenation — Knuth–Liang TeX-pattern hyphenation

Ships en-US patterns baked in (~5 000). Other languages load from any standard hyph-*.tex file.

iex> Text.Hyphenation.hyphenate("hyphenation")
"hy-phen-ation"
iex> Text.Hyphenation.count("supercalifragilisticexpialidocious")
9
# Load German patterns once; thereafter all calls are fast
Text.Hyphenation.load_language(:de, path: "hyph-de-1996.tex")
Text.Hyphenation.hyphenate("Bundesausbildungsförderungsgesetz", language: :de)
#=> "Bun-des-aus-bil-dungs-för-de-rungs-ge-setz"

Text.PII — detect & redact common identifiers

Phone, email, credit-card-shaped digits, IBANs, IPv4/IPv6, US SSN. Pattern-based — fast and deterministic. The right tool for “please don’t paste this into the LLM” preflight; pair with a stricter checker if you need legal-grade accuracy.

iex> Text.PII.detect("Email me at jane@example.com or call (415) 555-0142.")
[%{type: :email, value: "jane@example.com",  offset: 12, length: 16},  %{type: :phone, value: "(415) 555-0142",    offset: 37, length: 14}]
iex> Text.PII.redact("Card 4111-1111-1111-1111 expires 12/29")
"Card [CREDIT_CARD] expires 12/29"

Text.Spell — Norvig-style spelling suggestions

Edit-distance candidates ranked by frequency in Text.WordFreq (the 30,000-word English frequency table that also ships in 0.4.0).

iex> Text.Spell.correct("speling")
"spelling"
iex> Text.Spell.candidates("teh") |> Enum.take(3)
[%{word: "the",  distance: 1, frequency: 6_187_267},  %{word: "tech", distance: 1, frequency:    49_320},  %{word: "ten",  distance: 1, frequency:    21_117}]

Text.Summarize — extractive summarisation via TextRank

Sentence-graph TextRank with configurable similarity (:cosine or :jaccard) and target length.

article = """
The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Engineers worked three winters to anchor the central pier on the riverbed. Residents who used to take a 40-minute ferry now make the trip in five. The mayor said the project came in 2 % under budget, a rarity for civic work of this scale.
"""
iex> Text.Summarize.summarize(article, sentences: 2)
"The new bridge, opened on Tuesday, connects the two halves of the city for the first time in decades. Residents who used to take a 40-minute ferry now make the trip in five."
kip

kip

ex_cldr Core Team

text version 2.0 has been published today, just on schedule. In addition the library text_corpus_udhr is also published today - it provides a corpus to support natural language detection.

Language Detection

text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:

  • Text.Language.Classifier.CommulativeFrequency
  • Text.Language.Classifier.NaiveBayesian
  • Text.Language.Classifier.RankOrder

Additional classifiers can be added by defining a module that implements the Text.Language.Classifier behaviour.

The library text_corpus_udhr implements the Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.

Examples:

iex> Text.Language.detect "this is some english language thing"
{:ok, "en"}

# Options include `:corpus`, `:vocabulary` and `:classifier`
iex> Text.Language.detect "this is some english language thing", corpus: Text.Corpus.Udhr, vocabulary: Text.Vocabulary.Udhr.Quadgram
{:ok, "en"}

Word Counting

text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2 can be a String.t, File.Stream.t or Flow.t allowing flexible streaming of text.

English Pluralization

text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En and the functions:

  • Text.Inflect.En.pluralize/2
  • Text.Inflect.En.pluralize_noun/2
  • Text.Inflect.En.pluralize_verb/1
  • Text.Inflect.En.pluralize_adjective/1

N-Gram generation

The Text.Ngram module supports efficient generation of n-grams of length 2 to 7. See Text.Ngram.ngram/2.

Language detection accuracy

Detection accuracy is reliable at text lengths of 150 characters or more, reasonable at 100 characters and may not be considered acceptable at shorter lengths.

The results are consistent for the range of tested languages with German being a clear exception where the results are unacceptable for now.

  • English
  • Greek
  • Russian
  • Spanish
  • Finnish
  • French
  • Icelandic
  • Italian
  • Japanese
  • Simplified Chinese

Further details are contained in the github repo in the analysis directory.

English language with Naive Bayesian classifier

Text.Language.detect/2 with classifier: Text.Classifier.NaiveBayesian and three different vocabularies.

Text Length Udhr.Bigram Udhr.Multigram Udhr.Quadgram
50 95.6% 92.7% 95.6%
100 99.9% 99.5% 98.8%
150 100.0% 100.0% 99.3%
300 100.0% 100.0% 100.0%

Accuracy for German language detection

German is an exception to the consistent accuracy of most languages and the results are poor. Further analysis is required to understand the underlying cause.

Text Length Udhr.Bigram Udhr.Multigram Udhr.Quadgram
50 45.6% 38.0% 27.2%
100 64.4% 47.2% 42.3%
150 71.6% 57.6% 51.4%
300 78.7% 47.2% 56.9%
11
Post #4

Where Next?

Popular in Announcing Top

wmnnd
Hi there, for my project DBLSQD, I needed a file storage solution that is a bit more flexible than Arc. Because I thought others might f...
New
sasajuric
I’d like to announce a small library called boundaries. This is an experimental project which explores the idea of enforcing boundaries ...
New
asiniy
Hey there! I wrote a download elixir package which does exactly what its name about - an easy way to download files. I saw solutions ab...
New
ostinelli
Let’s write a database! Well not really, but I think it’s a little sad that there doesn’t seem to be a simple in-memory distributed KV da...
New
Qqwy
Today I realized that it would be possible to implement currying-capability in Elixir, using some clever anonymous function creation. (‘c...
New
woutdp
Hi! I wanted to introduce my latest project LiveSvelte. It allows you to render Svelte inside LiveView with end-to-end reactivity. It’s ...
New
Crowdhailer
Raxx is an alternative to Plug and is inspired by projects such as Rack(Ruby) and Ring(Clojure). 1.0-rc.1 is now available. To use it re...
New
benlime
LiveMotion enables high performance animations declared on the server and run on the client. As a follow up to my previous thread A libr...
New
anshuman23
Hello all, I have been working on my proposed project called Tensorflex as part of Google Summer of Code 2018.. Tensorflex can be used f...
New
mattludwigs
Grizzly is a library for working with Z-Wave devices. Z-Wave is a low-frequency radio protocol for controlling smart home devices on a me...
New

Other popular topics Top

Darmani72
If I have a post route which an argument: post /my_post_route/:my_param1, MyController.my_post_handler How would get the post params ...
New
senggen
Erlang/OTP 25 [erts-13.2.2] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] 15:22:35.803 [error] gen_event {lager_file_backend...
New
jononomo
I am trying to figure out how Mix knows whether the environment is test, dev, or prod – where is this set? Thanks.
New
aesmail
Hello guys, I have finally made it. I created an admin interface for a framework. It’s been on my todo list for years and with the curre...
New
saif
Hello everyone, Long time lurker first time poster here. I’ve recently begun working on Elixir full-time again! :raised_hands: It’s been...
New
rms.mrcs
Hi, I need to transform a list of numbers into a map where the keys are the indexes and the values are the original values of the list. ...
New
romenigld
I am trying to run a deploy with docker and I successfully runned with this command: docker build -t romenigld/blog-prod . but when I t...
New
joaquinalcerro
Hi there, I am working with Ecto-Postgresql and I need to call all of the records from a specific table but the table has 40,000 records...
New
WestKeys
Currently suffering from paralysis by [HTTP client] analysis. This is rather unusual in Elixirland as there tends to be consensus on the ...
New
sergio
Kind of like when jquery came out, it was super necessary. Existing drag and drop libraries have a bunch of baggage to support old browse...
New

We're in Beta

About us Mission Statement