Text - a text analysis library

kip · June 15, 2020, 7:34am

I’ll shortly be launching Text, a nascent text analysis library.

Current functionality

In this early version (not ready for prime time) it includes:

Word counting
N-gram generation
Language detection (of about 250 languages with pluggable vocabularies and pluggable correlation models)
An English inflector (singular to plural) using a non-regex algorithmic approach

Future functionality

A language stemmer - as soon as I finishing writing the snowball compiler
Parts of speech tagger

Collaboration encouraged

Contributions in all areas are most welcome
Non-english speakers who would like to contribute to non-english inflectors are particularly welcome

Next steps

After some polishing this weekend I will publish a version to hex.

kip · June 21, 2020, 10:43pm

Based on the positive feedback it seems this project has merit. I didn’t get finished all the work I planned for the weekend but I’m clearing a backlog so I can give this project some greater attention. The inflector is finished (nouns, pronouns, verbs). I’ve started work on adding the Metaphone 2 algorithm but its a slow slog because there is no description of the algorithm I can find - just imperative code. And then some more testing and verification on language detection.

All in all, likely a one week delay.

kip · June 22, 2020, 6:12am

Thought I’d share the near term roadmap in a little more detail. Feedback is most definitely welcome on the capabilities you would find most useful. Or any areas you’d like to contribute to.

Step 1: Language recognition

Most natural language processing is language dependent. So identifying the source language is important. The primary way of identifying languages is to split the text into n-grams and then perform various statistical analysis of the source text versus the same analysis of a standard corpora in multiple different languages. The Universal Declaration of Human Rights is a standard text published in a lot of languages so this is the corpora I’m using. There are different ways to correlate source text versus a corpora. I am primarily using the algorithms in Language Identification from Text Using N-gram Based Cumulative Frequency Addition.

This is the due now for delivery on 28th June.

Step 2: Text segmentation

No matter what analysis is required, segmenting the text into grapheme clusters, words and sentences is required. This is very language dependent. Elixir’s String.graphemes/1 implements the Unicode segmentation algorithm for grapheme clusters so thats taken care of. Elixir’s String.split/1 implements the Unicode segmentation algorithm for words. String.split/1 is great for a default case but its not sufficient for language-specific segmentation. And we still need sentence segmentation too. Therefore I am implementing the CLDR Segmentation rules which provide language-specific customisation for text segmentation. This is another rules parser (I think so far I have implemented 8 different rules parsers and “compilers” in various parts of the ex_cldr project).

The text segmentation algorithms will be implemented as part of the unicode_string library.

Step 3: Parts of Speech Tagging

Now we have segments of text we can proceed to understanding what is being expressed. The starting point for this is called “parts of speech tagging”. Because I want a good native Elixir implementation that supports a wide variety of languages with good (but not necessarily the absolutely best) tagging I’m using A Rule-based Part-of-Speech and Morphological Tagging Toolkit which provides a fully trained corpora for ~90 languages using the open source data maintained in the Universal Dependencies treebanks. The trained models are maintained in the RDRPOSTagger project which also defines a rules engine that I will implement in Elixir. Another parser/compiler

Step 4: Sentiment Analysis

Now that we have a grammatical breakdown of the target source we can start to identify meaning. Wikipedia says:

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level—whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, “beyond polarity” sentiment classification looks, for instance, at emotional states such as “angry”, “sad”, and “happy”.

The implementation approach is not yet defined and feedback and suggestions are warmly welcomed.

Step 5

To be determined. It will take a 2-4 months to get through the first 4 steps so thats plenty of time for feedback and collaboration

kip · June 29, 2020, 12:50am

text version 2.0 has been published today, just on schedule. In addition the library text_corpus_udhr is also published today - it provides a corpus to support natural language detection.

Language Detection

text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:

Text.Language.Classifier.CommulativeFrequency
Text.Language.Classifier.NaiveBayesian
Text.Language.Classifier.RankOrder

Additional classifiers can be added by defining a module that implements the Text.Language.Classifier behaviour.

The library text_corpus_udhr implements the Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.

Examples:

iex> Text.Language.detect "this is some english language thing"
{:ok, "en"}

# Options include `:corpus`, `:vocabulary` and `:classifier`
iex> Text.Language.detect "this is some english language thing", corpus: Text.Corpus.Udhr, vocabulary: Text.Vocabulary.Udhr.Quadgram
{:ok, "en"}

Word Counting

text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2 can be a String.t, File.Stream.t or Flow.t allowing flexible streaming of text.

English Pluralization

text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En and the functions:

Text.Inflect.En.pluralize/2
Text.Inflect.En.pluralize_noun/2
Text.Inflect.En.pluralize_verb/1
Text.Inflect.En.pluralize_adjective/1

N-Gram generation

The Text.Ngram module supports efficient generation of n-grams of length 2 to 7. See Text.Ngram.ngram/2.

Language detection accuracy

Detection accuracy is reliable at text lengths of 150 characters or more, reasonable at 100 characters and may not be considered acceptable at shorter lengths.

The results are consistent for the range of tested languages with German being a clear exception where the results are unacceptable for now.

English
Greek
Russian
Spanish
Finnish
French
Icelandic
Italian
Japanese
Simplified Chinese

Further details are contained in the github repo in the analysis directory.

English language with Naive Bayesian classifier

Text.Language.detect/2 with classifier: Text.Classifier.NaiveBayesian and three different vocabularies.

Text Length	Udhr.Bigram	Udhr.Multigram	Udhr.Quadgram
50	95.6%	92.7%	95.6%
100	99.9%	99.5%	98.8%
150	100.0%	100.0%	99.3%
300	100.0%	100.0%	100.0%

Accuracy for German language detection

German is an exception to the consistent accuracy of most languages and the results are poor. Further analysis is required to understand the underlying cause.

Text Length	Udhr.Bigram	Udhr.Multigram	Udhr.Quadgram
50	45.6%	38.0%	27.2%
100	64.4%	47.2%	42.3%
150	71.6%	57.6%	51.4%
300	78.7%	47.2%	56.9%

josefrichter · June 29, 2020, 8:09am

How can I do that?

ondrej-tucek · June 29, 2020, 8:59am

I would say that you have to create PR in kipcole9/text_corpus_udhr where you put a file, e.g. corpus/udhr/udhr_cze.txt with translation of udhr_eng.txt.

josefrichter · June 29, 2020, 9:11am

Seems to be there already, right? https://github.com/kipcole9/text_corpus_udhr/blob/master/corpus/udhr/udhr_ces.txt

kip · June 29, 2020, 9:27am

Thanks for the interest! Language detection should work at an acceptable level for ~200 languages using the UDHR corpus. Of course you can also contribute additional corpora in a library of your own making as long as it has a module that implements the Text.Corpus behaviour.

Inflection - specifically pluralisation - is on a per-language basis. If you would like to contribute an inflector then a PR with a module called Text.Inflect.<BCP47 language code> that implements a function called pluralize/2 would be “all” thats required. I will define a Text.Inflection behaviour in the 0.3.0 release to make this more clearly defined.

ondrej-tucek · June 29, 2020, 10:49am

Mea culpa, I was looking for that file but obviously not successfully…

kip · June 29, 2020, 10:52am

Nothing to apologise for. I need to add a Contributing section to the docs - thankis for the prompt to do so

smolcatgirl · June 29, 2020, 7:15pm

I think this is cool but i dont have a usecase for it. Keep up the good work

sorentwo · June 29, 2020, 10:28pm

I (we, at dscout) definitely have a usecase for nearly all of this work . I hope to contribute in the future, and would love to support the effort financially if you decide to make that possible .

kip · June 29, 2020, 11:35pm

Just a little fun addition over coffee this morning - deriving a CLDR locale from natural language. I’ll publish it to hex after I add some tests.

Examples

iex> Cldr.Text.locale_from_text "this is some text that I think will be English"
{:ok,                                                                                                                            %Cldr.LanguageTag{   
   backend: MyApp.Cldr,
   canonical_locale_name: "en-Latn-US",
   cldr_locale_name: "en",
   extensions: %{},
   gettext_locale_name: nil,
   language: "en",
   language_subtags: [],
   language_variant: nil,
   locale: %{},
   private_use: [],
   rbnf_locale_name: "en",
   requested_locale_name: "en",
   script: "Latn",
   territory: :US,
   transform: %{}
 }}

iex> german_text = "Wir wohnen in einem kleinen Haus mit einem Garten. Dort können die Kinder ein bisschen spielen. Unser Sohn kommt bald in die Schule, unsere Tochter geht noch eine Zeit lang in den Kindergarten. Meine Kinder sind am Nachmittag zu Hause. So arbeite ich nur halbtags."
iex> Cldr.Text.locale_from_text german_text
{:ok,                                                                                                                            %Cldr.LanguageTag{   
   backend: MyApp.Cldr,
   canonical_locale_name: "de-Latn-DE-1901",
   cldr_locale_name: "de",
   extensions: %{},
   gettext_locale_name: nil,
   language: "de",
   language_subtags: [],
   language_variant: "1901",
   locale: %{},
   private_use: [],
   rbnf_locale_name: "de",
   requested_locale_name: "de-1901",
   script: "Latn",
   territory: :DE,
   transform: %{}
 }}

rengel · July 31, 2020, 6:48am

Just stumbled upon this post. In case you didn’t konw:

kip · July 31, 2020, 6:56am

Thanks much for the link. I’m a bit challenged reading these imperative implementations for two reasons: (a) such ugly code compared to using pattern matching for most of it as one would in Elixir and (b) as a result, I just want the rules. Megaphone I can find them, but not double.

Maybe I’ll do a basic Metaphone implementation first and at least move forward …

tfwright · August 6, 2020, 8:20pm

I’m really interested in using this library in a project of mine, in particular to generate something similar to “word clouds” where common significant words are highlighted. Is that something you are planning on supporting? Please let me know if there’s any part I can help out with.