text version 2.0 has been published today, just on schedule. In addition the library text_corpus_udhr is also published today - it provides a corpus to support natural language detection.
Language Detection
text
contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:
Text.Language.Classifier.CommulativeFrequency
Text.Language.Classifier.NaiveBayesian
Text.Language.Classifier.RankOrder
Additional classifiers can be added by defining a module that implements the Text.Language.Classifier
behaviour.
The library text_corpus_udhr implements the Text.Corpus
behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.
Examples:
iex> Text.Language.detect "this is some english language thing"
{:ok, "en"}
# Options include `:corpus`, `:vocabulary` and `:classifier`
iex> Text.Language.detect "this is some english language thing", corpus: Text.Corpus.Udhr, vocabulary: Text.Vocabulary.Udhr.Quadgram
{:ok, "en"}
Word Counting
text
contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to Text.Word.word_count/2
can be a String.t
, File.Stream.t
or Flow.t
allowing flexible streaming of text.
English Pluralization
text
includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module Text.Inflect.En
and the functions:
Text.Inflect.En.pluralize/2
Text.Inflect.En.pluralize_noun/2
Text.Inflect.En.pluralize_verb/1
Text.Inflect.En.pluralize_adjective/1
N-Gram generation
The Text.Ngram
module supports efficient generation of n-grams of length 2
to 7
. See Text.Ngram.ngram/2
.
Language detection accuracy
Detection accuracy is reliable at text lengths of 150 characters or more, reasonable at 100 characters and may not be considered acceptable at shorter lengths.
The results are consistent for the range of tested languages with German being a clear exception where the results are unacceptable for now.
- English
- Greek
- Russian
- Spanish
- Finnish
- French
- Icelandic
- Italian
- Japanese
- Simplified Chinese
Further details are contained in the github repo in the analysis
directory.
English language with Naive Bayesian classifier
Text.Language.detect/2
with classifier: Text.Classifier.NaiveBayesian
and three different vocabularies.
Text Length |
Udhr.Bigram |
Udhr.Multigram |
Udhr.Quadgram |
50 |
95.6% |
92.7% |
95.6% |
100 |
99.9% |
99.5% |
98.8% |
150 |
100.0% |
100.0% |
99.3% |
300 |
100.0% |
100.0% |
100.0% |
Accuracy for German language detection
German is an exception to the consistent accuracy of most languages and the results are poor. Further analysis is required to understand the underlying cause.
Text Length |
Udhr.Bigram |
Udhr.Multigram |
Udhr.Quadgram |
50 |
45.6% |
38.0% |
27.2% |
100 |
64.4% |
47.2% |
42.3% |
150 |
71.6% |
57.6% |
51.4% |
300 |
78.7% |
47.2% |
56.9% |