text version 2.0 has been published today, just on schedule. In addition the library text_corpus_udhr is also published today - it provides a corpus to support natural language detection.
text contains 3 language classifiers to aid in natural language detection. However it does not include any corpora; these are contained in separate libraries. The available classifiers are:
Additional classifiers can be added by defining a module that implements the
The library text_corpus_udhr implements the
Text.Corpus behaviour for the United National Declaration of Human Rights which is available for download in 423 languages from Unicode.
iex> Text.Language.detect "this is some english language thing"
# Options include `:corpus`, `:vocabulary` and `:classifier`
iex> Text.Language.detect "this is some english language thing", corpus: Text.Corpus.Udhr, vocabulary: Text.Vocabulary.Udhr.Quadgram
text contains an implementation of word counting that is oriented towards large streams of words rather than discrete strings. Input to
Text.Word.word_count/2 can be a
Flow.t allowing flexible streaming of text.
text includes an inflector for the English language that takes an approach based upon An Algorithmic Approach to English Pluralization. See the module
Text.Inflect.En and the functions:
Text.Ngram module supports efficient generation of n-grams of length
Language detection accuracy
Detection accuracy is reliable at text lengths of 150 characters or more, reasonable at 100 characters and may not be considered acceptable at shorter lengths.
The results are consistent for the range of tested languages with German being a clear exception where the results are unacceptable for now.
- Simplified Chinese
Further details are contained in the github repo in the
English language with Naive Bayesian classifier
classifier: Text.Classifier.NaiveBayesian and three different vocabularies.
Accuracy for German language detection
German is an exception to the consistent accuracy of most languages and the results are poor. Further analysis is required to understand the underlying cause.