Lemma: morphological parsing in Elixir

xiamx · September 14, 2017, 5:03pm

Lemma is a Morphological Parser (Analyser) / Lemmatizer writen in Elixir. It is implemented using a textbook classic method relying in an abstraction called Finite State Transducer.

What are morphological parsing and lemmatization?

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. For instance:

am, are, is ⇒ be
car, cars, car’s, cars’ ⇒ car

The result of this mapping of text will be something like:
the boy’s cars are different colors ⇒ the boy car be differ color.

– Stanford NLP Group

Motivation

I originally started this project as part of learning Elixir. In erlang/otp, we have outstanding abstractions for finite state machines. The core of morphological parsing in Lemma are actually finite state transducers, which can be seen as extensions to finite state machines. This similarity makes Elixir seem like an ideal language to implement morphological parsing – although I opted to use libgraph instead of gen_state_m in the end.

I rewrote the entire implementation several times… The current one is the third rewrite. Every time I find better ways to implement things, or had to let go more expressive style for performance reasons.

I experimented with building the parser at run-time vs at compile-time. Both implementations are benchmarked. I was surprised to see that the parser built at compile-time runs about 4 times faster than the same parser built at run-time. Certainly, some optimizations were done by the compiler, and I’ve yet to find out how. But building the parser at compile-time has its own problem too, during compilation, memory usage would go up to 3-4GB.

What is going on behind the scene, hope we can test out the hypothesis together.

OvermindDL1 · September 14, 2017, 5:53pm

When a structure is known at compile-time it is cached into a special cache area in the BEAM file, and when loaded those values are marked as do not need to garbage collect essentially, and it can make a lot of other assumptions like not needing to copy them between processes and instead just pointing to it everywhere, which does indeed give quite a speed boost.

xiamx · September 15, 2017, 2:36am

Ahh I see, this makes sense. I previously tried to do lemmatization on a list of words in parallel, using Task.async and Task.await. But this turns to be rather a bad idea, as the entire finite state transducer is copied between processes for each word in the list.