Library to split string on unicode word boundaries

aochagavia · September 21, 2022, 1:14pm

I am looking for a way to split utf8 strings into chunks that have some logical cohesion (e.g. avoid splitting in the middle of a word or, worse, in the middle of a grapheme cluster). The unicode standard annex 29 defines so-called word boundaries, which I would like to use to ensure the chunking also works for non-latin alphabets.

According to unicode’s definition of word boundaries, a sentence like “the quick brown fox” would be split into “the”, “quick”, “brown”, “fox”. Similarly, the sentence “这只是一些随机的文本” would be split into “这”, “只”, “是”, “一”, “些”, “随”, “机”, “的”, “文”, “本”.

Is there any Elixir library that has functions for this purpose? In Rust we have the unicode-segmentation crate which provides the unicode_words function to split a string into unicode words (click here for a runnable example), but so far I haven’t been able to find something like that for Elixir.

LostKobrakai · September 21, 2022, 1:29pm

There’s GitHub - elixir-unicode/unicode: Unicode codepoint introspection and fast detection (lower, upper, alpha, numeric, whitespace, ...) in Elixir, which probably has some way to do what you want to do, though I’ve never had to use it on my own, so can’t be sure about it.

kip · September 21, 2022, 4:35pm

unicode_string implements the Unicode text segmentation specification and for your case specifically the Unicode word break algorithm. The current version is based upon Unicode 15 released on September 13, 2022.

See Unicode.String.split/2.

TR 29 provides specific break data for several but not all locales. Some locales have data only for line break suppressions (ie abbreviations like i.e. that do not signal a line break) and some include data for word breaks too, like the “ja” locale.

Otherwise the default break rules are used and worth a look to see what is going on under the hood. Its not as straight forward as you might think. All of these rules are used to define functions at compile time to optimise performance.

iex> Unicode.String.Segment.known_locales
["de", "el", "en", "en-US", "en-US-POSIX", "es", "fi", "fr", "it", "ja", "pt",
 "root", "ru", "sv", "zh", "zh-Hant"]

Here are your examples, using Unicode.String.split/2:

# "en" uses the default word break rules
iex> Unicode.String.split "the quick brown fox", trim: true
["the", "quick", "brown", "fox"]
# There are word break rules specific to "zh"
iex> Unicode.String.split "这只是一些随机的文本", locale: "zh"
["这", "只", "是", "一", "些", "随", "机", "的", "文", "本"]

Japanese does have specific word break rules so here’s an example. Notice that there is no whitespace separation. Word breaks in Japanese are determined by other means.

iex> Unicode.String.split "注ちゅう文もんの多おおい料りょう理り店", locale: "ja"
["注", "ちゅう", "文", "もんの", "多", "おおい", "料", "りょう",
 "理", "り", "店"]