I am looking for a way to split utf8 strings into chunks that have some logical cohesion (e.g. avoid splitting in the middle of a word or, worse, in the middle of a grapheme cluster). The unicode standard annex 29 defines so-called word boundaries, which I would like to use to ensure the chunking also works for non-latin alphabets.
According to unicode’s definition of word boundaries, a sentence like “the quick brown fox” would be split into “the”, “quick”, “brown”, “fox”. Similarly, the sentence “这只是一些随机的文本” would be split into “这”, “只”, “是”, “一”, “些”, “随”, “机”, “的”, “文”, “本”.
Is there any Elixir library that has functions for this purpose? In Rust we have the unicode-segmentation crate which provides the unicode_words
function to split a string into unicode words (click here for a runnable example), but so far I haven’t been able to find something like that for Elixir.