Chunx - Text splitting library with token, word, sentence, and semantic chunking strategies

preciz · January 1, 2025, 12:09am

Chunx is a library I made based on the python chonkie-ai/chonkie

It has 4 chunking strategies: token, word, sentence, semantic

Token & Word is very close to Chonkie’s implementation you can expect same output.
Sentence & Semantic is changed a bit hopefully for the better.

Currently I’m not sure about a few things like:

performance: should more things from the Tokenizers library be used for better performance?
correctness: Are Chunks start_index & end_index correct with UTF8 text?

I would be happy if the Elixir community would give it a spin and gave feedback.

https://github.com/preciz/chunx

Example:

iex(7)> text = 
"The paint on the windowframe started to chip from the sun's heat. The shelves were dusty.\n"
iex(8)> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
iex(9)> Chunx.Chunker.Word.chunk(text, tokenizer, chunk_size: 20, chunk_overlap: 10)
{:ok,
 [
   %Chunx.Chunk{
     text: "The paint on the windowframe started to chip from the sun's heat. The shelves were dusty.",
     start_index: 0,
     end_index: 89,
     token_count: 20,
     embedding: nil
   },
   %Chunx.Chunk{
     text: " the sun's heat. The shelves were dusty.\n",
     start_index: 49,
     end_index: 90,
     token_count: 11,
     embedding: nil
   }
 ]}

kip · January 1, 2025, 12:18am

Looks good, I’ll definitely take a look at this - especially interested in semantic and token chunking.

I implemented something similar in unicode_string, specifically the Unicode Text Segmentation Standard which has rules for multilingual text splitting of words, lines and sentences (and graphemes - but Elixir already takes care of that).

You can see some examples in the readme. Punctuation - especially multilingual punctuation - makes this a “fun” and “interesting” topic for sure!

al2o3cr · January 1, 2025, 12:28am

There’s some ambiguity with “index” and UTF8 as it could be “bytes” or “codepoints” and the second is painful to work with (can’t find the corresponding byte in constant-time, for instance)

Reading the source of Chunx indicates that those will always be in bytes because that’s what Regex.scan returns and :erlang.binary_part expects.

IMO calling them start_byte and end_byte would make the nature of the fields more explicit.

One downside of indexing by bytes is that it’s going to be harder to produce accurate “parse failed” messages that point to a specific character in the input if that input includes UTF8.

kip · January 1, 2025, 12:29am

I think as long as you make it clear about what is being indexed it’s probably ok. So maybe better to be byte_index_start and byte_index_end. Indexing into binaries always feels like a bit of an anti pattern to me though and in this case you’re already returning the match in the :text field so maybe its not required?

preciz · January 1, 2025, 12:50am

So if I would make the word splitting logic configurable then your library could be used for better outcomes?

preciz · January 1, 2025, 12:51am

Thank you will rename them accordingly.

kip · January 1, 2025, 1:24am

I’m not sure if it would be better outcomes for a RAG application (which is the primary focus if I understand correctly). I suspect unicode_string might be more correct but slower since it’s rules based. And it’s not clear that more correctness is more useful for RAG (not clear to me I mean). It might be useful though if you’re doing multilingual RAG, especially in languages which do not use whitespace as word separators.