iex(7)> text =
"The paint on the windowframe started to chip from the sun's heat. The shelves were dusty.\n"
iex(8)> {:ok, tokenizer} = Tokenizers.Tokenizer.from_pretrained("gpt2")
iex(9)> Chunx.Chunker.Word.chunk(text, tokenizer, chunk_size: 20, chunk_overlap: 10)
{:ok,
[
%Chunx.Chunk{
text: "The paint on the windowframe started to chip from the sun's heat. The shelves were dusty.",
start_index: 0,
end_index: 89,
token_count: 20,
embedding: nil
},
%Chunx.Chunk{
text: " the sun's heat. The shelves were dusty.\n",
start_index: 49,
end_index: 90,
token_count: 11,
embedding: nil
}
]}
Looks good, I’ll definitely take a look at this - especially interested in semantic and token chunking.
I implemented something similar in unicode_string, specifically the Unicode Text Segmentation Standard which has rules for multilingual text splitting of words, lines and sentences (and graphemes - but Elixir already takes care of that).
You can see some examples in the readme. Punctuation - especially multilingual punctuation - makes this a “fun” and “interesting” topic for sure!
There’s some ambiguity with “index” and UTF8 as it could be “bytes” or “codepoints” and the second is painful to work with (can’t find the corresponding byte in constant-time, for instance)
Reading the source of Chunx indicates that those will always be in bytes because that’s what Regex.scan returns and :erlang.binary_part expects.
IMO calling them start_byte and end_byte would make the nature of the fields more explicit.
One downside of indexing by bytes is that it’s going to be harder to produce accurate “parse failed” messages that point to a specific character in the input if that input includes UTF8.
I think as long as you make it clear about what is being indexed it’s probably ok. So maybe better to be byte_index_start and byte_index_end. Indexing into binaries always feels like a bit of an anti pattern to me though and in this case you’re already returning the match in the :text field so maybe its not required?
I’m not sure if it would be better outcomes for a RAG application (which is the primary focus if I understand correctly). I suspect unicode_string might be more correct but slower since it’s rules based. And it’s not clear that more correctness is more useful for RAG (not clear to me I mean). It might be useful though if you’re doing multilingual RAG, especially in languages which do not use whitespace as word separators.