Has anyone used the Tokenizers library of Nx by employing the behaviour merged_with_next
with a pre_tokenizer
?
My goal is to use the Elixir bindings in order to do something as in the following Python code, but I cannot find how to enforce said pre-tokenizer
in Elixir:
from tokenizers import Tokenizer, Regex
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Splitsamples = [“This is a first test.”, “This is a second test.”]
model = WordPiece({“”: 100}, unk_token=“”)
tokenizer = Tokenizer(model)tokenizer.pre_tokenizer = Split(Regex(“\w+|[^\w\s]+”), behavior=“merged_with_next”)
for s in samples:
print(“for input=”, s)
print(“standalone pre tekonizer:”, tokenizer.pre_tokenizer.pre_tokenize_str(s))
print(“----------------”)
The output of the above Python code is:
for input= This is a first test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('first ', (10, 16)), ('test', (16, 20)), ('.', (20, 21))]
----------------
for input= This is a second test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('second ', (10, 17)), ('test', (17, 21)), ('.', (21, 22))]
----------------