Nx Tokenizers: merged_with_next - cannot find how to enforce said pre-tokenizer in Elixir

Has anyone used the Tokenizers library of Nx by employing the behaviour merged_with_next with a pre_tokenizer?

My goal is to use the Elixir bindings in order to do something as in the following Python code, but I cannot find how to enforce said pre-tokenizer in Elixir:

from tokenizers import Tokenizer, Regex
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Split

samples = [“This is a first test.”, “This is a second test.”]

model = WordPiece({“”: 100}, unk_token=“”)
tokenizer = Tokenizer(model)

tokenizer.pre_tokenizer = Split(Regex(“\w+|[^\w\s]+”), behavior=“merged_with_next”)

for s in samples:
print(“for input=”, s)
print(“standalone pre tekonizer:”, tokenizer.pre_tokenizer.pre_tokenize_str(s))
print(“----------------”)

The output of the above Python code is:

for input= This is a first test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('first ', (10, 16)), ('test', (16, 20)), ('.', (20, 21))]
----------------
for input= This is a second test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('second ', (10, 17)), ('test', (17, 21)), ('.', (21, 22))]
----------------
1 Like

I don’t think we’ve exposed this behavior from the Rust library yet

We’ve been slowly adding things we need, if this is something that you can do in Rust it’s probably something we can support. Feel free to open an issue or PR!

2 Likes

Thank you for your response! I guess it is time to start learning some Rust. :slight_smile:

1 Like