Nx Tokenizers: merged_with_next - cannot find how to enforce said pre-tokenizer in Elixir

bdarla · December 29, 2022, 9:10am

Has anyone used the Tokenizers library of Nx by employing the behaviour merged_with_next with a pre_tokenizer?

My goal is to use the Elixir bindings in order to do something as in the following Python code, but I cannot find how to enforce said pre-tokenizer in Elixir:

from tokenizers import Tokenizer, Regex
from tokenizers.models import WordPiece
from tokenizers.pre_tokenizers import Split

samples = [“This is a first test.”, “This is a second test.”]

model = WordPiece({“”: 100}, unk_token=“”)
tokenizer = Tokenizer(model)

tokenizer.pre_tokenizer = Split(Regex(“\w+|[^\w\s]+”), behavior=“merged_with_next”)

for s in samples:
print(“for input=”, s)
print(“standalone pre tekonizer:”, tokenizer.pre_tokenizer.pre_tokenize_str(s))
print(“----------------”)

The output of the above Python code is:

for input= This is a first test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('first ', (10, 16)), ('test', (16, 20)), ('.', (20, 21))]
----------------
for input= This is a second test.
standalone pre tekonizer: [('This ', (0, 5)), ('is ', (5, 8)), ('a ', (8, 10)), ('second ', (10, 17)), ('test', (17, 21)), ('.', (21, 22))]
----------------

seanmor5 · December 29, 2022, 8:27pm

I don’t think we’ve exposed this behavior from the Rust library yet

We’ve been slowly adding things we need, if this is something that you can do in Rust it’s probably something we can support. Feel free to open an issue or PR!

bdarla · December 29, 2022, 10:15pm

Thank you for your response! I guess it is time to start learning some Rust.

bdarla · October 14, 2023, 5:19pm

It seems that the latest version of the Tokenizers library (v. 0.4) provides this feature. Thanks to everyone who contributed to that!

A usage example is the following:

pretok = Tokenizers.PreTokenizer.split(~S( ), :merged_with_previous)
Tokenizers.PreTokenizer.pre_tokenize(pretok, "Hi there. This is a test that merges spaces with previous token.")

… and the outcome:

{:ok,
 [
   {"Hi ", {0, 3}},
   {"there. ", {3, 10}},
   {"This ", {10, 15}},
   {"is ", {15, 18}},
   {"a ", {18, 20}},
   {"test ", {20, 25}},
   {"that ", {25, 30}},
   {"merges ", {30, 37}},
   {"spaces ", {37, 44}},
   {"with ", {44, 49}},
   {"previous ", {49, 58}},
   {"token.", {58, 64}}
 ]}

Such a feature allows the reconstruction of the original text since whitespaces are not deleted.