This implementation is inspired by the Python and JavaScript libraries available.
For now, my use case is only to calculate the token counts precisely to avoid hitting the limit in my project. But I believe the full encoding and decoding functionalities are working, given the passing tests. Your feedback is welcomed!
It seems using Rustler to NIF bind the Rust implementation of the HuggingFace tokenizer. The HF tokenizer use the same mechanism (BPE) as the GPT-3’s, but they are different and give different results. (I just confirmed it by comparing the result in its README vs OpenAI API)
Anyway, binding the rust implementation is a promising direction. GPT-3 encoder also has a rust implementation called tiktoken. For my own use case, the native one is good enough tho.
Thank you for this! I just tested it on a bunch of long transcripts and it got the token count exactly right. It’s also pretty fast. Counted 10k tokens in 15ms