Just contributed my first Elixir library to the open source community - gpt3_tokenizer!
This is an Elixir tokenizer for OpenAI’s GPT-3 model.
For now, my use case is only to calculate the token counts precisely to avoid hitting the limit in my project. But I believe the full encoding and decoding functionalities are working, given the passing tests. Your feedback is welcomed!
No, i haven’t.
It seems using Rustler to NIF bind the Rust implementation of the HuggingFace tokenizer. The HF tokenizer use the same mechanism (BPE) as the GPT-3’s, but they are different and give different results. (I just confirmed it by comparing the result in its README vs OpenAI API)
Anyway, binding the rust implementation is a promising direction. GPT-3 encoder also has a rust implementation called tiktoken. For my own use case, the native one is good enough tho.
Dude, this is great. It’s exactly what I wanted. Now I don’t have to bloat my project with these massive huggingface and nif bindings. So happy.
Thank you for this! I just tested it on a bunch of long transcripts and it got the token count exactly right. It’s also pretty fast. Counted 10k tokens in 15ms
This is great, thank you! I was looking for a way to drop rust as a dependency for our app.
Have run any benchmarks against tiktoken by chance?