Gpt3_tokenizer - BPE Encoder Decoder for GPT-3 implemented in Elixir

Just contributed my first Elixir library to the open source community - gpt3_tokenizer!

This is an Elixir tokenizer for OpenAI’s GPT-3 model.

This implementation is inspired by the Python and JavaScript libraries available.

For now, my use case is only to calculate the token counts precisely to avoid hitting the limit in my project. But I believe the full encoding and decoding functionalities are working, given the passing tests. Your feedback is welcomed!

12 Likes

Nice work, but had you seen this? Tokenizers.Tokenizer — Tokenizers v0.3.2

No, i haven’t.

It seems using Rustler to NIF bind the Rust implementation of the HuggingFace tokenizer. The HF tokenizer use the same mechanism (BPE) as the GPT-3’s, but they are different and give different results. (I just confirmed it by comparing the result in its README vs OpenAI API)

Anyway, binding the rust implementation is a promising direction. GPT-3 encoder also has a rust implementation called tiktoken. For my own use case, the native one is good enough tho.

2 Likes

Dude, this is great. It’s exactly what I wanted. Now I don’t have to bloat my project with these massive huggingface and nif bindings. So happy.

1 Like

Thank you for this! I just tested it on a bunch of long transcripts and it got the token count exactly right. It’s also pretty fast. Counted 10k tokens in 15ms

2 Likes

This is great, thank you! I was looking for a way to drop rust as a dependency for our app.

Have run any benchmarks against tiktoken by chance?