Gpt3_tokenizer - BPE Encoder Decoder for GPT-3 implemented in Elixir

libo · May 2, 2023, 7:44pm

Just contributed my first Elixir library to the open source community - gpt3_tokenizer!

This is an Elixir tokenizer for OpenAI’s GPT-3 model.

This implementation is inspired by the Python and JavaScript libraries available.

For now, my use case is only to calculate the token counts precisely to avoid hitting the limit in my project. But I believe the full encoding and decoding functionalities are working, given the passing tests. Your feedback is welcomed!

neilberkman · May 4, 2023, 7:45am

Nice work, but had you seen this? Tokenizers.Tokenizer — Tokenizers v0.3.2

libo · May 17, 2023, 2:12pm

No, i haven’t.

It seems using Rustler to NIF bind the Rust implementation of the HuggingFace tokenizer. The HF tokenizer use the same mechanism (BPE) as the GPT-3’s, but they are different and give different results. (I just confirmed it by comparing the result in its README vs OpenAI API)

Anyway, binding the rust implementation is a promising direction. GPT-3 encoder also has a rust implementation called tiktoken. For my own use case, the native one is good enough tho.

johns10davenport · June 10, 2023, 9:55pm

Dude, this is great. It’s exactly what I wanted. Now I don’t have to bloat my project with these massive huggingface and nif bindings. So happy.

atomkirk · June 23, 2023, 3:22pm

Thank you for this! I just tested it on a bunch of long transcripts and it got the token count exactly right. It’s also pretty fast. Counted 10k tokens in 15ms

tagamma · September 24, 2023, 6:11pm

This is great, thank you! I was looking for a way to drop rust as a dependency for our app.

Have run any benchmarks against tiktoken by chance?