Bumblebee/Axon vs. Python: Performance for sentence embedding

Context

I am experimenting with text embedding with the hope of implementing semantic similarity search inside a Phoenix application.

My target use case involves a user writing a short sentence (typically 5 to 30 words). In less than a few seconds, I want to present the user with similar sentences out of a collection of equally short sentences previously written by other users.

The test that puzzles me

As a first quick test of feasibility, I am playing with the example posted by @jonatanklosko at Add text embedding serving · Issue #206 · elixir-nx/bumblebee · GitHub

{:ok, model_info} = Bumblebee.load_model({:hf, "bert-base-uncased"}, architecture: :base)
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "bert-base-uncased"})

text = "Hello, world!"
inputs = Bumblebee.apply_tokenizer(tokenizer, text)

Axon.predict(model_info.model, model_info.params, inputs).hidden_state[0]

The code executes without error but when I run it locally on my machine (MacBookAir <4yo), the last line Axon.predict(model_info.model, model_info.params, inputs).hidden_state[0] takes more than 1 minute to complete.

In contrast, the Python equivalent presented at the top of the same GH thread (Add text embedding serving · Issue #206 · elixir-nx/bumblebee · GitHub) completes almost instantaneously (fractions of a second) on the same machine:

from transformers import AutoTokenizer, AutoModel
import torch

# Load pre-trained model tokenizer and model weights
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize input text
text = "Hello, world!"
tokens = tokenizer.encode(text, add_special_tokens=True, return_tensors="pt")

# Generate model embeddings
with torch.no_grad():
    embeddings = model(tokens)[0].squeeze(0)  # Remove batch dimension

# Print the embeddings for the first token
print(embeddings[0])

I am guessing this is not normal and I am doing something wrong. Any idea what that could be? Is there a better way to retrieve text embedding vectors than the script from Add text embedding serving · Issue #206 · elixir-nx/bumblebee · GitHub I am playing with?

Notes

  • I am using:
      {:bumblebee, "~> 0.5.3"},
      {:nx, "~> 0.7.0"},
  • The 1 minute runtime I am reporting above is for the last Axon.predict(model_info.model, model_info.params, inputs).hidden_state[0] step alone (does not include the other mode/tokenizer loading steps).
  • I did read through Nx vs. Python performance for sentence-transformer encoding. I am guessing my issue is different than what’s discussed there since that post is “only” discussing 2x slower running time compared to equivalent Python code (much lower than the delta I am experiencing => for my application, I’d be more than happy with 2x slower than the equivalent Python runtime I am experiencing).
  • Given my use case, since Python is fast enough, I realize I could let Python handle the embedding part and pick things up inside Phoenix after Python completes the embedding. But I’d prefer keeping it all in Elixir if possible.

Hi, the first quick check when something is slow: do you set the backend as described here? Or do you compile the model as described in the post you linked?

2 Likes

Thanks a lot for the pointer @joelpaulkoch.

I unfortunately did not heed the warning at the top https://hexdocs.pm/bumblebee/Bumblebee.html:

(Can’t say it wasn’t emphasized enough :sweat_smile:).

So I was “just”:

  • Adding
 {:bumblebee, "~> 0.5.3"},
      {:nx, "~> 0.7.0"},

to my mix.exs.

I’ve started looking more carefully at the backend setup you linked to. Running out of time for today but I will provide updates once I have had time to look into it further.

Cool, looking forward to your updates!

Hey, I think @joelpaulkoch is spot on, without backend all the operations run in pure Elixir, which is not meant for performance. So you want to set EXLA.Backend as the backend (config :nx, default_backend: EXLA.Backend or in a notebook Nx.global_default_backend(EXLA.Backend)).

For production, you also want to use a serving, in this case Bumblebee.Text.text_embedding and set compilation options, so that on startup the whole model is compiled into a single efficient computation (whereas backend dispatches every individual operation separately). Also, you may find this readme useful.

You can see Generating embeddings in the RAG docs, it includes the serving and also covers similarity lookup using the in-memory index via HNSWLib (or if you need persistence, you can use pgvector). Sidenote: since you know the sentences are short, you can compile for smaller sequence length, as in compile: [batch_size: ..., sequence_length: [32, 64]] (multiple values generate multiple versions of the computation and pick the shortest one that fits); batch size depends on how many concurrent requests you expect and how much the hardware can handle, you can probably start with something smaller, like 4 or even 1.

If anything is not clear or doesn’t work, let me know : )

5 Likes

Thanks a ton, @joelpaulkoch and @jonatanklosko. You were indeed spot on. Once I set up the backend to EXLA, everything started working as fast as expected.

@joelpaulkoch I accepted the answer from @jonatanklosko as it is more complete but really appreciate the earlier pointer nevertheless :pray: