Nx vs. Python performance for sentence-transformer encoding

jonatanklosko · July 31, 2023, 9:41pm

@steffend on Bumblebee main you can specify multiple sequence lengths, in which case we compile multiple versions of the computation and inputs are batched depending on the length. This way short sequences don’t have overly long padding. Here’s an example:

# Text embedding with multiple lengths

```elixir
Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:rustler, ">= 0.0.0", optional: true},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino, "~> 0.10.0"}
])

Nx.global_default_backend(EXLA.Backend)
```

## 🐈‍⬛

```elixir
repo = "sentence-transformers/all-MiniLM-L6-v2"
{:ok, model_info} = Bumblebee.load_model({:hf, repo})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, repo})

serving =
  Bumblebee.Text.TextEmbedding.text_embedding(model_info, tokenizer,
    compile: [batch_size: 32, sequence_length: [16, 32, 64, 128, 512]],
    defn_options: [compiler: EXLA]
  )

Kino.start_child({Nx.Serving, serving: serving, name: MyServing})
```

```elixir
short_text = "this is a test"
Nx.Serving.batched_run(MyServing, short_text)
```

```elixir
long_text = String.duplicate("this is a test with a much longer text", 50)
Nx.Serving.batched_run(MyServing, long_text)
```

The first input falls under a shorter sequence length, meaning we use less padding and the computation is faster. The second input falls under the largest length, so we pad to 512 and the computation takes longer.