High memory usage when transcribing with whisper-large-v3 in Nx/EXLA/Bumblebee

aime · July 8, 2025, 1:59pm

Hi, I’m trying to load Whisper models into Elixir using Nx, EXLA, and Bumblebee.

I’ve encountered the following issue:
When using the whisper-large-v3 model to transcribe audio files under 200 KB, the process consumes over 15 GB of RAM.

This is the script I created:

defmodule WhisperLarge do
  alias Bumblebee
  require Logger

  def run do
    Nx.global_default_backend(EXLA.Backend)

    {:ok, model} = Bumblebee.load_model({:hf, "openai/whisper-large-v3"})
    {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-large-v3"})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-large-v3"})
    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-large-v3"})

    generation_config = Bumblebee.configure(generation_config, max_new_tokens: 448)

    serving =
      Bumblebee.Audio.speech_to_text_whisper(
        model,
        featurizer,
        tokenizer,
        generation_config,
        compile: [batch_size: 1],
        chunk_num_seconds: 3,
        timestamps: :segments,
        language: "es",
        stream: false
      )

    result = Nx.Serving.run(serving, {:file, "file.wav"})

    text =
      result.chunks
      |> Enum.map(& &1.text)
      |> Enum.join(" ")

    IO.puts("Transcripción: #{text}")
  end
end

Dependencies:

      {:bumblebee, "~> 0.6.0"},
      {:nx, "~> 0.9.0"},
      {:exla, "~> 0.9.0"}

I’ve also tested with other models like whisper-tiny, which don’t consume nearly as much memory — but they are not as accurate for my use case.

Someone on the Elixir Slack suggested using whisper.cpp or Python, which does use significantly less memory.
However, I was hoping to accomplish the full transcription process entirely in Elixir.

I’d really appreciate any advice or suggestions on reducing memory usage with large models in Nx/Bumblebee.