Troubleshooting: Bumblebee audio transcription provides same text for few parsed webm files

marinakr · March 30, 2025, 6:37pm

Hi, I am trying to use Bumblebee to transcript audio to text
I have few multiple *.webm files with audio that sounds like ‘text audio with Bumblebee’, ‘one, two, three, four, five’, etc
I can reproduce it with mediaplayer and all file contains different audio
However, looks like bumblebee always generates same prediction with text ’ you’:

predicitons: %{
  chunks: [
    %{text: " you", start_timestamp_seconds: nil, end_timestamp_seconds: nil}
  ]
}
# ..x5+ times

predicitons: %{
  chunks: [
    %{      text: " Thank you.", start_timestamp_seconds: nil, end_timestamp_seconds: nil    }
  ]
}
predicitons: %{
  chunks: [
    %{text: " Bye.", start_timestamp_seconds: nil, end_timestamp_seconds: nil}
  ]
}

(predicitons is IO.inspect label in here)

My code:
Application application.ex, child spec:

children = [
      ........
      {Nx.Serving,
       serving: create_audio_serving(),
       name: Recognizer.AudioServing,
       batch_size: 4,
       batch_timeout: 100}
       .....
    ]

defp create_audio_serving() do
    # Load the pre-trained model
    {:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
    {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

    Bumblebee.Audio.speech_to_text_whisper(model_info, featurizer, tokenizer, generation_config,
      compile: [batch_size: 4],
      defn_options: [
        compiler: EXLA
      ]
    )
  end

In worker process I have code:

defmodule Recognizer.Room do
  @moduledoc false

  use GenServer, restart: :temporary
  ...
  @impl true
  def handle_cast({:receive_audio_msg, audio_base64}, state) do
    file = "/tmp/audio-#{state.id}.webm"
    audio_data = Base.decode64!(audio_base64)

    File.write!(file, audio_data)
    reader = Xav.Reader.new!(file, read: :audio)

    case Xav.Reader.next_frame(reader) do
      {:ok, frame} ->
        tensor = Xav.Frame.to_nx(frame)
        Task.async(fn -> Nx.Serving.batched_run(Recognizer.AudioServing, tensor) end)

      {:error, :no_keyframe} ->
        Logger.warning("Couldn't decode audio frame - missing keyframe!")
    end

    {:noreply, state}
  end

@impl true
  def handle_info({_ref, predicitons}, state) do
    predicitons |> IO.inspect(label: :predicitons) 
   {:noreply, state}
  end

I have added default exla backend like it is suggested in very similar question but predicition still nonsense
What I am doing wrong and why I have this random predictions?

marinakr · March 31, 2025, 1:42pm

The issue with Xav.next_frame
Probably it takes only first frame
Passing entire file like it was done in cool-whisper-server works

iex(10)> Nx.Serving.batched_run(Recognizer.AudioServing, {:file, "/home/maryna/Music/audio-17405164795554743983100781892.webm"})
%{
  chunks: [
    %{
      text: " Finally in temporary digital storage.",
      start_timestamp_seconds: nil,
      end_timestamp_seconds: nil
    }
  ]
}

marinakr · April 1, 2025, 1:54pm

Nx.Serving.batched_run(Recognizer.AudioServing, {:file, file})

lawik · April 2, 2025, 5:48am

How much data is in a frame? Collect a bunch of them. Stick em together, then pass them through.

I have done this in here. Checking if the concatenated bytes are above a desired size.

Something like this:

github.com/lawik/membrane_transcription

lib/membrane_transcription/element.ex

9b2519d85


      
          sample_size = @format_byte_size[:f32le]
          
          data = IO.iodata_to_binary(buffered)
          
          # TODO: Warn at least once if metadata doesn't have timestamper information
          # TODO: Reference that the timestamper filter can be added to the pipeline
          # TODO: to get that information
          # TODO: It should still work without it, the current implementation will fail
          
          state =
            if byte_size(data) / sample_size / @assumed_sample_rate >
                 state.buffer_duration and not state.transcribing? do
              trigger_transcript(data, state, buffer.metadata.end_ts)
          
              %{
                state
                | buffered: [],
                  start_ts: buffer.metadata.end_ts,
                  end_ts: buffer.metadata.end_ts,
                  transcribing?: true
              }

Old version of Membrane and all that but I use similar mechanisms in this one which is more up to date.