Troubleshooting: Bumblebee audio transcription provides same text for few parsed webm files

Hi, I am trying to use Bumblebee to transcript audio to text
I have few multiple *.webm files with audio that sounds like ‘text audio with Bumblebee’, ‘one, two, three, four, five’, etc
I can reproduce it with mediaplayer and all file contains different audio
However, looks like bumblebee always generates same prediction with text ’ you’:

predicitons: %{
  chunks: [
    %{text: " you", start_timestamp_seconds: nil, end_timestamp_seconds: nil}
  ]
}
# ..x5+ times

predicitons: %{
  chunks: [
    %{      text: " Thank you.", start_timestamp_seconds: nil, end_timestamp_seconds: nil    }
  ]
}
predicitons: %{
  chunks: [
    %{text: " Bye.", start_timestamp_seconds: nil, end_timestamp_seconds: nil}
  ]
}

(predicitons is IO.inspect label in here)

My code:
Application application.ex, child spec:

children = [
      ........
      {Nx.Serving,
       serving: create_audio_serving(),
       name: Recognizer.AudioServing,
       batch_size: 4,
       batch_timeout: 100}
       .....
    ]

defp create_audio_serving() do
    # Load the pre-trained model
    {:ok, model_info} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
    {:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
    {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
    {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

    Bumblebee.Audio.speech_to_text_whisper(model_info, featurizer, tokenizer, generation_config,
      compile: [batch_size: 4],
      defn_options: [
        compiler: EXLA
      ]
    )
  end

In worker process I have code:

defmodule Recognizer.Room do
  @moduledoc false

  use GenServer, restart: :temporary
  ...
  @impl true
  def handle_cast({:receive_audio_msg, audio_base64}, state) do
    file = "/tmp/audio-#{state.id}.webm"
    audio_data = Base.decode64!(audio_base64)

    File.write!(file, audio_data)
    reader = Xav.Reader.new!(file, read: :audio)

    case Xav.Reader.next_frame(reader) do
      {:ok, frame} ->
        tensor = Xav.Frame.to_nx(frame)
        Task.async(fn -> Nx.Serving.batched_run(Recognizer.AudioServing, tensor) end)

      {:error, :no_keyframe} ->
        Logger.warning("Couldn't decode audio frame - missing keyframe!")
    end

    {:noreply, state}
  end

@impl true
  def handle_info({_ref, predicitons}, state) do
    predicitons |> IO.inspect(label: :predicitons) 
   {:noreply, state}
  end

I have added default exla backend like it is suggested in very similar question but predicition still nonsense
What I am doing wrong and why I have this random predictions?

1 Like

The issue with Xav.next_frame
Probably it takes only first frame
Passing entire file like it was done in cool-whisper-server works

iex(10)> Nx.Serving.batched_run(Recognizer.AudioServing, {:file, "/home/maryna/Music/audio-17405164795554743983100781892.webm"})
%{
  chunks: [
    %{
      text: " Finally in temporary digital storage.",
      start_timestamp_seconds: nil,
      end_timestamp_seconds: nil
    }
  ]
}
3 Likes

Nx.Serving.batched_run(Recognizer.AudioServing, {:file, file})

How much data is in a frame? Collect a bunch of them. Stick em together, then pass them through.

I have done this in here. Checking if the concatenated bytes are above a desired size.

Something like this:

Old version of Membrane and all that but I use similar mechanisms in this one which is more up to date.

1 Like