Audio from Kino.Audio can't be recognized by Whisper in bumblebee

Ljzn · March 10, 2023, 3:21am

Hi, elixir forum. I’m using the whisper-tiny model to convert speech to text but I always get a bad result. I think it may be the format of the audio data, such as endianness. any idea?

Ljzn · March 10, 2023, 6:09am

It works well after changing the sampling rate to 16000!

audio_input = Kino.Input.audio("Speech", sampling_rate: 16000)

And I found this, whisper need use ffmpeg to resample the audio into 16000, so we need set the sampling rate to 16000 if not resampling.

Ljzn · March 10, 2023, 6:14am

Another question, how to set the “decoder id”? So we can recognize language other than english.

Ljzn · March 10, 2023, 6:57am

Got solution! You can config the forced_token_ids in whisper spec:

# set language to `zh`
forced_token_ids = [{1, 50260}, {2, 50359}, {3, 50363}]
whisper = put_in(whisper.spec.forced_token_ids, forced_token_ids)

The full list of tokens can be found here: added_tokens.json · openai/whisper-base at main