Blog Post: Voice Activity Detection in Elixir with Membrane

lawik · November 27, 2024, 7:40am

Building on other people’s work I bashed things together and suddenly I can know when someone is speaking using Elixir and Membrane.

nefty · November 28, 2024, 3:51pm

Great post! In case anyone else runs into the same issue, the latest silero_vad.onnx on their GitHub repo (version 5) expects different inputs than the version used in this post (version 4). The inputs h and c are now combined into a single state tensor: {"state", "Float32", [2, nil, 128]}
Here is the updated code:

init_state = %{state: Nx.broadcast(0.0, {2, 1, 128})}

live_audio
|> Kino.Control.stream()
|> Kino.listen(init_state, fn
  %{event: :audio_chunk, chunk: data}, %{state: state} ->
    input = Nx.tensor([data])
    sr = Nx.tensor(16_000, type: :s64)
    {output, state_n} = Ortex.run(model, {input, state, sr})
    prob = output |> Nx.squeeze() |> Nx.to_number()
    row = %{x: :os.system_time(), y: prob}
    Kino.VegaLite.push(chart, row, window: 1000)

    {:cont, %{state: state_n}}
end)

lawik · November 29, 2024, 7:57am

Thanks! Linked to your post in a small update and also linked the original model that works with the code I had.

delitrem · November 29, 2024, 11:51am

Awesome, very nice work!

By the way, I just remembered about 14 years ago I had a startup, where one of components was a gateway from real CB amateur radio (27Mhz) to our service. It was of course built with Erlang and used some magic around home-grown sox driver, which also had voice/sound activity detection.

P.s.: if I owned the time-machine that times, I would use your solution.