Building on other people’s work I bashed things together and suddenly I can know when someone is speaking using Elixir and Membrane.
Great post! In case anyone else runs into the same issue, the latest silero_vad.onnx on their GitHub repo (version 5) expects different inputs than the version used in this post (version 4). The inputs h and c are now combined into a single state tensor: {"state", "Float32", [2, nil, 128]}
Here is the updated code:
init_state = %{state: Nx.broadcast(0.0, {2, 1, 128})}
live_audio
|> Kino.Control.stream()
|> Kino.listen(init_state, fn
%{event: :audio_chunk, chunk: data}, %{state: state} ->
input = Nx.tensor([data])
sr = Nx.tensor(16_000, type: :s64)
{output, state_n} = Ortex.run(model, {input, state, sr})
prob = output |> Nx.squeeze() |> Nx.to_number()
row = %{x: :os.system_time(), y: prob}
Kino.VegaLite.push(chart, row, window: 1000)
{:cont, %{state: state_n}}
end)
Thanks! Linked to your post in a small update and also linked the original model that works with the code I had.
Awesome, very nice work!
By the way, I just remembered about 14 years ago I had a startup, where one of components was a gateway from real CB amateur radio (27Mhz) to our service. It was of course built with Erlang and used some magic around home-grown sox
driver, which also had voice/sound activity detection.
P.s.: if I owned the time-machine that times, I would use your solution.