hey thanks for the welcome…!!!
Basically I am making a web app that listens to the microphone through the client and the server is in charge of transcribing that and returning what I hear in text to the client.
And I want to send the audio through channel to later use it with some artificial intelligence that transcribes it
Hey @Trini I think at a high level there are roughly two approaches here.
Option 1: Make a client side file
In this scenario, you use standard javascript tooling to capture the audio into a file, then upload that file to Phoenix. From there Phoenix can send it over to the AI model or similar.
This doesn’t really sound like what you want though,
Option 2: Media streaming
For this option I’d check out https://membrane.stream/. Basically multi-media streaming is a pretty complex topic, but handily there is a whole framework for Elixir aimed at using it.
I’d say the main downside of using Membrane is that because it can do so much, you have to sort of look through it and figure out what parts are going to be relevant for you and what aren’t.
Option 2 looks quite interesting. I had no idea that there was something like mebrane. I’m going to analyze the 2 options anyway. Thank you very much for taking the time to answer my question.
Channels support binary payloads, so the example you used could absolutely be used to take an encoded segment of audio and do something with it on the server (like transcribing it). If you’re using bumblebee’s Whisper model, you’ll need to transcode the microphone capture to pcm with a bitrate that whisper wants. You could do this with ffmpeg on the server, but more ideally you can do it on the client so the raw audio chunk is already in the format you need to pass off to whisper. At a glance, you can do this with existing media capture primitives and js stdlib functions. Then you ship that final array buffer up the channel and pass it off to bumblebee
The above is basically option 1 of what ben is talking about, which should be fine for transcribing microphone data every X seconds
That’s fantastic! I had considered the whole “chop up the front end audio every few seconds” but I was concerned that that’d cause issues if a chop happened mid word. Is the buffer overlapping in some way?
defmodule AudioWeb.AudioChannel do
use AudioWeb, :channel
def join(_, _, socket) do
{:ok, socket}
end
def handle_in("chunk", {:binary, data}, socket) do
%{results: [%{text: text}]} =
Nx.Serving.batched_run(WhisperServing, Nx.from_binary(data, :f32))
push(socket, "transcription", %{text: text})
{:noreply, socket}
end
end