How to use channels in Phoenix to work with binary audio data?

Hi, I put them in context about my doubt in the phoenix documentation I find the following code (Phoenix.Channel — Phoenix v1.7.2)

def handle_in("file_chunk", {:binary, chunk}, socket) do
  ...
  {:reply, :ok, socket}
end

but the truth is that I am not being able to use it for what I need, which is to send audio captured by the microphone from the js client.

What I would be most interested in knowing is if there is any book or documentation that explains how to handle binary audio and video data in Phoenix

2 Likes

Hey @Trini welcome! Can you elaborate a bit with respect to what you want to do with that audio?

1 Like

hey thanks for the welcome…!!!
Basically I am making a web app that listens to the microphone through the client and the server is in charge of transcribing that and returning what I hear in text to the client.
And I want to send the audio through channel to later use it with some artificial intelligence that transcribes it

1 Like

Hey @Trini I think at a high level there are roughly two approaches here.

Option 1: Make a client side file

In this scenario, you use standard javascript tooling to capture the audio into a file, then upload that file to Phoenix. From there Phoenix can send it over to the AI model or similar.

This doesn’t really sound like what you want though,

Option 2: Media streaming

For this option I’d check out https://membrane.stream/. Basically multi-media streaming is a pretty complex topic, but handily there is a whole framework for Elixir aimed at using it.

I’d say the main downside of using Membrane is that because it can do so much, you have to sort of look through it and figure out what parts are going to be relevant for you and what aren’t.

2 Likes

Option 2 looks quite interesting. I had no idea that there was something like mebrane. I’m going to analyze the 2 options anyway. Thank you very much for taking the time to answer my question.

1 Like

You could also dive into the sourcecode for the Livebook smartcell for speech-to-text:

1 Like

Channels support binary payloads, so the example you used could absolutely be used to take an encoded segment of audio and do something with it on the server (like transcribing it). If you’re using bumblebee’s Whisper model, you’ll need to transcode the microphone capture to pcm with a bitrate that whisper wants. You could do this with ffmpeg on the server, but more ideally you can do it on the client so the raw audio chunk is already in the format you need to pass off to whisper. At a glance, you can do this with existing media capture primitives and js stdlib functions. Then you ship that final array buffer up the channel and pass it off to bumblebee :slight_smile:

The above is basically option 1 of what ben is talking about, which should be fine for transcribing microphone data every X seconds

4 Likes

The bumblebee folks actually have a single file example of this with all the javascript and everything! :slight_smile:

9 Likes

That’s fantastic! I had considered the whole “chop up the front end audio every few seconds” but I was concerned that that’d cause issues if a chop happened mid word. Is the buffer overlapping in some way?

You can crib off the bumblebee example, and if you want to go low level channels, the relevant elixir code is tiny:

// on captured audio chunks
audioChan.push("chunk", pcmEncodedAsArrayBuffer)
...
audioChan.on("transcription", ({text}) => console.log(text))
defmodule AudioWeb.AudioChannel do
  use AudioWeb, :channel

  def join(_, _, socket) do
    {:ok, socket}
  end

  def handle_in("chunk", {:binary, data}, socket) do
    %{results: [%{text: text}]} =
      Nx.Serving.batched_run(WhisperServing, Nx.from_binary(data, :f32))

    push(socket, "transcription", %{text: text})
    {:noreply, socket}
  end
end

you can do silence detection on the client and chunk on that

1 Like

and a video out today from José :slight_smile:

2 Likes

I was actually building this myself recently (note: I borrowed heavily from José and Chris both).

This will record from the client, upload the mp3 with liveview, parse the audio from chunks and transcribe it using bumblebee

2 Likes

I also put together a single elixir script example of this for anyone less familiar with the full app I linked above

4 Likes

This is the way I was able to receive and use the binary data for my speech recognition application.

def handle_in("file_chunk", {:binary, <<samples :: binary>>}, socket) do
end

The problem I’m having now is that I have a heap overflow from receiving too much data and that there are functions in my application that overlap

If anyone has any idea why this is happening I would really appreciate your input on this question:

1 Like