How to use channels in Phoenix to work with binary audio data?

Trini · April 13, 2023, 12:23pm

Hi, I put them in context about my doubt in the phoenix documentation I find the following code (Phoenix.Channel — Phoenix v1.7.2)

def handle_in("file_chunk", {:binary, chunk}, socket) do
  ...
  {:reply, :ok, socket}
end

but the truth is that I am not being able to use it for what I need, which is to send audio captured by the microphone from the js client.

What I would be most interested in knowing is if there is any book or documentation that explains how to handle binary audio and video data in Phoenix

benwilson512 · April 13, 2023, 12:24pm

Hey @Trini welcome! Can you elaborate a bit with respect to what you want to do with that audio?

Trini · April 13, 2023, 12:40pm

hey thanks for the welcome…!!!
Basically I am making a web app that listens to the microphone through the client and the server is in charge of transcribing that and returning what I hear in text to the client.
And I want to send the audio through channel to later use it with some artificial intelligence that transcribes it

benwilson512 · April 13, 2023, 12:45pm

Hey @Trini I think at a high level there are roughly two approaches here.

Option 1: Make a client side file

In this scenario, you use standard javascript tooling to capture the audio into a file, then upload that file to Phoenix. From there Phoenix can send it over to the AI model or similar.

This doesn’t really sound like what you want though,

Option 2: Media streaming

For this option I’d check out https://membrane.stream/. Basically multi-media streaming is a pretty complex topic, but handily there is a whole framework for Elixir aimed at using it.

I’d say the main downside of using Membrane is that because it can do so much, you have to sort of look through it and figure out what parts are going to be relevant for you and what aren’t.

Trini · April 13, 2023, 1:01pm

Option 2 looks quite interesting. I had no idea that there was something like mebrane. I’m going to analyze the 2 options anyway. Thank you very much for taking the time to answer my question.

LostKobrakai · April 13, 2023, 1:13pm

You could also dive into the sourcecode for the Livebook smartcell for speech-to-text:

chrismccord · April 13, 2023, 1:39pm

Channels support binary payloads, so the example you used could absolutely be used to take an encoded segment of audio and do something with it on the server (like transcribing it). If you’re using bumblebee’s Whisper model, you’ll need to transcode the microphone capture to pcm with a bitrate that whisper wants. You could do this with ffmpeg on the server, but more ideally you can do it on the client so the raw audio chunk is already in the format you need to pass off to whisper. At a glance, you can do this with existing media capture primitives and js stdlib functions. Then you ship that final array buffer up the channel and pass it off to bumblebee

The above is basically option 1 of what ben is talking about, which should be fine for transcribing microphone data every X seconds

chrismccord · April 13, 2023, 1:55pm

The bumblebee folks actually have a single file example of this with all the javascript and everything!

github.com

elixir-nx/bumblebee/blob/main/examples/phoenix/speech_to_text.exs

Application.put_env(:sample, PhoenixDemo.Endpoint,
  http: [ip: {127, 0, 0, 1}, port: 8080],
  server: true,
  live_view: [signing_salt: "bumblebee"],
  secret_key_base: String.duplicate("b", 64),
  pubsub_server: PhoenixDemo.PubSub
)

Mix.install([
  {:plug_cowboy, "~> 2.6"},
  {:jason, "~> 1.4"},
  {:phoenix, "~> 1.7.0"},
  {:phoenix_live_view, "~> 0.18.3"},
  # Bumblebee and friends
  {:bumblebee, "~> 0.2.0"},
  {:nx, "~> 0.5.1"},
  {:exla, "~> 0.5.1"}
])

Application.put_env(:nx, :default_backend, EXLA.Backend)

This file has been truncated. show original

benwilson512 · April 13, 2023, 2:01pm

That’s fantastic! I had considered the whole “chop up the front end audio every few seconds” but I was concerned that that’d cause issues if a chop happened mid word. Is the buffer overlapping in some way?

chrismccord · April 13, 2023, 2:01pm

You can crib off the bumblebee example, and if you want to go low level channels, the relevant elixir code is tiny:

// on captured audio chunks
audioChan.push("chunk", pcmEncodedAsArrayBuffer)
...
audioChan.on("transcription", ({text}) => console.log(text))

defmodule AudioWeb.AudioChannel do
  use AudioWeb, :channel

  def join(_, _, socket) do
    {:ok, socket}
  end

  def handle_in("chunk", {:binary, data}, socket) do
    %{results: [%{text: text}]} =
      Nx.Serving.batched_run(WhisperServing, Nx.from_binary(data, :f32))

    push(socket, "transcription", %{text: text})
    {:noreply, socket}
  end
end

chrismccord · April 13, 2023, 2:01pm

you can do silence detection on the client and chunk on that

chrismccord · April 13, 2023, 2:14pm

and a video out today from José

toranb · April 21, 2023, 1:21am

I was actually building this myself recently (note: I borrowed heavily from José and Chris both).

This will record from the client, upload the mp3 with liveview, parse the audio from chunks and transcribe it using bumblebee

toranb · April 21, 2023, 5:03pm

I also put together a single elixir script example of this for anyone less familiar with the full app I linked above

gist.github.com

https://gist.github.com/toranb/833282acc75e0055803bb8b7a2ecc343

speech_to_text_example.exs

Application.put_env(:sample, PhoenixDemo.Endpoint,
  http: [ip: {127, 0, 0, 1}, port: 8080],
  server: true,
  live_view: [signing_salt: "bumblebee"],
  secret_key_base: String.duplicate("b", 64),
  pubsub_server: PhoenixDemo.PubSub
)

Mix.install([
  {:plug_cowboy, "~> 2.6"},

This file has been truncated. show original

Trini · May 8, 2023, 9:28pm

This is the way I was able to receive and use the binary data for my speech recognition application.

def handle_in("file_chunk", {:binary, <<samples :: binary>>}, socket) do
end

The problem I’m having now is that I have a heap overflow from receiving too much data and that there are functions in my application that overlap

If anyone has any idea why this is happening I would really appreciate your input on this question: