How to stream audio chunks from the browser to the server?

ryanzidago · September 14, 2024, 9:09am

Hi all,

I have the following apps:

an application server that host my Elixir, Phoenix app hosted on Fly.io
an AI server hosting OpenAI Whisper Large V3 on a dedicated HuggingFace endpoint

In the app, users can record audio and have it transcribed.

I would like that the user’s browser sends chunks of audio data of a configurable duration (for ex. 5 seconds each – later I would like that each chunks overlaps each other based of a configurable duration, for ex. 2 seconds). The idea is that both servers (app and ai) never holds in memory the entirety of the audio + speed up inference by sending smaller audio files.

I know that there are some examples of Phoenix Liveview with Bumblebee (see this one), however:

I’m not using Bumblebee
Those examples send the entirety of the data to the AI model

I’ve tried the following in my microphone.js hook:

const SAMPLING_RATE = 16_000;
const CHUNK_DURATION_IN_MS = 5_000;
const CHUNK_OVERLAP_DURATION_IN_MS = 2_000;
const MIME_TYPE = "audio/ogg;codecs=opus";

const MicrophoneV2 = {
  mounted() {
    this.mediaRecorder = null;
    this.recording = false;

    this.el.addEventListener("click", () => {
      if (this.isRecording()) {
        this.stopRecording();
      } else {
        this.startRecording();
      }
    });
  },

  startRecording() {
    this.audioChunks = [];

    navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
      this.mediaRecorder = new MediaRecorder(stream, { mimeType: MIME_TYPE });

      this.mediaRecorder.addEventListener("dataavailable", (event) => {
        if (event.data.size > 0) {
          this.audioChunks.push(event.data);
          this.processChunks();
        }
      });

      this.mediaRecorder.start(CHUNK_DURATION_IN_MS);

      this.updateInterval = setInterval(() => {
        this.mediaRecorder.requestData();
      }, CHUNK_DURATION_IN_MS);
    });
  },

  stopRecording() {
    this.mediaRecorder.addEventListener("stop", () => {
      this.processChunks();
    });

    this.mediaRecorder.stop();
    clearInterval(this.updateInterval);
  },

  processChunks() {
    if (this.audioChunks.length < 1) return;

    const audioBlob = new Blob(this.audioChunks, { type: MIME_TYPE });

    audioBlob
      .arrayBuffer()
      .then((buffer) => {
        const context = new AudioContext({ sampleRate: SAMPLING_RATE });

        context.decodeAudioData(buffer, (audioBuffer) => {
          const pcmBuffer = this.audioBufferToPcm(audioBuffer);
          const buffer = this.convertEndianness32(
            pcmBuffer,
            this.getEndianness(),
            this.el.dataset.endianness
          );
          this.upload("audio", [new Blob([buffer])]);
        });
      })
      .then(() => {
        this.audioChunks = [];
      })
      .catch((error) => {
        console.error("Error decoding audio data", error);
      });
  },

  isRecording() {
    return this.mediaRecorder && this.mediaRecorder.state === "recording";
  },

  audioBufferToPcm(audioBuffer) {
    const numChannels = audioBuffer.numberOfChannels;
    const length = audioBuffer.length;
    const size = Float32Array.BYTES_PER_ELEMENT * length;
    const buffer = new ArrayBuffer(size);
    const pcmArray = new Float32Array(buffer);
    const channelDataBuffers = Array.from(
      { length: numChannels },
      (x, channel) => audioBuffer.getChannelData(channel)
    );

    // Average all channels upfront, so the PCM is always mono
    for (let i = 0; i < pcmArray.length; i++) {
      let sum = 0;

      for (let channel = 0; channel < numChannels; channel++) {
        sum += channelDataBuffers[channel][i];
      }

      pcmArray[i] = sum / numChannels;
    }

    return buffer;
  },

  convertEndianness32(buffer, from, to) {
    if (from === to) return buffer;

    // If the endianness differs, we swap bytes accordingly
    for (let i = 0; i < buffer.byteLength / 4; i++) {
      const b1 = buffer[i];
      const b2 = buffer[i + 1];
      const b3 = buffer[i + 2];
      const b4 = buffer[i + 3];

      buffer[i] = b4;
      buffer[i + 1] = b3;
      buffer[i + 2] = b2;
      buffer[i + 3] = b1;
    }

    return buffer;
  },

  getEndianness() {
    const buffer = new ArrayBuffer(2);
    const int16Array = new Uint16Array(buffer);
    const int8Array = new Uint8Array(buffer);

    int16Array[0] = 1;

    if (int8Array[0] === 1) {
      return "little";
    } else {
      return "big";
    }
  },
};

export { MicrophoneV2 };

The problem is in the processChunks method, when I empty the audioChunks like so:

        this.audioChunks = [];

The audioContext.decodeAudioData doesn’t work on partial content (seen here); it only works with the entirety of the audio:

Uncaught (in promise) DOMException: The buffer passed to decodeAudioData contains invalid content which cannot be decoded successfully.

I’ve attempted this but to no avail and I would like to avoid complicated binary handling.

How to send a bunch of audio chunks from the browser to the Phoenix server?

Stefano1990 · September 14, 2024, 9:51am

I have done this before as a toy project. The trick is that you need to keep the first chunk in memory and then prefix every subsequent chunk with that initial chunk which contains some meta data about the stream.

I added a bit of functionality which checks for a pause in the speech and then submits that chunk. Otherwise you will get transcription that is cut off mid sentence which doesn’t make sense.

I will copy paste the script when I’m on the computer and I can answer any questions that you have. Warning it’s somewhat hacky and not production code but I think it will help you get started.

FlyingNoodle · September 14, 2024, 10:00am

Here you go (posted from my second account):

hooks.Recorder = {
  async mounted() {
    let firstBlob = null; // contains file header information that we need to store

    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    this.mediaRecorder = new MediaRecorder(stream, {
      mimeType: "audio/ogg; codecs=opus",
      audioBitsPerSecond: 128000,
    });

    // Create an audio context and connect the stream source to an analyzer node
    const audioContext = new AudioContext();
    const source = audioContext.createMediaStreamSource(stream);
    const analyzer = audioContext.createAnalyser();
    source.connect(analyzer);

    const array = new Uint8Array(analyzer.fftSize);

    function getPeakLevel() {
      analyzer.getByteTimeDomainData(array);
      return (
        array.reduce(
          (max, current) => Math.max(max, Math.abs(current - 127)),
          0
        ) / 128
      );
    }

    const reader = new FileReader();
    reader.onloadend = () => {
      const base64String = reader.result.split(",")[1];
      const payload = {
        chunk: base64String,
      };
      this.pushEvent("send_chunk", payload);
    };

    this.mediaRecorder.ondataavailable = (e) => {
      let data = null;
      if (firstBlob && e.data.size < 25000) {
        return;
      }
      if (!firstBlob) {
        firstBlob = e.data;
      }
      data = new Blob([firstBlob, e.data], { type: e.type }); // prepend the file information which is stored in the first blob

      reader.readAsDataURL(data);
    };

    let lastTick = 0;
    let now = 0;
    let peakLevels = [];
    const samplingFreq = 80;
    const sampingWindow = 3;

    const tick = () => {
      if (!this.mediaRecorder || this.mediaRecorder.state !== "recording") {
        return;
      }
      now = performance.now();

      if (now - lastTick > samplingFreq || !firstBlob) {
        const peakLevel = getPeakLevel();
        peakLevels = [peakLevel, ...peakLevels];
        if (peakLevels.length > sampingWindow) {
          peakLevels.pop();
        }

        if (
          (peakLevels.length === sampingWindow &&
            !peakLevels.find((level) => level > 0.01)) ||
          !firstBlob
        ) {
          // all levels have been less than 0.01 so we have a longer silence and can trigger subs
          this.mediaRecorder.requestData();
          peakLevels = [];
        }

        lastTick = performance.now();
      }
      setTimeout(() => {
        tick();
      }, samplingFreq);
    };

    const startBtn = this.el.querySelector("#start");
    startBtn.addEventListener("click", () => {
      this.mediaRecorder.start();
      now = performance.now();
      lastTick = performance.now();
      // This is a hack to generate the first chunk which can't be empty
      setTimeout(() => {
        tick();
      }, 300);
    });
    const stopBtn = this.el.querySelector("#stop");
    stopBtn.addEventListener("click", () => {
      this.mediaRecorder.stop();
    });
  },
};

As I said, disclaimer, this was just hacked together for a PoC.

ryanzidago · September 14, 2024, 1:33pm

Hey many thanks for sharing the code!

Aren’t you just prepending the entire first chunk to all subsequent chunks?

To me it looks like both the header and the audio of the first chunk are prepended to all subsequent chunks.

Or are you extracting the headers somewhere?

This is what I’ve tried the following and I definitely hear the first audio chunk being prepended to all other chunks:

const SAMPLE_RATE = 16_000;
const CHUNK_DURATION_IN_MS = 5_000;
const CHUNK_OVERLAP_DURATION_IN_MS = 2_000;
const MIME_TYPE = "audio/ogg;codecs=opus";

const MicrophoneV3 = {
  mounted() {
    this.recording = false;
    this.mediaRecorder = null;
    this.firstBlob = null;

    this.el.addEventListener("click", () => {
      if (this.isRecording()) {
        this.stopRecording();
      } else {
        this.startRecording();
      }
    });
  },

  startRecording() {
    this.audioChunks = [];

    navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
      this.mediaRecorder = new MediaRecorder(stream, {
        mimeType: MIME_TYPE,
        audioBitsPerSecond: 128_000,
      });

      this.mediaRecorder.addEventListener("dataavailable", (event) => {
        if (event.data.size > 0) {
          if (this.firstBlob && event.data.size < 25000) {
            return;
          }

          if (!this.firstBlob) {
            this.firstBlob = event.data;
          } else {
            const chunkWithHeader = new Blob([event.data], {
              type: event.type,
            });
            this.audioChunks.push(chunkWithHeader);
            this.processChunks();
          }
        }
      });

      this.mediaRecorder.start(CHUNK_DURATION_IN_MS);

      this.updateInterval = setInterval(() => {
        this.mediaRecorder.requestData();
      }, CHUNK_DURATION_IN_MS);
    });
  },

  stopRecording() {
    this.mediaRecorder.addEventListener("stop", () => {
      this.processChunks();
    });

    this.mediaRecorder.stop();
    this.firstBlob = null;
    clearInterval(this.updateInterval);
  },

  processChunks() {
    if (this.audioChunks.length < 1) return;

    const audioBlob = new Blob(this.audioChunks, { type: MIME_TYPE });
    this.upload("audio", [audioBlob]);
    this.audioChunks = [];
  },

  isRecording() {
    return this.mediaRecorder && this.mediaRecorder.state === "recording";
  },

  audioBufferToPcm(audioBuffer) {
    const numChannels = audioBuffer.numberOfChannels;
    const length = audioBuffer.length;
    const size = Float32Array.BYTES_PER_ELEMENT * length;
    const buffer = new ArrayBuffer(size);
    const pcmArray = new Float32Array(buffer);
    const channelDataBuffers = Array.from(
      { length: numChannels },
      (x, channel) => audioBuffer.getChannelData(channel)
    );

    // Average all channels upfront, so the PCM is always mono
    for (let i = 0; i < pcmArray.length; i++) {
      let sum = 0;

      for (let channel = 0; channel < numChannels; channel++) {
        sum += channelDataBuffers[channel][i];
      }

      pcmArray[i] = sum / numChannels;
    }

    return buffer;
  },

  convertEndianness32(buffer, from, to) {
    if (from === to) return buffer;

    // If the endianness differs, we swap bytes accordingly
    for (let i = 0; i < buffer.byteLength / 4; i++) {
      const b1 = buffer[i];
      const b2 = buffer[i + 1];
      const b3 = buffer[i + 2];
      const b4 = buffer[i + 3];

      buffer[i] = b4;
      buffer[i + 1] = b3;
      buffer[i + 2] = b2;
      buffer[i + 3] = b1;
    }

    return buffer;
  },

  getEndianness() {
    const buffer = new ArrayBuffer(2);
    const int16Array = new Uint16Array(buffer);
    const int8Array = new Uint8Array(buffer);

    int16Array[0] = 1;

    if (int8Array[0] === 1) {
      return "little";
    } else {
      return "big";
    }
  },
};

export { MicrophoneV3 };

Stefano1990 · September 14, 2024, 1:37pm

Yes i am but the first chunk is only 300ms (you could make it 1ms) so I didn’t care.

Of course one could spend more time trying to figure out what actually is in that first chunk but I couldn’t be bothered in this case.

ryanzidago · September 14, 2024, 2:03pm

Okay nice!
I’ve reduced it to 50 ms and we don’t hear the first chunk audio into subsequent chunks:

const SAMPLE_RATE = 16_000;
const CHUNK_DURATION_IN_MS = 5_000;
const CHUNK_OVERLAP_DURATION_IN_MS = 2_000;
const MIME_TYPE = "audio/ogg;codecs=opus";

const MicrophoneV3 = {
  mounted() {
    this.recording = false;
    this.mediaRecorder = null;
    this.firstBlob = null;

    this.el.addEventListener("click", () => {
      if (this.isRecording()) {
        this.stopRecording();
      } else {
        this.startRecording();
      }
    });
  },

  startRecording() {
    this.audioChunks = [];

    navigator.mediaDevices.getUserMedia({ audio: true }).then((stream) => {
      this.mediaRecorder = new MediaRecorder(stream, {
        mimeType: MIME_TYPE,
        audioBitsPerSecond: 128_000,
      });

      this.mediaRecorder.addEventListener("dataavailable", (event) => {
        if (event.data.size > 0) {
          if (this.firstBlob && event.data.size < 25000) {
            return;
          }

          if (!this.firstBlob) {
            this.firstBlob = event.data;
          } else {
            const chunkWithHeader = new Blob([this.firstBlob, event.data], {
              type: event.type,
            });
            this.audioChunks.push(chunkWithHeader);
            this.processChunks();
          }
        }
      });

      this.mediaRecorder.start(CHUNK_DURATION_IN_MS);

      // Force the first chunk to be generated after 50 ms
      setTimeout(() => {
        this.mediaRecorder.requestData(); // Request the first chunk early after 50 ms
      }, 50);

      this.updateInterval = setInterval(() => {
        this.mediaRecorder.requestData();
      }, CHUNK_DURATION_IN_MS);
    });
  },

  stopRecording() {
    this.mediaRecorder.addEventListener("stop", () => {
      this.processChunks();
    });

    this.mediaRecorder.stop();
    this.firstBlob = null;
    clearInterval(this.updateInterval);
  },

  processChunks() {
    if (this.audioChunks.length < 1) return;

    const audioBlob = new Blob(this.audioChunks, { type: MIME_TYPE });
    this.upload("audio", [audioBlob]);
    this.audioChunks = [];
  },

  isRecording() {
    return this.mediaRecorder && this.mediaRecorder.state === "recording";
  },

  audioBufferToPcm(audioBuffer) {
    const numChannels = audioBuffer.numberOfChannels;
    const length = audioBuffer.length;
    const size = Float32Array.BYTES_PER_ELEMENT * length;
    const buffer = new ArrayBuffer(size);
    const pcmArray = new Float32Array(buffer);
    const channelDataBuffers = Array.from(
      { length: numChannels },
      (x, channel) => audioBuffer.getChannelData(channel)
    );

    // Average all channels upfront, so the PCM is always mono
    for (let i = 0; i < pcmArray.length; i++) {
      let sum = 0;

      for (let channel = 0; channel < numChannels; channel++) {
        sum += channelDataBuffers[channel][i];
      }

      pcmArray[i] = sum / numChannels;
    }

    return buffer;
  },

  convertEndianness32(buffer, from, to) {
    if (from === to) return buffer;

    // If the endianness differs, we swap bytes accordingly
    for (let i = 0; i < buffer.byteLength / 4; i++) {
      const b1 = buffer[i];
      const b2 = buffer[i + 1];
      const b3 = buffer[i + 2];
      const b4 = buffer[i + 3];

      buffer[i] = b4;
      buffer[i + 1] = b3;
      buffer[i + 2] = b2;
      buffer[i + 3] = b1;
    }

    return buffer;
  },

  getEndianness() {
    const buffer = new ArrayBuffer(2);
    const int16Array = new Uint16Array(buffer);
    const int8Array = new Uint8Array(buffer);

    int16Array[0] = 1;

    if (int8Array[0] === 1) {
      return "little";
    } else {
      return "big";
    }
  },
};

export { MicrophoneV3 };

ryanzidago · September 14, 2024, 2:05pm

Do you happen to know what would be a production equivalent of that hack ?

Seems so strange to me that streaming audio chunks from the browser to the server would be such a niche use-case that would required such a hack.

FlyingNoodle · September 14, 2024, 3:06pm

I put this together because I wanted to have live transcriptions (generated by sending those audio chunks to azure) that are streamed across websockets to visitors of a liveview page.

As such, I wasn’t interested in the JS part of this application, only the liveview part.

I guess you could convert the blob into a byte array and then start poking around in it? I really have no clue about media stuff at all, sry.