Processing uploaded files as they're being uploaded

hubertlepicki · June 22, 2023, 11:06am

I have a service that allows for taking voice commands. These voice commands are being recorded through the browser, and are being sent to my Phoenix application.

Once I have the file, I am calling OpenAI Whisper APIs to get the transcription.

This works, but introduces a couple of seconds of lag that I could have avoided if I was smarter about uploading files.

Baiscally, instead of waiting for the Phoenix.Multipart to process the uploaded file fully and save it in the temporary file, I would love to start uploading the received file to Whisper APIs as soon as the first bytes of it hit the server.

Anyone can point me in the right direction?

So far I figured I have the following options:

Disable Plug.Parsers.Multipart altogether for my specific request path, and instead start parsing the request in Phoenix controller and start uploading the file in chunks as I am reading it from input stream
Write a custom Cowboy handler to do the same above (would tie me to Cowboy)
Upload the file over Phoenix.Channel instead

Any other options? maybe there’s a ready to use solution / blog post that my google fu fails to find?

mtarnovan · June 22, 2023, 3:26pm

Maybe you could try writing a Whisper API adapter for minne (see also this post: Alternate :multipart plug parser - S3 storage instead of tmp/)

will work (I worked on a some code recently that replaces Plug.Parsers with a custom implementation that uploads to S3 on the fly, effectively a “streaming upload proxy”). You’ll probably still need to parse multipart requests, so you’ll probably end up copy pasting a lot from Parsers.Multipart
I don’t know much about this, but could work, and could be simpler (i.e. if you can just copy from socket to socket). If you do something like this, I’d love to see some code.

al2o3cr · June 22, 2023, 5:56pm

Does the Whisper API even start processing if you’re still streaming it an input file?

Without that, I don’t think you’ll get a response any faster by adding all this complexity.

w0rd-driven · June 23, 2023, 12:37am

Huggingface has a streaming interface that was working at one point. I remember someone working with whisper.cpp that mentioned streaming audio in 1 second chunks but up to 10 would probably work.

This is something that seems easier with Bumblebee though because you could stream chunks in and have some overlap as a buffer. Then compare the end of one chunk to the beginning of the next and trim the duplicate words. It’s most likely that a solution already exists in the other implementations that could be ported.

You also wouldn’t necessarily need to stream with the OpenAI API, you’d be sending smaller complete PCM chunks and stitching them together.

kokolegorille · June 23, 2023, 1:01am

There is an example of this in Bumblebee…

github.com

elixir-nx/bumblebee/blob/main/examples/phoenix/speech_to_text.exs

Application.put_env(:sample, PhoenixDemo.Endpoint,
  http: [ip: {127, 0, 0, 1}, port: 8080],
  server: true,
  live_view: [signing_salt: "bumblebee"],
  secret_key_base: String.duplicate("b", 64),
  pubsub_server: PhoenixDemo.PubSub
)

Mix.install([
  {:plug_cowboy, "~> 2.6"},
  {:jason, "~> 1.4"},
  {:phoenix, "~> 1.7.0"},
  {:phoenix_live_view, "~> 0.18.3"},
  # Bumblebee and friends
  {:bumblebee, "~> 0.3.0"},
  {:nx, "~> 0.5.1"},
  {:exla, "~> 0.5.1"}
])

Application.put_env(:nx, :default_backend, EXLA.Backend)

This file has been truncated. show original

In particular…

|> allow_upload(:audio, accept: :any, progress: &handle_progress/3, auto_upload: true)}


  defp handle_progress(:audio, entry, socket) when entry.done? do
    binary =
      consume_uploaded_entry(socket, entry, fn %{path: path} ->
        {:ok, File.read!(path)}
      end)

    # We always pre-process audio on the client into a single channel
    audio = Nx.from_binary(binary, :f32)

    task = Task.async(fn -> Nx.Serving.batched_run(PhoenixDemo.Serving, audio) end)

    {:noreply, assign(socket, task: task)}
  end

As You can see the data is consumed automatically.

It would be nice to be able to detect silence in the js code, and push data when one is detected.

That is one thing the python whisper client does well, clean cut huge audio into smaller part.