I have a service that allows for taking voice commands. These voice commands are being recorded through the browser, and are being sent to my Phoenix application.
Once I have the file, I am calling OpenAI Whisper APIs to get the transcription.
This works, but introduces a couple of seconds of lag that I could have avoided if I was smarter about uploading files.
Baiscally, instead of waiting for the Phoenix.Multipart to process the uploaded file fully and save it in the temporary file, I would love to start uploading the received file to Whisper APIs as soon as the first bytes of it hit the server.
Anyone can point me in the right direction?
So far I figured I have the following options:
Disable Plug.Parsers.Multipart altogether for my specific request path, and instead start parsing the request in Phoenix controller and start uploading the file in chunks as I am reading it from input stream
Write a custom Cowboy handler to do the same above (would tie me to Cowboy)
Upload the file over Phoenix.Channel instead
Any other options? maybe there’s a ready to use solution / blog post that my google fu fails to find?
will work (I worked on a some code recently that replaces Plug.Parsers with a custom implementation that uploads to S3 on the fly, effectively a “streaming upload proxy”). You’ll probably still need to parse multipart requests, so you’ll probably end up copy pasting a lot from Parsers.Multipart
I don’t know much about this, but could work, and could be simpler (i.e. if you can just copy from socket to socket). If you do something like this, I’d love to see some code.
Huggingface has a streaming interface that was working at one point. I remember someone working with whisper.cpp that mentioned streaming audio in 1 second chunks but up to 10 would probably work.
This is something that seems easier with Bumblebee though because you could stream chunks in and have some overlap as a buffer. Then compare the end of one chunk to the beginning of the next and trim the duplicate words. It’s most likely that a solution already exists in the other implementations that could be ported.
You also wouldn’t necessarily need to stream with the OpenAI API, you’d be sending smaller complete PCM chunks and stitching them together.