Speech to text - are there are tools that can convert an audio source to text?

I’m currently building https://readable.fm in Elixir.

It’s a product based on speech-to-text, but it leverages an external speech to text API.
I tried Whisper which is quite good, but doesn’t offer builtin speaker diarization, which I absolutely need.

We can have a chat if you want to know more about different APIs (I tested 5 of them)

Definitely, send me a pvt message I guess?

1 Like

Hello Christian, I’m curious about your findings as well. My Gist so far:

  • Whisper is great at speech recognition, has timing issues though and I’m not yet sure how to deal with silences in speech (I want to hide subs for example when nobody is talking… use VAD Voice Activity Detection on top? Dunno :person_shrugging: ) I ran my own and also on replicate.com, which is pretty fast (1h file in 90 secs)
  • Sonix feels like the goat to me. Similarly good in regards of speech as whisper, but way better timings, speaker diarization, etc, etc. BUT: expensive! :slight_smile:
  • rev.ai - is what I’m currently using for my Video CMS. It’s actually pretty good and timings are very accurate. It’s much cheaper than Sonix and I feel gives me the best bang for the buck right now (I’m currently torn between sonix and rev :slight_smile: )
  • aws transcribe - feels “ok”. timings are good, speech in my examples maybe a little less accurate than rev.ai
  • fireflies - Cool tool, didn’t play with it that much yet. I think they might use whisper under the hood, as I see a bunch of weird words that whisper was able to catch successfully (none of the others did), and they caught them too. They also have timing issues.
  • deepgram - feels very similar what aws and rev give me on my first few tries. pretty cheap though and lots of features, so I will play some more with it. I give them props for good DX as well.

Anyway, that’s quickly from me. Would love a DM from you to see what your experiences are. Thank you! :slightly_smiling_face: