Speech to text - are there are tools that can convert an audio source to text?

Hello everyone,

I would like to search/index videos by keywords.

The first step to do this is to translate audio to text. I know of some clouds providers that does speech-to-text, but before this, I would like to understand what I would need to roll my own.

I know the opposite, text-to-speech, is quite easy with some open source tools.

Did You meet similar task of converting audio source to text? If so do You know open source libraries that could help me achieve this?

Thanks for taking time

1 Like

https://github.com/mozilla/DeepSpeech is rather easy to use, but it’s very slow, at least on CPU. To transcribe a two second audio clip of me saying “hello, hello” it takes 10 seconds. But it’s quite accurate.

I also had to use ffmpeg to transcode the recording into a format deepspeech understands:

ffmpeg -i New\ Recording\ 3.m4a -acodec pcm_s16le -ac 1 -ar 16000 audio.wav
7 Likes

Thank You for the answer. I was also looking at this because there are rust bindings available.

Processing time is not that important, because the amount of videos is not that big…and will probably make a dedicated server for this.

Update: Nice ffmpeg command line :slight_smile:

2 Likes

For those interested in the subject, I did some basic tests with https://github.com/mozilla/DeepSpeech

$ deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio audio/8455-210777-0068.wav 
Loading model from file deepspeech-0.5.1-models/output_graph.pbmm
TensorFlow: v1.13.1-10-g3e0cc5374d
DeepSpeech: v0.5.1-0-g4b29b78
2019-09-27 01:53:25.693154: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
2019-09-27 01:53:25.693187: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-09-27 01:53:25.693200: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-09-27 01:53:25.693211: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
Loaded model in 0.0174s.
Loading language model from files deepspeech-0.5.1-models/lm.binary deepspeech-0.5.1-models/trie
Loaded language model in 1.38s.
Running inference.
your power is sufficient i said
Inference took 1.562s for 2.590s audio file.

It looks promising and is even faster than expected. And with a compatible card, it is possible to run on GPU :slight_smile:

Thanks again @idi527 for the link. Now I need to glue ffmpeg and deepspeech with Elixir to have the beginning of a working solution.

5 Likes

I had a similar idea around 2 years ago. I used Google for Speech to Text but it was expensive AF. Fortunately they give you some credit when you create a new account.

Google Speech to text was okay’ish for English but the results weren’t something I would just use for anything serious.

How does DeepSpeech perform with things like Youtube videos or screencasts? What is with other languages than English?

The system is not on the cloud, it runs on your machine. There are language packs, but I only tested english. I have tested this video https://www.youtube.com/watch?v=nNtsBvMPpTk. It’s about 20 minutes. Here is what it takes.

Inference took 678.371s for 1210.247s audio file.

And here is the raw result, some parts are ok, and some are really funny translations. It is far from perfect, but I don’t want to translate, I want to extract keywords.

i itinerations selwood ah so thanks for being here and then you be your way it’s a pleasure wait things for canning and here i am going to explain how you can make a a siphoning to your bower to make old you and be the cause and there we are going to use her wabisane is wave after askant different libraries one its separative and when it is done stately i my son a fortiori and i may sinuatin at night a is that these italian of insukameni here a some of my contacts okay first of all a less aspiciatur these are as he was so jess it in a valley a communication between routes in a real time we to be a fashion and that is not a single technology but that a asset of diffeeclety ah it is a composed by its fondamental itty eyes there the the the targets media and deceive percolation and appetites a channel the first is used to accusor millerites the second dose somerton such as the teaneans and the ford a do the real time communication the totersten a terrifiest but he is not important for this too so a samaritan produces a use the sun for thee that a plan my ion and other for the signal and in particular we we need the sun signals to negate in the stablish in a session before the deserters nation in the most use prouvois fails is the sibolic can achieve the security to have in creation in misjoined using the circumvention of the protege as he pissed to describe the session before the literation and other pro the case are used to overcome some at war problems related to the note for example stanistreet your babuyanes outside your look arranata and there a detinet me relay of voltaic so it evie iobant okay this is the over architecture of the system have and the wickens from this picture of the tie that saharaman components and the day to most important at the webellion and the serarius the when interested only on the well of vacation because we want to make a siphon into the bagatto make or demedicus um okay coulacanara from the mental part in the odious saxon are the most he used the air of poenitentiae and he in the vitiation are the h two six four and bitias dinies not complete supported by the roses and a v one seems to be the future we had never deserve made a furtiveness and netty i on top of istates a web of vacation with the intuited were but the upon to meet the cold during the years we have made the two different implementation in production on some thousands of cashmerian now we are going to see the incleenations that the uses josephine severitie a sea jovispeter and there he is being their first implementation of an ideal five six client it was presented a guariento thousand twelve and the correctory stick or that there is one under present for goose it realizes immediate dark of course was on what but the site it uses a serocold socket for the signaling and he with it you can make a devious but not only you can rely all or any somesings christain and son and it to port but a destinations okay this is the over alkite tore her that we have views the air arising the siphon with epimenides the stem as i flint realized this but as they be and their six dark that we are previously seen and there they be be the tradition for example there what the which all do caterer will be used and there sepper air lies their all for the fistic that he communicates over what socket transport and the after the session has been established all the mediately terranean is exchanged using the ah some posicles and using warburtons all of this components communicate with advice to the x okanagon to see how to clement a saponified with her only for that so cristofali had tending in this case are the ree the goitered casement and we could they in it metal then we start as i start at this pawn we can register their full extension to the sea pieces pulling the necessity eye and at the end we can make a analogical okay some are cold let it emotive first of all of course is tinoset of the australia vary only one side that we niaiser engine so we call the day in it function passing to it the two parameters to coldback on to manage the success of the oration and one for the fall and the if it go if all goes well we are going to create the sictachotes are six stack and there the parameters need it are the several that can be on the clown and you caroline for example a cloud on or a also imprimis then we pass the septanti you arietis composedly the prosopis of pretension in tipi and the severe address the pastoral retention potention to it that because we can find the sun cured that try to register expansion with such a pot passwords and the maker from cause to something in amber so it’s very important to to icare fully the display a is the name that he is being shown in the distaste for win in conical arise and there there were so cataracoui rincon taine say cashibos etenim that there obviously has to be reachable by the kind and the wet socket secure protocol so or the traffic is in created we can attach a event listener to manage it or the relatives and that could distort on the desgrais six super stock a disponere registered the fornication oca calling danusia if the eye on the jestress and passing three the word register and other events listener to manage their little events and there a defined the register findon charities at this point if were always well we can make a new focalpoint then you secondhand on on the descritta in passing to it some parameters that they are only the age the amerement or all uses weland at and that then which can make new front that it okay during the years he were the side to recipient of the capon a using a genus gateway okay we’re going to see this implementation johns is a general pulsate bit by medical company there in the essay we vaticinate with okay it uses ter jest as on our farther mac port messages exchange its ancient is made by flagging we have veway the sepoy but other progenies of course it her provide some interfaces for communication and other topolony to ring the artery very important in production ah this picture a wean see the differences from a dead oceanlike because previously superman five communicated therectly with after it but now the jonstone communicate wiv ganosgwah component that is a becancourt which in time communicates with asissi the supralocal and old of this traffic is precipat by apache what seer and potaties in created of course there avanother is that the now with onward any more about the change in the canaries from at the developments devolving gold there were battistino edge because gandagaro with hockey now we are going to see how to prevent a syphon using jones so arsenicated for steps are needed then salivation of the engine now they dosage is called jonas we create the session at tacherai on esteban’s use but the other begins and a den we can make a new cold creating a you for okay let’s go deeper to see more details things of the library we need only to avarice one is done of course and another one is there in a doctor that resolves all abate sit differences between different browses when he seizeth engine cording the hint mantel passing to it there that the bodleian and the code faction and then we create a new section uh not is dead there now the servares uses the standard port so you do need to open another part of the same or using withstand us and there we pass other colleges parameters and at this point we can assess the splain to have a plain and there to make a new fool then a deliathis seccatore coudonagny confusion so at their athebeck as passed the parameters and that they and before was well we can make a new funk invoking the great of her function on the jesuited legion spanner okay and there we can specify the cold the destination of the cold that it okay now some song few words about the seruices because in this case her or another for aminabad two to make the old components faction in cap in this to aview is the ne sera voters and that is based on the asterisk free beetroot and neuraline disturbation the is recompensed abelle on beta it has also a community of tetegisti people that you can find it there on the committee the toll you can participate in the project in whatever manner you want and am henterprise version of the artifice that which bore system is called nat voice and the natty i am to poetise well ication we were weary upon us to give you a an environment to start playing with weares i crather viterbo machine that you can easily run a using began i i gave you a lancet the inter next the liver okay to to no how you can easily and your private by exeter okay now i would like to show you a opened over an age theomanie safe reclines at two to water on the stand was reason now came so to recover a testator needed we are going to run a seapiece and for this you can and you you have to done all only a tortion planetography storekeeping to it around by grandfather point you have your saratogy configured and we are going to open the weeniest the ural then we can in self some devout values okay the you can find on petate paramitas are the everards the distraining extension dentify and the pastor cat then i’m no i’m going to logging to lick their looking battono register the fore extension into the several the axes and that this point we can make a from coofs okay licking their colbaton and there we we will see um eremos and local betooke all ise on my lockman of course for simply tioka thesis one soon okay averred had ethereality on privateerly look and machine so esperie the severites that these planeten on and pass for so in the session in going to insert their extension to hundred and look into the severe at the point of one tension is reduced to a incompletely working in another section i installed the exchange to under grant and look in and then this point i can make any cold to their extension to hundred o de envyde okay he ritentando about them and i have anodic obviously detectives there are remote the part and other army lockport is the same image of course because alisonae cat okay come back to potentator a fuselike okay this lie is to show you hardly do to understand that there we were entetee you can make a real producer for your castlemorris is the natty i made by nattiest make you all to be to call that we were whereabutes it there absalaam layer of aristoteles that is independent from the specific version of the asterisk okay as you has i promise you desire to linger the first they contain all their related in portarlington of this to and the second is very veriter resting where but the igoma there by taulatin okay is very teresting and there if you want that to discuss something about about the seas on the velamen of material they were everything else did i a desire or air some of my conduct and the i boielle i will be very happy to talk with you thank you we do have time for some questions i deleisonon has a question i heard proletarian but er a deer jonah is not to use in his or corner of turn is a correct yet in no say can you as a high the stanovoi the in de panton are your neck or to pology for example in this case i have er my look of machines so i confluent of arad in and in a note in an utmost the if you if you are connected to a internet e you are behind and not for example you consequens differently and there it a canneto one stone to have your public piadasi you look ordinato make a pinion ah of course in some cases you can have some problems boca in the air depends on your necropoli and you can use the sitimela okay to relay or de meditatin of course in this case the time server or requires a lot of banville because i relatittey or the or end every your verrion for bandit with the tragic key can the you can ever differentations of janesseron the seemelie the project now who estotiland or prophet or how it maintained the because it there was at by the unitarian that is a french company then he also proved from what it is in there about to tear twelve even of very well that ease of where what is there mountainous the is present or negabit i think it that i think some wood of arpers works on it sir i don’t avery well but he probably also to bind the worshippers

God damn that’s awful! :scream::face_with_hand_over_mouth::003:

Pretty interesting though, I’m subscribing to this thread in case you want to be updating it in the future.

1 Like

I have a lot of difficulties understanding how it came up with such translation… but it can only get better.

I recommend you mp3 to text converter. It also supports different audio and video formats, including WAV, OGG, WMA, M4A, MP4

Reviving this thread to see if there are any recent updates in this space. Anybody know of any solutions that run on your local server? Perhaps an integration with Bumblebee?

1 Like

@maz OpenAI’s Whisper works great and it is open source (python):

2 Likes

Wow that looks very good. I’ll look into it, thanks.

1 Like

Whisper is really good.

As deepspeech is not maintained anymore, I switched to whisper…

And now, I can pass the language in which I want the text to be translated :slight_smile:

1 Like

How was the integration? Was it just a matter of compiling whisper into a binary and using System.cmd()? Or do you have to make sure that the python interpreter is fully installed on your deployment?

I am using erlport to communicate with python scripts.
I added poolboy to be able to process multiple files concurently, but limited to 10 workers.

Here is an example for the sample file, in french…

iex(9)> WhisperStt.speech_to_text path, "fr"
{:ok,
 %{
   language: "fr",
   segments: [
     %{end: 1.0, start: 0.0, text: "C'est le machine-machin."},
     %{
       end: 3.0,
       start: 1.0,
       text: "La machine-machin est la plus mignonne de l'automobile."
     },
     %{end: 4.0, start: 3.0, text: "Il a des détails très tristiques."},
     %{
       end: 5.0,
       start: 4.0,
       text: "Le tristique, la position, le pain, le poids,"
     },
     %{end: 6.0, start: 5.0, text: "plus un mètre incroyable."},
     %{end: 7.0, start: 6.0, text: "Le mètre machine, le poids, le place,"},
...

I have python installed on the server, and I have a requirements.txt for the plugin.

4 Likes

I see, thanks that helps me get an idea of a practical configuration. The output is exactly what it needs to be!

Trying out whisper – pretty astounding. My dev machine is an M1 macbook pro and while it seems pretty fast, whisper doesn’t seem optimized for this processor. Have you experimented with hardware to find a speed boost and to see which is optimal for you?

I did not optimize at all…

The production server has no gpu, so I did it with cpu in mind.

1 Like

FYI: This just popped up on the changelog.com weekly newsletter this morning.

I don’t know anything more about it, but thought it may be of interest.

3 Likes

Very fast on my M1 mac, I’d say 2x speedup.

1 Like