Theo's Site

Writing about technology, self-hosting, and things I find interesting.

Note: This post is 3 years old and may no longer reflect current thinking or accurate information.

The Whisper Speech to Text Library Appears Really Powerful

by theo

There's a new speech-to-text program/library that just got released by OpenAI as open source called Whisper and it’s impressed me quite a bit so far. It’s really powerful and it competes pretty well with the incumbent major speech-to-text tools in terms of accuracy. and it's impressed me quite a bit so far. It's really powerful and it competes pretty well with the incumbent major speech-to-text tools in terms of accuracy.

The caveat being that its not a full featured tool. Currently all it does is is convert an audio file to text. It's a command line tool so far. It doesn't have anything more sophisticated like simulated keyboard input or training or any of the types of things you'd expect from a well-established desktop speech-to-text program like Dragon Naturally Speaking or something. It's more intended as a research model than anything, but the results I've gotten out of it are spectacular. It is not quite a hundred percent perfect, but the error rate is impressively small. The accuracy is better than even mature speaker dependent systems like Dragon. It has a very strong model of grammar and gets things that are really difficult for most speech to text programs like capitalization, or prepositions or small words.

It gets a lot of technical/specialized terms right, this is something most other speech to text systems I've used have a lot of difficulty with. It has the same accuracy to expect from a speaker dependent program that's been trained a while on your voice even though it's a speaker independent program that just works off of a generic model of speech.

As part of my testing, I read a few of my older blog posts to it. The audio clips + the generated text can be seen here..

It seems to work well with a wide range of microphones. I tried my SM7B (a standard broadcast dynamic microphone), but I also tried more exotic microphones. One of these is a a stenomask. A stenomask is a microphone that goes right up against your mouth and you speak in a really really soft voice into it, so it gives you privacy and people nearby can't hear what your are saying. These microphones are very frequently used with speech recognition, but because stenomasks muffle the sound of the speaker's voice, a lot of speech recognition programs have trouble with stenomasks, and when you use one versus a regular microphone accuracy tends to go downhill. I tried the stenomask on whisper and the same pattern of declined accuracy occurred but the accuracy was still pretty impressive and quite usable.

There are of course some limitations. I'd say it's only kind of sort of open source. You can download the tool to convert audio into text and you can download a pre-built model for it but software to actually generate that model from audio hasn't been released yet. Additionally, the model is based on a lot of not open source licensed data so you can't just regenerate the exact same model from public data sources even if you had the model generation code. So, I would say it's not fully open source although it's still a lot more open than basically any common and wide widely used speech to text program

Its also just one part of the puzzle of a fully featured speech to text system. To be a full competitor to other tools there would have to be a whole ecosystem of software using this model, and not just what we have now – a way to convert an audio recording to text. This includes integrations to other software, and integrations with the operating system. Of course, none of that exists for this particular speech to text model. But it appears that the broader open source community is working on ways to make use of this tool. There is, for instance, a repo on Github for a program that can take in live microphone input and run it through speech to text in real time. for a program that can take in live microphone input and run it through speech to text in real time.

I have been long interested in speech-to-text systems because I have a handwriting disability that makes it hard for me to quickly type and write normally, hopefully this progress means that some of the big incumbent sellers of speech to text software will have competition from the open source community.

Leave a Comment

Your email address will not be published. Required fields are marked *