OpenAI released Whisper, an open-source deep learning model for speech recognition. OpenAI’s tests on Whisper show promising results in transcribing audio not only in English but also in several other languages.
Developers and researchers who have experimented with Whisper are also impressed with what the model can do. However, what is perhaps equally important is what Whisper’s release tells us about the shifting culture in artificial intelligence (AI) research and the kind of applications we can expect in the future.
A return to openness?
OpenAI has been criticized for not open-sourcing its models. GPT-3 and DALL-E, two of OpenAI’s most impressive deep learning models, are only available behind paid API services, and there is no way to download and examine them.
In contrast, Whisper was released as a pretrained, open-source model that everyone can download and run on a computing platform of their choice. This latest development comes as the past few months have seen a trend toward more openness among commercial AI research labs.
In May, Meta open-sourced OPT-175B, a large language model (LLM) that matches GPT-3 in size. In July, Hugging Face released BLOOM, another open-source LLM of the GPT-3 scale. And in August, Stability.ai released Stable Diffusion, an open-source image generation model that rivals OpenAI’s DALL-E.
Open-source models can open new windows for performing research on deep learning models and helping create specialized applications.
OpenAI’s Whisper embraces data diversity
One of the important characteristics of Whisper is the diversity of data used to train it. Whisper was trained on 680,000 hours of multilingual and multitask supervised data collected from the web. A third of the training data is composed of non-English audio examples.
“Whisper can robustly transcribe English speech and perform at a state-of-the-art level with approximately 10 languages as well as translation from those languages into English,”
While the lab’s analysis of languages other than English is not comprehensive, users who have tested it report solid results.
Again, data diversity has become a popular trend in the AI research community. BLOOM, released this year, was the first language model to support 59 languages. And Meta is working on a model that supports translation across 200 languages.
The move toward more data and language diversity will make sure that more people can access and benefit from advances in deep learning.
Run your own model
As Whisper is open source, developers and users can choose to run it on the computation platform of their choice, whether it is their laptop, desktop workstation, mobile device, or cloud server. OpenAI released five different sizes of Whisper, each trading off accuracy for speed proportionately, with the tiniest model being approximately 60 times faster than the largest.
“Since transcription using the largest Whisper model runs faster than real-time on an [Nvidia] A100 [GPU], I expect there are practical use cases to run smaller models on mobile or desktop systems, once the models are properly ported to the respective environments, This would allow the users to run automatic speech recognition (ASR) without the privacy concerns of uploading their voice data to the cloud, while it may drain more battery and have increased latency compared to the alternative ASR solutions.”
Developers who have tried Whisper are satisfied with the opportunities that it can provide. And it can pose challenges to cloud-based ASR services that have been the main option until now.
“At first glance, Whisper appears to be much better than other SaaS [software-as-a-service] products in accuracy, since it is free and programmable, it most likely means a very significant challenge to services that only offer to transcribe.”
Gift ran the model on his computer to transcribe hundreds of MP4 files ranging from 10 minutes to hours. For machines with Nvidia GPUs, it may be much more cost effective to run the model locally and sync the results to the cloud.
“Many content creators that have some programming experience who weren’t initially using transcription services due to cost will immediately adopt Whisper into their workflow,”
Gift is now using Whisper to automate transcription in his workflow. And with automated transcription, he has the possibility of using other open-source language models, such as text summarizers.
“Content creators from indie to major film studios can use this technology and it has the possibility of being one of the tools in a tipping point in adding AI to our everyday workflows, by making transcription a commodity, now the real AI revolution can begin for those in the content space from YouTubers to News to Feature Film (all industries I have worked professionally in).”
Create your own applications
There are already several initiatives to make Whisper easier to use for people who don’t have the technical skills to set up and run machine learning models. An example is a joint project by journalist Peter Sterne and GitHub engineer Christina Warren to create a “free, secure, and easy-to-use transcription app for journalists” based on Whisper.
Meanwhile, open-source models like Whisper open new possibilities in the cloud. Developers are using platforms like Hugging Face to host Whisper and make it available through API calls.
“It takes a company 10 minutes to create their own transcription service powered by Whisper, and start transcribing calls or audio content even at high scale,”
Jeff Boudier, growth and product manager at Hugging Face,
Or fine-tune existing applications for your purposes
And another benefit of open-source models like Whisper is fine-tuning the process of taking a pre-trained model and optimizing it for a new application. For example, Whisper can be fine-tuned to improve ASR performance in a language that is not well-supported in the current model. Or it can be fine-tuned to better recognize medical or technical terms. Another interesting direction could be to fine-tune the model for other tasks than ASR, such as speaker verification, sound event detection, and keyword spotting.
“It could be fascinating to see where this heads, for very technical verticals, a fine-tuned version could be a game changer in how they are able to communicate technical information. For example, could this be the start of a revolution in medicine as primary care physicians could have their dialogue recorded and then eventually automated into AI systems that diagnose patients?”
“We have already received feedback that you can use Whisper as a plug-and-play service to achieve better results than before, combining this with fine-tuning the model will help improve the performance even further. Especially fine-tuning for languages which were not well represented in the pretraining dataset can improve the performance significantly.”
Philipp Schmid, technical lead at Hugging Face
To Read More IT Related Articles Click Here
