Watch On:
Summary
OpenAI recently released Whisper, a 1.6 billion parameter AI model that can transcribe and translate speech audio from 97 different languages. Whisper was trained on 680,000 hours of audio data collected from the web and shows robust zero-shot performance on a wide range of automated speech recognition tasks. Unlike most state-of-the-art ASR models, Whisper is not fine-tuned on any benchmark dataset; instead, it is trained using “weak” supervision on a large-scale, noisy dataset of speech audio and paired transcription text collected from the internet. Whisper set a new zero-shot state-of-the-art record, without using any of the benchmark training data.
Show Notes
OpenAI recently released Whisper, a 1.6 billion parameter AI model that can transcribe and translate speech audio from 97 different languages.
Whisper was trained on 680,000 hours of audio data collected from the web and shows robust zero-shot performance on a wide range of automated speech recognition (ASR) tasks.
In zero-shot evaluations on a set of speech recognition datasets, Whisper made on average 55% fewer errors than Wav2Vec, a baseline model.
For example, Meta’s XLS-R is pretrained on 436K hours of speech audio, then fine-tuned on much smaller benchmark-specific training sets.
On the CoVoST2 translation benchmark, Whisper set a new zero-shot state-of-the-art record, without using any of the benchmark training data.