OpenAI’s New Large-v3-Turbo Model
OpenAI has released a new text to speech model: large-v3-turbo (interesting discussion). The promise is large-v2 quality running faster than the “base” model. That sounds promising!
In this post I’ll share how I made it work and my first impressions.
How to use it in whisperx
Neither whisperx nor faster-whisper support this turbo model yet. After some tinkering I figured that the easiest way to add support was editing manually $HOME/.local/lib/python3.10/site-packages/faster_whisper/utils.py and adding the model there.
_MODELS = {
"tiny.en": "Systran/faster-whisper-tiny.en",
...
"distil-small.en": "Systran/faster-distil-whisper-small.en",
"turbo": "deepdml/faster-whisper-large-v3-turbo-ct2"
}
I had previously downloaded the model manually as shown below. whisperx should do that for you the first time you invoke it, but I note it here just in case.
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepdml/faster-whisper-large-v3-turbo-ct2", repo_type="model")
That was enough for me to have the turbo model run under whisperx.
Whisperx results
OpenAI claims a 8x speed-up from the large model to turbo. I wanted to know how these numbers translated in whisperx, which is in itself an acceleration on vanilla whisper.
English
I had a 1 hour 24 minute mp3 around in English, so I tested with how I use whisperx, including diarization:
whisperx input_file --model turbo --hf_token your_token --diarize \
--output_format srt --verbose False --compute_type float16 \
--language en --align_model WAV2VEC2_ASR_LARGE_LV60K_960H \
--initial_prompt "your prompt" --output_dir .
I’m running on a RTX 4090.
| Model | “user time” |
|---|---|
| Turbo | 2m48s |
| Large-v3 | 3m33s |
| Medium | 3m10s |
This was just an initial test to get a sense. The improvement was already quite satisfying to me.
Turbo performed significantly faster. Large-v3 skipped a few sentences from time to time (this is not a new problem to me, but I had never paid attention). In this one test I run, turbo didn’t have this issue. The rest of the text -the text that hadn’t been skipped- was practically identical. While I’m not sure that the swallowing is related to the model (see continuation), it’s reassuring that the remaining text was virtually the same.
Catalan
In Catalan, my 1 hour text got transcoded in 2m47s using turbo, and 4m8s using large-v2 (what I used to use). That’s a great latency improvement.
The result however was mixed. Similar to the English case, in my first test at the beginning there was some text that large-v2 skipped but turbo transcribed, only a few lines. But towards the end there was a pretty big chunk that turbo swallowed (tens of lines). I repeated the execution and the issue is repeatable. Then I grabbed two different hours from the same podcast. This time, the issue occurred with large-v2 instead.
However, in Catalan I did observe turbo was less accurate than large-v2 in a few instances. That’s more or less expected per the table below, but I didn’t expect it to be that noticeable. I will also evaluate large-v3, which I had been procrastinating, and then decide what to do.