Some speed & GDDR measurements for choosing a GPU and model settings #391
shervinemami
started this conversation in
General
Replies: 2 comments 4 replies
-
you should do tests on longer time taking examples such as 1 hour of speech data. initialization also takes time |
Beta Was this translation helpful? Give feedback.
2 replies
-
Hi @shervinemami I also found that FP32 was 2 to 3 times faster than FP16 but on an RTX 3090 this time! I asked a question about that. |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Here is some info that might help some people. I've been playing around with whisper from commandline and with
model.transcribe()
andmodel.decode()
and also different decoder options. I found that with my GPU (a GTX 1066 Super), FP32 is several times faster than FP16! I expect this would be true for most older generation GPUs (eg: GTX 1080 Ti), because GTX is quite weak at FP16. Whereas the new RTX GPUs are very fast at both FP16 and FP32, so the new generation would prefer FP16 while the older generations would prefer FP32.While it's tempting to buy a new RTX 3060 or similar GPU, my expectation is that a 2nd hand GTX 1080 Ti or GTX 1660 Super or similar recent GTX can give excellent value at the moment, since I expect that in practice, the difference in both speed & accuracy for "medium" on GTX 1660 in FP32 mode vs "large" on RTX 3060 in FP16 will be fairly small differences.
Another thing I noticed is that
model.decode()
is faster thanmodel.transcribe()
, and the choice of decoder options also has a big impact on speed, far more than the impact on accuracy / reliability. So I've written a little script to compare many different combinations of decoder options. I've also used nvidia-smi to measure the GPU's GDDR.Here are some speeds for decoding 10 seconds of English audio on my Linux PC with a 6GB CUDA GPU (
GTX 1660 Super
):And a few measurements performing inference purely on a fast desktop CPU (not GPU / CUDA at all) (
6-core i7-8700K @ 4.9GHz
):And comparing some different decoder options, all on the same "medium.en" model:
Remember that all of these timings are for
model.decode
(including the input loading and MEL spectrum analysis). Whereasmodel.transcode
is noticeably slower, such as 2x slower on CPU with larger models.Note that in my testing with a few different 10 second English audio files, the
medium.en
model gave very accurate transcription no matter what decoder options I used, the accuracy seems to only change slightly while the speed varies significantly. So I'll personally be going with one the faster decoder options. Whereas using smaller models thanmedium.en
(such asbase.en
) did start having a noticeable reduction in accuracy.I added this 1 line of code to
decoding.py
around line 454, it helped me understand a lot more about why I was seeing different speeds and outputs depending on whether I was using whisper through the cli or transcribe() or decode(), so I recommend adding this line of code todecoding.py
:For example, here are a set of options that takes around 2.5 seconds on my GPU for 10 seconds of audio:
[Decoding Options: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=3, patience=1.3, length_penalty=None, prompt="Profiling ARM CPU and GPU cores in my 40's", prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False) ]
Note: I uploaded my benchmarking script to GitHub if others want to compare their GPU or audio files or models: benchmark_whisper.py
Beta Was this translation helpful? Give feedback.
All reactions