Some speed & GDDR measurements for choosing a GPU and model settings #391

shervinemami · 2022-10-22T11:03:16Z

shervinemami
Oct 22, 2022

Hi,

Here is some info that might help some people. I've been playing around with whisper from commandline and with model.transcribe() and model.decode() and also different decoder options. I found that with my GPU (a GTX 1066 Super), FP32 is several times faster than FP16! I expect this would be true for most older generation GPUs (eg: GTX 1080 Ti), because GTX is quite weak at FP16. Whereas the new RTX GPUs are very fast at both FP16 and FP32, so the new generation would prefer FP16 while the older generations would prefer FP32.

While it's tempting to buy a new RTX 3060 or similar GPU, my expectation is that a 2nd hand GTX 1080 Ti or GTX 1660 Super or similar recent GTX can give excellent value at the moment, since I expect that in practice, the difference in both speed & accuracy for "medium" on GTX 1660 in FP32 mode vs "large" on RTX 3060 in FP16 will be fairly small differences.

Another thing I noticed is that model.decode() is faster than model.transcribe(), and the choice of decoder options also has a big impact on speed, far more than the impact on accuracy / reliability. So I've written a little script to compare many different combinations of decoder options. I've also used nvidia-smi to measure the GPU's GDDR.

Here are some speeds for decoding 10 seconds of English audio on my Linux PC with a 6GB CUDA GPU (GTX 1660 Super):

                                                         TIME:           GPU MEM USED:
    "tiny.en" model on GPU, FP32, GreedyDecoder:         0.20 seconds    1373 MiB GDDR
    "base.en" model on GPU, FP32, GreedyDecoder:         0.26 seconds    1591 MiB GDDR
    "small.en" model on GPU, FP32, GreedyDecoder:        0.6 seconds     2727 MiB GDDR
    "medium.en" model on GPU, FP32, GreedyDecoder:       1.1 seconds     5693 MiB GDDR
    "large" model on GPU using FP32, GreedyDecoder:      <crashed. 6GB is not enough GDDR for large!>

And a few measurements performing inference purely on a fast desktop CPU (not GPU / CUDA at all) (6-core i7-8700K @ 4.9GHz):

                                                         TIME:           RAM USED:
    "tiny.en" model on CPU, FP32, GreedyDecoder:         0.4 seconds     >550 MiB System DDR
    "base.en" model on CPU, FP32, GreedyDecoder:         0.7 seconds     >600 MiB System DDR
    "small.en" model on CPU, FP32, GreedyDecoder:        2.0 seconds     >1500 MiB System DDR
    "medium.en" model on CPU, FP32, GreedyDecoder:       6.0 seconds     >4400 MiB System DDR
    "large" model on CPU, FP32, GreedyDecoder:           10.6 seconds    >6900 MiB System DDR

And comparing some different decoder options, all on the same "medium.en" model:

                                                         TIME:           GPU MEM USED:
    "medium.en" model on GPU, FP32, GreedyDecoder:       1.1 seconds     5693 MiB GDDR
    "medium.en" model on GPU, FP32, BestOf5Decoder:      2.5 seconds     5733 MiB GDDR
    "medium.en" model on GPU, FP32, BeamSize3Decoder:    2.3 seconds     5733 MiB GDDR
    "medium.en" model on GPU, FP32, BeamSize5Decoder:    3.9 seconds     5933 MiB GDDR
    "medium.en" model on GPU, FP32, BeamSize7Decoder:    <crashed. 6GB is not enough GDDR for this!>
    "medium.en" model on GPU, FP16, GreedyDecoder:       3.8 seconds     5693 MiB GDDR
    "medium.en" model on GPU, FP16, BestOf5Decoder:      6.6 seconds     5733 MiB GDDR
    "medium.en" model on GPU, FP16, BeamSize5Decoder:    8.3 seconds     5747 MiB GDDR
    "medium.en" model on GPU, FP16, BeamSize7Decoder:    8.7 seconds     5921 MiB GDDR

Remember that all of these timings are for model.decode (including the input loading and MEL spectrum analysis). Whereas model.transcode is noticeably slower, such as 2x slower on CPU with larger models.

Note that in my testing with a few different 10 second English audio files, the medium.en model gave very accurate transcription no matter what decoder options I used, the accuracy seems to only change slightly while the speed varies significantly. So I'll personally be going with one the faster decoder options. Whereas using smaller models than medium.en (such as base.en) did start having a noticeable reduction in accuracy.

I added this 1 line of code to decoding.py around line 454, it helped me understand a lot more about why I was seeing different speeds and outputs depending on whether I was using whisper through the cli or transcribe() or decode(), so I recommend adding this line of code to decoding.py:

print("[Decoding Options: ", self.options, "]")

For example, here are a set of options that takes around 2.5 seconds on my GPU for 10 seconds of audio:
[Decoding Options: DecodingOptions(task='transcribe', language='en', temperature=0.0, sample_len=None, best_of=None, beam_size=3, patience=1.3, length_penalty=None, prompt="Profiling ARM CPU and GPU cores in my 40's", prefix=None, suppress_blank=True, suppress_tokens='-1', without_timestamps=False, max_initial_timestamp=1.0, fp16=False) ]

Note: I uploaded my benchmarking script to GitHub if others want to compare their GPU or audio files or models: benchmark_whisper.py

FurkanGozukara · 2022-10-22T12:46:12Z

FurkanGozukara
Oct 22, 2022

you should do tests on longer time taking examples

such as 1 hour of speech data. initialization also takes time

2 replies

shervinemami Oct 22, 2022
Author

Hi @FurkanGozukara , yes that would be a good test. But I assume others will eventually post that type of info, since most users of whisper seem to care about long recording transcriptions. Whereas my use case of whisper is on very short phrases that are mostly around 5 seconds long, because I'm using whisper for live dictation as part of my "voice coding" system (speech recognition system for controlling my computer well enough for being a full-time computer programmer without using my hands).

If no-one posts longer benchmarks after a while then ping me and I'll do it.

FurkanGozukara Oct 22, 2022

that makes sense. by the way whisper works better for longer transcription because it uses previous context as well as far as i know

Ca-ressemble-a-du-fake · 2022-12-01T14:36:32Z

Ca-ressemble-a-du-fake
Dec 1, 2022

Hi @shervinemami I also found that FP32 was 2 to 3 times faster than FP16 but on an RTX 3090 this time! I asked a question about that.

2 replies

shervinemami Dec 1, 2022
Author

Hi @Ca-ressemble-a-du-fake , oh that's interesting! I was surprised that on my old GTX 1660 Super, 32-bit is faster than 16-bit, because according to the specs of my GPU, my GPU should be much faster at FP16 than FP32, but my GPU is old so I thought there might be a HW bug causing a bottleneck for FP16 in my GPU. But for an RTX 3090, that's recent enough that I'd definitely expect FP16 should be faster than FP32!
According to TechPowerup (https://www.techpowerup.com/gpu-specs/geforce-rtx-3090.c3622), your RTX3090 can do FP16 compute at the same speed as FP32. But the memory transfers would be roughly 2x faster, and so overall you really should be seeing a speedup on your GPU by using FP16.
This indicates to me that whisper uses an inefficient FP16 implementation that's probably several times slower than it could be, especially that I'm sure I remember OpenAI mention somewhere that the whisper model itself is trained at FP16, and so it should be capable of running fairly well at FP16! @jongwook

dustinjoe Jan 12, 2023

Hi, have you got any update on this FP32 faster thing? I actually ran into a similar situation.
#817
The test run is on Telsa V100 and FP32 seems to be about 1/4 faster than the default FP16 option.
Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some speed & GDDR measurements for choosing a GPU and model settings #391

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some speed & GDDR measurements for choosing a GPU and model settings #391

Uh oh!

Uh oh!

shervinemami Oct 22, 2022

Replies: 2 comments · 4 replies

Uh oh!

FurkanGozukara Oct 22, 2022

Uh oh!

shervinemami Oct 22, 2022 Author

Uh oh!

Uh oh!

FurkanGozukara Oct 22, 2022

Uh oh!

Ca-ressemble-a-du-fake Dec 1, 2022

Uh oh!

shervinemami Dec 1, 2022 Author

Uh oh!

dustinjoe Jan 12, 2023

shervinemami
Oct 22, 2022

Replies: 2 comments 4 replies

FurkanGozukara
Oct 22, 2022

shervinemami Oct 22, 2022
Author

Ca-ressemble-a-du-fake
Dec 1, 2022

shervinemami Dec 1, 2022
Author