Dual GPU performance regression After #4606

A while ago, on 2x3090 I would get 18.x tokens/s on 70b models. I didn't update for a bit and was dismayed to see performance dip down to 15t/s. I had some HW issues so it took a while to figure out what's going on, but I narrowed it down to a commit between:
https://github.com/ggerganov/llama.cpp/commit/7082d24cec35e9ce9147535a2224dfc67ee0a78c and https://github.com/ggerganov/llama.cpp/commit/f6793491b5af6da75edad34d6f503ef86d31b09f

Reading through what happened in that week, the most likely culprits look to be https://github.com/ggerganov/llama.cpp/commit/5bf3953d7e9831ea22b0bc017ce97409b801ccf1 and https://github.com/ggerganov/llama.cpp/commit/dc68f0054cd279cddddb0cae0c9ef4f9cbaa512a

The first one I can't check against because it produced errors in multi-gpu which the second commit fixed. I can run versions from before this and my performance is back. 

link the pulls: https://github.com/ggerganov/llama.cpp/pull/4606 https://github.com/ggerganov/llama.cpp/pull/4620


Loading a model over 3 GPU, like miqu 5km, the regression is even bigger. From 15.5t/s down to 11 t/s. Memory use is improved though. I had to re-arrange how I split the model.


Some proof:

Pre:

```
llama_print_timings:        load time =     528.83 ms
llama_print_timings:      sample time =     112.26 ms /   200 runs   (    0.56 ms per token,  1781.55 tokens per second)
llama_print_timings: prompt eval time =     528.67 ms /    22 tokens (   24.03 ms per token,    41.61 tokens per second)
llama_print_timings:        eval time =   10762.82 ms /   199 runs   (   54.08 ms per token,    18.49 tokens per second)
llama_print_timings:       total time =   11874.81 ms
Output generated in 12.77 seconds (15.66 tokens/s, 200 tokens, context 22, seed 1952269572
```


Post:
```
llama_print_timings:        load time =     495.04 ms
llama_print_timings:      sample time =     113.32 ms /   200 runs   (    0.57 ms per token,  1764.90 tokens per second)
llama_print_timings: prompt eval time =     494.91 ms /    22 tokens (   22.50 ms per token,    44.45 tokens per second)
llama_print_timings:        eval time =   12894.68 ms /   199 runs   (   64.80 ms per token,    15.43 tokens per second)
llama_print_timings:       total time =   14055.05 ms /   221 tokens
Output generated in 14.63 seconds (13.67 tokens/s, 200 tokens, context 22, seed 1842804206)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dual GPU performance regression After #4606 #5324

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dual GPU performance regression After #4606 #5324

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions