Closed
Description
A while ago, on 2x3090 I would get 18.x tokens/s on 70b models. I didn't update for a bit and was dismayed to see performance dip down to 15t/s. I had some HW issues so it took a while to figure out what's going on, but I narrowed it down to a commit between:
7082d24 and f679349
Reading through what happened in that week, the most likely culprits look to be 5bf3953 and dc68f00
The first one I can't check against because it produced errors in multi-gpu which the second commit fixed. I can run versions from before this and my performance is back.
Loading a model over 3 GPU, like miqu 5km, the regression is even bigger. From 15.5t/s down to 11 t/s. Memory use is improved though. I had to re-arrange how I split the model.
Some proof:
Pre:
llama_print_timings: load time = 528.83 ms
llama_print_timings: sample time = 112.26 ms / 200 runs ( 0.56 ms per token, 1781.55 tokens per second)
llama_print_timings: prompt eval time = 528.67 ms / 22 tokens ( 24.03 ms per token, 41.61 tokens per second)
llama_print_timings: eval time = 10762.82 ms / 199 runs ( 54.08 ms per token, 18.49 tokens per second)
llama_print_timings: total time = 11874.81 ms
Output generated in 12.77 seconds (15.66 tokens/s, 200 tokens, context 22, seed 1952269572
Post:
llama_print_timings: load time = 495.04 ms
llama_print_timings: sample time = 113.32 ms / 200 runs ( 0.57 ms per token, 1764.90 tokens per second)
llama_print_timings: prompt eval time = 494.91 ms / 22 tokens ( 22.50 ms per token, 44.45 tokens per second)
llama_print_timings: eval time = 12894.68 ms / 199 runs ( 64.80 ms per token, 15.43 tokens per second)
llama_print_timings: total time = 14055.05 ms / 221 tokens
Output generated in 14.63 seconds (13.67 tokens/s, 200 tokens, context 22, seed 1842804206)