Queries related to llama_bench prompt processing (pp) performance for Qwen3-235B-A22B models on H200-SXM #14174
Unanswered
shrutiramesh1988
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Query 1 :
Why is llama_bench prompt processing (pp) demonstrating lower performance for Qwen3-235B-A22B-Q8_0 than Qwen3-235B-A22B-BF16 on H200-SXM?
On H200-SXM, the FP8 TFLOPS is supposed to be double the BF16 TFLOPS (as highlighted in the attached snapshot)
However, when we run the llama_bench prompt processing benchmark for Qwen2 235B A22B model on 8 H200-SXMs, lower tokens per second is observed with Q8 when compared to BF16.
Q8 tokens/s - 768.57 ± 3.41
BF16 tokens/s - 843.21 ± 45.38
Query 2 :
How to enable native FP8 support with llama_bench?
Could you kindly let us know if there is any way to enable the native FP8 support which would double the throughput when compared to BF16.
./build/bin/llama-bench -m /hpelustre/shruti/qwen-experiments/Qwen3-235B-A22B-BF16-00001-of-00010.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes
build: cdf94a1 (5501)
./build/bin/llama-bench -m /hpelustre/shruti/qwen-experiments/Qwen3-235B-A22B-Q8_0-00001-of-00009.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes
build: cdf94a1 (5501)
Query 3 :
Why is there a performance difference between BF16 and FP16 for llama_bench prompt processing (pp) with Qwen3-235B-A22B model?
The TFLOPS on H200s for BF16 and FP16 is the same. However, with ctk and ctv set to bf16 or fp16, we are seeing the below performance differences
fp16 : 843.21 ± 45.38
bf16 : 704.85 ± 24.55
Could you kindly explain the reason behind this performance gap.
Query 4 :
Does llama_bench support Tensor parallelism, pipeline parallelism and data parallelism? If yes, how to enable them
Beta Was this translation helpful? Give feedback.
All reactions