Queries related to llama_bench prompt processing (pp) performance for Qwen3-235B-A22B models on H200-SXM #14174

shrutiramesh1988 · 2025-06-13T15:49:04Z

shrutiramesh1988
Jun 13, 2025

Query 1 :

Why is llama_bench prompt processing (pp) demonstrating lower performance for Qwen3-235B-A22B-Q8_0 than Qwen3-235B-A22B-BF16 on H200-SXM?

On H200-SXM, the FP8 TFLOPS is supposed to be double the BF16 TFLOPS (as highlighted in the attached snapshot)
However, when we run the llama_bench prompt processing benchmark for Qwen2 235B A22B model on 8 H200-SXMs, lower tokens per second is observed with Q8 when compared to BF16.

Q8 tokens/s - 768.57 ± 3.41
BF16 tokens/s - 843.21 ± 45.38

Query 2 :

How to enable native FP8 support with llama_bench?

Could you kindly let us know if there is any way to enable the native FP8 support which would double the throughput when compared to BF16.

./build/bin/llama-bench -m /hpelustre/shruti/qwen-experiments/Qwen3-235B-A22B-BF16-00001-of-00010.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen3moe 235B.A22B BF16	437.99 GiB	235.09 B	CUDA	99	pp512	845.94 ± 47.23
qwen3moe 235B.A22B BF16	437.99 GiB	235.09 B	CUDA	99	tg128	30.00 ± 0.10

build: cdf94a1 (5501)

./build/bin/llama-bench -m /hpelustre/shruti/qwen-experiments/Qwen3-235B-A22B-Q8_0-00001-of-00009.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
Device 0: NVIDIA H200, compute capability 9.0, VMM: yes
Device 1: NVIDIA H200, compute capability 9.0, VMM: yes
Device 2: NVIDIA H200, compute capability 9.0, VMM: yes
Device 3: NVIDIA H200, compute capability 9.0, VMM: yes
Device 4: NVIDIA H200, compute capability 9.0, VMM: yes
Device 5: NVIDIA H200, compute capability 9.0, VMM: yes
Device 6: NVIDIA H200, compute capability 9.0, VMM: yes
Device 7: NVIDIA H200, compute capability 9.0, VMM: yes

model	size	params	backend	ngl	test	t/s
qwen3moe 235B.A22B Q8_0	232.77 GiB	235.09 B	CUDA	99	pp512	768.57 ± 3.41
qwen3moe 235B.A22B Q8_0	232.77 GiB	235.09 B	CUDA	99	tg128	36.04 ± 0.04

build: cdf94a1 (5501)

Query 3 :

Why is there a performance difference between BF16 and FP16 for llama_bench prompt processing (pp) with Qwen3-235B-A22B model?

The TFLOPS on H200s for BF16 and FP16 is the same. However, with ctk and ctv set to bf16 or fp16, we are seeing the below performance differences
fp16 : 843.21 ± 45.38
bf16 : 704.85 ± 24.55

Could you kindly explain the reason behind this performance gap.

Query 4 :

Does llama_bench support Tensor parallelism, pipeline parallelism and data parallelism? If yes, how to enable them

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Queries related to llama_bench prompt processing (pp) performance for Qwen3-235B-A22B models on H200-SXM #14174

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Queries related to llama_bench prompt processing (pp) performance for Qwen3-235B-A22B models on H200-SXM #14174

Uh oh!

shrutiramesh1988 Jun 13, 2025

Replies: 0 comments

shrutiramesh1988
Jun 13, 2025