Skip to content

Conversation

skyloevil
Copy link
Contributor

@skyloevil skyloevil commented Sep 22, 2025

Purpose

add tuned config for E=128,N=384 on NVIDIA_H100_PCIe with [128,128] block shape so the fused MoE kernel can reuse benchmarked settings vllm/model_executor/layers/fused_moe/configs/E=128,N=384,device_name=NVIDIA_H100_PCIe,block_shape=[128,128].json

Test Plan

python benchmarks/kernels/benchmark_moe.py --tune --save-dir vllm/model_executor/layers/fused_moe/configs --model Qwen/Qwen3-30B-A3B-FP8 --tp-size 2

Signed-off-by: zitian.zhao <[email protected]>
@skyloevil skyloevil requested a review from mgoin as a code owner September 22, 2025 15:33
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a tuned configuration file for the fused MoE kernel on NVIDIA H100 GPUs. While the configuration values themselves seem reasonable, there is a critical issue with the filename. It appears to be missing the dtype=fp8_w8a8 specifier, which is necessary for vLLM to load this configuration for FP8 models at runtime. Without the correct filename, the system will fall back to default parameters, negating the performance benefits of this tuning.

@@ -0,0 +1,146 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The filename for this configuration is missing the dtype specifier. According to the pull request description, this configuration is for an FP8 model. At runtime, vLLM constructs the configuration filename including a dtype part (e.g., dtype=fp8_w8a8 for FP8 models). Without this in the filename, this configuration file will not be found, and the fused MoE kernel will fall back to default, likely suboptimal, parameters.

To fix this, the file should be renamed to:
E=128,N=384,device_name=NVIDIA_H100_PCIe,dtype=fp8_w8a8,block_shape=[128,128].json

This likely occurred because the benchmark script was run without the --dtype fp8_w8a8 flag, which should be used when tuning for FP8 models.

@samanamp
Copy link
Contributor

Would you share the before and after tuning for the kernels and model e2e execution? AFAIK for H100 the defaults are usually performant.

@skyloevil
Copy link
Contributor Author

Would you share the before and after tuning for the kernels and model e2e execution? AFAIK for H100 the defaults are usually performant.

ok,we'll share later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants