Skip to content

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Sep 16, 2025

Purpose

Add DP ViT for Qwen3-VL models - this PR should be merged only after #24727 is merged.

Part of #22743

Test Plan

Running on Qwen3-VL-30B-A3B-Instruct with 4xL40s (PCI-E) with the following changes to the vision_language.py

    sampling_params = SamplingParams(
-        temperature=0.2, max_tokens=64, stop_token_ids=req_data.stop_token_ids
+        temperature=0.0, max_tokens=1, stop_token_ids=req_data.stop_token_ids # measure prefill perf
    )

    engine_args = EngineArgs(
        model=model_name,
-       max_model_len=4096,
-       max_num_seqs=5,
        mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
            "fps": 1,
        },
        limit_mm_per_prompt={modality: 1},
+       tensor_parallel_size=4,
+       mm_encoder_tp_mode="data", # vs "weights"
+       enable_prefix_caching=False,
+       mm_processor_cache_gb=0,
    )

Test Result

Running 500 image prompts with mm_encoder_tp_mode="weights"

python3 examples/offline_inference/vision_language.py -m qwen3_vl_moe --modality image --num-prompts 500 --seed 0
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1000/1000 [02:30<00:00,  6.63it/s, est. speed input: 6491.45 toks/s, output: 6.63 toks/s]

Running 500 images prompts with mm_encoder_tp_mode="data"

python3 examples/offline_inference/vision_language.py -m qwen3_vl_moe --modality image --num-prompts 500 --seed 0
Processed prompts: 100%|██████████████████████████████████████| 1000/1000 [01:12<00:00, 13.82it/s, est. speed input: 13533.77 toks/s, output: 13.82 toks/s]

Results are bit biased since we would not typically expect one to run high TP on chips without NVLink - but this is what I have available.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

ywang96 and others added 29 commits September 12, 2025 05:57
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: Huang Jie <[email protected]>
Co-authored-by: 松灵 <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
@mergify mergify bot added the documentation Improvements or additions to documentation label Sep 16, 2025
@mergify mergify bot added the qwen Related to Qwen models label Sep 16, 2025
@ywang96 ywang96 marked this pull request as ready for review September 17, 2025 06:09
@ywang96 ywang96 requested a review from sighingnow as a code owner September 17, 2025 06:09
@ywang96 ywang96 requested a review from Isotr0py September 17, 2025 07:40
@ywang96
Copy link
Member Author

ywang96 commented Sep 17, 2025

@tjtanaa Could you take a look too? Thanks

Copy link
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 17, 2025
@DarkLight1337
Copy link
Member

Have you run lm-eval to ensure the correctness?

@ywang96
Copy link
Member Author

ywang96 commented Sep 17, 2025

Have you run lm-eval to ensure the correctness?

@DarkLight1337 Yea that's what I'm planning to do next - though we don't have official results but should be fine as long as the two results match

@tjtanaa
Copy link
Contributor

tjtanaa commented Sep 17, 2025

@ywang96 LTGM as well. Can't wait for the model release.

@ywang96
Copy link
Member Author

ywang96 commented Sep 17, 2025

I won't merge this PR until I verify the correctness tomorrow.

@ywang96
Copy link
Member Author

ywang96 commented Sep 18, 2025

Additional results on 4xH200

vllm bench serve  \
--endpoint-type openai-chat \
--model Qwen-SGlang/Qwen3-VL-30B-A3B-Instruct   \
--tokenizer /tmp-nvme/models/Qwen-SGlang/Qwen3-VL-30B-A3B-Instruct \
--endpoint /v1/chat/completions   \
--dataset-name hf   \
--dataset-path lmarena-ai/VisionArena-Chat   \
--hf-split train   \
--num-prompts 1000 --request-rate 3

TP

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           3.00      
Benchmark duration (s):                  334.79    
Total input tokens:                      94327     
Total generated tokens:                  121241    
Request throughput (req/s):              2.99      
Output token throughput (tok/s):         362.14    
Total Token throughput (tok/s):          643.88    
---------------Time to First Token----------------
Mean TTFT (ms):                          284.71    
Median TTFT (ms):                        203.83    
P99 TTFT (ms):                           2600.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.60     
Median TPOT (ms):                        18.02     
P99 TPOT (ms):                           47.18     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.90     
Median ITL (ms):                         13.24     
P99 ITL (ms):                            153.90    
==================================================

DP

============ Serving Benchmark Result ============
Successful requests:                     1000      
Request rate configured (RPS):           3.00      
Benchmark duration (s):                  334.76    
Total input tokens:                      94327     
Total generated tokens:                  121609    
Request throughput (req/s):              2.99      
Output token throughput (tok/s):         363.28    
Total Token throughput (tok/s):          645.05    
---------------Time to First Token----------------
Mean TTFT (ms):                          191.12    
Median TTFT (ms):                        166.74    
P99 TTFT (ms):                           1195.92   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.98     
Median TPOT (ms):                        15.07     
P99 TPOT (ms):                           39.88     
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.23     
Median ITL (ms):                         11.37     
P99 ITL (ms):                            119.14    
==================================================

@vllm-bot vllm-bot merged commit 3127274 into vllm-project:main Sep 18, 2025
45 of 48 checks passed
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 18, 2025
…litPR into model_register

* 'model_register' of https://github.com/dsxsteven/vllm_splitPR: (138 commits)
  Retrieve `sliding_window` from text config in Gemma3 MM (vllm-project#25085)
  [Docs] Fix API Reference (vllm-project#25140)
  [Kernel] Better inf handling for grouped topk cu (vllm-project#24886)
  [CLI] Use streaming in CLI chat and completion commands (vllm-project#23769)
  [benchmark] add peak throughput metrics and plot (vllm-project#23867)
  [Spec Decode] Efficient padded speculation (vllm-project#24539)
  [V0 Deprecation] Remove more V0 tests (vllm-project#25117)
  [EPLB] Add EPLB support for hunyuan_v1 (vllm-project#23078)
  [XPU] Whisper model support on XPU Platform (vllm-project#25123)
  Mark prompt logprobs as incompatible with prompt embeds at API level (vllm-project#25077)
  [Model] enable data parallel for InternVL vision encoder (vllm-project#23909)
  [Kernels] Overlap shared experts with combine instead of dispatch (vllm-project#24254)
  [Bugfix][Qwen3-Next] add prefixes to shared_expert in qwen3-next and mlp in qwen2moe to successfully load ignored params in quantized models (vllm-project#24960)
  [Core][MM] Cleanup `MultiModalCache` (vllm-project#25006)
  [Docs] Clean up the contributing README (vllm-project#25099)
  [MM Encoder] Apply DP ViT for Qwen3-VL model series (vllm-project#24955)
  [Kernels] Enable DeepGEMM by default (vllm-project#24462)
  [V0 Deprecation] Skip PP test (vllm-project#25128)
  [V0 Deprecation] Remove misc V0 tests (vllm-project#25118)
  [V0 Deprecation] Remove V0 Tracing & Metrics tests (vllm-project#25115)
  ...
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Huang Jie <[email protected]>
Co-authored-by: 松灵 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <[email protected]>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Huang Jie <[email protected]>
Co-authored-by: 松灵 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <[email protected]>
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Co-authored-by: Huang Jie <[email protected]>
Co-authored-by: 松灵 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: charlifu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) new-model Requests to new models qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants