[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

Isotr0py · 2025-09-24T08:10:07Z

Purpose

Qwen3-VL's max_num_video_tokens calculation is implemented wrong, this PR corrects it

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <[email protected]>

ywang96 · 2025-09-24T08:34:47Z

Should we update this as well?

vllm/vllm/model_executor/models/qwen2_vl.py

Line 85 in 6488f34

_MAX_FRAMES_PER_VIDEO = 600

Isotr0py · 2025-09-24T08:38:14Z

So should we make it configurable or just increase the value for it?

ywang96 · 2025-09-24T08:41:55Z

So should we make it configurable or just increase the value for it?

Actually - let's make a new get_num_frames_with_most_features inside Qwen3-VL for now. I checked for Qwen2 and 2.5VL this number will be 14, so we can change this back to 14 but make a new constant inside Qwen3-VL with its get_num_frames_with_most_features referring to it

Signed-off-by: Isotr0py <[email protected]>

ywang96 · 2025-09-24T23:56:34Z

This PR fixes the issue of mismatch between calculated number of tokens and actual number of tokens generated from ViT, but I'm getting these warnings.
(Worker_TP5 pid=585480) Token indices sequence length is longer than the specified maximum sequence length for this model (414874 > 262144). Running this sequence through the model will result in indexing errors

I think there's something wrong around how we call the HF processor that introduces this warning, which could be confusing to the user.

Isotr0py · 2025-09-25T08:04:45Z

This PR will be reworked for Qwen3-VL after #25631 merged.

mergify · 2025-09-27T07:48:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2025-09-27T23:38:29Z

#25810 should solve the proble of the big misbatch between ViT output length and video soft token length, so I'm going to update this PR accordingly

Signed-off-by: Roger Wang <[email protected]>

ywang96 · 2025-09-28T00:25:56Z

I've update several logics:

We've now decoupled the calculation of max video token from number of embeddings of ViT output and now include padding tokens in between them.
Based on Qwen3-VL default setting, the max pixel is recommended at 24576 * 32 * 32, and therefore the max number of frames should be lowered to 24576, which translates to a total number of 148634 tokens. This is not far from our estimation of 12288 * 12.5 = 153600, and I think have a bit of overestimation is okay.

ywang96 · 2025-09-28T00:34:07Z

vllm/model_executor/models/qwen3_vl.py

+        target_video_size, _ = self.info._get_vision_info(
+            image_width=target_width,
+            image_height=target_height,
+            num_frames=target_num_frames,
+            image_processor=self.info.get_video_processor(),
+        )


This is in fact pretty important.

Previous we're sending a [32, 4096, 4096, 3] input tensor which would OOM if we turn on DP ViT, this is now corrected to [24576, 32, 32, 3]

Signed-off-by: Roger Wang <[email protected]>

…le video profiling (#25557) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: simon-mo <[email protected]>

…le video profiling (vllm-project#25557) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: baonudesifeizhai <[email protected]>

init

4da9d23

Signed-off-by: Isotr0py <[email protected]>

mergify bot added qwen Related to Qwen models v1 labels Sep 24, 2025

Isotr0py added 2 commits September 24, 2025 16:14

cleanup debug code

43f24ab

Signed-off-by: Isotr0py <[email protected]>

revert gpu runner to avoid conflict

67a0dc9

Signed-off-by: Isotr0py <[email protected]>

Isotr0py marked this pull request as ready for review September 24, 2025 08:16

Isotr0py requested a review from sighingnow as a code owner September 24, 2025 08:16

Isotr0py requested review from DarkLight1337 and ywang96 September 24, 2025 08:17

Isotr0py added 2 commits September 24, 2025 16:18

hardcode num_frames=1 for image

085bbd1

Signed-off-by: Isotr0py <[email protected]>

miss hardcode num_frames=1

8a14ae1

Signed-off-by: Isotr0py <[email protected]>

Isotr0py added 3 commits September 24, 2025 19:51

fix max frames per video

d21f790

Signed-off-by: Isotr0py <[email protected]>

Merge branch 'main' into fix-qwen3-video-profiling

f8cb079

ooops

26d915b

Signed-off-by: Isotr0py <[email protected]>

Isotr0py mentioned this pull request Sep 25, 2025

[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling #25648

Merged

5 tasks

Isotr0py changed the title ~~[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video profiling~~ [Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for configurable video profiling Sep 25, 2025

Isotr0py marked this pull request as draft September 25, 2025 08:04

Isotr0py changed the title ~~[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for configurable video profiling~~ [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling Sep 25, 2025

mergify bot added the needs-rebase label Sep 27, 2025

ywang96 added this to the v0.11.0 Cherry Picks milestone Sep 27, 2025

Merge branch 'main' into fix-qwen3-video-profiling

3a8e9d0

Signed-off-by: Roger Wang <[email protected]>

mergify bot removed the needs-rebase label Sep 27, 2025

update

2a58165

Signed-off-by: Roger Wang <[email protected]>

ywang96 added 2 commits September 27, 2025 17:19

update

c3f71b8

Signed-off-by: Roger Wang <[email protected]>

update estimation

241f3db

Signed-off-by: Roger Wang <[email protected]>

ywang96 marked this pull request as ready for review September 28, 2025 00:26

ywang96 reviewed Sep 28, 2025

View reviewed changes

wwl2755 mentioned this pull request Sep 28, 2025

[MM] Optimize memory profiling for scattered multimodal embeddings #25810

Merged

5 tasks

typo

f2f8c42

Signed-off-by: Roger Wang <[email protected]>

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 28, 2025

ywang96 approved these changes Sep 28, 2025

View reviewed changes

DarkLight1337 approved these changes Sep 28, 2025

View reviewed changes

ywang96 enabled auto-merge (squash) September 28, 2025 04:04

ywang96 merged commit 0efd540 into vllm-project:main Sep 28, 2025
50 checks passed

Isotr0py deleted the fix-qwen3-video-profiling branch September 28, 2025 07:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

Isotr0py commented Sep 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

ywang96 commented Sep 24, 2025

Uh oh!

Isotr0py commented Sep 24, 2025

Uh oh!

ywang96 commented Sep 24, 2025 •

edited

Loading

Uh oh!

ywang96 commented Sep 24, 2025 •

edited

Loading

Uh oh!

Isotr0py commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 27, 2025

Uh oh!

ywang96 commented Sep 27, 2025

Uh oh!

ywang96 commented Sep 28, 2025 •

edited

Loading

Uh oh!

ywang96 Sep 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557

Conversation

Isotr0py commented Sep 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

ywang96 commented Sep 24, 2025

Uh oh!

Isotr0py commented Sep 24, 2025

Uh oh!

ywang96 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Sep 25, 2025

Uh oh!

mergify bot commented Sep 27, 2025

Uh oh!

ywang96 commented Sep 27, 2025

Uh oh!

ywang96 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywang96 Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Isotr0py commented Sep 24, 2025 •

edited by github-actions bot

Loading

ywang96 commented Sep 24, 2025 •

edited

Loading

ywang96 commented Sep 24, 2025 •

edited

Loading

ywang96 commented Sep 28, 2025 •

edited

Loading

ywang96 Sep 28, 2025 •

edited

Loading