-
-
Notifications
You must be signed in to change notification settings - Fork 10.4k
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling #25557
Conversation
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
Should we update this as well?
|
So should we make it configurable or just increase the value for it? |
Actually - let's make a new |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: Isotr0py <[email protected]>
This PR fixes the issue of mismatch between calculated number of tokens and actual number of tokens generated from ViT, but I'm getting these warnings. I think there's something wrong around how we call the HF processor that introduces this warning, which could be confusing to the user. |
This PR will be reworked for Qwen3-VL after #25631 merged. |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Roger Wang <[email protected]>
#25810 should solve the proble of the big misbatch between ViT output length and video soft token length, so I'm going to update this PR accordingly |
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
I've update several logics:
|
target_video_size, _ = self.info._get_vision_info( | ||
image_width=target_width, | ||
image_height=target_height, | ||
num_frames=target_num_frames, | ||
image_processor=self.info.get_video_processor(), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in fact pretty important.
Previous we're sending a [32, 4096, 4096, 3]
input tensor which would OOM if we turn on DP ViT, this is now corrected to [24576, 32, 32, 3]
Signed-off-by: Roger Wang <[email protected]>
…le video profiling (#25557) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: simon-mo <[email protected]>
…le video profiling (vllm-project#25557) Signed-off-by: Isotr0py <[email protected]> Signed-off-by: Roger Wang <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: baonudesifeizhai <[email protected]>
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.