-
Notifications
You must be signed in to change notification settings - Fork 462
[BugFix] Fix ascend scheduler assert error #3191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request provides a critical fix for an assertion error that occurs when running multimodal models with the Ascend scheduler. The change correctly prevents partial prefilling of multimodal requests when their encoder inputs cannot be scheduled, which was the root cause of the issue. The fix is well-targeted and effectively ensures the stability of the scheduler for multimodal workloads. The change is correct and I approve of it.
vllm_ascend/core/scheduler.py
Outdated
new_encoder_budget) = self._try_schedule_encoder_inputs( | ||
request, num_computed_tokens, num_new_tokens, | ||
encoder_budget) | ||
if num_new_tokens == 0 or len(encoder_inputs_to_schedule) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is critical to prevent a crash when scheduling multimodal requests. Previously, if encoder inputs couldn't be scheduled due to budget constraints, the scheduler would still attempt to prefill the text tokens. This partial prefilling would lead to an inconsistent request state, causing an assertion failure later in the decoding phase. By adding the len(encoder_inputs_to_schedule) == 0
check, you ensure that the request is deferred until all its necessary components can be scheduled, which correctly resolves the issue.
Signed-off-by: fan2956 <[email protected]>
Signed-off-by: fan2956 <[email protected]>
ee45ee7
to
a005068
Compare
What this PR does / why we need it?
Running multimodal model with ascend scheduler may cause assert error 【assert (request.num_tokens - request.num_computed_tokens) == 1】
Does this PR introduce any user-facing change?
No
How was this patch tested?