-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
[Bugfix][PD] set max_completion_tokens=1 if req has this value #21841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a bug in the disaggregated serving examples where prefill requests were not correctly limited to generating a single token if the max_completion_tokens parameter was present. The fix correctly sets max_completion_tokens to 1 in addition to max_tokens.
The change is correctly applied for chat completion endpoints. However, in disagg_proxy_demo.py, the fix is missing for the standard completion endpoint (create_completion method), which could lead to the same bug persisting for those requests. I've added a comment to highlight this for a more complete fix.
examples/online_serving/disaggregated_serving/disagg_proxy_demo.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!!
curl http://localhost:10001/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello! What can you do?"}
],
"temperature": 0.7,
"max_completion_tokens": 20
}'
Indeed, using "max_completion_tokens": 20" can reliably reproduce the issue.
Does tests/v1/kv_connector/nixl_integration/toy_proxy_server.py have the same issue?
I just checked these tests, and it seems that there are no tests added for /v1/chat/completions at all. I will add this change in another PR. |
chaunceyjiang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks~
Signed-off-by: Abirdcfly <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]> Signed-off-by: Noam Gat <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]> Signed-off-by: Paul Pak <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]>
…project#21841) Signed-off-by: Abirdcfly <[email protected]>
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Test Result
(Optional) Documentation Update
Fix #21712
vllm/vllm/entrypoints/utils.py
Lines 289 to 294 in 04e3850
func
get_max_tokensused in chat or completions, It first looks for the value ofmax_completion_tokens, followed by the value ofmax_tokens, which means that ifmax_completion_tokenshas a value, no matter how muchmax_tokensis set, it will not take effect.when use
python3 benchmark_serving.py --backend openai-chat --endpoint /v1/chat/completionsthe req hasmax_completion_tokens. Therefore, the proxy.py settingmax_tokens=1becomes useless, and only the value ofmax_completion_tokensis used in vllm serve. :vllm/benchmarks/backend_request_func.py
Line 376 in ad341c5