Skip to content

Conversation

@Abirdcfly
Copy link
Contributor

cc @youkaichao

In my H20 8 nodes, I am unable to run the models's max seq len (262144 = 256k) vllm version: v0.10.1.dev101+gbda9d0535

# vllm serve /tmp/model/Qwen3-Coder-480B-A35B-Instruct-FP8 --served-model-name Qwen3-Coder-480B-A35B-FP8 -dp=8  --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder

ValueError: To serve at least one request with the models's max seq len (262144), (62.00 GiB KV cache is needed, which is larger than the available KV cache memory (48.66 GiB). Based on the available memory, the estimated maximum model length is 205728. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
# vllm serve /tmp/model/Qwen3-Coder-480B-A35B-Instruct-FP8 --served-model-name Qwen3-Coder-480B-A35B-FP8 -dp=8  --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.95

ValueError: To serve at least one request with the models's max seq len (262144), (62.00 GiB KV cache is needed, which is larger than the available KV cache memory (55.65 GiB). Based on the available memory, the estimated maximum model length is 235296. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Abirdcfly, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses memory limitations when serving the Qwen3-Coder-480B-A35B-Instruct-FP8 model with vLLM. I've updated the recommended vllm serve command to explicitly set a max-model-len and clarified the documentation regarding context length limitations for both BF16 and FP8 models.

Highlights

  • Memory Constraint Resolution: I've added the --max-model-len 131072 argument to the vllm serve command example for the FP8 model, which helps mitigate ValueError related to KV cache memory.
  • Documentation Clarification: I've updated the "Context Length Limitation" tips in the documentation for both BF16 and FP8 models, advising users that they can either reduce max-model-len or increase gpu-memory-utilization to handle memory constraints.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the documentation for the Qwen3-Coder-480B-A35B model to address memory limitations when running with the maximum sequence length on H20 hardware. The changes correctly add the --max-model-len parameter to the example command for the FP8 model and update the "Using Tips" section for both BF16 and FP8 models with workarounds. My review identifies a minor but recurring typo in the documentation that should be corrected for clarity and consistency.

@jeejeelee jeejeelee merged commit e2fd2c3 into vllm-project:main Aug 4, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants