[Qwen3-Coder-480B-A35B] update max-model-len for fp8 #12

Abirdcfly · 2025-08-04T03:01:59Z

In my H20 8 nodes, I am unable to run the models's max seq len (262144 = 256k) vllm version: v0.10.1.dev101+gbda9d0535

# vllm serve /tmp/model/Qwen3-Coder-480B-A35B-Instruct-FP8 --served-model-name Qwen3-Coder-480B-A35B-FP8 -dp=8  --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder

ValueError: To serve at least one request with the models's max seq len (262144), (62.00 GiB KV cache is needed, which is larger than the available KV cache memory (48.66 GiB). Based on the available memory, the estimated maximum model length is 205728. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

# vllm serve /tmp/model/Qwen3-Coder-480B-A35B-Instruct-FP8 --served-model-name Qwen3-Coder-480B-A35B-FP8 -dp=8  --enable-expert-parallel --enable-auto-tool-choice --tool-call-parser qwen3_coder --gpu-memory-utilization 0.95

ValueError: To serve at least one request with the models's max seq len (262144), (62.00 GiB KV cache is needed, which is larger than the available KV cache memory (55.65 GiB). Based on the available memory, the estimated maximum model length is 235296. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

gemini-code-assist

Summary of Changes

Hello @Abirdcfly, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses memory limitations when serving the Qwen3-Coder-480B-A35B-Instruct-FP8 model with vLLM. I've updated the recommended vllm serve command to explicitly set a max-model-len and clarified the documentation regarding context length limitations for both BF16 and FP8 models.

Highlights

Memory Constraint Resolution: I've added the --max-model-len 131072 argument to the vllm serve command example for the FP8 model, which helps mitigate ValueError related to KV cache memory.
Documentation Clarification: I've updated the "Context Length Limitation" tips in the documentation for both BF16 and FP8 models, advising users that they can either reduce max-model-len or increase gpu-memory-utilization to handle memory constraints.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the documentation for the Qwen3-Coder-480B-A35B model to address memory limitations when running with the maximum sequence length on H20 hardware. The changes correctly add the --max-model-len parameter to the example command for the FP8 model and update the "Using Tips" section for both BF16 and FP8 models with workarounds. My review identifies a minor but recurring typo in the documentation that should be corrected for clarity and consistency.

Qwen/Qwen3-Coder-480B-A35B.md

Signed-off-by: Abirdcfly <[email protected]>

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

Qwen/Qwen3-Coder-480B-A35B.md Outdated Show resolved Hide resolved

Qwen/Qwen3-Coder-480B-A35B.md Outdated Show resolved Hide resolved

Abirdcfly force-pushed the main branch from ce828d5 to fa2fb74 Compare August 4, 2025 03:04

[Qwen3-Coder-480B-A35B] update max-model-len for fp8

724df23

Signed-off-by: Abirdcfly <[email protected]>

Abirdcfly force-pushed the main branch from fa2fb74 to 724df23 Compare August 4, 2025 03:08

jeejeelee merged commit e2fd2c3 into vllm-project:main Aug 4, 2025
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Qwen3-Coder-480B-A35B] update max-model-len for fp8 #12

[Qwen3-Coder-480B-A35B] update max-model-len for fp8 #12

Uh oh!

Abirdcfly commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Qwen3-Coder-480B-A35B] update max-model-len for fp8 #12

[Qwen3-Coder-480B-A35B] update max-model-len for fp8 #12

Uh oh!

Conversation

Abirdcfly commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants