We are excited to announce the newest official release of vLLM Ascend. This release includes many feature supports, performance improvements and bug fixes. We recommend users to upgrade from 0.7.3 to this version. Please always set VLLM_USE_V1=1
to use V1 engine.
In this release, we added many enhancements for large scale expert parallel case. It's recommended to follow the official guide.
Please note that this release note will list all the important changes from last official release(v0.7.3)
Highlights
- DeepSeek V3/R1 is supported with high quality and performance. MTP can work with DeepSeek as well. Please refer to muliti node tutorials and Large Scale Expert Parallelism.
- Qwen series models work with graph mode now. It works by default with V1 Engine. Please refer to Qwen tutorials.
- Disaggregated Prefilling support for V1 Engine. Please refer to Large Scale Expert Parallelism tutorials.
- Automatic prefix caching and chunked prefill feature is supported.
- Speculative decoding feature works with Ngram and MTP method.
- MOE and dense w4a8 quantization support now. Please refer to quantization guide.
- Sleep Mode feature is supported for V1 engine. Please refer to Sleep mode tutorials.
- Dynamic and Static EPLB support is added. This feature is still experimental.
Note
The following notes are especially for reference when upgrading from last final release (v0.7.3):
- V0 Engine is not supported from this release. Please always set
VLLM_USE_V1=1
to use V1 engine with vLLM Ascend. - Mindie Turbo is not needed with this release. And the old version of Mindie Turbo is not compatible. Please do not install it. Currently all the function and enhancement is included in vLLM Ascend already. We'll consider to add it back in the future in needed.
- Torch-npu is upgraded to 2.5.1.post1. CANN is upgraded to 8.2.RC1. Don't forget to upgrade them.
Core
- The Ascend scheduler is added for V1 engine. This scheduler is more affine with Ascend hardware.
- Structured output feature works now on V1 Engine.
- A batch of custom ops are added to improve the performance.
Changes
- EPLB support for Qwen3-moe model. #2000
- Fix the bug that MTP doesn't work well with Prefill Decode Disaggregation. #2610 #2554 #2531
- Fix few bugs to make sure Prefill Decode Disaggregation works well. #2538 #2509 #2502
- Fix file not found error with shutil.rmtree in torchair mode. #2506
Known Issues
- When running MoE model, Aclgraph mode only work with tensor parallel. DP/EP doesn't work in this release.
- Pipeline parallelism is not supported in this release for V1 engine.
- If you use w4a8 quantization with eager mode, please set
VLLM_ASCEND_MLA_PARALLEL=1
to avoid oom error. - Accuracy test with some tools may not be correct. It doesn't affect the real user case. We'll fix it in the next post release. #2654
- We notice that there are still some problems when running vLLM Ascend with Prefill Decode Disaggregation. For example, the memory may be leaked and the service may be stuck. It's caused by known issue by vLLM and vLLM Ascend. We'll fix it in the next post release. #2650 #2604 vLLM#22736 vLLM#23554 vLLM#23981