-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
10 / 1710 of 17 issues completedDescription
Here is the development roadmap for 2025 Q3. Contributions and feedback are welcome (Join Slack). The next 2025 Q4 roadmap is #12780
Focus
- Feature compatibility and reliability: Make all advanced features fully compatible with each other and achieve production-level reliability, such as P/D disaggregation, all parallelisms, speculative decoding, and load balancing.
- Usability: easy installation on all backends; simple launch scripts for large-scale deployments.
- Kernel optimizations for new generations of hardware (Blackwell, MI350, TPU, etc).
- Reinforcement learning training framework integration.
Core refactor
- Simplify overlap scheduler [Feature] Overlap Spec Support #11762
- Piecewise cuda graph and torch compile support [Feature] Roadmap for Prefill (Piecewise) CUDA Graph #11490
- Flexible memory cache layer (mem_cache_v2) Separate allocation logic from scheduler #11313
- Document major parts
- Remove deadcode (e.g, double sparsity, flashinfer lora backend)
Speculative decoding
- overlap scheduler [Feature] Overlap Spec Support #11762
- Reference-based speculative decoding [Feature] Speculative decoding support lookahead #9873
- Make speculative decoding compatible with all other features
- Fix all corner cases in structured output + speculative decoding + reasoning/function call parsing
KVCache system
- HiCache with storage layer enhancement Hicache Storage Layer Prototype #7704
- Mooncake integration: Support l3 cache (mooncake store) for hiradix cache #7211
- 3FS integration: Add hf3fs support for hicache storage (based on #7704) #7280
- More backends including https://github.com/ai-dynamo/nixl, etc.
- Alternatively, directly mapping storage device or remote memory pool as host memory pool if hardware permits: Add GDS alternative for hierarchical kv cache #7896
- Holistic scheduling with a global view of KV cache, more can be found: [Roadmap] Distributed Serving Enhancement on 2025 H2 #8210
- C++ implementation of radix and hiradix tree to mitigate scheduling overhead [Refactor] Rewrite Hierarchical Prefix Cache in C++ #7194 [Feature] Radix Tree in C++ #7369 Minor Optimizations in Radix Cache and Schedule Batch #6907
- Sliding window attention memory pool optimizations (dynamic ratio)
Kernel
- FP4 moe kernels [POC] enable trtllm fp4 from trtllm wheel #7711
- More attention kernel variants from flashinfer TRTLLM Gen MLA Decode Kernel Integration #7938
- CuDNN attention backend Implement CuDNN Attention Backend with Graph Caching #5505 Integrates
cudnn_batch_prefill_with_kv_cacheinto flashinfer attention backend #7841 - Torch FlexAttention backend feat: Add FlexAttention Backend for Efficient Sparse Attention #9947
- Overlap allreduce in tensor parallelism [WIP] Support TP overlap #9058
- More allreduce optimizations (e.g., multi-node nvlink, fusion) [b200] support trt-llm allreduce fuse rms_norm_add kernel #7621
- Improve FP8 gemm kernels on hopper and blackwell
Parallelism
- Support arbitrary combination of DP attention + TP + EP [1/N] MoE Refactor: refactor
select_experts#7966 - Async schedule for pipeline parallelism [Roadmap] Pipeline parallelism refactoring roadmap #11857
- Elastic EP [4/N]Elastic EP support deepep backend #11837
- Support all parallelism + speculative decoding
- Support all parallelism + PD disaggregation
PD Disaggregation
- CPU transfer backend [Roadmap] Distributed Serving Enhancement on 2025 H2 #8210
- Auto scaling in OME
- Fault tolerance & reliability & health check
Quantization
- Major roadmap: [Roadmap] Quantization Support #8180
- ModelOpt integration to support all models
- Communication quantization (fp8 allreduce)
- MXFP4 support
RL framework integration
- AREAL, slime, veRL integration (sorted alphabetically)
- Faster weight sync
- Reproduce Deepseek/Kimi + GRPO training
Multi-LoRA serving
- Major Roadmap: [Roadmap] Lora Support #2929
- Asynchronous lora updates: Support overlapped lora updates #8213
- Support Radix Attention: Support radix cache for Lora feature #7216
- GPU pinning: Support GPU pinning for LoRA #8697
- LRU cache: Implement LRU eviction policy for LoRA adapters #11041
- Performance improvements Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops #8940
- Support embedding layers Feat: support LoRA for embedding layer #8222
- Serving lora adaptors with different ranks using unified paging
Hardware
- Better CI coverage of AMD
- Support Ascend NPU [Roadmap] Supporting Ascend NPU on 2025 H2 #8004
- H20-specific optimizations [Roadmap] Kimi-K2 performance enhancement on H20 GPU #8151
- Support TPU https://github.com/sgl-project/sglang-jax
- Support intel XPU
- Support more hardware with multi-framework [Roadmap] Supporting multi frameworks on 2025 H2 #8199
Model coverage
- Day 0 support for all upcoming OSS models
- Multi-modal models #
- Language models
- Hybrid Mamba/linear attention models Qwen3-Next support #10233 model: Support Hybrid Mamba2 NemotronHForCausalLM (nvidia/NVIDIA-Nemotron-Nano-9B-v2) #10909
API layer
- Provide an gRPC interface layer Roadmap: SGLang Router #10341 [router] 0.2.2 release #11986
- Rewrite api layer (fastapi, tokenizer manager) in sgl-router Roadmap: SGLang Router #10341 [router] 0.2.2 release #11986
- Support all advanced apis (e.g. OpenAI response API, MCP integration) Roadmap: SGLang Router #10341 [router] 0.2.2 release #11986
zhyncs, lambert0312, Swipe4057, ZelinMa557, hzh0425 and 69 morezhyncs, ispobock, KaiyuZhang001, Swipe4057, slin1237 and 26 morezhyncs, KaiyuZhang001, Swipe4057, slin1237, b8zhong and 30 morelkm2835, xwuShirley, Swipe4057, b8zhong, JustinTong0323 and 16 more