Development Roadmap (2025 Q3)

Here is the development roadmap for 2025 Q3. Contributions and feedback are welcome ([Join Slack](https://slack.sglang.ai)). The next 2025 Q4 roadmap is https://github.com/sgl-project/sglang/issues/12780

## Focus
- Feature compatibility and reliability: Make all advanced features fully compatible with each other and achieve production-level reliability, such as P/D disaggregation, all parallelisms, speculative decoding, and load balancing.
- Usability: easy installation on all backends; simple launch scripts for large-scale deployments.
- Kernel optimizations for new generations of hardware (Blackwell, MI350, TPU, etc).
- Reinforcement learning training framework integration.


## Core refactor
- Simplify overlap scheduler #11762
- Piecewise cuda graph and torch compile support  #11490
- Flexible memory cache layer (mem_cache_v2) #11313
- Document major parts
- Remove deadcode (e.g, double sparsity, flashinfer lora backend)

## Speculative decoding
- overlap scheduler #11762
- Reference-based speculative decoding #9873 
- Make speculative decoding compatible with all other features 
- Fix all corner cases in structured output + speculative decoding + reasoning/function call parsing

## KVCache system
- HiCache with storage layer enhancement #7704
  - Mooncake integration: https://github.com/sgl-project/sglang/pull/7211
  - 3FS integration: https://github.com/sgl-project/sglang/pull/7280
  - More backends including https://github.com/ai-dynamo/nixl, etc.
  - Alternatively, directly mapping storage device or remote memory pool as host memory pool if hardware permits: https://github.com/sgl-project/sglang/pull/7896
- Holistic scheduling with a global view of KV cache, more can be found: #8210
- C++ implementation of radix and hiradix tree to mitigate scheduling overhead #7194 #7369 #6907
- Sliding window attention memory pool optimizations (dynamic ratio)

## Kernel
- FP4 moe kernels #7711
- More attention kernel variants from flashinfer #7938
- CuDNN attention backend #5505 #7841
- Torch FlexAttention backend #9947 
- Overlap allreduce in tensor parallelism #9058 
- More allreduce optimizations (e.g., multi-node nvlink, fusion)  #7621
- Improve FP8 gemm kernels on hopper and blackwell

## Parallelism
- Support arbitrary combination of DP attention + TP + EP #7966
- Async schedule for pipeline parallelism #11857 
- Elastic EP #11837
- Support all parallelism + speculative decoding
- Support all parallelism + PD disaggregation

## PD Disaggregation
- CPU transfer backend #8210
- Auto scaling in OME 
- Fault tolerance & reliability & health check

## Quantization
- Major roadmap: https://github.com/sgl-project/sglang/issues/8180
- ModelOpt integration to support all models
- Communication quantization (fp8 allreduce)
- MXFP4 support 

## RL framework integration 
- AREAL, slime, veRL integration (sorted alphabetically) 
- Faster weight sync
- Reproduce Deepseek/Kimi + GRPO training

## Multi-LoRA serving
- Major Roadmap: #2929
- Asynchronous lora updates: #8213 
- Support Radix Attention: #7216 
- GPU pinning: #8697 
- LRU cache: #11041
- Performance improvements #8940 
- Support embedding layers #8222 
- Serving lora adaptors with different ranks using unified paging 

## Hardware
- Better CI coverage of AMD
- Support Ascend NPU #8004
- H20-specific optimizations #8151
- Support TPU https://github.com/sgl-project/sglang-jax
- Support intel XPU
- Support more hardware with multi-framework #8199

## Model coverage
- Day 0 support for all upcoming OSS models
    - GPT OSS #8833
    - DeepSeek v3.2 #11060 #11989
- Multi-modal models #
- Language models
  -  [ ] Hybrid Mamba/linear attention models #10233 #10909

## API layer
- Provide an gRPC interface layer #10341 #11986
- Rewrite api layer (fastapi, tokenizer manager) in sgl-router  #10341 #11986
- Support all advanced apis (e.g. OpenAI response API, MCP integration) #10341 #11986


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Development Roadmap (2025 Q3) #7736

Focus

Core refactor

Speculative decoding

KVCache system

Kernel

Parallelism

PD Disaggregation

Quantization

RL framework integration

Multi-LoRA serving

Hardware

Model coverage

API layer

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Development Roadmap (2025 Q3) #7736

Description

Focus

Core refactor

Speculative decoding

KVCache system

Kernel

Parallelism

PD Disaggregation

Quantization

RL framework integration

Multi-LoRA serving

Hardware

Model coverage

API layer

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions