Dynamo 0.4.0 Release Notes
Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.
As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Major Features and Improvements
Increasing Framework Support
-
vLLM Updates
- Added E2E integration tests (#1935) and multimodal example with Llama4 Maverick (#1990)
- Prefill-aware routing for improved performance (#1895)
- Configurable namespace support for vLLM examples (#1909)
- Routing via
ApproxKvIndexer
withuse_kv_events
flag (#1869) - Updated all vLLM examples to new UX (#1756)
-
SGLang Updates
-
TRT-LLM Updates
- New/speculative decoding example: Llama-4 + Eagle-3 (#1828)
-
Routing Performance
UX Updates
-
Migration to New Python UX
-
CLI and Packaging Enhancements
-
Kubernetes Deployment UX
-
Examples & Docs Overhaul
Deployment, Kubernetes, and CLI
-
Helm and Graph Deployments
-
Planner and Profiling
Performance and Observability
-
Structured Logging Improvements
- Enhanced structured JSONL logs with span start/close events, trace ID/span ID injection, duration formatting in microseconds, and improved context capture for distributed tracing workflows (PR #2061).
-
Tokenizer & Runtime
-
Metrics
Bug Fixes
- Fixed GPU resource specifications in LLM deployments (#1812)
- Corrected vLLM, SGLang, and TRTLLM deployment issues, including container builds, runtime packaging, and helm chart updates (#1942, #2062, #1825)
- Addressed port conflicts, deterministic port assignments, and health check improvements (#1937, #1996)
- Improved error handling for empty message lists and invalid configurations (#2067, #2071)
- Fixed nil pointer dereference issues in the Dynamo controller (#2299, #2335)
- Locked dependencies to avoid breaking changes (e.g., Triton 3.4.0 w/ TRT-LLM 1.0.0) (#2233)
Documentation
-
Guides and Examples
-
Docs Restructuring
Build, CI, and Test
- Added support for sGLang runtime image builds (#1770)
- Optional TRTLLM dependency and custom build support (#2113)
- New end-to-end router tests with mockers (#2073)
- Fixed vLLM builds for Blackwell GPUs (#2020)
Release Assets
Python Wheels:
Rust Crates:
Containers:
- nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.4.0
- nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.4.0
- nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.4.0
- nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.4.0
Helm Charts:
Open Issues
- x86 TRT-LLM container image not compatible out of the box with B200. Dev container still works for B200/GB200
Contributors
We welcome new contributors in this release:
@umang-kedia-hpe, @Ethan-ES, @messiaen, @galletas1712, @mc-nv, @zaristei, @jhaotingc, @saurabh-nvidia.
For the full list of changes, see the changelog.