Skip to content

Dynamo Release v0.4.0

Latest
Compare
Choose a tag to compare
@dmitry-tokarev-nv dmitry-tokarev-nv released this 12 Aug 06:29
· 3 commits to release/0.4.0 since this release
73bcc3b

Dynamo 0.4.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Major Features and Improvements

Increasing Framework Support

  • vLLM Updates

    • Added E2E integration tests (#1935) and multimodal example with Llama4 Maverick (#1990)
    • Prefill-aware routing for improved performance (#1895)
    • Configurable namespace support for vLLM examples (#1909)
    • Routing via ApproxKvIndexer with use_kv_events flag (#1869)
    • Updated all vLLM examples to new UX (#1756)
  • SGLang Updates

    • Receive KV metrics from scheduler (#1789)
    • Disaggregated deployment examples (#2137)
    • Launch and deploy examples added (#2068)
  • TRT-LLM Updates

    • New/speculative decoding example: Llama-4 + Eagle-3 (#1828)
  • Routing Performance

    • Removed router hot-path lock for faster request handling (#1963)
    • Added radix tree dumps as router events (#2057)

UX Updates

  • Migration to New Python UX

    • Updated all Python launch flows to the new UX structure (#2003), including refactoring vLLM backend integration (#1983).
    • Removed outdated examples that relied on the old UX (#1899).
  • CLI and Packaging Enhancements

    • Added Python bindings for Dynamo CLI tools (#1799).
    • Updated Python packaging to align with the new UX (#2054).
    • Introduced a Python frontend/ingress node for easier deployment integration (#1912).
    • Added a convenience script to uninstall Dynamo Deploy CRDs (#1933).
  • Kubernetes Deployment UX

    • Enhanced Helm chart flexibility:
      • Added ability to override any podSpec property (#2116).
      • Enabled Helm upgrade via deploy script for smoother iteration (#1936).
      • Added Grove scheduling support to the graph Helm chart (#1954).
    • Introduced Kubernetes deployment examples for vLLM, SGLang, and TRT-LLM (#2062, #2133).
    • New Hello World Kubernetes deployment example (#1854).
  • Examples & Docs Overhaul

    • Hello World Python binding example (#2083).
    • Documentation updated for UX (#2070), reorganized example READMEs (#2174), and refactored core README structure (#2141).

Deployment, Kubernetes, and CLI

  • Helm and Graph Deployments

    • Liveness/readiness probes in graph Helm chart (#1888)
    • Added ability to override any podSpec property (#2116)
    • Support for Grove scheduling in Helm (#1954)
  • Planner and Profiling

    • Deploy SLA profiler and SLA planner to Kubernetes (#2030, #2135)

Performance and Observability

  • Structured Logging Improvements

    • Enhanced structured JSONL logs with span start/close events, trace ID/span ID injection, duration formatting in microseconds, and improved context capture for distributed tracing workflows (PR #2061).
  • Tokenizer & Runtime

    • De-tokenize performance improved by ~50% (#1868)
    • Runtime now uses all available parallelism (#1858)
  • Metrics

    • Hierarchical Prometheus metrics registry (#2008)
    • Generic ingress handler metrics (#2090)

Bug Fixes

  • Fixed GPU resource specifications in LLM deployments (#1812)
  • Corrected vLLM, SGLang, and TRTLLM deployment issues, including container builds, runtime packaging, and helm chart updates (#1942, #2062, #1825)
  • Addressed port conflicts, deterministic port assignments, and health check improvements (#1937, #1996)
  • Improved error handling for empty message lists and invalid configurations (#2067, #2071)
  • Fixed nil pointer dereference issues in the Dynamo controller (#2299, #2335)
  • Locked dependencies to avoid breaking changes (e.g., Triton 3.4.0 w/ TRT-LLM 1.0.0) (#2233)

Documentation

  • Guides and Examples

    • New hello world Python binding example (#2083)
    • Added multinode, disaggregated, and Grove deployment guides (#2155, #2086)
    • Added AKS/EKS deployment guides (#2080)
  • Docs Restructuring

    • Updated for new Python UX (#2070)
    • Refactored README and reorganized examples (#2141, #2174)

Build, CI, and Test

  • Added support for sGLang runtime image builds (#1770)
  • Optional TRTLLM dependency and custom build support (#2113)
  • New end-to-end router tests with mockers (#2073)
  • Fixed vLLM builds for Blackwell GPUs (#2020)

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Open Issues

  • x86 TRT-LLM container image not compatible out of the box with B200. Dev container still works for B200/GB200

Contributors

We welcome new contributors in this release:
@umang-kedia-hpe, @Ethan-ES, @messiaen, @galletas1712, @mc-nv, @zaristei, @jhaotingc, @saurabh-nvidia.

For the full list of changes, see the changelog.