The open-source orchestration layer above inference engines. Dynamo turns a cluster of GPUs running SGLang, TensorRT-LLM, or vLLM into a coordinated multi-node inference system with disaggregated serving, KV-aware routing, multi-tier caching, and SLA-driven autoscaling.
Dynamo is built around three cooperating concerns: a fast Request Plane for token generation, a responsive Control Plane for scaling and placement, and a resilient Storage & Events Plane for KV reuse and failure recovery. This separation allows each concern to evolve and scale independently.
Keep TTFT and ITL predictable under bursty, mixed-length traffic. Disaggregated serving isolates compute-bound prefill from memory-bound decode.
Independently scale prefill and decode pools so each runs on hardware tuned for its workload. No wasted cycles on idle phases.
KV-aware routing with radix-tree indexing eliminates redundant prefill. Multi-tier KVBM extends effective cache capacity far beyond HBM.
From the architecture.md design doc -- the nine steps of a disaggregated inference request:
nvext hints.disaggregated_params for KV handoff.
The KV Router is implemented in Rust (lib/kv-router/src/) and is the critical decision point
for every inference request. It tracks two metrics per worker: potential active blocks
(decode load) and potential new prefill blocks (tokens requiring fresh computation).
These feed a cost function that minimizes redundant prefill while balancing decode load.
The KVIndexer maintains a global radix (prefix) tree built from worker-reported KV events.
Each node in the tree stores a set of worker IDs that have cached that particular block. From
lib/kv-router/src/indexer/radix_tree.rs:
// lib/kv-router/src/indexer/radix_tree.rs
pub(crate) struct RadixBlock {
/// Child blocks, keyed by local block hash
pub children: FxHashMap<LocalBlockHash, SharedRadixBlock>,
/// Workers that have this block cached
pub workers: FxHashSet<WorkerWithDpRank>,
/// External sequence block hash (None for root)
pub block_hash: Option<ExternalSequenceBlockHash>,
/// Recency buffer for frequency-based decisions
pub recent_uses: VecDeque<Instant>,
}
pub struct RadixTree {
pub root: SharedRadixBlock,
/// Per-worker O(1) lookup: worker -> (block_hash -> block)
pub lookup: FxHashMap<WorkerWithDpRank,
FxHashMap<ExternalSequenceBlockHash, SharedRadixBlock>>,
pub expiration_duration: Option<Duration>,
}
The find_matches method traverses the tree for a token sequence, tracking which workers
share each prefix depth. Workers "drop out" as we descend where their cached blocks end. The algorithm
supports early exit when only a single worker remains, and per-block frequency tracking with TTL-based
expiration for approximate mode.
--router-event-threads 1, a single-threaded
RadixTree with TTL/pruning support is used. With N > 1 (default: 4), a
ConcurrentRadixTree uses sticky worker routing for per-worker event serialization
while allowing concurrent reads.
The core selection logic in lib/kv-router/src/scheduling/selector.rs computes a cost
for each worker and selects the minimum:
// lib/kv-router/src/scheduling/selector.rs - DefaultWorkerSelector
let get_score = |worker: WorkerWithDpRank| -> f64 {
let overlap = overlaps.get(&worker).unwrap_or(&0);
let prefill_token = prefill_tokens.get(&worker).unwrap_or(&isl);
let potential_prefill_block = prefill_token as f64 / block_size as f64;
let decode_block = decode_blocks.get(&worker).unwrap_or(&...);
// The core cost function:
let logit = overlap_weight * potential_prefill_block + decode_block;
logit
};
When router_queue_threshold is set, the router maintains a priority queue with pluggable policies
(from lib/kv-router/src/scheduling/policy.rs):
| Policy | Key Formula | Optimizes For | Source |
|---|---|---|---|
FCFS (default) |
priority_jump - arrival_offset |
Tail TTFT -- no request waits longer than necessary | Rust |
LCFS |
priority_jump + arrival_offset |
Favors newer arrivals (experiment/comparison) | Rust |
WSPT |
(1 + priority_jump) / new_tokens |
Average TTFT via Smith's rule (1956); short or high-priority requests first | Rust |
new_tokens = ISL - max_overlap * block_size,
using the maximum overlap across all workers. This approximates the realized overlap since the downstream
selector routes to the best-overlap worker. Short requests with high cache hits jump ahead of long ones.
| Mode | Mechanism | Best For |
|---|---|---|
| NATS Core / Local Indexer (default) | Fire-and-forget pub/sub. Workers maintain local radix trees. Gap detection via monotonic event IDs. | Low latency, simple setup, single-router |
JetStream (--durable-kv-events) |
Persistent NATS stream with durable consumers. State snapshots in NATS Object Store. | Production multi-replica consistency |
KVBM is the multi-tier cache manager that extends effective KV cache capacity far beyond GPU HBM.
Implemented in Rust (lib/kvbm-physical/, lib/kvbm-logical/, lib/kvbm-common/),
it manages a 4-tier memory hierarchy with asynchronous offload/onboard operations coordinated through NIXL.
From kvbm-design.md, each KV block follows a strict lifecycle managed via RAII handles:
| State | Description | Valid Transitions |
|---|---|---|
Reset |
Block uninitialized or recycled. Held in InactivePool, reusable. | init_sequence(salt_hash) --> Partial |
Partial |
Being filled with tokens. Owned by sequence creator thread. | add_token() stays Partial; commit() --> Complete; reset() --> Reset |
Complete |
Fully filled but not yet visible for reuse by other requests. | register() --> Registered; reset() --> Reset |
Registered |
Finalized and visible in the dedup cache. Shared ownership. | Auto drop() triggers Remove event --> Reset |
The KvBlockManager<H, D> orchestrates across memory tiers by managing per-backend block pools:
// Conceptual structure from kvbm-design.md
KvBlockManager<H, D> owns:
- BlockPool<Device> // GPU-resident blocks (G1)
- BlockPool<Host> // CPU pinned-memory blocks (G2)
- NixlAgent // Remote communication + memory sharing
- BlockSetRegistry // Remote lookup + import/export metadata
Each BlockPool<T> tracks:
- ActivePool // Blocks currently in use by sequences
- InactivePool // Recycled blocks ready for allocation (free list)
Block memory layout: [num_layers][page_size x inner_dim]
block_stride = align_up(num_layers * layer_stride, alignment)
The TransferManager is an asynchronous orchestrator with per-path queues:
Triggered by the connector scheduler. CUDA D2H copy or custom kernel. Host pool registers new immutable block, deduped by sequence_hash.
Brings a host block back into GPU memory. CUDA H2D copy. Device pool registers the new immutable block for reuse.
NIXL Write via POSIX; GPUDirect Storage when available. Also supports network FS (NFS/Lustre/GPFS) for G4 remote storage.
Direct disk-to-GPU via NIXL Read, bypassing CPU when GDS is available. Fastest cold-onboard path for large KV blocks.
PublishHandle triggers a StoreEvent. When the handle is dropped (eviction or end-of-life),
a RemoveEvent is automatically published to the Dynamo Event Plane (NATS or ZMQ). This ensures
consistent cross-worker cache visibility without explicit deallocation logic.
The Planner is Dynamo's autoscaling controller, implemented in Python (components/src/dynamo/planner/).
It supports two scaling modes: throughput-based (profiling data + traffic prediction) and
load-based (real-time engine metrics + online regression). Both aim to meet TTFT and ITL
SLA targets while minimizing total cost of ownership (TCO).
adjustment_interval seconds (default: 180s), queries Prometheus for avg TTFT, ITL, request count, avg ISL, and avg OSL from the Frontend's /metrics endpoint.prefill_correction = actual_ttft / expected_ttft and decode_correction = actual_itl / expected_itl. These adapt profiling-based predictions to real-world behavior (request queueing, prefix cache hits, chunked prefill effects).next_num_req, next_isl, next_osl using one of four predictor algorithms (see below).connector.set_component_replicas() non-blocking. Supports KubernetesConnector (patches DGD resources) or VirtualConnector (writes to distributed runtime for external orchestrators).From planner-design.md and planner_core.py:
# Prefill replicas (single-batched, linear correction effect)
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
# Decode replicas (correction applied to ITL SLA target)
corrected_itl = target_itl / decode_correction_factor
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
Four predictor implementations in load_predictor.py, all extending BasePredictor:
| Predictor | Algorithm | Best For | Min Data Points |
|---|---|---|---|
| Constant | next = current |
Stable workloads, long intervals | 1 |
| ARIMA | Auto-ARIMA (pmdarima) with optional log1p transform. Auto-fallback from raw to log1p if model collapses to (0,d,0). |
Trending / seasonal patterns | 5 |
| Kalman | Local linear trend Kalman filter (filterpy). Starts after --kalman-min-points observations. |
Bursty traffic | configurable |
| Prophet | Facebook Prophet time-series model. Handles complex seasonality. | Complex seasonal patterns | multiple intervals |
BasePredictor which skips
leading zeros (idle period after deployment) until the first non-zero datapoint from live traffic. This
prevents cold-start artifacts from distorting predictions. Predictors also support warm-starting from
trace files via --load-predictor-warmup-trace.
Uses ForwardPassMetrics (FPM) from the Dynamo event plane for real-time scaling without
profiling data. Three specialized regression models in fpm_regression.py:
| Model | Regression | Estimates |
|---|---|---|
PrefillRegressionModel |
1D: sum_prefill_tokens --> wall_time | TTFT via simulated chunked prefill scheduling |
DecodeRegressionModel |
1D: sum_decode_kv_tokens --> wall_time | ITL for total decode load (scheduled + queued) |
AggRegressionModel |
2D: (prefill_tokens, decode_kv_tokens) --> wall_time | Both TTFT and ITL with piggybacked decode/prefill |
Scaling decisions: scale up if ALL engines' estimated latency exceeds SLA; scale down if ALL are below SLA * sensitivity. Only +/-1 per interval with pending-desired guard.
When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor. This gives the best of both worlds: capacity planning from profiling data and reactive adaptation from live metrics.
Prefill and decode have fundamentally different compute characteristics: prefill is compute-bound (processes all input tokens in one forward pass) while decode is memory-bandwidth-bound (generates one token at a time). Disaggregating them into specialized GPU pools allows each to scale independently, using optimal tensor-parallel (TP) configurations for each phase.
The PrefillRouter (from disagg-serving.md) orchestrates the full flow:
Router selects a prefill worker using KV-aware routing (cache overlap + load) or simple load balancing. KV overlap scores from the RadixTree indexer determine which prefill worker can reuse the most cached blocks.
Prefill worker computes KV cache and returns disaggregated_params containing backend-specific transfer metadata. The KV cache lives in the prefill worker's GPU memory.
Router injects prefill result into decode request, routes to decode worker. The decode worker is selected based on available KV capacity and current load.
Decode worker coordinates with prefill worker for direct GPU-to-GPU transfer via NIXL (NVLink, InfiniBand/UCX). Non-blocking -- GPU forward passes continue during transfer.
| Backend | Metadata Format | Transfer Behavior |
|---|---|---|
| SGLang | bootstrap_info (host, port, room_id) |
RDMA bootstrap coordination. Prefill runs as background task -- decode begins immediately while KV transfer proceeds in parallel. |
| vLLM | kv_transfer_params (block IDs, worker connection) |
Synchronous prefill. Decode waits for prefill to complete. |
| TensorRT-LLM | opaque_state (serialized internal metadata) |
Synchronous prefill. Decode waits for prefill to complete. |
Dynamo supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added or removed without downtime:
RuntimeConfig (including KV capacity). Router auto-discovers via discovery service.
NIXL is the data transfer fabric that enables high-speed KV cache movement across workers and memory
domains. In Dynamo's Rust codebase (lib/memory/src/nixl.rs, lib/kvbm-physical/src/transfer/executor/nixl.rs),
NIXL provides a unified interface over heterogeneous transports (NVLink, InfiniBand/UCX, PCIe, POSIX I/O).
// lib/memory/src/nixl.rs
/// Trait for storage types that can be registered with NIXL.
pub trait NixlCompatible {
/// Returns (ptr, size, mem_type, device_id)
fn nixl_params(&self) -> (*const u8, usize, MemType, u64);
}
/// NIXL descriptor for memory region registration.
pub struct NixlDescriptor {
pub addr: u64, // Base address
pub size: usize, // Region size in bytes
pub mem_type: MemType, // host / device / etc.
pub device_id: u64, // GPU index for device memory
}
The NIXL transfer builder in lib/kvbm-physical/src/transfer/executor/nixl.rs uses
Rust's typestate pattern for compile-time safety -- all required parameters (source layout, destination
layout, block IDs, transfer strategy) must be set before a transfer can be executed:
// lib/kvbm-physical/src/transfer/executor/nixl.rs
pub struct NixlTransferBuilder<'a, TSrc, TDst, TSrcBlocks, TDstBlocks, TStrategy> {
src: Option<&'a PhysicalLayout>,
dst: Option<&'a PhysicalLayout>,
src_block_ids: Option<&'a [BlockId]>,
dst_block_ids: Option<&'a [BlockId]>,
strategy: Option<TransferStrategy>,
// Phantom markers: Unset -> Set at compile time
}
From the KVBM design doc, the bidirectional protocol for cross-worker KV transfer:
NixlAgent, registers its memory regions (device memory) via nixl_register(). NIXL creates remote-accessible descriptors bound to the memory layout.SerializedNixlBlockLayout containing LayoutConfig (num_layers, page_size, inner_dim, dtype), BlockSetID, base address + stride, device ID + memory type. This bridges TP mismatches (e.g., TP=4 vs TP=8).FullyContiguous::serialize() encodes physical memory descriptors (address, size, VRAM/DRAM type). deserialize() rehydrates into a remote memory view with correct offsets. Enables correct gather-scatter across different system configurations.RegistrationHandle. On drop, automatic Remove event is published, deregistering the block from NIXL and removing it from the remote block registry. Prevents stale memory access and dangling pointers.
From docs/features/agentic_workloads.md: Agentic LLM inference is dominated by KV-cache
storage and I/O rather than computation. Without leveraging the predictable structure of agent lifecycles,
significant optimizations are left on the table. Dynamo bridges this gap with agentic hints
that flow from the harness through the router to the KV cache manager.
Hints are carried in the request body under nvext on chat completions. The frontend
parses them and passes them to the KV router and backends.
| Hint | Location | Description |
|---|---|---|
priority |
nvext.agent_hints |
Unified request priority. Higher values move the request earlier in the router queue (via priority_jump in all scheduling policies) and are forwarded to the backend for scheduling + priority-based eviction. |
osl |
nvext.agent_hints |
Expected output sequence length. Used by the router for output block tracking and load-balancing accuracy when --router-track-output-blocks is enabled. |
speculative_prefill |
nvext.agent_hints |
After assistant turn completes, prefills the predicted next-turn prefix to warm the KV cache. Up to ~3x TTFT improvement on turns 2+. |
cache_control |
nvext.cache_control |
TTL-based KV cache pinning. Pinned prefixes resist eviction for the specified duration, demoting to host memory rather than deletion. |
program_id |
nvext.agent_hints |
(Planned) Identifies the agentic program for program-level metrics and cache affinity. |
context_type |
nvext.agent_hints |
(Planned) Semantic type (system prompt, tool definition, reasoning branch) for context-aware eviction. |
From the router config (lib/kv-router/src/scheduling/config.rs), cache control is enabled
via router_enable_cache_control. When enabled, after generation completes the router calls
a pin_prefix hook (with TTL) to the worker's cache_control service mesh endpoint. Pinned
nodes resist eviction in SGLang's HiCache, demoting to host memory rather than being deleted.
// From lib/kv-router/src/scheduling/config.rs
/// Enable cache control (PIN with TTL) via the worker's
/// cache_control service mesh endpoint.
pub router_enable_cache_control: bool, // default: false
nvext.cache_control with the derived TTL automatically.
Future work: TTL could be context-type-aware -- think tokens get lower TTL than system
prompts and tool definitions.
Dynamo is supported directly in LangChain via the NVIDIA AI Endpoints integration. Configure the chat model to use the Dynamo endpoint and pass agent hints directly from the LangChain client.
| Feature | SGLang | TensorRT-LLM | vLLM |
|---|---|---|---|
| Priority-based cache eviction | Available | In progress | In progress |
| Cache pinning (TTL) | Available | In progress | -- |
| Cache prefetching | In progress | -- | -- |
| Speculative prefill | Available | Available | Available |
| Priority-aware routing | Available | Available | Available |
Dynamo 1.0 shipped March 2026 as a production-ready release with strong community adoption. The release spans zero-config deployment, agentic inference, multimodal E/P/D, video generation, and K8s Inference Gateway integration.
BETA Specify model, HW, and SLA in one YAML. AIConfigurator auto-profiles, Planner optimizes topology, Dynamo deploys.
apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
spec:
model: Qwen/Qwen3-0.6B
backend: vllm
sla: { ttft: 200.0, itl: 20.0 }
autoApply: true
Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations with speculative prefill for ~3x TTFT on turns 2+.
Disaggregated encode/prefill/decode with embedding cache -- 30% faster TTFT on image workloads. Block hashing incorporates mm_hash for multimodal objects.
Native FastVideo + SGLang Diffusion support. Real-time 1080p on a single B200.
KV-aware routing inside the standard Kubernetes gateway. EPP plugin with skip_initial_worker_wait for external worker management.
S3/Azure blob support for G4 tier + global KV events for cluster-wide cache visibility. Event Plane broadcasts StoreEvent / RemoveEvent for smart tiering.
| Result | Model | Hardware | Context |
|---|---|---|---|
| 750x higher throughput | DeepSeek-R1 | GB300 NVL72 | InferenceXv2 benchmark (SemiAnalysis) |
| 7x higher throughput/GPU | DeepSeek R1 | GB200 NVL72 vs B200 | InferenceX benchmark (SemiAnalysis) |
| 7x faster model startup | DeepSeek-V3 | H200 | ModelExpress GPU-to-GPU weight streaming via NIXL/NVLink |
| 2x faster TTFT | Qwen3-Coder 480B | Multi-node | KV-aware routing (Baseten production benchmark) |
| 80% fewer SLA breaches | Various | Production | Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025) |
| 10x inference speedup | Kimi K2 | GB200 | Moonshot AI production deployment |
| 19x faster TTFT | Various | Dell PowerScale + NIXL | Storage-to-GPU KV streaming via GDS |
| Feature | SGLang | TensorRT-LLM | vLLM |
|---|---|---|---|
| Disaggregated Serving | Available | Available | Available |
| KV-Aware Routing | Available | Available | Available |
| SLA-Based Planner | Available | Available | Available |
| KVBM | In progress | Available | Available |
| Multimodal | Available | Available | Available |
| Tool Calling | Available | Available | Available |
Key directories and files explored during this analysis of the ai-dynamo/dynamo repository:
| Path | Purpose |
|---|---|
lib/kv-router/src/scheduling/selector.rs | DefaultWorkerSelector: cost function, softmax sampling, tie-breaking |
lib/kv-router/src/scheduling/policy.rs | FCFS, LCFS, WSPT (Smith's rule) scheduling policies |
lib/kv-router/src/scheduling/config.rs | KvRouterConfig: all router parameters, defaults, cache control toggle |
lib/kv-router/src/indexer/radix_tree.rs | RadixTree: find_matches, block tracking, frequency-based TTL |
lib/kv-router/src/indexer/concurrent_radix_tree.rs | Thread-safe radix tree with sticky worker routing |
lib/kvbm-physical/src/transfer/executor/nixl.rs | NixlTransferBuilder: typestate pattern for safe cross-worker transfers |
lib/memory/src/nixl.rs | NixlCompatible trait, NixlDescriptor, NixlAgent wrappers |
lib/llm/src/block_manager/ | Block layouts, NIXL storage, transfer coordination |
| Path | Purpose |
|---|---|
components/src/dynamo/planner/utils/planner_core.py | Base planner, scaling loop, Prometheus metrics, connector integration |
components/src/dynamo/planner/utils/load_predictor.py | ARIMA, Kalman, Prophet, Constant predictors with idle-period handling |
components/src/dynamo/planner/utils/fpm_regression.py | Prefill/Decode/Agg regression models for load-based scaling |
components/src/dynamo/planner/utils/perf_interpolation.py | NPZ profiling data interpolation for throughput/latency mapping |
components/src/dynamo/planner/kubernetes_connector.py | K8s DGD PATCH for replica scaling via operator |
components/src/dynamo/planner/virtual_connector.py | VirtualConnector for non-K8s orchestrators |
| Path | Content |
|---|---|
docs/design-docs/architecture.md | Three-plane architecture, design goals, fault tolerance |
docs/design-docs/router-design.md | KVIndexer, cost function, event transport modes, inter-router sync |
docs/design-docs/kvbm-design.md | 4-tier hierarchy, block state machine, NIXL integration, framework connectors |
docs/design-docs/planner-design.md | Throughput + load-based scaling, predictor algorithms, connector design |
docs/design-docs/disagg-serving.md | PrefillRouter orchestration, backend-specific metadata, xPyD |
docs/design-docs/dynamo-flow.md | Full request flow with color-coded phases |
docs/features/agentic_workloads.md | nvext hints, cache pinning, speculative prefill, LangChain integration |