NVIDIA Dynamo v1.0 Rust + Python Apache 2.0 Source Code Analysis

NVIDIA Dynamo
Datacenter-Scale Inference Stack

The open-source orchestration layer above inference engines. Dynamo turns a cluster of GPUs running SGLang, TensorRT-LLM, or vLLM into a coordinated multi-node inference system with disaggregated serving, KV-aware routing, multi-tier caching, and SLA-driven autoscaling.

750x
Higher throughput
DeepSeek-R1 on GB300 NVL72
7x
Faster model startup
ModelExpress weight streaming
2x
Faster TTFT
KV-aware routing (Baseten)
80%
Fewer SLA breaches
Planner autoscaling

← Back to AI Infra Overview

Three-Plane Architecture

Dynamo is built around three cooperating concerns: a fast Request Plane for token generation, a responsive Control Plane for scaling and placement, and a resilient Storage & Events Plane for KV reuse and failure recovery. This separation allows each concern to evolve and scale independently.

Full System Architecture
REQUEST PLANE Client HTTP/gRPC Frontend Validate + Tokenize Smart Router KV-Aware + Cost Fn RadixTree Indexer Prefill Workers (xP) Decode Workers (yD) GPU GPU GPU GPU CONTROL PLANE Planner SLA-Driven Scaling K8s Operator DGD Reconciler Grove Topo Scheduler Prometheus Metrics STORAGE & EVENTS PLANE KVBM 4-Tier Cache Mgr NIXL KV Transfer KV Events NATS / ZMQ MEMORY HIERARCHY G1: GPU HBM Active KV + Hot Cache ~80 GB per GPU G2: CPU Pinned Offloaded Blocks Page-locked DRAM G3: Local SSD NVMe + GDS NIXL POSIX I/O G4: Remote / Cloud S3 / NFS / RDMA Opaque Blob Store

Design Goals

L
Latency Stability

Keep TTFT and ITL predictable under bursty, mixed-length traffic. Disaggregated serving isolates compute-bound prefill from memory-bound decode.

E
GPU Efficiency

Independently scale prefill and decode pools so each runs on hardware tuned for its workload. No wasted cycles on idle phases.

R
Compute Reuse

KV-aware routing with radix-tree indexing eliminates redundant prefill. Multi-tier KVBM extends effective cache capacity far beyond HBM.

End-to-End Request Flow (Disaggregated Mode)

From the architecture.md design doc -- the nine steps of a disaggregated inference request:

1
Client sends request to Frontend -- HTTP POST to the OpenAI-compatible endpoint on port 8000.
2
Frontend validates and preprocesses -- applies chat template, tokenizes, parses nvext hints.
3
Router selects a Prefill worker -- uses cost function combining KV cache overlap and decode load.
4
Prefill computes KV cache -- GPU forward pass generates key-value tensors for the prompt.
5
Prefill returns transfer metadata -- backend-specific disaggregated_params for KV handoff.
6
Router selects a Decode worker -- injects prefill metadata into the decode request.
7
Decode receives KV via NIXL -- direct GPU-to-GPU transfer using NVLink or InfiniBand/UCX.
8
Decode streams tokens -- auto-regressive generation using transferred KV state.
9
KV Events update cache visibility -- KVBM may offload/recall blocks based on reuse potential.

← Back to AI Infra Overview

Smart Router: KV-Aware Routing

The KV Router is implemented in Rust (lib/kv-router/src/) and is the critical decision point for every inference request. It tracks two metrics per worker: potential active blocks (decode load) and potential new prefill blocks (tokens requiring fresh computation). These feed a cost function that minimizes redundant prefill while balancing decode load.

Radix Tree KV Cache Indexer

The KVIndexer maintains a global radix (prefix) tree built from worker-reported KV events. Each node in the tree stores a set of worker IDs that have cached that particular block. From lib/kv-router/src/indexer/radix_tree.rs:

// lib/kv-router/src/indexer/radix_tree.rs

pub(crate) struct RadixBlock {
    /// Child blocks, keyed by local block hash
    pub children: FxHashMap<LocalBlockHash, SharedRadixBlock>,
    /// Workers that have this block cached
    pub workers: FxHashSet<WorkerWithDpRank>,
    /// External sequence block hash (None for root)
    pub block_hash: Option<ExternalSequenceBlockHash>,
    /// Recency buffer for frequency-based decisions
    pub recent_uses: VecDeque<Instant>,
}

pub struct RadixTree {
    pub root: SharedRadixBlock,
    /// Per-worker O(1) lookup: worker -> (block_hash -> block)
    pub lookup: FxHashMap<WorkerWithDpRank,
              FxHashMap<ExternalSequenceBlockHash, SharedRadixBlock>>,
    pub expiration_duration: Option<Duration>,
}

The find_matches method traverses the tree for a token sequence, tracking which workers share each prefix depth. Workers "drop out" as we descend where their cached blocks end. The algorithm supports early exit when only a single worker remains, and per-block frequency tracking with TTL-based expiration for approximate mode.

Two Indexer Backends: With --router-event-threads 1, a single-threaded RadixTree with TTL/pruning support is used. With N > 1 (default: 4), a ConcurrentRadixTree uses sticky worker routing for per-worker event serialization while allowing concurrent reads.

Cost Function and Worker Selection

The core selection logic in lib/kv-router/src/scheduling/selector.rs computes a cost for each worker and selects the minimum:

// lib/kv-router/src/scheduling/selector.rs - DefaultWorkerSelector

let get_score = |worker: WorkerWithDpRank| -> f64 {
    let overlap = overlaps.get(&worker).unwrap_or(&0);
    let prefill_token = prefill_tokens.get(&worker).unwrap_or(&isl);
    let potential_prefill_block = prefill_token as f64 / block_size as f64;
    let decode_block = decode_blocks.get(&worker).unwrap_or(&...);

    // The core cost function:
    let logit = overlap_weight * potential_prefill_block + decode_block;
    logit
};
Routing Decision Flow
Incoming Request tokens[], nvext hints Block Hashing tokens / block_size RadixTree Lookup find_matches(block_hashes) Returns: {worker_id: overlap_score} Queue Policy Check FCFS | LCFS | WSPT (Smith's rule) Threshold: router_queue_threshold Cost Function (per worker) cost = W * prefill_blocks + decode_blocks W = overlap_score_weight (default: 1.0) Worker Selection T=0: argmin(cost) + tree-size tiebreaker T>0: softmax_sample(costs, T) EXAMPLE: overlap_score_weight = 1.0, 3 workers, 10-block request Worker 1: 2 cached blocks prefill = (10-2) = 8 blocks decode = 10 active blocks cost = 1.0*8 + 10 = 18.0 Worker 2: 5 cached blocks prefill = (10-5) = 5 blocks decode = 5 active blocks cost = 1.0*5 + 5 = 10.0 SELECTED Worker 3: 8 cached blocks prefill = (10-8) = 2 blocks decode = 9 active blocks cost = 1.0*2 + 9 = 11.0 Inter-Router Sync (multi-replica): AddRequest | MarkPrefillCompleted | Free Router replicas synchronize active block state via NATS Core messaging. Each event carries a unique router ID to prevent self-processing.

Scheduling Policies

When router_queue_threshold is set, the router maintains a priority queue with pluggable policies (from lib/kv-router/src/scheduling/policy.rs):

PolicyKey FormulaOptimizes ForSource
FCFS (default) priority_jump - arrival_offset Tail TTFT -- no request waits longer than necessary Rust
LCFS priority_jump + arrival_offset Favors newer arrivals (experiment/comparison) Rust
WSPT (1 + priority_jump) / new_tokens Average TTFT via Smith's rule (1956); short or high-priority requests first Rust
WSPT uses max overlap: The WSPT policy computes new_tokens = ISL - max_overlap * block_size, using the maximum overlap across all workers. This approximates the realized overlap since the downstream selector routes to the best-overlap worker. Short requests with high cache hits jump ahead of long ones.

KV Event Transport Modes

ModeMechanismBest For
NATS Core / Local Indexer (default) Fire-and-forget pub/sub. Workers maintain local radix trees. Gap detection via monotonic event IDs. Low latency, simple setup, single-router
JetStream (--durable-kv-events) Persistent NATS stream with durable consumers. State snapshots in NATS Object Store. Production multi-replica consistency

← Back to AI Infra Overview

KV Block Manager (KVBM)

KVBM is the multi-tier cache manager that extends effective KV cache capacity far beyond GPU HBM. Implemented in Rust (lib/kvbm-physical/, lib/kvbm-logical/, lib/kvbm-common/), it manages a 4-tier memory hierarchy with asynchronous offload/onboard operations coordinated through NIXL.

KV Block Manager: 4-Tier Memory Hierarchy
G1 HBM GPU Device Memory (HBM3e) Active KV blocks + hot prefix cache. DeviceStorage backed by CUDA device buffers. Pools: ActivePool (in-use sequences) + InactivePool (free list). Block states: Reset -> Partial -> Complete -> Registered. ~80 GB / GPU Lowest latency CUDA D2H CUDA H2D G2 CPU CPU Pinned Memory (DRAM) Page-locked host memory for efficient CUDA transfers and NIXL I/O. PinnedStorage backend. Receives Device offloads (D2H), can onboard to Device (H2D), and offloads to Disk. Dedup by sequence_hash. TBs of DRAM Cross-node via NIXL NIXL POSIX NIXL Read / GDS G3 SSD Local NVMe SSD NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GPUDirect Storage (GDS). Disk-to-Device onboard path bypasses CPU via GDS when available. Also supports NFS/Lustre/GPFS mounted FS. Multi-TB NVMe GDS zero-copy NIXL Write G4 REM Remote / Cloud Storage S3, Azure Blob, or RDMA-capable volumes. KVBM treats G4 as an opaque blob store via NIXL get()/put(). v1.0 adds S3/Azure blob support + global KV events for cluster-wide cache visibility. Unlimited Highest latency

Block State Machine

From kvbm-design.md, each KV block follows a strict lifecycle managed via RAII handles:

StateDescriptionValid Transitions
Reset Block uninitialized or recycled. Held in InactivePool, reusable. init_sequence(salt_hash) --> Partial
Partial Being filled with tokens. Owned by sequence creator thread. add_token() stays Partial; commit() --> Complete; reset() --> Reset
Complete Fully filled but not yet visible for reuse by other requests. register() --> Registered; reset() --> Reset
Registered Finalized and visible in the dedup cache. Shared ownership. Auto drop() triggers Remove event --> Reset

KVBM Internal Architecture

The KvBlockManager<H, D> orchestrates across memory tiers by managing per-backend block pools:

// Conceptual structure from kvbm-design.md

KvBlockManager<H, D> owns:
  - BlockPool<Device>     // GPU-resident blocks (G1)
  - BlockPool<Host>       // CPU pinned-memory blocks (G2)
  - NixlAgent             // Remote communication + memory sharing
  - BlockSetRegistry      // Remote lookup + import/export metadata

Each BlockPool<T> tracks:
  - ActivePool     // Blocks currently in use by sequences
  - InactivePool   // Recycled blocks ready for allocation (free list)

Block memory layout: [num_layers][page_size x inner_dim]
  block_stride = align_up(num_layers * layer_stride, alignment)

Transfer Manager and Data Flows

The TransferManager is an asynchronous orchestrator with per-path queues:

D2H
Device --> Host (Offload)

Triggered by the connector scheduler. CUDA D2H copy or custom kernel. Host pool registers new immutable block, deduped by sequence_hash.

H2D
Host --> Device (Onboard)

Brings a host block back into GPU memory. CUDA H2D copy. Device pool registers the new immutable block for reuse.

H2K
Host --> Disk (Offload)

NIXL Write via POSIX; GPUDirect Storage when available. Also supports network FS (NFS/Lustre/GPFS) for G4 remote storage.

K2D
Disk --> Device (Onboard)

Direct disk-to-GPU via NIXL Read, bypassing CPU when GDS is available. Fastest cold-onboard path for large KV blocks.

RAII Event Integration: Block lifecycle is managed via RAII handles. When a block is registered, a PublishHandle triggers a StoreEvent. When the handle is dropped (eviction or end-of-life), a RemoveEvent is automatically published to the Dynamo Event Plane (NATS or ZMQ). This ensures consistent cross-worker cache visibility without explicit deallocation logic.

← Back to AI Infra Overview

Planner: SLA-Driven Dynamic Scaling

The Planner is Dynamo's autoscaling controller, implemented in Python (components/src/dynamo/planner/). It supports two scaling modes: throughput-based (profiling data + traffic prediction) and load-based (real-time engine metrics + online regression). Both aim to meet TTFT and ITL SLA targets while minimizing total cost of ownership (TCO).

Throughput-Based Scaling Algorithm

1
Metric Collection: Every adjustment_interval seconds (default: 180s), queries Prometheus for avg TTFT, ITL, request count, avg ISL, and avg OSL from the Frontend's /metrics endpoint.
2
Correction Factor: Computes prefill_correction = actual_ttft / expected_ttft and decode_correction = actual_itl / expected_itl. These adapt profiling-based predictions to real-world behavior (request queueing, prefix cache hits, chunked prefill effects).
3
Load Prediction: Forecasts next_num_req, next_isl, next_osl using one of four predictor algorithms (see below).
4
Replica Calculation: Computes required prefill and decode replicas using profiling-based interpolation and correction factors.
5
Scaling Execution: Calls connector.set_component_replicas() non-blocking. Supports KubernetesConnector (patches DGD resources) or VirtualConnector (writes to distributed runtime for external orchestrators).

Replica Calculation Formulas

From planner-design.md and planner_core.py:

# Prefill replicas (single-batched, linear correction effect)
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)

# Decode replicas (correction applied to ITL SLA target)
corrected_itl = target_itl / decode_correction_factor
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
    itl=corrected_itl,
    context_length=next_isl + next_osl / 2
)
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)

Load Predictor Algorithms

Four predictor implementations in load_predictor.py, all extending BasePredictor:

PredictorAlgorithmBest ForMin Data Points
Constant next = current Stable workloads, long intervals 1
ARIMA Auto-ARIMA (pmdarima) with optional log1p transform. Auto-fallback from raw to log1p if model collapses to (0,d,0). Trending / seasonal patterns 5
Kalman Local linear trend Kalman filter (filterpy). Starts after --kalman-min-points observations. Bursty traffic configurable
Prophet Facebook Prophet time-series model. Handles complex seasonality. Complex seasonal patterns multiple intervals
Idle Period Handling: All predictors inherit from BasePredictor which skips leading zeros (idle period after deployment) until the first non-zero datapoint from live traffic. This prevents cold-start artifacts from distorting predictions. Predictors also support warm-starting from trace files via --load-predictor-warmup-trace.

Load-Based Scaling Mode

Uses ForwardPassMetrics (FPM) from the Dynamo event plane for real-time scaling without profiling data. Three specialized regression models in fpm_regression.py:

ModelRegressionEstimates
PrefillRegressionModel 1D: sum_prefill_tokens --> wall_time TTFT via simulated chunked prefill scheduling
DecodeRegressionModel 1D: sum_decode_kv_tokens --> wall_time ITL for total decode load (scheduled + queued)
AggRegressionModel 2D: (prefill_tokens, decode_kv_tokens) --> wall_time Both TTFT and ITL with piggybacked decode/prefill

Scaling decisions: scale up if ALL engines' estimated latency exceeds SLA; scale down if ALL are below SLA * sensitivity. Only +/-1 per interval with pending-desired guard.

Co-existence of Modes

When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor. This gives the best of both worlds: capacity planning from profiling data and reactive adaptation from live metrics.


← Back to AI Infra Overview

Disaggregated Prefill/Decode Serving

Prefill and decode have fundamentally different compute characteristics: prefill is compute-bound (processes all input tokens in one forward pass) while decode is memory-bandwidth-bound (generates one token at a time). Disaggregating them into specialized GPU pools allows each to scale independently, using optimal tensor-parallel (TP) configurations for each phase.

PrefillRouter Orchestration

The PrefillRouter (from disagg-serving.md) orchestrates the full flow:

1
Worker Selection

Router selects a prefill worker using KV-aware routing (cache overlap + load) or simple load balancing. KV overlap scores from the RadixTree indexer determine which prefill worker can reuse the most cached blocks.

2
Prefill Execution

Prefill worker computes KV cache and returns disaggregated_params containing backend-specific transfer metadata. The KV cache lives in the prefill worker's GPU memory.

3
Decode Routing

Router injects prefill result into decode request, routes to decode worker. The decode worker is selected based on available KV capacity and current load.

4
KV Transfer + Decode

Decode worker coordinates with prefill worker for direct GPU-to-GPU transfer via NIXL (NVLink, InfiniBand/UCX). Non-blocking -- GPU forward passes continue during transfer.

Backend-Specific Transfer Metadata

BackendMetadata FormatTransfer Behavior
SGLang bootstrap_info (host, port, room_id) RDMA bootstrap coordination. Prefill runs as background task -- decode begins immediately while KV transfer proceeds in parallel.
vLLM kv_transfer_params (block IDs, worker connection) Synchronous prefill. Decode waits for prefill to complete.
TensorRT-LLM opaque_state (serialized internal metadata) Synchronous prefill. Decode waits for prefill to complete.

Runtime-Reconfigurable xPyD

Dynamo supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added or removed without downtime:

Performance Impact: On H100 with R1 Distilled Llama 70B FP8, disaggregated serving on 2 nodes significantly outperforms aggregated serving on 1 or 2 nodes, with superior Pareto curves for throughput vs. latency. The key insight: separating long prefills into dedicated engines prevents them from blocking ongoing decode operations.

← Back to AI Infra Overview

NIXL: NVIDIA Interchange Library

NIXL is the data transfer fabric that enables high-speed KV cache movement across workers and memory domains. In Dynamo's Rust codebase (lib/memory/src/nixl.rs, lib/kvbm-physical/src/transfer/executor/nixl.rs), NIXL provides a unified interface over heterogeneous transports (NVLink, InfiniBand/UCX, PCIe, POSIX I/O).

NIXL Architecture in KVBM

// lib/memory/src/nixl.rs

/// Trait for storage types that can be registered with NIXL.
pub trait NixlCompatible {
    /// Returns (ptr, size, mem_type, device_id)
    fn nixl_params(&self) -> (*const u8, usize, MemType, u64);
}

/// NIXL descriptor for memory region registration.
pub struct NixlDescriptor {
    pub addr: u64,       // Base address
    pub size: usize,     // Region size in bytes
    pub mem_type: MemType, // host / device / etc.
    pub device_id: u64,  // GPU index for device memory
}

Typestate Transfer Builder

The NIXL transfer builder in lib/kvbm-physical/src/transfer/executor/nixl.rs uses Rust's typestate pattern for compile-time safety -- all required parameters (source layout, destination layout, block IDs, transfer strategy) must be set before a transfer can be executed:

// lib/kvbm-physical/src/transfer/executor/nixl.rs

pub struct NixlTransferBuilder<'a, TSrc, TDst, TSrcBlocks, TDstBlocks, TStrategy> {
    src: Option<&'a PhysicalLayout>,
    dst: Option<&'a PhysicalLayout>,
    src_block_ids: Option<&'a [BlockId]>,
    dst_block_ids: Option<&'a [BlockId]>,
    strategy: Option<TransferStrategy>,
    // Phantom markers: Unset -> Set at compile time
}

Remote Memory Registration Protocol

From the KVBM design doc, the bidirectional protocol for cross-worker KV transfer:

1
Agent Creation & Memory Registration: Each worker sets up a NixlAgent, registers its memory regions (device memory) via nixl_register(). NIXL creates remote-accessible descriptors bound to the memory layout.
2
Metadata Exchange: Workers exchange SerializedNixlBlockLayout containing LayoutConfig (num_layers, page_size, inner_dim, dtype), BlockSetID, base address + stride, device ID + memory type. This bridges TP mismatches (e.g., TP=4 vs TP=8).
3
Serialization / Deserialization: FullyContiguous::serialize() encodes physical memory descriptors (address, size, VRAM/DRAM type). deserialize() rehydrates into a remote memory view with correct offsets. Enables correct gather-scatter across different system configurations.
4
Ownership & Lifetime: RAII-based RegistrationHandle. On drop, automatic Remove event is published, deregistering the block from NIXL and removing it from the remote block registry. Prevents stale memory access and dangling pointers.
Dell + NIXL: Dell integrated PowerScale with NIXL, achieving 19x faster TTFT by enabling direct storage-to-GPU KV cache streaming, bypassing the CPU entirely via GPUDirect Storage.

← Back to AI Infra Overview

Agentic Features

From docs/features/agentic_workloads.md: Agentic LLM inference is dominated by KV-cache storage and I/O rather than computation. Without leveraging the predictable structure of agent lifecycles, significant optimizations are left on the table. Dynamo bridges this gap with agentic hints that flow from the harness through the router to the KV cache manager.

nvext API: Agentic Hints

Hints are carried in the request body under nvext on chat completions. The frontend parses them and passes them to the KV router and backends.

HintLocationDescription
priority nvext.agent_hints Unified request priority. Higher values move the request earlier in the router queue (via priority_jump in all scheduling policies) and are forwarded to the backend for scheduling + priority-based eviction.
osl nvext.agent_hints Expected output sequence length. Used by the router for output block tracking and load-balancing accuracy when --router-track-output-blocks is enabled.
speculative_prefill nvext.agent_hints After assistant turn completes, prefills the predicted next-turn prefix to warm the KV cache. Up to ~3x TTFT improvement on turns 2+.
cache_control nvext.cache_control TTL-based KV cache pinning. Pinned prefixes resist eviction for the specified duration, demoting to host memory rather than deletion.
program_id nvext.agent_hints (Planned) Identifies the agentic program for program-level metrics and cache affinity.
context_type nvext.agent_hints (Planned) Semantic type (system prompt, tool definition, reasoning branch) for context-aware eviction.

Cache Pinning with TTL

From the router config (lib/kv-router/src/scheduling/config.rs), cache control is enabled via router_enable_cache_control. When enabled, after generation completes the router calls a pin_prefix hook (with TTL) to the worker's cache_control service mesh endpoint. Pinned nodes resist eviction in SGLang's HiCache, demoting to host memory rather than being deleted.

// From lib/kv-router/src/scheduling/config.rs

/// Enable cache control (PIN with TTL) via the worker's
/// cache_control service mesh endpoint.
pub router_enable_cache_control: bool,  // default: false
NeMo Agent Toolkit Integration: TTL is dynamically computed as the product of expected reuse count and inter-request time. The NAT profiler pre-computes these expectations during agent evaluations and injects nvext.cache_control with the derived TTL automatically. Future work: TTL could be context-type-aware -- think tokens get lower TTL than system prompts and tool definitions.

LangChain Integration

Dynamo is supported directly in LangChain via the NVIDIA AI Endpoints integration. Configure the chat model to use the Dynamo endpoint and pass agent hints directly from the LangChain client.

Feature Matrix by Backend

FeatureSGLangTensorRT-LLMvLLM
Priority-based cache evictionAvailableIn progressIn progress
Cache pinning (TTL)AvailableIn progress--
Cache prefetchingIn progress----
Speculative prefillAvailableAvailableAvailable
Priority-aware routingAvailableAvailableAvailable

← Back to AI Infra Overview

v1.0 Features & Benchmarks

Dynamo 1.0 shipped March 2026 as a production-ready release with strong community adoption. The release spans zero-config deployment, agentic inference, multimodal E/P/D, video generation, and K8s Inference Gateway integration.

What's New in 1.0

ZC
Zero-Config Deploy (DGDR)

BETA Specify model, HW, and SLA in one YAML. AIConfigurator auto-profiles, Planner optimizes topology, Dynamo deploys.

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
spec:
  model: Qwen/Qwen3-0.6B
  backend: vllm
  sla: { ttft: 200.0, itl: 20.0 }
  autoApply: true
AG
Agentic Inference

Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations with speculative prefill for ~3x TTFT on turns 2+.

MM
Multimodal E/P/D

Disaggregated encode/prefill/decode with embedding cache -- 30% faster TTFT on image workloads. Block hashing incorporates mm_hash for multimodal objects.

VG
Video Generation

Native FastVideo + SGLang Diffusion support. Real-time 1080p on a single B200.

K8
K8s Inference Gateway

KV-aware routing inside the standard Kubernetes gateway. EPP plugin with skip_initial_worker_wait for external worker management.

S3
Storage-Tier KV Offload

S3/Azure blob support for G4 tier + global KV events for cluster-wide cache visibility. Event Plane broadcasts StoreEvent / RemoveEvent for smart tiering.

Key Benchmarks

ResultModelHardwareContext
750x higher throughput DeepSeek-R1 GB300 NVL72 InferenceXv2 benchmark (SemiAnalysis)
7x higher throughput/GPU DeepSeek R1 GB200 NVL72 vs B200 InferenceX benchmark (SemiAnalysis)
7x faster model startup DeepSeek-V3 H200 ModelExpress GPU-to-GPU weight streaming via NIXL/NVLink
2x faster TTFT Qwen3-Coder 480B Multi-node KV-aware routing (Baseten production benchmark)
80% fewer SLA breaches Various Production Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025)
10x inference speedup Kimi K2 GB200 Moonshot AI production deployment
19x faster TTFT Various Dell PowerScale + NIXL Storage-to-GPU KV streaming via GDS

Backend Support Matrix

FeatureSGLangTensorRT-LLMvLLM
Disaggregated ServingAvailableAvailableAvailable
KV-Aware RoutingAvailableAvailableAvailable
SLA-Based PlannerAvailableAvailableAvailable
KVBMIn progressAvailableAvailable
MultimodalAvailableAvailableAvailable
Tool CallingAvailableAvailableAvailable

← Back to AI Infra Overview

Codebase Map

Key directories and files explored during this analysis of the ai-dynamo/dynamo repository:

Rust Core (Performance-Critical)

PathPurpose
lib/kv-router/src/scheduling/selector.rsDefaultWorkerSelector: cost function, softmax sampling, tie-breaking
lib/kv-router/src/scheduling/policy.rsFCFS, LCFS, WSPT (Smith's rule) scheduling policies
lib/kv-router/src/scheduling/config.rsKvRouterConfig: all router parameters, defaults, cache control toggle
lib/kv-router/src/indexer/radix_tree.rsRadixTree: find_matches, block tracking, frequency-based TTL
lib/kv-router/src/indexer/concurrent_radix_tree.rsThread-safe radix tree with sticky worker routing
lib/kvbm-physical/src/transfer/executor/nixl.rsNixlTransferBuilder: typestate pattern for safe cross-worker transfers
lib/memory/src/nixl.rsNixlCompatible trait, NixlDescriptor, NixlAgent wrappers
lib/llm/src/block_manager/Block layouts, NIXL storage, transfer coordination

Python Components (Extensibility)

PathPurpose
components/src/dynamo/planner/utils/planner_core.pyBase planner, scaling loop, Prometheus metrics, connector integration
components/src/dynamo/planner/utils/load_predictor.pyARIMA, Kalman, Prophet, Constant predictors with idle-period handling
components/src/dynamo/planner/utils/fpm_regression.pyPrefill/Decode/Agg regression models for load-based scaling
components/src/dynamo/planner/utils/perf_interpolation.pyNPZ profiling data interpolation for throughput/latency mapping
components/src/dynamo/planner/kubernetes_connector.pyK8s DGD PATCH for replica scaling via operator
components/src/dynamo/planner/virtual_connector.pyVirtualConnector for non-K8s orchestrators

Design Documentation

PathContent
docs/design-docs/architecture.mdThree-plane architecture, design goals, fault tolerance
docs/design-docs/router-design.mdKVIndexer, cost function, event transport modes, inter-router sync
docs/design-docs/kvbm-design.md4-tier hierarchy, block state machine, NIXL integration, framework connectors
docs/design-docs/planner-design.mdThroughput + load-based scaling, predictor algorithms, connector design
docs/design-docs/disagg-serving.mdPrefillRouter orchestration, backend-specific metadata, xPyD
docs/design-docs/dynamo-flow.mdFull request flow with color-coded phases
docs/features/agentic_workloads.mdnvext hints, cache pinning, speculative prefill, LangChain integration