NVIDIA Dynamo Deep Dive - Datacenter-Scale Inference Orchestration

Section 1

Three-Plane Architecture

Dynamo is built around three cooperating concerns: a fast Request Plane for token generation, a responsive Control Plane for scaling and placement, and a resilient Storage & Events Plane for KV reuse and failure recovery. This separation allows each concern to evolve and scale independently.

Full System Architecture

Design Goals

Latency Stability

Keep TTFT and ITL predictable under bursty, mixed-length traffic. Disaggregated serving isolates compute-bound prefill from memory-bound decode.

GPU Efficiency

Independently scale prefill and decode pools so each runs on hardware tuned for its workload. No wasted cycles on idle phases.

Compute Reuse

KV-aware routing with radix-tree indexing eliminates redundant prefill. Multi-tier KVBM extends effective cache capacity far beyond HBM.

End-to-End Request Flow (Disaggregated Mode)

From the architecture.md design doc -- the nine steps of a disaggregated inference request:

Client sends request to Frontend -- HTTP POST to the OpenAI-compatible endpoint on port 8000.

Frontend validates and preprocesses -- applies chat template, tokenizes, parses nvext hints.

Router selects a Prefill worker -- uses cost function combining KV cache overlap and decode load.

Prefill computes KV cache -- GPU forward pass generates key-value tensors for the prompt.

Prefill returns transfer metadata -- backend-specific disaggregated_params for KV handoff.

Router selects a Decode worker -- injects prefill metadata into the decode request.

Decode receives KV via NIXL -- direct GPU-to-GPU transfer using NVLink or InfiniBand/UCX.

Decode streams tokens -- auto-regressive generation using transferred KV state.

KV Events update cache visibility -- KVBM may offload/recall blocks based on reuse potential.

← Back to AI Infra Overview

Section 2

Smart Router: KV-Aware Routing

The KV Router is implemented in Rust (lib/kv-router/src/) and is the critical decision point for every inference request. It tracks two metrics per worker: potential active blocks (decode load) and potential new prefill blocks (tokens requiring fresh computation). These feed a cost function that minimizes redundant prefill while balancing decode load.

Radix Tree KV Cache Indexer

The KVIndexer maintains a global radix (prefix) tree built from worker-reported KV events. Each node in the tree stores a set of worker IDs that have cached that particular block. From lib/kv-router/src/indexer/radix_tree.rs:

// lib/kv-router/src/indexer/radix_tree.rs

pub(crate) struct RadixBlock {
    /// Child blocks, keyed by local block hash
    pub children: FxHashMap<LocalBlockHash, SharedRadixBlock>,
    /// Workers that have this block cached
    pub workers: FxHashSet<WorkerWithDpRank>,
    /// External sequence block hash (None for root)
    pub block_hash: Option<ExternalSequenceBlockHash>,
    /// Recency buffer for frequency-based decisions
    pub recent_uses: VecDeque<Instant>,
}

pub struct RadixTree {
    pub root: SharedRadixBlock,
    /// Per-worker O(1) lookup: worker -> (block_hash -> block)
    pub lookup: FxHashMap<WorkerWithDpRank,
              FxHashMap<ExternalSequenceBlockHash, SharedRadixBlock>>,
    pub expiration_duration: Option<Duration>,
}

The find_matches method traverses the tree for a token sequence, tracking which workers share each prefix depth. Workers "drop out" as we descend where their cached blocks end. The algorithm supports early exit when only a single worker remains, and per-block frequency tracking with TTL-based expiration for approximate mode.

Two Indexer Backends: With --router-event-threads 1, a single-threaded RadixTree with TTL/pruning support is used. With N > 1 (default: 4), a ConcurrentRadixTree uses sticky worker routing for per-worker event serialization while allowing concurrent reads.

Cost Function and Worker Selection

The core selection logic in lib/kv-router/src/scheduling/selector.rs computes a cost for each worker and selects the minimum:

// lib/kv-router/src/scheduling/selector.rs - DefaultWorkerSelector

let get_score = |worker: WorkerWithDpRank| -> f64 {
    let overlap = overlaps.get(&worker).unwrap_or(&0);
    let prefill_token = prefill_tokens.get(&worker).unwrap_or(&isl);
    let potential_prefill_block = prefill_token as f64 / block_size as f64;
    let decode_block = decode_blocks.get(&worker).unwrap_or(&...);

    // The core cost function:
    let logit = overlap_weight * potential_prefill_block + decode_block;
    logit
};

Routing Decision Flow

Scheduling Policies

When router_queue_threshold is set, the router maintains a priority queue with pluggable policies (from lib/kv-router/src/scheduling/policy.rs):

Policy	Key Formula	Optimizes For	Source
`FCFS` (default)	`priority_jump - arrival_offset`	Tail TTFT -- no request waits longer than necessary	Rust
`LCFS`	`priority_jump + arrival_offset`	Favors newer arrivals (experiment/comparison)	Rust
`WSPT`	`(1 + priority_jump) / new_tokens`	Average TTFT via Smith's rule (1956); short or high-priority requests first	Rust

WSPT uses max overlap: The WSPT policy computes new_tokens = ISL - max_overlap * block_size, using the maximum overlap across all workers. This approximates the realized overlap since the downstream selector routes to the best-overlap worker. Short requests with high cache hits jump ahead of long ones.

KV Event Transport Modes

Mode	Mechanism	Best For
NATS Core / Local Indexer (default)	Fire-and-forget pub/sub. Workers maintain local radix trees. Gap detection via monotonic event IDs.	Low latency, simple setup, single-router
JetStream (`--durable-kv-events`)	Persistent NATS stream with durable consumers. State snapshots in NATS Object Store.	Production multi-replica consistency

← Back to AI Infra Overview

Section 3

KV Block Manager (KVBM)

KVBM is the multi-tier cache manager that extends effective KV cache capacity far beyond GPU HBM. Implemented in Rust (lib/kvbm-physical/, lib/kvbm-logical/, lib/kvbm-common/), it manages a 4-tier memory hierarchy with asynchronous offload/onboard operations coordinated through NIXL.

KV Block Manager: 4-Tier Memory Hierarchy

Block State Machine

From kvbm-design.md, each KV block follows a strict lifecycle managed via RAII handles:

State	Description	Valid Transitions
`Reset`	Block uninitialized or recycled. Held in InactivePool, reusable.	`init_sequence(salt_hash)` --> Partial
`Partial`	Being filled with tokens. Owned by sequence creator thread.	`add_token()` stays Partial; `commit()` --> Complete; `reset()` --> Reset
`Complete`	Fully filled but not yet visible for reuse by other requests.	`register()` --> Registered; `reset()` --> Reset
`Registered`	Finalized and visible in the dedup cache. Shared ownership.	Auto `drop()` triggers Remove event --> Reset

KVBM Internal Architecture

The KvBlockManager<H, D> orchestrates across memory tiers by managing per-backend block pools:

// Conceptual structure from kvbm-design.md

KvBlockManager<H, D> owns:
  - BlockPool<Device>     // GPU-resident blocks (G1)
  - BlockPool<Host>       // CPU pinned-memory blocks (G2)
  - NixlAgent             // Remote communication + memory sharing
  - BlockSetRegistry      // Remote lookup + import/export metadata

Each BlockPool<T> tracks:
  - ActivePool     // Blocks currently in use by sequences
  - InactivePool   // Recycled blocks ready for allocation (free list)

Block memory layout: [num_layers][page_size x inner_dim]
  block_stride = align_up(num_layers * layer_stride, alignment)

Transfer Manager and Data Flows

The TransferManager is an asynchronous orchestrator with per-path queues:

D2H

Device --> Host (Offload)

Triggered by the connector scheduler. CUDA D2H copy or custom kernel. Host pool registers new immutable block, deduped by sequence_hash.

H2D

Host --> Device (Onboard)

Brings a host block back into GPU memory. CUDA H2D copy. Device pool registers the new immutable block for reuse.

H2K

Host --> Disk (Offload)

NIXL Write via POSIX; GPUDirect Storage when available. Also supports network FS (NFS/Lustre/GPFS) for G4 remote storage.

K2D

Disk --> Device (Onboard)

Direct disk-to-GPU via NIXL Read, bypassing CPU when GDS is available. Fastest cold-onboard path for large KV blocks.

RAII Event Integration: Block lifecycle is managed via RAII handles. When a block is registered, a PublishHandle triggers a StoreEvent. When the handle is dropped (eviction or end-of-life), a RemoveEvent is automatically published to the Dynamo Event Plane (NATS or ZMQ). This ensures consistent cross-worker cache visibility without explicit deallocation logic.

← Back to AI Infra Overview

Section 4

Planner: SLA-Driven Dynamic Scaling

The Planner is Dynamo's autoscaling controller, implemented in Python (components/src/dynamo/planner/). It supports two scaling modes: throughput-based (profiling data + traffic prediction) and load-based (real-time engine metrics + online regression). Both aim to meet TTFT and ITL SLA targets while minimizing total cost of ownership (TCO).

Throughput-Based Scaling Algorithm

Metric Collection: Every adjustment_interval seconds (default: 180s), queries Prometheus for avg TTFT, ITL, request count, avg ISL, and avg OSL from the Frontend's /metrics endpoint.

Correction Factor: Computes prefill_correction = actual_ttft / expected_ttft and decode_correction = actual_itl / expected_itl. These adapt profiling-based predictions to real-world behavior (request queueing, prefix cache hits, chunked prefill effects).

Load Prediction: Forecasts next_num_req, next_isl, next_osl using one of four predictor algorithms (see below).

Replica Calculation: Computes required prefill and decode replicas using profiling-based interpolation and correction factors.

Scaling Execution: Calls connector.set_component_replicas() non-blocking. Supports KubernetesConnector (patches DGD resources) or VirtualConnector (writes to distributed runtime for external orchestrators).

Replica Calculation Formulas

From planner-design.md and planner_core.py:

# Prefill replicas (single-batched, linear correction effect)
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)

# Decode replicas (correction applied to ITL SLA target)
corrected_itl = target_itl / decode_correction_factor
throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
    itl=corrected_itl,
    context_length=next_isl + next_osl / 2
)
decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)

Load Predictor Algorithms

Four predictor implementations in load_predictor.py, all extending BasePredictor:

Predictor	Algorithm	Best For	Min Data Points
Constant	`next = current`	Stable workloads, long intervals	1
ARIMA	Auto-ARIMA (`pmdarima`) with optional `log1p` transform. Auto-fallback from raw to log1p if model collapses to (0,d,0).	Trending / seasonal patterns	5
Kalman	Local linear trend Kalman filter (`filterpy`). Starts after `--kalman-min-points` observations.	Bursty traffic	configurable
Prophet	Facebook Prophet time-series model. Handles complex seasonality.	Complex seasonal patterns	multiple intervals

Idle Period Handling: All predictors inherit from BasePredictor which skips leading zeros (idle period after deployment) until the first non-zero datapoint from live traffic. This prevents cold-start artifacts from distorting predictions. Predictors also support warm-starting from trace files via --load-predictor-warmup-trace.

Load-Based Scaling Mode

Uses ForwardPassMetrics (FPM) from the Dynamo event plane for real-time scaling without profiling data. Three specialized regression models in fpm_regression.py:

Model	Regression	Estimates
`PrefillRegressionModel`	1D: sum_prefill_tokens --> wall_time	TTFT via simulated chunked prefill scheduling
`DecodeRegressionModel`	1D: sum_decode_kv_tokens --> wall_time	ITL for total decode load (scheduled + queued)
`AggRegressionModel`	2D: (prefill_tokens, decode_kv_tokens) --> wall_time	Both TTFT and ITL with piggybacked decode/prefill

Scaling decisions: scale up if ALL engines' estimated latency exceeds SLA; scale down if ALL are below SLA * sensitivity. Only +/-1 per interval with pending-desired guard.

Co-existence of Modes

When both modes are enabled, throughput-based scaling (longer interval) sets a lower bound on replicas while load-based scaling (shorter interval) handles real-time adjustments above that floor. This gives the best of both worlds: capacity planning from profiling data and reactive adaptation from live metrics.

← Back to AI Infra Overview

Section 5

Disaggregated Prefill/Decode Serving

Prefill and decode have fundamentally different compute characteristics: prefill is compute-bound (processes all input tokens in one forward pass) while decode is memory-bandwidth-bound (generates one token at a time). Disaggregating them into specialized GPU pools allows each to scale independently, using optimal tensor-parallel (TP) configurations for each phase.

PrefillRouter Orchestration

The PrefillRouter (from disagg-serving.md) orchestrates the full flow:

Worker Selection

Router selects a prefill worker using KV-aware routing (cache overlap + load) or simple load balancing. KV overlap scores from the RadixTree indexer determine which prefill worker can reuse the most cached blocks.

Prefill Execution

Prefill worker computes KV cache and returns disaggregated_params containing backend-specific transfer metadata. The KV cache lives in the prefill worker's GPU memory.

Decode Routing

Router injects prefill result into decode request, routes to decode worker. The decode worker is selected based on available KV capacity and current load.

KV Transfer + Decode

Decode worker coordinates with prefill worker for direct GPU-to-GPU transfer via NIXL (NVLink, InfiniBand/UCX). Non-blocking -- GPU forward passes continue during transfer.

Backend-Specific Transfer Metadata

Backend	Metadata Format	Transfer Behavior
SGLang	`bootstrap_info` (host, port, room_id)	RDMA bootstrap coordination. Prefill runs as background task -- decode begins immediately while KV transfer proceeds in parallel.
vLLM	`kv_transfer_params` (block IDs, worker connection)	Synchronous prefill. Decode waits for prefill to complete.
TensorRT-LLM	`opaque_state` (serialized internal metadata)	Synchronous prefill. Decode waits for prefill to complete.

Runtime-Reconfigurable xPyD

Dynamo supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added or removed without downtime:

Add worker: Registers with discovery service and publishes RuntimeConfig (including KV capacity). Router auto-discovers via discovery service.
Remove worker: Drains active requests and deregisters from discovery. Router stops routing new requests to it.

Performance Impact: On H100 with R1 Distilled Llama 70B FP8, disaggregated serving on 2 nodes significantly outperforms aggregated serving on 1 or 2 nodes, with superior Pareto curves for throughput vs. latency. The key insight: separating long prefills into dedicated engines prevents them from blocking ongoing decode operations.

← Back to AI Infra Overview

Section 6

NIXL: NVIDIA Interchange Library

NIXL is the data transfer fabric that enables high-speed KV cache movement across workers and memory domains. In Dynamo's Rust codebase (lib/memory/src/nixl.rs, lib/kvbm-physical/src/transfer/executor/nixl.rs), NIXL provides a unified interface over heterogeneous transports (NVLink, InfiniBand/UCX, PCIe, POSIX I/O).

NIXL Architecture in KVBM

// lib/memory/src/nixl.rs

/// Trait for storage types that can be registered with NIXL.
pub trait NixlCompatible {
    /// Returns (ptr, size, mem_type, device_id)
    fn nixl_params(&self) -> (*const u8, usize, MemType, u64);
}

/// NIXL descriptor for memory region registration.
pub struct NixlDescriptor {
    pub addr: u64,       // Base address
    pub size: usize,     // Region size in bytes
    pub mem_type: MemType, // host / device / etc.
    pub device_id: u64,  // GPU index for device memory
}

Typestate Transfer Builder

The NIXL transfer builder in lib/kvbm-physical/src/transfer/executor/nixl.rs uses Rust's typestate pattern for compile-time safety -- all required parameters (source layout, destination layout, block IDs, transfer strategy) must be set before a transfer can be executed:

// lib/kvbm-physical/src/transfer/executor/nixl.rs

pub struct NixlTransferBuilder<'a, TSrc, TDst, TSrcBlocks, TDstBlocks, TStrategy> {
    src: Option<&'a PhysicalLayout>,
    dst: Option<&'a PhysicalLayout>,
    src_block_ids: Option<&'a [BlockId]>,
    dst_block_ids: Option<&'a [BlockId]>,
    strategy: Option<TransferStrategy>,
    // Phantom markers: Unset -> Set at compile time
}

Remote Memory Registration Protocol

From the KVBM design doc, the bidirectional protocol for cross-worker KV transfer:

Agent Creation & Memory Registration: Each worker sets up a NixlAgent, registers its memory regions (device memory) via nixl_register(). NIXL creates remote-accessible descriptors bound to the memory layout.

Metadata Exchange: Workers exchange SerializedNixlBlockLayout containing LayoutConfig (num_layers, page_size, inner_dim, dtype), BlockSetID, base address + stride, device ID + memory type. This bridges TP mismatches (e.g., TP=4 vs TP=8).

Serialization / Deserialization: FullyContiguous::serialize() encodes physical memory descriptors (address, size, VRAM/DRAM type). deserialize() rehydrates into a remote memory view with correct offsets. Enables correct gather-scatter across different system configurations.

Ownership & Lifetime: RAII-based RegistrationHandle. On drop, automatic Remove event is published, deregistering the block from NIXL and removing it from the remote block registry. Prevents stale memory access and dangling pointers.

Dell + NIXL: Dell integrated PowerScale with NIXL, achieving 19x faster TTFT by enabling direct storage-to-GPU KV cache streaming, bypassing the CPU entirely via GPUDirect Storage.

← Back to AI Infra Overview

Section 7

Agentic Features

From docs/features/agentic_workloads.md: Agentic LLM inference is dominated by KV-cache storage and I/O rather than computation. Without leveraging the predictable structure of agent lifecycles, significant optimizations are left on the table. Dynamo bridges this gap with agentic hints that flow from the harness through the router to the KV cache manager.

nvext API: Agentic Hints

Hints are carried in the request body under nvext on chat completions. The frontend parses them and passes them to the KV router and backends.

Hint	Location	Description
`priority`	`nvext.agent_hints`	Unified request priority. Higher values move the request earlier in the router queue (via `priority_jump` in all scheduling policies) and are forwarded to the backend for scheduling + priority-based eviction.
`osl`	`nvext.agent_hints`	Expected output sequence length. Used by the router for output block tracking and load-balancing accuracy when `--router-track-output-blocks` is enabled.
`speculative_prefill`	`nvext.agent_hints`	After assistant turn completes, prefills the predicted next-turn prefix to warm the KV cache. Up to ~3x TTFT improvement on turns 2+.
`cache_control`	`nvext.cache_control`	TTL-based KV cache pinning. Pinned prefixes resist eviction for the specified duration, demoting to host memory rather than deletion.
`program_id`	`nvext.agent_hints`	(Planned) Identifies the agentic program for program-level metrics and cache affinity.
`context_type`	`nvext.agent_hints`	(Planned) Semantic type (system prompt, tool definition, reasoning branch) for context-aware eviction.

Cache Pinning with TTL

From the router config (lib/kv-router/src/scheduling/config.rs), cache control is enabled via router_enable_cache_control. When enabled, after generation completes the router calls a pin_prefix hook (with TTL) to the worker's cache_control service mesh endpoint. Pinned nodes resist eviction in SGLang's HiCache, demoting to host memory rather than being deleted.

// From lib/kv-router/src/scheduling/config.rs

/// Enable cache control (PIN with TTL) via the worker's
/// cache_control service mesh endpoint.
pub router_enable_cache_control: bool,  // default: false

NeMo Agent Toolkit Integration: TTL is dynamically computed as the product of expected reuse count and inter-request time. The NAT profiler pre-computes these expectations during agent evaluations and injects nvext.cache_control with the derived TTL automatically. Future work: TTL could be context-type-aware -- think tokens get lower TTL than system prompts and tool definitions.

LangChain Integration

Dynamo is supported directly in LangChain via the NVIDIA AI Endpoints integration. Configure the chat model to use the Dynamo endpoint and pass agent hints directly from the LangChain client.

Feature Matrix by Backend

Feature	SGLang	TensorRT-LLM	vLLM
Priority-based cache eviction	Available	In progress	In progress
Cache pinning (TTL)	Available	In progress	--
Cache prefetching	In progress	--	--
Speculative prefill	Available	Available	Available
Priority-aware routing	Available	Available	Available

← Back to AI Infra Overview

Section 8

v1.0 Features & Benchmarks

Dynamo 1.0 shipped March 2026 as a production-ready release with strong community adoption. The release spans zero-config deployment, agentic inference, multimodal E/P/D, video generation, and K8s Inference Gateway integration.

What's New in 1.0

Zero-Config Deploy (DGDR)

BETA Specify model, HW, and SLA in one YAML. AIConfigurator auto-profiles, Planner optimizes topology, Dynamo deploys.

apiVersion: nvidia.com/v1beta1
kind: DynamoGraphDeploymentRequest
spec:
  model: Qwen/Qwen3-0.6B
  backend: vllm
  sla: { ttft: 200.0, itl: 20.0 }
  autoApply: true

Agentic Inference

Per-request hints for latency priority, expected output length, and cache pinning TTL. LangChain + NeMo Agent Toolkit integrations with speculative prefill for ~3x TTFT on turns 2+.

Multimodal E/P/D

Disaggregated encode/prefill/decode with embedding cache -- 30% faster TTFT on image workloads. Block hashing incorporates mm_hash for multimodal objects.

Video Generation

Native FastVideo + SGLang Diffusion support. Real-time 1080p on a single B200.

K8s Inference Gateway

KV-aware routing inside the standard Kubernetes gateway. EPP plugin with skip_initial_worker_wait for external worker management.

Storage-Tier KV Offload

S3/Azure blob support for G4 tier + global KV events for cluster-wide cache visibility. Event Plane broadcasts StoreEvent / RemoveEvent for smart tiering.

Key Benchmarks

Result	Model	Hardware	Context
750x higher throughput	DeepSeek-R1	GB300 NVL72	InferenceXv2 benchmark (SemiAnalysis)
7x higher throughput/GPU	DeepSeek R1	GB200 NVL72 vs B200	InferenceX benchmark (SemiAnalysis)
7x faster model startup	DeepSeek-V3	H200	ModelExpress GPU-to-GPU weight streaming via NIXL/NVLink
2x faster TTFT	Qwen3-Coder 480B	Multi-node	KV-aware routing (Baseten production benchmark)
80% fewer SLA breaches	Various	Production	Planner autoscaling at 5% lower TCO (Alibaba APSARA 2025)
10x inference speedup	Kimi K2	GB200	Moonshot AI production deployment
19x faster TTFT	Various	Dell PowerScale + NIXL	Storage-to-GPU KV streaming via GDS

Backend Support Matrix

Feature	SGLang	TensorRT-LLM	vLLM
Disaggregated Serving	Available	Available	Available
KV-Aware Routing	Available	Available	Available
SLA-Based Planner	Available	Available	Available
KVBM	In progress	Available	Available
Multimodal	Available	Available	Available
Tool Calling	Available	Available	Available

← Back to AI Infra Overview

Source Reference

Codebase Map

Key directories and files explored during this analysis of the ai-dynamo/dynamo repository:

Rust Core (Performance-Critical)

Path	Purpose
`lib/kv-router/src/scheduling/selector.rs`	DefaultWorkerSelector: cost function, softmax sampling, tie-breaking
`lib/kv-router/src/scheduling/policy.rs`	FCFS, LCFS, WSPT (Smith's rule) scheduling policies
`lib/kv-router/src/scheduling/config.rs`	KvRouterConfig: all router parameters, defaults, cache control toggle
`lib/kv-router/src/indexer/radix_tree.rs`	RadixTree: find_matches, block tracking, frequency-based TTL
`lib/kv-router/src/indexer/concurrent_radix_tree.rs`	Thread-safe radix tree with sticky worker routing
`lib/kvbm-physical/src/transfer/executor/nixl.rs`	NixlTransferBuilder: typestate pattern for safe cross-worker transfers
`lib/memory/src/nixl.rs`	NixlCompatible trait, NixlDescriptor, NixlAgent wrappers
`lib/llm/src/block_manager/`	Block layouts, NIXL storage, transfer coordination

Python Components (Extensibility)

Path	Purpose
`components/src/dynamo/planner/utils/planner_core.py`	Base planner, scaling loop, Prometheus metrics, connector integration
`components/src/dynamo/planner/utils/load_predictor.py`	ARIMA, Kalman, Prophet, Constant predictors with idle-period handling
`components/src/dynamo/planner/utils/fpm_regression.py`	Prefill/Decode/Agg regression models for load-based scaling
`components/src/dynamo/planner/utils/perf_interpolation.py`	NPZ profiling data interpolation for throughput/latency mapping
`components/src/dynamo/planner/kubernetes_connector.py`	K8s DGD PATCH for replica scaling via operator
`components/src/dynamo/planner/virtual_connector.py`	VirtualConnector for non-K8s orchestrators

Design Documentation

Path	Content
`docs/design-docs/architecture.md`	Three-plane architecture, design goals, fault tolerance
`docs/design-docs/router-design.md`	KVIndexer, cost function, event transport modes, inter-router sync
`docs/design-docs/kvbm-design.md`	4-tier hierarchy, block state machine, NIXL integration, framework connectors
`docs/design-docs/planner-design.md`	Throughput + load-based scaling, predictor algorithms, connector design
`docs/design-docs/disagg-serving.md`	PrefillRouter orchestration, backend-specific metadata, xPyD
`docs/design-docs/dynamo-flow.md`	Full request flow with color-coded phases
`docs/features/agentic_workloads.md`	nvext hints, cache pinning, speculative prefill, LangChain integration

NVIDIA DynamoDatacenter-Scale Inference Stack

Three-Plane Architecture

Design Goals

End-to-End Request Flow (Disaggregated Mode)

Smart Router: KV-Aware Routing

Radix Tree KV Cache Indexer

Cost Function and Worker Selection

Scheduling Policies

KV Event Transport Modes

KV Block Manager (KVBM)

Block State Machine

KVBM Internal Architecture

Transfer Manager and Data Flows

Planner: SLA-Driven Dynamic Scaling

Throughput-Based Scaling Algorithm

Replica Calculation Formulas

Load Predictor Algorithms

Load-Based Scaling Mode

Co-existence of Modes

Disaggregated Prefill/Decode Serving

PrefillRouter Orchestration

Backend-Specific Transfer Metadata

Runtime-Reconfigurable xPyD

NIXL: NVIDIA Interchange Library

NIXL Architecture in KVBM

Typestate Transfer Builder

Remote Memory Registration Protocol

Agentic Features

nvext API: Agentic Hints

Cache Pinning with TTL

LangChain Integration

Feature Matrix by Backend

v1.0 Features & Benchmarks

What's New in 1.0

Key Benchmarks

Backend Support Matrix

Codebase Map

Rust Core (Performance-Critical)

Python Components (Extensibility)

Design Documentation

NVIDIA Dynamo
Datacenter-Scale Inference Stack