The Life of a Request — AI Inference Engine Deep Dive

Deep Dive Navigation

Explore Each Engine

Each deep-dive traces a core operation through the system's source code. Serving engines follow an inference request from HTTP entry to token output. Simulation platforms follow a request through the simulated stack, analyzing scheduling, prediction, and network modeling.

Serving Engines

vLLM V1

v0.11+ · V1 Engine (V0 removed)

PagedAttention · Unified Token Budget · ZeroMQ IPC

Multi-process architecture where AsyncLLM handles tokenization/detokenization and EngineCore runs the tight schedule → execute → update loop in a separate OS process. Workers receive only incremental diffs each step. Supports FlashAttention 3/4, FlashInfer, MLA, speculative decoding (EAGLE, Medusa, MTP, N-gram), and disaggregated prefill-decode via NIXL and LMCache connectors.

Deep-Dive Pages

Full Overview → EngineCore & IPC → Scheduler & KV Cache → ModelRunner & Attention →

SGLang

Latest · SRT Runtime

RadixAttention · Overlap Scheduling · Extend Mode

Six-layer multi-process architecture with dedicated TokenizerManager and DetokenizerManager processes. The scheduler features both normal and overlap event loops — overlap mode pipelines CPU batch preparation with GPU execution. RadixCache automatically shares KV cache across requests with common prefixes. Supports EAGLE, MTP, FlashInfer, FA3, dual-chunk attention, and a powerful frontend DSL.

Deep-Dive Pages

Full Overview → Scheduler & Batching → RadixCache → ModelRunner & Attention →

LMCache

v1 · KV Cache Middleware

Multi-Tier Storage · Content-Addressable Chunks · CacheBlend

Middleware KV cache layer that plugs into vLLM or SGLang via connector interfaces. Divides token sequences into 256-token content-addressable chunks, enabling cross-instance cache reuse. Four storage tiers — CPU DRAM (<1ms), NVMe (1–10ms), remote (10–100ms), and P2P GPU-to-GPU (5–50ms) — with plugin-based remote connectors (Redis, InfiniStore, Mooncake, NIXL). Winner: ACM EuroSys'25 Best Paper for CacheBlend.

Deep-Dive Pages

Full Overview →

NVIDIA Dynamo

Open Source · Disaggregated Serving

Disaggregated P/D · Smart Routing · Multi-Node

NVIDIA's framework for production-scale disaggregated LLM serving. Separates prefill and decode phases across dedicated GPU pools with intelligent request routing and KV cache transfer orchestration. Built to work with vLLM as the per-node inference backend, adding cluster-level scheduling, autoscaling, and observability. Supports NIXL for high-performance GPU-to-GPU KV transfer across nodes.

Deep-Dive Pages

Full Overview →

Simulation & Analysis Platforms

Vidur Simulator

Microsoft Research · arXiv:2405.05465

Discrete-Event Simulation · Execution Time Prediction · Scheduling Policy Exploration

A large-scale LLM inference simulator that models the full serving stack WITHOUT running actual GPU inference. Simulates scheduling policies (vLLM, Sarathi, Orca), GPU execution times via ML-based predictors, and cluster-level request routing — enabling capacity planning and policy design at a fraction of real deployment cost.

Deep-Dive Pages

Life of a Request → Scheduler Hierarchy → Execution Prediction → Paper Analysis →

SimAI

Alibaba Cloud · NSDI'25

Full-Stack Simulation · PD Disaggregation · AICB GPU Profiling · NS-3 Network Sim

A full-stack AI inference and training simulator that extends Microsoft's Vidur with AICB GPU profiling, SimCCL collective communication decomposition, astra-sim system simulation, and NS-3 packet-level network simulation. Enables Prefill/Decode disaggregation analysis, DeepSeek/Qwen3-MoE workload modeling, and cross-layer network-compute co-design at scale.

Deep-Dive Pages

Full-Stack Overview → PD Disaggregation → Workload & Profiling → Network Simulation →

Cross-System Analysis

Stage-by-Stage Comparison

How do the serving engines handle each stage of the inference pipeline? For simulation platform comparisons, see the dedicated Vidur and SimAI deep-dive pages.

Pipeline Stage	vLLM V1	SGLang	LMCache
HTTP Entry	FastAPI via Uvicorn `entrypoints/openai/api_server.py` OpenAI-compatible REST routes	FastAPI server `entrypoints/http_server.py` OpenAI + Anthropic + Ollama compatible	N/A (middleware — no HTTP layer)
Tokenization	In AsyncLLM process HF tokenizer via `inputs/preprocess.py` Same process as API server	Separate TokenizerManager process Chat template application + multimodal preprocessing Communicates with Scheduler via ZMQ	N/A (uses engine's tokenizer)
Scheduling	Unified Token Budget Continuous batching; decode priority over prefill Automatic chunked prefill for long prompts `v1/core/sched/scheduler.py`	Overlap Scheduling FCFS with extend mode; GPU executes batch N while CPU prepares batch N+1 `managers/scheduler.py`	N/A (no scheduling)
Prefix Caching	Block Hash Hash-based block matching with 16-token blocks `cached_block_hash_to_block` dictionary Enabled by default; `find_longest_cache_hit()`	RadixCache Radix tree for automatic prefix sharing LRU/LFU eviction; `match_prefix()` No explicit configuration needed	Content-Addressable 256-token chunks with rolling hashes ChunkedTokenDatabase; cross-instance dedup Can reuse any repeated text segment
KV Cache Memory	Pre-allocated paged blocks `gpu_memory_utilization=0.9` default Free block queue (doubly-linked list) Preemption via recomputation only	ReqToTokenPool + TokenToKVPool Specialized pools for MHA/MLA/Mamba HiRadixCache for three-tier GPU/CPU/storage Sliding window attention pool support	Four latency tiers: L1: CPU DRAM (<1ms, pinned, NUMA-aware) L2: NVMe (1–10ms, persistent) L3: Remote (10–100ms, Redis/RDMA/Mooncake) L4: P2P GPU-to-GPU (5–50ms, NIXL)
Attention	FlashAttention 3/4 (primary, Hopper+) FlashInfer, Triton, PagedAttention TRT-LLM attention backend MLA suite: FlashMLA, CUTLASS MLA, Triton MLA, Sparse MLA	FlashInfer (default pre-Hopper) FlashAttention 3 (default Hopper) FA4 (FP4), TRT-LLM MLA/MHA Dual-chunk FA (Qwen2.5-1M long context) Hybrid prefill/decode mixing	N/A (uses engine's attention kernels)
Speculative Decoding	Integrated in GPUModelRunner `propose_draft_token_ids()` Triton-based rejection sampler EAGLE, EAGLE-3, Medusa, MTP, N-gram	Dedicated EAGLE worker (separate process) Tree-based speculation with CUDA graph runner EAGLE, EAGLE-2/3, MTP, draft models, N-gram	N/A
MoE / Expert Parallelism	FusedMoE kernels `--enable-expert-parallel` Expert parallel load balancing (EPLB)	Triton-based fused MoE with pre-tuned configs EP with DeepEP, Mooncake, Mori, FuseEP FlashInfer all-to-all dispatch Dynamic expert rebalancing (EPLB Manager)	N/A
Disaggregated P/D	KVConnectorBase_V1 interface Scheduler-side: `build_connector_meta()` Worker-side: `start_load_kv() / save_kv_layer()` Connectors: NIXL, LMCache, shared storage	Disaggregation directory + HiCache Three-tier GPU/CPU/storage backends	PDBackend via NIXL/RDMA Layer-by-layer pipelined transfers Overlaps KV transfer with inference Plugs into vLLM or SGLang connectors
Detokenization	In AsyncLLM (output_processor) Incremental detokenization + stop checking Same process as tokenization	Separate DetokenizerManager process Converts token IDs → text incrementally Trims at stop sequences; sends `BatchStrOut`	N/A (uses engine's detokenizer)

Key Insight: vLLM and SGLang share roughly 80% of the same subsystem categories — API server, scheduler, model runner, attention backends, sampling, parallelism — but diverge meaningfully in three areas: caching strategy (block-hash vs. radix tree), scheduling philosophy (unified budget vs. overlap scheduling), and process topology (AsyncLLM combines tokenization/detokenization vs. SGLang's dedicated manager processes). LMCache complements both by adding a cross-instance, multi-tier KV cache layer that neither engine provides natively.

Simulation Platform Comparison

Dimension	Vidur	SimAI
Origin	Microsoft Research (MLSys'24)	Alibaba Cloud (NSDI'25)
Architecture	Single-repo DES simulator	5-component full-stack (vidur fork + AICB + SimCCL + astra-sim + NS-3)
Deployment Model	Co-located only	Co-located + PD Disaggregation (SplitWise scheduler)
Compute Time	sklearn RandomForest on profiled CSV	AICB AIOB real GPU profiling (DeepGEMM/FlashMLA)
Communication	None (assumes negligible)	SimCCL + astra-sim (analytical/NS-3/physical)
Supported Models	LLaMA-2 (7B/13B/70B)	+ DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B
Hardware Requirement	CPU only	Profiling: SM90+ GPU; Simulation: CPU
Scheduling Policies	vLLM, Sarathi, Orca, FasterTransformer, LightLLM	All Vidur policies + SplitWise (PD-aware)
Network Modeling	Not supported	Analytical (busbw) / NS-3 packet-level / Physical RDMA
Key Metric	E2E latency, TTFT, TBT	+ pd_p2p_comm_time, prefill/decode breakdown

Process Topology

Architecture Comparison: vLLM V1 vs SGLang

Both systems use multi-process architectures with ZeroMQ IPC, but their process topologies differ in important ways. vLLM consolidates tokenization and detokenization in a single AsyncLLM process, while SGLang separates them into dedicated manager processes.

vLLM V1 Multi-Process Architecture

Process 1: API Server

FastAPI Server (Uvicorn) OpenAI-compatible REST API

↓

AsyncLLM Tokenization + Detokenization + Streaming

ZeroMQ IPC (msgpack serialization)

Process 2: EngineCore

EngineCore Busy Loop schedule() → execute_model() → update_from_output()

↓

V1 Scheduler Unified token budget
Chunked prefill

KVCacheManager Paged blocks (16-tok)
Hash-based prefix

↓

Executor → Worker UniProc / MultiProc / Ray

↓

GPUModelRunner _update_states → _prepare_inputs → forward → sample

SGLang Six-Layer Process Architecture

Process 1: HTTP

FastAPI HTTP Server OpenAI + Anthropic + Ollama API

ZeroMQ IPC

Process 2: Tokenizer

TokenizerManager Tokenization + Chat templates + Multimodal

ZeroMQ IPC

Process 3: Scheduler

Scheduler Event Loop Normal mode / Overlap mode (CPU+GPU pipeline)

↓

Batch Scheduler FCFS + Extend mode

RadixCache Radix tree prefix match
LRU/LFU eviction

↓

TpModelWorker Normal / Overlap thread variant

↓

ModelRunner forward_decode / forward_extend / forward_idle

ZeroMQ IPC

Process 4: Detokenizer

DetokenizerManager Tokens → Text + Stop sequence trimming

HTTP / API Layer

Tokenization

Engine Core / Event Loop

Scheduling

Cache Layer

Worker / Executor

Model Runner

Detokenization

IPC Boundary

vLLM V1: Consolidated Process Model

Tokenization and detokenization in the same process (AsyncLLM)
EngineCore runs a tight busy loop — minimal overhead per step
Workers maintain persistent state and receive only incremental diffs
InputBatch avoids tensor recreation at each step
Data-parallel via DPCoordinator for multi-EngineCore load balancing
KV connector interface enables pluggable disaggregated P/D

SGLang: Separated Manager Model

Dedicated TokenizerManager and DetokenizerManager as separate processes
Overlap scheduling: GPU processes batch N while CPU prepares batch N+1
Extend mode incrementally updates existing KV caches (ragged tensors)
Three forward modes: DECODE, EXTEND, IDLE
DataParallelController with round-robin across DP replicas
Grammar mask generation overlapped with GPU inference for zero overhead

Simulation Platforms: Vidur vs SimAI

Vidur: Single-Repo DES Simulator

Pure Python discrete-event simulation — no GPU required
Event queue drives RequestArrival → Schedule → BatchStageEnd → BatchEnd
sklearn RandomForest predicts execution time from profiled CSV data
5 replica schedulers: vLLM, Sarathi, Orca, FasterTransformer, LightLLM
Vidur-Search: explores entire configuration space in ~1 hour on CPU

SimAI: Full-Stack Multi-Component Simulator

5 integrated components: vidur-alibabacloud + AICB + SimCCL + astra-sim + NS-3
PD Disaggregation via SplitWise scheduler with P:D ratio control
AICB profiles real GPU kernels (DeepGEMM FP8, FlashMLA) on Hopper/Blackwell
SimCCL decomposes NCCL collectives (Ring, Tree, NVLS) into P2P flows
3 network backends: Analytical (busbw), NS-3 simulation, Physical RDMA

Quick Reference

Key Concepts Glossary

Core terminology used throughout the deep-dive pages. Each concept is tagged with its primary system of origin, though most are applicable across engines.

PagedAttention vLLM

Memory management technique that partitions KV cache into fixed-size pages (blocks), enabling non-contiguous storage of key-value tensors in GPU memory. Eliminates internal fragmentation and enables memory sharing across requests. Inspired by OS virtual memory paging. Each block typically holds 16 tokens.

RadixAttention SGLang

KV cache management using a radix tree (compact trie) that automatically shares cached key-value tensors across requests with common token prefixes. Unlike block-hash approaches, the radix tree structure naturally handles arbitrary-length prefix matching and enables efficient LRU/LFU eviction at any granularity.

Continuous Batching General

Dynamic batching strategy where new requests can join or leave a running batch at every iteration step, rather than waiting for all requests in a batch to complete. Maximizes GPU utilization by immediately filling slots freed by completed requests. Both vLLM and SGLang implement this as their core scheduling strategy.

Chunked Prefill General

Technique for handling long prompts by splitting the prefill computation into smaller chunks that can be interleaved with decode tokens in the same batch. Prevents long prompts from monopolizing GPU compute and causing latency spikes for concurrent decode-phase requests. vLLM uses long_prefill_token_threshold; SGLang uses extend mode.

Prefix Caching General

Optimization that reuses KV cache from previously computed token sequences when a new request shares the same prefix (e.g., system prompt, few-shot examples). vLLM uses hash-based block matching (16-token blocks). SGLang uses RadixCache (automatic radix tree). LMCache uses content-addressable 256-token chunks with cross-instance deduplication.

KV Cache General

Cached key and value projection tensors from the attention mechanism, stored per layer per token. During autoregressive decoding, only the new token's Q/K/V need to be computed — all previous K/V are read from cache. KV cache memory consumption scales linearly with sequence length and is typically the dominant memory consumer after model weights.

CUDA Graphs General

NVIDIA mechanism for capturing and replaying a sequence of GPU operations as a single launchable unit, eliminating per-kernel CPU launch overhead. Both vLLM and SGLang capture CUDA graphs for common batch sizes at startup. SGLang's CudaGraphRunner and vLLM's graph management in GPUModelRunner reduce decode-phase latency by 10–30%.

Tensor Parallelism (TP) General

Distributed strategy that splits individual weight matrices across multiple GPUs, with each GPU computing a portion of every layer's output. Requires all-reduce or all-gather communication between GPUs after each layer. Both engines implement ColumnParallelLinear and RowParallelLinear layers. Best for intra-node parallelism with fast NVLink interconnect.

Pipeline Parallelism (PP) General

Distributed strategy that assigns different layers of the model to different GPUs, forming a pipeline where activations are sent forward between stages. Better suited for inter-node parallelism where bandwidth is limited. vLLM supports PP via Ray compiled DAG. SGLang implements PP via its scheduler PP mixin. Micro-batching minimizes pipeline bubble overhead.

Discrete-Event Simulation Vidur

Simulation methodology where a priority queue of timestamped events drives the system state forward. Each event (e.g., RequestArrival, BatchStageEnd) triggers handler functions that process the event and schedule future events. Vidur and SimAI use this to simulate inference request lifecycles without actual GPU computation.

PD Disaggregation SimAI

Architecture that separates prefill (compute-intensive prompt processing) and decode (memory-bound token generation) onto dedicated GPU pools. Eliminates resource contention but introduces KV cache transfer cost between pools. SimAI models this via SplitWise scheduler with configurable P:D node ratios and KV transfer cost estimation.

Collective Communication SimAI

Group communication operations (AllReduce, AllGather, ReduceScatter, AllToAll) used in distributed training and inference for gradient synchronization and activation exchange. SimCCL decomposes these into point-to-point flows following NCCL algorithms (Ring, Tree, NVLS), enabling precise network simulation via astra-sim and NS-3.

Understanding the Modern LLM Serving Stack

vLLM V1 — PagedAttention Pioneer

SGLang — RadixAttention Innovator

LMCache — Multi-Tier KV Cache Middleware

NVIDIA Dynamo — Disaggregated Orchestrator

Vidur — Discrete-Event Simulator

SimAI — Full-Stack Inference Simulator

Explore Each Engine

Serving Engines

vLLM V1

Deep-Dive Pages

SGLang

Deep-Dive Pages

LMCache

Deep-Dive Pages

NVIDIA Dynamo

Deep-Dive Pages

Simulation & Analysis Platforms

Vidur Simulator

Deep-Dive Pages

SimAI

Deep-Dive Pages

Stage-by-Stage Comparison

Simulation Platform Comparison

Architecture Comparison: vLLM V1 vs SGLang

vLLM V1 Multi-Process Architecture

SGLang Six-Layer Process Architecture

vLLM V1: Consolidated Process Model

SGLang: Separated Manager Model

Simulation Platforms: Vidur vs SimAI

Vidur: Single-Repo DES Simulator

SimAI: Full-Stack Multi-Component Simulator

Key Concepts Glossary

PagedAttention vLLM

RadixAttention SGLang

Continuous Batching General

Chunked Prefill General

Prefix Caching General

KV Cache General

CUDA Graphs General

Tensor Parallelism (TP) General

Pipeline Parallelism (PP) General

Discrete-Event Simulation Vidur

PD Disaggregation SimAI

Collective Communication SimAI