Source Code Deep Dive · March 2026

The Life of a Request —
AI Inference Engine Deep Dive

Source-code-level analysis of six systems that form the modern LLM inference stack — from serving engines (vLLM, SGLang, LMCache, Dynamo) to simulation platforms (Vidur, SimAI).

← Back to Survey

Understanding the Modern LLM Serving Stack

These systems — vLLM, SGLang, LMCache, NVIDIA Dynamo, Vidur, and SimAI — together represent the state of the art in open-source and production LLM serving and simulation infrastructure as of March 2026. Each deep-dive in this section is based on reading actual source code from the latest repositories, not just published papers.
Methodology: Every architectural claim in these deep-dives comes from direct source code analysis of the March 2026 repositories. vLLM now runs exclusively on its V1 engine (V0 was removed in v0.11.0). SGLang has matured its overlap scheduling and RadixCache into production-grade components. LMCache provides middleware KV caching that plugs into either engine. NVIDIA Dynamo adds disaggregated serving orchestration with intelligent routing. Vidur provides discrete-event simulation for scheduling policy exploration. SimAI extends Vidur with full-stack PD disaggregation simulation, AICB GPU profiling, and NS-3 network modeling. Together they cover the entire inference lifecycle from request ingestion to token delivery, plus simulation-based capacity planning.
500+
Source files indexed
6
Inference systems
10
Pipeline stages traced
80%
Subsystem overlap (vLLM/SGLang)

vLLM V1 — PagedAttention Pioneer

The first system to introduce PagedAttention for efficient KV cache memory management. V1 features a multi-process architecture with ZeroMQ IPC, a unified token budget scheduler, and an extensive set of attention backends including FlashAttention 3/4 and MLA variants for DeepSeek models.

SGLang — RadixAttention Innovator

Built around RadixCache for automatic prefix sharing via a radix tree data structure. Features an overlap scheduling mode where the GPU executes the current batch while the CPU prepares the next one. Pioneered the "extend" mode for efficient multi-turn conversations and offers a powerful frontend DSL for LLM programs.

LMCache — Multi-Tier KV Cache Middleware

A middleware layer that sits between inference engines and multi-tier storage. Reduces time-to-first-token by 3–10x by caching KV tensors across GPU, CPU DRAM, NVMe, and remote storage. Unlike built-in prefix caching, LMCache can reuse any repeated text segment across any serving instance in a distributed deployment.

NVIDIA Dynamo — Disaggregated Orchestrator

NVIDIA's open-source inference framework for disaggregated serving, providing intelligent request routing, prefill-decode separation, and KV cache transfer orchestration across GPU clusters. Integrates with vLLM and other backends to enable production-scale multi-node deployments with dynamic scaling.

Vidur — Discrete-Event Simulator

Microsoft's DES-based inference simulator that models the full serving stack without GPU inference. Simulates scheduling policies (vLLM, Sarathi, Orca), predicts execution times via ML models, and enables capacity planning across cluster configurations at a fraction of deployment cost.

SimAI — Full-Stack Inference Simulator

Alibaba Cloud's full-stack simulator extending Vidur with five integrated components: vidur-alibabacloud for PD disaggregation scheduling, AICB for GPU kernel profiling (DeepSeek/Qwen3-MoE), SimCCL for NCCL algorithm decomposition, astra-sim for system simulation, and NS-3 for packet-level network modeling.

Canonical Request Lifecycle
HTTP Request FastAPI Server Tokenization [IPC] Scheduler Prefix Cache Lookup KV Memory Alloc Model Forward Attention Sampling [IPC] Detokenization HTTP Response

Explore Each Engine

Each deep-dive traces a core operation through the system's source code. Serving engines follow an inference request from HTTP entry to token output. Simulation platforms follow a request through the simulated stack, analyzing scheduling, prediction, and network modeling.

Serving Engines

Simulation & Analysis Platforms

Stage-by-Stage Comparison

How do the serving engines handle each stage of the inference pipeline? For simulation platform comparisons, see the dedicated Vidur and SimAI deep-dive pages.
Pipeline Stage vLLM V1 SGLang LMCache
HTTP Entry FastAPI via Uvicorn
entrypoints/openai/api_server.py
OpenAI-compatible REST routes
FastAPI server
entrypoints/http_server.py
OpenAI + Anthropic + Ollama compatible
N/A (middleware — no HTTP layer)
Tokenization In AsyncLLM process
HF tokenizer via inputs/preprocess.py
Same process as API server
Separate TokenizerManager process
Chat template application + multimodal preprocessing
Communicates with Scheduler via ZMQ
N/A (uses engine's tokenizer)
Scheduling Unified Token Budget
Continuous batching; decode priority over prefill
Automatic chunked prefill for long prompts
v1/core/sched/scheduler.py
Overlap Scheduling
FCFS with extend mode; GPU executes batch N while CPU prepares batch N+1
managers/scheduler.py
N/A (no scheduling)
Prefix Caching Block Hash
Hash-based block matching with 16-token blocks
cached_block_hash_to_block dictionary
Enabled by default; find_longest_cache_hit()
RadixCache
Radix tree for automatic prefix sharing
LRU/LFU eviction; match_prefix()
No explicit configuration needed
Content-Addressable
256-token chunks with rolling hashes
ChunkedTokenDatabase; cross-instance dedup
Can reuse any repeated text segment
KV Cache Memory Pre-allocated paged blocks
gpu_memory_utilization=0.9 default
Free block queue (doubly-linked list)
Preemption via recomputation only
ReqToTokenPool + TokenToKVPool
Specialized pools for MHA/MLA/Mamba
HiRadixCache for three-tier GPU/CPU/storage
Sliding window attention pool support
Four latency tiers:
L1: CPU DRAM (<1ms, pinned, NUMA-aware)
L2: NVMe (1–10ms, persistent)
L3: Remote (10–100ms, Redis/RDMA/Mooncake)
L4: P2P GPU-to-GPU (5–50ms, NIXL)
Attention FlashAttention 3/4 (primary, Hopper+)
FlashInfer, Triton, PagedAttention
TRT-LLM attention backend
MLA suite: FlashMLA, CUTLASS MLA, Triton MLA, Sparse MLA
FlashInfer (default pre-Hopper)
FlashAttention 3 (default Hopper)
FA4 (FP4), TRT-LLM MLA/MHA
Dual-chunk FA (Qwen2.5-1M long context)
Hybrid prefill/decode mixing
N/A (uses engine's attention kernels)
Speculative Decoding Integrated in GPUModelRunner
propose_draft_token_ids()
Triton-based rejection sampler
EAGLE, EAGLE-3, Medusa, MTP, N-gram
Dedicated EAGLE worker (separate process)
Tree-based speculation with CUDA graph runner
EAGLE, EAGLE-2/3, MTP, draft models, N-gram
N/A
MoE / Expert Parallelism FusedMoE kernels
--enable-expert-parallel
Expert parallel load balancing (EPLB)
Triton-based fused MoE with pre-tuned configs
EP with DeepEP, Mooncake, Mori, FuseEP
FlashInfer all-to-all dispatch
Dynamic expert rebalancing (EPLB Manager)
N/A
Disaggregated P/D KVConnectorBase_V1 interface
Scheduler-side: build_connector_meta()
Worker-side: start_load_kv() / save_kv_layer()
Connectors: NIXL, LMCache, shared storage
Disaggregation directory + HiCache
Three-tier GPU/CPU/storage backends
PDBackend via NIXL/RDMA
Layer-by-layer pipelined transfers
Overlaps KV transfer with inference
Plugs into vLLM or SGLang connectors
Detokenization In AsyncLLM (output_processor)
Incremental detokenization + stop checking
Same process as tokenization
Separate DetokenizerManager process
Converts token IDs → text incrementally
Trims at stop sequences; sends BatchStrOut
N/A (uses engine's detokenizer)
Key Insight: vLLM and SGLang share roughly 80% of the same subsystem categories — API server, scheduler, model runner, attention backends, sampling, parallelism — but diverge meaningfully in three areas: caching strategy (block-hash vs. radix tree), scheduling philosophy (unified budget vs. overlap scheduling), and process topology (AsyncLLM combines tokenization/detokenization vs. SGLang's dedicated manager processes). LMCache complements both by adding a cross-instance, multi-tier KV cache layer that neither engine provides natively.

Simulation Platform Comparison

Dimension Vidur SimAI
OriginMicrosoft Research (MLSys'24)Alibaba Cloud (NSDI'25)
ArchitectureSingle-repo DES simulator5-component full-stack (vidur fork + AICB + SimCCL + astra-sim + NS-3)
Deployment ModelCo-located onlyCo-located + PD Disaggregation (SplitWise scheduler)
Compute Timesklearn RandomForest on profiled CSVAICB AIOB real GPU profiling (DeepGEMM/FlashMLA)
CommunicationNone (assumes negligible)SimCCL + astra-sim (analytical/NS-3/physical)
Supported ModelsLLaMA-2 (7B/13B/70B)+ DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B
Hardware RequirementCPU onlyProfiling: SM90+ GPU; Simulation: CPU
Scheduling PoliciesvLLM, Sarathi, Orca, FasterTransformer, LightLLMAll Vidur policies + SplitWise (PD-aware)
Network ModelingNot supportedAnalytical (busbw) / NS-3 packet-level / Physical RDMA
Key MetricE2E latency, TTFT, TBT+ pd_p2p_comm_time, prefill/decode breakdown

Architecture Comparison: vLLM V1 vs SGLang

Both systems use multi-process architectures with ZeroMQ IPC, but their process topologies differ in important ways. vLLM consolidates tokenization and detokenization in a single AsyncLLM process, while SGLang separates them into dedicated manager processes.

vLLM V1 Multi-Process Architecture

Process 1: API Server
FastAPI Server (Uvicorn) OpenAI-compatible REST API
AsyncLLM Tokenization + Detokenization + Streaming
ZeroMQ IPC (msgpack serialization)
Process 2: EngineCore
EngineCore Busy Loop schedule() → execute_model() → update_from_output()
V1 Scheduler Unified token budget
Chunked prefill
KVCacheManager Paged blocks (16-tok)
Hash-based prefix
Executor → Worker UniProc / MultiProc / Ray
GPUModelRunner _update_states → _prepare_inputs → forward → sample

SGLang Six-Layer Process Architecture

Process 1: HTTP
FastAPI HTTP Server OpenAI + Anthropic + Ollama API
ZeroMQ IPC
Process 2: Tokenizer
TokenizerManager Tokenization + Chat templates + Multimodal
ZeroMQ IPC
Process 3: Scheduler
Scheduler Event Loop Normal mode / Overlap mode (CPU+GPU pipeline)
Batch Scheduler FCFS + Extend mode
RadixCache Radix tree prefix match
LRU/LFU eviction
TpModelWorker Normal / Overlap thread variant
ModelRunner forward_decode / forward_extend / forward_idle
ZeroMQ IPC
Process 4: Detokenizer
DetokenizerManager Tokens → Text + Stop sequence trimming
HTTP / API Layer
Tokenization
Engine Core / Event Loop
Scheduling
Cache Layer
Worker / Executor
Model Runner
Detokenization
IPC Boundary

vLLM V1: Consolidated Process Model

  • Tokenization and detokenization in the same process (AsyncLLM)
  • EngineCore runs a tight busy loop — minimal overhead per step
  • Workers maintain persistent state and receive only incremental diffs
  • InputBatch avoids tensor recreation at each step
  • Data-parallel via DPCoordinator for multi-EngineCore load balancing
  • KV connector interface enables pluggable disaggregated P/D

SGLang: Separated Manager Model

  • Dedicated TokenizerManager and DetokenizerManager as separate processes
  • Overlap scheduling: GPU processes batch N while CPU prepares batch N+1
  • Extend mode incrementally updates existing KV caches (ragged tensors)
  • Three forward modes: DECODE, EXTEND, IDLE
  • DataParallelController with round-robin across DP replicas
  • Grammar mask generation overlapped with GPU inference for zero overhead

Simulation Platforms: Vidur vs SimAI

Vidur: Single-Repo DES Simulator

  • Pure Python discrete-event simulation — no GPU required
  • Event queue drives RequestArrival → Schedule → BatchStageEnd → BatchEnd
  • sklearn RandomForest predicts execution time from profiled CSV data
  • 5 replica schedulers: vLLM, Sarathi, Orca, FasterTransformer, LightLLM
  • Vidur-Search: explores entire configuration space in ~1 hour on CPU

SimAI: Full-Stack Multi-Component Simulator

  • 5 integrated components: vidur-alibabacloud + AICB + SimCCL + astra-sim + NS-3
  • PD Disaggregation via SplitWise scheduler with P:D ratio control
  • AICB profiles real GPU kernels (DeepGEMM FP8, FlashMLA) on Hopper/Blackwell
  • SimCCL decomposes NCCL collectives (Ring, Tree, NVLS) into P2P flows
  • 3 network backends: Analytical (busbw), NS-3 simulation, Physical RDMA

Key Concepts Glossary

Core terminology used throughout the deep-dive pages. Each concept is tagged with its primary system of origin, though most are applicable across engines.

PagedAttention vLLM

Memory management technique that partitions KV cache into fixed-size pages (blocks), enabling non-contiguous storage of key-value tensors in GPU memory. Eliminates internal fragmentation and enables memory sharing across requests. Inspired by OS virtual memory paging. Each block typically holds 16 tokens.

RadixAttention SGLang

KV cache management using a radix tree (compact trie) that automatically shares cached key-value tensors across requests with common token prefixes. Unlike block-hash approaches, the radix tree structure naturally handles arbitrary-length prefix matching and enables efficient LRU/LFU eviction at any granularity.

Continuous Batching General

Dynamic batching strategy where new requests can join or leave a running batch at every iteration step, rather than waiting for all requests in a batch to complete. Maximizes GPU utilization by immediately filling slots freed by completed requests. Both vLLM and SGLang implement this as their core scheduling strategy.

Chunked Prefill General

Technique for handling long prompts by splitting the prefill computation into smaller chunks that can be interleaved with decode tokens in the same batch. Prevents long prompts from monopolizing GPU compute and causing latency spikes for concurrent decode-phase requests. vLLM uses long_prefill_token_threshold; SGLang uses extend mode.

Prefix Caching General

Optimization that reuses KV cache from previously computed token sequences when a new request shares the same prefix (e.g., system prompt, few-shot examples). vLLM uses hash-based block matching (16-token blocks). SGLang uses RadixCache (automatic radix tree). LMCache uses content-addressable 256-token chunks with cross-instance deduplication.

KV Cache General

Cached key and value projection tensors from the attention mechanism, stored per layer per token. During autoregressive decoding, only the new token's Q/K/V need to be computed — all previous K/V are read from cache. KV cache memory consumption scales linearly with sequence length and is typically the dominant memory consumer after model weights.

CUDA Graphs General

NVIDIA mechanism for capturing and replaying a sequence of GPU operations as a single launchable unit, eliminating per-kernel CPU launch overhead. Both vLLM and SGLang capture CUDA graphs for common batch sizes at startup. SGLang's CudaGraphRunner and vLLM's graph management in GPUModelRunner reduce decode-phase latency by 10–30%.

Tensor Parallelism (TP) General

Distributed strategy that splits individual weight matrices across multiple GPUs, with each GPU computing a portion of every layer's output. Requires all-reduce or all-gather communication between GPUs after each layer. Both engines implement ColumnParallelLinear and RowParallelLinear layers. Best for intra-node parallelism with fast NVLink interconnect.

Pipeline Parallelism (PP) General

Distributed strategy that assigns different layers of the model to different GPUs, forming a pipeline where activations are sent forward between stages. Better suited for inter-node parallelism where bandwidth is limited. vLLM supports PP via Ray compiled DAG. SGLang implements PP via its scheduler PP mixin. Micro-batching minimizes pipeline bubble overhead.

Discrete-Event Simulation Vidur

Simulation methodology where a priority queue of timestamped events drives the system state forward. Each event (e.g., RequestArrival, BatchStageEnd) triggers handler functions that process the event and schedule future events. Vidur and SimAI use this to simulate inference request lifecycles without actual GPU computation.

PD Disaggregation SimAI

Architecture that separates prefill (compute-intensive prompt processing) and decode (memory-bound token generation) onto dedicated GPU pools. Eliminates resource contention but introduces KV cache transfer cost between pools. SimAI models this via SplitWise scheduler with configurable P:D node ratios and KV transfer cost estimation.

Collective Communication SimAI

Group communication operations (AllReduce, AllGather, ReduceScatter, AllToAll) used in distributed training and inference for gradient synchronization and activation exchange. SimCCL decomposes these into point-to-point flows following NCCL algorithms (Ring, Tree, NVLS), enabling precise network simulation via astra-sim and NS-3.