Source-code-level analysis of six systems that form the modern LLM inference stack — from serving engines (vLLM, SGLang, LMCache, Dynamo) to simulation platforms (Vidur, SimAI).
The first system to introduce PagedAttention for efficient KV cache memory management. V1 features a multi-process architecture with ZeroMQ IPC, a unified token budget scheduler, and an extensive set of attention backends including FlashAttention 3/4 and MLA variants for DeepSeek models.
Built around RadixCache for automatic prefix sharing via a radix tree data structure. Features an overlap scheduling mode where the GPU executes the current batch while the CPU prepares the next one. Pioneered the "extend" mode for efficient multi-turn conversations and offers a powerful frontend DSL for LLM programs.
A middleware layer that sits between inference engines and multi-tier storage. Reduces time-to-first-token by 3–10x by caching KV tensors across GPU, CPU DRAM, NVMe, and remote storage. Unlike built-in prefix caching, LMCache can reuse any repeated text segment across any serving instance in a distributed deployment.
NVIDIA's open-source inference framework for disaggregated serving, providing intelligent request routing, prefill-decode separation, and KV cache transfer orchestration across GPU clusters. Integrates with vLLM and other backends to enable production-scale multi-node deployments with dynamic scaling.
Microsoft's DES-based inference simulator that models the full serving stack without GPU inference. Simulates scheduling policies (vLLM, Sarathi, Orca), predicts execution times via ML models, and enables capacity planning across cluster configurations at a fraction of deployment cost.
Alibaba Cloud's full-stack simulator extending Vidur with five integrated components: vidur-alibabacloud for PD disaggregation scheduling, AICB for GPU kernel profiling (DeepSeek/Qwen3-MoE), SimCCL for NCCL algorithm decomposition, astra-sim for system simulation, and NS-3 for packet-level network modeling.
| Pipeline Stage | vLLM V1 | SGLang | LMCache |
|---|---|---|---|
| HTTP Entry |
FastAPI via Uvicornentrypoints/openai/api_server.pyOpenAI-compatible REST routes |
FastAPI serverentrypoints/http_server.pyOpenAI + Anthropic + Ollama compatible |
N/A (middleware — no HTTP layer) |
| Tokenization |
In AsyncLLM process HF tokenizer via inputs/preprocess.pySame process as API server |
Separate TokenizerManager process Chat template application + multimodal preprocessing Communicates with Scheduler via ZMQ |
N/A (uses engine's tokenizer) |
| Scheduling |
Unified Token Budget Continuous batching; decode priority over prefill Automatic chunked prefill for long prompts v1/core/sched/scheduler.py
|
Overlap Scheduling FCFS with extend mode; GPU executes batch N while CPU prepares batch N+1 managers/scheduler.py
|
N/A (no scheduling) |
| Prefix Caching |
Block Hash Hash-based block matching with 16-token blocks cached_block_hash_to_block dictionaryEnabled by default; find_longest_cache_hit()
|
RadixCache Radix tree for automatic prefix sharing LRU/LFU eviction; match_prefix()No explicit configuration needed |
Content-Addressable 256-token chunks with rolling hashes ChunkedTokenDatabase; cross-instance dedup Can reuse any repeated text segment |
| KV Cache Memory |
Pre-allocated paged blocksgpu_memory_utilization=0.9 defaultFree block queue (doubly-linked list) Preemption via recomputation only |
ReqToTokenPool + TokenToKVPool Specialized pools for MHA/MLA/Mamba HiRadixCache for three-tier GPU/CPU/storage Sliding window attention pool support |
Four latency tiers: L1: CPU DRAM (<1ms, pinned, NUMA-aware) L2: NVMe (1–10ms, persistent) L3: Remote (10–100ms, Redis/RDMA/Mooncake) L4: P2P GPU-to-GPU (5–50ms, NIXL) |
| Attention |
FlashAttention 3/4 (primary, Hopper+) FlashInfer, Triton, PagedAttention TRT-LLM attention backend MLA suite: FlashMLA, CUTLASS MLA, Triton MLA, Sparse MLA |
FlashInfer (default pre-Hopper) FlashAttention 3 (default Hopper) FA4 (FP4), TRT-LLM MLA/MHA Dual-chunk FA (Qwen2.5-1M long context) Hybrid prefill/decode mixing |
N/A (uses engine's attention kernels) |
| Speculative Decoding |
Integrated in GPUModelRunnerpropose_draft_token_ids()Triton-based rejection sampler EAGLE, EAGLE-3, Medusa, MTP, N-gram |
Dedicated EAGLE worker (separate process) Tree-based speculation with CUDA graph runner EAGLE, EAGLE-2/3, MTP, draft models, N-gram |
N/A |
| MoE / Expert Parallelism |
FusedMoE kernels--enable-expert-parallelExpert parallel load balancing (EPLB) |
Triton-based fused MoE with pre-tuned configs EP with DeepEP, Mooncake, Mori, FuseEP FlashInfer all-to-all dispatch Dynamic expert rebalancing (EPLB Manager) |
N/A |
| Disaggregated P/D |
KVConnectorBase_V1 interface Scheduler-side: build_connector_meta()Worker-side: start_load_kv() / save_kv_layer()Connectors: NIXL, LMCache, shared storage |
Disaggregation directory + HiCache Three-tier GPU/CPU/storage backends |
PDBackend via NIXL/RDMA Layer-by-layer pipelined transfers Overlaps KV transfer with inference Plugs into vLLM or SGLang connectors |
| Detokenization |
In AsyncLLM (output_processor) Incremental detokenization + stop checking Same process as tokenization |
Separate DetokenizerManager process Converts token IDs → text incrementally Trims at stop sequences; sends BatchStrOut
|
N/A (uses engine's detokenizer) |
| Dimension | Vidur | SimAI |
|---|---|---|
| Origin | Microsoft Research (MLSys'24) | Alibaba Cloud (NSDI'25) |
| Architecture | Single-repo DES simulator | 5-component full-stack (vidur fork + AICB + SimCCL + astra-sim + NS-3) |
| Deployment Model | Co-located only | Co-located + PD Disaggregation (SplitWise scheduler) |
| Compute Time | sklearn RandomForest on profiled CSV | AICB AIOB real GPU profiling (DeepGEMM/FlashMLA) |
| Communication | None (assumes negligible) | SimCCL + astra-sim (analytical/NS-3/physical) |
| Supported Models | LLaMA-2 (7B/13B/70B) | + DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B |
| Hardware Requirement | CPU only | Profiling: SM90+ GPU; Simulation: CPU |
| Scheduling Policies | vLLM, Sarathi, Orca, FasterTransformer, LightLLM | All Vidur policies + SplitWise (PD-aware) |
| Network Modeling | Not supported | Analytical (busbw) / NS-3 packet-level / Physical RDMA |
| Key Metric | E2E latency, TTFT, TBT | + pd_p2p_comm_time, prefill/decode breakdown |
DECODE, EXTEND, IDLERequestArrival → Schedule → BatchStageEnd → BatchEndMemory management technique that partitions KV cache into fixed-size pages (blocks), enabling non-contiguous storage of key-value tensors in GPU memory. Eliminates internal fragmentation and enables memory sharing across requests. Inspired by OS virtual memory paging. Each block typically holds 16 tokens.
KV cache management using a radix tree (compact trie) that automatically shares cached key-value tensors across requests with common token prefixes. Unlike block-hash approaches, the radix tree structure naturally handles arbitrary-length prefix matching and enables efficient LRU/LFU eviction at any granularity.
Dynamic batching strategy where new requests can join or leave a running batch at every iteration step, rather than waiting for all requests in a batch to complete. Maximizes GPU utilization by immediately filling slots freed by completed requests. Both vLLM and SGLang implement this as their core scheduling strategy.
Technique for handling long prompts by splitting the prefill computation into smaller chunks that can be interleaved with decode tokens in the same batch. Prevents long prompts from monopolizing GPU compute and causing latency spikes for concurrent decode-phase requests. vLLM uses long_prefill_token_threshold; SGLang uses extend mode.
Optimization that reuses KV cache from previously computed token sequences when a new request shares the same prefix (e.g., system prompt, few-shot examples). vLLM uses hash-based block matching (16-token blocks). SGLang uses RadixCache (automatic radix tree). LMCache uses content-addressable 256-token chunks with cross-instance deduplication.
Cached key and value projection tensors from the attention mechanism, stored per layer per token. During autoregressive decoding, only the new token's Q/K/V need to be computed — all previous K/V are read from cache. KV cache memory consumption scales linearly with sequence length and is typically the dominant memory consumer after model weights.
NVIDIA mechanism for capturing and replaying a sequence of GPU operations as a single launchable unit, eliminating per-kernel CPU launch overhead. Both vLLM and SGLang capture CUDA graphs for common batch sizes at startup. SGLang's CudaGraphRunner and vLLM's graph management in GPUModelRunner reduce decode-phase latency by 10–30%.
Distributed strategy that splits individual weight matrices across multiple GPUs, with each GPU computing a portion of every layer's output. Requires all-reduce or all-gather communication between GPUs after each layer. Both engines implement ColumnParallelLinear and RowParallelLinear layers. Best for intra-node parallelism with fast NVLink interconnect.
Distributed strategy that assigns different layers of the model to different GPUs, forming a pipeline where activations are sent forward between stages. Better suited for inter-node parallelism where bandwidth is limited. vLLM supports PP via Ray compiled DAG. SGLang implements PP via its scheduler PP mixin. Micro-batching minimizes pipeline bubble overhead.
Simulation methodology where a priority queue of timestamped events drives the system state forward. Each event (e.g., RequestArrival, BatchStageEnd) triggers handler functions that process the event and schedule future events. Vidur and SimAI use this to simulate inference request lifecycles without actual GPU computation.
Architecture that separates prefill (compute-intensive prompt processing) and decode (memory-bound token generation) onto dedicated GPU pools. Eliminates resource contention but introduces KV cache transfer cost between pools. SimAI models this via SplitWise scheduler with configurable P:D node ratios and KV transfer cost estimation.
Group communication operations (AllReduce, AllGather, ReduceScatter, AllToAll) used in distributed training and inference for gradient synchronization and activation exchange. SimCCL decomposes these into point-to-point flows following NCCL algorithms (Ring, Tree, NVLS), enabling precise network simulation via astra-sim and NS-3.