arXiv:2405.05465 · MLSys 2024

Vidur: A Large-Scale Simulation Framework for LLM Inference

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov -- Georgia Tech & Microsoft Research India

Published: May 2024
Venue: 7th MLSys Conference
Repo: github.com/microsoft/vidur
License: MIT

Problem Statement & Motivation

Optimizing LLM inference deployment is extraordinarily expensive. Providers must navigate a vast configuration space spanning parallelism strategies (TP, PP), scheduling policies (vLLM, Orca, Sarathi-Serve), batch sizes, GPU SKUs (A100, H100), and workload-specific parameters. Each configuration point requires running the actual model on GPUs, costing up to $97K per data point.

$694K
Daily ChatGPT Cost
42K
GPU-hrs for Brute-Force Search
$218K
Brute-Force Search Cost
2x
Cost of Mis-Configuration

Key Observations from the Paper

Figure 1 Finding: The optimal deployment configuration depends on both the model and the workload trace. An optimal config on one trace can be up to 2x sub-optimal on another trace for the same model (LLaMA2-70B). Even the optimal GPU SKU changes: H100 for Chat-1M, A100 for BWB.

Challenges in Simulating LLM Inference

Fine Time Granularity

LLM inference iterations are just a few milliseconds, unlike training iterations which run for hundreds of ms. Predictions must be accurate at the sub-millisecond level.

Varying Iteration Times

Unlike training where batch sizes are fixed, inference iteration times vary due to different sequence lengths, prefill/decode mix, and dynamic batch composition.

Cascading Errors

Small prediction errors compound over time since requests arrive dynamically and batch compositions change. A 1% per-iteration error can cascade into much larger end-to-end divergence.

Key Contributions

The paper introduces three main components that work together to enable fast, inexpensive LLM inference performance exploration:

Vidur Simulator
+
Vidur-Bench
+
Vidur-Search
=
Optimal Config
1. Vidur Simulator (Section 4)

A high-fidelity discrete-event simulator that predicts request-level LLM inference performance with under 9% error. It emulates the full inference stack: model execution, scheduling, and cluster-level coordination.

Discrete-Event Simulation ML-based Prediction Hierarchical Scheduling
2. Vidur-Bench (Section 5)

A benchmark suite with plug-and-play support for workload patterns (Chat-1M, Arxiv-Sum, BWB), scheduling policies (vLLM, Orca, Sarathi-Serve, FasterTransformer, LightLLM), and profiling data for A100/H100 GPUs.

5 Schedulers 3 Workloads 2 GPU SKUs

3. Vidur-Search: Configuration Optimizer

An automated search tool that finds the optimal deployment configuration maximizing QPS-per-dollar while meeting SLO constraints (TTFT P90 < 2s, TBT P99 < 200ms). It uses binary search over the simulator to find maximum capacity for each configuration.

1hr
Search Time (CPU)
$125
Total Search Cost
35,565
Configs Evaluated
9,000x
Avg Cost Savings

System Architecture (Paper Figure 2)

Vidur's architecture has two main phases: Model Onboarding (offline profiling + ML model training) and Simulation Runtime (discrete-event simulation with hierarchical scheduling).

MODEL ONBOARDING Model Spec Layers, Heads, Dims Offline Profiler CUPTI kernels Runtime Estimator Compute Profiles (CSV) Token-level, Attention, Comm ops Random Forest Models Trained on profiled data Prediction Tables SIMULATION RUNTIME (Discrete-Event Engine) Simulation Spec GPU, TP, PP, Sched Hierarchical Scheduler Event Queue heapq priority Metrics Tracker TTFT, TBT, Latency Simulation Report MFU, MBU, KV-Cache Util 3-Tier: Global + Replica + Stage Each policy < 150 lines Python 8 Event Types Request, Batch, Stage lifecycle Per-request + Cluster metrics Plots, Wandb, Chrome traces OPERATOR CLASSIFICATION Token-level Ops Linear, Norm, Activation Sequence-level Ops Attention kernels Comm Ops AllReduce, SR Key Insight: All LLMs decompose into the same small set of operator types Single GPU Profiling
Vidur's two-phase architecture: Model Onboarding (profiling + ML training) feeds into the Simulation Runtime (discrete-event engine).

Execution Time Modeling

Vidur's runtime prediction is built on the key insight that LLM operators can be classified into three categories, each with a different prediction strategy. The paper uses Random Forest regression models trained on profiled data, achieving accuracy that prevents error cascading.

Operator Classification & Prediction Strategy

Category Operators Depends On Prediction Approach
Token-level Linear (QKV proj, MLP up/down), Norm, Activation Total tokens in batch (prefill + decode) RF on (batch_size, num_tokens)
Sequence-level Attention (prefill: quadratic, decode: memory-bound) Per-request context length + request history Separate prefill/decode models; equivalent batch trick
Communication AllReduce (TP), Send/Recv (PP) Data size only (model-agnostic) Pre-profiled lookup indexed by topology

Prefill Attention: The Equivalent Batch Trick

Key Insight: Prefill attention is quadratic in sequence length. For a batch of P prefills with lengths p_i, the cost is proportional to the sum of p_i squared. Vidur approximates this by predicting the runtime of a single equivalent prefill of length sqrt(sum(p_i^2)), reducing the combinatorial profiling space.

Implementation: ExecutionTime Entity

In the codebase, execution time is decomposed into 20 distinct components, each predicted separately:

# vidur/entities/execution_time.py -- 20 components per iteration
class ExecutionTime:
    def _get_block_execution_time(self) -> float:
        return (
            self._get_attention_layer_execution_time()  # QKV proj + RoPE + KV save + attn + out proj + norm
            + self._get_mlp_layer_execution_time()        # up_proj + down_proj + act + norm + allreduce
            + self._add_time                              # residual connection
        )

    @property
    def model_time(self) -> float:
        block_time = self._get_block_execution_time()
        stage_time = block_time * self._num_layers_per_pipeline_stage
        return (stage_time + self.pipeline_parallel_communication_time) * 1e-3

    @property
    def total_time(self) -> float:
        return self.model_time + self._get_cpu_overhead() * 1e-3
        # CPU overhead = schedule + sampler + prepare_inputs + process_outputs + ray_comm

Random Forest Prediction Pipeline

The sklearn-based predictor in the repo follows this pipeline:

CSV Profiles
Feature Eng.
GridSearchCV
RF Models
Lookup Tables
# vidur/execution_time_predictor/sklearn_execution_time_predictor.py
# Derived features for attention prediction:
df["num_tokens"] = df[["prefill_chunk_size", "batch_size"]].max(axis=1)
df["is_decode"] = df["prefill_chunk_size"] == 0
df["prefill_chunk_size_squared"] = df["prefill_chunk_size"] ** 2  # captures quadratic attention cost

# Communication features:
df["num_tokens"] = df["size"] / model_config.embedding_dim / 2  # bytes -> tokens

# Model selection: GridSearchCV with MAPE scorer
# Paper chose Random Forest over MLP and polynomial regression
# because RF captures non-linear CUDA kernel characteristics
Why Random Forest? The paper evaluated three approaches: (1) MLPs require large training data and fail to capture CUDA kernel non-linearities from tile/wave quantization, (2) Polynomial regression misses these effects entirely, (3) Random Forest achieves the best balance of data frugality and fidelity.

Scheduling Policies Compared

Vidur implements five scheduling policies, each under 150 lines of Python. The paper classifies them into prefill-prioritizing and decode-prioritizing categories, with Sarathi-Serve bridging both.

Scheduler Category KV-Cache Mgmt Key Feature Repo Lines
vLLM Prefill-first PagedAttention (dynamic blocks) Eagerly schedules prefills, pauses decodes ~132
Orca Decode-first Static allocation (max blocks) Continuous batching, iteration-level scheduling ~55
Sarathi-Serve Hybrid PagedAttention + chunked prefills Chunk-size controls prefill/decode tradeoff ~187
FasterTransformer Prefill-first Static allocation, batch-level free Processes entire batch to completion ~66
LightLLM Decode-first Token-level allocation (block_size=1) Max-waiting-iters to prevent starvation ~154

Hierarchical Scheduler Architecture

Global Scheduler Replica Scheduler 0 Replica Scheduler 1 Replica Scheduler N Stage Sched 0 Stage Sched 1 Tier 1 Round-robin / LOR Tier 2 Batching + Memory Tier 3: Pipeline microbatch scheduling
Three-tier hierarchical scheduler: Global (routing) -> Replica (batching + memory) -> Stage (microbatch pipeline)

What-If Analysis: Best Policies by Workload (Paper Section 7.3)

Chat-1M (Short prefill, many decodes)

Best: Sarathi-Serve with large batch (256), H100. The moderate P:D ratio (2.3) means chunked prefills avoid stalling decodes. vLLM also performs well.

P:D = 2.3 Best GPU: H100
BWB-4K (Long decodes, low P:D)

Best: Smaller batch size (64), A100 often better. High KV-Cache pressure from long sequences. Decode-prioritizing policies struggle less since there are fewer prefills.

P:D = 0.65 Best GPU: A100

Simulation Accuracy (Paper Figures 3 & 4)

Vidur was validated across four models (LLaMA2-7B, InternLM-20B, LLaMA2-70B, Qwen-72B), three workloads, and both static and dynamic arrival patterns. The baseline is an optimized vLLM fork with CUDA graph support.

Static Workloads (Figure 3)

Model TP Median Exec Latency Error P95 Exec Latency Error Worst Case
LLaMA2-7B 1 0.30% - 3.01% 1.83% - 3.33% 3.33%
InternLM-20B 2 1.07% - 1.78% 0.38% - 1.37% 1.78%
LLaMA2-70B 4 0.15% - 2.53% 0.25% - 1.30% 2.86%
Qwen-72B 4 0.42% - 1.79% 0.52% - 1.69% 1.79%

Dynamic Workloads at 85% Capacity (Figure 4)

Model Median E2E Error P95 E2E Error Overall Assessment
LLaMA2-7B (TP1) 2.88% - 8.50% 4.55% - 7.47% Higher error (CPU overhead)
InternLM-20B (TP2) 0.47% - 1.27% 2.25% - 4.58% Excellent fidelity
LLaMA2-70B (TP4) 0.51% - 1.64% 0.12% - 1.82% Excellent fidelity
Qwen-72B (TP4) 0.41% - 3.29% 0.13% - 1.18% Excellent fidelity
Key Finding (Figure 4): Vidur achieves <5% error in almost all scenarios at 85% capacity. The 7B model shows slightly higher errors because CPU overhead (scheduling, input preparation) dominates at small model sizes, and these overheads are harder to predict precisely.

Accuracy at Different Load Levels (Appendix Figure 7-8)

75% Capacity
LLaMA2-70B<2.2%
Qwen-72B<3.1%
LLaMA2-7B<7.2%
95% Capacity (Near Saturation)
LLaMA2-70B<1.7%
Qwen-72B<2.8%
LLaMA2-7B<12.7%

Evaluation Results & What-If Analysis

Optimal Configurations Found (Paper Figure 1a & 6)

Vidur-Search explored all 12 model-trace combinations across TP={1,2,4}, PP={1,2,4}, Scheduler={vLLM, Orca+, Sarathi-Serve}, BatchSize={32,64,128,256,512}, and GPU SKU={A100,H100}. The results reveal that no single configuration is universally optimal.

Model Workload Best PP Best TP Best Scheduler Best BS SKU QPS/$
LLaMA-7B Chat-1M 1 1 Sarathi-Serve 64 A100 1.831
LLaMA2-70B Chat-1M 2 2 Sarathi-Serve 256 H100 0.201
LLaMA2-70B BWB-4K 2 4 vLLM 64 A100 0.026
Qwen-72B Chat-1M 2 4 vLLM 256 H100 0.091

Pareto Frontier Analysis (Paper Figure 5)

Figure 5 Key Takeaways:
1. Configurations optimal on one SLO metric may violate another. Blue points on the Pareto curve satisfy TTFT but not TBT, or vice versa.
2. Small changes in SLO cause large cost differences. For LLaMA2-70B + Chat-1M, changing the TBT SLO from 120ms to 140ms (just 20ms!) shifts the Pareto point and results in ~1.85x reduction in cost.
3. Qwen-72B is ~2x more costly than LLaMA2-70B despite similar sizes, because Qwen uses Multi-Head Attention (MHA) instead of GQA, resulting in 8x higher KV-Cache load.

Cost Comparison: Simulation vs Actual (Paper Table 2)

Scenario Actual Time Sim Time Actual Cost Sim Cost Savings
7B-Chat1M4K hrs31 min$20K$53,837x
7B-Arxiv10K hrs47 min$52K$86,708x
20B-Arxiv14K hrs25 min$73K$417,746x
70B-Chat1M12K hrs21 min$64K$418,151x
70B-Arxiv15K hrs16 min$78K$330,187x
72B-Arxiv17K hrs16 min$88K$333,354x
Total: The entire what-if analysis (35,565 simulation runs for Figure 1a) cost $125 on a 96-core CPU machine, compared to $1.14M and 42K GPU-hours for actual deployment. This is a 9,000x average cost reduction.

Paper Claims vs Repository Implementation

We conducted a thorough comparison of the paper's claims against the open-source codebase at github.com/microsoft/vidur. Here is the detailed analysis:

Core Simulator Engine

Paper Claims
  • Discrete-event simulator with priority-based event queue
  • Events for request arrival, batch lifecycle, stage scheduling
  • Time-limit support for bounded simulations
  • Detailed metrics tracking (TTFT, TBT, latency, MFU)
Repository Implementation
  • Implemented heapq-based event queue in simulator.py (129 lines)
  • Implemented 8 event types in vidur/events/
  • Implemented Time limit and early termination
  • Implemented MetricsStore with CDF sketches, Wandb, Chrome traces

Scheduling Policies

Paper Claims
  • 5 batching policies: vLLM, Orca, Sarathi-Serve, FasterTransformer, LightLLM
  • 3-tier hierarchical scheduler (global + replica + stage)
  • Each policy under 150 lines of Python
  • Memory management with block-based KV-Cache
Repository Implementation
  • Implemented All 5 in vidur/scheduler/replica_scheduler/
  • Implemented Global: round-robin, LOR, random; Stage: microbatch
  • Verified Orca: 55, FT: 66, vLLM: 132, LightLLM: 154, Sarathi: 187 lines
  • Implemented MemoryPlanner computes max batch size from GPU memory

Execution Time Prediction

Paper Claims
  • Random Forest models for runtime prediction
  • Operator triaging: token-level, sequence-level, communication
  • Automatic profiling for parallelism strategies from single GPU
  • Profiled on A100 and H100
Repository Implementation
  • Implemented RF + Linear predictors in execution_time_predictor/
  • Implemented 5 profiling data loaders in sklearn base class
  • Implemented Profiling scripts in vidur/profiling/
  • Included CSV data in data/profiling/

Model Support

Paper Claims
  • LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B
  • Declarative model specification format
  • Easy to add new models
Repository Implementation
  • Implemented + Extended Also: LLaMA3-8B/70B, CodeLlama-34B, InternLM2-20B, Phi-2
  • Implemented Dataclass configs in vidur/config/model_config.py
  • Verified New model = ~15 lines of dataclass

Vidur-Search (Configuration Optimizer)

Paper Claims
  • Binary search for maximum QPS under SLO
  • Parallelized across CPU cores
  • Visualization dashboard for Pareto analysis
  • Configurable SLO constraints (TTFT, TBT)
Repository Implementation
  • Implemented CapacitySearch with adaptive binary search
  • Implemented Ray-based parallelization with CPU affinity
  • Implemented Streamlit dashboard (5 analysis pages)
  • Implemented Scheduling delay quantile + value thresholds

What Is Missing or Different?

Gaps Between Paper and Implementation

Feature Paper Mentions Repo Status Notes
Async communication overlap Future work (Sec 4.5) Not implemented Only sync PP scheduling
Speculative decoding Future work (Sec 4.5) Not implemented Would need draft model sim
Energy consumption modeling Planned (Sec 5.2) Not implemented Only FLOPs + memory util
Sequence parallelism Future work (Sec 4.5) Not implemented Only TP and PP
Prefix caching Not discussed Not implemented Important for production
Preemption counting Mentioned (Sec 5.2) Implemented vLLM + Sarathi track this
Offline batch optimization Possible extension (Sec 6) Partially Static workload mode supported

Discrete-Event Simulation: Code Deep Dive

The core simulation loop is remarkably concise. The entire engine is 129 lines of Python:

# vidur/simulator.py -- The entire simulation engine
class Simulator:
    def run(self):
        while self._event_queue and not self._terminate:
            _, event = heapq.heappop(self._event_queue)  # get highest priority event
            self._set_time(event._time)                     # advance simulation clock
            new_events = event.handle_event(               # process event, get next events
                self._scheduler, self._metric_store
            )
            self._add_events(new_events)                    # push new events to queue

Event Lifecycle

RequestArrival GlobalSchedule ReplicaSchedule ReplicaStage Schedule BatchStage End BatchEnd Loop: completed requests freed, pending requests re-scheduled Next PP Stage
Event flow: Each request triggers a cascade of scheduling and execution events through the three-tier hierarchy.

Memory Management: MemoryPlanner

# vidur/scheduler/utils/memory_planner.py
class MemoryPlanner:
    def get_max_batch_size(self) -> int:
        available = gpu_memory * (1 - margin_fraction)  # typically 80GB * 0.9
        param_memory = 2 * num_parameters_per_device   # FP16
        kv_per_request = (
            2                                          # bytes per float
            * 2                                         # key + value
            * attention_head_dim
            * kv_heads_per_tp_worker
            * max_request_tokens
            * num_layers
        )
        return (available - param_memory) // kv_per_request

Vidur-Bench: Workload Characteristics

The benchmark suite includes three carefully chosen workloads that represent different LLM usage patterns. The key insight is that workload characteristics dramatically affect optimal configuration.

Dataset Content # Queries Prefill (med) Decode (med) P:D Characteristic
Chat-1M LMSys conversations 2M 417 141 2.3 Moderate
Arxiv-4K Paper summaries 203K 7827 228 35.4 Prefill-heavy
BWB-4K Book translation 195K 2396 3589 0.66 Decode-heavy
Impact on Optimal Config: The decode phase can be up to 200x more expensive than prefill per token (Agrawal et al., 2023). BWB-4K has 10x longer decodes and 2x longer prefills compared to Chat-1M, causing completely different optimal configurations. This is why Vidur-Search is essential -- you cannot pick a single best config.

Request Generation in the Repo

The repo implements a flexible request generation framework supporting both synthetic and trace-replay modes:

Interval Generators
  • Poisson (online / dynamic)
  • Gamma (bursty)
  • Static (all at t=0, offline)
  • Trace replay
Length Generators
  • Uniform (min/max range)
  • Zipf (power-law)
  • Fixed (constant length)
  • Trace replay (from CSV)

Use Cases

Capacity Planning

Given a target workload and SLO requirements, determine the minimum GPU fleet size and optimal SKU. Vidur-Search's binary search finds maximum QPS per configuration, then selects the most cost-effective option.

Scheduling Policy Design

Test new scheduling algorithms without GPU access. The extensible API requires implementing only _get_next_batch() and on_batch_end() methods. Sarathi-Serve's chunked prefill was validated this way.

Hardware Selection

Compare A100 vs H100 cost-effectiveness for specific workloads. The paper found that the optimal SKU changes with workload: H100 is better for Chat-1M but A100 wins for BWB due to lower cost-per-GB.

Example: Running Vidur for Capacity Planning
# Run a single simulation
python -m vidur.main \
  --replica_config_model_name meta-llama/Llama-2-70b-hf \
  --cluster_config_num_replicas 4 \
  --replica_config_num_pipeline_stages 2 \
  --replica_config_tensor_parallel_size 4 \
  --replica_scheduler_config_type sarathi \
  --sarathi_scheduler_chunk_size 512 \
  --vllm_scheduler_batch_size_cap 256 \
  --request_generator_config_type synthetic \
  --synthetic_request_generator_interval_generator_config_type poisson \
  --poisson_request_interval_generator_qps 0.5 \
  --length_generator_config_type trace \
  --trace_request_length_generator_trace_file data/processed_traces/chat1m.csv

# Run Vidur-Search for optimal configuration
python -m vidur.config_optimizer.config_explorer.main \
  --output_dir /tmp/vidur_search \
  --time_limit 600 \
  --scheduling_delay_slo_value 5 \
  --scheduling_delay_slo_quantile 0.99
Example: Adding a New Scheduling Policy
# Implementing a new scheduler requires 2 methods:
class MyCustomScheduler(BaseReplicaScheduler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._num_running_batches = 0

    def on_batch_end(self, batch: Batch) -> None:
        self._num_running_batches -= 1
        for request in batch.requests:
            if request.completed:
                self.free(request.id)

    def _get_next_batch(self) -> Batch:
        # Your batching logic here
        # Has access to: self._request_queue, self._allocation_map,
        # self.can_allocate(), self.allocate(), self._max_batch_size
        pass

Limitations & Future Work

Accuracy Limitations
  • Small models (7B): CPU overhead dominates; up to 12.65% error at 95% capacity
  • Near capacity: Errors cascade more at the tipping point
  • GQA batching overhead: Requires hand-tuned correction factors
Scope Limitations
  • Only decoder-only models: No encoder-decoder or MoE
  • No multi-modal models: Different compute patterns
  • FP16 only: No quantization modeling
  • Single-GPU profiling: May miss multi-node comm effects

Future Directions (from Paper Section 4.5 & 9)

Direction Impact Difficulty
Asynchronous communication overlap High Medium
Speculative decoding High High
Sequence parallelism Medium Medium
Energy consumption modeling Medium Low
Disaggregated prefill/decode (DistServe) High Medium
Quantization support High Medium

Key Takeaways

1. Configuration Matters More Than You Think

The cost of mis-configuration is up to 2x. Optimal configs change with workload, model, and even SLO thresholds. A 20ms change in TBT SLO can cause 1.85x cost difference.

2. Simulation is Viable for LLM Inference

Despite cascading error risks, careful operator classification and ML-based prediction achieve under 9% error. The key is decomposing into small, well-understood operator categories.

3. Extensible Architecture Enables Research

Each scheduling policy is under 150 lines. The registry pattern allows adding new policies, models, and workloads without modifying the core simulator.

4. Orders of Magnitude Cost Reduction

From $1.14M to $125 (9,000x savings). From 42K GPU-hours to ~1 CPU-hour. This makes configuration exploration practical even for frequent workload changes.

Repository Structure Summary

Directory Purpose Key Files
vidur/simulator.py Core discrete-event engine Event queue, run loop, trace output
vidur/events/ Event types and handlers 8 event classes, lifecycle management
vidur/scheduler/ 3-tier scheduler hierarchy global/, replica/, stage/ schedulers
vidur/execution_time_predictor/ ML-based runtime prediction sklearn base, RF + LinearRegression
vidur/entities/ Domain objects Batch, Request, ExecutionTime, Cluster, Replica
vidur/config/ Model and simulation configs 12+ model configs, device/node SKUs
vidur/config_optimizer/ Vidur-Search implementation CapacitySearch, ConfigExplorer, Dashboard
vidur/profiling/ GPU profiling scripts attention/, mlp/, collectives/, cpu_overhead/
vidur/request_generator/ Workload generation Poisson/Gamma/Static + Uniform/Zipf/Trace
vidur/metrics/ Metrics collection and plotting MetricsStore, CDF sketches, Plotly/Wandb
data/ Pre-processed traces and profiles CSV profiling data, workload traces