Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, Alexey Tumanov -- Georgia Tech & Microsoft Research India
Optimizing LLM inference deployment is extraordinarily expensive. Providers must navigate a vast configuration space spanning parallelism strategies (TP, PP), scheduling policies (vLLM, Orca, Sarathi-Serve), batch sizes, GPU SKUs (A100, H100), and workload-specific parameters. Each configuration point requires running the actual model on GPUs, costing up to $97K per data point.
LLM inference iterations are just a few milliseconds, unlike training iterations which run for hundreds of ms. Predictions must be accurate at the sub-millisecond level.
Unlike training where batch sizes are fixed, inference iteration times vary due to different sequence lengths, prefill/decode mix, and dynamic batch composition.
Small prediction errors compound over time since requests arrive dynamically and batch compositions change. A 1% per-iteration error can cascade into much larger end-to-end divergence.
The paper introduces three main components that work together to enable fast, inexpensive LLM inference performance exploration:
A high-fidelity discrete-event simulator that predicts request-level LLM inference performance with under 9% error. It emulates the full inference stack: model execution, scheduling, and cluster-level coordination.
Discrete-Event Simulation ML-based Prediction Hierarchical SchedulingA benchmark suite with plug-and-play support for workload patterns (Chat-1M, Arxiv-Sum, BWB), scheduling policies (vLLM, Orca, Sarathi-Serve, FasterTransformer, LightLLM), and profiling data for A100/H100 GPUs.
5 Schedulers 3 Workloads 2 GPU SKUsAn automated search tool that finds the optimal deployment configuration maximizing QPS-per-dollar while meeting SLO constraints (TTFT P90 < 2s, TBT P99 < 200ms). It uses binary search over the simulator to find maximum capacity for each configuration.
Vidur's architecture has two main phases: Model Onboarding (offline profiling + ML model training) and Simulation Runtime (discrete-event simulation with hierarchical scheduling).
Vidur's runtime prediction is built on the key insight that LLM operators can be classified into three categories, each with a different prediction strategy. The paper uses Random Forest regression models trained on profiled data, achieving accuracy that prevents error cascading.
| Category | Operators | Depends On | Prediction Approach |
|---|---|---|---|
| Token-level | Linear (QKV proj, MLP up/down), Norm, Activation | Total tokens in batch (prefill + decode) | RF on (batch_size, num_tokens) |
| Sequence-level | Attention (prefill: quadratic, decode: memory-bound) | Per-request context length + request history | Separate prefill/decode models; equivalent batch trick |
| Communication | AllReduce (TP), Send/Recv (PP) | Data size only (model-agnostic) | Pre-profiled lookup indexed by topology |
In the codebase, execution time is decomposed into 20 distinct components, each predicted separately:
# vidur/entities/execution_time.py -- 20 components per iteration class ExecutionTime: def _get_block_execution_time(self) -> float: return ( self._get_attention_layer_execution_time() # QKV proj + RoPE + KV save + attn + out proj + norm + self._get_mlp_layer_execution_time() # up_proj + down_proj + act + norm + allreduce + self._add_time # residual connection ) @property def model_time(self) -> float: block_time = self._get_block_execution_time() stage_time = block_time * self._num_layers_per_pipeline_stage return (stage_time + self.pipeline_parallel_communication_time) * 1e-3 @property def total_time(self) -> float: return self.model_time + self._get_cpu_overhead() * 1e-3 # CPU overhead = schedule + sampler + prepare_inputs + process_outputs + ray_comm
The sklearn-based predictor in the repo follows this pipeline:
# vidur/execution_time_predictor/sklearn_execution_time_predictor.py # Derived features for attention prediction: df["num_tokens"] = df[["prefill_chunk_size", "batch_size"]].max(axis=1) df["is_decode"] = df["prefill_chunk_size"] == 0 df["prefill_chunk_size_squared"] = df["prefill_chunk_size"] ** 2 # captures quadratic attention cost # Communication features: df["num_tokens"] = df["size"] / model_config.embedding_dim / 2 # bytes -> tokens # Model selection: GridSearchCV with MAPE scorer # Paper chose Random Forest over MLP and polynomial regression # because RF captures non-linear CUDA kernel characteristics
Vidur implements five scheduling policies, each under 150 lines of Python. The paper classifies them into prefill-prioritizing and decode-prioritizing categories, with Sarathi-Serve bridging both.
| Scheduler | Category | KV-Cache Mgmt | Key Feature | Repo Lines |
|---|---|---|---|---|
| vLLM | Prefill-first | PagedAttention (dynamic blocks) | Eagerly schedules prefills, pauses decodes | ~132 |
| Orca | Decode-first | Static allocation (max blocks) | Continuous batching, iteration-level scheduling | ~55 |
| Sarathi-Serve | Hybrid | PagedAttention + chunked prefills | Chunk-size controls prefill/decode tradeoff | ~187 |
| FasterTransformer | Prefill-first | Static allocation, batch-level free | Processes entire batch to completion | ~66 |
| LightLLM | Decode-first | Token-level allocation (block_size=1) | Max-waiting-iters to prevent starvation | ~154 |
Best: Sarathi-Serve with large batch (256), H100. The moderate P:D ratio (2.3) means chunked prefills avoid stalling decodes. vLLM also performs well.
P:D = 2.3 Best GPU: H100Best: Smaller batch size (64), A100 often better. High KV-Cache pressure from long sequences. Decode-prioritizing policies struggle less since there are fewer prefills.
P:D = 0.65 Best GPU: A100Vidur was validated across four models (LLaMA2-7B, InternLM-20B, LLaMA2-70B, Qwen-72B), three workloads, and both static and dynamic arrival patterns. The baseline is an optimized vLLM fork with CUDA graph support.
| Model | TP | Median Exec Latency Error | P95 Exec Latency Error | Worst Case |
|---|---|---|---|---|
| LLaMA2-7B | 1 | 0.30% - 3.01% | 1.83% - 3.33% | 3.33% |
| InternLM-20B | 2 | 1.07% - 1.78% | 0.38% - 1.37% | 1.78% |
| LLaMA2-70B | 4 | 0.15% - 2.53% | 0.25% - 1.30% | 2.86% |
| Qwen-72B | 4 | 0.42% - 1.79% | 0.52% - 1.69% | 1.79% |
| Model | Median E2E Error | P95 E2E Error | Overall Assessment |
|---|---|---|---|
| LLaMA2-7B (TP1) | 2.88% - 8.50% | 4.55% - 7.47% | Higher error (CPU overhead) |
| InternLM-20B (TP2) | 0.47% - 1.27% | 2.25% - 4.58% | Excellent fidelity |
| LLaMA2-70B (TP4) | 0.51% - 1.64% | 0.12% - 1.82% | Excellent fidelity |
| Qwen-72B (TP4) | 0.41% - 3.29% | 0.13% - 1.18% | Excellent fidelity |
Vidur-Search explored all 12 model-trace combinations across TP={1,2,4}, PP={1,2,4}, Scheduler={vLLM, Orca+, Sarathi-Serve}, BatchSize={32,64,128,256,512}, and GPU SKU={A100,H100}. The results reveal that no single configuration is universally optimal.
| Model | Workload | Best PP | Best TP | Best Scheduler | Best BS | SKU | QPS/$ |
|---|---|---|---|---|---|---|---|
| LLaMA-7B | Chat-1M | 1 | 1 | Sarathi-Serve | 64 | A100 | 1.831 |
| LLaMA2-70B | Chat-1M | 2 | 2 | Sarathi-Serve | 256 | H100 | 0.201 |
| LLaMA2-70B | BWB-4K | 2 | 4 | vLLM | 64 | A100 | 0.026 |
| Qwen-72B | Chat-1M | 2 | 4 | vLLM | 256 | H100 | 0.091 |
| Scenario | Actual Time | Sim Time | Actual Cost | Sim Cost | Savings |
|---|---|---|---|---|---|
| 7B-Chat1M | 4K hrs | 31 min | $20K | $5 | 3,837x |
| 7B-Arxiv | 10K hrs | 47 min | $52K | $8 | 6,708x |
| 20B-Arxiv | 14K hrs | 25 min | $73K | $4 | 17,746x |
| 70B-Chat1M | 12K hrs | 21 min | $64K | $4 | 18,151x |
| 70B-Arxiv | 15K hrs | 16 min | $78K | $3 | 30,187x |
| 72B-Arxiv | 17K hrs | 16 min | $88K | $3 | 33,354x |
We conducted a thorough comparison of the paper's claims against the open-source codebase at github.com/microsoft/vidur. Here is the detailed analysis:
| Feature | Paper Mentions | Repo Status | Notes |
|---|---|---|---|
| Async communication overlap | Future work (Sec 4.5) | Not implemented | Only sync PP scheduling |
| Speculative decoding | Future work (Sec 4.5) | Not implemented | Would need draft model sim |
| Energy consumption modeling | Planned (Sec 5.2) | Not implemented | Only FLOPs + memory util |
| Sequence parallelism | Future work (Sec 4.5) | Not implemented | Only TP and PP |
| Prefix caching | Not discussed | Not implemented | Important for production |
| Preemption counting | Mentioned (Sec 5.2) | Implemented | vLLM + Sarathi track this |
| Offline batch optimization | Possible extension (Sec 6) | Partially | Static workload mode supported |
The core simulation loop is remarkably concise. The entire engine is 129 lines of Python:
# vidur/simulator.py -- The entire simulation engine class Simulator: def run(self): while self._event_queue and not self._terminate: _, event = heapq.heappop(self._event_queue) # get highest priority event self._set_time(event._time) # advance simulation clock new_events = event.handle_event( # process event, get next events self._scheduler, self._metric_store ) self._add_events(new_events) # push new events to queue
# vidur/scheduler/utils/memory_planner.py class MemoryPlanner: def get_max_batch_size(self) -> int: available = gpu_memory * (1 - margin_fraction) # typically 80GB * 0.9 param_memory = 2 * num_parameters_per_device # FP16 kv_per_request = ( 2 # bytes per float * 2 # key + value * attention_head_dim * kv_heads_per_tp_worker * max_request_tokens * num_layers ) return (available - param_memory) // kv_per_request
The benchmark suite includes three carefully chosen workloads that represent different LLM usage patterns. The key insight is that workload characteristics dramatically affect optimal configuration.
| Dataset | Content | # Queries | Prefill (med) | Decode (med) | P:D | Characteristic |
|---|---|---|---|---|---|---|
| Chat-1M | LMSys conversations | 2M | 417 | 141 | 2.3 | Moderate |
| Arxiv-4K | Paper summaries | 203K | 7827 | 228 | 35.4 | Prefill-heavy |
| BWB-4K | Book translation | 195K | 2396 | 3589 | 0.66 | Decode-heavy |
The repo implements a flexible request generation framework supporting both synthetic and trace-replay modes:
Given a target workload and SLO requirements, determine the minimum GPU fleet size and optimal SKU. Vidur-Search's binary search finds maximum QPS per configuration, then selects the most cost-effective option.
Test new scheduling algorithms without GPU access. The extensible API requires implementing only _get_next_batch() and on_batch_end() methods. Sarathi-Serve's chunked prefill was validated this way.
Compare A100 vs H100 cost-effectiveness for specific workloads. The paper found that the optimal SKU changes with workload: H100 is better for Chat-1M but A100 wins for BWB due to lower cost-per-GB.
# Run a single simulation python -m vidur.main \ --replica_config_model_name meta-llama/Llama-2-70b-hf \ --cluster_config_num_replicas 4 \ --replica_config_num_pipeline_stages 2 \ --replica_config_tensor_parallel_size 4 \ --replica_scheduler_config_type sarathi \ --sarathi_scheduler_chunk_size 512 \ --vllm_scheduler_batch_size_cap 256 \ --request_generator_config_type synthetic \ --synthetic_request_generator_interval_generator_config_type poisson \ --poisson_request_interval_generator_qps 0.5 \ --length_generator_config_type trace \ --trace_request_length_generator_trace_file data/processed_traces/chat1m.csv # Run Vidur-Search for optimal configuration python -m vidur.config_optimizer.config_explorer.main \ --output_dir /tmp/vidur_search \ --time_limit 600 \ --scheduling_delay_slo_value 5 \ --scheduling_delay_slo_quantile 0.99
# Implementing a new scheduler requires 2 methods: class MyCustomScheduler(BaseReplicaScheduler): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self._num_running_batches = 0 def on_batch_end(self, batch: Batch) -> None: self._num_running_batches -= 1 for request in batch.requests: if request.completed: self.free(request.id) def _get_next_batch(self) -> Batch: # Your batching logic here # Has access to: self._request_queue, self._allocation_map, # self.can_allocate(), self.allocate(), self._max_batch_size pass
| Direction | Impact | Difficulty |
|---|---|---|
| Asynchronous communication overlap | High | Medium |
| Speculative decoding | High | High |
| Sequence parallelism | Medium | Medium |
| Energy consumption modeling | Medium | Low |
| Disaggregated prefill/decode (DistServe) | High | Medium |
| Quantization support | High | Medium |
The cost of mis-configuration is up to 2x. Optimal configs change with workload, model, and even SLO thresholds. A 20ms change in TBT SLO can cause 1.85x cost difference.
Despite cascading error risks, careful operator classification and ML-based prediction achieve under 9% error. The key is decomposing into small, well-understood operator categories.
Each scheduling policy is under 150 lines. The registry pattern allows adding new policies, models, and workloads without modifying the core simulator.
From $1.14M to $125 (9,000x savings). From 42K GPU-hours to ~1 CPU-hour. This makes configuration exploration practical even for frequent workload changes.
| Directory | Purpose | Key Files |
|---|---|---|
vidur/simulator.py |
Core discrete-event engine | Event queue, run loop, trace output |
vidur/events/ |
Event types and handlers | 8 event classes, lifecycle management |
vidur/scheduler/ |
3-tier scheduler hierarchy | global/, replica/, stage/ schedulers |
vidur/execution_time_predictor/ |
ML-based runtime prediction | sklearn base, RF + LinearRegression |
vidur/entities/ |
Domain objects | Batch, Request, ExecutionTime, Cluster, Replica |
vidur/config/ |
Model and simulation configs | 12+ model configs, device/node SKUs |
vidur/config_optimizer/ |
Vidur-Search implementation | CapacitySearch, ConfigExplorer, Dashboard |
vidur/profiling/ |
GPU profiling scripts | attention/, mlp/, collectives/, cpu_overhead/ |
vidur/request_generator/ |
Workload generation | Poisson/Gamma/Static + Uniform/Zipf/Trace |
vidur/metrics/ |
Metrics collection and plotting | MetricsStore, CDF sketches, Plotly/Wandb |
data/ |
Pre-processed traces and profiles | CSV profiling data, workload traces |