Vidur Paper Analysis - Large-Scale LLM Inference Simulation

Section 1

Problem Statement & Motivation

Optimizing LLM inference deployment is extraordinarily expensive. Providers must navigate a vast configuration space spanning parallelism strategies (TP, PP), scheduling policies (vLLM, Orca, Sarathi-Serve), batch sizes, GPU SKUs (A100, H100), and workload-specific parameters. Each configuration point requires running the actual model on GPUs, costing up to $97K per data point.

$694K

Daily ChatGPT Cost

42K

GPU-hrs for Brute-Force Search

$218K

Brute-Force Search Cost

2x

Cost of Mis-Configuration

Key Observations from the Paper

Figure 1 Finding: The optimal deployment configuration depends on both the model and the workload trace. An optimal config on one trace can be up to 2x sub-optimal on another trace for the same model (LLaMA2-70B). Even the optimal GPU SKU changes: H100 for Chat-1M, A100 for BWB.

Challenges in Simulating LLM Inference

Fine Time Granularity

LLM inference iterations are just a few milliseconds, unlike training iterations which run for hundreds of ms. Predictions must be accurate at the sub-millisecond level.

Varying Iteration Times

Unlike training where batch sizes are fixed, inference iteration times vary due to different sequence lengths, prefill/decode mix, and dynamic batch composition.

Cascading Errors

Small prediction errors compound over time since requests arrive dynamically and batch compositions change. A 1% per-iteration error can cascade into much larger end-to-end divergence.

Section 2

Key Contributions

The paper introduces three main components that work together to enable fast, inexpensive LLM inference performance exploration:

Vidur Simulator

+

Vidur-Bench

+

Vidur-Search

=

Optimal Config

1. Vidur Simulator (Section 4)

A high-fidelity discrete-event simulator that predicts request-level LLM inference performance with under 9% error. It emulates the full inference stack: model execution, scheduling, and cluster-level coordination.

Discrete-Event Simulation ML-based Prediction Hierarchical Scheduling

2. Vidur-Bench (Section 5)

A benchmark suite with plug-and-play support for workload patterns (Chat-1M, Arxiv-Sum, BWB), scheduling policies (vLLM, Orca, Sarathi-Serve, FasterTransformer, LightLLM), and profiling data for A100/H100 GPUs.

5 Schedulers 3 Workloads 2 GPU SKUs

3. Vidur-Search: Configuration Optimizer

An automated search tool that finds the optimal deployment configuration maximizing QPS-per-dollar while meeting SLO constraints (TTFT P90 < 2s, TBT P99 < 200ms). It uses binary search over the simulator to find maximum capacity for each configuration.

1hr

Search Time (CPU)

$125

Total Search Cost

35,565

Configs Evaluated

9,000x

Avg Cost Savings

Section 3

System Architecture (Paper Figure 2)

Vidur's architecture has two main phases: Model Onboarding (offline profiling + ML model training) and Simulation Runtime (discrete-event simulation with hierarchical scheduling).

Vidur's two-phase architecture: Model Onboarding (profiling + ML training) feeds into the Simulation Runtime (discrete-event engine).

Section 4

Execution Time Modeling

Vidur's runtime prediction is built on the key insight that LLM operators can be classified into three categories, each with a different prediction strategy. The paper uses Random Forest regression models trained on profiled data, achieving accuracy that prevents error cascading.

Operator Classification & Prediction Strategy

Category	Operators	Depends On	Prediction Approach
Token-level	Linear (QKV proj, MLP up/down), Norm, Activation	Total tokens in batch (prefill + decode)	RF on (batch_size, num_tokens)
Sequence-level	Attention (prefill: quadratic, decode: memory-bound)	Per-request context length + request history	Separate prefill/decode models; equivalent batch trick
Communication	AllReduce (TP), Send/Recv (PP)	Data size only (model-agnostic)	Pre-profiled lookup indexed by topology

Prefill Attention: The Equivalent Batch Trick

Key Insight: Prefill attention is quadratic in sequence length. For a batch of P prefills with lengths p_i, the cost is proportional to the sum of p_i squared. Vidur approximates this by predicting the runtime of a single equivalent prefill of length sqrt(sum(p_i^2)), reducing the combinatorial profiling space.

Implementation: ExecutionTime Entity

In the codebase, execution time is decomposed into 20 distinct components, each predicted separately:

# vidur/entities/execution_time.py -- 20 components per iteration
class ExecutionTime:
    def _get_block_execution_time(self) -> float:
        return (
            self._get_attention_layer_execution_time()  # QKV proj + RoPE + KV save + attn + out proj + norm
            + self._get_mlp_layer_execution_time()        # up_proj + down_proj + act + norm + allreduce
            + self._add_time                              # residual connection
        )

    @property
    def model_time(self) -> float:
        block_time = self._get_block_execution_time()
        stage_time = block_time * self._num_layers_per_pipeline_stage
        return (stage_time + self.pipeline_parallel_communication_time) * 1e-3

    @property
    def total_time(self) -> float:
        return self.model_time + self._get_cpu_overhead() * 1e-3
        # CPU overhead = schedule + sampler + prepare_inputs + process_outputs + ray_comm

Random Forest Prediction Pipeline

The sklearn-based predictor in the repo follows this pipeline:

CSV Profiles

→

Feature Eng.

→

GridSearchCV

→

RF Models

→

Lookup Tables

# vidur/execution_time_predictor/sklearn_execution_time_predictor.py
# Derived features for attention prediction:
df["num_tokens"] = df[["prefill_chunk_size", "batch_size"]].max(axis=1)
df["is_decode"] = df["prefill_chunk_size"] == 0
df["prefill_chunk_size_squared"] = df["prefill_chunk_size"] ** 2  # captures quadratic attention cost

# Communication features:
df["num_tokens"] = df["size"] / model_config.embedding_dim / 2  # bytes -> tokens

# Model selection: GridSearchCV with MAPE scorer
# Paper chose Random Forest over MLP and polynomial regression
# because RF captures non-linear CUDA kernel characteristics

Why Random Forest? The paper evaluated three approaches: (1) MLPs require large training data and fail to capture CUDA kernel non-linearities from tile/wave quantization, (2) Polynomial regression misses these effects entirely, (3) Random Forest achieves the best balance of data frugality and fidelity.

Section 5

Scheduling Policies Compared

Vidur implements five scheduling policies, each under 150 lines of Python. The paper classifies them into prefill-prioritizing and decode-prioritizing categories, with Sarathi-Serve bridging both.

Scheduler	Category	KV-Cache Mgmt	Key Feature	Repo Lines
vLLM	Prefill-first	PagedAttention (dynamic blocks)	Eagerly schedules prefills, pauses decodes	~132
Orca	Decode-first	Static allocation (max blocks)	Continuous batching, iteration-level scheduling	~55
Sarathi-Serve	Hybrid	PagedAttention + chunked prefills	Chunk-size controls prefill/decode tradeoff	~187
FasterTransformer	Prefill-first	Static allocation, batch-level free	Processes entire batch to completion	~66
LightLLM	Decode-first	Token-level allocation (block_size=1)	Max-waiting-iters to prevent starvation	~154

Hierarchical Scheduler Architecture

Three-tier hierarchical scheduler: Global (routing) -> Replica (batching + memory) -> Stage (microbatch pipeline)

What-If Analysis: Best Policies by Workload (Paper Section 7.3)

Chat-1M (Short prefill, many decodes)

Best: Sarathi-Serve with large batch (256), H100. The moderate P:D ratio (2.3) means chunked prefills avoid stalling decodes. vLLM also performs well.

P:D = 2.3 Best GPU: H100

BWB-4K (Long decodes, low P:D)

Best: Smaller batch size (64), A100 often better. High KV-Cache pressure from long sequences. Decode-prioritizing policies struggle less since there are fewer prefills.

P:D = 0.65 Best GPU: A100

Section 6

Simulation Accuracy (Paper Figures 3 & 4)

Vidur was validated across four models (LLaMA2-7B, InternLM-20B, LLaMA2-70B, Qwen-72B), three workloads, and both static and dynamic arrival patterns. The baseline is an optimized vLLM fork with CUDA graph support.

Static Workloads (Figure 3)

Model	TP	Median Exec Latency Error	P95 Exec Latency Error	Worst Case
LLaMA2-7B	1	0.30% - 3.01%	1.83% - 3.33%	3.33%
InternLM-20B	2	1.07% - 1.78%	0.38% - 1.37%	1.78%
LLaMA2-70B	4	0.15% - 2.53%	0.25% - 1.30%	2.86%
Qwen-72B	4	0.42% - 1.79%	0.52% - 1.69%	1.79%

Dynamic Workloads at 85% Capacity (Figure 4)

Model	Median E2E Error	P95 E2E Error	Overall Assessment
LLaMA2-7B (TP1)	2.88% - 8.50%	4.55% - 7.47%	Higher error (CPU overhead)
InternLM-20B (TP2)	0.47% - 1.27%	2.25% - 4.58%	Excellent fidelity
LLaMA2-70B (TP4)	0.51% - 1.64%	0.12% - 1.82%	Excellent fidelity
Qwen-72B (TP4)	0.41% - 3.29%	0.13% - 1.18%	Excellent fidelity

Key Finding (Figure 4): Vidur achieves <5% error in almost all scenarios at 85% capacity. The 7B model shows slightly higher errors because CPU overhead (scheduling, input preparation) dominates at small model sizes, and these overheads are harder to predict precisely.

Accuracy at Different Load Levels (Appendix Figure 7-8)

75% Capacity

LLaMA2-70B<2.2%

Qwen-72B<3.1%

LLaMA2-7B<7.2%

95% Capacity (Near Saturation)

LLaMA2-70B<1.7%

Qwen-72B<2.8%

LLaMA2-7B<12.7%

Section 7

Evaluation Results & What-If Analysis

Optimal Configurations Found (Paper Figure 1a & 6)

Vidur-Search explored all 12 model-trace combinations across TP={1,2,4}, PP={1,2,4}, Scheduler={vLLM, Orca+, Sarathi-Serve}, BatchSize={32,64,128,256,512}, and GPU SKU={A100,H100}. The results reveal that no single configuration is universally optimal.

Model	Workload	Best PP	Best TP	Best Scheduler	Best BS	SKU	QPS/$
LLaMA-7B	Chat-1M	1	1	Sarathi-Serve	64	A100	1.831
LLaMA2-70B	Chat-1M	2	2	Sarathi-Serve	256	H100	0.201
LLaMA2-70B	BWB-4K	2	4	vLLM	64	A100	0.026
Qwen-72B	Chat-1M	2	4	vLLM	256	H100	0.091

Pareto Frontier Analysis (Paper Figure 5)

Figure 5 Key Takeaways:
1. Configurations optimal on one SLO metric may violate another. Blue points on the Pareto curve satisfy TTFT but not TBT, or vice versa.
2. Small changes in SLO cause large cost differences. For LLaMA2-70B + Chat-1M, changing the TBT SLO from 120ms to 140ms (just 20ms!) shifts the Pareto point and results in ~1.85x reduction in cost.
3. Qwen-72B is ~2x more costly than LLaMA2-70B despite similar sizes, because Qwen uses Multi-Head Attention (MHA) instead of GQA, resulting in 8x higher KV-Cache load.

Cost Comparison: Simulation vs Actual (Paper Table 2)

Scenario	Actual Time	Sim Time	Actual Cost	Sim Cost	Savings
7B-Chat1M	4K hrs	31 min	$20K	$5	3,837x
7B-Arxiv	10K hrs	47 min	$52K	$8	6,708x
20B-Arxiv	14K hrs	25 min	$73K	$4	17,746x
70B-Chat1M	12K hrs	21 min	$64K	$4	18,151x
70B-Arxiv	15K hrs	16 min	$78K	$3	30,187x
72B-Arxiv	17K hrs	16 min	$88K	$3	33,354x

Total: The entire what-if analysis (35,565 simulation runs for Figure 1a) cost $125 on a 96-core CPU machine, compared to $1.14M and 42K GPU-hours for actual deployment. This is a 9,000x average cost reduction.

Section 8

Paper Claims vs Repository Implementation

We conducted a thorough comparison of the paper's claims against the open-source codebase at github.com/microsoft/vidur. Here is the detailed analysis:

Core Simulator Engine

Paper Claims

Discrete-event simulator with priority-based event queue
Events for request arrival, batch lifecycle, stage scheduling
Time-limit support for bounded simulations
Detailed metrics tracking (TTFT, TBT, latency, MFU)

Repository Implementation

Implemented heapq-based event queue in simulator.py (129 lines)
Implemented 8 event types in vidur/events/
Implemented Time limit and early termination
Implemented MetricsStore with CDF sketches, Wandb, Chrome traces

Scheduling Policies

Paper Claims

5 batching policies: vLLM, Orca, Sarathi-Serve, FasterTransformer, LightLLM
3-tier hierarchical scheduler (global + replica + stage)
Each policy under 150 lines of Python
Memory management with block-based KV-Cache

Repository Implementation

Implemented All 5 in vidur/scheduler/replica_scheduler/
Implemented Global: round-robin, LOR, random; Stage: microbatch
Verified Orca: 55, FT: 66, vLLM: 132, LightLLM: 154, Sarathi: 187 lines
Implemented MemoryPlanner computes max batch size from GPU memory

Execution Time Prediction

Paper Claims

Random Forest models for runtime prediction
Operator triaging: token-level, sequence-level, communication
Automatic profiling for parallelism strategies from single GPU
Profiled on A100 and H100

Repository Implementation

Implemented RF + Linear predictors in execution_time_predictor/
Implemented 5 profiling data loaders in sklearn base class
Implemented Profiling scripts in vidur/profiling/
Included CSV data in data/profiling/

Model Support

Paper Claims

LLaMA2-7B, LLaMA2-70B, InternLM-20B, Qwen-72B
Declarative model specification format
Easy to add new models

Repository Implementation

Implemented + Extended Also: LLaMA3-8B/70B, CodeLlama-34B, InternLM2-20B, Phi-2
Implemented Dataclass configs in vidur/config/model_config.py
Verified New model = ~15 lines of dataclass

Vidur-Search (Configuration Optimizer)

Paper Claims

Binary search for maximum QPS under SLO
Parallelized across CPU cores
Visualization dashboard for Pareto analysis
Configurable SLO constraints (TTFT, TBT)

Repository Implementation

Implemented CapacitySearch with adaptive binary search
Implemented Ray-based parallelization with CPU affinity
Implemented Streamlit dashboard (5 analysis pages)
Implemented Scheduling delay quantile + value thresholds

What Is Missing or Different?

Gaps Between Paper and Implementation

Feature	Paper Mentions	Repo Status	Notes
Async communication overlap	Future work (Sec 4.5)	Not implemented	Only sync PP scheduling
Speculative decoding	Future work (Sec 4.5)	Not implemented	Would need draft model sim
Energy consumption modeling	Planned (Sec 5.2)	Not implemented	Only FLOPs + memory util
Sequence parallelism	Future work (Sec 4.5)	Not implemented	Only TP and PP
Prefix caching	Not discussed	Not implemented	Important for production
Preemption counting	Mentioned (Sec 5.2)	Implemented	vLLM + Sarathi track this
Offline batch optimization	Possible extension (Sec 6)	Partially	Static workload mode supported

Section 9

Discrete-Event Simulation: Code Deep Dive

The core simulation loop is remarkably concise. The entire engine is 129 lines of Python:

# vidur/simulator.py -- The entire simulation engine
class Simulator:
    def run(self):
        while self._event_queue and not self._terminate:
            _, event = heapq.heappop(self._event_queue)  # get highest priority event
            self._set_time(event._time)                     # advance simulation clock
            new_events = event.handle_event(               # process event, get next events
                self._scheduler, self._metric_store
            )
            self._add_events(new_events)                    # push new events to queue

Event Lifecycle

Event flow: Each request triggers a cascade of scheduling and execution events through the three-tier hierarchy.

Memory Management: MemoryPlanner

# vidur/scheduler/utils/memory_planner.py
class MemoryPlanner:
    def get_max_batch_size(self) -> int:
        available = gpu_memory * (1 - margin_fraction)  # typically 80GB * 0.9
        param_memory = 2 * num_parameters_per_device   # FP16
        kv_per_request = (
            2                                          # bytes per float
            * 2                                         # key + value
            * attention_head_dim
            * kv_heads_per_tp_worker
            * max_request_tokens
            * num_layers
        )
        return (available - param_memory) // kv_per_request

Section 10

Vidur-Bench: Workload Characteristics

The benchmark suite includes three carefully chosen workloads that represent different LLM usage patterns. The key insight is that workload characteristics dramatically affect optimal configuration.

Dataset	Content	# Queries	Prefill (med)	Decode (med)	P:D	Characteristic
Chat-1M	LMSys conversations	2M	417	141	2.3	Moderate
Arxiv-4K	Paper summaries	203K	7827	228	35.4	Prefill-heavy
BWB-4K	Book translation	195K	2396	3589	0.66	Decode-heavy

Impact on Optimal Config: The decode phase can be up to 200x more expensive than prefill per token (Agrawal et al., 2023). BWB-4K has 10x longer decodes and 2x longer prefills compared to Chat-1M, causing completely different optimal configurations. This is why Vidur-Search is essential -- you cannot pick a single best config.

Request Generation in the Repo

The repo implements a flexible request generation framework supporting both synthetic and trace-replay modes:

Interval Generators

Poisson (online / dynamic)
Gamma (bursty)
Static (all at t=0, offline)
Trace replay

Length Generators

Uniform (min/max range)
Zipf (power-law)
Fixed (constant length)
Trace replay (from CSV)

Section 11

Use Cases

Capacity Planning

Given a target workload and SLO requirements, determine the minimum GPU fleet size and optimal SKU. Vidur-Search's binary search finds maximum QPS per configuration, then selects the most cost-effective option.

Scheduling Policy Design

Test new scheduling algorithms without GPU access. The extensible API requires implementing only _get_next_batch() and on_batch_end() methods. Sarathi-Serve's chunked prefill was validated this way.

Hardware Selection

Compare A100 vs H100 cost-effectiveness for specific workloads. The paper found that the optimal SKU changes with workload: H100 is better for Chat-1M but A100 wins for BWB due to lower cost-per-GB.

Example: Running Vidur for Capacity Planning

# Run a single simulation
python -m vidur.main \
  --replica_config_model_name meta-llama/Llama-2-70b-hf \
  --cluster_config_num_replicas 4 \
  --replica_config_num_pipeline_stages 2 \
  --replica_config_tensor_parallel_size 4 \
  --replica_scheduler_config_type sarathi \
  --sarathi_scheduler_chunk_size 512 \
  --vllm_scheduler_batch_size_cap 256 \
  --request_generator_config_type synthetic \
  --synthetic_request_generator_interval_generator_config_type poisson \
  --poisson_request_interval_generator_qps 0.5 \
  --length_generator_config_type trace \
  --trace_request_length_generator_trace_file data/processed_traces/chat1m.csv

# Run Vidur-Search for optimal configuration
python -m vidur.config_optimizer.config_explorer.main \
  --output_dir /tmp/vidur_search \
  --time_limit 600 \
  --scheduling_delay_slo_value 5 \
  --scheduling_delay_slo_quantile 0.99

Example: Adding a New Scheduling Policy

# Implementing a new scheduler requires 2 methods:
class MyCustomScheduler(BaseReplicaScheduler):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._num_running_batches = 0

    def on_batch_end(self, batch: Batch) -> None:
        self._num_running_batches -= 1
        for request in batch.requests:
            if request.completed:
                self.free(request.id)

    def _get_next_batch(self) -> Batch:
        # Your batching logic here
        # Has access to: self._request_queue, self._allocation_map,
        # self.can_allocate(), self.allocate(), self._max_batch_size
        pass

Section 12

Limitations & Future Work

Accuracy Limitations

Small models (7B): CPU overhead dominates; up to 12.65% error at 95% capacity
Near capacity: Errors cascade more at the tipping point
GQA batching overhead: Requires hand-tuned correction factors

Scope Limitations

Only decoder-only models: No encoder-decoder or MoE
No multi-modal models: Different compute patterns
FP16 only: No quantization modeling
Single-GPU profiling: May miss multi-node comm effects

Future Directions (from Paper Section 4.5 & 9)

Direction	Impact	Difficulty
Asynchronous communication overlap	High	Medium
Speculative decoding	High	High
Sequence parallelism	Medium	Medium
Energy consumption modeling	Medium	Low
Disaggregated prefill/decode (DistServe)	High	Medium
Quantization support	High	Medium

Section 13

Key Takeaways

1. Configuration Matters More Than You Think

The cost of mis-configuration is up to 2x. Optimal configs change with workload, model, and even SLO thresholds. A 20ms change in TBT SLO can cause 1.85x cost difference.

2. Simulation is Viable for LLM Inference

Despite cascading error risks, careful operator classification and ML-based prediction achieve under 9% error. The key is decomposing into small, well-understood operator categories.

3. Extensible Architecture Enables Research

Each scheduling policy is under 150 lines. The registry pattern allows adding new policies, models, and workloads without modifying the core simulator.

4. Orders of Magnitude Cost Reduction

From $1.14M to $125 (9,000x savings). From 42K GPU-hours to ~1 CPU-hour. This makes configuration exploration practical even for frequent workload changes.

Repository Structure Summary

Directory	Purpose	Key Files
`vidur/simulator.py`	Core discrete-event engine	Event queue, run loop, trace output
`vidur/events/`	Event types and handlers	8 event classes, lifecycle management
`vidur/scheduler/`	3-tier scheduler hierarchy	global/, replica/, stage/ schedulers
`vidur/execution_time_predictor/`	ML-based runtime prediction	sklearn base, RF + LinearRegression
`vidur/entities/`	Domain objects	Batch, Request, ExecutionTime, Cluster, Replica
`vidur/config/`	Model and simulation configs	12+ model configs, device/node SKUs
`vidur/config_optimizer/`	Vidur-Search implementation	CapacitySearch, ConfigExplorer, Dashboard
`vidur/profiling/`	GPU profiling scripts	attention/, mlp/, collectives/, cpu_overhead/
`vidur/request_generator/`	Workload generation	Poisson/Gamma/Static + Uniform/Zipf/Trace
`vidur/metrics/`	Metrics collection and plotting	MetricsStore, CDF sketches, Plotly/Wandb
`data/`	Pre-processed traces and profiles	CSV profiling data, workload traces

Section 14

Related Work & Context

Vidur is positioned against DNN training simulators and differentiates itself by addressing inference-specific challenges:

System	Target	Approach	Vidur Advantage
Daydream (2020)	DNN Training	Computation graph + graph transforms	Dynamic batching, finer time scale
Habitat (2021)	DNN Training	Roofline model for operators	RF captures CUDA non-linearities
Proteus (2023)	Distributed Training	Strategy Tree for parallelism	Varying batch composition, request metrics

Key Differentiator: Prior simulators target training where batch composition is static and iterations are long (hundreds of ms). Vidur is the first simulator specifically designed for LLM inference, where batch composition changes every iteration, time granularity is milliseconds, and errors cascade due to dynamic request arrivals.

Vidur: A Large-Scale Simulation Framework for LLM Inference

Problem Statement & Motivation

Key Observations from the Paper

Challenges in Simulating LLM Inference

Fine Time Granularity

Varying Iteration Times

Cascading Errors

Key Contributions

1. Vidur Simulator (Section 4)

2. Vidur-Bench (Section 5)

3. Vidur-Search: Configuration Optimizer

System Architecture (Paper Figure 2)

Execution Time Modeling

Operator Classification & Prediction Strategy

Prefill Attention: The Equivalent Batch Trick

Implementation: ExecutionTime Entity

Random Forest Prediction Pipeline

Scheduling Policies Compared

Hierarchical Scheduler Architecture

What-If Analysis: Best Policies by Workload (Paper Section 7.3)

Chat-1M (Short prefill, many decodes)

BWB-4K (Long decodes, low P:D)

Simulation Accuracy (Paper Figures 3 & 4)

Static Workloads (Figure 3)

Dynamic Workloads at 85% Capacity (Figure 4)

Accuracy at Different Load Levels (Appendix Figure 7-8)

75% Capacity

95% Capacity (Near Saturation)

Evaluation Results & What-If Analysis

Optimal Configurations Found (Paper Figure 1a & 6)

Pareto Frontier Analysis (Paper Figure 5)

Cost Comparison: Simulation vs Actual (Paper Table 2)

Paper Claims vs Repository Implementation

Core Simulator Engine

Paper Claims

Repository Implementation

Scheduling Policies

Paper Claims

Repository Implementation

Execution Time Prediction

Paper Claims

Repository Implementation

Model Support

Paper Claims

Repository Implementation

Vidur-Search (Configuration Optimizer)

Paper Claims

Repository Implementation

What Is Missing or Different?

Gaps Between Paper and Implementation

Discrete-Event Simulation: Code Deep Dive

Event Lifecycle

Memory Management: MemoryPlanner

Vidur-Bench: Workload Characteristics

Request Generation in the Repo

Interval Generators

Length Generators

Use Cases

Capacity Planning

Scheduling Policy Design

Hardware Selection

Limitations & Future Work

Accuracy Limitations

Scope Limitations

Future Directions (from Paper Section 4.5 & 9)

Key Takeaways

1. Configuration Matters More Than You Think

2. Simulation is Viable for LLM Inference

3. Extensible Architecture Enables Research

4. Orders of Magnitude Cost Reduction

Repository Structure Summary

Related Work & Context