How Vidur models GPU kernel latency without any GPUs, drives simulation through a priority-queue event loop, and generates realistic request traffic from synthetic distributions or production traces.
Vidur's simulation engine rests on three pillars that work in concert: an execution time predictor that estimates how long each GPU operation takes for a given batch, a discrete event system that advances simulated time through a priority queue, and a request generator that produces realistic workload patterns. Together, these components allow Vidur to simulate an entire LLM serving cluster on a single CPU in seconds.
At the heart of Vidur's ability to simulate without GPUs is the execution time predictor. The base class BaseExecutionTimePredictor defines a 19-component breakdown of every batch's execution time. This is not a single monolithic prediction -- every sub-operation of a transformer layer is modeled independently.
# vidur/execution_time_predictor/base_execution_time_predictor.py class BaseExecutionTimePredictor(ABC): def __init__(self, predictor_config, replica_config, replica_scheduler_config, metrics_config): self._config = predictor_config self._replica_config = replica_config self._model_config = replica_config.model_config self._replica_scheduler_provider = str(replica_scheduler_config.get_type()) self._block_size = replica_scheduler_config.block_size self._num_layers_per_pipeline_stage = ( self._model_config.num_layers // self._replica_config.num_pipeline_stages ) def get_execution_time(self, batch: Batch, pipeline_stage: int) -> ExecutionTime: # Compute communication overheads conditionally if pipeline_stage == self._replica_config.num_pipeline_stages - 1: pipeline_parallel_communication_time = 0 else: pipeline_parallel_communication_time = ( self._get_pipeline_parallel_communication_time(batch) ) if self._replica_config.tensor_parallel_size == 1: tensor_parallel_communication_time = 0 else: tensor_parallel_communication_time = ( self._get_tensor_parallel_communication_time(batch) ) return ExecutionTime( self._num_layers_per_pipeline_stage, self._get_attention_rope_execution_time(batch), self._get_attention_kv_cache_save_execution_time(batch), self._get_attention_decode_execution_time(batch), self._get_attention_prefill_execution_time(batch), self._get_attention_layer_pre_proj_execution_time(batch), self._get_attention_layer_post_proj_execution_time(batch), self._get_mlp_layer_up_proj_execution_time(batch), self._get_mlp_layer_down_proj_execution_time(batch), self._get_mlp_layer_act_execution_time(batch), self._get_attn_norm_layer_act_execution_time(batch), self._get_mlp_norm_layer_act_execution_time(batch), self._get_add_layer_act_execution_time(batch), tensor_parallel_communication_time, pipeline_parallel_communication_time, self._get_schedule_time(batch), self._get_sampler_e2e_time(batch), self._get_prepare_inputs_e2e_time(batch), self._get_process_model_outputs_time(batch), self._get_ray_comm_time(batch), )
Every call to get_execution_time() produces an ExecutionTime entity that decomposes a single pipeline stage's execution into fine-grained sub-operations. These are aggregated to compute the total time:
| Component | Category | Description |
|---|---|---|
attention_rope | Attention | Rotary Position Embedding computation |
attention_kv_cache_save | Attention | Writing K/V to cache memory |
attention_decode | Attention | Decode attention (batch_size x kv_cache_size) |
attention_prefill | Attention | Prefill attention (kv_cache x chunk_size^2) |
attention_pre_proj | Attention | QKV projection (linear layer) |
attention_post_proj | Attention | Output projection (linear layer) |
mlp_up_proj | MLP | Up projection / gate projection |
mlp_down_proj | MLP | Down projection |
mlp_act | MLP | Activation function (SiLU, GELU) |
attn_norm | Norm | Input layer normalization (RMSNorm / LayerNorm) |
mlp_norm | Norm | Post-attention normalization |
add | Residual | Residual connection addition |
tensor_parallel_comm | Communication | All-reduce across tensor parallel workers |
pipeline_parallel_comm | Communication | Send/recv between pipeline stages |
schedule | CPU Overhead | Scheduler CPU time |
sampler_e2e | CPU Overhead | Token sampling end-to-end |
prepare_inputs_e2e | CPU Overhead | Input preparation overhead |
process_model_outputs | CPU Overhead | Output processing overhead |
ray_comm_time | CPU Overhead | Ray distributed communication |
# vidur/entities/execution_time.py class ExecutionTime(BaseEntity): def _get_block_execution_time(self) -> float: return ( self._get_attention_layer_execution_time() # pre_proj + post_proj + rope + kv_save + decode + prefill + TP_comm + norm + self._get_mlp_layer_execution_time() # up_proj + down_proj + act + TP_comm + norm + self._add_time # residual ) @property def model_time(self) -> float: block_execution_time = self._get_block_execution_time() pipeline_stage_execution_time = ( block_execution_time * self._num_layers_per_pipeline_stage ) # return in seconds (internal times are ms) return ( pipeline_stage_execution_time + self.pipeline_parallel_communication_time ) * 1e-3 @property def total_time(self) -> float: # model GPU time + CPU overhead (schedule, sampler, ray, etc.) return self.model_time + self._get_cpu_overhead() * 1e-3
(block_time x num_layers) + pipeline_comm + cpu_overhead. Each block is attention_layer + mlp_layer + residual. Communication costs (TP all-reduce, PP send/recv) are added only when parallelism > 1.
SklearnExecutionTimePredictor is the workhorse implementation. It trains ML models on profiled data collected from real GPUs, then uses pre-computed prediction tables during simulation. The pipeline has three phases: data loading + feature engineering, model training with GridSearchCV, and full-range prediction caching.
Five input CSV files contain profiled data from real GPU runs. Each is filtered to match the exact model architecture and parallelism configuration being simulated. File paths use template substitution with {DEVICE}, {MODEL}, and {NETWORK_DEVICE} placeholders:
# vidur/execution_time_predictor/sklearn_execution_time_predictor.py def _get_input_files(self): input_files = [ self._config.compute_input_file, self._config.attention_input_file, self._config.all_reduce_input_file, self._config.send_recv_input_file, self._config.cpu_overhead_input_file, ] for i in range(len(input_files)): input_files[i] = ( input_files[i] .replace("{DEVICE}", self._replica_config.device) .replace("{MODEL}", self._model_config.get_name()) .replace("{NETWORK_DEVICE}", self._replica_config.network_device) ) return tuple(input_files) def _load_compute_df(self, file_path): df = self._read_input_file(file_path) df = df.drop_duplicates() # Filter to exact model architecture df = df[ (df["n_head"] == self._model_config.num_q_heads) & (df["n_kv_head"] == self._model_config.num_kv_heads) & (df["n_embd"] == self._model_config.embedding_dim) & (df["n_expanded_embd"] == self._model_config.mlp_hidden_dim) & (df["use_gated_mlp"] == self._model_config.use_gated_mlp) & (df["vocab_size"] == self._model_config.vocab_size) & (df["num_tensor_parallel_workers"] == self._replica_config.tensor_parallel_size) ] return df
Attention data requires additional derived features. The is_decode flag separates prefill and decode operations, while prefill_chunk_size_squared captures the quadratic scaling of self-attention. For communication data, byte sizes are converted to token counts using the embedding dimension:
def _get_attention_df_with_derived_features(self, df): df_with_derived_features = df.copy() df_with_derived_features["num_tokens"] = df_with_derived_features[ ["prefill_chunk_size", "batch_size"] ].max(axis=1) df_with_derived_features["is_decode"] = ( df_with_derived_features["prefill_chunk_size"] == 0 ) df_with_derived_features["prefill_chunk_size_squared"] = ( df_with_derived_features["prefill_chunk_size"] ** 2 ) return df_with_derived_features def _get_all_reduce_df_with_derived_features(self, df): df_with_derived_features = df.copy() # convert bytes to num tokens: each token = 2 * embedding_dim bytes df_with_derived_features["num_tokens"] = ( df_with_derived_features["size"] / self._model_config.embedding_dim / 2 ) return df_with_derived_features
Each sub-operation gets its own model trained via scikit-learn's GridSearchCV. The scorer is MAPE (Mean Absolute Percentage Error), and all data is used for training -- there is no train/test split because the goal is interpolation within the known profiling domain, not extrapolation:
def _train_model(self, model_name, df, feature_cols, target_col): if len(df) == 0: raise Exception(f"Training data for model {model_name} is empty") model_hash = self._get_model_hash(model_name, df) cached_model = self._load_model_from_cache(model_name, model_hash) if cached_model: return cached_model model = self._get_estimator() # abstract: LinearRegression or RandomForest grid_search_params = self._get_grid_search_params() # abstract: hyperparameter grid cv = min(self._config.k_fold_cv_splits, len(df)) if len(df) < self._config.k_fold_cv_splits else self._config.k_fold_cv_splits grid_search = GridSearchCV( estimator=model, param_grid=grid_search_params, scoring=self._get_scorer(), # MAPE scorer cv=cv, n_jobs=self._config.num_training_job_threads, ) # No train/test split -- we want to predict within the profiled domain X, y = df[feature_cols], df[target_col] grid_search.fit(X, y) self._store_model_in_cache(model_name, model_hash, grid_search.best_estimator_) return grid_search.best_estimator_
mean_absolute_percentage_error method handles the zero-true-value edge case: if the true value is 0 and prediction is also 0, error is 0; if true is 0 but prediction is non-zero, error is 100%. This prevents division-by-zero crashes during cross-validation.
def _train_models(self): models = self._train_compute_models() # 9-10 models models.update(self._train_cpu_overhead_models()) # 5 models models.update(self._train_attention_layer_models()) # 2 models return models # Compute models: features = [num_tokens], target = time_stats.{op}.median # attn_pre_proj, attn_post_proj, mlp_up_proj, mlp_down_proj, mlp_act, # input_layernorm, post_attention_layernorm, attn_rope, add, attn_kv_cache_save # Attention models (separate prefill and decode): # attn_prefill: features=[kv_cache_size, prefill_chunk_size_squared] # attn_decode: features=[batch_size, kv_cache_size] # CPU overhead models: features = [batch_size] # schedule, sampler_e2e, prepare_inputs_e2e, process_model_outputs, ray_comm_time # Communication models (conditional): # all_reduce: features=[num_tokens] -- only if TP > 1 # send_recv: features=[num_tokens] -- only if PP > 1
After training, Vidur pre-computes predictions for every possible input combination and stores them as dictionary lookups. This means simulation-time lookups are O(1) hash table accesses, not model inference calls:
def _predict_for_attention_layer_models(self): # Decode grid: batch_size x kv_cache_size decode_batch_size_range = np.arange(1, self._config.prediction_max_batch_size + 1) decode_kv_cache_size_range = np.arange( 0, self._config.prediction_max_tokens_per_request + 1, self._config.kv_cache_prediction_granularity, ) # Prefill grid: kv_cache_size x prefill_chunk_size prefill_kv_cache_size_range = np.arange( 0, self._config.prediction_max_tokens_per_request + 1, self._config.kv_cache_prediction_granularity, ) prefill_prefill_chunk_size_range = np.arange( 1, self._config.prediction_max_prefill_chunk_size + 1 ) # Cartesian product -> all combinations # model.predict(X) -> Dict[(tuple_of_features,), predicted_time]
During simulation, each abstract method implementation is a single dictionary lookup. Prefill attention uses a geometric aggregation of per-request chunk sizes, while decode attention uses the mean KV-cache size across the batch, rounded to the prediction granularity:
# Simple compute lookups: key is (total_num_tokens_rounded,) def _get_mlp_layer_up_proj_execution_time(self, batch): return self._predictions["mlp_up_proj"][(batch._total_num_tokens_rounded,)] # Attention decode: features = (batch_size, avg_kv_cache_size) def _get_attention_decode_execution_time(self, batch): decode_batch_size, decode_avg_kv_cache_size = self._get_batch_decode_attention_params(batch) if decode_batch_size == 0: return 0 return self._predictions["attn_decode"][ (decode_batch_size, decode_avg_kv_cache_size) ] * (1 + self._attention_decode_batching_overhead_fraction * int(decode_batch_size > 1)) # Attention prefill: aggregate chunk sizes geometrically def _get_attention_prefill_execution_time(self, batch): prefill_params = self._get_batch_prefill_attention_params(batch) if len(prefill_params) == 0: return 0 kv_cache_sizes, prefill_chunk_sizes = zip(*prefill_params) agg_kv_cache_size = sum(kv_cache_sizes) agg_prefill_chunk_size = sum([x**2 for x in prefill_chunk_sizes]) ** 0.5 return self._predictions["attn_prefill"][ (agg_kv_cache_size, round(agg_prefill_chunk_size) ** 2) ] * (1 + self._attention_prefill_batching_overhead_fraction * int(len(prefill_params) > 1))
num_q_heads > num_kv_heads), a configurable batching_overhead_fraction is applied to both prefill and decode attention when batch size exceeds 1. This captures the additional overhead of batching with GQA compared to MHA.
Both concrete predictor implementations inherit from SklearnExecutionTimePredictor and only need to override two methods: _get_estimator() and _get_grid_search_params(). The training pipeline, caching, and prediction logic are entirely shared.
# linear_regression_execution_time_predictor.py from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import PolynomialFeatures class LinearRegressionExecutionTimePredictor( SklearnExecutionTimePredictor): def _get_grid_search_params(self): return { "polynomialfeatures__degree": self._config.polynomial_degree, "polynomialfeatures__include_bias": self._config.polynomial_include_bias, "polynomialfeatures__interaction_only": self._config.polynomial_interaction_only, "linearregression__fit_intercept": self._config.fit_intercept, } def _get_estimator(self): return make_pipeline( PolynomialFeatures(), LinearRegression() )
Uses a polynomial feature + linear regression pipeline. GridSearchCV explores polynomial degree, bias inclusion, and interaction-only features. Suited for operations with known polynomial scaling (e.g., attention is O(n^2) in sequence length).
# random_forrest_execution_time_predictor.py from sklearn.ensemble import RandomForestRegressor class RandomForrestExecutionTimePredictor( SklearnExecutionTimePredictor): def _get_grid_search_params(self): return { "n_estimators": self._config.num_estimators, "max_depth": self._config.max_depth, "min_samples_split": self._config.min_samples_split, } def _get_estimator(self): return RandomForestRegressor()
Uses a Random Forest ensemble. GridSearchCV tunes tree count, max depth, and min samples per split. Better for operations with non-polynomial or irregular scaling patterns. More robust to outliers in profiling data.
# vidur/execution_time_predictor/execution_time_predictor_registry.py class ExecutionTimePredictorRegistry(BaseRegistry): @classmethod def get_key_from_str(cls, key_str): return ExecutionTimePredictorType.from_str(key_str) ExecutionTimePredictorRegistry.register( ExecutionTimePredictorType.RANDOM_FORREST, RandomForrestExecutionTimePredictor ) ExecutionTimePredictorRegistry.register( ExecutionTimePredictorType.LINEAR_REGRESSION, LinearRegressionExecutionTimePredictor )
Each GPU type is modeled with two critical parameters: fp16_tflops (compute throughput) and total_memory_gb (VRAM capacity). These are used by the MFU calculator and memory management:
# vidur/config/device_sku_config.py @dataclass class BaseDeviceSKUConfig(BaseFixedConfig): fp16_tflops: int total_memory_gb: int @dataclass class A40DeviceSKUConfig(BaseDeviceSKUConfig): fp16_tflops: int = 150 total_memory_gb: int = 45 @dataclass class A100DeviceSKUConfig(BaseDeviceSKUConfig): fp16_tflops: int = 312 total_memory_gb: int = 80 @dataclass class H100DeviceSKUConfig(BaseDeviceSKUConfig): fp16_tflops: int = 1000 total_memory_gb: int = 80
| GPU | FP16 TFLOPS | Memory (GB) | Architecture |
|---|---|---|---|
A40 | 150 | 45 | Ampere |
A100 | 312 | 80 | Ampere |
H100 | 1000 | 80 | Hopper |
Nodes combine a device type with a device count. The num_devices_per_node parameter is critical for determining whether tensor/pipeline parallelism spans across nodes (affecting communication latency):
# vidur/config/node_sku_config.py @dataclass class A100PairwiseNvlinkNodeSKUConfig(BaseNodeSKUConfig): device_sku_type: DeviceSKUType = DeviceSKUType.A100 num_devices_per_node: int = 4 # Pairwise NVLink: 4 GPUs @dataclass class A100DgxNodeSKUConfig(BaseNodeSKUConfig): device_sku_type: DeviceSKUType = DeviceSKUType.A100 num_devices_per_node: int = 8 # Full DGX: 8 GPUs with NVSwitch @dataclass class H100DgxNodeSKUConfig(BaseNodeSKUConfig): device_sku_type: DeviceSKUType = DeviceSKUType.H100 num_devices_per_node: int = 8 # H100 DGX: 8 GPUs with NVSwitch
num_workers > devices_per_node to detect multi-node setups. When multi-node, it selects devices_per_node=1 for send/recv profiling data (inter-node communication). For intra-node, it uses devices_per_node=2.
Every LLM architecture is fully parameterized as a dataclass. The base class BaseModelConfig defines the complete set of architectural parameters that drive both execution time prediction and memory management:
# vidur/config/model_config.py @dataclass class BaseModelConfig(BaseFixedConfig): num_layers: int # Total transformer layers num_q_heads: int # Query attention heads num_kv_heads: int # Key/Value heads (GQA when < num_q_heads) embedding_dim: int # Hidden dimension mlp_hidden_dim: int # Feed-forward intermediate dimension max_position_embeddings: int use_gated_mlp: bool # LLaMA-style gated MLP vs vanilla use_bias: bool use_qkv_bias: bool activation: ActivationType # SiLU, GELU, etc. norm: NormType # RMS_NORM or LAYER_NORM post_attn_norm: bool # Whether post-attention layernorm exists vocab_size: int rope_theta: float # RoPE base frequency partial_rotary_factor: float = 1.0 no_tensor_parallel: bool = False
| Model | Layers | Q Heads | KV Heads | Embed Dim | MLP Dim | Vocab |
|---|---|---|---|---|---|---|
| Llama-2-7B | 32 | 32 | 32 | 4096 | 11008 | 32768 |
| Llama-2-70B | 80 | 64 | 8 | 8192 | 28672 | 32768 |
| Llama-3-8B | 32 | 32 | 8 | 4096 | 14336 | 128256 |
| Llama-3-70B | 80 | 64 | 8 | 8192 | 28672 | 128256 |
| CodeLlama-34B | 48 | 64 | 8 | 8192 | 22016 | 32768 |
| InternLM-20B | 60 | 40 | 40 | 5120 | 13824 | 103168 |
| InternLM2-20B | 48 | 48 | 8 | 6144 | 16384 | 92544 |
| Phi-2 | 32 | 32 | 32 | 2560 | 10240 | 51200 |
| Qwen-72B | 80 | 64 | 64 | 8192 | 24576 | 152064 |
partial_rotary_factor=0.4.
Vidur's simulation is driven by a priority queue of events. Each event has a timestamp, type, and unique ID. When processed, an event may generate new events that are pushed back into the queue. Time advances non-uniformly, jumping directly to the next event's timestamp.
# vidur/events/base_event.py class BaseEvent(ABC): _id = 0 def __init__(self, time: float, event_type: EventType): self._time = time self._id = BaseEvent.generate_id() self._event_type = event_type self._priority_number = self._get_priority_number() def _get_priority_number(self): return (self._time, self._id, self.event_type) def __lt__(self, other): # Three-level comparison: time -> event_type -> id if self._time == other._time: if self._event_type == other._event_type: return self._id < other._id return self._event_type < other._event_type else: return self._time < other._time @abstractmethod def handle_event(self, scheduler, metrics_store) -> List["BaseEvent"]: pass
The EventType enum defines the numeric ordering that breaks ties when multiple events occur at the same timestamp. Lower numbers have higher priority:
# vidur/types/event_type.py class EventType(BaseIntEnum): # At any given time step, call the schedule event at the last # to ensure that all the requests are processed BATCH_STAGE_ARRIVAL = 1 # Highest priority -- data arriving at a stage REQUEST_ARRIVAL = 2 # New request enters the system BATCH_STAGE_END = 3 # GPU computation for a stage completes BATCH_END = 4 # All stages of a batch complete GLOBAL_SCHEDULE = 5 # Distribute requests to replicas REPLICA_SCHEDULE = 6 # Schedule batches within a replica REPLICA_STAGE_SCHEDULE = 7 # Schedule work on a specific pipeline stage
# request_arrival_event.py def handle_event(self, scheduler, metrics): scheduler.add_request(self._request) metrics.on_request_arrival( self.time, self._request) return [GlobalScheduleEvent(self.time)]
Adds the request to the global scheduler's queue and immediately triggers a GlobalScheduleEvent at the same timestamp to attempt dispatching.
# global_schedule_event.py def handle_event(self, scheduler, metrics): self._request_mapping = scheduler.schedule() for replica_id, request in self._request_mapping: self._replica_set.add(replica_id) scheduler.get_replica_scheduler( replica_id).add_request(request) return [ ReplicaScheduleEvent(self.time, rid) for rid in self._replica_set ]
Invokes the global scheduler's dispatch logic (round-robin, load-balanced, etc.) and maps requests to replicas. Triggers ReplicaScheduleEvent for each affected replica.
# replica_schedule_event.py def handle_event(self, scheduler, metrics): replica_scheduler = scheduler.get_replica_scheduler( self._replica_id) self._batches = replica_scheduler.on_schedule() if not self._batches: return [] for batch in self._batches: batch.on_schedule(self.time) return [ BatchStageArrivalEvent( self.time, self._replica_id, 0, # stage_id = 0 (first stage) batch) for batch in self._batches ]
Calls the replica's scheduler (ORCA, vLLM, sarathi, etc.) to form batches. Each batch starts at pipeline stage 0 via BatchStageArrivalEvent.
# replica_stage_schedule_event.py def handle_event(self, scheduler, metrics): stage_scheduler = scheduler._replica_schedulers[ self._replica_id ]._replica_stage_schedulers[self._stage_id] self._batch, self._batch_stage, execution_time = ( stage_scheduler.on_schedule()) if not (self._batch and self._batch_stage): return [] # THIS IS WHERE TIME ADVANCES: return [ BatchStageEndEvent( self.time + self._batch_stage.execution_time, self._replica_id, self._stage_id, stage_scheduler.is_last_stage, self._batch, self._batch_stage) ]
The critical event that advances simulation time. The stage scheduler calls the execution time predictor, and a BatchStageEndEvent is scheduled at current_time + predicted_execution_time.
# batch_stage_end_event.py def handle_event(self, scheduler, metrics): scheduler.get_replica_stage_scheduler( self._replica_id, self._stage_id ).on_stage_end() self._batch_stage.on_stage_end(self.time) next_events = [ ReplicaStageScheduleEvent( self.time, self._replica_id, self._stage_id) ] if self._is_last_stage: return next_events + [ BatchEndEvent(self.time, self._replica_id, self._batch)] return next_events + [ BatchStageArrivalEvent( self.time, self._replica_id, self._stage_id + 1, self._batch)]
When a pipeline stage finishes, it either advances to the next stage (BatchStageArrival(stage+1)) or, if this was the last stage, completes the batch (BatchEndEvent). It also re-schedules the current stage for new work.
# batch_end_event.py def handle_event(self, scheduler, metrics): self._batch.on_batch_end(self.time) replica_scheduler = scheduler.get_replica_scheduler( self._replica_id) replica_scheduler.on_batch_end(self._batch) memory_usage = replica_scheduler.memory_usage_percent metrics.on_batch_end( self.time, self._batch, self._replica_id, memory_usage) return [ReplicaScheduleEvent( self.time, self._replica_id)]
Finalizes the batch: updates request completion state, frees memory, records metrics. Triggers a new ReplicaScheduleEvent to check if more batches can be formed from queued requests.
The Simulator class orchestrates everything. Its run() method is a clean, minimal event loop that processes events until the queue empties or a time limit is reached:
# vidur/simulator.py class Simulator: def __init__(self, config: SimulationConfig): self._time = 0 self._terminate = False self._time_limit = config.time_limit or float("inf") self._event_queue = [] # Python heapq (min-heap) self._cluster = Cluster(config.cluster_config, ...) self._metric_store = MetricsStore(config) self._request_generator = RequestGeneratorRegistry.get( config.request_generator_config.get_type(), config.request_generator_config, ) self._scheduler = GlobalSchedulerRegistry.get( config.cluster_config.global_scheduler_config.get_type(), config, self._cluster.replicas, ) self._init_event_queue() def _init_event_queue(self): # Generate ALL requests upfront, add as arrival events requests = self._request_generator.generate() for request in requests: self._add_event(RequestArrivalEvent(request.arrived_at, request)) def run(self): while self._event_queue and not self._terminate: _, event = heapq.heappop(self._event_queue) # Pop lowest priority number self._set_time(event._time) # TIME JUMPS to event time new_events = event.handle_event( # Process event self._scheduler, self._metric_store) self._add_events(new_events) # Push new events to heap def _add_event(self, event): heapq.heappush(self._event_queue, (event._priority_number, event)) def _set_time(self, time): self._time = time if self._time > self._time_limit: self._terminate = True
self._set_time(event._time). This is the hallmark of discrete event simulation: there is no computation in "between" events. When ReplicaStageScheduleEvent creates a BatchStageEndEvent at current_time + execution_time, that predicted execution time (from the sklearn model) determines when simulation time next advances significantly.
A key design choice: all requests are generated upfront in _init_event_queue() and added as RequestArrivalEvents. This means the event queue starts populated with all future arrivals. The request generator runs once before the simulation loop begins.
def _write_output(self): self._metric_store.plot() # Generate all metric plots if self._config.metrics_config.write_json_trace: self._write_event_trace() # JSON event log if self._config.metrics_config.enable_chrome_trace: self._write_chrome_trace() # Chrome DevTools format
The simulator optionally writes Chrome trace files that can be loaded in chrome://tracing for visual timeline analysis. This is achieved through each event's to_chrome_trace() method, with BatchStageEndEvent being the primary source of trace entries.
Vidur supports two fundamental request generation strategies: synthetic (parameterized distributions) and trace replay (real production data). The synthetic generator composes independent interval and length generators via the registry pattern.
# vidur/request_generator/synthetic_request_generator.py class SyntheticRequestGenerator(BaseRequestGenerator): def __init__(self, config): self.request_length_generator = RequestLengthGeneratorRegistry.get( config.length_generator_config.get_type(), config.length_generator_config, ) self.request_interval_generator = RequestIntervalGeneratorRegistry.get( config.interval_generator_config.get_type(), config.interval_generator_config, ) def _generate_next_request(self, last_arrived_at): inter_request_time = self.request_interval_generator.get_next_inter_request_time() arrived_at = last_arrived_at + inter_request_time prefill_tokens, decode_tokens = self.request_length_generator.get_next_num_tokens() return Request( arrived_at=arrived_at, num_prefill_tokens=int(prefill_tokens), num_decode_tokens=int(decode_tokens), ) def _generate_requests(self): requests = [] current_time = 0 # Duration-based, count-based, or trace-exhaustion generation if self.config.duration is not None: while current_time < self.config.duration: request = self._generate_next_request(current_time) current_time = request.arrived_at requests.append(request) elif self.config.num_requests is not None: for _ in range(self.config.num_requests): request = self._generate_next_request(current_time) current_time = request.arrived_at requests.append(request) return requests
# poisson_request_interval_generator.py class PoissonRequestIntervalGenerator: def __init__(self, config): self.qps = config.qps self.std = 1.0 / self.qps self.max_interval = self.std * 3.0 def get_next_inter_request_time(self): next_interval = ( -math.log(1.0 - random.random()) / self.qps) # Clamp to 3 sigma to avoid extreme gaps return min(next_interval, self.max_interval)
Models memoryless arrivals via inverse CDF of the exponential distribution: -ln(1-U)/lambda. Intervals are clamped to 3 standard deviations to prevent extremely long gaps.
# gamma_request_interval_generator.py class GammaRequestIntervalGenerator: def __init__(self, config): cv = config.cv # coefficient of variation self.qps = config.qps self.gamma_shape = 1.0 / (cv ** 2) def get_next_inter_request_time(self): gamma_scale = 1.0 / ( self.qps * self.gamma_shape) return gamma.rvs( self.gamma_shape, scale=gamma_scale)
The Gamma distribution generalizes exponential arrivals with a coefficient of variation (CV) parameter. CV=1 equals Poisson; CV>1 produces burstier traffic; CV<1 produces more regular arrivals. The shape parameter is 1/CV^2.
def get_next_num_tokens(self): total = random.uniform( self.config.min_tokens, self.config.max_tokens) decode = math.ceil( total / (1 + self.config.prefill_to_decode_ratio)) prefill = total - decode return prefill, decode
def get_next_num_tokens(self): total = self.zipf_generator.next() decode = total / ( 1 + self.config.prefill_to_decode_ratio) prefill = total - decode return prefill, decode
# vidur/request_generator/trace_replay_request_generator.py class TraceReplayRequestGenerator(BaseRequestGenerator): def __init__(self, config): self.trace_df = pd.read_csv(config.trace_file) # Scale tokens self.trace_df["num_prefill_tokens"] *= config.prefill_scale_factor self.trace_df["num_decode_tokens"] *= config.decode_scale_factor # Enforce constraints: integers, minimum 1, total <= max self.trace_df["num_prefill_tokens"] = self.trace_df["num_prefill_tokens"].astype(int).clip(lower=1) self.trace_df["num_decode_tokens"] = self.trace_df["num_decode_tokens"].astype(int).clip(lower=1) # If total exceeds max, reduce prefill tokens total_tokens = self.trace_df["num_prefill_tokens"] + self.trace_df["num_decode_tokens"] diff_tokens = (total_tokens - config.max_tokens).clip(lower=0) self.trace_df["num_prefill_tokens"] -= diff_tokens # Rescale arrival times to change effective QPS self.trace_df["arrived_at"] *= config.time_scale_factor def generate_requests(self): return [ Request(arrived_at=row["arrived_at"], num_prefill_tokens=row["num_prefill_tokens"], num_decode_tokens=row["num_decode_tokens"]) for _, row in self.trace_df.iterrows() ]
prefill_scale_factor and decode_scale_factor for adjusting token lengths, and time_scale_factor for compressing or stretching the arrival pattern. A time_scale_factor of 0.5 doubles the effective QPS. Token counts are always clipped to ensure at least 1 prefill and 1 decode token, and total never exceeds max_tokens.
The MetricsStore is a comprehensive measurement system that captures every aspect of simulation behavior. It uses two core data structures: DataSeries for indexed (x, y) data and CDFSketch (backed by DDSketch) for streaming quantile estimation.
# vidur/metrics/cdf_sketch.py class CDFSketch: def __init__(self, metric_name, save_table_to_wandb=True, save_plots=True): self._sketch = DDSketch(relative_accuracy=0.001) # 0.1% accuracy self._metric_name = metric_name self._last_data = 0 def put(self, data: float): self._last_data = data self._sketch.add(data) def put_delta(self, delta: float): # Incremental update: add to last value data = self._last_data + delta self.put(data) def _to_df(self): # Generate CDF at 1% intervals for plotting quantiles = np.linspace(0, 1, 101) quantile_values = [self._sketch.get_quantile_value(q) for q in quantiles] return pd.DataFrame({"cdf": quantiles, self._metric_name: quantile_values}) def print_distribution_stats(self, plot_name): # Reports: min, max, mean, p25, p50, p75, p95, p99, p99.9, count, sum ...
relative_accuracy=0.001, any quantile query (p50, p99, p99.9) is guaranteed to be within 0.1% of the true value. This is far more memory-efficient than storing all values, especially when tracking millions of per-token metrics.
| Category | Storage | Metrics |
|---|---|---|
| Request Time Distributions | DataSeries | E2E time, execution time, model execution time, scheduling delay, preemption time, prefill time, decode time (all with normalized variants) |
| Request Histograms | DataSeries | Inter-arrival delay, total tokens, prefill tokens, decode tokens, P/D ratio, num restarts |
| Token Metrics | CDFSketch | Decode token execution+preemption time (TPOT-equivalent) |
| Batch Count Distributions | CDFSketch | Batch num tokens, prefill tokens, decode tokens, batch size |
| Batch Time Distributions | CDFSketch | Batch execution time |
| GPU Operation Metrics | CDFSketch | Per-operation breakdown: MLP up/down/act, attention pre/post proj, prefill, decode, KV save, RoPE, norms, all-reduce, pipeline send/recv |
| CPU Operation Metrics | CDFSketch | Schedule, sampler E2E, prepare inputs, model execution E2E, process outputs, Ray comm |
| Utilization Metrics | SeriesAverageMeter | Per-replica memory usage, per-replica-stage busy time %, per-replica-stage MFU |
| Completion Time Series | DataSeries | Request arrivals/completions over time, prefill completions, decode completions |
Metrics are collected via event-driven callbacks. The @if_write_metrics decorator ensures that metric collection can be disabled for performance. Key collection points:
# Callback chain through event handlers: # RequestArrivalEvent -> metrics.on_request_arrival(time, request) # Records: arrival count, token histograms, inter-arrival delay # ReplicaStageScheduleEvent -> metrics.on_replica_stage_schedule(...) # Records: ALL per-operation execution times (the 19-component breakdown) # Also: busy_time, MFU per replica/stage # BatchStageEndEvent -> metrics.on_batch_stage_end(...) # Records: busy_time=0, MFU=0 (stage becomes idle) # BatchEndEvent -> metrics.on_batch_end(time, batch, ...) # Records: batch execution time, token counts, memory usage # For completed requests: E2E time, scheduling delay, preemption time # For each token: per-token execution time (TPOT tracking)
The MetricsStore produces five categories of output, all stored under output_dir/plots/:
All plots are generated with Plotly Express and optionally logged to Weights & Biases for experiment tracking. CSV files are saved alongside plots for programmatic analysis.
The complete flow from configuration to results follows a clear pipeline. Understanding how these pieces connect is essential for extending Vidur or interpreting its outputs.
SimulationConfig aggregates model config (Llama-2-70B), device config (A100), node config (DGX 8-GPU), scheduler config (Orca/vLLM), and predictor config (LinearRegression).
Request objects with arrival times and token counts.
RequestArrivalEvent in the heap.
Rather than calling model.predict() during simulation, all predictions are pre-computed as Dict[(features_tuple), time] lookups. This trades memory for speed: simulation-time prediction is a hash table access, not a model inference call. The cache system uses MD5 hashing of the configuration + data to avoid retraining across runs.
Events at the same timestamp are ordered by type: arrivals before completions before scheduling. This ensures that when a batch finishes and new requests arrive simultaneously, the system state is correctly updated before scheduling decisions are made. The three-level ordering (time, event_type, id) provides deterministic simulation.
Prefill attention uses geometric aggregation of chunk sizes (sqrt(sum(x^2))) to handle batched prefill. Decode attention uses arithmetic mean of KV-cache sizes. Both are rounded to the kv_cache_prediction_granularity to match the pre-computed prediction grid. This balances accuracy with memory usage for the prediction tables.
Tensor parallel communication adds a fixed NCCL launch overhead plus a per-device skew term: prediction + nccl_cpu_launch_overhead_ms + nccl_cpu_skew_overhead_per_device_ms * TP^1.25. Pipeline parallel communication uses profiled send/recv data with multi-node awareness. The TP=1 and last-stage optimizations skip communication entirely.
| Component | File | Key Class/Function |
|---|---|---|
| Base Predictor | execution_time_predictor/base_execution_time_predictor.py | BaseExecutionTimePredictor.get_execution_time() |
| sklearn Pipeline | execution_time_predictor/sklearn_execution_time_predictor.py | SklearnExecutionTimePredictor._train_models() |
| Linear Regression | execution_time_predictor/linear_regression_execution_time_predictor.py | LinearRegressionExecutionTimePredictor |
| Random Forest | execution_time_predictor/random_forrest_execution_time_predictor.py | RandomForrestExecutionTimePredictor |
| Execution Time Entity | entities/execution_time.py | ExecutionTime.model_time, .total_time |
| Event Base | events/base_event.py | BaseEvent.__lt__(), .handle_event() |
| Event Types | types/event_type.py | EventType enum (1-7 priority) |
| Simulator | simulator.py | Simulator.run() event loop |
| Synthetic Generator | request_generator/synthetic_request_generator.py | SyntheticRequestGenerator._generate_requests() |
| Trace Generator | request_generator/trace_replay_request_generator.py | TraceReplayRequestGenerator.generate_requests() |
| Poisson Intervals | request_generator/poisson_request_interval_generator.py | -log(1-U)/qps |
| Gamma Intervals | request_generator/gamma_request_interval_generator.py | shape=1/cv^2 |
| CDF Sketch | metrics/cdf_sketch.py | CDFSketch(DDSketch(0.001)) |
| Metrics Store | metrics/metrics_store.py | MetricsStore.on_batch_end() |
| Device Config | config/device_sku_config.py | A100: 312 TFLOPS, 80GB |
| Model Config | config/model_config.py | Llama2_70BModelConfig |
| Node Config | config/node_sku_config.py | A100DgxNodeSKUConfig: 8 GPUs |