Complete file-level invocation trace: from python3 -m vidur.main through initialization, event loop, and output — every file, every class, every call in order.
vidur/main.py → Simulator → Event Loop → OutputVidur is a discrete event simulator. Execution has three macro phases: initialization (build all objects, generate requests, fill event queue), event loop (pop events in time order, each event handler returns new events), and output (write metrics and traces). The diagram below shows the full call chain at file level.
The entry point. Invoked via python3 -m vidur.main. Does exactly three things:
def main() -> None:
config: SimulationConfig = SimulationConfig.create_from_cli_args() # ① parse CLI
set_seeds(config.seed) # ② deterministic RNG
simulator = Simulator(config) # ③ build & run
simulator.run()
Parses all CLI arguments into a nested dataclass hierarchy. Key sub-configs:
| Sub-Config | Controls | File |
|---|---|---|
cluster_config |
num_replicas, replica_config (model, device, parallelism), global_scheduler_config | config/config.py |
request_generator_config |
Synthetic vs trace-replay, interval/length generators | config/config.py |
metrics_config |
output_dir, write_metrics, write_json_trace, enable_chrome_trace | config/config.py |
Sets random seeds for Python's random and numpy to ensure reproducible simulations.
Simulator.__init__() builds the entire simulation infrastructure. Each sub-component triggers a cascade of further object creation.
Cluster.__init__() creates num_replicas Replica objects. Each Replica validates model/device constraints (e.g., layers divisible by pipeline stages, embedding_dim divisible by tensor_parallel_size).
# entities/cluster.py
self._replicas = {}
for _ in range(num_replicas):
replica = Replica(replica_config, generator_config)
self._replicas[replica.id] = replica
Initializes DataSeries and CDFSketch objects for tracking request latency, batch sizes, utilization, and token-level metrics.
The registry pattern selects the concrete generator class based on config type:
Uses interval generators (Poisson, Gamma, Static, Trace) and length generators (Uniform, Zipf, Trace, Fixed) to produce synthetic requests.
Reads a CSV trace file with arrival times and token counts, replaying exact request patterns.
This is the deepest initialization chain. The global scheduler's __init__() creates the entire scheduling hierarchy:
LinearRegressionExecutionTimePredictor or RandomForrestExecutionTimePredictorGlobal scheduler concrete implementations (all route requests to replicas):
| Scheduler | Strategy |
|---|---|
RandomGlobalScheduler |
Random replica selection |
RoundRobinGlobalScheduler |
Round-robin rotation |
LORGlobalScheduler |
Least Outstanding Requests |
After all objects are built, the simulator populates the event queue with the initial set of events:
# simulator.py
def _init_event_queue(self):
requests = self._request_generator.generate() # → List[Request]
for request in requests:
self._add_event(RequestArrivalEvent(request.arrived_at, request))
Each request becomes a RequestArrivalEvent timestamped at its arrival time, pushed into a min-heap priority queue. Also registers atexit(_write_output) for cleanup.
# simulator.py
def run(self):
while self._event_queue and not self._terminate:
_, event = heapq.heappop(self._event_queue) # ① pop earliest event
self._set_time(event._time) # ② advance sim clock
new_events = event.handle_event( # ③ dispatch to handler
self._scheduler, self._metric_store)
self._add_events(new_events) # ④ push new events
This is the core loop. Every handle_event() returns zero or more new events that get pushed back into the heap. The simulation advances by processing events in chronological order until the queue is empty or time_limit is reached.
(time, event_id, event_type) tuple. This guarantees deterministic ordering when multiple events share the same timestamp.
Each event type lives in its own file under vidur/events/. Below is the complete chain showing what each event does, what it calls, and what new events it creates.
| Action | Target | File |
|---|---|---|
| Add request to global queue | scheduler.add_request(request) |
scheduler/global_scheduler/base_global_scheduler.py |
| Record arrival metrics | metrics_store.on_request_arrival() |
metrics/metrics_store.py |
Creates: GlobalScheduleEvent
scheduler._request_queue and waits. Multiple requests can accumulate in this queue before any batch is formed.
def handle_event(self, scheduler, metrics_store):
scheduler.add_request(self._request)
metrics_store.on_request_arrival(self.time, self._request)
return [GlobalScheduleEvent(self.time)]
| Action | Target | File |
|---|---|---|
| Decide which replica handles which request | scheduler.schedule() |
scheduler/global_scheduler/{random,round_robin,lor}.py |
| Push request into replica's queue | replica_scheduler.add_request(request) |
scheduler/replica_scheduler/base_replica_scheduler.py |
Creates: ReplicaScheduleEvent × one per affected replica
def handle_event(self, scheduler, metrics_store):
self._replica_set = set()
self._request_mapping = scheduler.schedule()
for replica_id, request in self._request_mapping:
self._replica_set.add(replica_id)
scheduler.get_replica_scheduler(replica_id).add_request(request)
return [
ReplicaScheduleEvent(self.time, replica_id)
for replica_id in self._replica_set
]
| Action | Target | File |
|---|---|---|
| Decide which requests to batch together | replica_scheduler.on_schedule() |
scheduler/replica_scheduler/{vllm,sarathi,orca,...}.py |
| Mark batch as scheduled | batch.on_schedule(time) |
entities/batch.py |
| Record memory usage | metrics_store.on_replica_schedule() |
metrics/metrics_store.py |
Creates: BatchStageArrivalEvent × one per batch (stage_id = 0), or empty list if no batch can be formed
replica_scheduler.on_schedule() pulls multiple requests from the queue to form ONE batch. It does NOT create one batch per request. The internal _get_next_batch() loops through the request queue, adding requests until it hits memory, token, or batch-size limits. If the replica is already running the maximum number of concurrent batches (_num_running_batches ≥ _num_stages), it returns an empty list — the requests simply wait.
def handle_event(self, scheduler, metrics_store):
replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
self._batches = replica_scheduler.on_schedule()
if not self._batches:
return [] # ← 若 replica 正忙或佇列為空,什麼都不做
memory_usage_percent = replica_scheduler.memory_usage_percent
metrics_store.on_replica_schedule(
self.time, self._replica_id, memory_usage_percent)
for batch in self._batches:
batch.on_schedule(self.time)
return [
BatchStageArrivalEvent(self.time, self._replica_id, 0, batch)
for batch in self._batches
]
# base_replica_scheduler.py — on_schedule()
def on_schedule(self) -> List[Batch]:
scheduled_batches = []
while self._num_running_batches < self._num_stages:
batch = self._get_next_batch() # ← 從佇列撈多個 request 組成 1 個 batch
if not batch:
break
scheduled_batches.append(batch)
self._num_running_batches += 1
return scheduled_batches
# vllm_replica_scheduler.py — _get_next_batch() (以 vLLM 為例)
def _get_next_batch(self) -> Optional[Batch]:
requests, num_tokens = [], []
while self._request_queue: # ← 遍歷整個佇列
if not self._can_allocate(...): # memory 不足 → 停止
break
if total_tokens > max_tokens: # token 上限 → 停止
break
if len(requests) >= batch_size_cap: # batch size 上限 → 停止
break
request = self._request_queue.pop(0)
requests.append(request) # ← 多個 request 加入同一個 batch
num_tokens.append(next_tokens)
if not requests:
return None
return Batch(self._replica_id, requests, num_tokens)
| Action | Target | File |
|---|---|---|
| Enqueue batch at stage | stage_scheduler.add_batch(batch) |
scheduler/replica_stage_scheduler/replica_stage_schduler.py |
Creates: ReplicaStageScheduleEvent
def handle_event(self, scheduler, metrics_store):
scheduler.get_replica_stage_scheduler(
self._replica_id, self._stage_id
).add_batch(self._batch)
return [
ReplicaStageScheduleEvent(
self.time, self._replica_id, self._stage_id)
]
stage_scheduler.on_schedule() checks _is_busy first. If this stage is already executing a batch, it returns (None, None, None) and the event produces no further events — the new batch waits in the stage queue until the current one finishes.
| Action | Target | File |
|---|---|---|
| Dequeue batch and compute execution time (if not busy) | stage_scheduler.on_schedule() |
scheduler/replica_stage_scheduler/replica_stage_schduler.py |
| Predict execution duration | execution_time_predictor.get_execution_time(batch, stage_id) |
execution_time_predictor/base_execution_time_predictor.py |
| Create BatchStage entity | BatchStage(..., execution_time) |
entities/batch_stage.py |
| Mark stage as scheduled | batch_stage.on_schedule(time) |
entities/batch_stage.py |
Creates: BatchStageEndEvent scheduled at time + execution_time
def handle_event(self, scheduler, metrics_store):
stage_scheduler = scheduler._replica_schedulers[
self._replica_id
]._replica_stage_schedulers[self._stage_id]
self._batch, self._batch_stage, execution_time = \
stage_scheduler.on_schedule()
if not (self._batch and self._batch_stage):
return []
self._batch_stage.on_schedule(self.time)
metrics_store.on_replica_stage_schedule(
self.time, self._replica_id, self._stage_id,
self._batch_stage, execution_time)
self._is_last_stage = stage_scheduler.is_last_stage
return [
BatchStageEndEvent(
self.time + self._batch_stage.execution_time, # ← 唯一的時間推進
self._replica_id, self._stage_id,
self._is_last_stage, self._batch, self._batch_stage),
]
# scheduler/replica_stage_scheduler/replica_stage_schduler.py
def on_schedule(self):
if self._is_busy or not self._batch_queue:
return None, None, None
self._is_busy = True
batch = self._batch_queue.pop(0)
execution_time = self._execution_time_predictor.get_execution_time(
batch, self._stage_id) # ← execution_time 的來源
total_execution_time = execution_time.total_time
model_execution_time = execution_time.model_time
batch_stage = BatchStage(
batch.id, self._replica_id, self._stage_id,
total_execution_time, model_execution_time,
batch.requests, batch.num_tokens)
return batch, batch_stage, execution_time
| Action | Target | File |
|---|---|---|
| Release stage resources | stage_scheduler.on_stage_end() |
scheduler/replica_stage_scheduler/replica_stage_schduler.py |
| Finalize stage timing | batch_stage.on_stage_end(time) |
entities/batch_stage.py |
| Record stage metrics | metrics_store.on_batch_stage_end() |
metrics/metrics_store.py |
stage_id + 1 → batch moves to next pipeline stagedef handle_event(self, scheduler, metrics_store):
scheduler.get_replica_stage_scheduler(
self._replica_id, self._stage_id
).on_stage_end()
self._batch_stage.on_stage_end(self.time)
metrics_store.on_batch_stage_end(
self._batch_stage, self.time,
self._replica_id, self._stage_id)
next_events = [
ReplicaStageScheduleEvent(
self.time, self._replica_id, self._stage_id),
]
if self._is_last_stage:
return next_events + [
BatchEndEvent(self.time, self._replica_id, self._batch)]
return next_events + [
BatchStageArrivalEvent(
self.time, self._replica_id,
self._stage_id + 1, self._batch)]
| Action | Target | File |
|---|---|---|
| Complete batch, update request tokens | batch.on_batch_end(time) |
entities/batch.py |
| Free memory blocks, handle completed/preempted requests | replica_scheduler.on_batch_end(batch) |
scheduler/replica_scheduler/base_replica_scheduler.py |
| Record batch completion metrics | metrics_store.on_batch_end() |
metrics/metrics_store.py |
Creates: ReplicaScheduleEvent → loops back to Event 3, forming the next batch
def handle_event(self, scheduler, metrics_store):
self._batch.on_batch_end(self.time)
replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
replica_scheduler.on_batch_end(self._batch)
memory_usage_percent = replica_scheduler.memory_usage_percent
metrics_store.on_batch_end(
self.time, self._batch,
self._replica_id, memory_usage_percent)
return [ReplicaScheduleEvent(self.time, self._replica_id)]
_num_running_batches < _num_stages. When the replica is already executing a batch, on_schedule() returns an empty list, and the request simply waits in the queue.
Here is a concrete timeline showing how the two paths differ:
t=0.1 RequestArrivalEvent(Req A)
→ GlobalScheduleEvent → assigns Req A to replica 0
→ ReplicaScheduleEvent → on_schedule()
_num_running_batches(0) < _num_stages(1) ✓
→ _get_next_batch(): pulls Req A (only 1 in queue) → Batch(size=1)
→ _num_running_batches = 1
→ BatchStageArrivalEvent → ... → batch starts executing
_get_next_batch() has no waiting/accumulation mechanism — it pulls whatever is currently in the queue, even if only 1 request. This correctly models continuous batching (as used by vLLM): the system never waits for more requests, it processes what's available immediately. New requests arriving later join the next iteration via the BatchEnd → ReplicaSchedule path (Path C below).
t=0.2 RequestArrivalEvent(Req B)
→ GlobalScheduleEvent → assigns Req B to replica 0
→ ReplicaScheduleEvent → on_schedule()
_num_running_batches(1) < _num_stages(1) ✗ ← GUARD FAILS
→ return [] ← 空列表,事件鏈到此結束
Req B stays in _request_queue, waiting...
t=0.3 RequestArrivalEvent(Req C) → same thing, Req C also waits
t=0.5 BatchEndEvent(Batch of Req A)
→ _num_running_batches -= 1 → now 0
→ ReplicaScheduleEvent → on_schedule()
_num_running_batches(0) < _num_stages(1) ✓
→ _get_next_batch(): queue has [Req B, Req C]
while _request_queue:
add Req B ✓, add Req C ✓ ← 多個 request 組成一個 batch
→ Batch(requests=[Req B, Req C])
→ BatchStageArrivalEvent → ... → batch starts executing
The event chain (Events 3→4→5→6→7→3) forms a loop. A single request goes through this loop multiple times — once is not enough. But the call graph above doesn't make this obvious. This section traces the full lifecycle of a single request with P prefill tokens and D decode tokens.
def _get_request_next_num_tokens(self, request):
if request.is_prefill_complete:
return 1 # decode: 每次只處理 1 個 token
return request.num_prefill_tokens # prefill: 一次全部處理完
Prefill processes all P tokens in one shot. Decode processes exactly 1 token per iteration.
def on_batch_end(self, batch):
self._num_running_batches -= 1
for request in batch.requests:
if request.completed:
self.free(request.id) # 完成 → 釋放記憶體
else:
self._preempted_requests.append(request)
# ↑ 未完成 → 放回 preempted 佇列,下次 on_schedule() 會再撈出來
After each batch, non-completed requests go into _preempted_requests. The next _get_next_batch() picks them up again — each time with num_tokens=1 (decode).
Inside on_batch_end(), there is a critical +1 that changes the iteration count:
def on_batch_end(self, time, num_tokens_processed):
self._num_processed_tokens += num_tokens_processed # prefill 後: P
if self._num_processed_tokens == self._num_prefill_tokens:
self._is_prefill_complete = True
# we get one decode token when the prefill processing completes
self._num_processed_tokens += 1 # ← P → P+1,第一個 decode token 免費!
if self._prefill_completed_at == 0:
self._prefill_completed_at = time
if self._num_processed_tokens == self.total_tokens: # P+D?
self._completed = True
Walk through the state of num_processed_tokens at each batch iteration. The request is complete when it reaches total_tokens = P + D = 640.
| Iter | Phase | num_tokens in Batch | processed before | += tokens | +1 bonus | processed after | Status |
|---|---|---|---|---|---|---|---|
| 1 | Prefill | 512 | 0 | +512 | +1 | 513 | prefill_complete=True, → preempted queue |
| 2 | Decode | 1 | 513 | +1 | — | 514 | → preempted queue |
| 3 | Decode | 1 | 514 | +1 | — | 515 | → preempted queue |
| ⋮ (each iteration: +1 token) | |||||||
| 128 | Decode | 1 | 639 | +1 | — | 640 = P+D | ✓ completed=True, freed |
Mapping the 128 iterations back to the event chain. Each row is one pass through Events 3→4→5→6→7→3.
Iter 1 (Prefill):
ReplicaScheduleEvent → _get_next_batch()
request from _request_queue, num_tokens=512
→ Batch(requests=[Req], num_tokens=[512])
→ BatchStageArrival → ReplicaStageSchedule
→ execution_time_predictor(batch with 512 prefill tokens) → 較長的 execution_time
→ BatchStageEnd → BatchEnd
→ Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
processed: 0 → 512 → 513 (bonus +1)
→ VLLMScheduler.on_batch_end()
request.completed? No (513 < 640) → _preempted_requests.append(request)
→ ReplicaScheduleEvent ← 迴圈回到事件 3
Iter 2 (Decode #1):
ReplicaScheduleEvent → _get_next_batch()
request from _preempted_requests, is_prefill_complete=True → num_tokens=1
→ Batch(requests=[Req, ...], num_tokens=[1, ...])
↑ 同一個 batch 可以包含來自不同 request 的 decode tokens
→ BatchStageArrival → ReplicaStageSchedule
→ execution_time_predictor(batch with N decode tokens) → 較短的 execution_time
→ BatchStageEnd → BatchEnd
→ Request.on_batch_end(num_tokens=1)
processed: 513 → 514
→ request not completed → _preempted_requests.append(request)
→ ReplicaScheduleEvent ← 迴圈回到事件 3
⋮ (repeats D-2 more times)
Iter 128 (Decode #127, final):
ReplicaScheduleEvent → _get_next_batch()
→ Batch(requests=[Req, ...], num_tokens=[1, ...])
→ ... → BatchEnd
→ Request.on_batch_end(num_tokens=1)
processed: 639 → 640 = total_tokens → completed=True
→ VLLMScheduler.on_batch_end()
request.completed? Yes → self.free(request.id) ← 記憶體釋放
A critical detail: in decode iterations, _get_next_batch() pulls from both _preempted_requests (ongoing decode) and _request_queue (new arrivals). This means:
Batch iteration N:
Batch(
requests = [Req_A, Req_B, Req_C, Req_D ],
num_tokens= [1, 1, 1, 256 ]
) ↑decode ↑decode ↑decode ↑prefill(new arrival)
One batch can contain a mix of decode tokens (1 each) from ongoing requests AND prefill tokens from a newly arrived request. The execution time predictor receives the full batch and computes a single execution time that accounts for both prefill and decode work.
The Sarathi scheduler overrides _get_request_next_num_tokens() to cap prefill tokens per iteration:
def _get_request_next_num_tokens(self, request, batch_contains_prefill, num_batch_tokens):
if request.is_prefill_complete:
return 1 # decode: 仍然 1 token
next_num_tokens = min(
request.num_prefill_tokens - request.num_processed_tokens,
self._config.chunk_size - num_batch_tokens, # ← 限制在 chunk_size 內
)
return max(0, next_num_tokens)
With chunk_size=512 and P=2048 prefill tokens:
| Iter | Phase | num_tokens | processed after |
|---|---|---|---|
| 1 | Prefill chunk 1 | 512 | 512 |
| 2 | Prefill chunk 2 | 512 | 1024 |
| 3 | Prefill chunk 3 | 512 | 1536 |
| 4 | Prefill chunk 4 | 512 | 2048 → 2049 (+1 bonus) |
| 5..D+3 | Decode | 1 | 2049 → ... → P+D |
Total: ⌈P/chunk_size⌉ + (D−1) iterations. Chunked prefill trades more iterations for lower per-iteration latency, allowing decode requests to interleave.
| Scheduler | Prefill Iterations | Decode Iterations | Total Batch Iterations |
|---|---|---|---|
| vLLM / Orca / FasterTransformer | 1 | D − 1 | D |
| Sarathi (chunk_size = C) | ⌈P / C⌉ | D − 1 | ⌈P/C⌉ + D − 1 |
The original Vidur has no PD disagg: every request runs prefill and decode on the same replica, and the +1 bonus is unconditional. There is no KV cache transfer, no splitwise scheduler, and no separate P/D replica pools.
The SimAI fork (vidur-alibabacloud) achieves PD disaggregation by modifying three points in the existing event chain — not by adding new event types, but by changing the behavior of existing ones.
At initialization, replicas are split by pd_node_ratio:
# Example: 4 replicas, pd_node_ratio=0.5
# Replica 0, 1 → ReplicaType.PREFILL (P pool)
# Replica 2, 3 → ReplicaType.DECODE (D pool)
for replica_id, replica in self._replicas.items():
if replica_id < self._num_prefill_nodes:
self.prefill_replicas[replica_id] = replica
replica.replica_type = ReplicaType.PREFILL
else:
self.decode_replicas[replica_id] = replica
replica.replica_type = ReplicaType.DECODE
In schedule(), each new request is assigned to both a P replica and a D replica upfront:
def schedule(self):
for request in self._request_queue:
request.request_type = RequestType.PREFILL
# Assign P replica (round-robin in P pool)
replica_id = self.p_request_counter % len(self.prefill_replicas)
request.prefill_replica_id = replica_id
# Assign D replica (round-robin in D pool)
replica_id = (self.d_request_counter % len(self.decode_replicas)) + len(self.prefill_replicas)
request.decode_replica_id = replica_id
return prefill_request_mapping # ← 只返回 P replica 的映射
schedule() only returns prefill_request_mapping. The request is initially sent to the P replica only. The D replica gets its ReplicaScheduleEvent later — from BatchEndEvent, not from here.
This is where the entire PD disagg logic lives. The original BatchEndEvent was 42 lines. The SimAI version is 151 lines. After the normal batch-end handling, it adds:
# batch_end_event.py — SimAI version (simplified)
def handle_event(self, scheduler, metrics_store):
# ① Normal Vidur logic (unchanged)
self._batch.on_batch_end(self.time)
replica_scheduler.on_batch_end(self._batch)
events = [ReplicaScheduleEvent(self.time, self._replica_id)]
# ② NEW: PD disagg cross-replica transfer
if scheduler.__class__.__name__ == 'SplitwiseGlobalScheduler':
for request in self._batch.requests:
if request.is_prefill_complete \
and request.request_type == RequestType.DECODE \
and replica_scheduler.replica.replica_type == ReplicaType.PREFILL:
# ③ Calculate KV cache transfer time
request.pd_p2p_comm_size = request.estimate_kv_cache_size(
request.num_processed_tokens, replica_scheduler.replica)
request.pd_p2p_comm_bandwidth = replica.pd_p2p_comm_bandwidth
request.pd_p2p_comm_time = request.pd_p2p_comm_size / request.pd_p2p_comm_bandwidth
# ④ Set D replica arrival = prefill done + KV transfer time
request.decode_arrived_at = request.prefill_completed_at + request.pd_p2p_comm_time
# ⑤ Create ReplicaScheduleEvent for D replica with DELAYED time
events.append(
ReplicaScheduleEvent(request.decode_arrived_at, request.decode_replica_id))
# ↑ 注意:時間不是 self.time,而是未來的時間點!
return events
# New fields added to Request
request_type = RequestType.PREFILL # PREFILL → DECODE
prefill_replica_id = None
decode_replica_id = None
prefill_arrived_at = arrived_at
decode_arrived_at = float('inf') # ← set later
pd_p2p_comm_size = float('inf')
pd_p2p_comm_time = float('inf')
pd_p2p_comm_bandwidth = 0
# New fields added to Replica
replica_type = ReplicaType.MIXED
# ↑ set to PREFILL or DECODE by scheduler
pd_p2p_comm_bandwidth = ...
pd_p2p_comm_dtype = ...
pd_node_ratio = ...
pending_requests = []
The request_type starts as PREFILL and changes to DECODE in on_batch_end() when prefill completes (same place as the +1 bonus). This flag is how BatchEndEvent knows to trigger the cross-replica transfer.
═══ P Replica (Prefill Pool) ═══
RequestArrivalEvent(t=0.0)
→ GlobalScheduleEvent → SplitwiseGlobalScheduler.schedule()
request.prefill_replica_id = 0 (P pool)
request.decode_replica_id = 2 (D pool)
return [(replica_id=0, request)] ← 只發到 P replica
→ ReplicaScheduleEvent(t=0.0, replica=0) ← P replica
→ BatchStageArrival → ReplicaStageSchedule
execution_time = predict(batch with 512 prefill tokens)
→ BatchStageEndEvent(t=0.0 + 0.8s)
→ BatchEndEvent(t=0.8) ← 這裡改動最大
① Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
processed: 0 → 512 → 513 (bonus +1)
request_type: PREFILL → DECODE
② Detect: is_prefill_complete + DECODE type + on PREFILL replica
③ estimate_kv_cache_size(513 tokens) = 1.6 GB
④ pd_p2p_comm_time = 1.6 GB / 50 GB/s = 0.032s
⑤ decode_arrived_at = 0.8 + 0.032 = 0.832s
⑥ events = [
ReplicaScheduleEvent(0.8, replica=0), ← P replica 繼續排下一個 batch
ReplicaScheduleEvent(0.832, replica=2) ← D replica 在 KV 傳輸完成後開始
]
═══ D Replica (Decode Pool) ═══
ReplicaScheduleEvent(t=0.832, replica=2) ← 來自上面的 BatchEndEvent
→ on_schedule() → _get_next_batch()
request from queue, is_prefill_complete=True → num_tokens=1
→ BatchStageArrival → ReplicaStageSchedule → BatchStageEnd → BatchEnd
processed: 513 → 514
→ ReplicaScheduleEvent(t=..., replica=2) ← D replica 繼續 decode loop
⋮ (D-1 iterations on D replica)
| Aspect | Vidur (original) | SimAI (vidur-alibabacloud) |
|---|---|---|
| Replica pools | All replicas identical | P pool + D pool (split by pd_node_ratio) |
schedule() |
Assign to 1 replica | Assign to P replica + record D replica; only return P mapping |
| Prefill execution | Same replica as decode | P replica only |
| P→D handoff | — | BatchEndEvent detects prefill_complete on P replica, calculates KV transfer time, creates delayed ReplicaScheduleEvent for D replica |
| KV cache transfer | — | 2 × tokens × hidden_dim × layers × dtype_bytes / bandwidth |
| Decode execution | Same replica as prefill | D replica only (all D−1 iterations) |
| +1 bonus | Kept (correct for co-located) | Still kept — first decode token is not re-generated on D replica |
| New event types | — | None — reuses existing ReplicaScheduleEvent with delayed time |
BatchEndEvent.handle_event() to detect the prefill→decode transition on a P replica, and inject a time-delayed ReplicaScheduleEvent targeting the D replica. The delay = KV cache transfer time. This is the second point in the entire simulator where time advances (the first being the execution_time in ReplicaStageScheduleEvent).
When a request sits idle in _preempted_requests between one batch ending and the next batch starting, this gap is captured at two granularities:
Accumulated in on_batch_stage_schedule() — measures the gap between the last stage completion and the next stage start:
def on_batch_stage_schedule(self, time):
self._latest_stage_scheduled_at = time
if self._latest_stage_completed_at == 0:
self._preempted_time = 0
else:
self._preempted_time += \
time - self._latest_stage_completed_at
# ↑ 這就是等待時間
Reported as metric: request_preemption_time
Computed in on_batch_schedule() — measures per-iteration wait from last batch completion to next batch start:
def on_batch_schedule(self, time):
self._latest_iteration_scheduling_delay = \
time - self._latest_iteration_completed_at
# ↑ 每次 iteration 的等待時間
if not self._scheduled: # 首次排程
self._scheduling_delay = \
time - self._arrived_at
# ↑ 從到達到首次被排程的等待
Reported as metric: request_scheduling_delay
Request arrives at t=0.0
│
├─ on_batch_schedule(t=0.5)
│ scheduling_delay = 0.5 - 0.0 = 0.5s ← 從到達到首次排程
│
├─ on_batch_stage_schedule(t=0.5)
│ _latest_stage_completed_at == 0 → preempted_time = 0
│
├─ on_batch_stage_end(t=1.0) ← prefill 完成
│ _latest_stage_completed_at = 1.0
│
│ ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌ 0.8s gap
│
├─ on_batch_stage_schedule(t=1.8) ← 下一個 batch 開始
│ preempted_time += 1.8 - 1.0 = +0.8s ← 第一段等待被記錄
│
├─ on_batch_stage_end(t=1.85) ← decode iter 完成
│ _latest_stage_completed_at = 1.85
│
│ ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌ 0.3s gap
│
├─ on_batch_stage_schedule(t=2.15)
│ preempted_time += 2.15 - 1.85 = +0.3s ← 第二段等待被記錄
│
│ ⋮ (repeats for each decode iteration)
│
└─ Request completes. Final metrics:
scheduling_delay = 0.5s ← 到達 → 首次排程
preempted_time = 0.8 + 0.3 + ... = Σ(all gaps)
execution_time = Σ(all stage execution times)
e2e_time = completed_at - arrived_at ← 包含一切
| Metric | Field | Formula | Meaning |
|---|---|---|---|
request_scheduling_delay |
_scheduling_delay |
first scheduled_at − arrived_at | Initial queue wait before ever being batched |
request_preemption_time |
_preempted_time |
Σ(stage_schedulei − stage_endi-1) | Cumulative idle time across ALL inter-batch gaps |
request_execution_time |
_execution_time |
Σ(stage execution_timei) | Cumulative GPU execution time |
request_execution_plus_preemption_time |
(computed) | execution_time + preempted_time | Total time from first batch to last batch (minus scheduling delay) |
request_e2e_time |
e2e_time |
completed_at − arrived_at | Everything: scheduling_delay + execution + preemption |
e2e_time ≈ scheduling_delay + execution_time + preempted_timepreempted_time is measured at the stage level while scheduling_delay is at the iteration level, and there's a small gap in the first iteration before the first stage schedule.
| Transition | New Event Time | Time Source |
|---|---|---|
| RequestArrival → GlobalSchedule | self.time |
Same instant — scheduling is instantaneous |
| GlobalSchedule → ReplicaSchedule | self.time |
Same instant — dispatching is instantaneous |
| ReplicaSchedule → BatchStageArrival | self.time |
Same instant — batching is instantaneous |
| BatchStageArrival → ReplicaStageSchedule | self.time |
Same instant — stage enqueue is instantaneous |
| ReplicaStageSchedule → BatchStageEnd | self.time + execution_time |
⏱ THE ONLY TIME ADVANCE — execution_time from ExecutionTimePredictor |
| BatchStageEnd → ReplicaStageSchedule | self.time |
Same instant — release stage for next batch |
| BatchStageEnd → BatchStageArrival (next stage) | self.time |
Same instant — inter-stage transfer is instantaneous |
| BatchStageEnd → BatchEnd (last stage) | self.time |
Same instant — finalization is instantaneous |
| BatchEnd → ReplicaSchedule | self.time |
Same instant — re-scheduling is instantaneous |
# replica_stage_schedule_event.py, handle_event()
BatchStageEndEvent(
self.time + self._batch_stage.execution_time, # ← 唯一的時間推進
...
)
Where execution_time comes from:
execution_time_predictor.get_execution_time(batch, stage_id)ExecutionTime object with the total duration → stored in BatchStage.execution_timeExecutionTimePredictor's fidelity.
Events don't just create other events — they also mutate entity state through callback methods. Here's a map of which entity methods are called by which events.
| Callback Method | Called By | State Change |
|---|---|---|
on_batch_schedule(time) |
Batch.on_schedule() | Sets scheduled_at, scheduling_delay |
on_batch_stage_schedule(time) |
BatchStage.on_schedule() | Sets latest_stage_scheduled_at, calculates preempted_time |
on_batch_stage_end(time, ...) |
BatchStage.on_stage_end() | Accumulates execution_time, model_execution_time |
on_batch_end(time, num_tokens) |
Batch.on_batch_end() | Increments num_processed_tokens; if all done → completed=True |
on_schedule(time) → calls request.on_batch_schedule() for each requeston_batch_end(time) → calls request.on_batch_end() for each request with its num_tokenson_schedule(time) → calls request.on_batch_stage_schedule() for each requeston_stage_end(time) → calls request.on_batch_stage_end() for each requestCalled automatically when the process exits (even on exceptions). Triggers the metrics pipeline:
| Action | Target | Output |
|---|---|---|
| Plot all metrics | metric_store.plot() |
CDF/histogram PNGs |
| Write JSON trace | _write_event_trace() |
event_trace.json |
| Write Chrome trace | _write_chrome_trace() |
chrome_trace.json |
Every file involved in a standard simulation run, grouped by role.
| File | Role | Invocation Timing |
|---|---|---|
| vidur/main.py | Entry point | Once at startup |
| vidur/simulator.py | Orchestrator — init, event loop, output | Entire lifecycle |
| File | Role | Invocation Timing |
|---|---|---|
| vidur/config/config.py | SimulationConfig dataclass tree | Phase 1 — CLI parsing |
| vidur/config/model_config.py | LLM model specs | Phase 1 |
| vidur/config/device_sku_config.py | GPU/device specs | Phase 1 |
| vidur/config/node_sku_config.py | Node specifications | Phase 1 |
| vidur/config/base_poly_config.py | Polymorphic config base | Phase 1 |
| vidur/config/flat_dataclass.py | CLI argument flattening | Phase 1 |
| File | Role | Invocation Timing |
|---|---|---|
| vidur/entities/cluster.py | Creates and holds replicas | Phase 2 — init |
| vidur/entities/replica.py | Model + device config, constraints | Phase 2 — init |
| vidur/entities/request.py | Per-request state machine | Phase 3 (created) → Phase 5 (mutated) |
| vidur/entities/batch.py | Groups requests, delegates callbacks | Phase 5 — created by ReplicaScheduler |
| vidur/entities/batch_stage.py | Per-stage execution record | Phase 5 — created by ReplicaStageScheduleEvent |
| vidur/entities/execution_time.py | Execution time breakdown | Phase 5 — returned by predictor |
| File | Event | Invocation Timing |
|---|---|---|
| vidur/events/base_event.py | Abstract base | Every event inherits |
| vidur/events/request_arrival_event.py | RequestArrival | Once per request |
| vidur/events/global_schedule_event.py | GlobalSchedule | After each request arrival |
| vidur/events/replica_schedule_event.py | ReplicaSchedule | After global schedule + after each batch end |
| vidur/events/batch_stage_arrival_event.py | BatchStageArrival | Per batch per pipeline stage |
| vidur/events/replica_stage_schedule_event.py | ReplicaStageSchedule | Per stage execution |
| vidur/events/batch_stage_end_event.py | BatchStageEnd | After stage execution_time elapses |
| vidur/events/batch_end_event.py | BatchEnd | After last stage completes |
| File | Role | Invocation Timing |
|---|---|---|
| scheduler/global_scheduler/base_global_scheduler.py | Base class + init cascade | Phase 2 (init) + Phase 5 (every event) |
| scheduler/global_scheduler/{random,round_robin,lor}.py | Request → replica assignment | Phase 5 — GlobalScheduleEvent |
| scheduler/replica_scheduler/base_replica_scheduler.py | Batching logic base | Phase 5 — ReplicaScheduleEvent, BatchEndEvent |
| scheduler/replica_scheduler/{vllm,sarathi,orca,...}.py | Concrete batching strategies | Phase 5 — on_schedule() |
| scheduler/replica_stage_scheduler/replica_stage_schduler.py | Per-stage batch queue | Phase 5 — stage arrival/schedule/end events |
| File | Role | Invocation Timing |
|---|---|---|
| request_generator/base_request_generator.py | Abstract generator | Phase 3 |
| request_generator/synthetic_request_generator.py | Generate from distributions | Phase 3 — generate() |
| request_generator/trace_replay_request_generator.py | Replay from CSV | Phase 3 — generate() |
| File | Role | Invocation Timing |
|---|---|---|
| execution_time_predictor/base_execution_time_predictor.py | Computes attention/MLP/comm time | Phase 5 — ReplicaStageScheduleEvent |
| execution_time_predictor/{linear_regression,random_forrest}.py | Concrete predictors | Phase 5 — get_execution_time() |
| File | Role | Invocation Timing |
|---|---|---|
| vidur/metrics/metrics_store.py | Collect metrics from all events | Phase 2 (init) + Phase 5 (every event) + Phase 6 (plot) |
| vidur/utils/random.py | Seed management | Phase 1 |
| vidur/logger.py | Logging setup | All phases |