Execution Call Chain

Complete file-level invocation trace: from python3 -m vidur.main through initialization, event loop, and output — every file, every class, every call in order.

vidur/main.py → Simulator → Event Loop → Output

Table of Contents

  1. High-Level Overview
  2. Phase 1 — Entry Point & Config
  3. Phase 2 — Simulator Initialization
  4. Phase 3 — Event Queue Bootstrap
  5. Phase 4 — Main Event Loop
  6. Phase 5 — Event Chain Detail
  7. Prefill / Decode Iteration Lifecycle
  8. PD Disagg Gap & Waiting Time
  9. Event Timestamp Propagation
  10. Entity State Transitions
  11. Phase 6 — Output & Metrics
  12. Complete File Map

1. High-Level Overview

Vidur is a discrete event simulator. Execution has three macro phases: initialization (build all objects, generate requests, fill event queue), event loop (pop events in time order, each event handler returns new events), and output (write metrics and traces). The diagram below shows the full call chain at file level.

Execution Flow Overview
INIT EVENT LOOP OUTPUT main.py main() config/config.py SimulationConfig .create_from_cli_args() Simulator.__init__() simulator.py entities/cluster.py Cluster → Replica metrics/metrics_store.py MetricsStore request_generator/ RequestGenerator scheduler/global_scheduler/ GlobalScheduler scheduler/replica_scheduler/ ReplicaScheduler execution_time_predictor/ ExecutionTimePredictor replica_stage_scheduler/ ReplicaStageScheduler simulator.py _init_event_queue() RequestArrivalEvent × N Simulator.run() — while event_queue heappop → event.handle_event() → heappush new events RequestArrival GlobalSchedule ReplicaSchedule BatchStageArrival ReplicaStageSchedule BatchStageEnd BatchEnd loop back next stage last stage? not last _write_output() MetricsStore.plot() event_trace.json chrome_trace.json CDF plots (PNG)

2. Phase 1 — Entry Point & Config

1
vidur/main.py

main()

The entry point. Invoked via python3 -m vidur.main. Does exactly three things:

def main() -> None:
    config: SimulationConfig = SimulationConfig.create_from_cli_args()  # ① parse CLI
    set_seeds(config.seed)                                              # ② deterministic RNG
    simulator = Simulator(config)                                        # ③ build & run
    simulator.run()
2
vidur/config/config.py

SimulationConfig.create_from_cli_args()

Parses all CLI arguments into a nested dataclass hierarchy. Key sub-configs:

Sub-Config Controls File
cluster_config num_replicas, replica_config (model, device, parallelism), global_scheduler_config config/config.py
request_generator_config Synthetic vs trace-replay, interval/length generators config/config.py
metrics_config output_dir, write_metrics, write_json_trace, enable_chrome_trace config/config.py
3
vidur/utils/random.py

set_seeds(seed)

Sets random seeds for Python's random and numpy to ensure reproducible simulations.

3. Phase 2 — Simulator Initialization

Simulator.__init__() builds the entire simulation infrastructure. Each sub-component triggers a cascade of further object creation.

A
vidur/simulator.pyvidur/entities/cluster.pyvidur/entities/replica.py

Cluster & Replica Creation

Cluster.__init__() creates num_replicas Replica objects. Each Replica validates model/device constraints (e.g., layers divisible by pipeline stages, embedding_dim divisible by tensor_parallel_size).

# entities/cluster.py
self._replicas = {}
for _ in range(num_replicas):
    replica = Replica(replica_config, generator_config)
    self._replicas[replica.id] = replica
B
vidur/simulator.pyvidur/metrics/metrics_store.py

MetricsStore Creation

Initializes DataSeries and CDFSketch objects for tracking request latency, batch sizes, utilization, and token-level metrics.

C
vidur/simulator.pyvidur/request_generator/request_generator_registry.py

RequestGenerator Creation (Registry)

The registry pattern selects the concrete generator class based on config type:

SyntheticRequestGenerator

request_generator/synthetic_request_generator.py

Uses interval generators (Poisson, Gamma, Static, Trace) and length generators (Uniform, Zipf, Trace, Fixed) to produce synthetic requests.

TraceReplayRequestGenerator

request_generator/trace_replay_request_generator.py

Reads a CSV trace file with arrival times and token counts, replaying exact request patterns.

D
vidur/simulator.pyvidur/scheduler/global_scheduler/global_scheduler_registry.py

GlobalScheduler Creation (triggers deep initialization cascade)

This is the deepest initialization chain. The global scheduler's __init__() creates the entire scheduling hierarchy:

Initialization Cascade inside GlobalScheduler.__init__():
  1. Creates ExecutionTimePredictor via registry → LinearRegressionExecutionTimePredictor or RandomForrestExecutionTimePredictor
  2. For each replica, creates a ReplicaScheduler via registry (one of: vLLM, Sarathi, Orca, LightLLM, FasterTransformer)
  3. Each ReplicaScheduler creates a MemoryPlanner to calculate max batch size
  4. Each ReplicaScheduler creates ReplicaStageScheduler × num_pipeline_stages

Global scheduler concrete implementations (all route requests to replicas):

Scheduler Strategy
RandomGlobalScheduler Random replica selection
RoundRobinGlobalScheduler Round-robin rotation
LORGlobalScheduler Least Outstanding Requests

4. Phase 3 — Event Queue Bootstrap

vidur/simulator.pyvidur/request_generator/*.pyvidur/events/request_arrival_event.py

_init_event_queue()

After all objects are built, the simulator populates the event queue with the initial set of events:

# simulator.py
def _init_event_queue(self):
    requests = self._request_generator.generate()  # → List[Request]
    for request in requests:
        self._add_event(RequestArrivalEvent(request.arrived_at, request))

Each request becomes a RequestArrivalEvent timestamped at its arrival time, pushed into a min-heap priority queue. Also registers atexit(_write_output) for cleanup.

5. Phase 4 — Main Event Loop

vidur/simulator.py

Simulator.run()

# simulator.py
def run(self):
    while self._event_queue and not self._terminate:
        _, event = heapq.heappop(self._event_queue)   # ① pop earliest event
        self._set_time(event._time)                       # ② advance sim clock
        new_events = event.handle_event(                  # ③ dispatch to handler
            self._scheduler, self._metric_store)
        self._add_events(new_events)                       # ④ push new events

This is the core loop. Every handle_event() returns zero or more new events that get pushed back into the heap. The simulation advances by processing events in chronological order until the queue is empty or time_limit is reached.

Event Ordering: Events are ordered by (time, event_id, event_type) tuple. This guarantees deterministic ordering when multiple events share the same timestamp.

6. Phase 5 — Event Chain Detail

Each event type lives in its own file under vidur/events/. Below is the complete chain showing what each event does, what it calls, and what new events it creates.

1
RequestArrivalEvent vidur/events/request_arrival_event.py

Request enters the system

Action Target File
Add request to global queue scheduler.add_request(request) scheduler/global_scheduler/base_global_scheduler.py
Record arrival metrics metrics_store.on_request_arrival() metrics/metrics_store.py

Creates: GlobalScheduleEvent

Note: This only enqueues the request — it does NOT immediately form a batch. The request joins scheduler._request_queue and waits. Multiple requests can accumulate in this queue before any batch is formed.
▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    scheduler.add_request(self._request)
    metrics_store.on_request_arrival(self.time, self._request)
    return [GlobalScheduleEvent(self.time)]
2
GlobalScheduleEvent vidur/events/global_schedule_event.py

Assign requests to replicas

Action Target File
Decide which replica handles which request scheduler.schedule() scheduler/global_scheduler/{random,round_robin,lor}.py
Push request into replica's queue replica_scheduler.add_request(request) scheduler/replica_scheduler/base_replica_scheduler.py

Creates: ReplicaScheduleEvent × one per affected replica

▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    self._replica_set = set()
    self._request_mapping = scheduler.schedule()

    for replica_id, request in self._request_mapping:
        self._replica_set.add(replica_id)
        scheduler.get_replica_scheduler(replica_id).add_request(request)

    return [
        ReplicaScheduleEvent(self.time, replica_id)
        for replica_id in self._replica_set
    ]
3
ReplicaScheduleEvent vidur/events/replica_schedule_event.py

Form batches on a replica

Action Target File
Decide which requests to batch together replica_scheduler.on_schedule() scheduler/replica_scheduler/{vllm,sarathi,orca,...}.py
Mark batch as scheduled batch.on_schedule(time) entities/batch.py
Record memory usage metrics_store.on_replica_schedule() metrics/metrics_store.py

Creates: BatchStageArrivalEvent × one per batch (stage_id = 0), or empty list if no batch can be formed

This is where batching actually happens: replica_scheduler.on_schedule() pulls multiple requests from the queue to form ONE batch. It does NOT create one batch per request. The internal _get_next_batch() loops through the request queue, adding requests until it hits memory, token, or batch-size limits. If the replica is already running the maximum number of concurrent batches (_num_running_batches ≥ _num_stages), it returns an empty list — the requests simply wait.
▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
    self._batches = replica_scheduler.on_schedule()

    if not self._batches:
        return []    # ← 若 replica 正忙或佇列為空,什麼都不做

    memory_usage_percent = replica_scheduler.memory_usage_percent
    metrics_store.on_replica_schedule(
        self.time, self._replica_id, memory_usage_percent)

    for batch in self._batches:
        batch.on_schedule(self.time)

    return [
        BatchStageArrivalEvent(self.time, self._replica_id, 0, batch)
        for batch in self._batches
    ]
▸ View on_schedule() → _get_next_batch() batching logic
# base_replica_scheduler.py — on_schedule()
def on_schedule(self) -> List[Batch]:
    scheduled_batches = []
    while self._num_running_batches < self._num_stages:
        batch = self._get_next_batch()     # ← 從佇列撈多個 request 組成 1 個 batch
        if not batch:
            break
        scheduled_batches.append(batch)
        self._num_running_batches += 1
    return scheduled_batches

# vllm_replica_scheduler.py — _get_next_batch() (以 vLLM 為例)
def _get_next_batch(self) -> Optional[Batch]:
    requests, num_tokens = [], []
    while self._request_queue:           # ← 遍歷整個佇列
        if not self._can_allocate(...):    # memory 不足 → 停止
            break
        if total_tokens > max_tokens:     # token 上限 → 停止
            break
        if len(requests) >= batch_size_cap: # batch size 上限 → 停止
            break
        request = self._request_queue.pop(0)
        requests.append(request)            # ← 多個 request 加入同一個 batch
        num_tokens.append(next_tokens)
    if not requests:
        return None
    return Batch(self._replica_id, requests, num_tokens)
4
BatchStageArrivalEvent vidur/events/batch_stage_arrival_event.py

Batch arrives at a pipeline stage

Action Target File
Enqueue batch at stage stage_scheduler.add_batch(batch) scheduler/replica_stage_scheduler/replica_stage_schduler.py

Creates: ReplicaStageScheduleEvent

▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    scheduler.get_replica_stage_scheduler(
        self._replica_id, self._stage_id
    ).add_batch(self._batch)

    return [
        ReplicaStageScheduleEvent(
            self.time, self._replica_id, self._stage_id)
    ]
5
ReplicaStageScheduleEvent vidur/events/replica_stage_schedule_event.py

Execute batch on a pipeline stage

Guard: stage_scheduler.on_schedule() checks _is_busy first. If this stage is already executing a batch, it returns (None, None, None) and the event produces no further events — the new batch waits in the stage queue until the current one finishes.
Action Target File
Dequeue batch and compute execution time (if not busy) stage_scheduler.on_schedule() scheduler/replica_stage_scheduler/replica_stage_schduler.py
Predict execution duration execution_time_predictor.get_execution_time(batch, stage_id) execution_time_predictor/base_execution_time_predictor.py
Create BatchStage entity BatchStage(..., execution_time) entities/batch_stage.py
Mark stage as scheduled batch_stage.on_schedule(time) entities/batch_stage.py

Creates: BatchStageEndEvent scheduled at time + execution_time

▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    stage_scheduler = scheduler._replica_schedulers[
        self._replica_id
    ]._replica_stage_schedulers[self._stage_id]

    self._batch, self._batch_stage, execution_time = \
        stage_scheduler.on_schedule()

    if not (self._batch and self._batch_stage):
        return []

    self._batch_stage.on_schedule(self.time)
    metrics_store.on_replica_stage_schedule(
        self.time, self._replica_id, self._stage_id,
        self._batch_stage, execution_time)

    self._is_last_stage = stage_scheduler.is_last_stage

    return [
        BatchStageEndEvent(
            self.time + self._batch_stage.execution_time,  # ← 唯一的時間推進
            self._replica_id, self._stage_id,
            self._is_last_stage, self._batch, self._batch_stage),
    ]
▸ View stage_scheduler.on_schedule() source
# scheduler/replica_stage_scheduler/replica_stage_schduler.py
def on_schedule(self):
    if self._is_busy or not self._batch_queue:
        return None, None, None

    self._is_busy = True
    batch = self._batch_queue.pop(0)
    execution_time = self._execution_time_predictor.get_execution_time(
        batch, self._stage_id)             # ← execution_time 的來源
    total_execution_time = execution_time.total_time
    model_execution_time = execution_time.model_time
    batch_stage = BatchStage(
        batch.id, self._replica_id, self._stage_id,
        total_execution_time, model_execution_time,
        batch.requests, batch.num_tokens)

    return batch, batch_stage, execution_time
6
BatchStageEndEvent vidur/events/batch_stage_end_event.py

Pipeline stage completes — branch point

Action Target File
Release stage resources stage_scheduler.on_stage_end() scheduler/replica_stage_scheduler/replica_stage_schduler.py
Finalize stage timing batch_stage.on_stage_end(time) entities/batch_stage.py
Record stage metrics metrics_store.on_batch_stage_end() metrics/metrics_store.py
Branch Logic:
  • Always: creates ReplicaStageScheduleEvent (allow next batch to use this stage)
  • If NOT last stage: creates BatchStageArrivalEvent for stage_id + 1 → batch moves to next pipeline stage
  • If last stage: creates BatchEndEvent → batch is fully processed
▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    scheduler.get_replica_stage_scheduler(
        self._replica_id, self._stage_id
    ).on_stage_end()

    self._batch_stage.on_stage_end(self.time)
    metrics_store.on_batch_stage_end(
        self._batch_stage, self.time,
        self._replica_id, self._stage_id)

    next_events = [
        ReplicaStageScheduleEvent(
            self.time, self._replica_id, self._stage_id),
    ]

    if self._is_last_stage:
        return next_events + [
            BatchEndEvent(self.time, self._replica_id, self._batch)]

    return next_events + [
        BatchStageArrivalEvent(
            self.time, self._replica_id,
            self._stage_id + 1, self._batch)]
7
BatchEndEvent vidur/events/batch_end_event.py

Batch completes all stages — loop back

Action Target File
Complete batch, update request tokens batch.on_batch_end(time) entities/batch.py
Free memory blocks, handle completed/preempted requests replica_scheduler.on_batch_end(batch) scheduler/replica_scheduler/base_replica_scheduler.py
Record batch completion metrics metrics_store.on_batch_end() metrics/metrics_store.py

Creates: ReplicaScheduleEvent → loops back to Event 3, forming the next batch

▸ View handle_event() source
def handle_event(self, scheduler, metrics_store):
    self._batch.on_batch_end(self.time)
    replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
    replica_scheduler.on_batch_end(self._batch)

    memory_usage_percent = replica_scheduler.memory_usage_percent
    metrics_store.on_batch_end(
        self.time, self._batch,
        self._replica_id, memory_usage_percent)

    return [ReplicaScheduleEvent(self.time, self._replica_id)]
The Complete Event Cycle:
RequestArrival → GlobalSchedule → ReplicaSchedule → BatchStageArrival → ReplicaStageSchedule → BatchStageEnd → (next stage OR BatchEnd) → ReplicaSchedule → ... This cycle repeats until all requests have generated all their decode tokens.

When Does Multi-Request Batching Actually Happen?

Key Question: Every request arrival triggers the full chain (Events 1→2→3). Does that mean each arrival pulls multiple requests into a batch?

Answer: No. Most of the time, the chain is short-circuited by the guard _num_running_batches < _num_stages. When the replica is already executing a batch, on_schedule() returns an empty list, and the request simply waits in the queue.

Here is a concrete timeline showing how the two paths differ:

Path A — Request Arrival when replica is idle (first request or queue was empty)

t=0.1  RequestArrivalEvent(Req A)GlobalScheduleEvent  → assigns Req A to replica 0
  → ReplicaScheduleEventon_schedule()
       _num_running_batches(0) < _num_stages(1) ✓
       → _get_next_batch(): pulls Req A (only 1 in queue) → Batch(size=1)
       → _num_running_batches = 1BatchStageArrivalEvent → ... → batch starts executing
Yes, batch size = 1 is normal here. _get_next_batch() has no waiting/accumulation mechanism — it pulls whatever is currently in the queue, even if only 1 request. This correctly models continuous batching (as used by vLLM): the system never waits for more requests, it processes what's available immediately. New requests arriving later join the next iteration via the BatchEnd → ReplicaSchedule path (Path C below).

Path B — Request Arrival when replica is busy (common case)

t=0.2  RequestArrivalEvent(Req B)GlobalScheduleEvent  → assigns Req B to replica 0
  → ReplicaScheduleEventon_schedule()
       _num_running_batches(1) < _num_stages(1) ✗  ← GUARD FAILSreturn []  ← 空列表,事件鏈到此結束
       Req B stays in _request_queue, waiting...

t=0.3  RequestArrivalEvent(Req C)  → same thing, Req C also waits

Path C — BatchEnd triggers real multi-request batching

t=0.5  BatchEndEvent(Batch of Req A)
  → _num_running_batches -= 1  → now 0ReplicaScheduleEventon_schedule()
       _num_running_batches(0) < _num_stages(1) ✓
       → _get_next_batch(): queue has [Req B, Req C]
           while _request_queue:
               add Req B ✓, add Req C ✓  ← 多個 request 組成一個 batch
       → Batch(requests=[Req B, Req C])
  → BatchStageArrivalEvent → ... → batch starts executing
Summary: The event chain triggered by RequestArrival fires every time, but is mostly a no-op when the replica is busy. The real multi-request batching happens when BatchEnd triggers ReplicaSchedule — by that time, multiple requests have accumulated in the queue and are pulled into a single batch.

Prefill / Decode Iteration Lifecycle — How Many Batches Does One Request Need?

The event chain (Events 3→4→5→6→7→3) forms a loop. A single request goes through this loop multiple times — once is not enough. But the call graph above doesn't make this obvious. This section traces the full lifecycle of a single request with P prefill tokens and D decode tokens.

The Two Key Mechanisms

① How many tokens per iteration?

scheduler/replica_scheduler/base_replica_scheduler.py
def _get_request_next_num_tokens(self, request):
    if request.is_prefill_complete:
        return 1                        # decode: 每次只處理 1 個 token
    return request.num_prefill_tokens  # prefill: 一次全部處理完

Prefill processes all P tokens in one shot. Decode processes exactly 1 token per iteration.

② How does a request re-enter the loop?

scheduler/replica_scheduler/vllm_replica_scheduler.py
def on_batch_end(self, batch):
    self._num_running_batches -= 1
    for request in batch.requests:
        if request.completed:
            self.free(request.id)         # 完成 → 釋放記憶體
        else:
            self._preempted_requests.append(request)
            # ↑ 未完成 → 放回 preempted 佇列,下次 on_schedule() 會再撈出來

After each batch, non-completed requests go into _preempted_requests. The next _get_next_batch() picks them up again — each time with num_tokens=1 (decode).

The Hidden +1: Prefill Awards the First Decode Token for Free

vidur/entities/request.py

Inside on_batch_end(), there is a critical +1 that changes the iteration count:

def on_batch_end(self, time, num_tokens_processed):
    self._num_processed_tokens += num_tokens_processed  # prefill 後: P

    if self._num_processed_tokens == self._num_prefill_tokens:
        self._is_prefill_complete = True
        # we get one decode token when the prefill processing completes
        self._num_processed_tokens += 1    # ← P → P+1,第一個 decode token 免費!

        if self._prefill_completed_at == 0:
            self._prefill_completed_at = time

    if self._num_processed_tokens == self.total_tokens:  # P+D?
        self._completed = True
This models real LLM behavior: the prefill forward pass computes the KV cache for all P input tokens and produces the first output token in the same pass. So the first decode token costs zero additional iterations.

Concrete Example: Request(P=512, D=128)

Walk through the state of num_processed_tokens at each batch iteration. The request is complete when it reaches total_tokens = P + D = 640.

Iter Phase num_tokens in Batch processed before += tokens +1 bonus processed after Status
1 Prefill 512 0 +512 +1 513 prefill_complete=True, → preempted queue
2 Decode 1 513 +1 514 → preempted queue
3 Decode 1 514 +1 515 → preempted queue
⋮    (each iteration: +1 token)
128 Decode 1 639 +1 640 = P+D ✓ completed=True, freed
Total: D batch iterations (not D+1)
= 1 prefill iteration + (D−1) decode iterations = D. The first decode token is included in the prefill iteration via the +1 bonus.

What the Event Loop Actually Looks Like for This Request

Mapping the 128 iterations back to the event chain. Each row is one pass through Events 3→4→5→6→7→3.

Iter 1 (Prefill):
  ReplicaScheduleEvent_get_next_batch()
    request from _request_queue, num_tokens=512Batch(requests=[Req], num_tokens=[512])
  → BatchStageArrival → ReplicaStageSchedule
    → execution_time_predictor(batch with 512 prefill tokens) → 較長的 execution_time
  → BatchStageEnd → BatchEnd
    → Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
      processed: 0 → 512 → 513 (bonus +1)
    → VLLMScheduler.on_batch_end()
      request.completed? No (513 < 640) → _preempted_requests.append(request)ReplicaScheduleEvent  ← 迴圈回到事件 3

Iter 2 (Decode #1):
  ReplicaScheduleEvent_get_next_batch()
    request from _preempted_requests, is_prefill_complete=True → num_tokens=1Batch(requests=[Req, ...], num_tokens=[1, ...])
    ↑ 同一個 batch 可以包含來自不同 request 的 decode tokens
  → BatchStageArrival → ReplicaStageSchedule
    → execution_time_predictor(batch with N decode tokens) → 較短的 execution_time
  → BatchStageEnd → BatchEnd
    → Request.on_batch_end(num_tokens=1)
      processed: 513 → 514request not completed → _preempted_requests.append(request)ReplicaScheduleEvent  ← 迴圈回到事件 3

⋮ (repeats D-2 more times)

Iter 128 (Decode #127, final):
  ReplicaScheduleEvent_get_next_batch()
    → Batch(requests=[Req, ...], num_tokens=[1, ...])
  → ... → BatchEnd
    → Request.on_batch_end(num_tokens=1)
      processed: 639 → 640 = total_tokens → completed=True
    → VLLMScheduler.on_batch_end()
      request.completed? Yes → self.free(request.id)  ← 記憶體釋放

Continuous Batching: Different Requests Share a Batch

A critical detail: in decode iterations, _get_next_batch() pulls from both _preempted_requests (ongoing decode) and _request_queue (new arrivals). This means:

Batch iteration N:
  Batch(
    requests = [Req_A,  Req_B,  Req_C,  Req_D ],
    num_tokens= [1,      1,      1,      256  ]
  )               ↑decode  ↑decode  ↑decode  ↑prefill(new arrival)

One batch can contain a mix of decode tokens (1 each) from ongoing requests AND prefill tokens from a newly arrived request. The execution time predictor receives the full batch and computes a single execution time that accounts for both prefill and decode work.

Sarathi Variant: Chunked Prefill Changes the Count

The Sarathi scheduler overrides _get_request_next_num_tokens() to cap prefill tokens per iteration:

scheduler/replica_scheduler/sarathi_replica_scheduler.py
def _get_request_next_num_tokens(self, request, batch_contains_prefill, num_batch_tokens):
    if request.is_prefill_complete:
        return 1                                          # decode: 仍然 1 token

    next_num_tokens = min(
        request.num_prefill_tokens - request.num_processed_tokens,
        self._config.chunk_size - num_batch_tokens,     # ← 限制在 chunk_size 內
    )
    return max(0, next_num_tokens)

With chunk_size=512 and P=2048 prefill tokens:

Iter Phase num_tokens processed after
1Prefill chunk 1512512
2Prefill chunk 25121024
3Prefill chunk 35121536
4Prefill chunk 45122048 → 2049 (+1 bonus)
5..D+3Decode12049 → ... → P+D

Total: ⌈P/chunk_size⌉ + (D−1) iterations. Chunked prefill trades more iterations for lower per-iteration latency, allowing decode requests to interleave.

Summary: Iteration Counts by Scheduler

Scheduler Prefill Iterations Decode Iterations Total Batch Iterations
vLLM / Orca / FasterTransformer 1 D − 1 D
Sarathi (chunk_size = C) ⌈P / C⌉ D − 1 ⌈P/C⌉ + D − 1
Why this matters for understanding the call graph: The event chain Events 3→7→3 is not a one-shot pipeline — it is a loop that executes D times per request (vLLM) or more (Sarathi). A simulation of 100 requests with average D=256 generates approximately 100 × 256 = 25,600 full cycles of Events 3→4→5→6→7→3. Each cycle produces a BatchStageEndEvent with a time advance (the only point where simulation time moves forward).

PD Disagg Gap & Request Waiting Time Between Batches

Original Vidur: No PD Disaggregation

The original Vidur has no PD disagg: every request runs prefill and decode on the same replica, and the +1 bonus is unconditional. There is no KV cache transfer, no splitwise scheduler, and no separate P/D replica pools.

SimAI Fork: What Changed in the Event System for PD Disagg

The SimAI fork (vidur-alibabacloud) achieves PD disaggregation by modifying three points in the existing event chain — not by adding new event types, but by changing the behavior of existing ones.

Change 1: SplitwiseGlobalScheduler — split replicas into P and D pools

scheduler/global_scheduler/splitwise_global_scheduler.py

At initialization, replicas are split by pd_node_ratio:

# Example: 4 replicas, pd_node_ratio=0.5
# Replica 0, 1 → ReplicaType.PREFILL  (P pool)
# Replica 2, 3 → ReplicaType.DECODE   (D pool)
for replica_id, replica in self._replicas.items():
    if replica_id < self._num_prefill_nodes:
        self.prefill_replicas[replica_id] = replica
        replica.replica_type = ReplicaType.PREFILL
    else:
        self.decode_replicas[replica_id] = replica
        replica.replica_type = ReplicaType.DECODE

In schedule(), each new request is assigned to both a P replica and a D replica upfront:

def schedule(self):
    for request in self._request_queue:
        request.request_type = RequestType.PREFILL

        # Assign P replica (round-robin in P pool)
        replica_id = self.p_request_counter % len(self.prefill_replicas)
        request.prefill_replica_id = replica_id

        # Assign D replica (round-robin in D pool)
        replica_id = (self.d_request_counter % len(self.decode_replicas)) + len(self.prefill_replicas)
        request.decode_replica_id = replica_id

    return prefill_request_mapping  # ← 只返回 P replica 的映射
schedule() only returns prefill_request_mapping. The request is initially sent to the P replica only. The D replica gets its ReplicaScheduleEvent later — from BatchEndEvent, not from here.

Change 2: BatchEndEvent — the critical cross-replica handoff

events/batch_end_event.py

This is where the entire PD disagg logic lives. The original BatchEndEvent was 42 lines. The SimAI version is 151 lines. After the normal batch-end handling, it adds:

# batch_end_event.py — SimAI version (simplified)
def handle_event(self, scheduler, metrics_store):
    # ① Normal Vidur logic (unchanged)
    self._batch.on_batch_end(self.time)
    replica_scheduler.on_batch_end(self._batch)
    events = [ReplicaScheduleEvent(self.time, self._replica_id)]

    # ② NEW: PD disagg cross-replica transfer
    if scheduler.__class__.__name__ == 'SplitwiseGlobalScheduler':
        for request in self._batch.requests:
            if request.is_prefill_complete \
               and request.request_type == RequestType.DECODE \
               and replica_scheduler.replica.replica_type == ReplicaType.PREFILL:

                # ③ Calculate KV cache transfer time
                request.pd_p2p_comm_size = request.estimate_kv_cache_size(
                    request.num_processed_tokens, replica_scheduler.replica)
                request.pd_p2p_comm_bandwidth = replica.pd_p2p_comm_bandwidth
                request.pd_p2p_comm_time = request.pd_p2p_comm_size / request.pd_p2p_comm_bandwidth

                # ④ Set D replica arrival = prefill done + KV transfer time
                request.decode_arrived_at = request.prefill_completed_at + request.pd_p2p_comm_time

                # ⑤ Create ReplicaScheduleEvent for D replica with DELAYED time
                events.append(
                    ReplicaScheduleEvent(request.decode_arrived_at, request.decode_replica_id))
                    # ↑ 注意:時間不是 self.time,而是未來的時間點!

    return events

Change 3: Request & Replica entities — new PD fields

entities/request.py
# New fields added to Request
request_type = RequestType.PREFILL  # PREFILL → DECODE
prefill_replica_id = None
decode_replica_id = None
prefill_arrived_at = arrived_at
decode_arrived_at = float('inf')  # ← set later
pd_p2p_comm_size = float('inf')
pd_p2p_comm_time = float('inf')
pd_p2p_comm_bandwidth = 0
entities/replica.py
# New fields added to Replica
replica_type = ReplicaType.MIXED
# ↑ set to PREFILL or DECODE by scheduler
pd_p2p_comm_bandwidth = ...
pd_p2p_comm_dtype = ...
pd_node_ratio = ...
pending_requests = []

The request_type starts as PREFILL and changes to DECODE in on_batch_end() when prefill completes (same place as the +1 bonus). This flag is how BatchEndEvent knows to trigger the cross-replica transfer.

The Modified Event Flow with PD Disagg

═══ P Replica (Prefill Pool) ═══

RequestArrivalEvent(t=0.0)
  → GlobalScheduleEventSplitwiseGlobalScheduler.schedule()
      request.prefill_replica_id = 0   (P pool)
      request.decode_replica_id  = 2   (D pool)
      return [(replica_id=0, request)]     ← 只發到 P replicaReplicaScheduleEvent(t=0.0, replica=0)    ← P replica
  → BatchStageArrival → ReplicaStageSchedule
      execution_time = predict(batch with 512 prefill tokens)BatchStageEndEvent(t=0.0 + 0.8s)
  → BatchEndEvent(t=0.8)  ← 這裡改動最大
      ① Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
         processed: 0 → 512 → 513 (bonus +1)
         request_type: PREFILL → DECODEDetect: is_prefill_complete + DECODE type + on PREFILL replicaestimate_kv_cache_size(513 tokens) = 1.6 GB
      ④ pd_p2p_comm_time = 1.6 GB / 50 GB/s = 0.032s
      ⑤ decode_arrived_at = 0.8 + 0.032 = 0.832s
      ⑥ events = [
           ReplicaScheduleEvent(0.8, replica=0),    ← P replica 繼續排下一個 batch
           ReplicaScheduleEvent(0.832, replica=2)  ← D replica 在 KV 傳輸完成後開始
         ]

═══ D Replica (Decode Pool) ═══

ReplicaScheduleEvent(t=0.832, replica=2)    ← 來自上面的 BatchEndEventon_schedule() → _get_next_batch()
      request from queue, is_prefill_complete=True → num_tokens=1
  → BatchStageArrival → ReplicaStageSchedule → BatchStageEnd → BatchEnd
      processed: 513 → 514ReplicaScheduleEvent(t=..., replica=2)    ← D replica 繼續 decode loop
      ⋮ (D-1 iterations on D replica)

Vidur vs SimAI: Comparison

Aspect Vidur (original) SimAI (vidur-alibabacloud)
Replica pools All replicas identical P pool + D pool (split by pd_node_ratio)
schedule() Assign to 1 replica Assign to P replica + record D replica; only return P mapping
Prefill execution Same replica as decode P replica only
P→D handoff BatchEndEvent detects prefill_complete on P replica, calculates KV transfer time, creates delayed ReplicaScheduleEvent for D replica
KV cache transfer 2 × tokens × hidden_dim × layers × dtype_bytes / bandwidth
Decode execution Same replica as prefill D replica only (all D−1 iterations)
+1 bonus Kept (correct for co-located) Still kept — first decode token is not re-generated on D replica
New event types None — reuses existing ReplicaScheduleEvent with delayed time
Key Insight: SimAI achieves PD disagg without adding any new event types. It only modifies BatchEndEvent.handle_event() to detect the prefill→decode transition on a P replica, and inject a time-delayed ReplicaScheduleEvent targeting the D replica. The delay = KV cache transfer time. This is the second point in the entire simulator where time advances (the first being the execution_time in ReplicaStageScheduleEvent).
Note on +1 bonus: The SimAI fork keeps the +1 bonus. In real PD disagg, the first decode token produced on the P replica is discarded and must be regenerated on the D replica (costing an extra iteration, making total = D+1). SimAI does not model this discard — it keeps total = D. This is a simplification in the current SimAI implementation.

Is Waiting Time Between Batches Recorded?

Yes — via two separate metrics.

When a request sits idle in _preempted_requests between one batch ending and the next batch starting, this gap is captured at two granularities:

Stage-Level: preempted_time

vidur/entities/request.py

Accumulated in on_batch_stage_schedule() — measures the gap between the last stage completion and the next stage start:

def on_batch_stage_schedule(self, time):
    self._latest_stage_scheduled_at = time
    if self._latest_stage_completed_at == 0:
        self._preempted_time = 0
    else:
        self._preempted_time += \
            time - self._latest_stage_completed_at
            # ↑ 這就是等待時間

Reported as metric: request_preemption_time

Iteration-Level: scheduling_delay

vidur/entities/request.py

Computed in on_batch_schedule() — measures per-iteration wait from last batch completion to next batch start:

def on_batch_schedule(self, time):
    self._latest_iteration_scheduling_delay = \
        time - self._latest_iteration_completed_at
        # ↑ 每次 iteration 的等待時間

    if not self._scheduled:   # 首次排程
        self._scheduling_delay = \
            time - self._arrived_at
            # ↑ 從到達到首次被排程的等待

Reported as metric: request_scheduling_delay

Concrete Timeline: Where Each Timing Is Captured

Request arrives at t=0.0
    │
    ├─ on_batch_schedule(t=0.5)
    │     scheduling_delay = 0.5 - 0.0 = 0.5s  ← 從到達到首次排程
    │
    ├─ on_batch_stage_schedule(t=0.5)
    │     _latest_stage_completed_at == 0 → preempted_time = 0
    │
    ├─ on_batch_stage_end(t=1.0)   ← prefill 完成
    │     _latest_stage_completed_at = 1.0
    │
    │  ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌  0.8s gap
    │
    ├─ on_batch_stage_schedule(t=1.8)   ← 下一個 batch 開始
    │     preempted_time += 1.8 - 1.0 = +0.8s  ← 第一段等待被記錄
    │
    ├─ on_batch_stage_end(t=1.85)   ← decode iter 完成
    │     _latest_stage_completed_at = 1.85
    │
    │  ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌  0.3s gap
    │
    ├─ on_batch_stage_schedule(t=2.15)
    │     preempted_time += 2.15 - 1.85 = +0.3s  ← 第二段等待被記錄
    │
    │  ⋮  (repeats for each decode iteration)
    │
    └─ Request completes. Final metrics:
         scheduling_delay = 0.5s     ← 到達 → 首次排程
         preempted_time   = 0.8 + 0.3 + ... = Σ(all gaps)
         execution_time   = Σ(all stage execution times)
         e2e_time         = completed_at - arrived_at  ← 包含一切

All Request Timing Metrics

Metric Field Formula Meaning
request_scheduling_delay _scheduling_delay first scheduled_at − arrived_at Initial queue wait before ever being batched
request_preemption_time _preempted_time Σ(stage_schedulei − stage_endi-1) Cumulative idle time across ALL inter-batch gaps
request_execution_time _execution_time Σ(stage execution_timei) Cumulative GPU execution time
request_execution_plus_preemption_time (computed) execution_time + preempted_time Total time from first batch to last batch (minus scheduling delay)
request_e2e_time e2e_time completed_at − arrived_at Everything: scheduling_delay + execution + preemption
The relationship:
e2e_time ≈ scheduling_delay + execution_time + preempted_time
This is approximate because preempted_time is measured at the stage level while scheduling_delay is at the iteration level, and there's a small gap in the first iteration before the first stage schedule.

Event Timestamp Propagation — Where Does Time Actually Advance?

Key Insight: The simulation clock advances at exactly ONE point in the entire event chain.
All scheduling, dispatching, and queue operations are modeled as instantaneous (zero latency). The only place where simulated time moves forward is when a batch stage finishes execution on the GPU.
Transition New Event Time Time Source
RequestArrivalGlobalSchedule self.time Same instant — scheduling is instantaneous
GlobalScheduleReplicaSchedule self.time Same instant — dispatching is instantaneous
ReplicaScheduleBatchStageArrival self.time Same instant — batching is instantaneous
BatchStageArrivalReplicaStageSchedule self.time Same instant — stage enqueue is instantaneous
ReplicaStageScheduleBatchStageEnd self.time + execution_time THE ONLY TIME ADVANCE — execution_time from ExecutionTimePredictor
BatchStageEndReplicaStageSchedule self.time Same instant — release stage for next batch
BatchStageEndBatchStageArrival (next stage) self.time Same instant — inter-stage transfer is instantaneous
BatchStageEndBatchEnd (last stage) self.time Same instant — finalization is instantaneous
BatchEndReplicaSchedule self.time Same instant — re-scheduling is instantaneous

The Single Line That Drives the Clock

# replica_stage_schedule_event.py, handle_event()
BatchStageEndEvent(
    self.time + self._batch_stage.execution_time,  # ← 唯一的時間推進
    ...
)

Where execution_time comes from:

  1. scheduler/replica_stage_scheduler/replica_stage_schduler.py calls execution_time_predictor.get_execution_time(batch, stage_id)
  2. execution_time_predictor/base_execution_time_predictor.py computes attention, MLP, allreduce, send/recv, and pipeline communication times based on batch size, num_tokens, and model config
  3. Returns an ExecutionTime object with the total duration → stored in BatchStage.execution_time
Implication: This means the simulation models a world where only GPU compute takes time. All CPU-side scheduling, memory allocation, request routing, and inter-stage data transfer are assumed to have zero overhead. The accuracy of the entire simulation therefore depends entirely on the ExecutionTimePredictor's fidelity.

7. Entity State Transitions

Events don't just create other events — they also mutate entity state through callback methods. Here's a map of which entity methods are called by which events.

Request State Machine

vidur/entities/request.py
Callback Method Called By State Change
on_batch_schedule(time) Batch.on_schedule() Sets scheduled_at, scheduling_delay
on_batch_stage_schedule(time) BatchStage.on_schedule() Sets latest_stage_scheduled_at, calculates preempted_time
on_batch_stage_end(time, ...) BatchStage.on_stage_end() Accumulates execution_time, model_execution_time
on_batch_end(time, num_tokens) Batch.on_batch_end() Increments num_processed_tokens; if all done → completed=True

Batch & BatchStage

vidur/entities/batch.py

Batch

  • on_schedule(time)calls request.on_batch_schedule() for each request
  • on_batch_end(time)calls request.on_batch_end() for each request with its num_tokens
vidur/entities/batch_stage.py

BatchStage

  • on_schedule(time)calls request.on_batch_stage_schedule() for each request
  • on_stage_end(time)calls request.on_batch_stage_end() for each request

8. Phase 6 — Output & Metrics

vidur/simulator.py

_write_output() — registered via atexit

Called automatically when the process exits (even on exceptions). Triggers the metrics pipeline:

Action Target Output
Plot all metrics metric_store.plot() CDF/histogram PNGs
Write JSON trace _write_event_trace() event_trace.json
Write Chrome trace _write_chrome_trace() chrome_trace.json

9. Complete File Map

Every file involved in a standard simulation run, grouped by role.

Core Simulation

FileRoleInvocation Timing
vidur/main.pyEntry pointOnce at startup
vidur/simulator.pyOrchestrator — init, event loop, outputEntire lifecycle

Configuration

FileRoleInvocation Timing
vidur/config/config.pySimulationConfig dataclass treePhase 1 — CLI parsing
vidur/config/model_config.pyLLM model specsPhase 1
vidur/config/device_sku_config.pyGPU/device specsPhase 1
vidur/config/node_sku_config.pyNode specificationsPhase 1
vidur/config/base_poly_config.pyPolymorphic config basePhase 1
vidur/config/flat_dataclass.pyCLI argument flatteningPhase 1

Entities

FileRoleInvocation Timing
vidur/entities/cluster.pyCreates and holds replicasPhase 2 — init
vidur/entities/replica.pyModel + device config, constraintsPhase 2 — init
vidur/entities/request.pyPer-request state machinePhase 3 (created) → Phase 5 (mutated)
vidur/entities/batch.pyGroups requests, delegates callbacksPhase 5 — created by ReplicaScheduler
vidur/entities/batch_stage.pyPer-stage execution recordPhase 5 — created by ReplicaStageScheduleEvent
vidur/entities/execution_time.pyExecution time breakdownPhase 5 — returned by predictor

Events

FileEventInvocation Timing
vidur/events/base_event.pyAbstract baseEvery event inherits
vidur/events/request_arrival_event.pyRequestArrivalOnce per request
vidur/events/global_schedule_event.pyGlobalScheduleAfter each request arrival
vidur/events/replica_schedule_event.pyReplicaScheduleAfter global schedule + after each batch end
vidur/events/batch_stage_arrival_event.pyBatchStageArrivalPer batch per pipeline stage
vidur/events/replica_stage_schedule_event.pyReplicaStageSchedulePer stage execution
vidur/events/batch_stage_end_event.pyBatchStageEndAfter stage execution_time elapses
vidur/events/batch_end_event.pyBatchEndAfter last stage completes

Schedulers

FileRoleInvocation Timing
scheduler/global_scheduler/base_global_scheduler.pyBase class + init cascadePhase 2 (init) + Phase 5 (every event)
scheduler/global_scheduler/{random,round_robin,lor}.pyRequest → replica assignmentPhase 5 — GlobalScheduleEvent
scheduler/replica_scheduler/base_replica_scheduler.pyBatching logic basePhase 5 — ReplicaScheduleEvent, BatchEndEvent
scheduler/replica_scheduler/{vllm,sarathi,orca,...}.pyConcrete batching strategiesPhase 5 — on_schedule()
scheduler/replica_stage_scheduler/replica_stage_schduler.pyPer-stage batch queuePhase 5 — stage arrival/schedule/end events

Request Generation

FileRoleInvocation Timing
request_generator/base_request_generator.pyAbstract generatorPhase 3
request_generator/synthetic_request_generator.pyGenerate from distributionsPhase 3 — generate()
request_generator/trace_replay_request_generator.pyReplay from CSVPhase 3 — generate()

Execution Time Prediction

FileRoleInvocation Timing
execution_time_predictor/base_execution_time_predictor.pyComputes attention/MLP/comm timePhase 5 — ReplicaStageScheduleEvent
execution_time_predictor/{linear_regression,random_forrest}.pyConcrete predictorsPhase 5 — get_execution_time()

Metrics & Output

FileRoleInvocation Timing
vidur/metrics/metrics_store.pyCollect metrics from all eventsPhase 2 (init) + Phase 5 (every event) + Phase 6 (plot)
vidur/utils/random.pySeed managementPhase 1
vidur/logger.pyLogging setupAll phases