Execution Call Chain -- SimAI

High-Level Overview
Phase 1 — Entry Point & Config
Phase 2 — Simulator Initialization
Phase 3 — Event Queue Bootstrap
Phase 4 — Main Event Loop
Phase 5 — Event Chain Detail
Prefill / Decode Iteration Lifecycle
PD Disagg Gap & Waiting Time
Event Timestamp Propagation
Entity State Transitions
Phase 6 — Output & Metrics
Complete File Map

1. High-Level Overview

Vidur is a discrete event simulator. Execution has three macro phases: initialization (build all objects, generate requests, fill event queue), event loop (pop events in time order, each event handler returns new events), and output (write metrics and traces). The diagram below shows the full call chain at file level.

Execution Flow Overview

2. Phase 1 — Entry Point & Config

vidur/main.py

main()

The entry point. Invoked via python3 -m vidur.main. Does exactly three things:

def main() -> None:
    config: SimulationConfig = SimulationConfig.create_from_cli_args()  # ① parse CLI
    set_seeds(config.seed)                                              # ② deterministic RNG
    simulator = Simulator(config)                                        # ③ build & run
    simulator.run()

▼

vidur/config/config.py

SimulationConfig.create_from_cli_args()

Parses all CLI arguments into a nested dataclass hierarchy. Key sub-configs:

Sub-Config	Controls	File
`cluster_config`	num_replicas, replica_config (model, device, parallelism), global_scheduler_config	config/config.py
`request_generator_config`	Synthetic vs trace-replay, interval/length generators	config/config.py
`metrics_config`	output_dir, write_metrics, write_json_trace, enable_chrome_trace	config/config.py

▼

vidur/utils/random.py

set_seeds(seed)

Sets random seeds for Python's random and numpy to ensure reproducible simulations.

3. Phase 2 — Simulator Initialization

Simulator.__init__() builds the entire simulation infrastructure. Each sub-component triggers a cascade of further object creation.

vidur/simulator.py → vidur/entities/cluster.py → vidur/entities/replica.py

Cluster & Replica Creation

Cluster.__init__() creates num_replicas Replica objects. Each Replica validates model/device constraints (e.g., layers divisible by pipeline stages, embedding_dim divisible by tensor_parallel_size).

# entities/cluster.py
self._replicas = {}
for _ in range(num_replicas):
    replica = Replica(replica_config, generator_config)
    self._replicas[replica.id] = replica

▼

vidur/simulator.py → vidur/metrics/metrics_store.py

MetricsStore Creation

Initializes DataSeries and CDFSketch objects for tracking request latency, batch sizes, utilization, and token-level metrics.

▼

vidur/simulator.py → vidur/request_generator/request_generator_registry.py

RequestGenerator Creation (Registry)

The registry pattern selects the concrete generator class based on config type:

SyntheticRequestGenerator

request_generator/synthetic_request_generator.py

Uses interval generators (Poisson, Gamma, Static, Trace) and length generators (Uniform, Zipf, Trace, Fixed) to produce synthetic requests.

TraceReplayRequestGenerator

request_generator/trace_replay_request_generator.py

Reads a CSV trace file with arrival times and token counts, replaying exact request patterns.

▼

vidur/simulator.py → vidur/scheduler/global_scheduler/global_scheduler_registry.py

GlobalScheduler Creation (triggers deep initialization cascade)

This is the deepest initialization chain. The global scheduler's __init__() creates the entire scheduling hierarchy:

Initialization Cascade inside GlobalScheduler.__init__():

Creates ExecutionTimePredictor via registry → LinearRegressionExecutionTimePredictor or RandomForrestExecutionTimePredictor
For each replica, creates a ReplicaScheduler via registry (one of: vLLM, Sarathi, Orca, LightLLM, FasterTransformer)
Each ReplicaScheduler creates a MemoryPlanner to calculate max batch size
Each ReplicaScheduler creates ReplicaStageScheduler × num_pipeline_stages

Global scheduler concrete implementations (all route requests to replicas):

Scheduler	Strategy
`RandomGlobalScheduler`	Random replica selection
`RoundRobinGlobalScheduler`	Round-robin rotation
`LORGlobalScheduler`	Least Outstanding Requests

4. Phase 3 — Event Queue Bootstrap

vidur/simulator.py → vidur/request_generator/*.py → vidur/events/request_arrival_event.py

_init_event_queue()

After all objects are built, the simulator populates the event queue with the initial set of events:

# simulator.py
def _init_event_queue(self):
    requests = self._request_generator.generate()  # → List[Request]
    for request in requests:
        self._add_event(RequestArrivalEvent(request.arrived_at, request))

Each request becomes a RequestArrivalEvent timestamped at its arrival time, pushed into a min-heap priority queue. Also registers atexit(_write_output) for cleanup.

5. Phase 4 — Main Event Loop

vidur/simulator.py

Simulator.run()

# simulator.py
def run(self):
    while self._event_queue and not self._terminate:
        _, event = heapq.heappop(self._event_queue)   # ① pop earliest event
        self._set_time(event._time)                       # ② advance sim clock
        new_events = event.handle_event(                  # ③ dispatch to handler
            self._scheduler, self._metric_store)
        self._add_events(new_events)                       # ④ push new events

This is the core loop. Every handle_event() returns zero or more new events that get pushed back into the heap. The simulation advances by processing events in chronological order until the queue is empty or time_limit is reached.

Event Ordering: Events are ordered by (time, event_id, event_type) tuple. This guarantees deterministic ordering when multiple events share the same timestamp.

6. Phase 5 — Event Chain Detail

Each event type lives in its own file under vidur/events/. Below is the complete chain showing what each event does, what it calls, and what new events it creates.

RequestArrivalEvent vidur/events/request_arrival_event.py

Request enters the system

Action	Target	File
Add request to global queue	`scheduler.add_request(request)`	scheduler/global_scheduler/base_global_scheduler.py
Record arrival metrics	`metrics_store.on_request_arrival()`	metrics/metrics_store.py

Creates: GlobalScheduleEvent

Note: This only enqueues the request — it does NOT immediately form a batch. The request joins scheduler._request_queue and waits. Multiple requests can accumulate in this queue before any batch is formed.

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    scheduler.add_request(self._request)
    metrics_store.on_request_arrival(self.time, self._request)
    return [GlobalScheduleEvent(self.time)]

▼

GlobalScheduleEvent vidur/events/global_schedule_event.py

Assign requests to replicas

Action	Target	File
Decide which replica handles which request	`scheduler.schedule()`	scheduler/global_scheduler/{random,round_robin,lor}.py
Push request into replica's queue	`replica_scheduler.add_request(request)`	scheduler/replica_scheduler/base_replica_scheduler.py

Creates: ReplicaScheduleEvent × one per affected replica

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    self._replica_set = set()
    self._request_mapping = scheduler.schedule()

    for replica_id, request in self._request_mapping:
        self._replica_set.add(replica_id)
        scheduler.get_replica_scheduler(replica_id).add_request(request)

    return [
        ReplicaScheduleEvent(self.time, replica_id)
        for replica_id in self._replica_set
    ]

▼

ReplicaScheduleEvent vidur/events/replica_schedule_event.py

Form batches on a replica

Action	Target	File
Decide which requests to batch together	`replica_scheduler.on_schedule()`	scheduler/replica_scheduler/{vllm,sarathi,orca,...}.py
Mark batch as scheduled	`batch.on_schedule(time)`	entities/batch.py
Record memory usage	`metrics_store.on_replica_schedule()`	metrics/metrics_store.py

Creates: BatchStageArrivalEvent × one per batch (stage_id = 0), or empty list if no batch can be formed

This is where batching actually happens: replica_scheduler.on_schedule() pulls multiple requests from the queue to form ONE batch. It does NOT create one batch per request. The internal _get_next_batch() loops through the request queue, adding requests until it hits memory, token, or batch-size limits. If the replica is already running the maximum number of concurrent batches (_num_running_batches ≥ _num_stages), it returns an empty list — the requests simply wait.

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
    self._batches = replica_scheduler.on_schedule()

    if not self._batches:
        return []    # ← 若 replica 正忙或佇列為空，什麼都不做

    memory_usage_percent = replica_scheduler.memory_usage_percent
    metrics_store.on_replica_schedule(
        self.time, self._replica_id, memory_usage_percent)

    for batch in self._batches:
        batch.on_schedule(self.time)

    return [
        BatchStageArrivalEvent(self.time, self._replica_id, 0, batch)
        for batch in self._batches
    ]

▸ View on_schedule() → _get_next_batch() batching logic

# base_replica_scheduler.py — on_schedule()
def on_schedule(self) -> List[Batch]:
    scheduled_batches = []
    while self._num_running_batches < self._num_stages:
        batch = self._get_next_batch()     # ← 從佇列撈多個 request 組成 1 個 batch
        if not batch:
            break
        scheduled_batches.append(batch)
        self._num_running_batches += 1
    return scheduled_batches

# vllm_replica_scheduler.py — _get_next_batch() (以 vLLM 為例)
def _get_next_batch(self) -> Optional[Batch]:
    requests, num_tokens = [], []
    while self._request_queue:           # ← 遍歷整個佇列
        if not self._can_allocate(...):    # memory 不足 → 停止
            break
        if total_tokens > max_tokens:     # token 上限 → 停止
            break
        if len(requests) >= batch_size_cap: # batch size 上限 → 停止
            break
        request = self._request_queue.pop(0)
        requests.append(request)            # ← 多個 request 加入同一個 batch
        num_tokens.append(next_tokens)
    if not requests:
        return None
    return Batch(self._replica_id, requests, num_tokens)

▼

BatchStageArrivalEvent vidur/events/batch_stage_arrival_event.py

Batch arrives at a pipeline stage

Action	Target	File
Enqueue batch at stage	`stage_scheduler.add_batch(batch)`	scheduler/replica_stage_scheduler/replica_stage_schduler.py

Creates: ReplicaStageScheduleEvent

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    scheduler.get_replica_stage_scheduler(
        self._replica_id, self._stage_id
    ).add_batch(self._batch)

    return [
        ReplicaStageScheduleEvent(
            self.time, self._replica_id, self._stage_id)
    ]

▼

ReplicaStageScheduleEvent vidur/events/replica_stage_schedule_event.py

Execute batch on a pipeline stage

Guard: stage_scheduler.on_schedule() checks _is_busy first. If this stage is already executing a batch, it returns (None, None, None) and the event produces no further events — the new batch waits in the stage queue until the current one finishes.

Action	Target	File
Dequeue batch and compute execution time (if not busy)	`stage_scheduler.on_schedule()`	scheduler/replica_stage_scheduler/replica_stage_schduler.py
Predict execution duration	`execution_time_predictor.get_execution_time(batch, stage_id)`	execution_time_predictor/base_execution_time_predictor.py
Create BatchStage entity	`BatchStage(..., execution_time)`	entities/batch_stage.py
Mark stage as scheduled	`batch_stage.on_schedule(time)`	entities/batch_stage.py

Creates: BatchStageEndEvent scheduled at time + execution_time

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    stage_scheduler = scheduler._replica_schedulers[
        self._replica_id
    ]._replica_stage_schedulers[self._stage_id]

    self._batch, self._batch_stage, execution_time = \
        stage_scheduler.on_schedule()

    if not (self._batch and self._batch_stage):
        return []

    self._batch_stage.on_schedule(self.time)
    metrics_store.on_replica_stage_schedule(
        self.time, self._replica_id, self._stage_id,
        self._batch_stage, execution_time)

    self._is_last_stage = stage_scheduler.is_last_stage

    return [
        BatchStageEndEvent(
            self.time + self._batch_stage.execution_time,  # ← 唯一的時間推進
            self._replica_id, self._stage_id,
            self._is_last_stage, self._batch, self._batch_stage),
    ]

▸ View stage_scheduler.on_schedule() source

# scheduler/replica_stage_scheduler/replica_stage_schduler.py
def on_schedule(self):
    if self._is_busy or not self._batch_queue:
        return None, None, None

    self._is_busy = True
    batch = self._batch_queue.pop(0)
    execution_time = self._execution_time_predictor.get_execution_time(
        batch, self._stage_id)             # ← execution_time 的來源
    total_execution_time = execution_time.total_time
    model_execution_time = execution_time.model_time
    batch_stage = BatchStage(
        batch.id, self._replica_id, self._stage_id,
        total_execution_time, model_execution_time,
        batch.requests, batch.num_tokens)

    return batch, batch_stage, execution_time

▼

BatchStageEndEvent vidur/events/batch_stage_end_event.py

Pipeline stage completes — branch point

Action	Target	File
Release stage resources	`stage_scheduler.on_stage_end()`	scheduler/replica_stage_scheduler/replica_stage_schduler.py
Finalize stage timing	`batch_stage.on_stage_end(time)`	entities/batch_stage.py
Record stage metrics	`metrics_store.on_batch_stage_end()`	metrics/metrics_store.py

Branch Logic:

Always: creates ReplicaStageScheduleEvent (allow next batch to use this stage)
If NOT last stage: creates BatchStageArrivalEvent for stage_id + 1 → batch moves to next pipeline stage
If last stage: creates BatchEndEvent → batch is fully processed

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    scheduler.get_replica_stage_scheduler(
        self._replica_id, self._stage_id
    ).on_stage_end()

    self._batch_stage.on_stage_end(self.time)
    metrics_store.on_batch_stage_end(
        self._batch_stage, self.time,
        self._replica_id, self._stage_id)

    next_events = [
        ReplicaStageScheduleEvent(
            self.time, self._replica_id, self._stage_id),
    ]

    if self._is_last_stage:
        return next_events + [
            BatchEndEvent(self.time, self._replica_id, self._batch)]

    return next_events + [
        BatchStageArrivalEvent(
            self.time, self._replica_id,
            self._stage_id + 1, self._batch)]

▼

BatchEndEvent vidur/events/batch_end_event.py

Batch completes all stages — loop back

Action	Target	File
Complete batch, update request tokens	`batch.on_batch_end(time)`	entities/batch.py
Free memory blocks, handle completed/preempted requests	`replica_scheduler.on_batch_end(batch)`	scheduler/replica_scheduler/base_replica_scheduler.py
Record batch completion metrics	`metrics_store.on_batch_end()`	metrics/metrics_store.py

Creates: ReplicaScheduleEvent → loops back to Event 3, forming the next batch

▸ View handle_event() source

def handle_event(self, scheduler, metrics_store):
    self._batch.on_batch_end(self.time)
    replica_scheduler = scheduler.get_replica_scheduler(self._replica_id)
    replica_scheduler.on_batch_end(self._batch)

    memory_usage_percent = replica_scheduler.memory_usage_percent
    metrics_store.on_batch_end(
        self.time, self._batch,
        self._replica_id, memory_usage_percent)

    return [ReplicaScheduleEvent(self.time, self._replica_id)]

The Complete Event Cycle:
RequestArrival → GlobalSchedule → ReplicaSchedule → BatchStageArrival → ReplicaStageSchedule → BatchStageEnd → (next stage OR BatchEnd) → ReplicaSchedule → ... This cycle repeats until all requests have generated all their decode tokens.

When Does Multi-Request Batching Actually Happen?

Key Question: Every request arrival triggers the full chain (Events 1→2→3). Does that mean each arrival pulls multiple requests into a batch?

Answer: No. Most of the time, the chain is short-circuited by the guard _num_running_batches < _num_stages. When the replica is already executing a batch, on_schedule() returns an empty list, and the request simply waits in the queue.

Here is a concrete timeline showing how the two paths differ:

Path A — Request Arrival when replica is idle (first request or queue was empty)

t=0.1  RequestArrivalEvent(Req A)
  → GlobalScheduleEvent  → assigns Req A to replica 0
  → ReplicaScheduleEvent → on_schedule()
       _num_running_batches(0) < _num_stages(1) ✓
       → _get_next_batch(): pulls Req A (only 1 in queue) → Batch(size=1)
       → _num_running_batches = 1
  → BatchStageArrivalEvent → ... → batch starts executing

Yes, batch size = 1 is normal here. _get_next_batch() has no waiting/accumulation mechanism — it pulls whatever is currently in the queue, even if only 1 request. This correctly models continuous batching (as used by vLLM): the system never waits for more requests, it processes what's available immediately. New requests arriving later join the next iteration via the BatchEnd → ReplicaSchedule path (Path C below).

Path B — Request Arrival when replica is busy (common case)

t=0.2  RequestArrivalEvent(Req B)
  → GlobalScheduleEvent  → assigns Req B to replica 0
  → ReplicaScheduleEvent → on_schedule()
       _num_running_batches(1) < _num_stages(1) ✗  ← GUARD FAILS
       → return []  ← 空列表，事件鏈到此結束
       Req B stays in _request_queue, waiting...

t=0.3  RequestArrivalEvent(Req C)  → same thing, Req C also waits

Path C — BatchEnd triggers real multi-request batching

t=0.5  BatchEndEvent(Batch of Req A)
  → _num_running_batches -= 1  → now 0
  → ReplicaScheduleEvent → on_schedule()
       _num_running_batches(0) < _num_stages(1) ✓
       → _get_next_batch(): queue has [Req B, Req C]
           while _request_queue:
               add Req B ✓, add Req C ✓  ← 多個 request 組成一個 batch
       → Batch(requests=[Req B, Req C])
  → BatchStageArrivalEvent → ... → batch starts executing

Summary: The event chain triggered by RequestArrival fires every time, but is mostly a no-op when the replica is busy. The real multi-request batching happens when BatchEnd triggers ReplicaSchedule — by that time, multiple requests have accumulated in the queue and are pulled into a single batch.

Prefill / Decode Iteration Lifecycle — How Many Batches Does One Request Need?

The event chain (Events 3→4→5→6→7→3) forms a loop. A single request goes through this loop multiple times — once is not enough. But the call graph above doesn't make this obvious. This section traces the full lifecycle of a single request with P prefill tokens and D decode tokens.

The Two Key Mechanisms

① How many tokens per iteration?

scheduler/replica_scheduler/base_replica_scheduler.py

def _get_request_next_num_tokens(self, request):
    if request.is_prefill_complete:
        return 1                        # decode: 每次只處理 1 個 token
    return request.num_prefill_tokens  # prefill: 一次全部處理完

Prefill processes all P tokens in one shot. Decode processes exactly 1 token per iteration.

② How does a request re-enter the loop?

scheduler/replica_scheduler/vllm_replica_scheduler.py

def on_batch_end(self, batch):
    self._num_running_batches -= 1
    for request in batch.requests:
        if request.completed:
            self.free(request.id)         # 完成 → 釋放記憶體
        else:
            self._preempted_requests.append(request)
            # ↑ 未完成 → 放回 preempted 佇列，下次 on_schedule() 會再撈出來

After each batch, non-completed requests go into _preempted_requests. The next _get_next_batch() picks them up again — each time with num_tokens=1 (decode).

The Hidden +1: Prefill Awards the First Decode Token for Free

vidur/entities/request.py

Inside on_batch_end(), there is a critical +1 that changes the iteration count:

def on_batch_end(self, time, num_tokens_processed):
    self._num_processed_tokens += num_tokens_processed  # prefill 後: P

    if self._num_processed_tokens == self._num_prefill_tokens:
        self._is_prefill_complete = True
        # we get one decode token when the prefill processing completes
        self._num_processed_tokens += 1    # ← P → P+1，第一個 decode token 免費！

        if self._prefill_completed_at == 0:
            self._prefill_completed_at = time

    if self._num_processed_tokens == self.total_tokens:  # P+D?
        self._completed = True

This models real LLM behavior: the prefill forward pass computes the KV cache for all P input tokens and produces the first output token in the same pass. So the first decode token costs zero additional iterations.

Concrete Example: Request(P=512, D=128)

Walk through the state of num_processed_tokens at each batch iteration. The request is complete when it reaches total_tokens = P + D = 640.

Iter	Phase	num_tokens in Batch	processed before	+= tokens	+1 bonus	processed after	Status
1	Prefill	512	0	+512	+1	513	prefill_complete=True, → preempted queue
2	Decode	1	513	+1	—	514	→ preempted queue
3	Decode	1	514	+1	—	515	→ preempted queue
⋮ (each iteration: +1 token)
128	Decode	1	639	+1	—	640 = P+D	✓ completed=True, freed

Total: D batch iterations (not D+1)
= 1 prefill iteration + (D−1) decode iterations = D. The first decode token is included in the prefill iteration via the +1 bonus.

What the Event Loop Actually Looks Like for This Request

Mapping the 128 iterations back to the event chain. Each row is one pass through Events 3→4→5→6→7→3.

Iter 1 (Prefill):
  ReplicaScheduleEvent → _get_next_batch()
    request from _request_queue, num_tokens=512
    → Batch(requests=[Req], num_tokens=[512])
  → BatchStageArrival → ReplicaStageSchedule
    → execution_time_predictor(batch with 512 prefill tokens) → 較長的 execution_time
  → BatchStageEnd → BatchEnd
    → Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
      processed: 0 → 512 → 513 (bonus +1)
    → VLLMScheduler.on_batch_end()
      request.completed? No (513 < 640) → _preempted_requests.append(request)
  → ReplicaScheduleEvent  ← 迴圈回到事件 3

Iter 2 (Decode #1):
  ReplicaScheduleEvent → _get_next_batch()
    request from _preempted_requests, is_prefill_complete=True → num_tokens=1
    → Batch(requests=[Req, ...], num_tokens=[1, ...])
    ↑ 同一個 batch 可以包含來自不同 request 的 decode tokens
  → BatchStageArrival → ReplicaStageSchedule
    → execution_time_predictor(batch with N decode tokens) → 較短的 execution_time
  → BatchStageEnd → BatchEnd
    → Request.on_batch_end(num_tokens=1)
      processed: 513 → 514
    → request not completed → _preempted_requests.append(request)
  → ReplicaScheduleEvent  ← 迴圈回到事件 3

⋮ (repeats D-2 more times)

Iter 128 (Decode #127, final):
  ReplicaScheduleEvent → _get_next_batch()
    → Batch(requests=[Req, ...], num_tokens=[1, ...])
  → ... → BatchEnd
    → Request.on_batch_end(num_tokens=1)
      processed: 639 → 640 = total_tokens → completed=True
    → VLLMScheduler.on_batch_end()
      request.completed? Yes → self.free(request.id)  ← 記憶體釋放

Continuous Batching: Different Requests Share a Batch

A critical detail: in decode iterations, _get_next_batch() pulls from both _preempted_requests (ongoing decode) and _request_queue (new arrivals). This means:

Batch iteration N:
  Batch(
    requests = [Req_A,  Req_B,  Req_C,  Req_D ],
    num_tokens= [1,      1,      1,      256  ]
  )               ↑decode  ↑decode  ↑decode  ↑prefill(new arrival)

One batch can contain a mix of decode tokens (1 each) from ongoing requests AND prefill tokens from a newly arrived request. The execution time predictor receives the full batch and computes a single execution time that accounts for both prefill and decode work.

Sarathi Variant: Chunked Prefill Changes the Count

The Sarathi scheduler overrides _get_request_next_num_tokens() to cap prefill tokens per iteration:

scheduler/replica_scheduler/sarathi_replica_scheduler.py

def _get_request_next_num_tokens(self, request, batch_contains_prefill, num_batch_tokens):
    if request.is_prefill_complete:
        return 1                                          # decode: 仍然 1 token

    next_num_tokens = min(
        request.num_prefill_tokens - request.num_processed_tokens,
        self._config.chunk_size - num_batch_tokens,     # ← 限制在 chunk_size 內
    )
    return max(0, next_num_tokens)

With chunk_size=512 and P=2048 prefill tokens:

Iter	Phase	num_tokens	processed after
1	Prefill chunk 1	512	512
2	Prefill chunk 2	512	1024
3	Prefill chunk 3	512	1536
4	Prefill chunk 4	512	2048 → 2049 (+1 bonus)
5..D+3	Decode	1	2049 → ... → P+D

Total: ⌈P/chunk_size⌉ + (D−1) iterations. Chunked prefill trades more iterations for lower per-iteration latency, allowing decode requests to interleave.

Summary: Iteration Counts by Scheduler

Scheduler	Prefill Iterations	Decode Iterations	Total Batch Iterations
vLLM / Orca / FasterTransformer	1	D − 1	D
Sarathi (chunk_size = C)	⌈P / C⌉	D − 1	⌈P/C⌉ + D − 1

Why this matters for understanding the call graph: The event chain Events 3→7→3 is not a one-shot pipeline — it is a loop that executes D times per request (vLLM) or more (Sarathi). A simulation of 100 requests with average D=256 generates approximately 100 × 256 = 25,600 full cycles of Events 3→4→5→6→7→3. Each cycle produces a BatchStageEndEvent with a time advance (the only point where simulation time moves forward).

PD Disagg Gap & Request Waiting Time Between Batches

Original Vidur: No PD Disaggregation

The original Vidur has no PD disagg: every request runs prefill and decode on the same replica, and the +1 bonus is unconditional. There is no KV cache transfer, no splitwise scheduler, and no separate P/D replica pools.

SimAI Fork: What Changed in the Event System for PD Disagg

The SimAI fork (vidur-alibabacloud) achieves PD disaggregation by modifying three points in the existing event chain — not by adding new event types, but by changing the behavior of existing ones.

Change 1: SplitwiseGlobalScheduler — split replicas into P and D pools

scheduler/global_scheduler/splitwise_global_scheduler.py

At initialization, replicas are split by pd_node_ratio:

# Example: 4 replicas, pd_node_ratio=0.5
# Replica 0, 1 → ReplicaType.PREFILL  (P pool)
# Replica 2, 3 → ReplicaType.DECODE   (D pool)
for replica_id, replica in self._replicas.items():
    if replica_id < self._num_prefill_nodes:
        self.prefill_replicas[replica_id] = replica
        replica.replica_type = ReplicaType.PREFILL
    else:
        self.decode_replicas[replica_id] = replica
        replica.replica_type = ReplicaType.DECODE

In schedule(), each new request is assigned to both a P replica and a D replica upfront:

def schedule(self):
    for request in self._request_queue:
        request.request_type = RequestType.PREFILL

        # Assign P replica (round-robin in P pool)
        replica_id = self.p_request_counter % len(self.prefill_replicas)
        request.prefill_replica_id = replica_id

        # Assign D replica (round-robin in D pool)
        replica_id = (self.d_request_counter % len(self.decode_replicas)) + len(self.prefill_replicas)
        request.decode_replica_id = replica_id

    return prefill_request_mapping  # ← 只返回 P replica 的映射

schedule() only returns prefill_request_mapping. The request is initially sent to the P replica only. The D replica gets its ReplicaScheduleEvent later — from BatchEndEvent, not from here.

Change 2: BatchEndEvent — the critical cross-replica handoff

events/batch_end_event.py

This is where the entire PD disagg logic lives. The original BatchEndEvent was 42 lines. The SimAI version is 151 lines. After the normal batch-end handling, it adds:

# batch_end_event.py — SimAI version (simplified)
def handle_event(self, scheduler, metrics_store):
    # ① Normal Vidur logic (unchanged)
    self._batch.on_batch_end(self.time)
    replica_scheduler.on_batch_end(self._batch)
    events = [ReplicaScheduleEvent(self.time, self._replica_id)]

    # ② NEW: PD disagg cross-replica transfer
    if scheduler.__class__.__name__ == 'SplitwiseGlobalScheduler':
        for request in self._batch.requests:
            if request.is_prefill_complete \
               and request.request_type == RequestType.DECODE \
               and replica_scheduler.replica.replica_type == ReplicaType.PREFILL:

                # ③ Calculate KV cache transfer time
                request.pd_p2p_comm_size = request.estimate_kv_cache_size(
                    request.num_processed_tokens, replica_scheduler.replica)
                request.pd_p2p_comm_bandwidth = replica.pd_p2p_comm_bandwidth
                request.pd_p2p_comm_time = request.pd_p2p_comm_size / request.pd_p2p_comm_bandwidth

                # ④ Set D replica arrival = prefill done + KV transfer time
                request.decode_arrived_at = request.prefill_completed_at + request.pd_p2p_comm_time

                # ⑤ Create ReplicaScheduleEvent for D replica with DELAYED time
                events.append(
                    ReplicaScheduleEvent(request.decode_arrived_at, request.decode_replica_id))
                    # ↑ 注意：時間不是 self.time，而是未來的時間點！

    return events

Change 3: Request & Replica entities — new PD fields

entities/request.py

# New fields added to Request
request_type = RequestType.PREFILL  # PREFILL → DECODE
prefill_replica_id = None
decode_replica_id = None
prefill_arrived_at = arrived_at
decode_arrived_at = float('inf')  # ← set later
pd_p2p_comm_size = float('inf')
pd_p2p_comm_time = float('inf')
pd_p2p_comm_bandwidth = 0

entities/replica.py

# New fields added to Replica
replica_type = ReplicaType.MIXED
# ↑ set to PREFILL or DECODE by scheduler
pd_p2p_comm_bandwidth = ...
pd_p2p_comm_dtype = ...
pd_node_ratio = ...
pending_requests = []

The request_type starts as PREFILL and changes to DECODE in on_batch_end() when prefill completes (same place as the +1 bonus). This flag is how BatchEndEvent knows to trigger the cross-replica transfer.

The Modified Event Flow with PD Disagg

═══ P Replica (Prefill Pool) ═══

RequestArrivalEvent(t=0.0)
  → GlobalScheduleEvent → SplitwiseGlobalScheduler.schedule()
      request.prefill_replica_id = 0   (P pool)
      request.decode_replica_id  = 2   (D pool)
      return [(replica_id=0, request)]     ← 只發到 P replica
  → ReplicaScheduleEvent(t=0.0, replica=0)    ← P replica
  → BatchStageArrival → ReplicaStageSchedule
      execution_time = predict(batch with 512 prefill tokens)
  → BatchStageEndEvent(t=0.0 + 0.8s)
  → BatchEndEvent(t=0.8)  ← 這裡改動最大
      ① Batch.on_batch_end() → Request.on_batch_end(num_tokens=512)
         processed: 0 → 512 → 513 (bonus +1)
         request_type: PREFILL → DECODE
      ② Detect: is_prefill_complete + DECODE type + on PREFILL replica
      ③ estimate_kv_cache_size(513 tokens) = 1.6 GB
      ④ pd_p2p_comm_time = 1.6 GB / 50 GB/s = 0.032s
      ⑤ decode_arrived_at = 0.8 + 0.032 = 0.832s
      ⑥ events = [
           ReplicaScheduleEvent(0.8, replica=0),    ← P replica 繼續排下一個 batch
           ReplicaScheduleEvent(0.832, replica=2)  ← D replica 在 KV 傳輸完成後開始
         ]

═══ D Replica (Decode Pool) ═══

ReplicaScheduleEvent(t=0.832, replica=2)    ← 來自上面的 BatchEndEvent
  → on_schedule() → _get_next_batch()
      request from queue, is_prefill_complete=True → num_tokens=1
  → BatchStageArrival → ReplicaStageSchedule → BatchStageEnd → BatchEnd
      processed: 513 → 514
  → ReplicaScheduleEvent(t=..., replica=2)    ← D replica 繼續 decode loop
      ⋮ (D-1 iterations on D replica)

Vidur vs SimAI: Comparison

Aspect	Vidur (original)	SimAI (vidur-alibabacloud)
Replica pools	All replicas identical	P pool + D pool (split by pd_node_ratio)
`schedule()`	Assign to 1 replica	Assign to P replica + record D replica; only return P mapping
Prefill execution	Same replica as decode	P replica only
P→D handoff	—	BatchEndEvent detects prefill_complete on P replica, calculates KV transfer time, creates delayed ReplicaScheduleEvent for D replica
KV cache transfer	—	`2 × tokens × hidden_dim × layers × dtype_bytes / bandwidth`
Decode execution	Same replica as prefill	D replica only (all D−1 iterations)
+1 bonus	Kept (correct for co-located)	Still kept — first decode token is not re-generated on D replica
New event types	—	None — reuses existing ReplicaScheduleEvent with delayed time

Key Insight: SimAI achieves PD disagg without adding any new event types. It only modifies BatchEndEvent.handle_event() to detect the prefill→decode transition on a P replica, and inject a time-delayed ReplicaScheduleEvent targeting the D replica. The delay = KV cache transfer time. This is the second point in the entire simulator where time advances (the first being the execution_time in ReplicaStageScheduleEvent).

Note on +1 bonus: The SimAI fork keeps the +1 bonus. In real PD disagg, the first decode token produced on the P replica is discarded and must be regenerated on the D replica (costing an extra iteration, making total = D+1). SimAI does not model this discard — it keeps total = D. This is a simplification in the current SimAI implementation.

Is Waiting Time Between Batches Recorded?

Yes — via two separate metrics.

When a request sits idle in _preempted_requests between one batch ending and the next batch starting, this gap is captured at two granularities:

Stage-Level: preempted_time

vidur/entities/request.py

Accumulated in on_batch_stage_schedule() — measures the gap between the last stage completion and the next stage start:

def on_batch_stage_schedule(self, time):
    self._latest_stage_scheduled_at = time
    if self._latest_stage_completed_at == 0:
        self._preempted_time = 0
    else:
        self._preempted_time += \
            time - self._latest_stage_completed_at
            # ↑ 這就是等待時間

Reported as metric: request_preemption_time

Iteration-Level: scheduling_delay

vidur/entities/request.py

Computed in on_batch_schedule() — measures per-iteration wait from last batch completion to next batch start:

def on_batch_schedule(self, time):
    self._latest_iteration_scheduling_delay = \
        time - self._latest_iteration_completed_at
        # ↑ 每次 iteration 的等待時間

    if not self._scheduled:   # 首次排程
        self._scheduling_delay = \
            time - self._arrived_at
            # ↑ 從到達到首次被排程的等待

Reported as metric: request_scheduling_delay

Concrete Timeline: Where Each Timing Is Captured

Request arrives at t=0.0
    │
    ├─ on_batch_schedule(t=0.5)
    │     scheduling_delay = 0.5 - 0.0 = 0.5s  ← 從到達到首次排程
    │
    ├─ on_batch_stage_schedule(t=0.5)
    │     _latest_stage_completed_at == 0 → preempted_time = 0
    │
    ├─ on_batch_stage_end(t=1.0)   ← prefill 完成
    │     _latest_stage_completed_at = 1.0
    │
    │  ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌  0.8s gap
    │
    ├─ on_batch_stage_schedule(t=1.8)   ← 下一個 batch 開始
    │     preempted_time += 1.8 - 1.0 = +0.8s  ← 第一段等待被記錄
    │
    ├─ on_batch_stage_end(t=1.85)   ← decode iter 完成
    │     _latest_stage_completed_at = 1.85
    │
    │  ╌╌╌ request 在 _preempted_requests 中閒置 ╌╌╌  0.3s gap
    │
    ├─ on_batch_stage_schedule(t=2.15)
    │     preempted_time += 2.15 - 1.85 = +0.3s  ← 第二段等待被記錄
    │
    │  ⋮  (repeats for each decode iteration)
    │
    └─ Request completes. Final metrics:
         scheduling_delay = 0.5s     ← 到達 → 首次排程
         preempted_time   = 0.8 + 0.3 + ... = Σ(all gaps)
         execution_time   = Σ(all stage execution times)
         e2e_time         = completed_at - arrived_at  ← 包含一切

All Request Timing Metrics

Metric	Field	Formula	Meaning
`request_scheduling_delay`	`_scheduling_delay`	first scheduled_at − arrived_at	Initial queue wait before ever being batched
`request_preemption_time`	`_preempted_time`	Σ(stage_schedule_i − stage_end_i-1)	Cumulative idle time across ALL inter-batch gaps
`request_execution_time`	`_execution_time`	Σ(stage execution_time_i)	Cumulative GPU execution time
`request_execution_plus_preemption_time`	(computed)	execution_time + preempted_time	Total time from first batch to last batch (minus scheduling delay)
`request_e2e_time`	`e2e_time`	completed_at − arrived_at	Everything: scheduling_delay + execution + preemption

The relationship:
e2e_time ≈ scheduling_delay + execution_time + preempted_time
This is approximate because preempted_time is measured at the stage level while scheduling_delay is at the iteration level, and there's a small gap in the first iteration before the first stage schedule.

Event Timestamp Propagation — Where Does Time Actually Advance?

Key Insight: The simulation clock advances at exactly ONE point in the entire event chain.
All scheduling, dispatching, and queue operations are modeled as instantaneous (zero latency). The only place where simulated time moves forward is when a batch stage finishes execution on the GPU.

Transition	New Event Time	Time Source
RequestArrival → GlobalSchedule	`self.time`	Same instant — scheduling is instantaneous
GlobalSchedule → ReplicaSchedule	`self.time`	Same instant — dispatching is instantaneous
ReplicaSchedule → BatchStageArrival	`self.time`	Same instant — batching is instantaneous
BatchStageArrival → ReplicaStageSchedule	`self.time`	Same instant — stage enqueue is instantaneous
ReplicaStageSchedule → BatchStageEnd	`self.time + execution_time`	⏱ THE ONLY TIME ADVANCE — execution_time from ExecutionTimePredictor
BatchStageEnd → ReplicaStageSchedule	`self.time`	Same instant — release stage for next batch
BatchStageEnd → BatchStageArrival (next stage)	`self.time`	Same instant — inter-stage transfer is instantaneous
BatchStageEnd → BatchEnd (last stage)	`self.time`	Same instant — finalization is instantaneous
BatchEnd → ReplicaSchedule	`self.time`	Same instant — re-scheduling is instantaneous

The Single Line That Drives the Clock

# replica_stage_schedule_event.py, handle_event()
BatchStageEndEvent(
    self.time + self._batch_stage.execution_time,  # ← 唯一的時間推進
    ...
)

Where execution_time comes from:

scheduler/replica_stage_scheduler/replica_stage_schduler.py calls execution_time_predictor.get_execution_time(batch, stage_id)
execution_time_predictor/base_execution_time_predictor.py computes attention, MLP, allreduce, send/recv, and pipeline communication times based on batch size, num_tokens, and model config
Returns an ExecutionTime object with the total duration → stored in BatchStage.execution_time

Implication: This means the simulation models a world where only GPU compute takes time. All CPU-side scheduling, memory allocation, request routing, and inter-stage data transfer are assumed to have zero overhead. The accuracy of the entire simulation therefore depends entirely on the ExecutionTimePredictor's fidelity.

7. Entity State Transitions

Events don't just create other events — they also mutate entity state through callback methods. Here's a map of which entity methods are called by which events.

Request State Machine

    vidur/entities/request.py
    
        Callback Method
        Called By
        State Change
      
        on_batch_schedule(time)
        Batch.on_schedule()
        Sets scheduled_at, scheduling_delay
      
        on_batch_stage_schedule(time)
        BatchStage.on_schedule()
        Sets latest_stage_scheduled_at, calculates preempted_time
      
        on_batch_stage_end(time, ...)
        BatchStage.on_stage_end()
        Accumulates execution_time, model_execution_time
      
        on_batch_end(time, num_tokens)
        Batch.on_batch_end()
        Increments num_processed_tokens; if all done → completed=True

Callback Method	Called By	State Change
`on_batch_schedule(time)`	Batch.on_schedule()	Sets scheduled_at, scheduling_delay
`on_batch_stage_schedule(time)`	BatchStage.on_schedule()	Sets latest_stage_scheduled_at, calculates preempted_time
`on_batch_stage_end(time, ...)`	BatchStage.on_stage_end()	Accumulates execution_time, model_execution_time
`on_batch_end(time, num_tokens)`	Batch.on_batch_end()	Increments num_processed_tokens; if all done → completed=True

Batch & BatchStage

      vidur/entities/batch.py
      Batch
      on_schedule(time) → calls request.on_batch_schedule() for each request
on_batch_end(time) → calls request.on_batch_end() for each request with its num_tokens

    

      vidur/entities/batch_stage.py
      BatchStage
      on_schedule(time) → calls request.on_batch_stage_schedule() for each request
on_stage_end(time) → calls request.on_batch_stage_end() for each request

    

8. Phase 6 — Output & Metrics

vidur/simulator.py

_write_output() — registered via atexit

Called automatically when the process exits (even on exceptions). Triggers the metrics pipeline:

Action	Target	Output
Plot all metrics	`metric_store.plot()`	CDF/histogram PNGs
Write JSON trace	`_write_event_trace()`	`event_trace.json`
Write Chrome trace	`_write_chrome_trace()`	`chrome_trace.json`

9. Complete File Map

Every file involved in a standard simulation run, grouped by role.

Core Simulation

File	Role	Invocation Timing
vidur/main.py	Entry point	Once at startup
vidur/simulator.py	Orchestrator — init, event loop, output	Entire lifecycle

Configuration

File	Role	Invocation Timing
vidur/config/config.py	SimulationConfig dataclass tree	Phase 1 — CLI parsing
vidur/config/model_config.py	LLM model specs	Phase 1
vidur/config/device_sku_config.py	GPU/device specs	Phase 1
vidur/config/node_sku_config.py	Node specifications	Phase 1
vidur/config/base_poly_config.py	Polymorphic config base	Phase 1
vidur/config/flat_dataclass.py	CLI argument flattening	Phase 1

Entities

File	Role	Invocation Timing
vidur/entities/cluster.py	Creates and holds replicas	Phase 2 — init
vidur/entities/replica.py	Model + device config, constraints	Phase 2 — init
vidur/entities/request.py	Per-request state machine	Phase 3 (created) → Phase 5 (mutated)
vidur/entities/batch.py	Groups requests, delegates callbacks	Phase 5 — created by ReplicaScheduler
vidur/entities/batch_stage.py	Per-stage execution record	Phase 5 — created by ReplicaStageScheduleEvent
vidur/entities/execution_time.py	Execution time breakdown	Phase 5 — returned by predictor

Events

File	Event	Invocation Timing
vidur/events/base_event.py	Abstract base	Every event inherits
vidur/events/request_arrival_event.py	RequestArrival	Once per request
vidur/events/global_schedule_event.py	GlobalSchedule	After each request arrival
vidur/events/replica_schedule_event.py	ReplicaSchedule	After global schedule + after each batch end
vidur/events/batch_stage_arrival_event.py	BatchStageArrival	Per batch per pipeline stage
vidur/events/replica_stage_schedule_event.py	ReplicaStageSchedule	Per stage execution
vidur/events/batch_stage_end_event.py	BatchStageEnd	After stage execution_time elapses
vidur/events/batch_end_event.py	BatchEnd	After last stage completes

Schedulers

File	Role	Invocation Timing
scheduler/global_scheduler/base_global_scheduler.py	Base class + init cascade	Phase 2 (init) + Phase 5 (every event)
scheduler/global_scheduler/{random,round_robin,lor}.py	Request → replica assignment	Phase 5 — GlobalScheduleEvent
scheduler/replica_scheduler/base_replica_scheduler.py	Batching logic base	Phase 5 — ReplicaScheduleEvent, BatchEndEvent
scheduler/replica_scheduler/{vllm,sarathi,orca,...}.py	Concrete batching strategies	Phase 5 — on_schedule()
scheduler/replica_stage_scheduler/replica_stage_schduler.py	Per-stage batch queue	Phase 5 — stage arrival/schedule/end events

Request Generation

File	Role	Invocation Timing
request_generator/base_request_generator.py	Abstract generator	Phase 3
request_generator/synthetic_request_generator.py	Generate from distributions	Phase 3 — generate()
request_generator/trace_replay_request_generator.py	Replay from CSV	Phase 3 — generate()

Execution Time Prediction

File	Role	Invocation Timing
execution_time_predictor/base_execution_time_predictor.py	Computes attention/MLP/comm time	Phase 5 — ReplicaStageScheduleEvent
execution_time_predictor/{linear_regression,random_forrest}.py	Concrete predictors	Phase 5 — get_execution_time()

Metrics & Output

File	Role	Invocation Timing
vidur/metrics/metrics_store.py	Collect metrics from all events	Phase 2 (init) + Phase 5 (every event) + Phase 6 (plot)
vidur/utils/random.py	Seed management	Phase 1
vidur/logger.py	Logging setup	All phases

Execution Call Chain

Table of Contents

1. High-Level Overview

2. Phase 1 — Entry Point & Config

main()

SimulationConfig.create_from_cli_args()

set_seeds(seed)

3. Phase 2 — Simulator Initialization

Cluster & Replica Creation

MetricsStore Creation

RequestGenerator Creation (Registry)

SyntheticRequestGenerator

TraceReplayRequestGenerator

GlobalScheduler Creation (triggers deep initialization cascade)

4. Phase 3 — Event Queue Bootstrap

_init_event_queue()

5. Phase 4 — Main Event Loop

Simulator.run()

6. Phase 5 — Event Chain Detail

Request enters the system

Assign requests to replicas

Form batches on a replica

Batch arrives at a pipeline stage

Execute batch on a pipeline stage

Pipeline stage completes — branch point

Batch completes all stages — loop back

When Does Multi-Request Batching Actually Happen?

Path A — Request Arrival when replica is idle (first request or queue was empty)

Path B — Request Arrival when replica is busy (common case)

Path C — BatchEnd triggers real multi-request batching

Prefill / Decode Iteration Lifecycle — How Many Batches Does One Request Need?

The Two Key Mechanisms

① How many tokens per iteration?

② How does a request re-enter the loop?

The Hidden +1: Prefill Awards the First Decode Token for Free

Concrete Example: Request(P=512, D=128)

What the Event Loop Actually Looks Like for This Request

Continuous Batching: Different Requests Share a Batch

Sarathi Variant: Chunked Prefill Changes the Count

Summary: Iteration Counts by Scheduler

PD Disagg Gap & Request Waiting Time Between Batches

Original Vidur: No PD Disaggregation

SimAI Fork: What Changed in the Event System for PD Disagg

Change 1: SplitwiseGlobalScheduler — split replicas into P and D pools

Change 2: BatchEndEvent — the critical cross-replica handoff

Change 3: Request & Replica entities — new PD fields

The Modified Event Flow with PD Disagg

Vidur vs SimAI: Comparison

Is Waiting Time Between Batches Recorded?

Stage-Level: preempted_time

Iteration-Level: scheduling_delay

Concrete Timeline: Where Each Timing Is Captured

All Request Timing Metrics

Event Timestamp Propagation — Where Does Time Actually Advance?

The Single Line That Drives the Clock

7. Entity State Transitions

Request State Machine

Batch & BatchStage

Batch

BatchStage

8. Phase 6 — Output & Metrics

_write_output() — registered via atexit

9. Complete File Map

Core Simulation

Configuration

Entities

Events

Schedulers

Request Generation

Execution Time Prediction

Metrics & Output