Request Lifecycle & I/O Formats

Step-by-step walkthrough of how a request flows through SimAI's 6 stages, with detailed input/output file formats, CSV headers, backend comparison, and configuration parameters at each step.

vidur-alibabacloud · AICB · astra-sim · NS-3

4. End-to-End Request Lifecycle

When a simulated inference request enters SimAI, it travels through six distinct stages spanning all five components. The following step-flow traces this journey from arrival to metrics collection.

Request Lifecycle Flow Across Components
vidur AICB astra-sim SimCCL NS-3 Step 1 Request Arrival Step 2 Global Scheduling Step 3 Batch Formation Step 4a AICB Compute Profile Step 4b astra-sim Comm Est. Step 5 KV Cache Transfer NS-3 P2P Sim Packet-level RDMA Step 6 Metrics Collection timing feedback
1

Request Arrival

vidur-alibabacloud

A RequestArrivalEvent fires in the discrete-event simulation loop. The event carries num_prefill_tokens, num_decode_tokens, and an arrival timestamp. A Request entity is created with an empty DAG (nx.DiGraph()) that will be populated as the request progresses through the system.

Input Sources & File Formats

Requests are produced by a Request Generator (vidur/request_generator/). SimAI supports two top-level modes:

Mode A — SYNTHETIC (programmatic generation)

Arrival intervals and token lengths are generated independently. You pick one Interval Generator and one Length Generator.

Interval Generators
Type Config Parameters Formula / Logic Source File
POISSON qps (float, default 0.5), seed interval = −ln(1−rand()) / qps, capped at 3σ
vidur/request_generator/poisson_request_interval_generator.py
GAMMA qps (float, default 0.2), cv (float, default 0.5), seed shape = 1/cv², scale = 1/(qps×shape)
vidur/request_generator/gamma_request_interval_generator.py
STATIC Fixed interval value Constant inter-arrival time
vidur/request_generator/static_request_interval_generator.py
TRACE trace_file (CSV path), start_time, end_time, time_scale_factor Reads CSV column arrival_time (datetime), computes inter-arrival via .diff(), applies time_scale_factor
vidur/request_generator/trace_request_interval_generator.py
Trace Interval CSV format: Must contain an arrival_time column with datetime strings (e.g. 2021-01-04 12:00:15). Default trace: data/processed_traces/AzureFunctionsInvocationTraceForTwoWeeksJan2021Processed.csv
Length Generators
Type Config Parameters Logic Source File
FIXED prefill_tokens (int, default 2048), decode_tokens (int, default 512) Returns constant (prefill, decode) pair
vidur/request_generator/fixed_request_length_generator.py
ZIPF min_tokens (1024), max_tokens (4096), theta (0.6), prefill_to_decode_ratio (20.0), scramble, seed Zipf-distributed total tokens, split by ratio
vidur/request_generator/zipf_request_length_generator.py
UNIFORM min_tokens (1024), max_tokens (4096), prefill_to_decode_ratio (20.0), seed Uniform distribution between min/max, split by ratio
vidur/request_generator/uniform_request_length_generator.py
TRACE trace_file (CSV path), prefill_scale_factor (1.0), decode_scale_factor (1.0), max_tokens (4096), seed Reads CSV columns num_prefill_tokens, num_decode_tokens; applies scale factors and clips
vidur/request_generator/trace_request_length_generator.py
Trace Length CSV format: Must contain num_prefill_tokens and num_decode_tokens integer columns. Default trace: data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv

Mode B — TRACE_REPLAY (replay a recorded trace)

Reads arrival time and token counts from a single CSV file. Config: TraceRequestGeneratorConfig.

Trace Replay CSV format (e.g. data/processed_traces/splitwise_conv.csv):
arrived_at,num_prefill_tokens,num_decode_tokens
0.0102006,1024,10
0.0105234,2048,15
0.0215440,1536,8
Columns: arrived_at (float, seconds) — absolute arrival timestamp; num_prefill_tokens (int) — input/prompt tokens; num_decode_tokens (int) — output/generation tokens.
Config params: trace_file, prefill_scale_factor (1.0), decode_scale_factor (1.0), time_scale_factor (1.0), max_tokens (4096), seed (42).

CLI Examples

# Synthetic: Poisson arrival + Fixed length
python -m vidur.main \
  --request_generator_config_type synthetic \
  --interval_generator_config_type poisson \
  --poisson_request_interval_generator_config_qps 100 \
  --length_generator_config_type fixed \
  --fixed_request_length_generator_config_prefill_tokens 1024 \
  --fixed_request_length_generator_config_decode_tokens 10

# Trace Replay
python -m vidur.main \
  --request_generator_config_type trace_replay \
  --trace_request_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
  --trace_request_generator_config_time_scale_factor 0.5

Output → Request Object

Defined in vidur/entities/request.py. At this stage only the identity and token fields are set; timing/DAG fields are populated in later steps.

Request(
  # ── Identity & tokens (set at creation) ──
  _id: int                    # auto-incremented unique ID
  _arrived_at: float            # arrival timestamp (seconds)
  _num_prefill_tokens: int       # input / prompt tokens
  _num_decode_tokens: int        # output / generation tokens
  _num_processed_tokens: int = 0 # tokens processed so far

  # ── DAG (empty now, populated in Step 2) ──
  dag: nx.DiGraph()              # task dependency graph
  node_id: int = 0              # counter for unique node IDs in DAG
  nodes: dict = {}              # node_id → Node object mapping
  root_node: Node = None        # entry point (PromptTask)
  request_type: RequestType      # PREFILL=1 / DECODE=2 / MIXED=0

  # ── Replica assignment (set in Step 2) ──
  prefill_replica_id: int = None
  decode_replica_id: int = None

  # ── Timing fields (populated in Steps 3–6) ──
  _scheduled_at: float = 0          # first scheduling time
  _prefill_completed_at: float = 0  # TTFT base timestamp
  _completed_at: float = 0          # request completion time
  _execution_time: float = 0        # GPU execution time
  _model_execution_time: float = 0  # model-specific execution time
  _scheduling_delay: float = 0      # wait before first scheduling
  _preempted_time: float = 0        # time lost to preemption
  prefill_arrived_at: float         # prefill start (= arrived_at)
  decode_arrived_at: float = inf    # decode start (after KV transfer)

  # ── PD communication (set in Step 5) ──
  pd_p2p_comm_size: float = inf    # KV cache transfer bytes
  pd_p2p_comm_time: float = inf    # KV cache transfer time (s)
  pd_p2p_comm_bandwidth: float = 0 # effective bandwidth
  pd_p2p_bytes_per_token: int = None # 2 (fp16) / 4 (fp32) / 8 (fp64)
  pd_p2p_comm_dtype: str = None   # "float16", "float32", etc.

  # ── Status flags ──
  _scheduled: bool = False
  _preempted: bool = False
  _completed: bool = False
  _is_prefill_complete: bool = False
  _num_restarts: int = 0
)

The generator also creates a RequestArrivalEvent(_time=arrived_at, _request=request) which is pushed into the simulation's min-heap event queue.

2

Global Scheduling

vidur-alibabacloud

The global scheduler routes the request to a replica pool. In PD disaggregation mode, it separates prefill and decode into different replica pools.

Input & Configuration

Input Request object from Step 1
Global Scheduler Types RANDOM, ROUND_ROBIN, LOR (Least Outstanding Requests), SPLIT_WISE (PD-aware)
Key Config pd_node_ratio (float, default 0.5) — fraction of replicas dedicated to prefill
num_replicas (int, default 1) — total GPU replicas in cluster

Output → Populated DAG

Sets request.prefill_replica_id and request.decode_replica_id, then builds the DAG. The exact DAG shape depends on whether the prefill and decode replicas are the same or different (code in splitwise_global_scheduler.py):

Case A: Same Replica (no PD split)

PromptTask ──→ TokenTask
  # linked via prefill_task.chain = [decode_task]
  # no KV transfer needed

Case B: Different Replicas (PD disaggregation)

PromptTask ──→ KVCacheTransferFlow ──→ TokenTask
  # edges in request.dag (nx.DiGraph)
  # KV transfer bridges the two replicas

DAG Node Structures

vidur/entities/task.py
vidur/entities/flow.py
vidur/entities/node.py
# ── Base Node (all DAG nodes inherit from this) ──
Node(BaseEntity):
  node_id: int                # unique ID within this request's DAG
  state: NodeState             # NONE=0 → QUEUED=1 → RUNNING=2 → COMPLETED=4
                               #   (or BLOCKED=3 if preempted, ABORTED=5)
  request: Request             # back-reference to owning request
  chain: list[Node]            # consecutive nodes (used in same-replica case)
  num_preemptions: int = 0

# ── PromptTask (prefill stage) ──
PromptTask(Task → Node):
  task_type: TaskType = PROMPT # = 1
  prompt_size: int             # = request.num_prefill_tokens
  tokens_per_iteration: int    # = prompt_size (process all at once)
  processing_tokens: int = 0   # tokens being processed this iteration
  processed_tokens: int = 0    # total tokens processed
  generating_tokens: int = 0   # tokens being generated this iteration
  generated_tokens: int = 0    # total generated (max 1 for prefill)
  is_prefill_complete: bool = False
  cleanup_memory: bool = False # keep KV cache for decode
  batch_size: int = 1
  duration: float = 0.0       # set by execution time predictor

  # memory = 2 × num_tokens × mlp_hidden_dim × num_layers × bytes_per_token

# ── TokenTask (decode stage) ──
TokenTask(Task → Node):
  task_type: TaskType = TOKEN  # = 2
  token_size: int              # = request.num_decode_tokens − 1
                               #   (1 token already generated in prefill)
  tokens_per_iteration: int = 1 # auto-regressive: 1 token at a time
  is_prefill_complete: bool = True
  # same processing/generated fields as PromptTask

# ── KVCacheTransferFlow (PD transfer) ──
KVCacheTransferFlow(Flow → Node):
  flow_type: FlowType = KVCacheTransfer  # = 1
  src: Replica                 # prefill replica
  dest: Replica                # decode replica
  size: float                  # KV cache size in bytes
                               #   = estimate_kv_cache_size(prefill_tokens, replica)
  duration: float = 0.0        # set after transfer simulation
  notify: bool = False
  batch_size: int = 1

After DAG construction, the global scheduler emits a GlobalScheduleEvent which triggers ReplicaScheduleEvent(s) for the assigned replica(s).

3

Replica Scheduling + Batching

vidur-alibabacloud

The replica scheduler on the target replica builds a batch from waiting requests, considering KV cache memory, batch size limits, and token limits.

Input & Configuration

Input List of Request objects with DAG nodes assigned to this replica
Replica Scheduler Types SARATHI (chunk_size=512), VLLM (max_tokens_in_batch=4096), ORCA, LIGHTLLM (max_waiting_iters=10), FASTER_TRANSFORMER, SPLIT_WISE (max_tokens_in_batch=4096)
Common Config batch_size_cap (128), block_size (16), watermark_blocks_fraction (0.01), num_blocks (optional)

Output → Batch Object + Event Chain

vidur/entities/batch.py
Batch(BaseEntity):
  _replica_id: int              # which replica executes this batch
  _requests: List[Request]       # requests grouped into this batch
  _num_tokens: List[int]         # tokens per request in this batch
  _total_num_tokens: int         # sum of all tokens
  _num_prefill_tokens: int       # prefill tokens (non-completed only)
  _total_num_tokens_rounded: int # tokens rounded to 8-token boundary
  _scheduled_at: float = None   # when batch was scheduled
  _completed_at: float = None   # when batch completed

  # Properties:
  num_decode_tokens → _total_num_tokens − _num_prefill_tokens
  all_requests_completed → bool  # True if all done
  preempted_requests → List[Request]
  completed_requests → List[Request]

Event Chain (discrete-event simulation)

The simulation is event-driven. Each event generates successor events that are pushed into a min-heap sorted by timestamp. The full chain from scheduling to completion:

vidur/events/
# ── Event chain for a single batch ──

RequestArrivalEvent(time, request)
  → GlobalScheduleEvent(time)                  # route to replica
    → ReplicaScheduleEvent(time, replica_id)   # build batch
      → BatchStageArrivalEvent(time, replica_id, stage_id=0, batch)
        → ReplicaStageScheduleEvent(time, replica_id, stage_id)
          # calls execution_time_predictor → gets duration
          → BatchStageEndEvent(time + duration, replica_id, stage_id, batch)

# If pipeline has more stages (PP > 1):
BatchStageEndEvent → BatchStageArrivalEvent(next stage) + ReplicaStageScheduleEvent(current)

# If last pipeline stage:
BatchStageEndEvent → BatchEndEvent(time, replica_id, batch)

# BatchEndEvent handles PD logic:
BatchEndEvent:
  # For each request that completed prefill:
  #   calculates pd_p2p_comm_time
  #   sets request.decode_arrived_at = prefill_completed_at + pd_p2p_comm_time
  #   → ReplicaScheduleEvent(decode_arrived_at, decode_replica_id)
BaseEvent fields: Every event carries _time (float), _id (auto-incremented int), _event_type (EventType enum), and _priority_number = (time, id, event_type) for min-heap ordering.
4

Execution Time Prediction

vidur-alibabacloud aicb astra-sim

This step answers: “How long does this batch take to execute?” The answer is always composed of two parts:

Core formula (all backends):
execution_time = (block_time × num_layers_per_stage + pp_comm_time) / 1000
where block_time = attention_layer_time + mlp_layer_time + add_time
and each layer time = compute time (kernel execution on GPU) + communication time (collective ops like AllReduce across GPUs).

The 4 backends differ in how they estimate these two components.

Backend Comparison: Compute

Backend Compute Time Method Granularity Speed
vidur RandomForest trained on GPU profiling CSVs → O(1) lookup table Homogeneous (all layers identical) Fastest
aicb Pre-profiled per-layer CSV (real GPU, no ML training) Per-layer heterogeneous + phase-aware (prefill/decode) Very Fast
simai_analytical ← Same as vidur (identical RandomForest compute path)
simai_simulation ← Same as vidur (identical RandomForest compute path)

Backend Comparison: Communication (per parallelism type)

Backend TP (AllReduce) PP (SendRecv) EP/MoE (AllToAll)
vidur ✓ RandomForest on all_reduce.csv ✓ RandomForest on send_recv.csv (all 4 backends) ✗ Not modeled
aicb ✗ = 0 (TODO in code) ✓ comm_size / bandwidth
simai_analytical ✓ SimAI_analytical binary + busbw.yaml ✗ Not modeled
simai_simulation ✓ NS-3 packet-level simulation ✗ Not modeled
Key insights:
1. The 4 backends are mutually exclusive — no mixing (e.g. you cannot combine AICB compute with NS-3 communication).
2. vidur, simai_analytical, simai_simulation share the same RandomForest compute path; they only differ in TP communication estimation.
3. aicb is the only backend that models EP/MoE AllToAll communication (via bandwidth formula), but it currently does NOT model TP AllReduce (= 0 with TODO in base_execution_time_predictor.py:84).
4. PP send_recv is handled identically by all 4 backends (RandomForest lookup on send_recv.csv).

Backend A: vidur (RandomForest on profiling CSVs)

vidur/execution_time_predictor/sklearn_execution_time_predictor.py

Trains 11 separate RandomForest models from pre-collected GPU profiling data. After training, predictions are pre-generated for all possible token counts and stored as O(1) lookup tables (Python dicts).

# Flow:
Batch(batch_size, num_tokens_per_request)
  → extract features (num_tokens, kv_cache_size)
  → lookup pre-computed dict: {(num_tokens,): time_ms}
  → sum across 11 sub-models per layer
  → execution_time

Input File 1: Compute Profiling CSVs

Pre-collected by running each model operator on a real GPU at various token counts. The RandomForest learns f(num_tokens) → time_ms for each operator.

# File: data/profiling/compute/{DEVICE}/{MODEL}/mlp.csv
# Example: data/profiling/compute/h100/meta-llama/Llama-2-7b-hf/mlp.csv
#
# Feature columns (input to RandomForest):
num_tokens                          # int — tokens in this profiling run
num_tensor_parallel_workers         # int — TP degree used during profiling
n_head, n_kv_head, n_embd           # model architecture params
n_expanded_embd, vocab_size         # model architecture params
use_gated_mlp                       # bool

# Target columns (what RandomForest predicts) — each with min/max/mean/median/std:
time_stats.emb.*                    # embedding layer time
time_stats.input_layernorm.*        # input LayerNorm time
time_stats.attn_pre_proj.*          # QKV projection time
time_stats.attn_rope.*              # rotary position encoding time
time_stats.attn_post_proj.*         # output projection time
time_stats.post_attention_layernorm.*  # post-attention LayerNorm time
time_stats.mlp_up_proj.*            # MLP up-projection time
time_stats.mlp_act.*                # activation function time
time_stats.mlp_down_proj.*          # MLP down-projection time
time_stats.add.*                    # residual add time
# File: data/profiling/compute/{DEVICE}/{MODEL}/attention.csv
#
# Additional feature columns:
num_tokens, kv_cache_size           # input features

# Target columns:
time_stats.attn_prefill.*           # prefill attention kernel time
time_stats.attn_decode.*            # decode attention kernel time
time_stats.attn_kv_cache_save.*     # KV cache write time

Input File 2: Network Profiling CSVs

Pre-collected by running collective operations at various message sizes. Also trained via RandomForest: f(message_size) → time_ms.

# File: data/profiling/network/{NETWORK_DEVICE}/all_reduce.csv
# Example: data/profiling/network/h100_pairwise_nvlink/all_reduce.csv
#
# Columns:
size                                # int — message size in bytes
num_workers                         # int — number of GPUs
rank                                # int — GPU rank
collective                          # str — operation type ("all_reduce")
devices_per_node                    # int — GPUs per machine
max_devices_per_node                # int
time_stats.all_reduce.min           # float — min observed time (ms)
time_stats.all_reduce.max           # float
time_stats.all_reduce.mean          # float — training target
time_stats.all_reduce.median        # float
time_stats.all_reduce.std           # float

# File: data/profiling/network/{NETWORK_DEVICE}/send_recv.csv
# Same structure but for point-to-point send/recv (used for PP communication)
RandomForest training config: GridSearchCV with n_estimators ∈ {250,500,750}, max_depth ∈ {8,16,32}, min_samples_split ∈ {2,5,10}. Scoring = MAPE. Trained models are cached as pickle files in cache/.

Backend B: aicb (Pre-profiled Per-Layer Lookup)

vidur/entities/execution_time.py
aicb/

Completely different architecture. No RandomForest training. Instead, AICB pre-profiles every model layer on a real GPU (DeepGEMM for matmul, FlashMLA for attention) and saves per-layer compute time + comm size to a CSV. At simulation time, it does a direct lookup per layer.

# Flow:
Batch(batch_size, num_tokens, phase=prefill|decode)
  → determine AICB CSV file by (model, tp, pp, ep, bs, seq, phase)
  → load CSV: per-layer comp_time + comm_size
  → for each layer:
  │    compute = comp_time (from CSV)
  │    comm    = comm_size / bandwidth (from replica config)
  │    layer_time = compute + comm
  → sum across all layers
  → execution_time

Input File: AICB Pre-profiled CSV

# File: results/workload/vidur-{model}-world_size{ws}-tp{tp}-pp{pp}-ep{ep}-bs{bs}-seq{seq}-{phase}.csv
# Example: results/workload/vidur-deepseek-671B-world_size16-tp2-pp1-ep8-bs1-seq1024-prefill.csv
#
# Format: TAB-separated, no header row
# Columns:
layer_id    # int — layer index (0, 1, 2, ...)
layer_name  # str — "attention", "mlp", "moe", etc.
comp_time   # float — compute time in NANOSECONDS (from real GPU profiling)
comm_size   # float — communication size in BYTES (for collective ops)

# Example rows:
0	attention	125000	0
0	mlp	98000	0
0	moe	210000	67108864
1	attention	125000	0
1	mlp	98000	0
1	moe	210000	67108864

How comm_time is calculated from comm_size

# For MoE layers (expert parallelism):
if phase == "prefill":
    bandwidth = rdma_bandwidth    # cross-node (default 800 Gbps)
else:  # decode
    bandwidth = nvlink_bandwidth  # intra-node (default 1600 Gbps)

moe_comm_time = comm_size / (bandwidth * 1024**3 / 8)  # Gbps → Bytes/s
layer_time = comp_time * 1e-9 + moe_comm_time  # ns → s

# For attention/MLP layers:
# TP communication time = 0 (currently disabled in code)
layer_time = comp_time * 1e-9
Key differences from vidur: (1) No ML training — direct profiled values; (2) Per-layer heterogeneous — each layer can have different times (critical for MoE models where layers alternate between dense and expert); (3) Phase-aware — separate CSVs for prefill vs decode; (4) Supports models vidur cannot: DeepSeek-671B, Qwen3-Moe-235B, Qwen3-Next-80B.

Backend C: simai_analytical (RandomForest compute + analytical network model)

vidur/execution_time_predictor/communication_time_predictor.py

Compute time: identical to vidur (same RandomForest, same profiling CSVs). Communication time: replaces vidur's RandomForest with an external SimAI_analytical binary that uses a topology-aware analytical model.

# Flow:
Batch(batch_size, num_tokens)
  → [COMPUTE] same RandomForest as vidur → comp_time
  → [COMM] generate workload file (AllReduce size = hidden_dim × tokens × 2 bytes)
  →        run: SimAI_analytical -w {workload} -g {world_size} -g_p_s 8
  →        parse output CSV → comm_time (μs → ms)
  → execution_time = comp_time + comm_time + NCCL_overhead

Input File 1: Same profiling CSVs as vidur

Uses the exact same mlp.csv, attention.csv files described in Backend A for compute time.

Input File 2: Bus Bandwidth YAML (for comm time)

Specifies the effective bandwidth (GB/s) for each collective operation within each parallelism group. The analytical model uses these values to estimate communication latency without packet-level simulation.

# File: example/busbw.yaml

test                          # scenario name (first line)
TP:                           # Tensor Parallelism group
  allreduce,: 300             # AllReduce bandwidth in GB/s
  allgather,: 280             # AllGather bandwidth in GB/s
  reducescatter,: 280         # ReduceScatter bandwidth in GB/s
  alltoall,: 230              # AllToAll bandwidth in GB/s
EP:                           # Expert Parallelism group
  allreduce,: null
  allgather,: 45
  reducescatter,: 45
  alltoall,: 80
PP:                           # Pipeline Parallelism group
  busbw: 47.5                 # single value for send_recv (GB/s)

Input File 3: SimAI Configuration

# Topology:  example/topo  (network topology definition)
# Config:    astra-sim-alibabacloud/inputs/config/SimAI.conf
#            (collective algorithm selection, buffer sizes, etc.)

Output of SimAI_analytical binary

# Output: results/analytical_EndToEnd_{id}.csv
# Parser reads: last row, 6th column (index 5) = total comm time in μs
# Converts to ms: latency = float(rows[-1][5]) * 1e-3

# Results are cached by MD5(workload_params + topo + conf)
# to avoid redundant runs

Backend D: simai_simulation (RandomForest compute + NS-3 packet-level simulation)

vidur/execution_time_predictor/communication_time_predictor.py

Compute time: identical to vidur & simai_analytical. Communication time: replaces the analytical formula with a full NS-3 packet-level RDMA simulation. Same workload generation as simai_analytical, but runs through NS-3 instead.

# Flow (only comm path differs from simai_analytical):
Batch(batch_size, num_tokens)
  → [COMPUTE] same RandomForest as vidur → comp_time
  → [COMM] generate same workload file
  →        run: AS_SEND_LAT=6 AS_NVLS_ENABLE=1 \
  →              SimAI_simulator -t 16 -w {workload} -n {topo} -c {conf}
  →        parse output CSV → comm_time (μs → ms)
  → execution_time = comp_time + comm_time + NCCL_overhead

# Output: results/ncclFlowModel_EndToEnd_{id}.csv
# Parser reads: last row, 2nd column (index 1) = total comm time in μs
vs simai_analytical: The only difference is the binary called. SimAI_analytical uses bandwidth formulas for O(1) estimation; SimAI_simulator runs a full NS-3 simulation that models packet-level congestion, ECMP routing, and PFC back-pressure. More accurate for large-scale topologies with contention, but orders of magnitude slower.

Appendix: AICB Training Workload File Format

example/workload_analytical.txt

This text format is used by the astra-sim / SimAI binaries for training workload simulation (not inference). Included here for reference as it shares the same SimAI ecosystem.

# Line 1: model config metadata
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 16 pp: 12 ...
# Line 2: total number of operations
1789

# Remaining lines: one operation per line (space-delimited, 12 fields)
# op_name | layer_id | iterations | fwd_comm_type | fwd_comm_size | fwd_compute
#         | bwd_comm_type | bwd_comm_size | bwd_compute | dp_comm_type | dp_comm_size | priority

attention_column -1 1750840 ALLGATHER 50331648 875420 REDUCESCATTER 0 875420 NONE 0 100
mlp_moelayer     -1 1750840 ALLTOALL  67108864 875420 ALLTOALL 67108864 875420 NONE 0 100

# comm_type values: ALLREDUCE | ALLGATHER | REDUCESCATTER | ALLTOALL | NONE
# comm_size: bytes    fwd/bwd_compute: microseconds    priority: typically 100

Step 4 Output (all backends)

# Regardless of backend, the final output is:
execution_time: float  # seconds — written into BatchStage

# This becomes:
BatchStageEndEvent(_time = current_time + execution_time, batch, stage_id)
# → eventually → BatchEndEvent when last pipeline stage completes
# Final time breakdown for all backends:
model_time = (
    (attention_layer_time + mlp_layer_time + add_time)  # per-layer block
    × num_layers_per_stage                               # = num_layers / PP
    + pipeline_parallel_communication_time               # send_recv between stages
) / 1000  # ms → s

attention_layer_time = attn_pre_proj + attn_post_proj + attn_rope
    + attn_kv_cache_save + attn_prefill_or_decode
    + tensor_parallel_comm_time  # ← THIS is what differs per backend
    + attn_norm_time

tensor_parallel_comm_time =
    backend_predicted_comm_ms
    + nccl_cpu_launch_overhead_ms                        # default 0.02
    + nccl_cpu_skew_overhead_per_device_ms × tp**1.25   # non-linear scaling
5

KV Cache Transfer

simccl astra-sim ns-3

In PD disaggregation mode, once the prefill replica completes, the KV cache must be transferred to the decode replica.

Input & Configuration

Input KVCacheTransferFlow DAG node from the request, containing KV cache size (derived from model's num_kv_heads, attention_head_dim, num_layers, num_prefill_tokens)
Network Config pd_p2p_comm_bandwidth (int, default 800 Gbps)
pd_p2p_comm_dtype (str, default "float16")
nvlink_bandwidth (int, default 1600 Gbps)
rdma_bandwidth (int, default 800 Gbps)
Simulation Config (NS-3) simai_simulation_topotopology file (generated by gen_Topo_Template.py)
simai_simulation_configastra-sim-alibabacloud/inputs/config/SimAI.conf

Output

Writes into the request: pd_p2p_comm_time (float, seconds), pd_p2p_comm_size (int, bytes), pd_p2p_comm_bandwidth (float), pd_p2p_bytes_per_token (float). Triggers DecodeCompletionEvent scheduling on the decode replica.

6

Metrics Collection

vidur-alibabacloud

Once all DAG nodes complete, vidur records the full metrics suite and exports them to CSV files.

Input

Completed Request objects with all timing fields populated by Steps 1–5.

Output → request_metrics.csv

Output directory: simulator_output/YYYY-MM-DD_HH-MM-SS-XXXXXX/
# request_metrics.csv columns (from metrics/constants.py):
Request Id,
request_e2e_time,              # total latency (s)
request_execution_time,        # GPU execution time (s)
request_model_execution_time,  # model-specific execution (s)
request_preemption_time,       # preemption wait (s)
request_scheduling_delay,      # pre-schedule wait (s)
prefill_e2e_time,              # TTFT: prefill_completed_at - arrived_at
decode_time,                   # decode duration (s)
tbt,                           # time-between-tokens: decode_time / num_decode_tokens
arrived_at,                    # arrival timestamp (s)
scheduled_at,                  # first scheduling timestamp (s)
prefill_completed_at,          # TTFT base timestamp (s)
decode_arrived_at,             # decode phase start (s)
completed_at,                  # completion timestamp (s)
request_num_prefill_tokens,    # input tokens (int)
request_num_decode_tokens,     # output tokens (int)
prefill_replica_id,            # prefill replica (PD mode)
decode_replica_id,             # decode replica (PD mode)
pd_p2p_comm_size,              # KV cache transfer bytes
pd_p2p_comm_time,              # KV cache transfer time (s)
pd_p2p_comm_bandwidth,         # effective bandwidth
pd_p2p_bytes_per_token,        # bytes per token in transfer
pd_p2p_comm_dtype              # data type (float16, etc.)

Additional Output Files

File Config Toggle Description
batch_metrics.csv store_batch_metrics (true) batch_size, batch_num_tokens, batch_num_prefill_tokens, batch_num_decode_tokens, batch_execution_time
chrome_trace.json enable_chrome_trace (true) Chrome DevTools timeline for visual profiling
event_trace.json write_json_trace (false) Detailed timestamped event log
CDF/histogram plots (PNG) store_plots (true) Latency CDFs, batch size histograms, utilization time series