Request Lifecycle & I/O Formats -- SimAI

When a simulated inference request enters SimAI, it travels through six distinct stages spanning all five components. The following step-flow traces this journey from arrival to metrics collection.

Request Arrival

vidur-alibabacloud

A RequestArrivalEvent fires in the discrete-event simulation loop. The event carries num_prefill_tokens, num_decode_tokens, and an arrival timestamp. A Request entity is created with an empty DAG (nx.DiGraph()) that will be populated as the request progresses through the system.

Input Sources & File Formats

Requests are produced by a Request Generator (vidur/request_generator/). SimAI supports two top-level modes:

Mode A — SYNTHETIC (programmatic generation)

Arrival intervals and token lengths are generated independently. You pick one Interval Generator and one Length Generator.

Interval Generators
Type	Config Parameters	Formula / Logic	Source File
`POISSON`	`qps` (float, default 0.5), `seed`	interval = −ln(1−rand()) / qps, capped at 3σ	vidur/request_generator/poisson_request_interval_generator.py
`GAMMA`	`qps` (float, default 0.2), `cv` (float, default 0.5), `seed`	shape = 1/cv², scale = 1/(qps×shape)	vidur/request_generator/gamma_request_interval_generator.py
`STATIC`	Fixed interval value	Constant inter-arrival time	vidur/request_generator/static_request_interval_generator.py
`TRACE`	`trace_file` (CSV path), `start_time`, `end_time`, `time_scale_factor`	Reads CSV column `arrival_time` (datetime), computes inter-arrival via `.diff()`, applies `time_scale_factor`	vidur/request_generator/trace_request_interval_generator.py

Trace Interval CSV format: Must contain an arrival_time column with datetime strings (e.g. 2021-01-04 12:00:15). Default trace: data/processed_traces/AzureFunctionsInvocationTraceForTwoWeeksJan2021Processed.csv

Length Generators
Type	Config Parameters	Logic	Source File
`FIXED`	`prefill_tokens` (int, default 2048), `decode_tokens` (int, default 512)	Returns constant (prefill, decode) pair	vidur/request_generator/fixed_request_length_generator.py
`ZIPF`	`min_tokens` (1024), `max_tokens` (4096), `theta` (0.6), `prefill_to_decode_ratio` (20.0), `scramble`, `seed`	Zipf-distributed total tokens, split by ratio	vidur/request_generator/zipf_request_length_generator.py
`UNIFORM`	`min_tokens` (1024), `max_tokens` (4096), `prefill_to_decode_ratio` (20.0), `seed`	Uniform distribution between min/max, split by ratio	vidur/request_generator/uniform_request_length_generator.py
`TRACE`	`trace_file` (CSV path), `prefill_scale_factor` (1.0), `decode_scale_factor` (1.0), `max_tokens` (4096), `seed`	Reads CSV columns `num_prefill_tokens`, `num_decode_tokens`; applies scale factors and clips	vidur/request_generator/trace_request_length_generator.py

Trace Length CSV format: Must contain num_prefill_tokens and num_decode_tokens integer columns. Default trace: data/processed_traces/sharegpt_8k_filtered_stats_llama2_tokenizer.csv

Mode B — TRACE_REPLAY (replay a recorded trace)

Reads arrival time and token counts from a single CSV file. Config: TraceRequestGeneratorConfig.

Trace Replay CSV format (e.g. data/processed_traces/splitwise_conv.csv):

arrived_at,num_prefill_tokens,num_decode_tokens
0.0102006,1024,10
0.0105234,2048,15
0.0215440,1536,8

Columns: arrived_at (float, seconds) — absolute arrival timestamp; num_prefill_tokens (int) — input/prompt tokens; num_decode_tokens (int) — output/generation tokens.
Config params: trace_file, prefill_scale_factor (1.0), decode_scale_factor (1.0), time_scale_factor (1.0), max_tokens (4096), seed (42).

CLI Examples

# Synthetic: Poisson arrival + Fixed length
python -m vidur.main \
  --request_generator_config_type synthetic \
  --interval_generator_config_type poisson \
  --poisson_request_interval_generator_config_qps 100 \
  --length_generator_config_type fixed \
  --fixed_request_length_generator_config_prefill_tokens 1024 \
  --fixed_request_length_generator_config_decode_tokens 10

# Trace Replay
python -m vidur.main \
  --request_generator_config_type trace_replay \
  --trace_request_generator_config_trace_file ./data/processed_traces/splitwise_conv.csv \
  --trace_request_generator_config_time_scale_factor 0.5

Output → Request Object

Defined in vidur/entities/request.py. At this stage only the identity and token fields are set; timing/DAG fields are populated in later steps.

Request(
  # ── Identity & tokens (set at creation) ──
  _id: int                    # auto-incremented unique ID
  _arrived_at: float            # arrival timestamp (seconds)
  _num_prefill_tokens: int       # input / prompt tokens
  _num_decode_tokens: int        # output / generation tokens
  _num_processed_tokens: int = 0 # tokens processed so far

  # ── DAG (empty now, populated in Step 2) ──
  dag: nx.DiGraph()              # task dependency graph
  node_id: int = 0              # counter for unique node IDs in DAG
  nodes: dict = {}              # node_id → Node object mapping
  root_node: Node = None        # entry point (PromptTask)
  request_type: RequestType      # PREFILL=1 / DECODE=2 / MIXED=0

  # ── Replica assignment (set in Step 2) ──
  prefill_replica_id: int = None
  decode_replica_id: int = None

  # ── Timing fields (populated in Steps 3–6) ──
  _scheduled_at: float = 0          # first scheduling time
  _prefill_completed_at: float = 0  # TTFT base timestamp
  _completed_at: float = 0          # request completion time
  _execution_time: float = 0        # GPU execution time
  _model_execution_time: float = 0  # model-specific execution time
  _scheduling_delay: float = 0      # wait before first scheduling
  _preempted_time: float = 0        # time lost to preemption
  prefill_arrived_at: float         # prefill start (= arrived_at)
  decode_arrived_at: float = inf    # decode start (after KV transfer)

  # ── PD communication (set in Step 5) ──
  pd_p2p_comm_size: float = inf    # KV cache transfer bytes
  pd_p2p_comm_time: float = inf    # KV cache transfer time (s)
  pd_p2p_comm_bandwidth: float = 0 # effective bandwidth
  pd_p2p_bytes_per_token: int = None # 2 (fp16) / 4 (fp32) / 8 (fp64)
  pd_p2p_comm_dtype: str = None   # "float16", "float32", etc.

  # ── Status flags ──
  _scheduled: bool = False
  _preempted: bool = False
  _completed: bool = False
  _is_prefill_complete: bool = False
  _num_restarts: int = 0
)

The generator also creates a RequestArrivalEvent(_time=arrived_at, _request=request) which is pushed into the simulation's min-heap event queue.

Global Scheduling

vidur-alibabacloud

The global scheduler routes the request to a replica pool. In PD disaggregation mode, it separates prefill and decode into different replica pools.

Input & Configuration

Input	`Request` object from Step 1
Global Scheduler Types	`RANDOM`, `ROUND_ROBIN`, `LOR` (Least Outstanding Requests), `SPLIT_WISE` (PD-aware)
Key Config	`pd_node_ratio` (float, default 0.5) — fraction of replicas dedicated to prefill `num_replicas` (int, default 1) — total GPU replicas in cluster

Output → Populated DAG

Sets request.prefill_replica_id and request.decode_replica_id, then builds the DAG. The exact DAG shape depends on whether the prefill and decode replicas are the same or different (code in splitwise_global_scheduler.py):

Case A: Same Replica (no PD split)

PromptTask ──→ TokenTask
  # linked via prefill_task.chain = [decode_task]
  # no KV transfer needed

Case B: Different Replicas (PD disaggregation)

PromptTask ──→ KVCacheTransferFlow ──→ TokenTask
  # edges in request.dag (nx.DiGraph)
  # KV transfer bridges the two replicas

DAG Node Structures

vidur/entities/task.py

vidur/entities/flow.py

vidur/entities/node.py

# ── Base Node (all DAG nodes inherit from this) ──
Node(BaseEntity):
  node_id: int                # unique ID within this request's DAG
  state: NodeState             # NONE=0 → QUEUED=1 → RUNNING=2 → COMPLETED=4
                               #   (or BLOCKED=3 if preempted, ABORTED=5)
  request: Request             # back-reference to owning request
  chain: list[Node]            # consecutive nodes (used in same-replica case)
  num_preemptions: int = 0

# ── PromptTask (prefill stage) ──
PromptTask(Task → Node):
  task_type: TaskType = PROMPT # = 1
  prompt_size: int             # = request.num_prefill_tokens
  tokens_per_iteration: int    # = prompt_size (process all at once)
  processing_tokens: int = 0   # tokens being processed this iteration
  processed_tokens: int = 0    # total tokens processed
  generating_tokens: int = 0   # tokens being generated this iteration
  generated_tokens: int = 0    # total generated (max 1 for prefill)
  is_prefill_complete: bool = False
  cleanup_memory: bool = False # keep KV cache for decode
  batch_size: int = 1
  duration: float = 0.0       # set by execution time predictor

  # memory = 2 × num_tokens × mlp_hidden_dim × num_layers × bytes_per_token

# ── TokenTask (decode stage) ──
TokenTask(Task → Node):
  task_type: TaskType = TOKEN  # = 2
  token_size: int              # = request.num_decode_tokens − 1
                               #   (1 token already generated in prefill)
  tokens_per_iteration: int = 1 # auto-regressive: 1 token at a time
  is_prefill_complete: bool = True
  # same processing/generated fields as PromptTask

# ── KVCacheTransferFlow (PD transfer) ──
KVCacheTransferFlow(Flow → Node):
  flow_type: FlowType = KVCacheTransfer  # = 1
  src: Replica                 # prefill replica
  dest: Replica                # decode replica
  size: float                  # KV cache size in bytes
                               #   = estimate_kv_cache_size(prefill_tokens, replica)
  duration: float = 0.0        # set after transfer simulation
  notify: bool = False
  batch_size: int = 1

After DAG construction, the global scheduler emits a GlobalScheduleEvent which triggers ReplicaScheduleEvent(s) for the assigned replica(s).

Replica Scheduling + Batching

vidur-alibabacloud

The replica scheduler on the target replica builds a batch from waiting requests, considering KV cache memory, batch size limits, and token limits.

Input & Configuration

Input	List of `Request` objects with DAG nodes assigned to this replica
Replica Scheduler Types	`SARATHI` (chunk_size=512), `VLLM` (max_tokens_in_batch=4096), `ORCA`, `LIGHTLLM` (max_waiting_iters=10), `FASTER_TRANSFORMER`, `SPLIT_WISE` (max_tokens_in_batch=4096)
Common Config	`batch_size_cap` (128), `block_size` (16), `watermark_blocks_fraction` (0.01), `num_blocks` (optional)

Output → Batch Object + Event Chain

vidur/entities/batch.py

Batch(BaseEntity):
  _replica_id: int              # which replica executes this batch
  _requests: List[Request]       # requests grouped into this batch
  _num_tokens: List[int]         # tokens per request in this batch
  _total_num_tokens: int         # sum of all tokens
  _num_prefill_tokens: int       # prefill tokens (non-completed only)
  _total_num_tokens_rounded: int # tokens rounded to 8-token boundary
  _scheduled_at: float = None   # when batch was scheduled
  _completed_at: float = None   # when batch completed

  # Properties:
  num_decode_tokens → _total_num_tokens − _num_prefill_tokens
  all_requests_completed → bool  # True if all done
  preempted_requests → List[Request]
  completed_requests → List[Request]

Event Chain (discrete-event simulation)

The simulation is event-driven. Each event generates successor events that are pushed into a min-heap sorted by timestamp. The full chain from scheduling to completion:

vidur/events/

# ── Event chain for a single batch ──

RequestArrivalEvent(time, request)
  → GlobalScheduleEvent(time)                  # route to replica
    → ReplicaScheduleEvent(time, replica_id)   # build batch
      → BatchStageArrivalEvent(time, replica_id, stage_id=0, batch)
        → ReplicaStageScheduleEvent(time, replica_id, stage_id)
          # calls execution_time_predictor → gets duration
          → BatchStageEndEvent(time + duration, replica_id, stage_id, batch)

# If pipeline has more stages (PP > 1):
BatchStageEndEvent → BatchStageArrivalEvent(next stage) + ReplicaStageScheduleEvent(current)

# If last pipeline stage:
BatchStageEndEvent → BatchEndEvent(time, replica_id, batch)

# BatchEndEvent handles PD logic:
BatchEndEvent:
  # For each request that completed prefill:
  #   calculates pd_p2p_comm_time
  #   sets request.decode_arrived_at = prefill_completed_at + pd_p2p_comm_time
  #   → ReplicaScheduleEvent(decode_arrived_at, decode_replica_id)

BaseEvent fields: Every event carries _time (float), _id (auto-incremented int), _event_type (EventType enum), and _priority_number = (time, id, event_type) for min-heap ordering.

Execution Time Prediction

vidur-alibabacloud aicb astra-sim

This step answers: “How long does this batch take to execute?” The answer is always composed of two parts:

Core formula (all backends):
execution_time = (block_time × num_layers_per_stage + pp_comm_time) / 1000
where block_time = attention_layer_time + mlp_layer_time + add_time
and each layer time = compute time (kernel execution on GPU) + communication time (collective ops like AllReduce across GPUs).

The 4 backends differ in how they estimate these two components.

Backend Comparison: Compute
          
            Backend
            Compute Time Method
            Granularity
            Speed
          

            vidur
            RandomForest trained on GPU profiling CSVs → O(1) lookup table
            Homogeneous (all layers identical)
            Fastest
          

            aicb
            Pre-profiled per-layer CSV (real GPU, no ML training)
            Per-layer heterogeneous + phase-aware (prefill/decode)
            Very Fast
          

            simai_analytical
            ← Same as vidur (identical RandomForest compute path)
          

            simai_simulation
            ← Same as vidur (identical RandomForest compute path)
          
Backend Comparison: Communication (per parallelism type)
          
            Backend
            TP (AllReduce)
            PP (SendRecv)
            EP/MoE (AllToAll)
          

            vidur
            ✓ RandomForest on all_reduce.csv
            ✓ RandomForest on send_recv.csv (all 4 backends)
            ✗ Not modeled
          

            aicb
            ✗ = 0 (TODO in code)
            ✓ comm_size / bandwidth
          

            simai_analytical
            ✓ SimAI_analytical binary + busbw.yaml
            ✗ Not modeled
          

            simai_simulation
            ✓ NS-3 packet-level simulation
            ✗ Not modeled
          

          Key insights:

            1. The 4 backends are mutually exclusive — no mixing (e.g. you cannot combine AICB compute with NS-3 communication).

            2. vidur, simai_analytical, simai_simulation share the same RandomForest compute path; they only differ in TP communication estimation.

            3. aicb is the only backend that models EP/MoE AllToAll communication (via bandwidth formula), but it currently does NOT model TP AllReduce (= 0 with TODO in base_execution_time_predictor.py:84).

            4. PP send_recv is handled identically by all 4 backends (RandomForest lookup on send_recv.csv).
        

Backend	Compute Time Method	Granularity	Speed
`vidur`	RandomForest trained on GPU profiling CSVs → O(1) lookup table	Homogeneous (all layers identical)	Fastest
`aicb`	Pre-profiled per-layer CSV (real GPU, no ML training)	Per-layer heterogeneous + phase-aware (prefill/decode)	Very Fast
`simai_analytical`	← Same as vidur (identical RandomForest compute path)
`simai_simulation`	← Same as vidur (identical RandomForest compute path)

Backend	TP (AllReduce)	PP (SendRecv)	EP/MoE (AllToAll)
`vidur`	✓ RandomForest on all_reduce.csv	✓ RandomForest on send_recv.csv (all 4 backends)	✗ Not modeled
`aicb`	✗ = 0 (TODO in code)	✓ comm_size / bandwidth
`simai_analytical`	✓ SimAI_analytical binary + busbw.yaml	✗ Not modeled
`simai_simulation`	✓ NS-3 packet-level simulation	✗ Not modeled

Backend A: vidur (RandomForest on profiling CSVs)

vidur/execution_time_predictor/sklearn_execution_time_predictor.py

Trains 11 separate RandomForest models from pre-collected GPU profiling data. After training, predictions are pre-generated for all possible token counts and stored as O(1) lookup tables (Python dicts).

# Flow:
Batch(batch_size, num_tokens_per_request)
  → extract features (num_tokens, kv_cache_size)
  → lookup pre-computed dict: {(num_tokens,): time_ms}
  → sum across 11 sub-models per layer
  → execution_time

Input File 1: Compute Profiling CSVs

Pre-collected by running each model operator on a real GPU at various token counts. The RandomForest learns f(num_tokens) → time_ms for each operator.

# File: data/profiling/compute/{DEVICE}/{MODEL}/mlp.csv
# Example: data/profiling/compute/h100/meta-llama/Llama-2-7b-hf/mlp.csv
#
# Feature columns (input to RandomForest):
num_tokens                          # int — tokens in this profiling run
num_tensor_parallel_workers         # int — TP degree used during profiling
n_head, n_kv_head, n_embd           # model architecture params
n_expanded_embd, vocab_size         # model architecture params
use_gated_mlp                       # bool

# Target columns (what RandomForest predicts) — each with min/max/mean/median/std:
time_stats.emb.*                    # embedding layer time
time_stats.input_layernorm.*        # input LayerNorm time
time_stats.attn_pre_proj.*          # QKV projection time
time_stats.attn_rope.*              # rotary position encoding time
time_stats.attn_post_proj.*         # output projection time
time_stats.post_attention_layernorm.*  # post-attention LayerNorm time
time_stats.mlp_up_proj.*            # MLP up-projection time
time_stats.mlp_act.*                # activation function time
time_stats.mlp_down_proj.*          # MLP down-projection time
time_stats.add.*                    # residual add time

# File: data/profiling/compute/{DEVICE}/{MODEL}/attention.csv
#
# Additional feature columns:
num_tokens, kv_cache_size           # input features

# Target columns:
time_stats.attn_prefill.*           # prefill attention kernel time
time_stats.attn_decode.*            # decode attention kernel time
time_stats.attn_kv_cache_save.*     # KV cache write time

Input File 2: Network Profiling CSVs

Pre-collected by running collective operations at various message sizes. Also trained via RandomForest: f(message_size) → time_ms.

# File: data/profiling/network/{NETWORK_DEVICE}/all_reduce.csv
# Example: data/profiling/network/h100_pairwise_nvlink/all_reduce.csv
#
# Columns:
size                                # int — message size in bytes
num_workers                         # int — number of GPUs
rank                                # int — GPU rank
collective                          # str — operation type ("all_reduce")
devices_per_node                    # int — GPUs per machine
max_devices_per_node                # int
time_stats.all_reduce.min           # float — min observed time (ms)
time_stats.all_reduce.max           # float
time_stats.all_reduce.mean          # float — training target
time_stats.all_reduce.median        # float
time_stats.all_reduce.std           # float

# File: data/profiling/network/{NETWORK_DEVICE}/send_recv.csv
# Same structure but for point-to-point send/recv (used for PP communication)

RandomForest training config: GridSearchCV with n_estimators ∈ {250,500,750}, max_depth ∈ {8,16,32}, min_samples_split ∈ {2,5,10}. Scoring = MAPE. Trained models are cached as pickle files in cache/.

Backend B: aicb (Pre-profiled Per-Layer Lookup)

vidur/entities/execution_time.py

aicb/

Completely different architecture. No RandomForest training. Instead, AICB pre-profiles every model layer on a real GPU (DeepGEMM for matmul, FlashMLA for attention) and saves per-layer compute time + comm size to a CSV. At simulation time, it does a direct lookup per layer.

# Flow:
Batch(batch_size, num_tokens, phase=prefill|decode)
  → determine AICB CSV file by (model, tp, pp, ep, bs, seq, phase)
  → load CSV: per-layer comp_time + comm_size
  → for each layer:
  │    compute = comp_time (from CSV)
  │    comm    = comm_size / bandwidth (from replica config)
  │    layer_time = compute + comm
  → sum across all layers
  → execution_time

Input File: AICB Pre-profiled CSV

# File: results/workload/vidur-{model}-world_size{ws}-tp{tp}-pp{pp}-ep{ep}-bs{bs}-seq{seq}-{phase}.csv
# Example: results/workload/vidur-deepseek-671B-world_size16-tp2-pp1-ep8-bs1-seq1024-prefill.csv
#
# Format: TAB-separated, no header row
# Columns:
layer_id    # int — layer index (0, 1, 2, ...)
layer_name  # str — "attention", "mlp", "moe", etc.
comp_time   # float — compute time in NANOSECONDS (from real GPU profiling)
comm_size   # float — communication size in BYTES (for collective ops)

# Example rows:
0	attention	125000	0
0	mlp	98000	0
0	moe	210000	67108864
1	attention	125000	0
1	mlp	98000	0
1	moe	210000	67108864

How comm_time is calculated from comm_size

# For MoE layers (expert parallelism):
if phase == "prefill":
    bandwidth = rdma_bandwidth    # cross-node (default 800 Gbps)
else:  # decode
    bandwidth = nvlink_bandwidth  # intra-node (default 1600 Gbps)

moe_comm_time = comm_size / (bandwidth * 1024**3 / 8)  # Gbps → Bytes/s
layer_time = comp_time * 1e-9 + moe_comm_time  # ns → s

# For attention/MLP layers:
# TP communication time = 0 (currently disabled in code)
layer_time = comp_time * 1e-9

Key differences from vidur: (1) No ML training — direct profiled values; (2) Per-layer heterogeneous — each layer can have different times (critical for MoE models where layers alternate between dense and expert); (3) Phase-aware — separate CSVs for prefill vs decode; (4) Supports models vidur cannot: DeepSeek-671B, Qwen3-Moe-235B, Qwen3-Next-80B.

Backend C: simai_analytical (RandomForest compute + analytical network model)

vidur/execution_time_predictor/communication_time_predictor.py

Compute time: identical to vidur (same RandomForest, same profiling CSVs). Communication time: replaces vidur's RandomForest with an external SimAI_analytical binary that uses a topology-aware analytical model.

# Flow:
Batch(batch_size, num_tokens)
  → [COMPUTE] same RandomForest as vidur → comp_time
  → [COMM] generate workload file (AllReduce size = hidden_dim × tokens × 2 bytes)
  →        run: SimAI_analytical -w {workload} -g {world_size} -g_p_s 8
  →        parse output CSV → comm_time (μs → ms)
  → execution_time = comp_time + comm_time + NCCL_overhead

Input File 1: Same profiling CSVs as vidur

Uses the exact same mlp.csv, attention.csv files described in Backend A for compute time.

Input File 2: Bus Bandwidth YAML (for comm time)

Specifies the effective bandwidth (GB/s) for each collective operation within each parallelism group. The analytical model uses these values to estimate communication latency without packet-level simulation.

# File: example/busbw.yaml

test                          # scenario name (first line)
TP:                           # Tensor Parallelism group
  allreduce,: 300             # AllReduce bandwidth in GB/s
  allgather,: 280             # AllGather bandwidth in GB/s
  reducescatter,: 280         # ReduceScatter bandwidth in GB/s
  alltoall,: 230              # AllToAll bandwidth in GB/s
EP:                           # Expert Parallelism group
  allreduce,: null
  allgather,: 45
  reducescatter,: 45
  alltoall,: 80
PP:                           # Pipeline Parallelism group
  busbw: 47.5                 # single value for send_recv (GB/s)

Input File 3: SimAI Configuration

# Topology:  example/topo  (network topology definition)
# Config:    astra-sim-alibabacloud/inputs/config/SimAI.conf
#            (collective algorithm selection, buffer sizes, etc.)

Output of SimAI_analytical binary

# Output: results/analytical_EndToEnd_{id}.csv
# Parser reads: last row, 6th column (index 5) = total comm time in μs
# Converts to ms: latency = float(rows[-1][5]) * 1e-3

# Results are cached by MD5(workload_params + topo + conf)
# to avoid redundant runs

Backend D: simai_simulation (RandomForest compute + NS-3 packet-level simulation)

vidur/execution_time_predictor/communication_time_predictor.py

Compute time: identical to vidur & simai_analytical. Communication time: replaces the analytical formula with a full NS-3 packet-level RDMA simulation. Same workload generation as simai_analytical, but runs through NS-3 instead.

# Flow (only comm path differs from simai_analytical):
Batch(batch_size, num_tokens)
  → [COMPUTE] same RandomForest as vidur → comp_time
  → [COMM] generate same workload file
  →        run: AS_SEND_LAT=6 AS_NVLS_ENABLE=1 \
  →              SimAI_simulator -t 16 -w {workload} -n {topo} -c {conf}
  →        parse output CSV → comm_time (μs → ms)
  → execution_time = comp_time + comm_time + NCCL_overhead

# Output: results/ncclFlowModel_EndToEnd_{id}.csv
# Parser reads: last row, 2nd column (index 1) = total comm time in μs

vs simai_analytical: The only difference is the binary called. SimAI_analytical uses bandwidth formulas for O(1) estimation; SimAI_simulator runs a full NS-3 simulation that models packet-level congestion, ECMP routing, and PFC back-pressure. More accurate for large-scale topologies with contention, but orders of magnitude slower.

Appendix: AICB Training Workload File Format

example/workload_analytical.txt

This text format is used by the astra-sim / SimAI binaries for training workload simulation (not inference). Included here for reference as it shares the same SimAI ecosystem.

# Line 1: model config metadata
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 16 pp: 12 ...
# Line 2: total number of operations
1789

# Remaining lines: one operation per line (space-delimited, 12 fields)
# op_name | layer_id | iterations | fwd_comm_type | fwd_comm_size | fwd_compute
#         | bwd_comm_type | bwd_comm_size | bwd_compute | dp_comm_type | dp_comm_size | priority

attention_column -1 1750840 ALLGATHER 50331648 875420 REDUCESCATTER 0 875420 NONE 0 100
mlp_moelayer     -1 1750840 ALLTOALL  67108864 875420 ALLTOALL 67108864 875420 NONE 0 100

# comm_type values: ALLREDUCE | ALLGATHER | REDUCESCATTER | ALLTOALL | NONE
# comm_size: bytes    fwd/bwd_compute: microseconds    priority: typically 100

Step 4 Output (all backends)

# Regardless of backend, the final output is:
execution_time: float  # seconds — written into BatchStage

# This becomes:
BatchStageEndEvent(_time = current_time + execution_time, batch, stage_id)
# → eventually → BatchEndEvent when last pipeline stage completes

# Final time breakdown for all backends:
model_time = (
    (attention_layer_time + mlp_layer_time + add_time)  # per-layer block
    × num_layers_per_stage                               # = num_layers / PP
    + pipeline_parallel_communication_time               # send_recv between stages
) / 1000  # ms → s

attention_layer_time = attn_pre_proj + attn_post_proj + attn_rope
    + attn_kv_cache_save + attn_prefill_or_decode
    + tensor_parallel_comm_time  # ← THIS is what differs per backend
    + attn_norm_time

tensor_parallel_comm_time =
    backend_predicted_comm_ms
    + nccl_cpu_launch_overhead_ms                        # default 0.02
    + nccl_cpu_skew_overhead_per_device_ms × tp**1.25   # non-linear scaling

KV Cache Transfer

simccl astra-sim ns-3

In PD disaggregation mode, once the prefill replica completes, the KV cache must be transferred to the decode replica.

Input & Configuration

Input	`KVCacheTransferFlow` DAG node from the request, containing KV cache size (derived from model's `num_kv_heads`, `attention_head_dim`, `num_layers`, `num_prefill_tokens`)
Network Config	`pd_p2p_comm_bandwidth` (int, default 800 Gbps) `pd_p2p_comm_dtype` (str, default "float16") `nvlink_bandwidth` (int, default 1600 Gbps) `rdma_bandwidth` (int, default 800 Gbps)
Simulation Config (NS-3)	`simai_simulation_topo` — topology file (generated by `gen_Topo_Template.py`) `simai_simulation_config` — `astra-sim-alibabacloud/inputs/config/SimAI.conf`

Output

Writes into the request: pd_p2p_comm_time (float, seconds), pd_p2p_comm_size (int, bytes), pd_p2p_comm_bandwidth (float), pd_p2p_bytes_per_token (float). Triggers DecodeCompletionEvent scheduling on the decode replica.

Metrics Collection

vidur-alibabacloud

Once all DAG nodes complete, vidur records the full metrics suite and exports them to CSV files.

Input

Completed Request objects with all timing fields populated by Steps 1–5.

Output → request_metrics.csv

Output directory: simulator_output/YYYY-MM-DD_HH-MM-SS-XXXXXX/

# request_metrics.csv columns (from metrics/constants.py):
Request Id,
request_e2e_time,              # total latency (s)
request_execution_time,        # GPU execution time (s)
request_model_execution_time,  # model-specific execution (s)
request_preemption_time,       # preemption wait (s)
request_scheduling_delay,      # pre-schedule wait (s)
prefill_e2e_time,              # TTFT: prefill_completed_at - arrived_at
decode_time,                   # decode duration (s)
tbt,                           # time-between-tokens: decode_time / num_decode_tokens
arrived_at,                    # arrival timestamp (s)
scheduled_at,                  # first scheduling timestamp (s)
prefill_completed_at,          # TTFT base timestamp (s)
decode_arrived_at,             # decode phase start (s)
completed_at,                  # completion timestamp (s)
request_num_prefill_tokens,    # input tokens (int)
request_num_decode_tokens,     # output tokens (int)
prefill_replica_id,            # prefill replica (PD mode)
decode_replica_id,             # decode replica (PD mode)
pd_p2p_comm_size,              # KV cache transfer bytes
pd_p2p_comm_time,              # KV cache transfer time (s)
pd_p2p_comm_bandwidth,         # effective bandwidth
pd_p2p_bytes_per_token,        # bytes per token in transfer
pd_p2p_comm_dtype              # data type (float16, etc.)

Additional Output Files

File	Config Toggle	Description
`batch_metrics.csv`	`store_batch_metrics` (true)	batch_size, batch_num_tokens, batch_num_prefill_tokens, batch_num_decode_tokens, batch_execution_time
`chrome_trace.json`	`enable_chrome_trace` (true)	Chrome DevTools timeline for visual profiling
`event_trace.json`	`write_json_trace` (false)	Detailed timestamped event log
CDF/histogram plots (PNG)	`store_plots` (true)	Latency CDFs, batch size histograms, utilization time series

Request Lifecycle & I/O Formats

4. End-to-End Request Lifecycle

Request Arrival

Input Sources & File Formats

Mode A — SYNTHETIC (programmatic generation)

Mode B — TRACE_REPLAY (replay a recorded trace)

CLI Examples

Output → Request Object

Global Scheduling

Input & Configuration

Output → Populated DAG

Case A: Same Replica (no PD split)

Case B: Different Replicas (PD disaggregation)

DAG Node Structures

Replica Scheduling + Batching

Input & Configuration

Output → Batch Object + Event Chain

Event Chain (discrete-event simulation)

Execution Time Prediction

Backend Comparison: Compute

Backend Comparison: Communication (per parallelism type)

Backend A: vidur (RandomForest on profiling CSVs)

Input File 1: Compute Profiling CSVs

Input File 2: Network Profiling CSVs

Backend B: aicb (Pre-profiled Per-Layer Lookup)

Input File: AICB Pre-profiled CSV

How comm_time is calculated from comm_size

Backend C: simai_analytical (RandomForest compute + analytical network model)

Input File 1: Same profiling CSVs as vidur

Input File 2: Bus Bandwidth YAML (for comm time)

Input File 3: SimAI Configuration

Output of SimAI_analytical binary

Backend D: simai_simulation (RandomForest compute + NS-3 packet-level simulation)

Appendix: AICB Training Workload File Format

Step 4 Output (all backends)

KV Cache Transfer

Input & Configuration

Output

Metrics Collection

Input

Output → request_metrics.csv

Additional Output Files