Life of an Inference Request in SimAI

Alibaba Cloud's full-stack simulator integrating Vidur scheduling, AICB GPU profiling, SimCCL collective decomposition, astra-sim system simulation, and NS-3 network simulation — enabling PD disaggregation analysis without deploying real clusters.

aliyun/SimAI · NSDI'25

Table of Contents

  1. System Positioning
  2. Vidur → SimAI: What Changed and Why
  3. Five-Component Architecture
  4. Execution Time Breakdown
  5. New Core Entities
  6. Configuration System
  7. Three Simulation Modes
  8. Example CLI
  9. Key Insights

1. System Positioning

SimAI is Alibaba Cloud's full-stack simulator for AI training AND inference. It extends Microsoft's Vidur (a discrete-event inference simulator) with AICB GPU profiling, SimCCL collective communication decomposition, astra-sim system simulation, and NS-3 packet-level network simulation. Together, these five components form an end-to-end pipeline that can predict training iteration time or inference latency metrics without deploying a single real GPU.

Important Distinction: SimAI is NOT a serving engine (like vLLM or SGLang). It is a simulator — it models inference to enable capacity planning, network topology design, and Prefill-Decode (PD) disaggregation analysis. No actual inference is performed.

Who Should Use SimAI?

Infrastructure architects evaluating cluster configurations, network engineers comparing topology designs (fat-tree, rail-optimized, dual-ToR), and ML platform teams estimating the cost/benefit of PD disaggregation before committing hardware.

2. Vidur → SimAI: What Changed and Why

The following table details every dimension where vidur-alibabacloud diverges from the original Microsoft Vidur. These changes enable full-stack network-aware simulation with Prefill-Decode disaggregation support.

Dimension Original Vidur SimAI vidur-alibabacloud
Deployment Model Co-located (prefill + decode on same replica) PD disaggregation support — separate prefill and decode replica pools
Compute Time Estimation sklearn RandomForest trained on profiled CSV AICB AIOB real GPU profiling (DeepGEMM / FlashMLA kernels)
Communication Simulation None (assumes replica-internal communication is negligible) SimCCL + astra-sim + NS-3 full-stack network simulation
Supported Models LLaMA2-7B / 13B / 70B + DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B
Hardware Requirement Pure CPU simulation Profiling needs Hopper/Blackwell GPU (SM90+); simulation itself is CPU-only
Global Scheduler Random, Round-Robin, Least Outstanding Requests (LOR) + SplitWise (PD-aware routing to prefill/decode pools)
Replica Scheduler Faster Transformer, Orca, Sarathi, vLLM, LightLLM + SplitWise (PD-aware batch formation)
Task Representation Basic execution time estimation DAG with PromptTask / TokenTask / KVCacheTransferFlow
Output Metrics Basic latency / throughput + Detailed PD timing breakdown: prefill_e2e, decode_e2e, pd_p2p_comm_time, pd_p2p_comm_size
Key Takeaway: The biggest architectural shift is moving from Vidur's "compute-only, single-replica" model to SimAI's "compute + communication, multi-pool PD disaggregation" model. Every other change (AICB profiling, SimCCL, NS-3) supports this fundamental expansion.

3. Five-Component Architecture

SimAI is not a monolith — it is a federation of five repositories, each handling a different layer of the simulation stack. The following diagram shows how data flows between them.

SimAI Five-Component Architecture & Data Flow
vidur-alibabacloud Scheduling + Orchestration (Python) Global Scheduler Replica Scheduler SplitWise | Event Loop | Metrics V AICB Workload Gen + GPU Profiling Python + CUDA (AIOB) DeepGEMM | FlashMLA A SimCCL Collective Comm Decomposition C++ Library AllReduce | AllGather | ReduceScatter S astra-sim-alibabacloud System Simulation Engine C++ | Workload Layer | System Layer Network Layer Interface AS ns-3-alibabacloud Packet-Level Network Simulation C++ | RDMA | ECN/PFC | Fat-tree N compute_time / op workload file (.txt) collective decompose p2p flows (bytes) comm _time Legend GPU profiling data Workload description Collective decomposition Point-to-point flows Timing feedback
vidur

Orchestration Layer

The discrete-event simulation engine. Manages request arrival, global scheduling (SplitWise), replica scheduling, batch formation, and metrics collection. Written in Python.

aicb

Workload & Profiling Layer

Generates realistic workload descriptions and profiles actual GPU kernel times using AIOB (AI Operation Benchmark). Captures DeepGEMM, FlashMLA, and NCCL operator latencies.

simccl

Collective Decomposition

Decomposes high-level collective operations (AllReduce, AllGather, ReduceScatter) into point-to-point data transfers. Supports ring, tree, and halving-doubling algorithms.

astra-sim

System Simulation

Three-layer simulation engine: Workload layer (reads execution graphs), System layer (schedules compute/comm), Network layer (pluggable backend: analytical or NS-3).

ns-3

Network Simulation

Packet-level network simulation with RDMA transport, ECN/PFC congestion control, and configurable topologies (fat-tree, rail-optimized, dual-ToR). The highest-fidelity backend.

4. Execution Time Breakdown

A request's end-to-end latency decomposes into 5 phases. This section summarizes what parameters affect each phase. For the detailed calculation logic (RandomForest models, backend comparison, per-layer formulas), see the Lifecycle page.

e2e_time = completed_at − arrived_at

Timeline

Non-PD Mode (same replica)

① Prefill Sched.
② Prefill Exec
④ Decode Sched.
⑤ Decode Iterations (×N)

PD Disaggregation Mode

Prefill Replica
① Prefill Sched.
② Prefill Exec
Network
③ KV Cache Transfer
Decode Replica
④ Decode Sched.
⑤ Decode Iterations

What Affects Each Phase

Prefill Scheduling Delay

Time from request arrival to the first batch containing this request. Event-driven: the scheduler tries immediately when a request arrives, but may be blocked.

Parameter How it affects scheduling delay
num_prefill_tokens Longer prompts need more KV blocks upfront and consume more batch token budget — harder to fit.
KV cache memory usage When allocated blocks + watermark ≥ total blocks, no new prefill can be scheduled until decode requests free memory.
num_pipeline_stages Pipeline can hold at most this many batches in-flight; when full, new batches wait for one to finish.
max_tokens_in_batch If adding this request's tokens would exceed the limit, it waits for the current batch to execute.
batch_size_cap Hard cap on requests per micro-batch; excess requests must queue for the next batch.
QPS / arrival rate Higher QPS fills KV memory and batch slots faster, increasing contention and queue depth.
Preemption Memory thrashing evicts the request; entire prefill must restart from scratch.

Prefill Execution

GPU computation for processing all input tokens in one forward pass (single batch).

Parameter How it affects prefill time
num_prefill_tokens More tokens = more compute per layer; attention cost scales quadratically with sequence length.
num_layers Total time is block_time × layers_per_stage; more layers = proportionally longer.
Model architecture Larger hidden_dim and more attention heads increase per-layer compute (matmul size).
tensor_parallel_size Splits compute across GPUs (faster) but adds 2 AllReduce per layer (communication overhead).
num_pipeline_stages Each stage handles fewer layers (faster per stage) but adds SendRecv latency between stages.
GPU device H100 kernels are ~2× faster than A100 for same operation (profiled per device).
Execution backend vidur/aicb/simai_analytical/simai_simulation use different methods to estimate compute & comm time.

KV Cache Transfer PD mode only

Network transfer of KV cache from prefill replica to decode replica. Skipped in non-PD mode.

Parameter How it affects transfer time
num_prefill_tokens More tokens = larger KV cache to transfer (linear relationship).
Model dims kv_cache_size = 2 × tokens × num_kv_heads × head_dim × num_layers × bytes_per_element.
pd_p2p_comm_bandwidth Higher bandwidth (e.g. 800 Gbps) proportionally reduces transfer time.
pd_p2p_comm_dtype fp16 = 2 bytes/element, fp32 = 4 bytes — doubles the transfer size.
transfer_time = kv_cache_size / pd_p2p_comm_bandwidth

Decode Scheduling Delay

Wait before the first decode batch. Non-PD: near-zero (same replica, next scheduler cycle). PD: queueing time on D replica after KV transfer.

Parameter How it affects decode scheduling
D replica load More concurrent decode requests on D = pipeline/batch slots fill up, new arrivals queue.
KV cache on D Decode only needs 1 block/iter, but if D's memory is nearly full, even this can block.
batch_size_cap If D's current batch already has max requests, new decode must wait for next cycle.
Non-PD mode Request stays on same replica; slots into next batch immediately with continuous batching (≈ 0).

Decode Iterations

Auto-regressive generation: num_decode_tokens − 1 serial iterations, each producing 1 token. Typically the dominant phase for long outputs.

Parameter How it affects decode time
num_decode_tokens Directly determines iteration count; 128 output tokens = 127 serial iterations.
Context length KV cache grows each iteration; attn_decode cost increases as context lengthens (later tokens cost more).
Concurrent batch_size More decode requests in same batch = larger total KV cache to attend over, increasing per-iter time.
num_layers Each iteration passes through all layers; more layers = proportionally longer per iteration.
tensor_parallel_size Same tradeoff as prefill: splits compute but adds 2 AllReduce per layer per iteration.
GPU device Faster GPU = faster per-iteration decode; decode is memory-bandwidth bound (not compute bound).
Inter-batch gaps Scheduler overhead between iterations; total = Σ(per_iter_time + gap) across all iterations.

Typical Timing Reference

Single-request latency at low QPS (no queueing contention), non-PD mode (TP=1, PP=1). The RTX 5090 column is measured with vLLM v0.1 + LLaMA-3.1 8B (bf16). Other columns are estimates based on device specs.

Non-PD Mode — ~300 prefill + 128 decode tokens

Phase LLaMA-2 7B
A100
estimated
LLaMA-3.1 8B
RTX 5090
measured
LLaMA-3 8B
H100
estimated
LLaMA-2 70B
H100 TP=4
estimated
① Prefill Sched. ~1 ms < 1 ms ~1 ms~2 ms
② Prefill Exec ~18 ms 14 ms ~8 ms~50 ms
③ KV Transfer skipped (non-PD)
④ Decode Sched. ≈ 0 ≈ 0 ≈ 0≈ 0
⑤ Decode (127×) ~1.4 s 1.37 s ~0.8 s~2.5 s
E2E ~1.4 s 1.39 s ~0.8 s ~2.6 s
TTFT ~19 ms 14 ms ~9 ms~52 ms
TBT (avg) ~11 ms 10.8 ms ~6 ms~20 ms
RTX 5090 measurement details: LLaMA-3.1 8B Instruct, bf16, vLLM v0.1 with FlashAttention + CUDA graphs, 32 GB GDDR7 (1.8 TB/s bandwidth), prompt ~303 tokens, decode 128 tokens. TBT ~10.8 ms/token is comparable to A100 (~11 ms) despite being a consumer GPU — decode is memory-bandwidth bound and RTX 5090's GDDR7 bandwidth (1.8 TB/s) is close to A100's HBM2e (2 TB/s).
PD disagg at low QPS: E2E is nearly identical because the same GPU work is done. PD disagg adds KV transfer (~0.3-1.6 ms) + D replica queueing, but removes nothing. The benefit appears at high QPS: prefill and decode no longer compete for the same GPU, reducing scheduling delay and increasing throughput.

Supported Devices & Models

Supported Devices A40, A100, H100, H800
NOT supported Consumer GPUs (RTX 5090, etc.) — no profiling CSVs available. You must profile on the target GPU and add CSVs to data/profiling/compute/ and data/profiling/network/.
Example Models LLaMA-2 7B/13B/70B, LLaMA-3 8B/70B, Qwen, CodeLlama, Mixtral, DeepSeek-671B, Qwen3-Moe-235B

5. New Core Entities

SimAI introduces several new entity classes to model PD disaggregation. These classes extend Vidur's original entity hierarchy with DAG-based task representation and explicit KV cache transfer flows.

5.1 Request with DAG

vidur-alibabacloud
vidur/entities/request.py

The Request class now carries a directed acyclic graph (DAG) that encodes the dependency relationships between prefill tasks, decode tasks, and KV cache transfer flows. This is the fundamental data structure enabling PD disaggregation simulation.

import networkx as nx

class Request(BaseEntity):
    """A single inference request with PD disaggregation support."""

    def __init__(self, ...):
        self.dag = nx.DiGraph()         # Task dependency graph
        self.prefill_replica_id = None # Assigned prefill replica
        self.decode_replica_id = None  # Assigned decode replica

        # PD communication metrics (populated after KV transfer)
        self.pd_p2p_comm_size = float('inf')
        self.pd_p2p_comm_time = float('inf')

        # Timing breakdown
        self.prefill_e2e = None
        self.decode_e2e = None
        self.prefill_start_timestamp = None
        self.decode_start_timestamp = None

5.2 Replica Types

vidur-alibabacloud
vidur/entities/replica.py

SimAI introduces a ReplicaType enum to differentiate between co-located (mixed) replicas and PD-disaggregated replicas. The global scheduler uses this to route requests to the appropriate pool.

from enum import IntEnum

class ReplicaType(IntEnum):
    MIXED   = 0  # Co-located: prefill + decode on same replica
    PREFILL = 1  # Dedicated prefill replica
    DECODE  = 2  # Dedicated decode replica

5.3 Node / Task / Flow Hierarchy

vidur-alibabacloud
vidur/entities/node.py

Node is the abstract base class for all executable units in the request DAG. Each node has a start time, end time, and duration.

class Node(BaseEntity):
    """Base class for Task and Flow."""
    self.start_time: float
    self.end_time: float
    self.duration: float
vidur/entities/task.py

Task extends Node for compute operations. Two concrete subclasses model the PD split.

class Task(Node):
    """Compute operation."""
    pass

class PromptTask(Task):
    """Prefill computation."""
    pass

class TokenTask(Task):
    """Decode computation."""
    pass
vidur/entities/flow.py

Flow extends Node for data transfer operations. The key subclass is KVCacheTransferFlow, which models the PD KV cache transfer.

class Flow(Node):
    """Data transfer operation."""
    self.size_bytes: int
    self.bandwidth: float

class KVCacheTransferFlow(Flow):
    """KV cache transfer from prefill to decode replica.
    In PD disaggregation mode, this flow is inserted into the
    request DAG between PromptTask and TokenTask nodes."""

    def __init__(self, src_replica_id, dst_replica_id, kv_cache_size):
        self.src_replica_id = src_replica_id
        self.dst_replica_id = dst_replica_id
        self.kv_cache_size = kv_cache_size

5.4 Interconnect Types

vidur-alibabacloud
vidur/entities/interconnect.py

SimAI models multiple interconnect types to accurately simulate communication latency across different hardware links. Each interconnect has a configurable bandwidth and latency.

class Interconnect:
    """Hardware interconnect abstraction."""

    # Supported interconnect types:
    NVLink     # Intra-node GPU-GPU (e.g., 900 GB/s per direction)
    RDMA       # Inter-node GPU-GPU via RoCEv2/InfiniBand
    Ethernet   # Standard Ethernet (fallback)
    PCIe       # CPU-GPU or NIC-GPU transfer
    DummyLink  # Zero-latency link for testing

    def __init__(self, link_type, bandwidth_bps, latency_us=0):
        self.link_type = link_type
        self.bandwidth_bps = bandwidth_bps
        self.latency_us = latency_us

    def transfer_time(self, size_bytes) -> float:
        """Estimate transfer time in microseconds."""
        return (size_bytes * 8 / self.bandwidth_bps * 1e6
                + self.latency_us)

6. Configuration System

SimAI extends Vidur's configuration with new parameters for PD disaggregation, network simulation, and MoE model support. All parameters are exposed as CLI flags and can be set in config files.

6.1 New Config Parameters

vidur-alibabacloud
vidur/config/replica_config.py
from dataclasses import dataclass, field

@dataclass
class ReplicaConfig:
    # ===== PD Disaggregation =====
    pd_p2p_comm_bandwidth: int = 800      # bps (bits per second)
    pd_p2p_comm_dtype: str = 'float16'    # KV cache data type
    pd_node_ratio: float = 0.5           # P:D ratio (0.5 = 1:1)

    # ===== Network Bandwidth =====
    nvlink_bandwidth: int = 1600          # bps (NVLink per direction)
    rdma_bandwidth: int = 800             # bps (RDMA per NIC)

    # ===== MoE (Mixture of Experts) =====
    expert_model_parallel_size: int = 1  # Expert parallelism degree

    # ===== Simulation Backend =====
    backend: str = "vidur"
    # Choices: "vidur"              - original Vidur (CPU only)
    #          "simai_simulation"   - NS-3 full network sim
    #          "simai_analytical"   - Bus bandwidth estimation
    #          "aicb"               - AICB GPU profiling backend

6.2 Execution Time Predictor Config

vidur-alibabacloud
vidur/config/execution_time_predictor_config.py
@dataclass
class RandomForrestExecutionTimePredictorConfig:
    """Config for the execution time predictor.
    In SimAI, this predictor can delegate to AICB for
    real GPU profiling instead of sklearn RandomForest."""

    backend: str = "vidur"
    # "vidur" - sklearn RandomForest on profiled CSV
    # "aicb"  - AICB AIOB real GPU kernel profiling

    compute_cache_dir: str = "./compute_cache"
    # Directory for cached AICB profiling results

    model_name: str = "deepseek-671B"
    # Target model for profiling
Configuration Hierarchy: SimAI's config system follows a layered approach: default valuesconfig fileCLI flags. CLI flags always take precedence. All PD-related parameters have sensible defaults that produce co-located (non-disaggregated) behavior, maintaining backward compatibility with original Vidur.

7. Three Simulation Modes

SimAI offers three simulation backends, each trading off speed for fidelity. Choose the right mode based on your exploration stage and available resources.

Mode Backend Speed Fidelity Hardware Required Use Case
Analytical Bus bandwidth estimation ★★★ Fast Low CPU only Quick exploration, parameter sweeps
NS-3 Simulation Full packet-level network sim Slow ★★★ High CPU only (multi-core recommended) Topology comparison, congestion analysis
Physical Real RDMA traffic ★★ Real-time ★★★★ Highest RDMA-capable cluster Final validation, production calibration

Analytical Mode

Uses simple bus bandwidth formulas: time = message_size / (bandwidth * num_links). No congestion modeling, no packet-level simulation. Suitable for rapid parameter sweeps across hundreds of configurations.

# Backend flag for analytical mode
--backend simai_analytical

NS-3 Mode

Full packet-level simulation with RDMA transport, ECN/PFC congestion control, and real topology modeling. Captures contention, head-of-line blocking, and incast effects. 10-100x slower than analytical but much more accurate.

# Backend flag for NS-3 mode
--backend simai_simulation

Physical Mode

Runs actual RDMA traffic on a physical cluster. Provides ground-truth measurements for validation. Requires RoCEv2 or InfiniBand capable NICs and an actual cluster deployment.

# Physical mode uses real hardware
--backend physical
Mode Selection Strategy: Three simulation modes let you trade off speed vs fidelity. Start with analytical for broad parameter sweeps (seconds per run), then validate interesting configurations with NS-3 (minutes per run), and finally confirm with physical on the actual cluster.

8. Example CLI

The following command runs a PD disaggregation simulation for DeepSeek-V3-671B with the SplitWise scheduler and AICB-profiled execution times:

Full PD Disaggregation Simulation

python -m vidur.main \
  --replica_config_model_name deepseek-671B \
  --replica_config_pd_p2p_comm_bandwidth 800 \
  --replica_config_nvlink_bandwidth 1600 \
  --replica_config_pd_node_ratio 0.5 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --random_forrest_execution_time_predictor_config_backend aicb

Quick Analytical Run

For a fast parameter sweep using the analytical backend:

python -m vidur.main \
  --replica_config_model_name qwen3-moe-235B \
  --replica_config_pd_node_ratio 0.3 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --random_forrest_execution_time_predictor_config_backend vidur \
  --backend simai_analytical

NS-3 High-Fidelity Run

For detailed network simulation with topology specification:

python -m vidur.main \
  --replica_config_model_name deepseek-671B \
  --replica_config_pd_p2p_comm_bandwidth 800 \
  --replica_config_rdma_bandwidth 800 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --backend simai_simulation \
  --network_topology fat_tree

CLI Parameter Reference

Flag Description Default
--replica_config_model_name Target model to simulate -
--replica_config_pd_p2p_comm_bandwidth PD KV transfer bandwidth (bps) 800
--replica_config_nvlink_bandwidth NVLink bandwidth per direction (bps) 1600
--replica_config_rdma_bandwidth RDMA bandwidth per NIC (bps) 800
--replica_config_pd_node_ratio Fraction of nodes for prefill (0.5 = 1:1 P:D) 0.5
--replica_config_pd_p2p_comm_dtype Data type for KV cache transfer float16
--global_scheduler_config_type Global scheduler algorithm round_robin
--replica_scheduler_config_type Replica scheduler algorithm vllm
--random_forrest_execution_time_predictor_config_backend Execution time predictor backend vidur
--backend Simulation backend (vidur/simai_simulation/simai_analytical) vidur
--replica_config_expert_model_parallel_size Expert parallelism degree for MoE models 1

9. Key Insights

Cross-Component Integration: SimAI's PD disaggregation simulation capability comes from cross-component integration. No single component can model PD end-to-end: vidur provides the scheduling and event loop, AICB profiles compute time, SimCCL decomposes collective operations, astra-sim orchestrates the system simulation, and NS-3 provides packet-level network fidelity. The innovation is in the glue between these components — the workload file format, the timing callback interface, and the DAG-based task representation that threads through all five layers.
Hardware Limitation: Profiling with AICB requires an SM90+ GPU (NVIDIA Hopper or Blackwell architecture). This means you need access to H100, H200, or B200 GPUs to generate the profiling data that feeds into SimAI's execution time predictions. The simulation itself runs on CPU, but the profiling step is GPU-bound. If you lack SM90+ hardware, you must rely on pre-cached profiling results or use the original Vidur RandomForest predictor as a fallback.
DAG Complexity: The DAG-based task representation introduces additional complexity compared to Vidur's linear execution model. Each request in PD mode generates exactly three DAG nodes (PromptTask → KVCacheTransferFlow → TokenTask); in co-located mode (same replica), it generates only two nodes (PromptTask → TokenTask). The single TokenTask node handles all decode tokens internally through multiple batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. This keeps the DAG lightweight regardless of output length.
Backward Compatibility: SimAI maintains full backward compatibility with the original Vidur. If you set backend=vidur and use the default schedulers (round_robin + vllm), SimAI behaves identically to upstream Vidur. The PD disaggregation features only activate when you explicitly configure SplitWise scheduling and set pd_node_ratio < 1.0.
MoE Support: SimAI adds native support for Mixture-of-Experts (MoE) models like DeepSeek-V3 and Qwen3-MoE. The expert_model_parallel_size parameter controls how experts are distributed across GPUs. AICB profiles MoE-specific kernels including expert routing, top-k gating, and sparse matrix operations, providing accurate compute time estimates for these architectures.

Appendix: Repository Map

SimAI spans five repositories under the aliyun GitHub organization. The following table maps each component to its repository and primary language.

Component Repository Language Role
vidur vidur-alibabacloud Python Scheduling, orchestration, metrics
aicb aliyun/aicb Python + CUDA Workload generation, GPU profiling
simccl SimCCL C++ Collective communication decomposition
astra-sim astra-sim-alibabacloud C++ System simulation engine
ns-3 ns-3-alibabacloud C++ Packet-level network simulation

Build Dependencies

The C++ components (SimCCL, astra-sim, NS-3) are linked together at compile time via CMake. SimCCL is compiled as a static library that astra-sim links against, and NS-3 is built as a separate shared library loaded by astra-sim's network layer.

Python ↔ C++ Bridge

vidur (Python) communicates with astra-sim (C++) through workload description files (.txt). vidur writes the execution graph as a text file, then invokes astra-sim as a subprocess. astra-sim returns timing results that vidur reads back to advance its event loop.

Appendix: Output Metrics

SimAI produces a rich set of per-request and aggregate metrics. The PD-specific metrics are unique to SimAI and not available in the original Vidur.

Metric Unit Description New in SimAI?
ttft ms Time To First Token -
tbt ms Time Between Tokens (avg) -
e2e_latency ms End-to-end request latency -
prefill_e2e ms Prefill phase end-to-end time
decode_e2e ms Decode phase end-to-end time
pd_p2p_comm_time ms KV cache transfer time (P→D)
pd_p2p_comm_size bytes KV cache transfer size
prefill_replica_id - Assigned prefill replica
decode_replica_id - Assigned decode replica
vidur-alibabacloud
vidur/metrics/metrics_store.py

The metrics store collects all per-request metrics and exports them to CSV. The PD-specific fields are only populated when running in PD disaggregation mode.

class MetricsStore:
    """Central metrics collection for simulation results."""

    def on_request_complete(self, request: Request):
        # Standard Vidur metrics
        self._record("ttft", request.ttft)
        self._record("tbt", request.avg_tbt)
        self._record("e2e_latency", request.e2e_latency)

        # NEW: PD disaggregation metrics
        if request.prefill_e2e is not None:
            self._record("prefill_e2e", request.prefill_e2e)
            self._record("decode_e2e", request.decode_e2e)
            self._record("pd_p2p_comm_time", request.pd_p2p_comm_time)
            self._record("pd_p2p_comm_size", request.pd_p2p_comm_size)

Appendix: Request DAG Example

In PD disaggregation mode, a request produces a DAG with exactly three nodes. The single TokenTask node handles all decode tokens internally through batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. The KVCacheTransferFlow sits between prefill and decode, creating the critical PD communication dependency.

Request DAG in PD Disaggregation Mode
Prefill Replica Network Decode Replica PromptTask prefill all tokens at once KVCache Transfer TokenTask single node, token_size = num_decode_tokens − 1 internally iterates: tokens_per_iteration = 1 node_id = 0 node_id = 1 node_id = 2 compute_time (AICB profiled) pd_p2p_comm_time (NS-3 / analytical) compute_time × num_decode_tokens (batched by replica scheduler) DAG has only 3 nodes; decode tokens are handled by batch iterations within the single TokenTask
Critical Path: The TTFT (Time To First Token) in PD mode is prefill_compute_time + pd_p2p_comm_time + first_decode_compute_time. This means the KV cache transfer time directly impacts user-perceived latency. Optimizing PD bandwidth (via pd_p2p_comm_bandwidth) and topology (to minimize hops between P and D replicas) is crucial for PD disaggregation performance.

Appendix: SplitWise Scheduler

The SplitWise scheduler is SimAI's key addition to Vidur's scheduling layer. It operates at two levels: the global scheduler routes requests to the correct replica pool, and the replica scheduler manages batch formation within each pool.

vidur-alibabacloud
vidur/scheduler/global_scheduler/split_wise_global_scheduler.py

Global Level: Pool Routing

The SplitWise global scheduler partitions replicas into prefill and decode pools based on pd_node_ratio. When a new request arrives, it routes the initial prefill to the least-loaded prefill replica. After prefill completes and KV cache is transferred, the decode phase is routed to the least-loaded decode replica.

class SplitWiseGlobalScheduler(BaseGlobalScheduler):
    """PD-aware global scheduler.
    Splits replicas into prefill and decode pools."""

    def __init__(self, config, replicas):
        self.pd_node_ratio = config.pd_node_ratio
        n_prefill = int(len(replicas) * self.pd_node_ratio)
        self.prefill_replicas = replicas[:n_prefill]
        self.decode_replicas = replicas[n_prefill:]

    def schedule(self, request: Request) -> int:
        # Route prefill to least-loaded prefill replica
        target = min(
            self.prefill_replicas,
            key=lambda r: r.pending_requests
        )
        request.prefill_replica_id = target.id
        return target.id
vidur-alibabacloud
vidur/scheduler/replica_scheduler/split_wise_replica_scheduler.py

Replica Level: Batch Formation

The SplitWise replica scheduler extends the base Sarathi-style continuous batching with PD awareness. Prefill replicas only process prefill requests, decode replicas only process decode iterations. This specialization eliminates the prefill-decode interference that degrades performance in co-located deployments.

class SplitWiseReplicaScheduler(BaseReplicaScheduler):
    """PD-aware replica scheduler.
    Handles batch formation for a single replica type."""

    def _build_batch(self) -> Batch:
        if self.replica_type == ReplicaType.PREFILL:
            # Only schedule prefill requests
            candidates = [r for r in self.waiting
                         if not r.prefill_complete]
        else:
            # Only schedule decode iterations
            candidates = [r for r in self.running
                         if r.prefill_complete]
        return self._form_batch(candidates)
pd_node_ratio Sensitivity: The pd_node_ratio parameter is critical for performance. Too many prefill replicas (ratio > 0.5) leads to decode starvation and high TBT. Too few prefill replicas (ratio < 0.3) causes prefill queueing and high TTFT. SimAI's simulation capability makes it practical to sweep this parameter across dozens of values without provisioning real hardware.

Appendix: AICB Profiling Pipeline

AICB (AI Communication Benchmark) provides two key capabilities to SimAI: workload description generation and real GPU kernel profiling. The profiling pipeline is designed to be run once per model/hardware combination and cached for repeated simulations.

1

Model Description

aicb

AICB reads the model architecture description (layer count, hidden dimension, attention heads, MoE config) and generates a comprehensive operator list. For DeepSeek-V3-671B, this includes MLA (Multi-head Latent Attention), DeepGEMM sparse experts, and shared expert layers.

2

AIOB Kernel Profiling

aicb

The AIOB (AI Operation Benchmark) module executes each operator on the actual GPU and records its execution time. For compute operators, it uses DeepGEMM for matrix multiplications and FlashMLA for attention. For communication operators, it measures NCCL collective latencies.

3

Workload File Generation

aicb

AICB outputs a workload description file (.txt) that lists every operation in an iteration with its profiled timing. This file is consumed by astra-sim's workload layer. The format encodes: operation type, data size, compute time, communication collective type, and dependencies.

4

Cache for Simulation

vidur-alibabacloud

The profiled compute times are cached in the compute_cache directory. Vidur's execution time predictor loads these cached values instead of running sklearn RandomForest predictions. This makes subsequent simulation runs fast while maintaining profiling accuracy.

aicb
aicb/workload_generator/generate_workload.py

Workload File Format Example

# AICB workload description file for DeepSeek-V3-671B
# Format: op_type  data_size  compute_time_us  comm_type  deps
COMP    0          1250          NONE       -1     # QKV projection (DeepGEMM)
COMP    0          890           NONE       0      # MLA attention (FlashMLA)
COMP    0          420           NONE       1      # Output projection
COMM    8388608   0             ALLREDUCE  2      # TP AllReduce (8MB)
COMP    0          340           NONE       3      # Expert routing + gating
COMP    0          1680          NONE       4      # Sparse expert FFN (DeepGEMM)
COMM    16777216  0             ALLTOALL   5      # EP All-to-All (16MB)
COMP    0          560           NONE       6      # Shared expert FFN
COMM    8388608   0             ALLREDUCE  7      # TP AllReduce (8MB)

Appendix: Supported Network Topologies

The NS-3 backend supports multiple data center network topologies. Choosing the right topology significantly impacts PD communication latency and collective operation performance.

Topology Structure Max Hops (P→D) Bisection BW Best For
Fat-tree 3-tier Clos (core/agg/ToR) 6 Full General purpose, balanced
Rail-Optimized GPU-rank aligned rails 2 Partial AllReduce-heavy workloads
Dual-ToR Redundant ToR switches 4 High PD disagg with fault tolerance
Single-Switch All GPUs on one switch 2 Full Small clusters (≤64 GPUs)
ns-3

Congestion Control

The NS-3 backend models DCQCN (Data Center QCN) congestion control with ECN marking and PFC (Priority Flow Control) pause frames. When multiple KV cache transfers compete for bandwidth on shared links, the simulation captures the resulting throughput degradation and queueing delays. This is invisible to the analytical backend.

astra-sim

Topology Configuration

Network topologies are specified via JSON configuration files that describe switch connectivity, link bandwidth, and latency parameters. astra-sim reads this configuration and passes it to the NS-3 backend during initialization. Custom topologies can be defined by modifying these configuration files.

Appendix: SimCCL Algorithm Mapping

SimCCL maps high-level collective communication primitives to concrete point-to-point transfer schedules. The algorithm selection depends on the message size, number of participants, and network topology.

Collective Algorithm P2P Transfers Typical Use
AllReduce Ring (large msg) / Tree (small msg) 2(n-1) / 2 log(n) TP gradient sync
AllGather Ring / Recursive Halving-Doubling n-1 PP layer gathering
ReduceScatter Ring / Direct n-1 ZeRO gradient partitioning
AllToAll Direct exchange n(n-1) MoE expert routing
Broadcast Binary tree log(n) KV cache distribution
simccl
astra-sim-alibabacloud/extern/SimCCL/src/algorithm.cpp

SimCCL Ring AllReduce Decomposition

For a ring AllReduce with n GPUs and message size M, SimCCL generates 2(n-1) phases. Each phase consists of n concurrent point-to-point transfers of size M/n. The first n-1 phases are reduce-scatter, the next n-1 are allgather.

// Simplified Ring AllReduce decomposition (C++)
void RingAllReduce::decompose(
    int num_gpus,
    size_t message_size,
    std::vector<P2PTransfer>& transfers
) {
    size_t chunk_size = message_size / num_gpus;

    // Phase 1: Reduce-Scatter (n-1 steps)
    for (int step = 0; step < num_gpus - 1; step++) {
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            int dst = (gpu + 1) % num_gpus;
            transfers.push_back({
                .src = gpu,
                .dst = dst,
                .size = chunk_size,
                .phase = step
            });
        }
    }

    // Phase 2: AllGather (n-1 steps)
    for (int step = 0; step < num_gpus - 1; step++) {
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            int dst = (gpu + 1) % num_gpus;
            transfers.push_back({
                .src = gpu,
                .dst = dst,
                .size = chunk_size,
                .phase = num_gpus - 1 + step
            });
        }
    }
}

Appendix: Event Types in SimAI

SimAI's discrete-event simulation loop processes the following event types. Events marked with a star are new additions from SimAI (not present in original Vidur).

Event Trigger Handler New?
RequestArrivalEvent Trace timestamp Creates Request, routes to global scheduler -
BatchScheduleEvent Replica ready Forms batch, invokes execution time predictor -
BatchEndEvent Batch execution completes Updates request state, triggers next step -
PrefillCompleteEvent Prefill batch ends (PD mode) Initiates KV cache transfer to decode replica
KVCacheTransferCompleteEvent KV transfer finishes Records pd_p2p_comm_time, enqueues for decode
RequestCompletionEvent All decode tokens generated Collects metrics, removes from replica -
CommSimCompleteEvent astra-sim/NS-3 returns result Updates comm_time in batch execution estimate
Event Ordering: All events are processed in timestamp order via a priority queue (min-heap). When two events share the same timestamp, they are processed in insertion order. The simulation clock only advances when the next event's timestamp is greater than the current time. This ensures deterministic, reproducible results across runs.

Repository Structure

SimAI is a multi-repo project composed of 5 git submodules plus top-level orchestration. Below is the complete directory tree with descriptions of each component's role.

Top-Level Structure

SimAI/
├── README.md              ── Project overview, scenarios, setup guide
├── Dockerfile             ── Docker image (nvidia/pytorch base + AICB + Vidur)
├── .gitmodules            ── Submodule config: SimCCL, aicb, ns-3-alibabacloud
├── scripts/
│   └── build.sh           ── Master build: -c analytical | ns3 | phy
├── example/
│   ├── microAllReduce.txt ── Sample AllReduce workload (8 GPUs, TP=8)
│   ├── workload_analytical.txt ── Sample analytical workload
│   └── busbw.yaml        ── Bus bandwidth config (TP/DP/EP/PP per-op BW)
├── docs/
│   ├── Tutorial.md        ── Comprehensive usage tutorial
│   └── SimAI_Intro_Online.pdf ── Presentation slides
│
├── vidur-alibabacloud/    ── ① Inference scheduling simulator (Python)
├── aicb/                  ── ② Workload generation + GPU profiling (Python+CUDA)
├── SimCCL/                ── ③ Collective communication decomposition (C++)
├── astra-sim-alibabacloud/ ── ④ System simulation engine (C++)
└── ns-3-alibabacloud/     ── ⑤ Packet-level network simulator (C++)

vidur-alibabacloud — Inference Scheduling Simulator

vidur-alibabacloud/vidur/
├── main.py                 ── Entry point (python -m vidur.main)
├── simulator.py            ── DES event loop, manages clock + event queue
├── config/
│   ├── config.py           ── All config dataclasses (ReplicaConfig, PD params)
│   └── model_config.py     ── Model specs (DeepSeek-671B, Qwen3-MoE-235B, etc.)
├── entities/               ── Core data models
│   ├── request.py          ── Request with DAG (nx.DiGraph), PD metadata
│   ├── replica.py          ── ReplicaType: MIXED / PREFILL / DECODE
│   ├── batch.py            ── Batch of requests for co-execution
│   ├── task.py             ── PromptTask (prefill) / TokenTask (decode)
│   ├── flow.py             ── KVCacheTransferFlow (PD disaggregation)
│   ├── node.py             ── Base abstraction for Task/Flow in DAG
│   └── interconnect.py     ── NVLink / RDMA / Ethernet / PCIe link models
├── events/                 ── DES event types
│   ├── request_arrival_event.py ── Request enters system
│   ├── replica_schedule_event.py ── Triggers replica scheduling
│   └── batch_end_event.py  ── Batch execution completes
├── scheduler/
│   ├── global_scheduler/
│   │   ├── splitwise_global_scheduler.py ── PD-aware: P-pool + D-pool routing
│   │   ├── lor_global_scheduler.py ── Least Outstanding Requests
│   │   └── round_robin_global_scheduler.py
│   └── replica_scheduler/
│       ├── splitwise_replica_scheduler.py ── PD-aware per-replica scheduling
│       └── vllm_replica_scheduler.py ── vLLM-style scheduling policy
├── execution_time_predictor/
│   ├── sklearn_execution_time_predictor.py ── RandomForest / AICB backend
│   ├── communication_time_predictor.py    ── SimAI NS-3 / analytical backend
│   └── SimAIWorkload.py   ── Workload file builder for astra-sim
├── request_generator/      ── Synthetic / trace-based / Poisson arrivals
└── metrics/                ── TTFT, TBT, E2E latency, PD breakdown

aicb — Workload Generation & GPU Profiling

aicb/
├── aicb.py                  ── Main entry point for benchmark execution
├── workload_applyer.py      ── Applies workloads to GPU cluster (runs collectives)
├── workload_generator/
│   ├── SimAI_inference_workload_generator.py ── Inference workload (prefill/decode)
│   ├── SimAI_training_workload_generator.py  ── Training workload files
│   └── mocked_model/
│       ├── MockedModel.py   ── Base class, InferencePhase enum (PREFILL/DECODE)
│       ├── inference/
│       │   ├── MockedDeepSeek.py  ── DeepSeek-V3 MLA + MoE architecture mock
│       │   ├── MockedQwen3Moe.py  ── Qwen3-MoE architecture mock
│       │   ├── MockedQwen3Next.py ── Qwen3-Next (hybrid attention) mock
│       │   ├── AiobDeepSeek.py    ── GPU kernel profiling: FlashMLA, DeepGEMM FP8
│       │   ├── AiobQwen3Moe.py    ── Qwen3-MoE GPU kernel profiling
│       │   └── AiobQwen3Next.py   ── Qwen3-Next GPU kernel profiling
│       └── training/        ── DeepSpeed, Megatron, DeepSeek training mocks
├── utils/
│   ├── utils.py             ── CommType, CommGroup, Strategy enums, compute cache
│   ├── deepgemm_utils.py    ── FP8 GEMM (per_token/per_block quantization)
│   └── timer.py             ── CudaEventTimer for GPU kernel timing
├── workload/
│   └── simAI/model_workload/ ── Pre-generated workload files (.txt)
├── scripts/inference_configs/
│   ├── deepseek_default.json  ── DeepSeek-V3 model config
│   ├── qwen3_moe_default.json ── Qwen3-MoE model config
│   └── qwen3_next_default.json ── Qwen3-Next model config
└── log_analyzer/            ── Result analysis and plotting tools

SimCCL — Collective Communication Decomposition

SimCCL's core logic currently lives inside astra-sim-alibabacloud/astra-sim/system/MockNccl* files. The standalone SimCCL repo contains documentation; the full implementation will be released separately.

astra-sim-alibabacloud — System Simulation Engine

astra-sim-alibabacloud/
├── CMakeLists.txt          ── Top-level CMake config
├── build/
│   ├── simai_analytical/   ── Build config → bin/SimAI_analytical
│   ├── astra_ns3/          ── Build config → bin/SimAI_simulator
│   └── simai_phy/          ── Build config → bin/SimAI_phynet
├── astra-sim/
│   ├── system/              ── Core simulation (88 files)
│   │   ├── Sys.cc/.hh       ── Main system class, event dispatch
│   │   ├── MockNcclGroup.cc/.h ── NCCL algo decomposition (Ring/Tree/NVLS)
│   │   ├── MockNcclChannel.cc/.h ── SingleFlow, ncclTree, ncclChannelNode
│   │   ├── MockNccl.h       ── Algorithm IDs, base/hw latency tables
│   │   ├── MockNcclQps.h    ── QPS tracking per connection
│   │   ├── SimAiFlowModelRdma.cc ── RDMA flow model
│   │   └── calbusbw.cc      ── Bus bandwidth calculator
│   ├── network_frontend/
│   │   ├── analytical/      ── AnaSim: fast tick-based estimation
│   │   ├── ns3/             ── NS-3 integration (entry.h = main bridge)
│   │   └── phynet/          ── Physical RDMA traffic generation
│   └── workload/
│       ├── Workload.cc/.hh  ── Workload parser (reads .txt workload files)
│       └── Layer.cc/.hh     ── Single compute+comm layer representation
├── inputs/
│   ├── config/
│   │   └── SimAI.conf       ── 60+ params: CC_MODE, PFC, ECN, monitoring
│   ├── topo/
│   │   └── gen_Topo_Template.py ── 5 topologies: Spectrum-X, HPN, DCN+
│   └── ratio/               ── NIC/NVLink/ATA performance ratio CSV

ns-3-alibabacloud — Packet-Level Network Simulator

ns-3-alibabacloud/
├── simulation/src/point-to-point/model/  ── Core network models (37 files)
│   ├── rdma-hw.cc/.h        ── RDMA NIC: QP management, CC algorithms
│   ├── rdma-queue-pair.cc/.h ── QP state: rate, window, seq tracking
│   ├── switch-node.cc/.h    ── Switch: ECMP hash routing, ECN marking
│   ├── switch-mmu.cc/.h     ── Switch MMU: buffer mgmt, PFC, RED/ECN
│   ├── qbb-net-device.cc/.h ── QBB NIC: PFC pause/resume, WRR scheduling
│   ├── nvswitch-node.cc/.h  ── NVSwitch: intra-node NVLS routing
│   ├── int-header.h         ── INT telemetry header (for HPCC)
│   └── pint.h               ── Probabilistic INT (for HPCC-PINT)
├── simulation/src/network/utils/
│   └── custom-header.cc/.h ── Packet header with IP, port, seq, INT
├── analysis/                ── Post-simulation analysis tools
│   ├── fct_analysis.py      ── Flow Completion Time analysis
│   ├── qlen_analysis.py     ── Queue length analysis
│   ├── bw_analysis.py       ── Bandwidth utilization analysis
│   └── qp_cnp_analysis.py   ── CNP count per QP analysis
└── docs/images/             ── Network topology diagrams
Cross-Component Data Flow: AICB profiles GPU kernels → produces compute times → vidur-alibabacloud reads them via execution_time_predictor → generates workload .txt files → astra-sim reads workload → decomposes collectives via MockNcclGroup (SimCCL) → sends P2P flows to NS-3 → NS-3 returns FCT → astra-sim returns communication time → vidur records metrics.