SimAI: Full-Stack AI Inference Simulator

System Positioning
Vidur → SimAI: What Changed and Why
Five-Component Architecture
Execution Time Breakdown
New Core Entities
Configuration System
Three Simulation Modes
Example CLI
Key Insights

1. System Positioning

SimAI is Alibaba Cloud's full-stack simulator for AI training AND inference. It extends Microsoft's Vidur (a discrete-event inference simulator) with AICB GPU profiling, SimCCL collective communication decomposition, astra-sim system simulation, and NS-3 packet-level network simulation. Together, these five components form an end-to-end pipeline that can predict training iteration time or inference latency metrics without deploying a single real GPU.

Important Distinction: SimAI is NOT a serving engine (like vLLM or SGLang). It is a simulator — it models inference to enable capacity planning, network topology design, and Prefill-Decode (PD) disaggregation analysis. No actual inference is performed.

Who Should Use SimAI?

Infrastructure architects evaluating cluster configurations, network engineers comparing topology designs (fat-tree, rail-optimized, dual-ToR), and ML platform teams estimating the cost/benefit of PD disaggregation before committing hardware.

2. Vidur → SimAI: What Changed and Why

The following table details every dimension where vidur-alibabacloud diverges from the original Microsoft Vidur. These changes enable full-stack network-aware simulation with Prefill-Decode disaggregation support.

Dimension	Original Vidur	SimAI vidur-alibabacloud
Deployment Model	Co-located (prefill + decode on same replica)	PD disaggregation support — separate prefill and decode replica pools
Compute Time Estimation	sklearn RandomForest trained on profiled CSV	AICB AIOB real GPU profiling (DeepGEMM / FlashMLA kernels)
Communication Simulation	None (assumes replica-internal communication is negligible)	SimCCL + astra-sim + NS-3 full-stack network simulation
Supported Models	LLaMA2-7B / 13B / 70B	+ DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B
Hardware Requirement	Pure CPU simulation	Profiling needs Hopper/Blackwell GPU (SM90+); simulation itself is CPU-only
Global Scheduler	Random, Round-Robin, Least Outstanding Requests (LOR)	+ SplitWise (PD-aware routing to prefill/decode pools)
Replica Scheduler	Faster Transformer, Orca, Sarathi, vLLM, LightLLM	+ SplitWise (PD-aware batch formation)
Task Representation	Basic execution time estimation	DAG with PromptTask / TokenTask / KVCacheTransferFlow
Output Metrics	Basic latency / throughput	+ Detailed PD timing breakdown: prefill_e2e, decode_e2e, pd_p2p_comm_time, pd_p2p_comm_size

Key Takeaway: The biggest architectural shift is moving from Vidur's "compute-only, single-replica" model to SimAI's "compute + communication, multi-pool PD disaggregation" model. Every other change (AICB profiling, SimCCL, NS-3) supports this fundamental expansion.

3. Five-Component Architecture

SimAI is not a monolith — it is a federation of five repositories, each handling a different layer of the simulation stack. The following diagram shows how data flows between them.

SimAI Five-Component Architecture & Data Flow

vidur

Orchestration Layer

The discrete-event simulation engine. Manages request arrival, global scheduling (SplitWise), replica scheduling, batch formation, and metrics collection. Written in Python.

aicb

Workload & Profiling Layer

Generates realistic workload descriptions and profiles actual GPU kernel times using AIOB (AI Operation Benchmark). Captures DeepGEMM, FlashMLA, and NCCL operator latencies.

simccl

Collective Decomposition

Decomposes high-level collective operations (AllReduce, AllGather, ReduceScatter) into point-to-point data transfers. Supports ring, tree, and halving-doubling algorithms.

astra-sim

System Simulation

Three-layer simulation engine: Workload layer (reads execution graphs), System layer (schedules compute/comm), Network layer (pluggable backend: analytical or NS-3).

ns-3

Network Simulation

Packet-level network simulation with RDMA transport, ECN/PFC congestion control, and configurable topologies (fat-tree, rail-optimized, dual-ToR). The highest-fidelity backend.

4. Execution Time Breakdown

A request's end-to-end latency decomposes into 5 phases. This section summarizes what parameters affect each phase. For the detailed calculation logic (RandomForest models, backend comparison, per-layer formulas), see the Lifecycle page.

e2e_time = completed_at − arrived_at

Timeline

Non-PD Mode (same replica)

① Prefill Sched.

② Prefill Exec

④ Decode Sched.

⑤ Decode Iterations (×N)

PD Disaggregation Mode

Prefill Replica

① Prefill Sched.

② Prefill Exec

Network

③ KV Cache Transfer

Decode Replica

④ Decode Sched.

⑤ Decode Iterations

What Affects Each Phase

① Prefill Scheduling Delay

Time from request arrival to the first batch containing this request. Event-driven: the scheduler tries immediately when a request arrives, but may be blocked.

Parameter	How it affects scheduling delay
`num_prefill_tokens`	Longer prompts need more KV blocks upfront and consume more batch token budget — harder to fit.
KV cache memory usage	When allocated blocks + watermark ≥ total blocks, no new prefill can be scheduled until decode requests free memory.
`num_pipeline_stages`	Pipeline can hold at most this many batches in-flight; when full, new batches wait for one to finish.
`max_tokens_in_batch`	If adding this request's tokens would exceed the limit, it waits for the current batch to execute.
`batch_size_cap`	Hard cap on requests per micro-batch; excess requests must queue for the next batch.
QPS / arrival rate	Higher QPS fills KV memory and batch slots faster, increasing contention and queue depth.
Preemption	Memory thrashing evicts the request; entire prefill must restart from scratch.

② Prefill Execution

GPU computation for processing all input tokens in one forward pass (single batch).

Parameter	How it affects prefill time
`num_prefill_tokens`	More tokens = more compute per layer; attention cost scales quadratically with sequence length.
`num_layers`	Total time is block_time × layers_per_stage; more layers = proportionally longer.
Model architecture	Larger hidden_dim and more attention heads increase per-layer compute (matmul size).
`tensor_parallel_size`	Splits compute across GPUs (faster) but adds 2 AllReduce per layer (communication overhead).
`num_pipeline_stages`	Each stage handles fewer layers (faster per stage) but adds SendRecv latency between stages.
GPU device	H100 kernels are ~2× faster than A100 for same operation (profiled per device).
Execution backend	vidur/aicb/simai_analytical/simai_simulation use different methods to estimate compute & comm time.

③ KV Cache Transfer PD mode only

Network transfer of KV cache from prefill replica to decode replica. Skipped in non-PD mode.

Parameter	How it affects transfer time
`num_prefill_tokens`	More tokens = larger KV cache to transfer (linear relationship).
Model dims	kv_cache_size = 2 × tokens × num_kv_heads × head_dim × num_layers × bytes_per_element.
`pd_p2p_comm_bandwidth`	Higher bandwidth (e.g. 800 Gbps) proportionally reduces transfer time.
`pd_p2p_comm_dtype`	fp16 = 2 bytes/element, fp32 = 4 bytes — doubles the transfer size.

transfer_time = kv_cache_size / pd_p2p_comm_bandwidth

④ Decode Scheduling Delay

Wait before the first decode batch. Non-PD: near-zero (same replica, next scheduler cycle). PD: queueing time on D replica after KV transfer.

Parameter	How it affects decode scheduling
D replica load	More concurrent decode requests on D = pipeline/batch slots fill up, new arrivals queue.
KV cache on D	Decode only needs 1 block/iter, but if D's memory is nearly full, even this can block.
`batch_size_cap`	If D's current batch already has max requests, new decode must wait for next cycle.
Non-PD mode	Request stays on same replica; slots into next batch immediately with continuous batching (≈ 0).

⑤ Decode Iterations

Auto-regressive generation: num_decode_tokens − 1 serial iterations, each producing 1 token. Typically the dominant phase for long outputs.

Parameter	How it affects decode time
`num_decode_tokens`	Directly determines iteration count; 128 output tokens = 127 serial iterations.
Context length	KV cache grows each iteration; attn_decode cost increases as context lengthens (later tokens cost more).
Concurrent batch_size	More decode requests in same batch = larger total KV cache to attend over, increasing per-iter time.
`num_layers`	Each iteration passes through all layers; more layers = proportionally longer per iteration.
`tensor_parallel_size`	Same tradeoff as prefill: splits compute but adds 2 AllReduce per layer per iteration.
GPU device	Faster GPU = faster per-iteration decode; decode is memory-bandwidth bound (not compute bound).
Inter-batch gaps	Scheduler overhead between iterations; total = Σ(per_iter_time + gap) across all iterations.

Typical Timing Reference

Single-request latency at low QPS (no queueing contention), non-PD mode (TP=1, PP=1). The RTX 5090 column is measured with vLLM v0.1 + LLaMA-3.1 8B (bf16). Other columns are estimates based on device specs.

Non-PD Mode — ~300 prefill + 128 decode tokens

Phase	LLaMA-2 7B A100 estimated	LLaMA-3.1 8B RTX 5090 measured	LLaMA-3 8B H100 estimated	LLaMA-2 70B H100 TP=4 estimated
① Prefill Sched.	~1 ms	< 1 ms	~1 ms	~2 ms
② Prefill Exec	~18 ms	14 ms	~8 ms	~50 ms
③ KV Transfer	— skipped (non-PD) —
④ Decode Sched.	≈ 0	≈ 0	≈ 0	≈ 0
⑤ Decode (127×)	~1.4 s	1.37 s	~0.8 s	~2.5 s
E2E	~1.4 s	1.39 s	~0.8 s	~2.6 s
TTFT	~19 ms	14 ms	~9 ms	~52 ms
TBT (avg)	~11 ms	10.8 ms	~6 ms	~20 ms

RTX 5090 measurement details: LLaMA-3.1 8B Instruct, bf16, vLLM v0.1 with FlashAttention + CUDA graphs, 32 GB GDDR7 (1.8 TB/s bandwidth), prompt ~303 tokens, decode 128 tokens. TBT ~10.8 ms/token is comparable to A100 (~11 ms) despite being a consumer GPU — decode is memory-bandwidth bound and RTX 5090's GDDR7 bandwidth (1.8 TB/s) is close to A100's HBM2e (2 TB/s).

PD disagg at low QPS: E2E is nearly identical because the same GPU work is done. PD disagg adds KV transfer (~0.3-1.6 ms) + D replica queueing, but removes nothing. The benefit appears at high QPS: prefill and decode no longer compete for the same GPU, reducing scheduling delay and increasing throughput.

    Supported Devices & Models
    
        Supported Devices
        A40, A100, H100, H800
      
        NOT supported
        Consumer GPUs (RTX 5090, etc.) — no profiling CSVs available. You must profile on the target GPU and add CSVs to data/profiling/compute/ and data/profiling/network/.
      
        Example Models
        LLaMA-2 7B/13B/70B, LLaMA-3 8B/70B, Qwen, CodeLlama, Mixtral, DeepSeek-671B, Qwen3-Moe-235B

Supported Devices	`A40`, `A100`, `H100`, `H800`
NOT supported	Consumer GPUs (RTX 5090, etc.) — no profiling CSVs available. You must profile on the target GPU and add CSVs to `data/profiling/compute/` and `data/profiling/network/`.
Example Models	LLaMA-2 7B/13B/70B, LLaMA-3 8B/70B, Qwen, CodeLlama, Mixtral, DeepSeek-671B, Qwen3-Moe-235B

5. New Core Entities

SimAI introduces several new entity classes to model PD disaggregation. These classes extend Vidur's original entity hierarchy with DAG-based task representation and explicit KV cache transfer flows.

5.1 Request with DAG

vidur-alibabacloud

vidur/entities/request.py

The Request class now carries a directed acyclic graph (DAG) that encodes the dependency relationships between prefill tasks, decode tasks, and KV cache transfer flows. This is the fundamental data structure enabling PD disaggregation simulation.

import networkx as nx

class Request(BaseEntity):
    """A single inference request with PD disaggregation support."""

    def __init__(self, ...):
        self.dag = nx.DiGraph()         # Task dependency graph
        self.prefill_replica_id = None # Assigned prefill replica
        self.decode_replica_id = None  # Assigned decode replica

        # PD communication metrics (populated after KV transfer)
        self.pd_p2p_comm_size = float('inf')
        self.pd_p2p_comm_time = float('inf')

        # Timing breakdown
        self.prefill_e2e = None
        self.decode_e2e = None
        self.prefill_start_timestamp = None
        self.decode_start_timestamp = None

5.2 Replica Types

vidur-alibabacloud

vidur/entities/replica.py

SimAI introduces a ReplicaType enum to differentiate between co-located (mixed) replicas and PD-disaggregated replicas. The global scheduler uses this to route requests to the appropriate pool.

from enum import IntEnum

class ReplicaType(IntEnum):
    MIXED   = 0  # Co-located: prefill + decode on same replica
    PREFILL = 1  # Dedicated prefill replica
    DECODE  = 2  # Dedicated decode replica

5.3 Node / Task / Flow Hierarchy

vidur-alibabacloud

vidur/entities/node.py

Node is the abstract base class for all executable units in the request DAG. Each node has a start time, end time, and duration.

class Node(BaseEntity):
    """Base class for Task and Flow."""
    self.start_time: float
    self.end_time: float
    self.duration: float

vidur/entities/task.py

Task extends Node for compute operations. Two concrete subclasses model the PD split.

class Task(Node):
    """Compute operation."""
    pass

class PromptTask(Task):
    """Prefill computation."""
    pass

class TokenTask(Task):
    """Decode computation."""
    pass

vidur/entities/flow.py

Flow extends Node for data transfer operations. The key subclass is KVCacheTransferFlow, which models the PD KV cache transfer.

class Flow(Node):
    """Data transfer operation."""
    self.size_bytes: int
    self.bandwidth: float

class KVCacheTransferFlow(Flow):
    """KV cache transfer from prefill to decode replica.
    In PD disaggregation mode, this flow is inserted into the
    request DAG between PromptTask and TokenTask nodes."""

    def __init__(self, src_replica_id, dst_replica_id, kv_cache_size):
        self.src_replica_id = src_replica_id
        self.dst_replica_id = dst_replica_id
        self.kv_cache_size = kv_cache_size

5.4 Interconnect Types

vidur-alibabacloud

vidur/entities/interconnect.py

SimAI models multiple interconnect types to accurately simulate communication latency across different hardware links. Each interconnect has a configurable bandwidth and latency.

class Interconnect:
    """Hardware interconnect abstraction."""

    # Supported interconnect types:
    NVLink     # Intra-node GPU-GPU (e.g., 900 GB/s per direction)
    RDMA       # Inter-node GPU-GPU via RoCEv2/InfiniBand
    Ethernet   # Standard Ethernet (fallback)
    PCIe       # CPU-GPU or NIC-GPU transfer
    DummyLink  # Zero-latency link for testing

    def __init__(self, link_type, bandwidth_bps, latency_us=0):
        self.link_type = link_type
        self.bandwidth_bps = bandwidth_bps
        self.latency_us = latency_us

    def transfer_time(self, size_bytes) -> float:
        """Estimate transfer time in microseconds."""
        return (size_bytes * 8 / self.bandwidth_bps * 1e6
                + self.latency_us)

6. Configuration System

SimAI extends Vidur's configuration with new parameters for PD disaggregation, network simulation, and MoE model support. All parameters are exposed as CLI flags and can be set in config files.

6.1 New Config Parameters

vidur-alibabacloud

vidur/config/replica_config.py

from dataclasses import dataclass, field

@dataclass
class ReplicaConfig:
    # ===== PD Disaggregation =====
    pd_p2p_comm_bandwidth: int = 800      # bps (bits per second)
    pd_p2p_comm_dtype: str = 'float16'    # KV cache data type
    pd_node_ratio: float = 0.5           # P:D ratio (0.5 = 1:1)

    # ===== Network Bandwidth =====
    nvlink_bandwidth: int = 1600          # bps (NVLink per direction)
    rdma_bandwidth: int = 800             # bps (RDMA per NIC)

    # ===== MoE (Mixture of Experts) =====
    expert_model_parallel_size: int = 1  # Expert parallelism degree

    # ===== Simulation Backend =====
    backend: str = "vidur"
    # Choices: "vidur"              - original Vidur (CPU only)
    #          "simai_simulation"   - NS-3 full network sim
    #          "simai_analytical"   - Bus bandwidth estimation
    #          "aicb"               - AICB GPU profiling backend

6.2 Execution Time Predictor Config

vidur-alibabacloud

vidur/config/execution_time_predictor_config.py

@dataclass
class RandomForrestExecutionTimePredictorConfig:
    """Config for the execution time predictor.
    In SimAI, this predictor can delegate to AICB for
    real GPU profiling instead of sklearn RandomForest."""

    backend: str = "vidur"
    # "vidur" - sklearn RandomForest on profiled CSV
    # "aicb"  - AICB AIOB real GPU kernel profiling

    compute_cache_dir: str = "./compute_cache"
    # Directory for cached AICB profiling results

    model_name: str = "deepseek-671B"
    # Target model for profiling

Configuration Hierarchy: SimAI's config system follows a layered approach: default values → config file → CLI flags. CLI flags always take precedence. All PD-related parameters have sensible defaults that produce co-located (non-disaggregated) behavior, maintaining backward compatibility with original Vidur.

7. Three Simulation Modes

SimAI offers three simulation backends, each trading off speed for fidelity. Choose the right mode based on your exploration stage and available resources.

Mode	Backend	Speed	Fidelity	Hardware Required	Use Case
Analytical	Bus bandwidth estimation	★★★ Fast	★ Low	CPU only	Quick exploration, parameter sweeps
NS-3 Simulation	Full packet-level network sim	★ Slow	★★★ High	CPU only (multi-core recommended)	Topology comparison, congestion analysis
Physical	Real RDMA traffic	★★ Real-time	★★★★ Highest	RDMA-capable cluster	Final validation, production calibration

Analytical Mode

Uses simple bus bandwidth formulas: time = message_size / (bandwidth * num_links). No congestion modeling, no packet-level simulation. Suitable for rapid parameter sweeps across hundreds of configurations.

# Backend flag for analytical mode
--backend simai_analytical

NS-3 Mode

Full packet-level simulation with RDMA transport, ECN/PFC congestion control, and real topology modeling. Captures contention, head-of-line blocking, and incast effects. 10-100x slower than analytical but much more accurate.

# Backend flag for NS-3 mode
--backend simai_simulation

Physical Mode

Runs actual RDMA traffic on a physical cluster. Provides ground-truth measurements for validation. Requires RoCEv2 or InfiniBand capable NICs and an actual cluster deployment.

# Physical mode uses real hardware
--backend physical

Mode Selection Strategy: Three simulation modes let you trade off speed vs fidelity. Start with analytical for broad parameter sweeps (seconds per run), then validate interesting configurations with NS-3 (minutes per run), and finally confirm with physical on the actual cluster.

8. Example CLI

The following command runs a PD disaggregation simulation for DeepSeek-V3-671B with the SplitWise scheduler and AICB-profiled execution times:

Full PD Disaggregation Simulation

python -m vidur.main \
  --replica_config_model_name deepseek-671B \
  --replica_config_pd_p2p_comm_bandwidth 800 \
  --replica_config_nvlink_bandwidth 1600 \
  --replica_config_pd_node_ratio 0.5 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --random_forrest_execution_time_predictor_config_backend aicb

Quick Analytical Run

For a fast parameter sweep using the analytical backend:

python -m vidur.main \
  --replica_config_model_name qwen3-moe-235B \
  --replica_config_pd_node_ratio 0.3 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --random_forrest_execution_time_predictor_config_backend vidur \
  --backend simai_analytical

NS-3 High-Fidelity Run

For detailed network simulation with topology specification:

python -m vidur.main \
  --replica_config_model_name deepseek-671B \
  --replica_config_pd_p2p_comm_bandwidth 800 \
  --replica_config_rdma_bandwidth 800 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  --backend simai_simulation \
  --network_topology fat_tree

CLI Parameter Reference

Flag	Description	Default
`--replica_config_model_name`	Target model to simulate	-
`--replica_config_pd_p2p_comm_bandwidth`	PD KV transfer bandwidth (bps)	`800`
`--replica_config_nvlink_bandwidth`	NVLink bandwidth per direction (bps)	`1600`
`--replica_config_rdma_bandwidth`	RDMA bandwidth per NIC (bps)	`800`
`--replica_config_pd_node_ratio`	Fraction of nodes for prefill (0.5 = 1:1 P:D)	`0.5`
`--replica_config_pd_p2p_comm_dtype`	Data type for KV cache transfer	`float16`
`--global_scheduler_config_type`	Global scheduler algorithm	`round_robin`
`--replica_scheduler_config_type`	Replica scheduler algorithm	`vllm`
`--random_forrest_execution_time_predictor_config_backend`	Execution time predictor backend	`vidur`
`--backend`	Simulation backend (vidur/simai_simulation/simai_analytical)	`vidur`
`--replica_config_expert_model_parallel_size`	Expert parallelism degree for MoE models	`1`

9. Key Insights

Cross-Component Integration: SimAI's PD disaggregation simulation capability comes from cross-component integration. No single component can model PD end-to-end: vidur provides the scheduling and event loop, AICB profiles compute time, SimCCL decomposes collective operations, astra-sim orchestrates the system simulation, and NS-3 provides packet-level network fidelity. The innovation is in the glue between these components — the workload file format, the timing callback interface, and the DAG-based task representation that threads through all five layers.

Hardware Limitation: Profiling with AICB requires an SM90+ GPU (NVIDIA Hopper or Blackwell architecture). This means you need access to H100, H200, or B200 GPUs to generate the profiling data that feeds into SimAI's execution time predictions. The simulation itself runs on CPU, but the profiling step is GPU-bound. If you lack SM90+ hardware, you must rely on pre-cached profiling results or use the original Vidur RandomForest predictor as a fallback.

DAG Complexity: The DAG-based task representation introduces additional complexity compared to Vidur's linear execution model. Each request in PD mode generates exactly three DAG nodes (PromptTask → KVCacheTransferFlow → TokenTask); in co-located mode (same replica), it generates only two nodes (PromptTask → TokenTask). The single TokenTask node handles all decode tokens internally through multiple batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. This keeps the DAG lightweight regardless of output length.

Backward Compatibility: SimAI maintains full backward compatibility with the original Vidur. If you set backend=vidur and use the default schedulers (round_robin + vllm), SimAI behaves identically to upstream Vidur. The PD disaggregation features only activate when you explicitly configure SplitWise scheduling and set pd_node_ratio < 1.0.

MoE Support: SimAI adds native support for Mixture-of-Experts (MoE) models like DeepSeek-V3 and Qwen3-MoE. The expert_model_parallel_size parameter controls how experts are distributed across GPUs. AICB profiles MoE-specific kernels including expert routing, top-k gating, and sparse matrix operations, providing accurate compute time estimates for these architectures.

Appendix: Repository Map

SimAI spans five repositories under the aliyun GitHub organization. The following table maps each component to its repository and primary language.

Component	Repository	Language	Role
vidur	vidur-alibabacloud	Python	Scheduling, orchestration, metrics
aicb	aliyun/aicb	Python + CUDA	Workload generation, GPU profiling
simccl	SimCCL	C++	Collective communication decomposition
astra-sim	astra-sim-alibabacloud	C++	System simulation engine
ns-3	ns-3-alibabacloud	C++	Packet-level network simulation

Build Dependencies

The C++ components (SimCCL, astra-sim, NS-3) are linked together at compile time via CMake. SimCCL is compiled as a static library that astra-sim links against, and NS-3 is built as a separate shared library loaded by astra-sim's network layer.

Python ↔ C++ Bridge

vidur (Python) communicates with astra-sim (C++) through workload description files (.txt). vidur writes the execution graph as a text file, then invokes astra-sim as a subprocess. astra-sim returns timing results that vidur reads back to advance its event loop.

Appendix: Output Metrics

SimAI produces a rich set of per-request and aggregate metrics. The PD-specific metrics are unique to SimAI and not available in the original Vidur.

Metric	Unit	Description	New in SimAI?
`ttft`	ms	Time To First Token	-
`tbt`	ms	Time Between Tokens (avg)	-
`e2e_latency`	ms	End-to-end request latency	-
`prefill_e2e`	ms	Prefill phase end-to-end time	✓
`decode_e2e`	ms	Decode phase end-to-end time	✓
`pd_p2p_comm_time`	ms	KV cache transfer time (P→D)	✓
`pd_p2p_comm_size`	bytes	KV cache transfer size	✓
`prefill_replica_id`	-	Assigned prefill replica	✓
`decode_replica_id`	-	Assigned decode replica	✓

vidur-alibabacloud

vidur/metrics/metrics_store.py

The metrics store collects all per-request metrics and exports them to CSV. The PD-specific fields are only populated when running in PD disaggregation mode.

class MetricsStore:
    """Central metrics collection for simulation results."""

    def on_request_complete(self, request: Request):
        # Standard Vidur metrics
        self._record("ttft", request.ttft)
        self._record("tbt", request.avg_tbt)
        self._record("e2e_latency", request.e2e_latency)

        # NEW: PD disaggregation metrics
        if request.prefill_e2e is not None:
            self._record("prefill_e2e", request.prefill_e2e)
            self._record("decode_e2e", request.decode_e2e)
            self._record("pd_p2p_comm_time", request.pd_p2p_comm_time)
            self._record("pd_p2p_comm_size", request.pd_p2p_comm_size)

Appendix: Request DAG Example

In PD disaggregation mode, a request produces a DAG with exactly three nodes. The single TokenTask node handles all decode tokens internally through batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. The KVCacheTransferFlow sits between prefill and decode, creating the critical PD communication dependency.

Request DAG in PD Disaggregation Mode

Critical Path: The TTFT (Time To First Token) in PD mode is prefill_compute_time + pd_p2p_comm_time + first_decode_compute_time. This means the KV cache transfer time directly impacts user-perceived latency. Optimizing PD bandwidth (via pd_p2p_comm_bandwidth) and topology (to minimize hops between P and D replicas) is crucial for PD disaggregation performance.

Appendix: SplitWise Scheduler

The SplitWise scheduler is SimAI's key addition to Vidur's scheduling layer. It operates at two levels: the global scheduler routes requests to the correct replica pool, and the replica scheduler manages batch formation within each pool.

vidur-alibabacloud

vidur/scheduler/global_scheduler/split_wise_global_scheduler.py

Global Level: Pool Routing

The SplitWise global scheduler partitions replicas into prefill and decode pools based on pd_node_ratio. When a new request arrives, it routes the initial prefill to the least-loaded prefill replica. After prefill completes and KV cache is transferred, the decode phase is routed to the least-loaded decode replica.

class SplitWiseGlobalScheduler(BaseGlobalScheduler):
    """PD-aware global scheduler.
    Splits replicas into prefill and decode pools."""

    def __init__(self, config, replicas):
        self.pd_node_ratio = config.pd_node_ratio
        n_prefill = int(len(replicas) * self.pd_node_ratio)
        self.prefill_replicas = replicas[:n_prefill]
        self.decode_replicas = replicas[n_prefill:]

    def schedule(self, request: Request) -> int:
        # Route prefill to least-loaded prefill replica
        target = min(
            self.prefill_replicas,
            key=lambda r: r.pending_requests
        )
        request.prefill_replica_id = target.id
        return target.id

vidur-alibabacloud

vidur/scheduler/replica_scheduler/split_wise_replica_scheduler.py

Replica Level: Batch Formation

The SplitWise replica scheduler extends the base Sarathi-style continuous batching with PD awareness. Prefill replicas only process prefill requests, decode replicas only process decode iterations. This specialization eliminates the prefill-decode interference that degrades performance in co-located deployments.

class SplitWiseReplicaScheduler(BaseReplicaScheduler):
    """PD-aware replica scheduler.
    Handles batch formation for a single replica type."""

    def _build_batch(self) -> Batch:
        if self.replica_type == ReplicaType.PREFILL:
            # Only schedule prefill requests
            candidates = [r for r in self.waiting
                         if not r.prefill_complete]
        else:
            # Only schedule decode iterations
            candidates = [r for r in self.running
                         if r.prefill_complete]
        return self._form_batch(candidates)

pd_node_ratio Sensitivity: The pd_node_ratio parameter is critical for performance. Too many prefill replicas (ratio > 0.5) leads to decode starvation and high TBT. Too few prefill replicas (ratio < 0.3) causes prefill queueing and high TTFT. SimAI's simulation capability makes it practical to sweep this parameter across dozens of values without provisioning real hardware.

Appendix: AICB Profiling Pipeline

AICB (AI Communication Benchmark) provides two key capabilities to SimAI: workload description generation and real GPU kernel profiling. The profiling pipeline is designed to be run once per model/hardware combination and cached for repeated simulations.

Model Description

aicb

AICB reads the model architecture description (layer count, hidden dimension, attention heads, MoE config) and generates a comprehensive operator list. For DeepSeek-V3-671B, this includes MLA (Multi-head Latent Attention), DeepGEMM sparse experts, and shared expert layers.

AIOB Kernel Profiling

aicb

The AIOB (AI Operation Benchmark) module executes each operator on the actual GPU and records its execution time. For compute operators, it uses DeepGEMM for matrix multiplications and FlashMLA for attention. For communication operators, it measures NCCL collective latencies.

Workload File Generation

aicb

AICB outputs a workload description file (.txt) that lists every operation in an iteration with its profiled timing. This file is consumed by astra-sim's workload layer. The format encodes: operation type, data size, compute time, communication collective type, and dependencies.

Cache for Simulation

vidur-alibabacloud

The profiled compute times are cached in the compute_cache directory. Vidur's execution time predictor loads these cached values instead of running sklearn RandomForest predictions. This makes subsequent simulation runs fast while maintaining profiling accuracy.

aicb

aicb/workload_generator/generate_workload.py

Workload File Format Example

# AICB workload description file for DeepSeek-V3-671B
# Format: op_type  data_size  compute_time_us  comm_type  deps
COMP    0          1250          NONE       -1     # QKV projection (DeepGEMM)
COMP    0          890           NONE       0      # MLA attention (FlashMLA)
COMP    0          420           NONE       1      # Output projection
COMM    8388608   0             ALLREDUCE  2      # TP AllReduce (8MB)
COMP    0          340           NONE       3      # Expert routing + gating
COMP    0          1680          NONE       4      # Sparse expert FFN (DeepGEMM)
COMM    16777216  0             ALLTOALL   5      # EP All-to-All (16MB)
COMP    0          560           NONE       6      # Shared expert FFN
COMM    8388608   0             ALLREDUCE  7      # TP AllReduce (8MB)

Appendix: Supported Network Topologies

The NS-3 backend supports multiple data center network topologies. Choosing the right topology significantly impacts PD communication latency and collective operation performance.

Topology	Structure	Max Hops (P→D)	Bisection BW	Best For
Fat-tree	3-tier Clos (core/agg/ToR)	6	Full	General purpose, balanced
Rail-Optimized	GPU-rank aligned rails	2	Partial	AllReduce-heavy workloads
Dual-ToR	Redundant ToR switches	4	High	PD disagg with fault tolerance
Single-Switch	All GPUs on one switch	2	Full	Small clusters (≤64 GPUs)

ns-3

Congestion Control

The NS-3 backend models DCQCN (Data Center QCN) congestion control with ECN marking and PFC (Priority Flow Control) pause frames. When multiple KV cache transfers compete for bandwidth on shared links, the simulation captures the resulting throughput degradation and queueing delays. This is invisible to the analytical backend.

astra-sim

Topology Configuration

Network topologies are specified via JSON configuration files that describe switch connectivity, link bandwidth, and latency parameters. astra-sim reads this configuration and passes it to the NS-3 backend during initialization. Custom topologies can be defined by modifying these configuration files.

Appendix: SimCCL Algorithm Mapping

SimCCL maps high-level collective communication primitives to concrete point-to-point transfer schedules. The algorithm selection depends on the message size, number of participants, and network topology.

Collective	Algorithm	P2P Transfers	Typical Use
AllReduce	Ring (large msg) / Tree (small msg)	2(n-1) / 2 log(n)	TP gradient sync
AllGather	Ring / Recursive Halving-Doubling	n-1	PP layer gathering
ReduceScatter	Ring / Direct	n-1	ZeRO gradient partitioning
AllToAll	Direct exchange	n(n-1)	MoE expert routing
Broadcast	Binary tree	log(n)	KV cache distribution

simccl

astra-sim-alibabacloud/extern/SimCCL/src/algorithm.cpp

SimCCL Ring AllReduce Decomposition

For a ring AllReduce with n GPUs and message size M, SimCCL generates 2(n-1) phases. Each phase consists of n concurrent point-to-point transfers of size M/n. The first n-1 phases are reduce-scatter, the next n-1 are allgather.

// Simplified Ring AllReduce decomposition (C++)
void RingAllReduce::decompose(
    int num_gpus,
    size_t message_size,
    std::vector<P2PTransfer>& transfers
) {
    size_t chunk_size = message_size / num_gpus;

    // Phase 1: Reduce-Scatter (n-1 steps)
    for (int step = 0; step < num_gpus - 1; step++) {
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            int dst = (gpu + 1) % num_gpus;
            transfers.push_back({
                .src = gpu,
                .dst = dst,
                .size = chunk_size,
                .phase = step
            });
        }
    }

    // Phase 2: AllGather (n-1 steps)
    for (int step = 0; step < num_gpus - 1; step++) {
        for (int gpu = 0; gpu < num_gpus; gpu++) {
            int dst = (gpu + 1) % num_gpus;
            transfers.push_back({
                .src = gpu,
                .dst = dst,
                .size = chunk_size,
                .phase = num_gpus - 1 + step
            });
        }
    }
}

Appendix: Event Types in SimAI

SimAI's discrete-event simulation loop processes the following event types. Events marked with a star are new additions from SimAI (not present in original Vidur).

Event	Trigger	Handler	New?
`RequestArrivalEvent`	Trace timestamp	Creates Request, routes to global scheduler	-
`BatchScheduleEvent`	Replica ready	Forms batch, invokes execution time predictor	-
`BatchEndEvent`	Batch execution completes	Updates request state, triggers next step	-
`PrefillCompleteEvent`	Prefill batch ends (PD mode)	Initiates KV cache transfer to decode replica	★
`KVCacheTransferCompleteEvent`	KV transfer finishes	Records pd_p2p_comm_time, enqueues for decode	★
`RequestCompletionEvent`	All decode tokens generated	Collects metrics, removes from replica	-
`CommSimCompleteEvent`	astra-sim/NS-3 returns result	Updates comm_time in batch execution estimate	★

Event Ordering: All events are processed in timestamp order via a priority queue (min-heap). When two events share the same timestamp, they are processed in insertion order. The simulation clock only advances when the next event's timestamp is greater than the current time. This ensures deterministic, reproducible results across runs.

Repository Structure

SimAI is a multi-repo project composed of 5 git submodules plus top-level orchestration. Below is the complete directory tree with descriptions of each component's role.

Top-Level Structure

SimAI/
├── README.md              ── Project overview, scenarios, setup guide
├── Dockerfile             ── Docker image (nvidia/pytorch base + AICB + Vidur)
├── .gitmodules            ── Submodule config: SimCCL, aicb, ns-3-alibabacloud
├── scripts/
│   └── build.sh           ── Master build: -c analytical | ns3 | phy
├── example/
│   ├── microAllReduce.txt ── Sample AllReduce workload (8 GPUs, TP=8)
│   ├── workload_analytical.txt ── Sample analytical workload
│   └── busbw.yaml        ── Bus bandwidth config (TP/DP/EP/PP per-op BW)
├── docs/
│   ├── Tutorial.md        ── Comprehensive usage tutorial
│   └── SimAI_Intro_Online.pdf ── Presentation slides
│
├── vidur-alibabacloud/    ── ① Inference scheduling simulator (Python)
├── aicb/                  ── ② Workload generation + GPU profiling (Python+CUDA)
├── SimCCL/                ── ③ Collective communication decomposition (C++)
├── astra-sim-alibabacloud/ ── ④ System simulation engine (C++)
└── ns-3-alibabacloud/     ── ⑤ Packet-level network simulator (C++)

① vidur-alibabacloud — Inference Scheduling Simulator

vidur-alibabacloud/vidur/
├── main.py                 ── Entry point (python -m vidur.main)
├── simulator.py            ── DES event loop, manages clock + event queue
├── config/
│   ├── config.py           ── All config dataclasses (ReplicaConfig, PD params)
│   └── model_config.py     ── Model specs (DeepSeek-671B, Qwen3-MoE-235B, etc.)
├── entities/               ── Core data models
│   ├── request.py          ── Request with DAG (nx.DiGraph), PD metadata
│   ├── replica.py          ── ReplicaType: MIXED / PREFILL / DECODE
│   ├── batch.py            ── Batch of requests for co-execution
│   ├── task.py             ── PromptTask (prefill) / TokenTask (decode)
│   ├── flow.py             ── KVCacheTransferFlow (PD disaggregation)
│   ├── node.py             ── Base abstraction for Task/Flow in DAG
│   └── interconnect.py     ── NVLink / RDMA / Ethernet / PCIe link models
├── events/                 ── DES event types
│   ├── request_arrival_event.py ── Request enters system
│   ├── replica_schedule_event.py ── Triggers replica scheduling
│   └── batch_end_event.py  ── Batch execution completes
├── scheduler/
│   ├── global_scheduler/
│   │   ├── splitwise_global_scheduler.py ── PD-aware: P-pool + D-pool routing
│   │   ├── lor_global_scheduler.py ── Least Outstanding Requests
│   │   └── round_robin_global_scheduler.py
│   └── replica_scheduler/
│       ├── splitwise_replica_scheduler.py ── PD-aware per-replica scheduling
│       └── vllm_replica_scheduler.py ── vLLM-style scheduling policy
├── execution_time_predictor/
│   ├── sklearn_execution_time_predictor.py ── RandomForest / AICB backend
│   ├── communication_time_predictor.py    ── SimAI NS-3 / analytical backend
│   └── SimAIWorkload.py   ── Workload file builder for astra-sim
├── request_generator/      ── Synthetic / trace-based / Poisson arrivals
└── metrics/                ── TTFT, TBT, E2E latency, PD breakdown

② aicb — Workload Generation & GPU Profiling

aicb/
├── aicb.py                  ── Main entry point for benchmark execution
├── workload_applyer.py      ── Applies workloads to GPU cluster (runs collectives)
├── workload_generator/
│   ├── SimAI_inference_workload_generator.py ── Inference workload (prefill/decode)
│   ├── SimAI_training_workload_generator.py  ── Training workload files
│   └── mocked_model/
│       ├── MockedModel.py   ── Base class, InferencePhase enum (PREFILL/DECODE)
│       ├── inference/
│       │   ├── MockedDeepSeek.py  ── DeepSeek-V3 MLA + MoE architecture mock
│       │   ├── MockedQwen3Moe.py  ── Qwen3-MoE architecture mock
│       │   ├── MockedQwen3Next.py ── Qwen3-Next (hybrid attention) mock
│       │   ├── AiobDeepSeek.py    ── GPU kernel profiling: FlashMLA, DeepGEMM FP8
│       │   ├── AiobQwen3Moe.py    ── Qwen3-MoE GPU kernel profiling
│       │   └── AiobQwen3Next.py   ── Qwen3-Next GPU kernel profiling
│       └── training/        ── DeepSpeed, Megatron, DeepSeek training mocks
├── utils/
│   ├── utils.py             ── CommType, CommGroup, Strategy enums, compute cache
│   ├── deepgemm_utils.py    ── FP8 GEMM (per_token/per_block quantization)
│   └── timer.py             ── CudaEventTimer for GPU kernel timing
├── workload/
│   └── simAI/model_workload/ ── Pre-generated workload files (.txt)
├── scripts/inference_configs/
│   ├── deepseek_default.json  ── DeepSeek-V3 model config
│   ├── qwen3_moe_default.json ── Qwen3-MoE model config
│   └── qwen3_next_default.json ── Qwen3-Next model config
└── log_analyzer/            ── Result analysis and plotting tools

③ SimCCL — Collective Communication Decomposition

SimCCL's core logic currently lives inside astra-sim-alibabacloud/astra-sim/system/MockNccl* files. The standalone SimCCL repo contains documentation; the full implementation will be released separately.

④ astra-sim-alibabacloud — System Simulation Engine

astra-sim-alibabacloud/
├── CMakeLists.txt          ── Top-level CMake config
├── build/
│   ├── simai_analytical/   ── Build config → bin/SimAI_analytical
│   ├── astra_ns3/          ── Build config → bin/SimAI_simulator
│   └── simai_phy/          ── Build config → bin/SimAI_phynet
├── astra-sim/
│   ├── system/              ── Core simulation (88 files)
│   │   ├── Sys.cc/.hh       ── Main system class, event dispatch
│   │   ├── MockNcclGroup.cc/.h ── NCCL algo decomposition (Ring/Tree/NVLS)
│   │   ├── MockNcclChannel.cc/.h ── SingleFlow, ncclTree, ncclChannelNode
│   │   ├── MockNccl.h       ── Algorithm IDs, base/hw latency tables
│   │   ├── MockNcclQps.h    ── QPS tracking per connection
│   │   ├── SimAiFlowModelRdma.cc ── RDMA flow model
│   │   └── calbusbw.cc      ── Bus bandwidth calculator
│   ├── network_frontend/
│   │   ├── analytical/      ── AnaSim: fast tick-based estimation
│   │   ├── ns3/             ── NS-3 integration (entry.h = main bridge)
│   │   └── phynet/          ── Physical RDMA traffic generation
│   └── workload/
│       ├── Workload.cc/.hh  ── Workload parser (reads .txt workload files)
│       └── Layer.cc/.hh     ── Single compute+comm layer representation
├── inputs/
│   ├── config/
│   │   └── SimAI.conf       ── 60+ params: CC_MODE, PFC, ECN, monitoring
│   ├── topo/
│   │   └── gen_Topo_Template.py ── 5 topologies: Spectrum-X, HPN, DCN+
│   └── ratio/               ── NIC/NVLink/ATA performance ratio CSV

⑤ ns-3-alibabacloud — Packet-Level Network Simulator

ns-3-alibabacloud/
├── simulation/src/point-to-point/model/  ── Core network models (37 files)
│   ├── rdma-hw.cc/.h        ── RDMA NIC: QP management, CC algorithms
│   ├── rdma-queue-pair.cc/.h ── QP state: rate, window, seq tracking
│   ├── switch-node.cc/.h    ── Switch: ECMP hash routing, ECN marking
│   ├── switch-mmu.cc/.h     ── Switch MMU: buffer mgmt, PFC, RED/ECN
│   ├── qbb-net-device.cc/.h ── QBB NIC: PFC pause/resume, WRR scheduling
│   ├── nvswitch-node.cc/.h  ── NVSwitch: intra-node NVLS routing
│   ├── int-header.h         ── INT telemetry header (for HPCC)
│   └── pint.h               ── Probabilistic INT (for HPCC-PINT)
├── simulation/src/network/utils/
│   └── custom-header.cc/.h ── Packet header with IP, port, seq, INT
├── analysis/                ── Post-simulation analysis tools
│   ├── fct_analysis.py      ── Flow Completion Time analysis
│   ├── qlen_analysis.py     ── Queue length analysis
│   ├── bw_analysis.py       ── Bandwidth utilization analysis
│   └── qp_cnp_analysis.py   ── CNP count per QP analysis
└── docs/images/             ── Network topology diagrams

Cross-Component Data Flow: AICB profiles GPU kernels → produces compute times → vidur-alibabacloud reads them via execution_time_predictor → generates workload .txt files → astra-sim reads workload → decomposes collectives via MockNcclGroup (SimCCL) → sends P2P flows to NS-3 → NS-3 returns FCT → astra-sim returns communication time → vidur records metrics.

Life of an Inference Request in SimAI

Table of Contents

1. System Positioning

Who Should Use SimAI?

2. Vidur → SimAI: What Changed and Why

3. Five-Component Architecture

Orchestration Layer

Workload & Profiling Layer

Collective Decomposition

System Simulation

Network Simulation

4. Execution Time Breakdown

Timeline

Non-PD Mode (same replica)

PD Disaggregation Mode

What Affects Each Phase

① Prefill Scheduling Delay

② Prefill Execution

③ KV Cache Transfer PD mode only

④ Decode Scheduling Delay

⑤ Decode Iterations

Typical Timing Reference

Non-PD Mode — ~300 prefill + 128 decode tokens

Supported Devices & Models

5. New Core Entities

5.1 Request with DAG

5.2 Replica Types

5.3 Node / Task / Flow Hierarchy

5.4 Interconnect Types

6. Configuration System

6.1 New Config Parameters

6.2 Execution Time Predictor Config

7. Three Simulation Modes

Analytical Mode

NS-3 Mode

Physical Mode

8. Example CLI

Full PD Disaggregation Simulation

Quick Analytical Run

NS-3 High-Fidelity Run

CLI Parameter Reference

9. Key Insights

Appendix: Repository Map

Build Dependencies

Python ↔ C++ Bridge

Appendix: Output Metrics

Appendix: Request DAG Example

Appendix: SplitWise Scheduler

Global Level: Pool Routing

Replica Level: Batch Formation

Appendix: AICB Profiling Pipeline

Model Description

AIOB Kernel Profiling

Workload File Generation

Cache for Simulation

Workload File Format Example

Appendix: Supported Network Topologies

Congestion Control

Topology Configuration

Appendix: SimCCL Algorithm Mapping

SimCCL Ring AllReduce Decomposition

Appendix: Event Types in SimAI

Repository Structure

Top-Level Structure

① vidur-alibabacloud — Inference Scheduling Simulator

② aicb — Workload Generation & GPU Profiling

③ SimCCL — Collective Communication Decomposition

④ astra-sim-alibabacloud — System Simulation Engine

⑤ ns-3-alibabacloud — Packet-Level Network Simulator