Alibaba Cloud's full-stack simulator integrating Vidur scheduling, AICB GPU profiling, SimCCL collective decomposition, astra-sim system simulation, and NS-3 network simulation — enabling PD disaggregation analysis without deploying real clusters.
aliyun/SimAI · NSDI'25SimAI is Alibaba Cloud's full-stack simulator for AI training AND inference. It extends Microsoft's Vidur (a discrete-event inference simulator) with AICB GPU profiling, SimCCL collective communication decomposition, astra-sim system simulation, and NS-3 packet-level network simulation. Together, these five components form an end-to-end pipeline that can predict training iteration time or inference latency metrics without deploying a single real GPU.
Infrastructure architects evaluating cluster configurations, network engineers comparing topology designs (fat-tree, rail-optimized, dual-ToR), and ML platform teams estimating the cost/benefit of PD disaggregation before committing hardware.
The following table details every dimension where vidur-alibabacloud diverges from the original Microsoft Vidur. These changes enable full-stack network-aware simulation with Prefill-Decode disaggregation support.
| Dimension | Original Vidur | SimAI vidur-alibabacloud |
|---|---|---|
| Deployment Model | Co-located (prefill + decode on same replica) | PD disaggregation support — separate prefill and decode replica pools |
| Compute Time Estimation | sklearn RandomForest trained on profiled CSV | AICB AIOB real GPU profiling (DeepGEMM / FlashMLA kernels) |
| Communication Simulation | None (assumes replica-internal communication is negligible) | SimCCL + astra-sim + NS-3 full-stack network simulation |
| Supported Models | LLaMA2-7B / 13B / 70B | + DeepSeek-V3-671B, Qwen3-MoE-235B, Qwen3-Next-80B |
| Hardware Requirement | Pure CPU simulation | Profiling needs Hopper/Blackwell GPU (SM90+); simulation itself is CPU-only |
| Global Scheduler | Random, Round-Robin, Least Outstanding Requests (LOR) | + SplitWise (PD-aware routing to prefill/decode pools) |
| Replica Scheduler | Faster Transformer, Orca, Sarathi, vLLM, LightLLM | + SplitWise (PD-aware batch formation) |
| Task Representation | Basic execution time estimation | DAG with PromptTask / TokenTask / KVCacheTransferFlow |
| Output Metrics | Basic latency / throughput | + Detailed PD timing breakdown: prefill_e2e, decode_e2e, pd_p2p_comm_time, pd_p2p_comm_size |
SimAI is not a monolith — it is a federation of five repositories, each handling a different layer of the simulation stack. The following diagram shows how data flows between them.
The discrete-event simulation engine. Manages request arrival, global scheduling (SplitWise), replica scheduling, batch formation, and metrics collection. Written in Python.
Generates realistic workload descriptions and profiles actual GPU kernel times using AIOB (AI Operation Benchmark). Captures DeepGEMM, FlashMLA, and NCCL operator latencies.
Decomposes high-level collective operations (AllReduce, AllGather, ReduceScatter) into point-to-point data transfers. Supports ring, tree, and halving-doubling algorithms.
Three-layer simulation engine: Workload layer (reads execution graphs), System layer (schedules compute/comm), Network layer (pluggable backend: analytical or NS-3).
Packet-level network simulation with RDMA transport, ECN/PFC congestion control, and configurable topologies (fat-tree, rail-optimized, dual-ToR). The highest-fidelity backend.
A request's end-to-end latency decomposes into 5 phases. This section summarizes what parameters affect each phase. For the detailed calculation logic (RandomForest models, backend comparison, per-layer formulas), see the Lifecycle page.
e2e_time = completed_at − arrived_at
Time from request arrival to the first batch containing this request. Event-driven: the scheduler tries immediately when a request arrives, but may be blocked.
| Parameter | How it affects scheduling delay |
|---|---|
num_prefill_tokens |
Longer prompts need more KV blocks upfront and consume more batch token budget — harder to fit. |
| KV cache memory usage | When allocated blocks + watermark ≥ total blocks, no new prefill can be scheduled until decode requests free memory. |
num_pipeline_stages |
Pipeline can hold at most this many batches in-flight; when full, new batches wait for one to finish. |
max_tokens_in_batch |
If adding this request's tokens would exceed the limit, it waits for the current batch to execute. |
batch_size_cap |
Hard cap on requests per micro-batch; excess requests must queue for the next batch. |
| QPS / arrival rate | Higher QPS fills KV memory and batch slots faster, increasing contention and queue depth. |
| Preemption | Memory thrashing evicts the request; entire prefill must restart from scratch. |
GPU computation for processing all input tokens in one forward pass (single batch).
| Parameter | How it affects prefill time |
|---|---|
num_prefill_tokens |
More tokens = more compute per layer; attention cost scales quadratically with sequence length. |
num_layers |
Total time is block_time × layers_per_stage; more layers = proportionally longer. |
| Model architecture | Larger hidden_dim and more attention heads increase per-layer compute (matmul size). |
tensor_parallel_size |
Splits compute across GPUs (faster) but adds 2 AllReduce per layer (communication overhead). |
num_pipeline_stages |
Each stage handles fewer layers (faster per stage) but adds SendRecv latency between stages. |
| GPU device | H100 kernels are ~2× faster than A100 for same operation (profiled per device). |
| Execution backend | vidur/aicb/simai_analytical/simai_simulation use different methods to estimate compute & comm time. |
Network transfer of KV cache from prefill replica to decode replica. Skipped in non-PD mode.
| Parameter | How it affects transfer time |
|---|---|
num_prefill_tokens |
More tokens = larger KV cache to transfer (linear relationship). |
| Model dims | kv_cache_size = 2 × tokens × num_kv_heads × head_dim × num_layers × bytes_per_element. |
pd_p2p_comm_bandwidth |
Higher bandwidth (e.g. 800 Gbps) proportionally reduces transfer time. |
pd_p2p_comm_dtype |
fp16 = 2 bytes/element, fp32 = 4 bytes — doubles the transfer size. |
transfer_time = kv_cache_size / pd_p2p_comm_bandwidthWait before the first decode batch. Non-PD: near-zero (same replica, next scheduler cycle). PD: queueing time on D replica after KV transfer.
| Parameter | How it affects decode scheduling |
|---|---|
| D replica load | More concurrent decode requests on D = pipeline/batch slots fill up, new arrivals queue. |
| KV cache on D | Decode only needs 1 block/iter, but if D's memory is nearly full, even this can block. |
batch_size_cap |
If D's current batch already has max requests, new decode must wait for next cycle. |
| Non-PD mode | Request stays on same replica; slots into next batch immediately with continuous batching (≈ 0). |
Auto-regressive generation: num_decode_tokens − 1 serial iterations, each producing 1 token. Typically the dominant phase for long outputs.
| Parameter | How it affects decode time |
|---|---|
num_decode_tokens |
Directly determines iteration count; 128 output tokens = 127 serial iterations. |
| Context length | KV cache grows each iteration; attn_decode cost increases as context lengthens (later tokens cost more). |
| Concurrent batch_size | More decode requests in same batch = larger total KV cache to attend over, increasing per-iter time. |
num_layers |
Each iteration passes through all layers; more layers = proportionally longer per iteration. |
tensor_parallel_size |
Same tradeoff as prefill: splits compute but adds 2 AllReduce per layer per iteration. |
| GPU device | Faster GPU = faster per-iteration decode; decode is memory-bandwidth bound (not compute bound). |
| Inter-batch gaps | Scheduler overhead between iterations; total = Σ(per_iter_time + gap) across all iterations. |
Single-request latency at low QPS (no queueing contention), non-PD mode (TP=1, PP=1). The RTX 5090 column is measured with vLLM v0.1 + LLaMA-3.1 8B (bf16). Other columns are estimates based on device specs.
| Phase | LLaMA-2 7B A100 estimated |
LLaMA-3.1 8B RTX 5090 measured |
LLaMA-3 8B H100 estimated |
LLaMA-2 70B H100 TP=4 estimated |
|---|---|---|---|---|
| ① Prefill Sched. | ~1 ms | < 1 ms | ~1 ms | ~2 ms |
| ② Prefill Exec | ~18 ms | 14 ms | ~8 ms | ~50 ms |
| ③ KV Transfer | — skipped (non-PD) — | |||
| ④ Decode Sched. | ≈ 0 | ≈ 0 | ≈ 0 | ≈ 0 |
| ⑤ Decode (127×) | ~1.4 s | 1.37 s | ~0.8 s | ~2.5 s |
| E2E | ~1.4 s | 1.39 s | ~0.8 s | ~2.6 s |
| TTFT | ~19 ms | 14 ms | ~9 ms | ~52 ms |
| TBT (avg) | ~11 ms | 10.8 ms | ~6 ms | ~20 ms |
| Supported Devices | A40, A100, H100, H800 |
|---|---|
| NOT supported | Consumer GPUs (RTX 5090, etc.) — no profiling CSVs available. You must profile on the target GPU and add CSVs to data/profiling/compute/ and data/profiling/network/. |
| Example Models | LLaMA-2 7B/13B/70B, LLaMA-3 8B/70B, Qwen, CodeLlama, Mixtral, DeepSeek-671B, Qwen3-Moe-235B |
SimAI introduces several new entity classes to model PD disaggregation. These classes extend Vidur's original entity hierarchy with DAG-based task representation and explicit KV cache transfer flows.
The Request class now carries a directed acyclic graph (DAG) that encodes the dependency relationships between prefill tasks, decode tasks, and KV cache transfer flows. This is the fundamental data structure enabling PD disaggregation simulation.
import networkx as nx
class Request(BaseEntity):
"""A single inference request with PD disaggregation support."""
def __init__(self, ...):
self.dag = nx.DiGraph() # Task dependency graph
self.prefill_replica_id = None # Assigned prefill replica
self.decode_replica_id = None # Assigned decode replica
# PD communication metrics (populated after KV transfer)
self.pd_p2p_comm_size = float('inf')
self.pd_p2p_comm_time = float('inf')
# Timing breakdown
self.prefill_e2e = None
self.decode_e2e = None
self.prefill_start_timestamp = None
self.decode_start_timestamp = None
SimAI introduces a ReplicaType enum to differentiate between co-located (mixed) replicas and PD-disaggregated replicas. The global scheduler uses this to route requests to the appropriate pool.
from enum import IntEnum
class ReplicaType(IntEnum):
MIXED = 0 # Co-located: prefill + decode on same replica
PREFILL = 1 # Dedicated prefill replica
DECODE = 2 # Dedicated decode replica
Node is the abstract base class for all executable units in the request DAG. Each node has a start time, end time, and duration.
class Node(BaseEntity):
"""Base class for Task and Flow."""
self.start_time: float
self.end_time: float
self.duration: float
Task extends Node for compute operations. Two concrete subclasses model the PD split.
class Task(Node):
"""Compute operation."""
pass
class PromptTask(Task):
"""Prefill computation."""
pass
class TokenTask(Task):
"""Decode computation."""
pass
Flow extends Node for data transfer operations. The key subclass is KVCacheTransferFlow, which models the PD KV cache transfer.
class Flow(Node):
"""Data transfer operation."""
self.size_bytes: int
self.bandwidth: float
class KVCacheTransferFlow(Flow):
"""KV cache transfer from prefill to decode replica.
In PD disaggregation mode, this flow is inserted into the
request DAG between PromptTask and TokenTask nodes."""
def __init__(self, src_replica_id, dst_replica_id, kv_cache_size):
self.src_replica_id = src_replica_id
self.dst_replica_id = dst_replica_id
self.kv_cache_size = kv_cache_size
SimAI models multiple interconnect types to accurately simulate communication latency across different hardware links. Each interconnect has a configurable bandwidth and latency.
class Interconnect:
"""Hardware interconnect abstraction."""
# Supported interconnect types:
NVLink # Intra-node GPU-GPU (e.g., 900 GB/s per direction)
RDMA # Inter-node GPU-GPU via RoCEv2/InfiniBand
Ethernet # Standard Ethernet (fallback)
PCIe # CPU-GPU or NIC-GPU transfer
DummyLink # Zero-latency link for testing
def __init__(self, link_type, bandwidth_bps, latency_us=0):
self.link_type = link_type
self.bandwidth_bps = bandwidth_bps
self.latency_us = latency_us
def transfer_time(self, size_bytes) -> float:
"""Estimate transfer time in microseconds."""
return (size_bytes * 8 / self.bandwidth_bps * 1e6
+ self.latency_us)
SimAI extends Vidur's configuration with new parameters for PD disaggregation, network simulation, and MoE model support. All parameters are exposed as CLI flags and can be set in config files.
from dataclasses import dataclass, field
@dataclass
class ReplicaConfig:
# ===== PD Disaggregation =====
pd_p2p_comm_bandwidth: int = 800 # bps (bits per second)
pd_p2p_comm_dtype: str = 'float16' # KV cache data type
pd_node_ratio: float = 0.5 # P:D ratio (0.5 = 1:1)
# ===== Network Bandwidth =====
nvlink_bandwidth: int = 1600 # bps (NVLink per direction)
rdma_bandwidth: int = 800 # bps (RDMA per NIC)
# ===== MoE (Mixture of Experts) =====
expert_model_parallel_size: int = 1 # Expert parallelism degree
# ===== Simulation Backend =====
backend: str = "vidur"
# Choices: "vidur" - original Vidur (CPU only)
# "simai_simulation" - NS-3 full network sim
# "simai_analytical" - Bus bandwidth estimation
# "aicb" - AICB GPU profiling backend
@dataclass
class RandomForrestExecutionTimePredictorConfig:
"""Config for the execution time predictor.
In SimAI, this predictor can delegate to AICB for
real GPU profiling instead of sklearn RandomForest."""
backend: str = "vidur"
# "vidur" - sklearn RandomForest on profiled CSV
# "aicb" - AICB AIOB real GPU kernel profiling
compute_cache_dir: str = "./compute_cache"
# Directory for cached AICB profiling results
model_name: str = "deepseek-671B"
# Target model for profiling
default values → config file → CLI flags. CLI flags always take precedence. All PD-related parameters have sensible defaults that produce co-located (non-disaggregated) behavior, maintaining backward compatibility with original Vidur.
SimAI offers three simulation backends, each trading off speed for fidelity. Choose the right mode based on your exploration stage and available resources.
| Mode | Backend | Speed | Fidelity | Hardware Required | Use Case |
|---|---|---|---|---|---|
| Analytical | Bus bandwidth estimation | ★★★ Fast | ★ Low | CPU only | Quick exploration, parameter sweeps |
| NS-3 Simulation | Full packet-level network sim | ★ Slow | ★★★ High | CPU only (multi-core recommended) | Topology comparison, congestion analysis |
| Physical | Real RDMA traffic | ★★ Real-time | ★★★★ Highest | RDMA-capable cluster | Final validation, production calibration |
Uses simple bus bandwidth formulas: time = message_size / (bandwidth * num_links). No congestion modeling, no packet-level simulation. Suitable for rapid parameter sweeps across hundreds of configurations.
# Backend flag for analytical mode
--backend simai_analytical
Full packet-level simulation with RDMA transport, ECN/PFC congestion control, and real topology modeling. Captures contention, head-of-line blocking, and incast effects. 10-100x slower than analytical but much more accurate.
# Backend flag for NS-3 mode
--backend simai_simulation
Runs actual RDMA traffic on a physical cluster. Provides ground-truth measurements for validation. Requires RoCEv2 or InfiniBand capable NICs and an actual cluster deployment.
# Physical mode uses real hardware
--backend physical
The following command runs a PD disaggregation simulation for DeepSeek-V3-671B with the SplitWise scheduler and AICB-profiled execution times:
python -m vidur.main \
--replica_config_model_name deepseek-671B \
--replica_config_pd_p2p_comm_bandwidth 800 \
--replica_config_nvlink_bandwidth 1600 \
--replica_config_pd_node_ratio 0.5 \
--global_scheduler_config_type split_wise \
--replica_scheduler_config_type split_wise \
--random_forrest_execution_time_predictor_config_backend aicb
For a fast parameter sweep using the analytical backend:
python -m vidur.main \
--replica_config_model_name qwen3-moe-235B \
--replica_config_pd_node_ratio 0.3 \
--global_scheduler_config_type split_wise \
--replica_scheduler_config_type split_wise \
--random_forrest_execution_time_predictor_config_backend vidur \
--backend simai_analytical
For detailed network simulation with topology specification:
python -m vidur.main \
--replica_config_model_name deepseek-671B \
--replica_config_pd_p2p_comm_bandwidth 800 \
--replica_config_rdma_bandwidth 800 \
--global_scheduler_config_type split_wise \
--replica_scheduler_config_type split_wise \
--backend simai_simulation \
--network_topology fat_tree
| Flag | Description | Default |
|---|---|---|
--replica_config_model_name |
Target model to simulate | - |
--replica_config_pd_p2p_comm_bandwidth |
PD KV transfer bandwidth (bps) | 800 |
--replica_config_nvlink_bandwidth |
NVLink bandwidth per direction (bps) | 1600 |
--replica_config_rdma_bandwidth |
RDMA bandwidth per NIC (bps) | 800 |
--replica_config_pd_node_ratio |
Fraction of nodes for prefill (0.5 = 1:1 P:D) | 0.5 |
--replica_config_pd_p2p_comm_dtype |
Data type for KV cache transfer | float16 |
--global_scheduler_config_type |
Global scheduler algorithm | round_robin |
--replica_scheduler_config_type |
Replica scheduler algorithm | vllm |
--random_forrest_execution_time_predictor_config_backend |
Execution time predictor backend | vidur |
--backend |
Simulation backend (vidur/simai_simulation/simai_analytical) | vidur |
--replica_config_expert_model_parallel_size |
Expert parallelism degree for MoE models | 1 |
TokenTask node handles all decode tokens internally through multiple batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. This keeps the DAG lightweight regardless of output length.
backend=vidur and use the default schedulers (round_robin + vllm), SimAI behaves identically to upstream Vidur. The PD disaggregation features only activate when you explicitly configure SplitWise scheduling and set pd_node_ratio < 1.0.
expert_model_parallel_size parameter controls how experts are distributed across GPUs. AICB profiles MoE-specific kernels including expert routing, top-k gating, and sparse matrix operations, providing accurate compute time estimates for these architectures.
SimAI spans five repositories under the aliyun GitHub organization. The following table maps each component to its repository and primary language.
| Component | Repository | Language | Role |
|---|---|---|---|
| vidur | vidur-alibabacloud | Python | Scheduling, orchestration, metrics |
| aicb | aliyun/aicb | Python + CUDA | Workload generation, GPU profiling |
| simccl | SimCCL | C++ | Collective communication decomposition |
| astra-sim | astra-sim-alibabacloud | C++ | System simulation engine |
| ns-3 | ns-3-alibabacloud | C++ | Packet-level network simulation |
The C++ components (SimCCL, astra-sim, NS-3) are linked together at compile time via CMake. SimCCL is compiled as a static library that astra-sim links against, and NS-3 is built as a separate shared library loaded by astra-sim's network layer.
vidur (Python) communicates with astra-sim (C++) through workload description files (.txt). vidur writes the execution graph as a text file, then invokes astra-sim as a subprocess. astra-sim returns timing results that vidur reads back to advance its event loop.
SimAI produces a rich set of per-request and aggregate metrics. The PD-specific metrics are unique to SimAI and not available in the original Vidur.
| Metric | Unit | Description | New in SimAI? |
|---|---|---|---|
ttft |
ms | Time To First Token | - |
tbt |
ms | Time Between Tokens (avg) | - |
e2e_latency |
ms | End-to-end request latency | - |
prefill_e2e |
ms | Prefill phase end-to-end time | ✓ |
decode_e2e |
ms | Decode phase end-to-end time | ✓ |
pd_p2p_comm_time |
ms | KV cache transfer time (P→D) | ✓ |
pd_p2p_comm_size |
bytes | KV cache transfer size | ✓ |
prefill_replica_id |
- | Assigned prefill replica | ✓ |
decode_replica_id |
- | Assigned decode replica | ✓ |
The metrics store collects all per-request metrics and exports them to CSV. The PD-specific fields are only populated when running in PD disaggregation mode.
class MetricsStore:
"""Central metrics collection for simulation results."""
def on_request_complete(self, request: Request):
# Standard Vidur metrics
self._record("ttft", request.ttft)
self._record("tbt", request.avg_tbt)
self._record("e2e_latency", request.e2e_latency)
# NEW: PD disaggregation metrics
if request.prefill_e2e is not None:
self._record("prefill_e2e", request.prefill_e2e)
self._record("decode_e2e", request.decode_e2e)
self._record("pd_p2p_comm_time", request.pd_p2p_comm_time)
self._record("pd_p2p_comm_size", request.pd_p2p_comm_size)
In PD disaggregation mode, a request produces a DAG with exactly three nodes. The single TokenTask node handles all decode tokens internally through batch iterations (tokens_per_iteration = 1), rather than creating a separate node per token. The KVCacheTransferFlow sits between prefill and decode, creating the critical PD communication dependency.
prefill_compute_time + pd_p2p_comm_time + first_decode_compute_time. This means the KV cache transfer time directly impacts user-perceived latency. Optimizing PD bandwidth (via pd_p2p_comm_bandwidth) and topology (to minimize hops between P and D replicas) is crucial for PD disaggregation performance.
The SplitWise scheduler is SimAI's key addition to Vidur's scheduling layer. It operates at two levels: the global scheduler routes requests to the correct replica pool, and the replica scheduler manages batch formation within each pool.
The SplitWise global scheduler partitions replicas into prefill and decode pools based on pd_node_ratio. When a new request arrives, it routes the initial prefill to the least-loaded prefill replica. After prefill completes and KV cache is transferred, the decode phase is routed to the least-loaded decode replica.
class SplitWiseGlobalScheduler(BaseGlobalScheduler):
"""PD-aware global scheduler.
Splits replicas into prefill and decode pools."""
def __init__(self, config, replicas):
self.pd_node_ratio = config.pd_node_ratio
n_prefill = int(len(replicas) * self.pd_node_ratio)
self.prefill_replicas = replicas[:n_prefill]
self.decode_replicas = replicas[n_prefill:]
def schedule(self, request: Request) -> int:
# Route prefill to least-loaded prefill replica
target = min(
self.prefill_replicas,
key=lambda r: r.pending_requests
)
request.prefill_replica_id = target.id
return target.id
The SplitWise replica scheduler extends the base Sarathi-style continuous batching with PD awareness. Prefill replicas only process prefill requests, decode replicas only process decode iterations. This specialization eliminates the prefill-decode interference that degrades performance in co-located deployments.
class SplitWiseReplicaScheduler(BaseReplicaScheduler):
"""PD-aware replica scheduler.
Handles batch formation for a single replica type."""
def _build_batch(self) -> Batch:
if self.replica_type == ReplicaType.PREFILL:
# Only schedule prefill requests
candidates = [r for r in self.waiting
if not r.prefill_complete]
else:
# Only schedule decode iterations
candidates = [r for r in self.running
if r.prefill_complete]
return self._form_batch(candidates)
pd_node_ratio parameter is critical for performance. Too many prefill replicas (ratio > 0.5) leads to decode starvation and high TBT. Too few prefill replicas (ratio < 0.3) causes prefill queueing and high TTFT. SimAI's simulation capability makes it practical to sweep this parameter across dozens of values without provisioning real hardware.
AICB (AI Communication Benchmark) provides two key capabilities to SimAI: workload description generation and real GPU kernel profiling. The profiling pipeline is designed to be run once per model/hardware combination and cached for repeated simulations.
AICB reads the model architecture description (layer count, hidden dimension, attention heads, MoE config) and generates a comprehensive operator list. For DeepSeek-V3-671B, this includes MLA (Multi-head Latent Attention), DeepGEMM sparse experts, and shared expert layers.
The AIOB (AI Operation Benchmark) module executes each operator on the actual GPU and records its execution time. For compute operators, it uses DeepGEMM for matrix multiplications and FlashMLA for attention. For communication operators, it measures NCCL collective latencies.
AICB outputs a workload description file (.txt) that lists every operation in an iteration with its profiled timing. This file is consumed by astra-sim's workload layer. The format encodes: operation type, data size, compute time, communication collective type, and dependencies.
The profiled compute times are cached in the compute_cache directory. Vidur's execution time predictor loads these cached values instead of running sklearn RandomForest predictions. This makes subsequent simulation runs fast while maintaining profiling accuracy.
# AICB workload description file for DeepSeek-V3-671B
# Format: op_type data_size compute_time_us comm_type deps
COMP 0 1250 NONE -1 # QKV projection (DeepGEMM)
COMP 0 890 NONE 0 # MLA attention (FlashMLA)
COMP 0 420 NONE 1 # Output projection
COMM 8388608 0 ALLREDUCE 2 # TP AllReduce (8MB)
COMP 0 340 NONE 3 # Expert routing + gating
COMP 0 1680 NONE 4 # Sparse expert FFN (DeepGEMM)
COMM 16777216 0 ALLTOALL 5 # EP All-to-All (16MB)
COMP 0 560 NONE 6 # Shared expert FFN
COMM 8388608 0 ALLREDUCE 7 # TP AllReduce (8MB)
The NS-3 backend supports multiple data center network topologies. Choosing the right topology significantly impacts PD communication latency and collective operation performance.
| Topology | Structure | Max Hops (P→D) | Bisection BW | Best For |
|---|---|---|---|---|
| Fat-tree | 3-tier Clos (core/agg/ToR) | 6 | Full | General purpose, balanced |
| Rail-Optimized | GPU-rank aligned rails | 2 | Partial | AllReduce-heavy workloads |
| Dual-ToR | Redundant ToR switches | 4 | High | PD disagg with fault tolerance |
| Single-Switch | All GPUs on one switch | 2 | Full | Small clusters (≤64 GPUs) |
The NS-3 backend models DCQCN (Data Center QCN) congestion control with ECN marking and PFC (Priority Flow Control) pause frames. When multiple KV cache transfers compete for bandwidth on shared links, the simulation captures the resulting throughput degradation and queueing delays. This is invisible to the analytical backend.
Network topologies are specified via JSON configuration files that describe switch connectivity, link bandwidth, and latency parameters. astra-sim reads this configuration and passes it to the NS-3 backend during initialization. Custom topologies can be defined by modifying these configuration files.
SimCCL maps high-level collective communication primitives to concrete point-to-point transfer schedules. The algorithm selection depends on the message size, number of participants, and network topology.
| Collective | Algorithm | P2P Transfers | Typical Use |
|---|---|---|---|
| AllReduce | Ring (large msg) / Tree (small msg) | 2(n-1) / 2 log(n) | TP gradient sync |
| AllGather | Ring / Recursive Halving-Doubling | n-1 | PP layer gathering |
| ReduceScatter | Ring / Direct | n-1 | ZeRO gradient partitioning |
| AllToAll | Direct exchange | n(n-1) | MoE expert routing |
| Broadcast | Binary tree | log(n) | KV cache distribution |
For a ring AllReduce with n GPUs and message size M, SimCCL generates 2(n-1) phases. Each phase consists of n concurrent point-to-point transfers of size M/n. The first n-1 phases are reduce-scatter, the next n-1 are allgather.
// Simplified Ring AllReduce decomposition (C++)
void RingAllReduce::decompose(
int num_gpus,
size_t message_size,
std::vector<P2PTransfer>& transfers
) {
size_t chunk_size = message_size / num_gpus;
// Phase 1: Reduce-Scatter (n-1 steps)
for (int step = 0; step < num_gpus - 1; step++) {
for (int gpu = 0; gpu < num_gpus; gpu++) {
int dst = (gpu + 1) % num_gpus;
transfers.push_back({
.src = gpu,
.dst = dst,
.size = chunk_size,
.phase = step
});
}
}
// Phase 2: AllGather (n-1 steps)
for (int step = 0; step < num_gpus - 1; step++) {
for (int gpu = 0; gpu < num_gpus; gpu++) {
int dst = (gpu + 1) % num_gpus;
transfers.push_back({
.src = gpu,
.dst = dst,
.size = chunk_size,
.phase = num_gpus - 1 + step
});
}
}
}
SimAI's discrete-event simulation loop processes the following event types. Events marked with a star are new additions from SimAI (not present in original Vidur).
| Event | Trigger | Handler | New? |
|---|---|---|---|
RequestArrivalEvent |
Trace timestamp | Creates Request, routes to global scheduler | - |
BatchScheduleEvent |
Replica ready | Forms batch, invokes execution time predictor | - |
BatchEndEvent |
Batch execution completes | Updates request state, triggers next step | - |
PrefillCompleteEvent |
Prefill batch ends (PD mode) | Initiates KV cache transfer to decode replica | ★ |
KVCacheTransferCompleteEvent |
KV transfer finishes | Records pd_p2p_comm_time, enqueues for decode | ★ |
RequestCompletionEvent |
All decode tokens generated | Collects metrics, removes from replica | - |
CommSimCompleteEvent |
astra-sim/NS-3 returns result | Updates comm_time in batch execution estimate | ★ |
SimAI is a multi-repo project composed of 5 git submodules plus top-level orchestration. Below is the complete directory tree with descriptions of each component's role.
SimAI/
├── README.md ── Project overview, scenarios, setup guide
├── Dockerfile ── Docker image (nvidia/pytorch base + AICB + Vidur)
├── .gitmodules ── Submodule config: SimCCL, aicb, ns-3-alibabacloud
├── scripts/
│ └── build.sh ── Master build: -c analytical | ns3 | phy
├── example/
│ ├── microAllReduce.txt ── Sample AllReduce workload (8 GPUs, TP=8)
│ ├── workload_analytical.txt ── Sample analytical workload
│ └── busbw.yaml ── Bus bandwidth config (TP/DP/EP/PP per-op BW)
├── docs/
│ ├── Tutorial.md ── Comprehensive usage tutorial
│ └── SimAI_Intro_Online.pdf ── Presentation slides
│
├── vidur-alibabacloud/ ── ① Inference scheduling simulator (Python)
├── aicb/ ── ② Workload generation + GPU profiling (Python+CUDA)
├── SimCCL/ ── ③ Collective communication decomposition (C++)
├── astra-sim-alibabacloud/ ── ④ System simulation engine (C++)
└── ns-3-alibabacloud/ ── ⑤ Packet-level network simulator (C++)
vidur-alibabacloud/vidur/
├── main.py ── Entry point (python -m vidur.main)
├── simulator.py ── DES event loop, manages clock + event queue
├── config/
│ ├── config.py ── All config dataclasses (ReplicaConfig, PD params)
│ └── model_config.py ── Model specs (DeepSeek-671B, Qwen3-MoE-235B, etc.)
├── entities/ ── Core data models
│ ├── request.py ── Request with DAG (nx.DiGraph), PD metadata
│ ├── replica.py ── ReplicaType: MIXED / PREFILL / DECODE
│ ├── batch.py ── Batch of requests for co-execution
│ ├── task.py ── PromptTask (prefill) / TokenTask (decode)
│ ├── flow.py ── KVCacheTransferFlow (PD disaggregation)
│ ├── node.py ── Base abstraction for Task/Flow in DAG
│ └── interconnect.py ── NVLink / RDMA / Ethernet / PCIe link models
├── events/ ── DES event types
│ ├── request_arrival_event.py ── Request enters system
│ ├── replica_schedule_event.py ── Triggers replica scheduling
│ └── batch_end_event.py ── Batch execution completes
├── scheduler/
│ ├── global_scheduler/
│ │ ├── splitwise_global_scheduler.py ── PD-aware: P-pool + D-pool routing
│ │ ├── lor_global_scheduler.py ── Least Outstanding Requests
│ │ └── round_robin_global_scheduler.py
│ └── replica_scheduler/
│ ├── splitwise_replica_scheduler.py ── PD-aware per-replica scheduling
│ └── vllm_replica_scheduler.py ── vLLM-style scheduling policy
├── execution_time_predictor/
│ ├── sklearn_execution_time_predictor.py ── RandomForest / AICB backend
│ ├── communication_time_predictor.py ── SimAI NS-3 / analytical backend
│ └── SimAIWorkload.py ── Workload file builder for astra-sim
├── request_generator/ ── Synthetic / trace-based / Poisson arrivals
└── metrics/ ── TTFT, TBT, E2E latency, PD breakdown
aicb/
├── aicb.py ── Main entry point for benchmark execution
├── workload_applyer.py ── Applies workloads to GPU cluster (runs collectives)
├── workload_generator/
│ ├── SimAI_inference_workload_generator.py ── Inference workload (prefill/decode)
│ ├── SimAI_training_workload_generator.py ── Training workload files
│ └── mocked_model/
│ ├── MockedModel.py ── Base class, InferencePhase enum (PREFILL/DECODE)
│ ├── inference/
│ │ ├── MockedDeepSeek.py ── DeepSeek-V3 MLA + MoE architecture mock
│ │ ├── MockedQwen3Moe.py ── Qwen3-MoE architecture mock
│ │ ├── MockedQwen3Next.py ── Qwen3-Next (hybrid attention) mock
│ │ ├── AiobDeepSeek.py ── GPU kernel profiling: FlashMLA, DeepGEMM FP8
│ │ ├── AiobQwen3Moe.py ── Qwen3-MoE GPU kernel profiling
│ │ └── AiobQwen3Next.py ── Qwen3-Next GPU kernel profiling
│ └── training/ ── DeepSpeed, Megatron, DeepSeek training mocks
├── utils/
│ ├── utils.py ── CommType, CommGroup, Strategy enums, compute cache
│ ├── deepgemm_utils.py ── FP8 GEMM (per_token/per_block quantization)
│ └── timer.py ── CudaEventTimer for GPU kernel timing
├── workload/
│ └── simAI/model_workload/ ── Pre-generated workload files (.txt)
├── scripts/inference_configs/
│ ├── deepseek_default.json ── DeepSeek-V3 model config
│ ├── qwen3_moe_default.json ── Qwen3-MoE model config
│ └── qwen3_next_default.json ── Qwen3-Next model config
└── log_analyzer/ ── Result analysis and plotting tools
SimCCL's core logic currently lives inside astra-sim-alibabacloud/astra-sim/system/MockNccl* files. The standalone SimCCL repo contains documentation; the full implementation will be released separately.
astra-sim-alibabacloud/
├── CMakeLists.txt ── Top-level CMake config
├── build/
│ ├── simai_analytical/ ── Build config → bin/SimAI_analytical
│ ├── astra_ns3/ ── Build config → bin/SimAI_simulator
│ └── simai_phy/ ── Build config → bin/SimAI_phynet
├── astra-sim/
│ ├── system/ ── Core simulation (88 files)
│ │ ├── Sys.cc/.hh ── Main system class, event dispatch
│ │ ├── MockNcclGroup.cc/.h ── NCCL algo decomposition (Ring/Tree/NVLS)
│ │ ├── MockNcclChannel.cc/.h ── SingleFlow, ncclTree, ncclChannelNode
│ │ ├── MockNccl.h ── Algorithm IDs, base/hw latency tables
│ │ ├── MockNcclQps.h ── QPS tracking per connection
│ │ ├── SimAiFlowModelRdma.cc ── RDMA flow model
│ │ └── calbusbw.cc ── Bus bandwidth calculator
│ ├── network_frontend/
│ │ ├── analytical/ ── AnaSim: fast tick-based estimation
│ │ ├── ns3/ ── NS-3 integration (entry.h = main bridge)
│ │ └── phynet/ ── Physical RDMA traffic generation
│ └── workload/
│ ├── Workload.cc/.hh ── Workload parser (reads .txt workload files)
│ └── Layer.cc/.hh ── Single compute+comm layer representation
├── inputs/
│ ├── config/
│ │ └── SimAI.conf ── 60+ params: CC_MODE, PFC, ECN, monitoring
│ ├── topo/
│ │ └── gen_Topo_Template.py ── 5 topologies: Spectrum-X, HPN, DCN+
│ └── ratio/ ── NIC/NVLink/ATA performance ratio CSV
ns-3-alibabacloud/
├── simulation/src/point-to-point/model/ ── Core network models (37 files)
│ ├── rdma-hw.cc/.h ── RDMA NIC: QP management, CC algorithms
│ ├── rdma-queue-pair.cc/.h ── QP state: rate, window, seq tracking
│ ├── switch-node.cc/.h ── Switch: ECMP hash routing, ECN marking
│ ├── switch-mmu.cc/.h ── Switch MMU: buffer mgmt, PFC, RED/ECN
│ ├── qbb-net-device.cc/.h ── QBB NIC: PFC pause/resume, WRR scheduling
│ ├── nvswitch-node.cc/.h ── NVSwitch: intra-node NVLS routing
│ ├── int-header.h ── INT telemetry header (for HPCC)
│ └── pint.h ── Probabilistic INT (for HPCC-PINT)
├── simulation/src/network/utils/
│ └── custom-header.cc/.h ── Packet header with IP, port, seq, INT
├── analysis/ ── Post-simulation analysis tools
│ ├── fct_analysis.py ── Flow Completion Time analysis
│ ├── qlen_analysis.py ── Queue length analysis
│ ├── bw_analysis.py ── Bandwidth utilization analysis
│ └── qp_cnp_analysis.py ── CNP count per QP analysis
└── docs/images/ ── Network topology diagrams
AICB profiles GPU kernels → produces compute times → vidur-alibabacloud reads them via execution_time_predictor → generates workload .txt files → astra-sim reads workload → decomposes collectives via MockNcclGroup (SimCCL) → sends P2P flows to NS-3 → NS-3 returns FCT → astra-sim returns communication time → vidur records metrics.