Communication Simulation: SimCCL + astra-sim + NS-3

From collective operations to packet-level network simulation — the full communication stack

SimCCL · astra-sim · NS-3
Back to SimAI Overview

Table of Contents

  1. Communication Stack Overview
  2. NCCL Algorithm Simulation (MockNcclGroup)
  3. SingleFlow Data Structure
  4. Group Topology
  5. astra-sim Three Backends
  6. P2P Transport Mechanics: RDMA, DCQCN, PFC, ECMP
  7. Network Topology Generation
  8. Supported Network Topologies — Deep Dive
  9. NS-3 Configuration
  10. Build System
  11. End-to-End Example

Page Organization

Layer 1: Collective Decomposition
§2 NCCL Algorithms (Ring/Tree/NVLS)
§3 SingleFlow Data Structure
§4 Group Topology (TP/DP/EP/PP)
Layer 2: Transport & Network
§5 astra-sim Backends
§6 RDMA QP · DCQCN/HPCC/TIMELY
    ECN · PFC · ECMP · NACK · NVLS
Layer 3: Topology & Config
§7 Topology Generation + 5 Deep Dives
§8 NS-3 Configuration (SimAI.conf)
§9 Build System · §10 E2E Example

Communication Stack Overview

SimAI's communication simulation covers the complete network stack — from high-level collective operations (AllReduce, AllGather) down to packet-level RDMA transport with DCQCN congestion control and PFC flow control. This page provides a comprehensive deep-dive organized into three layers: (1) Collective Decomposition (Sections 2-4): how NCCL algorithms decompose collectives into point-to-point flows using Ring, Tree, and NVLS patterns; (2) Transport & Network Mechanics (Sections 5-6): how flows traverse the network via RoCEv2 RDMA with DCQCN/HPCC/TIMELY congestion control, RED-based ECN marking, PFC back-pressure, and per-flow ECMP routing; (3) Topology & Configuration (Sections 7-10): the five supported datacenter topologies (Spectrum-X, AlibabaHPN, DCN+), NS-3 configuration parameters, and end-to-end usage examples.

COMMUNICATION SIMULATION STACK LAYER 1: COLLECTIVE OPERATIONS High-level communication primitives from the training workload AllReduce AllGather ReduceScatter AllToAll LAYER 2: SimCCL / MockNCCL Algorithm decomposition into point-to-point flows Ring Tree NVLS NVLS-Tree CollNet LAYER 3: astra-sim (SYSTEM SIMULATION) Event-driven scheduling, flow dependency tracking, workload orchestration Flow Scheduler Dependency DAG Workload Manager LAYER 4: NETWORK BACKEND Packet-level, analytical, or physical network simulation NS-3 (Packet-level) Analytical (busbw) Physical (RDMA)

Each layer has a well-defined interface to the one below it. Collective operations produce FlowModels (a set of SingleFlow structs). astra-sim consumes these flows and dispatches them to the network backend. The backend reports completion events back to astra-sim, which then triggers dependent flows.

NCCL Algorithm Simulation (MockNcclGroup)

The MockNcclGroup class is the core of SimCCL. It faithfully replicates NCCL's algorithm selection and flow decomposition logic, translating high-level collective operations into point-to-point data flows that can be simulated by the network backend.

2.1 Supported Algorithms

astra-sim/workload/SimCCL/MockNcclGroup.h SimCCL

SimCCL mirrors the six NCCL algorithm types. Each algorithm defines a different communication pattern for moving data between GPUs:

// NCCL algorithm definitions mirrored in SimCCL
#define NCCL_ALGO_TREE            0   // Tree reduction (hierarchical)
#define NCCL_ALGO_RING            1   // Ring (bandwidth-optimal)
#define NCCL_ALGO_COLLNET_DIRECT  2   // Direct CollNet (switch-assisted)
#define NCCL_ALGO_COLLNET_CHAIN   3   // Chain CollNet (pipelined)
#define NCCL_ALGO_NVLS            4   // NVLink Switching (Hopper+)
#define NCCL_ALGO_NVLS_TREE       5   // NVLS + Tree hybrid

#define NCCL_NUM_ALGORITHMS       6
#define NCCL_NUM_PROTOCOLS        3   // LL, LL128, Simple

2.2 Algorithm Selection Logic

Algorithm selection in SimCCL is GPU-type-aware. The system considers the GPU architecture (A100, H100, H800), the group type (TP, DP, PP, EP), the number of ranks, and whether NVLink Switching is available. This mirrors NCCL's real-world decision tree:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL
ncclInfo* MockNcclGroup::get_algo_proto_info(
    GroupType type, int rank, ComType op, uint64_t data_size) {

    ncclInfo* info = new ncclInfo();
    info->nBytes = data_size;
    info->op = op;

    if (op == All_Reduce && type == TP) {
        if (gpu_type == H100 || gpu_type == H800) {
            if (nRanks >= 8 && NVLSenable)
                info->algorithm = NCCL_ALGO_NVLS;  // Hopper favors NVLS
            else
                info->algorithm = NCCL_ALGO_RING;
        } else if (gpu_type == A100 || gpu_type == A800) {
            info->algorithm = NCCL_ALGO_RING;   // Ampere uses Ring
        }
    } else if (op == All_Reduce && type == DP) {
        info->algorithm = NCCL_ALGO_RING;       // DP always Ring
    } else if (op == All_to_All) {
        info->algorithm = NCCL_ALGO_RING;       // AllToAll uses Ring
    } else if (op == All_Gather) {
        if (type == TP && NVLSenable && nRanks >= 8)
            info->algorithm = NCCL_ALGO_NVLS_TREE;
        else
            info->algorithm = NCCL_ALGO_RING;
    }

    return info;
}
Key Insight: Algorithm selection is GPU-type-aware: A100 uses Ring, H100 favors NVLS for TP groups with 8+ ranks. This distinction can lead to 2-3x performance differences in large-scale training simulations.

2.3 Ring AllReduce Decomposition

The Ring AllReduce is the most common collective algorithm. It operates in two phases: ReduceScatter (each GPU sends 1/N of its data around the ring, reducing as it goes) and AllGather (each GPU broadcasts its reduced chunk). For N GPUs, this requires 2*(N-1) steps total.

RING AllReduce FOR 4 GPUs: 2*(N-1) = 6 STEPS PHASE 1: ReduceScatter (3 steps) Each GPU sends 1/4 of data around the ring, reducing at each hop Step 1 GPU0 GPU1 GPU2 GPU3 Step 2 GPU0 GPU1 GPU2 GPU3 Step 3 GPU0 GPU1 GPU2 GPU3 After ReduceScatter: each GPU holds 1/4 of fully reduced data PHASE 2: AllGather (3 steps) Each GPU broadcasts its reduced chunk around the ring Step 4 GPU0 GPU1 GPU2 GPU3 Step 5 GPU0 GPU1 GPU2 GPU3 Step 6 GPU0 GPU1 GPU2 GPU3 Result: All 4 GPUs hold complete reduced data | Total: 2*(4-1) = 6 steps
astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

Ring AllReduce Flow Generation

std::map<int, shared_ptr<FlowModels>>
MockNcclGroup::genAllReduceRingFlowModels(
    GroupType type, int rank, uint64_t data_size) {

    int nranks = gp_info.nRanks;
    int chunkcount = 2 * (nranks - 1);  // reduce + broadcast phases
    chunksize = data_size / nranks / ringchannels.size();

    // Phase 1: ReduceScatter
    for (int step = 0; step < nranks - 1; step++) {
        for (auto& ring : ringchannels) {
            int src_rank = ring[(ring_idx + nranks - step) % nranks];
            int dest_rank = ring[(ring_idx + nranks - step + 1) % nranks];

            tmp_result = SingleFlow(
                flow_id, src_rank, dest_rank, chunksize,
                {prev_flow_id},      // depends on previous step
                {},                   // no parallel deps
                {child_flow_id},      // next step depends on this
                ring_id, chunk_id,
                chunkcount, "RING");
        }
    }

    // Phase 2: AllGather
    for (int step = 0; step < nranks - 1; step++) {
        for (auto& ring : ringchannels) {
            // Similar flow generation with broadcast semantics
            tmp_result = SingleFlow(
                flow_id, src_rank, dest_rank, chunksize,
                {prev_flow_id}, {}, {child_flow_id},
                ring_id, chunk_id, chunkcount, "RING");
        }
    }

    return flow_models;
}
Ring Complexity: Ring AllReduce sends 2*(N-1) chunks for N GPUs -- NVLS on Hopper reduces this to O(1) using NVSwitch. For 8 GPUs, Ring needs 14 steps while NVLS needs only a single multicast + reduce operation through the NVSwitch fabric.

2.4 NVLS AllReduce (Hopper / Blackwell)

NVLS (NVLink Switching) leverages NVSwitch to provide all-to-all NVLink connectivity within a node. Instead of passing data around a ring, all GPUs can simultaneously access a shared memory region through the NVSwitch. The NVLS-Tree variant combines NVSwitch-based intra-node communication with a tree reduction for inter-node communication.

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

NVLS-Tree AllReduce

shared_ptr<FlowModels>
MockNcclGroup::genallReduceNVLSTreeFlowModels(
    GroupType type, int rank, uint64_t data_size) {

    // Step 1: NVLS Reduce (intra-node via NVSwitch)
    // All GPUs within a node reduce their data through
    // the NVSwitch shared memory multicast
    generate_flow_model_nvls_tree_allreduce_up(
        rank, data_size, flow_models);

    // Step 2: Tree Reduce (inter-node via NIC)
    // One representative GPU per node participates in
    // a tree reduction across nodes

    // Step 3: Tree Broadcast (inter-node via NIC)
    // Root broadcasts the fully reduced result down the tree

    // Step 4: NVLS Broadcast (intra-node via NVSwitch)
    // Representative GPU shares the result with all node-local GPUs
    generate_flow_model_nvls_tree_allreduce_down(
        rank, data_size, flow_models);

    return flow_models;
}

// NVLS uses multicast addressing through NVSwitch
// Each GPU writes to a shared buffer that is atomically
// reduced by the NVSwitch hardware itself
void MockNcclGroup::generate_flow_model_nvls_tree_allreduce_up(
    int rank, uint64_t data_size,
    shared_ptr<FlowModels> flow_models) {

    // NVLS multicast: each GPU sends to NVSwitch
    for (auto& nvswitch : gp_info.NVSwitchs) {
        SingleFlow flow(flow_id, rank, nvswitch,
            data_size / nRanks,
            {}, {}, {child_ids},
            0, chunk_id, chunk_count, "NVLS");
        flow_models->add(flow);
    }
}

2.5 Hardware Latency Parameters

SimCCL models hardware-specific latencies for each algorithm and protocol combination. These base latency values are added to the data transfer time to account for protocol overhead, synchronization, and memory copies:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL
// Base latency (microseconds) per algorithm per protocol
// Protocols: [LL, LL128, Simple]
static const float baseLat[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
    {6.8,  14.0, 0},       // Tree
    {6.6,  14.0, 8.4},     // Ring
    {0,    0,    0},       // CollNet Direct
    {0,    0,    0},       // CollNet Chain
    {0,    0,    23.0},    // NVLS
    {0,    0,    0},       // NVLS-Tree
};

// Per-path latencies: [NVLINK, PCI, NET]
// Each indexed by [path][algorithm][protocol]
static float hwLat[3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
    // NVLINK path latencies
    { {0.6, 1.25, 28.0},   // Tree over NVLINK
      {0.6, 1.9,  3.4},    // Ring over NVLINK
      {0,   0,    0},      // CollNet Direct
      {0,   0,    0},      // CollNet Chain
      {0,   0,    0},      // NVLS
      {0,   0,    0} },    // NVLS-Tree

    // PCI path latencies
    { {1.0, 1.0,  0},      // Tree over PCI
      {1.0, 1.0,  0},      // Ring over PCI
      ... },

    // Network (NIC) path latencies
    { {5.0, 8.5,  0},      // Tree over NET
      {2.7, 5.5,  0},      // Ring over NET
      ... },
};

SingleFlow Data Structure

The SingleFlow struct is the fundamental unit of communication in SimAI. Every collective operation is decomposed into a directed acyclic graph (DAG) of SingleFlows. Each flow represents a point-to-point data transfer between two ranks, with explicit dependency tracking via prev and child_flow_id fields.

astra-sim/workload/SimCCL/FlowModels.h SimCCL
struct SingleFlow {
    int flow_id;                  // Unique identifier for this flow
    int src, dest;                // Source and destination rank
    uint64_t flow_size;           // Data size in bytes

    vector<int> prev;            // Dependency flows (must complete first)
    vector<int> parallel_deps;   // Flows that run in parallel
    vector<int> child_flow_id;   // Downstream flows (triggered on completion)

    int channel_id;               // Ring/tree channel index
    int chunk_id;                 // Which chunk of the collective
    int chunk_count;              // Total chunks in this collective
    string conn_type;             // "RING", "TREE", "NVLS", "PXN_INIT"

    // Constructor
    SingleFlow(int id, int s, int d, uint64_t size,
              vector<int> p, vector<int> par,
              vector<int> child,
              int ch, int ck, int cc,
              string ct)
        : flow_id(id), src(s), dest(d), flow_size(size),
          prev(p), parallel_deps(par), child_flow_id(child),
          channel_id(ch), chunk_id(ck), chunk_count(cc),
          conn_type(ct) {}
};

The flow DAG ensures correct execution ordering. For a Ring AllReduce with 4 GPUs, the prev field chains flows so that step 2 of the ReduceScatter cannot begin until step 1 completes. The child_flow_id enables the simulation engine to eagerly dispatch dependent flows once a flow finishes.

1

Flow Creation

MockNcclGroup decomposes collective into SingleFlows with explicit dependencies

2

Flow Scheduling

astra-sim checks prev dependencies; dispatches flows with all dependencies satisfied

3

Network Transfer

Backend (NS-3 / Analytical) simulates the point-to-point transfer

4

Completion Callback

Backend notifies astra-sim; child_flow_id flows are checked and potentially dispatched

Group Topology

SimAI organizes GPU ranks into communication groups based on parallelism strategy. Each GroupInfo struct describes the membership, topology, and NVSwitch connectivity of a group. The GroupType enum determines which collective algorithm is used:

astra-sim/workload/SimCCL/MockNcclGroup.h SimCCL
enum GroupType { TP, DP, PP, EP, DP_EP, NONE };

struct GroupInfo {
    GroupType type;            // Parallelism type
    int nNodes;                // Number of nodes in this group
    int nRanks;                // Number of GPU ranks
    vector<int> Ranks;        // List of rank IDs
    vector<int> NVSwitchs;     // NVSwitch IDs (if applicable)
};

// Example: 8-GPU TP group within one node
// GroupInfo { TP, 1, 8, {0,1,2,3,4,5,6,7}, {8,9,10,11} }
//   - 8 ranks on 1 node
//   - 4 NVSwitches (H100 SXM)

GroupType Communication Patterns

GroupType Typical Op Algorithm Description
TP AllReduce Ring / NVLS Tensor parallel gradient sync within a node; NVLS preferred on H100+ with 8+ ranks
DP AllReduce Ring Data parallel gradient sync across nodes; bandwidth-bound over NIC
EP AllToAll Ring Expert parallel token dispatch for MoE models; each GPU sends to all experts
PP SendRecv P2P Pipeline parallel activation transfer between stages; latency-sensitive
DP_EP AllReduce Ring Combined data-parallel + expert-parallel gradient sync

The group topology determines how MockNcclGroup generates ring or tree channels. For a TP group confined to one node, rings are constructed using the NVLink topology (e.g., rail-optimized patterns). For DP groups spanning multiple nodes, rings traverse NIC links and the network fabric.

astra-sim Three Backends

astra-sim supports three different network backends, each offering a different trade-off between simulation speed and fidelity. Users choose the backend at build time, and the same workload description works with all three.

5.1 Analytical Mode (AnaSim)

astra-sim/network/analytical/AnaSim.hh astra-sim

The analytical backend uses a simple event queue with bus bandwidth estimates to compute transfer times. It runs in seconds but does not model congestion, queueing, or flow contention:

class AnaSim {
    static queue<struct CallTask> call_list;
    static uint64_t tick;

    static void Run() {
        while (!call_list.empty()) {
            CallTask task = call_list.front();
            while (tick != task.time) tick++;
            call_list.pop();
            task.fun_ptr(task.fun_arg);  // Execute callback
        }
    }

    static void Schedule(
        uint64_t delay,              // Computed from busbw config
        void (*fun_ptr)(void*),      // Completion callback
        void* fun_arg) {
        call_list.push({tick + delay, fun_ptr, fun_arg});
    }
};

// Transfer time = data_size / busbw + base_latency
// No congestion, no contention, no queue dynamics

5.2 NS-3 Simulation Mode

NS-3

The NS-3 backend provides full packet-level simulation with realistic network behavior. This is the highest-fidelity software simulation mode, modeling every aspect of the network fabric:

  • RDMA / QBB protocol — Emulates RDMA over Converged Ethernet (RoCEv2)
  • Congestion control (QCN / PFC) — Quantized Congestion Notification and Priority Flow Control
  • Adaptive routing — Models ECMP and adaptive load balancing across paths
  • Realistic queue dynamics — Switch buffer occupancy, tail drops, ECN marking
  • Multi-path effects — Incast, hash collisions, flow-level fairness
Fidelity Matters: NS-3 models real congestion effects (QCN/PFC) that analytical mode misses — critical for large-scale accuracy. At 1000+ GPU scale, network contention can add 20-40% overhead that only packet-level simulation captures.

5.3 Physical Mode (Beta)

astra-sim/network/physical/PhyNetSim.cc astra-sim

The physical backend generates actual RDMA traffic on real hardware. This provides the highest fidelity validation but requires an actual RDMA cluster:

void send_flow(int src, int dst, uint64_t maxPacketCount,
    void (*msg_handler)(void*), void* fun_arg,
    int tag, sim_request* request) {

    ncclFlowTag flowtag = request->flowTag;
    TransportData send_data = TransportData(
        flowtag.channel_id,
        flowtag.chunk_id,
        flowtag.chunk_count,
        flowtag.current_flow_id,
        flowtag.child_flow_id,
        flowtag.tree_flow_list,
        flowtag.data_size);

    // Serialize and send via actual RDMA verbs
    rdma_transport->post_send(
        dst, &send_data, sizeof(send_data),
        tag, msg_handler, fun_arg);
}

Backend Comparison

Mode Speed Fidelity Hardware Use Case
Analytical ~seconds Low CPU only Quick estimation, parameter sweeps, early design exploration
NS-3 Sim ~minutes-hours High CPU (multi-thread) Network topology design, congestion analysis, QoS tuning
Physical Real-time Highest RDMA cluster Validation against real hardware, final design sign-off

P2P Transport Mechanics: How KV Cache Actually Travels

When SimAI simulates a KV cache transfer from a prefill node to a decode node, the NS-3 backend doesn't just estimate latency — it simulates the full RoCEv2 RDMA transport stack at packet granularity. This section traces the exact mechanisms: how RDMA Queue Pairs are created, how DCQCN adjusts sending rates on congestion, how PFC prevents buffer overflow, and how ECMP distributes flows across equal-cost paths.

KV Cache P2P Transfer: Where Each Mechanism Activates
6.1 Prefill GPU KV Cache data (e.g. 320 MB) ready to send SENDER NIC 6.1 QP Creation RdmaHw.AddQueuePair(size=320MB, pg=3) 6.10 Window Check snd_nxt - snd_una < win 6.13 Multi-QP Split across N QPs 6.12 NIC Scheduling GetNextQindex() -- WRR, ACK priority 9000B Jumbo Frame sent SWITCH FABRIC ASW / ToR Switch (Ingress) 6.3 ECN qlen > kmin? RED mark 6.4 PFC buffer > threshold? PAUSE 6.5 ECMP Hash(5-tuple) % nexthops -- pick path if cross-rail: via PSW PSW / Spine Switch same ECN / PFC / ECMP checks 6.9 NVLS Check Same server? Route via NVSwitch instead RECEIVER NIC 6.8 Seq Check ReceiverCheckSeq() ACK/NACK 6.6 FCT Tracking actual vs ideal time ACK / CNP sent back to sender FEEDBACK LOOP (Sender) 6.2 DCQCN CNP received -- rate = rate * (1 - alpha/2) 6.11 Alt CC HPCC / TIMELY / DCTCP 6.10 Win Adjust VAR_WIN: w = m_win*rate/max 6.7 IRN (Go-Back-N) NACK-triggered selective retransmit All bytes ACKed 6.6 qp_finish() FCT recorded 6.14 Monitoring output -- callback to astra-sim Decode GPU -- KV Cache received, decode scheduling proceeds
KV Cache P2P Transfer: Where Each Mechanism Activates
1

§6.1 Prefill GPU → QP Creation

RdmaHw.AddQueuePair(size=KV_cache_bytes, pg=3)

2

§6.10/§6.12/§6.13 Sender NIC Gates

Window check (IsWinBound) → Multi-QP split → WRR NIC scheduling

3

§6.3/§6.4/§6.5 Switch Fabric

ECN marking (RED) → PFC pause check → ECMP hash routing → PSW if cross-rail

4

§6.9 NVLS Check

Same server? → route via NVSwitch (2880 Gbps) instead of NIC

5

§6.8/§6.6 Receiver NIC

Seq check → ACK or NACK (protocol 0xFD) → FCT tracking (actual vs ideal)

6

§6.2/§6.11/§6.10 Feedback Loop

CNP → DCQCN rate decrease (α×EWMA) | or HPCC/TIMELY/DCTCP | Window adjust

§6.6/§6.14 Complete

qp_finish() → FCT recorded → monitoring output → callback to astra-sim → Decode GPU

6.1 RDMA Queue Pair (QP) Lifecycle

Every P2P flow in SimAI becomes one or more RDMA Queue Pairs. The call chain starts from astra-sim's collective decomposition and ends at NS-3's packet-level simulation:

1

MockNcclGroup decomposes collective → SingleFlow

e.g., AllReduce(64MB, 8 GPUs) → 14 SingleFlow objects (2×(N-1) for ring)

2

astra-sim calls sim_send(src, dst, bytes)

Each SingleFlow triggers a sim_send() to the NS-3 backend with flow_size bytes

3

NS-3 creates RdmaClient → RdmaDriver.AddQueuePair()

A QP is created with src/dst IP, port, priority group, window size (BDP), and base RTT. The QP starts sending packets at line rate.

4

Packets traverse switches → ECN marking → ACK/CNP

9000-byte jumbo frames traverse the topology. Switches check queue depth, mark ECN if congested, trigger PFC if buffer critical. Receiver sends ACK or generates CNP.

5

All bytes ACKed → qp_finish() → callback to astra-sim

FCT (Flow Completion Time) is recorded. astra-sim's event system proceeds to the next dependent operation.

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-queue-pair.h
class RdmaQueuePair : public Object {
    Ipv4Address sip, dip;             // Source & destination IP
    uint16_t sport, dport;             // Source & dest port (ECMP hash input)
    uint64_t m_size;                    // Total bytes to transfer (= KV cache size)
    uint16_t m_pg;                      // Priority group (queue index for PFC)
    uint64_t snd_nxt, snd_una;          // Next seq to send, unacked seq
    uint32_t m_win;                     // Window size (BDP-based)
    DataRate m_rate;                    // Current sending rate (DCQCN-controlled)
    uint64_t m_baseRtt;                 // Base RTT for this src-dst pair
};

6.2 DCQCN Congestion Control (CC_MODE=1)

SimAI's default congestion control is DCQCN (Data Center QCN), Mellanox's adaptation of QCN for RoCEv2. It operates in three phases: rate decrease on CNP, alpha tracking via EWMA, and rate recovery via additive/hyper-additive increase.

CC_MODE options: 1 = DCQCN (Mellanox, default), 3 = HPCC (High Precision CC), 7 = TIMELY (RTT-based), 10 = HPCC-PINT (with INT sampling). All implemented in rdma-hw.cc.
ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-hw.h — DCQCN parameters
// DCQCN rate control state (per QP)
double m_g;                    // EWMA gain = 0.00390625 (1/256)
double m_rateOnFirstCNP;       // Rate fraction on first CNP (1.0 = full)
double m_rpgTimeReset;         // Rate increase timer = 900 μs
double m_rateDecreaseInterval;  // Rate decrease check interval = 4 μs
double m_alpha_resume_interval; // Alpha EWMA update interval = 1 μs
DataRate m_rai;                // Additive increase = 50 Mb/s
DataRate m_rhai;               // Hyper-additive increase = 100 Mb/s
uint32_t m_rpgThreshold;       // Fast recovery threshold = 1 CNP

Rate Decrease (on CNP received)

When a CNP (Congestion Notification Packet) arrives at the sender, DCQCN reduces the rate multiplicatively. The key formula:

DCQCN Rate Decrease Formula

// On each rate_decrease_interval (4 μs), if CNP was received:
target_rate = rate × (1 - alpha / 2)
rate = max(rate × (1 - alpha / 2), MIN_RATE)

// Alpha is updated via EWMA every alpha_resume_interval (1 μs):
// If CNP received in this interval:
alpha = (1 - m_g) × alpha + m_g        // α increases toward 1
// If no CNP:
alpha = (1 - m_g) × alpha              // α decays toward 0

// With m_g = 1/256, alpha converges slowly — providing stability

Rate Increase (recovery)

DCQCN Rate Increase — Three Phases

// Every RP_TIMER (900 μs), if no CNP received:
// Phase 1: Fast Recovery (first m_rpgThreshold rounds)
rate = (rate + target_rate) / 2

// Phase 2: Additive Increase
rate = rate + m_rai                    // +50 Mb/s per interval

// Phase 3: Hyper-Additive Increase (after sustained no-congestion)
rate = rate + m_rhai                   // +100 Mb/s per interval

6.3 ECN Marking at Switches (RED-based)

Switches mark packets with ECN (setting the ECE bits in IPv4 header) based on probabilistic RED (Random Early Detection). The marking probability increases linearly between kmin and kmax:

ns-3-alibabacloud
simulation/src/point-to-point/model/switch-mmu.cc — ShouldSendCN()
bool SwitchMmu::ShouldSendCN(uint32_t ifindex, uint32_t qIndex) {
    if (qIndex == 0) return false;          // Queue 0 = highest priority, never marked
    if (egress_bytes[ifindex][qIndex] > kmax[ifindex])
        return true;                           // Above kmax: always mark (100%)
    if (egress_bytes[ifindex][qIndex] > kmin[ifindex]) {
        // Between kmin and kmax: linear probability
        double p = pmax[ifindex]
                 * (double)(egress_bytes[ifindex][qIndex] - kmin[ifindex])
                 / (kmax[ifindex] - kmin[ifindex]);
        if (UniformVariable(0, 1).GetValue() < p)
            return true;                       // Probabilistic mark
    }
    return false;                              // Below kmin: never mark
}

The kmin/kmax/pmax values are rate-dependent, configured per link speed in SimAI.conf:

Link Speed kmin (KB) kmax (KB) pmax
25 Gbps1004000.2
100 Gbps40016000.2
200 Gbps30012000.8
400 Gbps80032000.2

6.4 PFC (Priority Flow Control) — Layer 2 Back-Pressure

PFC operates at Layer 2, independently from DCQCN. When a switch's ingress buffer fills beyond the threshold, it sends a PAUSE frame to the upstream sender, halting transmission on that priority queue. This prevents packet loss but can cause head-of-line blocking and PFC storms.

ns-3-alibabacloud
simulation/src/point-to-point/model/switch-mmu.cc — PFC PAUSE/RESUME logic
bool SwitchMmu::CheckShouldPause(uint32_t port, uint32_t qIndex) {
    return !paused[port][qIndex] &&
           (hdrm_bytes[port][qIndex] > 0 ||                  // Headroom occupied
            GetSharedUsed(port, qIndex) >= GetPfcThreshold(port));  // Shared buffer full
}

bool SwitchMmu::CheckShouldResume(uint32_t port, uint32_t qIndex) {
    if (!paused[port][qIndex]) return false;
    return hdrm_bytes[port][qIndex] == 0 &&                 // Headroom drained
           (GetSharedUsed(port,qIndex) == 0 ||
            GetSharedUsed(port,qIndex) + resume_offset       // 3 KB hysteresis
                <= GetPfcThreshold(port));
}

// Switch buffer: 12 MB total (static) or 32 MB (SimAI.conf BUFFER_SIZE)
// Dynamic threshold: USE_DYNAMIC_PFC_THRESHOLD = 1 (enabled by default)
// 8 priority queues (qCnt=8), queue 0 = highest (ACK/NACK/CNP bypass PFC)
PFC vs DCQCN — Two Layers of Defense: DCQCN (Layer 4) proactively reduces sending rate when switches mark ECN — it's the fine-grained rate control. PFC (Layer 2) is the last resort that physically pauses the link when buffers are critically full. In a well-tuned system, DCQCN should react fast enough that PFC rarely triggers. SimAI tracks PFC events in PFC_OUTPUT_FILE — high PFC counts indicate that DCQCN parameters need tuning (e.g., lower kmin, faster RATE_DECREASE_INTERVAL).

6.5 ECMP Routing (Per-Flow, MurmurHash3)

SimAI uses per-flow ECMP routing at every switch. When a packet arrives at a switch with multiple equal-cost next hops, the switch hashes the flow's 5-tuple to deterministically pick a path:

ns-3-alibabacloud
simulation/src/point-to-point/model/switch-node.cc — ECMP hash
int SwitchNode::GetOutDev(Ptr<Packet> p, CustomHeader &ch) {
    // Extract 5-tuple: src_ip(4B) + dst_ip(4B) + src_port(2B) + dst_port(2B)
    union { uint8_t u8[12]; uint32_t u32[3]; } buf;
    buf.u32[0] = ch.sip;    buf.u32[1] = ch.dip;
    buf.u32[2] = ch.sport | ((uint32_t)ch.dport << 16);

    // MurmurHash3 with per-switch seed (= node ID)
    uint32_t idx = EcmpHash(buf.u8, 12, m_ecmpSeed) % nexthops.size();
    return nexthops[idx];     // Deterministic path for this flow
}
ECMP and PD Disaggregation: In PD disaggregation, KV cache transfers are large, long-lived flows. Per-flow ECMP means each KV transfer is locked to a single path for its entire duration — it cannot spread across multiple links. This creates hash collision risk: if two large KV transfers hash to the same ASW→PSW link, they compete for bandwidth while other links sit idle. This is a known limitation of per-flow ECMP for elephant flows, and is why rail-optimized topologies (where same-rail traffic avoids PSW entirely) are preferred for TP-heavy workloads.

6.6 Flow Completion Time (FCT) Tracking

ns-3-alibabacloud
astra-sim/network_frontend/ns3/entry.h — qp_finish()
void qp_finish(FILE *fout, Ptr<RdmaQueuePair> q) {
    uint64_t standalone_fct = base_rtt
        + total_bytes * 8000000000lu / bandwidth;  // Ideal FCT (no congestion)

    fprintf(fout, "%08x %08x %u %u %lu %lu %lu %lu\n",
        q->sip, q->dip,             // Source & dest IP (hex)
        q->sport, q->dport,          // Ports
        q->m_size,                    // Data bytes (= KV cache size)
        q->startTime,                 // Start time (ns)
        Simulator::Now() - q->startTime,  // Actual FCT (ns)
        standalone_fct);              // Ideal FCT (ns, zero-congestion)
}
// FCT ratio = actual_fct / ideal_fct → measures congestion impact
// For PD disagg: this directly determines pd_p2p_comm_time

6.7 Transport Stack Summary

Layer Mechanism Implementation Key Parameters
L4 Transport RoCEv2 RDMA with QP state rdma-hw.cc, rdma-queue-pair.h PACKET_PAYLOAD_SIZE=9000, HAS_WIN=1
L4 Congestion Control DCQCN (default) / HPCC / TIMELY rdma-hw.cc cnp_received_mlx() CC_MODE=1, EWMA_GAIN=1/256, RATE_AI=50Mb/s
L3 ECN Marking Probabilistic RED at switch egress switch-mmu.cc ShouldSendCN() kmin/kmax per link speed, pmax=0.2
L3 Routing Per-flow ECMP (MurmurHash3) switch-node.cc GetOutDev() 5-tuple hash, seed=node_id
L2 Flow Control PFC with dynamic threshold switch-mmu.cc, qbb-net-device.h 8 queues, 3KB hysteresis, 32MB buffer
L1 Physical Point-to-point links with configurable BW/latency Topology file NVLink 2880Gbps, NIC 100-400Gbps
Why This Matters for PD Disaggregation Accuracy: The analytical backend estimates KV transfer time as size / bandwidth. The NS-3 backend captures effects that analytical mode misses: (1) DCQCN rate ramp-up — new QPs start at line rate but quickly converge when competing; (2) ECMP hash collisions — two KV transfers to the same decode node may collide on a PSW uplink; (3) PFC cascading — a congested decode node can PFC-pause the entire upstream path; (4) cross-traffic interference — TP AllReduce on the same rails competes with PD KV transfers. These effects can cause 2-5× FCT inflation compared to the ideal size/BW estimate, making NS-3 essential for accurate PD disagg analysis.

6.8 NACK and Layer-2 Retransmission

When a receiver detects an out-of-order sequence number, it generates a NACK packet (protocol 0xFD). The sender retransmits from the NACKed sequence. This provides a reliable transport layer beneath congestion control.

Key parameters from SimAI.conf:

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-hw.cc lines 41-44, 584-602 — ReceiverCheckSeq()
// rdma-hw.cc — ReceiverCheckSeq()
int RdmaHw::ReceiverCheckSeq(uint32_t seq, Ptr<RdmaRxQueuePair> q, uint32_t size) {
    uint32_t expected = q->ReceiverNextExpectedSeq;
    if (seq == expected) {
        q->ReceiverNextExpectedSeq = expected + size;
        // Check if need to send ACK
        if (m_ack_interval == 0) return 1; // ACK
        else return 5; // delayed ACK
    } else if (seq > expected) {
        // Gap detected → generate NACK
        return 2; // NACK
    } else {
        // Duplicate or retransmitted packet
        return 4; // duplicate
    }
}

6.9 NVSwitch / NVLS Dedicated Routing

SimAI has a dedicated NVSwitchNode class (separate from regular SwitchNode) for intra-node NVLink routing. This enables distinct routing paths for GPU-to-GPU traffic within the same server versus inter-server traffic through NICs.

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-hw.h lines 49-51, rdma-hw.cc lines 203-212
// rdma-hw.h
std::unordered_map<uint32_t, std::vector<int>> m_rtTable;              // Normal routing
std::unordered_map<uint32_t, std::vector<int>> m_rtTable_nxthop_nvswitch; // NVSwitch routing
uint32_t m_gpus_per_server;  // Determines intra-node boundary

// rdma-hw.cc — route selection
if (nvls_enable && IsInSameServer(sip, dip)) {
    // Use NVSwitch routing table
    nexthops = m_rtTable_nxthop_nvswitch[dip];
} else {
    // Use normal ECMP routing
    nexthops = m_rtTable[dip];
}

6.10 Window + Rate Dual Control

SimAI uses both window-based and rate-based flow control simultaneously. A packet is sent only if both conditions are met: (1) within window, AND (2) within rate limit.

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-queue-pair.h lines 25-28, rdma-queue-pair.cc lines 112-121, 168-190
// rdma-queue-pair.cc
bool RdmaQueuePair::IsWinBound() {
    uint64_t on_the_fly = snd_nxt - snd_una;
    uint64_t win;
    if (m_var_win) {
        win = m_win * m_rate.GetBitRate() / m_max_rate.GetBitRate();
        if (win == 0) win = 1;
    } else {
        win = m_win;
    }
    return on_the_fly >= win;
}

6.11 Alternative Congestion Control: HPCC, TIMELY, DCTCP

Beyond DCQCN (CC_MODE=1), SimAI implements three more congestion control algorithms, each using different network signals:

Why are CC_MODE numbers non-consecutive (1, 3, 7, 8, 10)? SimAI inherits its CC_MODE numbering from the upstream HPCC simulator codebase. Modes 2, 4, 5, 6, 9 were reserved as placeholders for algorithms that were planned but never implemented (e.g., DCQCN variants, Swift). We verified in rdma-hw.cc that only modes 1, 3, 7, 8, 10 have actual if (m_cc_mode == N) branches — the gaps are historical, not omissions in our analysis.

HPCC (CC_MODE=3)Uses INT (In-Network Telemetry) headers. Each switch adds an IntHop record (timestamp, bytes, queue length, line rate). The sender uses precise utilization info to set rate:

ns-3-alibabacloud
simulation/src/point-to-point/model/int-header.h — IntHop structure
// IntHop structure (int-header.h)
struct IntHop {
    uint32_t time : 24;     // Timestamp (ns)
    uint32_t bytes : 20;    // Bytes transmitted since last sample
    uint32_t qlen : 17;     // Queue occupancy
    uint32_t lineRate : 3;  // Encoded link rate
};
// Max 5 hops per packet (IntHeader::maxHop = 5)
// Target utilization: m_targetUtil = 0.95

TIMELY (CC_MODE=7)RTT-based, no ECN required:

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-hw.h — TIMELY parameters
// Parameters (rdma-hw.h)
double m_tmly_alpha;    // Rate decrease factor
double m_tmly_beta;     // Rate increase factor
uint64_t m_tmly_TLow;  // Low RTT threshold
uint64_t m_tmly_THigh; // High RTT threshold
uint64_t m_tmly_minRtt; // Minimum RTT tracking

DCTCP (CC_MODE=8)ECN-based with additive increase:

ns-3-alibabacloud
simulation/src/point-to-point/model/rdma-hw.h — DCTCP parameters
DataRate m_dctcp_rai;  // Additive increase = 1000 Mb/s (from SimAI.conf DCTCP_RATE_AI)

Comparison of all supported congestion control algorithms:

CC Algorithm Signal Rate Decrease Rate Increase Use Case
DCQCN (1) ECN via CNP α×EWMA multiplicative AI/HAI additive Default, RoCEv2
HPCC (3) INT telemetry Precise utilization-based Target 95% utilization High precision
TIMELY (7) RTT measurement RTT > THigh RTT < TLow No switch support
DCTCP (8) ECN marking ECN-proportional Additive (1Gbps) TCP-like
HPCC-PINT (10) Probabilistic INT Sampled utilization Same as HPCC Reduced overhead

6.12 NIC Packet Scheduling (Weighted Round-Robin)

When multiple QPs compete for the same NIC, packets are scheduled via weighted round-robin. ACK/NACK packets always receive absolute priority.

ns-3-alibabacloud
simulation/src/point-to-point/model/qbb-net-device.cc lines 79-139 — GetNextQindex()
// qbb-net-device.cc — GetNextQindex()
int QbbNetDevice::GetNextQindex(bool paused[]) {
    // 1. Check high-priority ACK queue first
    if (m_ackQ.size() > 0) return -1;  // ACK has absolute priority

    // 2. Round-robin over QPs, skipping paused priorities
    for (int i = 1; i <= m_queue->m_qpGrp->GetN(); i++) {
        int idx = (m_qpidx + i) % m_queue->m_qpGrp->GetN();
        Ptr<RdmaQueuePair> qp = m_queue->m_qpGrp->Get(idx);
        if (!paused[qp->m_pg]           // Not PFC-paused
            && qp->GetBytesLeft() > 0    // Has data to send
            && !qp->IsWinBound()         // Within window
            && qp->m_nextAvail <= now) { // Rate-limited timer expired
            m_qpidx = idx;
            return idx;
        }
    }
    return -2; // Nothing to send
}

6.13 Multi-QP per Flow (_QPS_PER_CONNECTION_)

Large flows can be split across multiple RDMA Queue Pairs for parallel transmission. Each QP gets a different source port, resulting in a different ECMP hash and potentially a different physical path. This is how SimAI can simulate multi-path load balancing for elephant flows.

astra-sim
simulation/src/point-to-point/model/entry.h lines 21, 110-112
#define _QPS_PER_CONNECTION_ 1  // Default: 1 QP per flow (configurable)

void SendFlow(int src, int dst, uint64_t maxPacketCount, ...) {
    uint64_t perQP = (maxPacketCount + _QPS_PER_CONNECTION_ - 1) / _QPS_PER_CONNECTION_;
    uint64_t remaining = maxPacketCount;

    for (int i = 0; i < _QPS_PER_CONNECTION_; i++) {
        uint64_t thisQP = min(perQP, remaining);
        remaining -= thisQP;
        uint32_t port = portNumber[src][dst]++;  // Unique port → unique ECMP hash
        // Each QP gets a different src port → different ECMP path
        RdmaClientHelper client(pg, sip, dip, port, dport, thisQP, ...);
    }
}
Key Insight: Each QP gets a different source port → different ECMP hash → potentially different physical path. This is how SimAI simulates multi-path load balancing for elephant flows.

6.14 Monitoring Outputs and Link Failure

SimAI provides six monitoring output files for observing network behavior at different granularities, plus support for link failure simulation:

File Content Interval
FCT_OUTPUT_FILE Flow completion times (actual vs ideal) Per flow
QLEN_MON_FILE Switch queue occupancy (per port, per queue) 10 ms
BW_MON_FILE Host-level transmit bytes 10 ms
RATE_MON_FILE Per-QP sending rate (DCQCN-controlled) 100 μs
CNP_MON_FILE CNP reception count per QP 100 μs
PFC_OUTPUT_FILE PFC PAUSE/RESUME events Per event

Link failure simulation:

ns-3-alibabacloud
SimAI.conf lines 47, 57-65; rdma-hw.cc lines 762-802
# SimAI.conf — Link failure configuration
# Format: LINK_DOWN <timestamp> <node_A> <node_B>
LINK_DOWN 0 0 0  # 0 0 0 = no failure

# Can simulate link failures at specified times to study resilience
# Example: LINK_DOWN 1000000 12 48  — fail link between node 12 and 48 at t=1ms
Monitoring Tip: Combine FCT_OUTPUT_FILE (macro-level flow completion) with RATE_MON_FILE (micro-level rate dynamics) to diagnose why specific collective operations experience high latency. High PFC counts in PFC_OUTPUT_FILE combined with queue buildup in QLEN_MON_FILE indicate congestion hotspots that DCQCN alone cannot resolve.
Section 7

Network Topology Generation

SimAI includes a Python topology generator (gen_Topo_Template.py) that produces network descriptions for several real-world datacenter topologies. These topologies define the physical connectivity between GPUs, NVSwitches, ToR switches, aggregate switches (ASW), and pod switches (PSW).

astra-sim/network/ns3/gen_Topo_Template.py NS-3
# gen_Topo_Template.py
# Supported topologies:
# - Spectrum-X: Rail-optimized single ToR (4096 GPUs default)
# - AlibabaHPN Single-Plane: Dual ToR (15360 GPUs)
# - AlibabaHPN Dual-Plane: Dual ToR with dual plane (15360 GPUs)
# - DCN+ Single-ToR: Single ToR topology (512 GPUs)
# - DCN+ Dual-ToR: Dual ToR topology (512 GPUs)

# Key parameters:
# gps:  GPU per server (typically 8)
# nvbw: NVLink bandwidth (Gbps), e.g. 900 for H100
# bw:   NIC-to-ASW bandwidth (Gbps), e.g. 100, 200, 400
# nl:   NVLink latency (ns), typically 1000
# l:    NIC latency (ns), typically 1000

# Output format (plain text):
# Line 1: <total_nodes> <gpu_per_server> <nv_switch_num>
#         <switch_nodes> <links> <gpu_type>
# Line 2: <switch_node_ids> (space-separated)
# Line 3+: <src> <dst> <bandwidth> <latency> <error_rate>

parser = argparse.ArgumentParser()
parser.add_argument('-topo', type=str,
    choices=['Spectrum-X', 'AlibabaHPN-SinglePlane',
             'AlibabaHPN-DualPlane', 'DCN-SingleToR',
             'DCN-DualToR'])
parser.add_argument('-g', type=int,
    help='Total number of GPUs')
parser.add_argument('-gt', type=str,
    choices=['A100', 'A800', 'H100', 'H800'])
parser.add_argument('-bw', type=str,
    help='NIC bandwidth, e.g. 100Gbps')
SPECTRUM-X / RAIL-OPTIMIZED TOPOLOGY (16 GPUs, 2 Servers) Source: gen_Topo_Template.py → Rail_Opti_SingleToR() SPINE (PSW) PSW-0 PSW-1 LEAF / ASW (1 per rail) ASW-0 ASW-1 ASW-2 ASW-3 ASW-4 ASW-5 ASW-6 ASW-7 Server 0 (G0–G7) NVSwitch G0 G1 G2 G3 G4 G5 G6 G7 Server 1 (G8–G15) NVSwitch G8 G9 G10 G11 G12 G13 G14 G15 Rail 0 Rail 1 Rail 2 Rail 3 Rail 4 Rail 5 Rail 6 Rail 7 Rail-Optimized Wiring: G0 & G8 → ASW-0 (same rail) → AllReduce on Rail 0 stays within one switch (no PSW hop) G0 → G3 (cross-rail) requires: G0 → ASW-0 → PSW → ASW-3 → G3 (2 extra hops) GPU ↔ NVSwitch: NVLink (2880 Gbps) GPU ↔ ASW: NIC (100–400 Gbps) ASW ↔ PSW: full mesh Server boundary (intra-node NVLink domain) Colored lines = rail-aligned NIC connections (GPU_i % 8 → ASW_i)

The topology file defines every link in the network with its bandwidth (Gbps), latency (ns), and error rate. The NS-3 backend reads this file to construct the simulation network. Different topologies lead to dramatically different congestion patterns, especially for AllToAll operations in MoE training.

Section 7

Supported Network Topologies — Deep Dive

The topology generator gen_Topo_Template.py implements five distinct topology functions that map to three named architecture families: Spectrum-X, AlibabaHPN, and DCN+. The critical architectural distinction is between rail-optimized topologies (where GPUi on every server connects to a dedicated ASWi) and non-rail-optimized topologies (where all GPUs in a segment share the same ASW). This choice profoundly impacts collective communication performance.

Rail-Optimized vs Non-Rail-Optimized — The Key Design Decision
In rail-optimized topologies (Spectrum-X, AlibabaHPN), GPUi within each server connects to ASWi % gps. This means all GPU-0s across the cluster share ASW-0, all GPU-1s share ASW-1, and so on. This is optimal for AllReduce, where GPUi communicates with GPUi on other servers — all traffic stays within a single rail (one ASW hop). Cross-rail traffic (GPU-0 → GPU-3) requires traversal up to the PSW layer and back.

In non-rail-optimized topologies (DCN+), all GPUs in a segment connect to the same ASW (or pair). There is no rail alignment — all communication patterns are treated equally. This is better for AllToAll (MoE) workloads but sub-optimal for AllReduce.

1. Spectrum-X — Rail-Optimized Single-ToR

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_SingleToR() NS-3

Each GPU connects to its own rail-aligned ASW based on its index within the server. With 8 GPUs per server, there are 8 ASW per segment — one per GPU position. GPUi → ASW[group × gps + (i % gps)]. Every ASW connects to ALL PSW in a full mesh. This is NVIDIA's recommended topology for AllReduce-heavy workloads.

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server  # 8 ASW per segment (one per rail)
gpu_count = 4096          # Default GPU count
nic_bandwidth = 400Gbps   # NIC → ASW bandwidth
nvlink_bw = 2880Gbps      # NVLink bandwidth

# Wiring logic:
# GPU_i → NVSwitch (NVLink, intra-node)
# GPU_i → ASW[group * gps + (i % gps)]  (NIC, rail-aligned)
# ASW_j → ALL PSW  (full mesh uplinks)
SPECTRUM-X: RAIL-OPTIMIZED SINGLE-ToR (2 Servers x 8 GPUs) PSW / SPINE PSW-0 PSW-1 PSW-2 ASW / ToR (8 rails) ASW-0 (Rail 0) ASW-1 (Rail 1) ASW-2 (Rail 2) ASW-3 (Rail 3) ASW-4 (Rail 4) ASW-5 (Rail 5) ASW-6 (Rail 6) ASW-7 (Rail 7) NVSwitch Server 0 NVSwitch-0 Server 1 NVSwitch-1 GPUs G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 Rail 0 Rail 7 NVLink (2880 Gbps) NIC -> ASW (400 Gbps) ASW -> PSW (full mesh) Server boundary G0 (Srv0) and G8 (Srv1) share ASW-0 -- same-rail AllReduce stays local G0 -> G3 cross-rail: must traverse ASW-0 -> PSW -> ASW-3 asw_switch_num_per_segment = gpu_per_server = 8

2. AlibabaHPN — Dual-ToR Single-Plane

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_DualToR_SinglePlane() NS-3

Rail-optimized like Spectrum-X, but each GPU connects to two ASW switches (dual-homed), providing link redundancy. The ASW are split into two sets (ASW-A and ASW-B), but both connect to the same PSW pool (single plane). With 8 GPUs per server, there are gps × 2 = 16 ASW per segment.

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2  # 16 ASW per segment
gpu_count = 15360         # Default GPU count
nic_bandwidth = 200Gbps   # NIC → ASW bandwidth
asw_sets = 2              # asw_switch_1[] and asw_switch_2[]
psw_sets = 1              # Single PSW pool (both ASW sets → same PSW)

# Wiring logic:
# GPU_i → ASW-A[i % gps]  (NIC-1, rail-aligned)
# GPU_i → ASW-B[i % gps]  (NIC-2, rail-aligned)
# ASW-A[*] → ALL PSW      (full mesh)
# ASW-B[*] → ALL PSW      (full mesh)
ALIBABA HPN: DUAL-ToR SINGLE-PLANE (2 Servers x 8 GPUs) PSW (single pool) PSW-0 PSW-1 PSW-2 ASW-A (ToR set 1) A0 A1 A2 A3 A4 A5 A6 A7 ASW-B (ToR set 2) B0 B1 B2 B3 B4 B5 B6 B7 Server 0 NVSwitch-0 Server 1 NVSwitch-1 G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 Each GPU dual-homed: NIC-1 -> ASW-A[rail], NIC-2 -> ASW-B[rail] Both ASW-A and ASW-B connect to the SAME PSW pool (single plane) asw_switch_num_per_segment = gpu_per_server x 2 = 16 NVLink NIC-1 -> ASW-A NIC-2 -> ASW-B ASW -> PSW

3. AlibabaHPN — Dual-ToR Dual-Plane

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_DualToR_DualPlane() NS-3

The most fault-tolerant topology. Like HPN Single-Plane, each GPU is dual-homed to ASW-A and ASW-B. But the PSW layer is also split: ASW-A connects only to PSW-A, and ASW-B connects only to PSW-B — forming two completely independent network planes. If one entire plane fails, the other can still carry all traffic. The link formula uses psw_switch_num / pod_num / 2 (half PSW per plane).

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2  # 16 ASW per segment
gpu_count = 15360
psw_sets = 2              # PSW split into psw_switch_1[] and psw_switch_2[]

# Wiring logic:
# GPU_i → ASW-A[i % gps]  (NIC-1)
# GPU_i → ASW-B[i % gps]  (NIC-2)
# ASW-A[*] → PSW-A[*] only  (Plane A)
# ASW-B[*] → PSW-B[*] only  (Plane B)
# Two independent planes — no cross-plane links
ALIBABA HPN: DUAL-ToR DUAL-PLANE (2 Servers x 8 GPUs) PLANE A PLANE B PSW-A0 PSW-A1 PSW-A2 PSW-B0 PSW-B1 PSW-B2 A0 A1 A2 A3 A4 A5 A6 A7 B0 B1 B2 B3 B4 B5 B6 B7 NO CROSS Server 0 Server 1 NVSwitch-0 NVSwitch-1 G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 Two fully independent network planes for fault tolerance ASW-A -> PSW-A ONLY | ASW-B -> PSW-B ONLY | No cross-plane links If Plane A fails entirely, Plane B still provides full connectivity NVLink NIC-1 -> ASW-A (Plane A) NIC-2 -> ASW-B (Plane B) Plane isolation boundary

4. DCN+ — Non-Rail Single-ToR

astra-sim/network/ns3/gen_Topo_Template.py → No_Rail_Opti_SingleToR() NS-3

The simplest topology with no rail optimization. All GPUs in a segment connect to the same single ASW. There is only 1 ASW per segment, meaning 8 GPUs (or more, controlled by nics_per_aswitch) share the same top-of-rack switch. This eliminates rail structure entirely — all communication patterns are equal. The ASW connects to all PSW in a full mesh.

# Key parameters (defaults):
asw_switch_num_per_segment = 1  # Only 1 ASW per segment (no rails!)
gpu_count = 32+             # Small clusters

# Wiring logic:
# ALL GPU in segment → SAME ASW  (group_account tracks nics_per_aswitch)
# ASW → ALL PSW  (full mesh uplinks)
# No rail alignment — G0, G1, ..., G7 all share the same switch
DCN+ SINGLE-ToR: NON-RAIL-OPTIMIZED (2 Servers x 8 GPUs) PSW / SPINE PSW-0 PSW-1 PSW-2 ASW (1 per segment) ASW-0 (ALL 16 GPUs) Server 0 NVSwitch-0 Server 1 NVSwitch-1 G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 NO RAIL ALIGNMENT -- All GPUs -> same ASW G0 -> G3 intra-ASW: local switch only (no PSW hop) But ASW becomes bottleneck with many flows -- bad for AllReduce at scale asw_switch_num_per_segment = 1 ALL GPUs -> single ASW (no rail) ASW -> PSW (full mesh) NVLink (intra-node)

5. DCN+ — Non-Rail Dual-ToR

astra-sim/network/ns3/gen_Topo_Template.py → No_Rail_Opti_DualToR() NS-3

Non-rail-optimized with dual-ToR redundancy. There are 2 ASW per segment (ASW-1 and ASW-2), but they are not rail-aligned — all GPUs in the segment connect to both ASW switches. Both ASW sets connect to the same PSW pool (single plane). This provides link redundancy for MoE workloads without rail structure.

# Key parameters:
asw_switch_num_per_segment = 2  # 2 ASW per segment (not rail-aligned)

# Wiring logic:
# ALL GPU in segment → ASW-1  (NIC-1, no rail alignment)
# ALL GPU in segment → ASW-2  (NIC-2, no rail alignment)
# ASW-1 → ALL PSW  (full mesh)
# ASW-2 → ALL PSW  (full mesh, same pool)
DCN+ DUAL-ToR: NON-RAIL-OPTIMIZED (2 Servers x 8 GPUs) PSW (single pool) PSW-0 PSW-1 PSW-2 ASW_A-0 ASW_A-0 (ALL 16 GPUs) ASW_B-0 ASW_B-0 (ALL 16 GPUs) Server 0 Server 1 NVSwitch-0 NVSwitch-1 G0 G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 NO RAIL ALIGNMENT -- All GPUs -> both ASW_A-0 and ASW_B-0 Dual-homed for redundancy, but no GPU-specific rail assignment asw_switch_num_per_segment = 2 | Both ASW -> same PSW pool NVLink NIC-1 -> ASW_A-0 NIC-2 -> ASW_B-0 ASW -> PSW (full mesh)

Topology Comparison Matrix

Feature Spectrum-X HPN Single-Plane HPN Dual-Plane DCN+ SingleToR DCN+ DualToR
Rail-Optimized Yes Yes Yes No No
ToR Redundancy Single Dual Dual Single Dual
Network Planes 1 1 2 1 1
ASW per Segment gps (8) gps×2 (16) gps×2 (16) 1 2
GPU→ASW Pattern Rail-aligned Rail-aligned, dual-homed Rail-aligned, dual-homed All→same ASW All→both ASW
ASW→PSW Full mesh Full mesh Per-plane mesh Full mesh Full mesh
AllReduce Efficiency Optimal (same-rail) Optimal + redundant Optimal + fault-tolerant Sub-optimal Sub-optimal + redundant
AllToAll (MoE) Efficiency Requires PSW hops Requires PSW hops Plane-isolated Equal path Equal path + redundant
Default GPU Count 4096 15360 15360 32+ 32+
Best For AllReduce-heavy TP/DP Production PD disaggregation Fault-tolerant production Small clusters, MoE MoE with redundancy
Significance for PD (Prefill-Decode) Disaggregation Simulation
In PD disaggregation, prefill and decode phases run on separate GPU groups connected through the network fabric. The topology choice directly determines the KV-cache transfer latency between prefill and decode nodes. With AlibabaHPN Dual-Plane, one plane can be dedicated to KV-cache transfers while the other handles gradient synchronization — eliminating interference. With Spectrum-X, rail alignment means KV-cache transfers between same-rail GPU pairs (e.g., GPU-0 on prefill → GPU-0 on decode) complete with minimal latency (single ASW hop), but cross-rail transfers suffer. SimAI enables quantifying this trade-off before hardware procurement.
MoE (Mixture of Experts) Topology Implications
MoE models use AllToAll collectives to route tokens to expert GPUs, generating traffic patterns that cross rail boundaries. In rail-optimized topologies, AllToAll creates heavy cross-rail traffic that must traverse the PSW layer, potentially causing congestion at ASW-PSW uplinks. In DCN+ non-rail topologies, all GPUs share the same ASW, making AllToAll traffic patterns uniform — no path is inherently worse than another. This is why SimAI's topology comparison workflow (generating Spectrum-X, HPN, and DCN+ topologies for the same workload) is critical for MoE model deployment decisions.

Source Code → Topology Function Mapping

astra-sim/network/ns3/gen_Topo_Template.py
# CLI argument → Function mapping:
'Spectrum-X'Rail_Opti_SingleToR()
'AlibabaHPN-SinglePlane'Rail_Opti_DualToR_SinglePlane()
'AlibabaHPN-DualPlane'Rail_Opti_DualToR_DualPlane()
'DCN-SingleToR'No_Rail_Opti_SingleToR()
'DCN-DualToR'No_Rail_Opti_DualToR()

# The key variable that distinguishes rail vs non-rail:
# Rail-optimized:     asw_switch_num_per_segment = gpu_per_server (or ×2)
#   → GPU_i connects to ASW[i % gps]  (rail-aligned)
# Non-rail-optimized: asw_switch_num_per_segment = 1 (or 2)
#   → All GPUs in segment share same ASW(s)
Section 8

NS-3 Configuration (SimAI.conf)

The NS-3 simulation backend is configured through a combination of a configuration file (SimAI.conf) and environment variables. The configuration file controls network protocol parameters, while environment variables control simulation-level behavior.

SimAI.conf (Network Protocol)

NS-3
# Congestion control
ENABLE_QCN 1
USE_DYNAMIC_PFC_THRESHOLD 1
CC_MODE 1          # 0=disabled, 1=DCQCN

# Packet size
PACKET_PAYLOAD_SIZE 9000  # Jumbo frames

# PFC (Priority Flow Control)
PAUSE_TIME 5
L2_WAIT_FOR_ACK 0

# Switch buffer
BUFFER_SIZE 16777216  # 16MB per port
KMIN 1500
KMAX 100000
PMAX 0.2

# ECN
ECN_ENABLED 1
DCTCP_GAIN 0.00390625

Environment Variables

astra-sim
# Packet sending latency (microseconds)
AS_SEND_LAT=6

# Enable NVLS algorithm
AS_NVLS_ENABLE=1

# Enable PXN optimization
AS_PXN_ENABLE=0

# Logging level
AS_LOG_LEVEL=INFO

# GPU type override
AS_GPU_TYPE=H100

# Number of NVLink channels
AS_NV_CHANNELS=8

# Ring channel count
AS_RING_CHANNELS=2

# Network bandwidth override (Gbps)
AS_NET_BW=400

# Simulation thread count
AS_SIM_THREADS=16

The ENABLE_QCN and USE_DYNAMIC_PFC_THRESHOLD settings are particularly important for accuracy. QCN (Quantized Congestion Notification) allows switches to signal senders to slow down before buffers overflow. PFC (Priority Flow Control) provides lossless Ethernet semantics required by RDMA. Together, they model the complex feedback loops that determine real-world network performance.

Section 9

Build System

SimAI uses a unified build script that compiles the appropriate backend based on a command-line flag. Each backend produces a separate binary with the same command-line interface, making it easy to switch between simulation modes:

scripts/build.sh astra-sim
# Analytical backend (fastest build + fastest simulation)
./scripts/build.sh -c analytical
# Output: bin/SimAI_analytical
# Dependencies: C++ compiler only
# Build time: ~30 seconds

# NS-3 Simulation backend (full packet-level)
./scripts/build.sh -c ns3
# Output: bin/SimAI_simulator
# Dependencies: NS-3 library (auto-built)
# Build time: ~5 minutes (first build)

# Physical backend (real RDMA traffic)
./scripts/build.sh -c phy
# Output: bin/SimAI_phynet
# Dependencies: libibverbs, librdmacm
# Build time: ~1 minute

Analytical Binary

SimAI_analytical — Lightweight, no external dependencies beyond a C++17 compiler. Ideal for CI/CD pipelines and quick iteration on workload configurations.

NS-3 Simulator Binary

SimAI_simulator — Full NS-3 integration. The build script automatically fetches and compiles the NS-3 library with SimAI's custom RDMA/QBB modules.

Physical Binary

SimAI_phynet — Requires RDMA-capable NICs and the libibverbs / librdmacm libraries. Must be deployed on actual cluster nodes.

Section 10

End-to-End Example

Here is a complete workflow for running a communication simulation from topology generation through execution to output analysis. This example simulates a 128-GPU Spectrum-X cluster with A100 GPUs running a micro AllReduce benchmark:

1

Generate Topology

# Generate a 128-GPU Spectrum-X topology with A100s and 100Gbps NICs
python3 gen_Topo_Template.py \
  -topo Spectrum-X \
  -g 128 \
  -gt A100 \
  -bw 100Gbps

# Output: Spectrum-X_128g_8gps_100Gbps_A100
# Contains: 128 GPUs, 64 NVSwitches, ToR/ASW/PSW switches, all links
2

Prepare Workload

# microAllReduce.txt — A simple AllReduce benchmark
# Format: num_passes op_type data_size group_type ...
1 ALLREDUCE 1048576 TP    # 1MB AllReduce on TP group
1 ALLREDUCE 67108864 DP  # 64MB AllReduce on DP group
1 ALLTOALL 16777216 EP   # 16MB AllToAll on EP group
3

Run Simulation

# Run with NS-3 backend
AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \
  -t 16 \                          # 16 simulation threads
  -w ./microAllReduce.txt \        # workload file
  -n ./Spectrum-X_128g_8gps_100Gbps_A100 \  # topology
  -c ./SimAI.conf                  # NS-3 configuration

# Or run with analytical backend for quick results
AS_SEND_LAT=6 ./bin/SimAI_analytical \
  -w ./microAllReduce.txt \
  -n ./Spectrum-X_128g_8gps_100Gbps_A100
4

Analyze Output

# Output files (NS-3 mode):
# - fct.txt:        Flow Completion Times per flow
# - bandwidth.txt:  Per-link bandwidth utilization over time
# - queue.txt:      Switch queue occupancy over time
# - rate.txt:       Sending rate per flow over time
# - cnp.txt:        Congestion Notification Packet counts
# - pfc.txt:        PFC pause frame events

# Key metric: total collective completion time
# Compare analytical vs NS-3 to assess congestion impact
Pro Tip: Start with analytical mode for rapid iteration on workload parameters, then switch to NS-3 for final validation. The analytical mode can complete in seconds what NS-3 takes minutes to simulate, but it will miss congestion effects that can cause 20-40% performance degradation at scale.

Advanced: Multi-Workload Simulation

For full training iteration simulation, combine the workload generator (AICB) with SimAI. AICB generates realistic computation + communication interleaving patterns, and SimAI handles the network simulation:

# Step 1: Generate workload with AICB
python3 -m aicb.main \
  --model_name llama_70b \
  --tp 8 --dp 16 --pp 1 \
  --world_size 128 \
  --output_dir ./workloads/

# Step 2: Feed into SimAI
AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \
  -t 16 \
  -w ./workloads/llama_70b_tp8_dp16.txt \
  -n ./Spectrum-X_128g_8gps_100Gbps_H100 \
  -c ./SimAI.conf

# The simulation interleaves compute phases (estimated analytically)
# with communication phases (simulated by NS-3)

Topology Comparison Workflow

One of SimAI's most valuable use cases is comparing different network topologies for the same workload. By generating multiple topology files and running the same workload against each, you can quantify the performance impact of different network designs:

# Generate three topologies for comparison
python3 gen_Topo_Template.py -topo Spectrum-X -g 1024 -gt H100 -bw 400Gbps
python3 gen_Topo_Template.py -topo AlibabaHPN-SinglePlane -g 1024 -gt H100 -bw 400Gbps
python3 gen_Topo_Template.py -topo AlibabaHPN-DualPlane -g 1024 -gt H100 -bw 400Gbps

# Run the same workload on each
for topo in Spectrum-X AlibabaHPN-SinglePlane AlibabaHPN-DualPlane; do
  AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \
    -t 16 \
    -w ./workloads/llama_70b.txt \
    -n ./${topo}_1024g_8gps_400Gbps_H100 \
    -c ./SimAI.conf \
    -o ./results/${topo}/
done

# Compare FCT distributions across topologies
python3 analyze_results.py --dirs ./results/
Scaling Note: NS-3 simulation time scales roughly O(N*F) where N is GPU count and F is total flows. A 128-GPU AllReduce takes ~30 seconds; a 4096-GPU full training iteration can take 30+ minutes. Use the analytical backend for initial exploration, then NS-3 for final numbers.
Deep Dive

Algorithm Internals: Channel Construction

Before flows can be generated, SimCCL must construct ring and tree channels. A channel is a specific ordering of ranks that defines the communication pattern. Multiple channels can be used simultaneously to increase bandwidth utilization. The number of channels depends on the GPU type and NVLink topology:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

Ring Channel Construction

// Ring channels are built from the NVLink topology
// Each channel is a permutation of ranks optimized for
// the physical NVLink connectivity
void MockNcclGroup::buildRingChannels() {
    int nChannels = getenv("AS_RING_CHANNELS")
                    ? atoi(getenv("AS_RING_CHANNELS"))
                    : default_ring_channels;

    for (int c = 0; c < nChannels; c++) {
        vector<int> ring;
        // Build ring following NVLink adjacency
        // Ensures each hop uses a physical NVLink
        for (int r = 0; r < nRanks; r++) {
            ring.push_back(
                gp_info.Ranks[(r + c) % nRanks]);
        }
        ringchannels.push_back(ring);
    }
}

// Tree channels for hierarchical algorithms
void MockNcclGroup::buildTreeChannels() {
    // Build binary tree with root at rank 0
    // Child(i) = 2*i+1, 2*i+2
    // Used for inter-node reduction in NVLS-Tree
}

AllGather and ReduceScatter Decomposition

While AllReduce is the most common collective, SimCCL also decomposes AllGather and ReduceScatter independently. AllGather is used extensively in tensor parallelism (e.g., gathering weight shards for the forward pass), while ReduceScatter is used for gradient reduction with sharded optimizers (ZeRO Stage 2+):

AllGather (Ring)

Each GPU starts with 1/N of the data and ends with all data. N-1 ring steps, each sending data_size/N bytes. Total data transferred per GPU: (N-1)/N * data_size.

ReduceScatter (Ring)

Each GPU starts with full data and ends with 1/N of the reduced result. N-1 ring steps with reduction at each hop. Same bandwidth as AllGather but includes reduce operations.

AllToAll for MoE Expert Parallelism

AllToAll is the most network-intensive collective, used in Mixture-of-Experts (MoE) models to dispatch tokens to the appropriate expert. Unlike AllReduce where data flows in one direction around a ring, AllToAll requires every GPU to send unique data to every other GPU. This creates an N*N communication pattern that can severely stress the network fabric:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL
std::map<int, shared_ptr<FlowModels>>
MockNcclGroup::genAllToAllRingFlowModels(
    GroupType type, int rank, uint64_t data_size) {

    int nranks = gp_info.nRanks;
    // Each GPU sends data_size/nranks to each other GPU
    uint64_t per_peer_size = data_size / nranks;

    for (int peer = 0; peer < nranks; peer++) {
        if (peer == rank) continue;
        // Direct P2P send to each peer
        SingleFlow flow(
            flow_id++, rank, gp_info.Ranks[peer],
            per_peer_size,
            {},           // No dependencies (all sends parallel)
            {},
            {child_ids},  // Barrier after all sends complete
            0, peer, nranks, "RING");
    }

    return flow_models;
}
Advanced

PXN Optimization: Proxy Network Communication

PXN (Proxy-based cross-Node communication) is an optimization where GPUs within a node use NVLink to forward data to a proxy GPU that has direct NIC access, rather than each GPU accessing the NIC directly. This reduces NIC contention and leverages the high-bandwidth NVLink fabric for intra-node data gathering before sending across the network:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL
// PXN flow generation: 3 phases
// Phase 1: NVLink gather to proxy GPU (PXN_INIT)
// Phase 2: Network transfer from proxy (NET)
// Phase 3: NVLink scatter from proxy (PXN_FINAL)

if (PXNenable && cross_node_transfer) {
    // Step 1: Local GPUs send to proxy via NVLink
    for (auto& local_rank : node_ranks) {
        if (local_rank == proxy_rank) continue;
        flows.push_back(SingleFlow(
            id++, local_rank, proxy_rank,
            chunk_size, {}, {}, {net_flow_id},
            channel, chunk, count, "PXN_INIT"));
    }

    // Step 2: Proxy sends aggregated data over network
    flows.push_back(SingleFlow(
        id++, proxy_rank, remote_proxy,
        total_chunk_size,
        {pxn_init_flow_ids},  // Wait for all local gathers
        {}, {pxn_final_ids},
        channel, chunk, count, "NET"));

    // Step 3: Remote proxy scatters to local GPUs via NVLink
    for (auto& remote_rank : remote_node_ranks) {
        if (remote_rank == remote_proxy) continue;
        flows.push_back(SingleFlow(
            id++, remote_proxy, remote_rank,
            chunk_size,
            {net_flow_id},  // Wait for network transfer
            {}, {},
            channel, chunk, count, "PXN_FINAL"));
    }
}

PXN is controlled by the AS_PXN_ENABLE environment variable. It is most beneficial when the NIC-to-GPU ratio is less than 1:1 (e.g., 4 NICs for 8 GPUs), which is common in many datacenter configurations. The trade-off is increased NVLink traffic in exchange for reduced NIC contention.

Summary

Key Concepts Recap

SimCCL = NCCL Mirror

SimCCL

Faithfully replicates NCCL's algorithm selection (Ring, Tree, NVLS, NVLS-Tree) and flow decomposition logic. GPU-type-aware: A100 vs H100 use different algorithms.

SingleFlow = Atomic Unit

SimCCL

Every collective is decomposed into a DAG of SingleFlows with explicit prev/child dependencies. This enables fine-grained simulation of overlapping and pipelined communication.

Three Backends

astra-sim

Analytical (fast/low-fi), NS-3 (slow/high-fi), Physical (real hardware). Same workload file works with all three. Choose based on accuracy needs.

Topology-Aware

NS-3

Supports Spectrum-X, Alibaba HPN, and DCN+ topologies. GPU-NVSwitch-ToR-ASW-PSW hierarchy. Real datacenter network architectures.

Bottom Line: SimAI's communication stack transforms high-level collective operations (AllReduce, AllGather, AllToAll) into a precise DAG of point-to-point flows, then simulates their traversal through realistic datacenter network topologies. This enables network architects to evaluate topology designs and training engineers to optimize parallelism strategies — all without provisioning a single GPU.