SimAI - Communication Simulation: SimCCL + astra-sim + NS-3

Communication Stack Overview
NCCL Algorithm Simulation (MockNcclGroup)
SingleFlow Data Structure
Group Topology
astra-sim Three Backends
P2P Transport Mechanics: RDMA, DCQCN, PFC, ECMP
Network Topology Generation
Supported Network Topologies — Deep Dive
NS-3 Configuration
Build System
End-to-End Example

Page Organization
          Layer 1: Collective Decomposition
        
          §2 NCCL Algorithms (Ring/Tree/NVLS)
§3 SingleFlow Data Structure
§4 Group Topology (TP/DP/EP/PP)
        
          Layer 2: Transport & Network
        
          §5 astra-sim Backends
§6 RDMA QP · DCQCN/HPCC/TIMELY
    ECN · PFC · ECMP · NACK · NVLS
        
          Layer 3: Topology & Config
        
          §7 Topology Generation + 5 Deep Dives
§8 NS-3 Configuration (SimAI.conf)
§9 Build System · §10 E2E Example

Section 1

Communication Stack Overview

SimAI's communication simulation covers the complete network stack — from high-level collective operations (AllReduce, AllGather) down to packet-level RDMA transport with DCQCN congestion control and PFC flow control. This page provides a comprehensive deep-dive organized into three layers: (1) Collective Decomposition (Sections 2-4): how NCCL algorithms decompose collectives into point-to-point flows using Ring, Tree, and NVLS patterns; (2) Transport & Network Mechanics (Sections 5-6): how flows traverse the network via RoCEv2 RDMA with DCQCN/HPCC/TIMELY congestion control, RED-based ECN marking, PFC back-pressure, and per-flow ECMP routing; (3) Topology & Configuration (Sections 7-10): the five supported datacenter topologies (Spectrum-X, AlibabaHPN, DCN+), NS-3 configuration parameters, and end-to-end usage examples.

Each layer has a well-defined interface to the one below it. Collective operations produce FlowModels (a set of SingleFlow structs). astra-sim consumes these flows and dispatches them to the network backend. The backend reports completion events back to astra-sim, which then triggers dependent flows.

Section 2

NCCL Algorithm Simulation (MockNcclGroup)

The MockNcclGroup class is the core of SimCCL. It faithfully replicates NCCL's algorithm selection and flow decomposition logic, translating high-level collective operations into point-to-point data flows that can be simulated by the network backend.

2.1 Supported Algorithms

astra-sim/workload/SimCCL/MockNcclGroup.h SimCCL

SimCCL mirrors the six NCCL algorithm types. Each algorithm defines a different communication pattern for moving data between GPUs:

// NCCL algorithm definitions mirrored in SimCCL
#define NCCL_ALGO_TREE            0   // Tree reduction (hierarchical)
#define NCCL_ALGO_RING            1   // Ring (bandwidth-optimal)
#define NCCL_ALGO_COLLNET_DIRECT  2   // Direct CollNet (switch-assisted)
#define NCCL_ALGO_COLLNET_CHAIN   3   // Chain CollNet (pipelined)
#define NCCL_ALGO_NVLS            4   // NVLink Switching (Hopper+)
#define NCCL_ALGO_NVLS_TREE       5   // NVLS + Tree hybrid

#define NCCL_NUM_ALGORITHMS       6
#define NCCL_NUM_PROTOCOLS        3   // LL, LL128, Simple

2.2 Algorithm Selection Logic

Algorithm selection in SimCCL is GPU-type-aware. The system considers the GPU architecture (A100, H100, H800), the group type (TP, DP, PP, EP), the number of ranks, and whether NVLink Switching is available. This mirrors NCCL's real-world decision tree:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

ncclInfo* MockNcclGroup::get_algo_proto_info(
    GroupType type, int rank, ComType op, uint64_t data_size) {

    ncclInfo* info = new ncclInfo();
    info->nBytes = data_size;
    info->op = op;

    if (op == All_Reduce && type == TP) {
        if (gpu_type == H100 || gpu_type == H800) {
            if (nRanks >= 8 && NVLSenable)
                info->algorithm = NCCL_ALGO_NVLS;  // Hopper favors NVLS
            else
                info->algorithm = NCCL_ALGO_RING;
        } else if (gpu_type == A100 || gpu_type == A800) {
            info->algorithm = NCCL_ALGO_RING;   // Ampere uses Ring
        }
    } else if (op == All_Reduce && type == DP) {
        info->algorithm = NCCL_ALGO_RING;       // DP always Ring
    } else if (op == All_to_All) {
        info->algorithm = NCCL_ALGO_RING;       // AllToAll uses Ring
    } else if (op == All_Gather) {
        if (type == TP && NVLSenable && nRanks >= 8)
            info->algorithm = NCCL_ALGO_NVLS_TREE;
        else
            info->algorithm = NCCL_ALGO_RING;
    }

    return info;
}

Key Insight: Algorithm selection is GPU-type-aware: A100 uses Ring, H100 favors NVLS for TP groups with 8+ ranks. This distinction can lead to 2-3x performance differences in large-scale training simulations.

2.3 Ring AllReduce Decomposition

The Ring AllReduce is the most common collective algorithm. It operates in two phases: ReduceScatter (each GPU sends 1/N of its data around the ring, reducing as it goes) and AllGather (each GPU broadcasts its reduced chunk). For N GPUs, this requires 2*(N-1) steps total.

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

Ring AllReduce Flow Generation

std::map<int, shared_ptr<FlowModels>>
MockNcclGroup::genAllReduceRingFlowModels(
    GroupType type, int rank, uint64_t data_size) {

    int nranks = gp_info.nRanks;
    int chunkcount = 2 * (nranks - 1);  // reduce + broadcast phases
    chunksize = data_size / nranks / ringchannels.size();

    // Phase 1: ReduceScatter
    for (int step = 0; step < nranks - 1; step++) {
        for (auto& ring : ringchannels) {
            int src_rank = ring[(ring_idx + nranks - step) % nranks];
            int dest_rank = ring[(ring_idx + nranks - step + 1) % nranks];

            tmp_result = SingleFlow(
                flow_id, src_rank, dest_rank, chunksize,
                {prev_flow_id},      // depends on previous step
                {},                   // no parallel deps
                {child_flow_id},      // next step depends on this
                ring_id, chunk_id,
                chunkcount, "RING");
        }
    }

    // Phase 2: AllGather
    for (int step = 0; step < nranks - 1; step++) {
        for (auto& ring : ringchannels) {
            // Similar flow generation with broadcast semantics
            tmp_result = SingleFlow(
                flow_id, src_rank, dest_rank, chunksize,
                {prev_flow_id}, {}, {child_flow_id},
                ring_id, chunk_id, chunkcount, "RING");
        }
    }

    return flow_models;
}

Ring Complexity: Ring AllReduce sends 2*(N-1) chunks for N GPUs -- NVLS on Hopper reduces this to O(1) using NVSwitch. For 8 GPUs, Ring needs 14 steps while NVLS needs only a single multicast + reduce operation through the NVSwitch fabric.

2.4 NVLS AllReduce (Hopper / Blackwell)

NVLS (NVLink Switching) leverages NVSwitch to provide all-to-all NVLink connectivity within a node. Instead of passing data around a ring, all GPUs can simultaneously access a shared memory region through the NVSwitch. The NVLS-Tree variant combines NVSwitch-based intra-node communication with a tree reduction for inter-node communication.

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

NVLS-Tree AllReduce

shared_ptr<FlowModels>
MockNcclGroup::genallReduceNVLSTreeFlowModels(
    GroupType type, int rank, uint64_t data_size) {

    // Step 1: NVLS Reduce (intra-node via NVSwitch)
    // All GPUs within a node reduce their data through
    // the NVSwitch shared memory multicast
    generate_flow_model_nvls_tree_allreduce_up(
        rank, data_size, flow_models);

    // Step 2: Tree Reduce (inter-node via NIC)
    // One representative GPU per node participates in
    // a tree reduction across nodes

    // Step 3: Tree Broadcast (inter-node via NIC)
    // Root broadcasts the fully reduced result down the tree

    // Step 4: NVLS Broadcast (intra-node via NVSwitch)
    // Representative GPU shares the result with all node-local GPUs
    generate_flow_model_nvls_tree_allreduce_down(
        rank, data_size, flow_models);

    return flow_models;
}

// NVLS uses multicast addressing through NVSwitch
// Each GPU writes to a shared buffer that is atomically
// reduced by the NVSwitch hardware itself
void MockNcclGroup::generate_flow_model_nvls_tree_allreduce_up(
    int rank, uint64_t data_size,
    shared_ptr<FlowModels> flow_models) {

    // NVLS multicast: each GPU sends to NVSwitch
    for (auto& nvswitch : gp_info.NVSwitchs) {
        SingleFlow flow(flow_id, rank, nvswitch,
            data_size / nRanks,
            {}, {}, {child_ids},
            0, chunk_id, chunk_count, "NVLS");
        flow_models->add(flow);
    }
}

2.5 Hardware Latency Parameters

SimCCL models hardware-specific latencies for each algorithm and protocol combination. These base latency values are added to the data transfer time to account for protocol overhead, synchronization, and memory copies:

astra-sim/workload/SimCCL/MockNcclGroup.cc SimCCL

// Base latency (microseconds) per algorithm per protocol
// Protocols: [LL, LL128, Simple]
static const float baseLat[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
    {6.8,  14.0, 0},       // Tree
    {6.6,  14.0, 8.4},     // Ring
    {0,    0,    0},       // CollNet Direct
    {0,    0,    0},       // CollNet Chain
    {0,    0,    23.0},    // NVLS
    {0,    0,    0},       // NVLS-Tree
};

// Per-path latencies: [NVLINK, PCI, NET]
// Each indexed by [path][algorithm][protocol]
static float hwLat[3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
    // NVLINK path latencies
    { {0.6, 1.25, 28.0},   // Tree over NVLINK
      {0.6, 1.9,  3.4},    // Ring over NVLINK
      {0,   0,    0},      // CollNet Direct
      {0,   0,    0},      // CollNet Chain
      {0,   0,    0},      // NVLS
      {0,   0,    0} },    // NVLS-Tree

    // PCI path latencies
    { {1.0, 1.0,  0},      // Tree over PCI
      {1.0, 1.0,  0},      // Ring over PCI
      ... },

    // Network (NIC) path latencies
    { {5.0, 8.5,  0},      // Tree over NET
      {2.7, 5.5,  0},      // Ring over NET
      ... },
};

Section 3

SingleFlow Data Structure

The SingleFlow struct is the fundamental unit of communication in SimAI. Every collective operation is decomposed into a directed acyclic graph (DAG) of SingleFlows. Each flow represents a point-to-point data transfer between two ranks, with explicit dependency tracking via prev and child_flow_id fields.

astra-sim/workload/SimCCL/FlowModels.h SimCCL

struct SingleFlow {
    int flow_id;                  // Unique identifier for this flow
    int src, dest;                // Source and destination rank
    uint64_t flow_size;           // Data size in bytes

    vector<int> prev;            // Dependency flows (must complete first)
    vector<int> parallel_deps;   // Flows that run in parallel
    vector<int> child_flow_id;   // Downstream flows (triggered on completion)

    int channel_id;               // Ring/tree channel index
    int chunk_id;                 // Which chunk of the collective
    int chunk_count;              // Total chunks in this collective
    string conn_type;             // "RING", "TREE", "NVLS", "PXN_INIT"

    // Constructor
    SingleFlow(int id, int s, int d, uint64_t size,
              vector<int> p, vector<int> par,
              vector<int> child,
              int ch, int ck, int cc,
              string ct)
        : flow_id(id), src(s), dest(d), flow_size(size),
          prev(p), parallel_deps(par), child_flow_id(child),
          channel_id(ch), chunk_id(ck), chunk_count(cc),
          conn_type(ct) {}
};

The flow DAG ensures correct execution ordering. For a Ring AllReduce with 4 GPUs, the prev field chains flows so that step 2 of the ReduceScatter cannot begin until step 1 completes. The child_flow_id enables the simulation engine to eagerly dispatch dependent flows once a flow finishes.

Flow Creation

MockNcclGroup decomposes collective into SingleFlows with explicit dependencies

Flow Scheduling

astra-sim checks prev dependencies; dispatches flows with all dependencies satisfied

Network Transfer

Backend (NS-3 / Analytical) simulates the point-to-point transfer

Completion Callback

Backend notifies astra-sim; child_flow_id flows are checked and potentially dispatched

Section 4

Group Topology

SimAI organizes GPU ranks into communication groups based on parallelism strategy. Each GroupInfo struct describes the membership, topology, and NVSwitch connectivity of a group. The GroupType enum determines which collective algorithm is used:

astra-sim/workload/SimCCL/MockNcclGroup.h SimCCL

enum GroupType { TP, DP, PP, EP, DP_EP, NONE };

struct GroupInfo {
    GroupType type;            // Parallelism type
    int nNodes;                // Number of nodes in this group
    int nRanks;                // Number of GPU ranks
    vector<int> Ranks;        // List of rank IDs
    vector<int> NVSwitchs;     // NVSwitch IDs (if applicable)
};

// Example: 8-GPU TP group within one node
// GroupInfo { TP, 1, 8, {0,1,2,3,4,5,6,7}, {8,9,10,11} }
//   - 8 ranks on 1 node
//   - 4 NVSwitches (H100 SXM)

GroupType Communication Patterns

GroupType	Typical Op	Algorithm	Description
`TP`	AllReduce	Ring / NVLS	Tensor parallel gradient sync within a node; NVLS preferred on H100+ with 8+ ranks
`DP`	AllReduce	Ring	Data parallel gradient sync across nodes; bandwidth-bound over NIC
`EP`	AllToAll	Ring	Expert parallel token dispatch for MoE models; each GPU sends to all experts
`PP`	SendRecv	P2P	Pipeline parallel activation transfer between stages; latency-sensitive
`DP_EP`	AllReduce	Ring	Combined data-parallel + expert-parallel gradient sync

The group topology determines how MockNcclGroup generates ring or tree channels. For a TP group confined to one node, rings are constructed using the NVLink topology (e.g., rail-optimized patterns). For DP groups spanning multiple nodes, rings traverse NIC links and the network fabric.

Section 5

astra-sim Three Backends

astra-sim supports three different network backends, each offering a different trade-off between simulation speed and fidelity. Users choose the backend at build time, and the same workload description works with all three.

5.1 Analytical Mode (AnaSim)

astra-sim/network/analytical/AnaSim.hh astra-sim

The analytical backend uses a simple event queue with bus bandwidth estimates to compute transfer times. It runs in seconds but does not model congestion, queueing, or flow contention:

class AnaSim {
    static queue<struct CallTask> call_list;
    static uint64_t tick;

    static void Run() {
        while (!call_list.empty()) {
            CallTask task = call_list.front();
            while (tick != task.time) tick++;
            call_list.pop();
            task.fun_ptr(task.fun_arg);  // Execute callback
        }
    }

    static void Schedule(
        uint64_t delay,              // Computed from busbw config
        void (*fun_ptr)(void*),      // Completion callback
        void* fun_arg) {
        call_list.push({tick + delay, fun_ptr, fun_arg});
    }
};

// Transfer time = data_size / busbw + base_latency
// No congestion, no contention, no queue dynamics

5.2 NS-3 Simulation Mode

NS-3

The NS-3 backend provides full packet-level simulation with realistic network behavior. This is the highest-fidelity software simulation mode, modeling every aspect of the network fabric:

RDMA / QBB protocol — Emulates RDMA over Converged Ethernet (RoCEv2)
Congestion control (QCN / PFC) — Quantized Congestion Notification and Priority Flow Control
Adaptive routing — Models ECMP and adaptive load balancing across paths
Realistic queue dynamics — Switch buffer occupancy, tail drops, ECN marking
Multi-path effects — Incast, hash collisions, flow-level fairness

Fidelity Matters: NS-3 models real congestion effects (QCN/PFC) that analytical mode misses — critical for large-scale accuracy. At 1000+ GPU scale, network contention can add 20-40% overhead that only packet-level simulation captures.

5.3 Physical Mode (Beta)

astra-sim/network/physical/PhyNetSim.cc astra-sim

The physical backend generates actual RDMA traffic on real hardware. This provides the highest fidelity validation but requires an actual RDMA cluster:

void send_flow(int src, int dst, uint64_t maxPacketCount,
    void (*msg_handler)(void*), void* fun_arg,
    int tag, sim_request* request) {

    ncclFlowTag flowtag = request->flowTag;
    TransportData send_data = TransportData(
        flowtag.channel_id,
        flowtag.chunk_id,
        flowtag.chunk_count,
        flowtag.current_flow_id,
        flowtag.child_flow_id,
        flowtag.tree_flow_list,
        flowtag.data_size);

    // Serialize and send via actual RDMA verbs
    rdma_transport->post_send(
        dst, &send_data, sizeof(send_data),
        tag, msg_handler, fun_arg);
}

Backend Comparison

Mode	Speed	Fidelity	Hardware	Use Case
`Analytical`	~seconds	Low	CPU only	Quick estimation, parameter sweeps, early design exploration
`NS-3 Sim`	~minutes-hours	High	CPU (multi-thread)	Network topology design, congestion analysis, QoS tuning
`Physical`	Real-time	Highest	RDMA cluster	Validation against real hardware, final design sign-off

Section 6

P2P Transport Mechanics: How KV Cache Actually Travels

When SimAI simulates a KV cache transfer from a prefill node to a decode node, the NS-3 backend doesn't just estimate latency — it simulates the full RoCEv2 RDMA transport stack at packet granularity. This section traces the exact mechanisms: how RDMA Queue Pairs are created, how DCQCN adjusts sending rates on congestion, how PFC prevents buffer overflow, and how ECMP distributes flows across equal-cost paths.

KV Cache P2P Transfer: Where Each Mechanism Activates

§6.1 Prefill GPU → QP Creation

RdmaHw.AddQueuePair(size=KV_cache_bytes, pg=3)

§6.10/§6.12/§6.13 Sender NIC Gates

Window check (IsWinBound) → Multi-QP split → WRR NIC scheduling

§6.3/§6.4/§6.5 Switch Fabric

ECN marking (RED) → PFC pause check → ECMP hash routing → PSW if cross-rail

§6.9 NVLS Check

Same server? → route via NVSwitch (2880 Gbps) instead of NIC

§6.8/§6.6 Receiver NIC

Seq check → ACK or NACK (protocol 0xFD) → FCT tracking (actual vs ideal)

§6.2/§6.11/§6.10 Feedback Loop

CNP → DCQCN rate decrease (α×EWMA) | or HPCC/TIMELY/DCTCP | Window adjust

✓

§6.6/§6.14 Complete

qp_finish() → FCT recorded → monitoring output → callback to astra-sim → Decode GPU

6.1 RDMA Queue Pair (QP) Lifecycle

Every P2P flow in SimAI becomes one or more RDMA Queue Pairs. The call chain starts from astra-sim's collective decomposition and ends at NS-3's packet-level simulation:

MockNcclGroup decomposes collective → SingleFlow

e.g., AllReduce(64MB, 8 GPUs) → 14 SingleFlow objects (2×(N-1) for ring)

astra-sim calls sim_send(src, dst, bytes)

Each SingleFlow triggers a sim_send() to the NS-3 backend with flow_size bytes

NS-3 creates RdmaClient → RdmaDriver.AddQueuePair()

A QP is created with src/dst IP, port, priority group, window size (BDP), and base RTT. The QP starts sending packets at line rate.

Packets traverse switches → ECN marking → ACK/CNP

9000-byte jumbo frames traverse the topology. Switches check queue depth, mark ECN if congested, trigger PFC if buffer critical. Receiver sends ACK or generates CNP.

All bytes ACKed → qp_finish() → callback to astra-sim

FCT (Flow Completion Time) is recorded. astra-sim's event system proceeds to the next dependent operation.

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-queue-pair.h

class RdmaQueuePair : public Object {
    Ipv4Address sip, dip;             // Source & destination IP
    uint16_t sport, dport;             // Source & dest port (ECMP hash input)
    uint64_t m_size;                    // Total bytes to transfer (= KV cache size)
    uint16_t m_pg;                      // Priority group (queue index for PFC)
    uint64_t snd_nxt, snd_una;          // Next seq to send, unacked seq
    uint32_t m_win;                     // Window size (BDP-based)
    DataRate m_rate;                    // Current sending rate (DCQCN-controlled)
    uint64_t m_baseRtt;                 // Base RTT for this src-dst pair
};

6.2 DCQCN Congestion Control (CC_MODE=1)

SimAI's default congestion control is DCQCN (Data Center QCN), Mellanox's adaptation of QCN for RoCEv2. It operates in three phases: rate decrease on CNP, alpha tracking via EWMA, and rate recovery via additive/hyper-additive increase.

CC_MODE options: 1 = DCQCN (Mellanox, default), 3 = HPCC (High Precision CC), 7 = TIMELY (RTT-based), 10 = HPCC-PINT (with INT sampling). All implemented in rdma-hw.cc.

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-hw.h — DCQCN parameters

// DCQCN rate control state (per QP)
double m_g;                    // EWMA gain = 0.00390625 (1/256)
double m_rateOnFirstCNP;       // Rate fraction on first CNP (1.0 = full)
double m_rpgTimeReset;         // Rate increase timer = 900 μs
double m_rateDecreaseInterval;  // Rate decrease check interval = 4 μs
double m_alpha_resume_interval; // Alpha EWMA update interval = 1 μs
DataRate m_rai;                // Additive increase = 50 Mb/s
DataRate m_rhai;               // Hyper-additive increase = 100 Mb/s
uint32_t m_rpgThreshold;       // Fast recovery threshold = 1 CNP

Rate Decrease (on CNP received)

When a CNP (Congestion Notification Packet) arrives at the sender, DCQCN reduces the rate multiplicatively. The key formula:

DCQCN Rate Decrease Formula

// On each rate_decrease_interval (4 μs), if CNP was received:
target_rate = rate × (1 - alpha / 2)
rate = max(rate × (1 - alpha / 2), MIN_RATE)

// Alpha is updated via EWMA every alpha_resume_interval (1 μs):
// If CNP received in this interval:
alpha = (1 - m_g) × alpha + m_g        // α increases toward 1
// If no CNP:
alpha = (1 - m_g) × alpha              // α decays toward 0

// With m_g = 1/256, alpha converges slowly — providing stability

Rate Increase (recovery)

DCQCN Rate Increase — Three Phases

// Every RP_TIMER (900 μs), if no CNP received:
// Phase 1: Fast Recovery (first m_rpgThreshold rounds)
rate = (rate + target_rate) / 2

// Phase 2: Additive Increase
rate = rate + m_rai                    // +50 Mb/s per interval

// Phase 3: Hyper-Additive Increase (after sustained no-congestion)
rate = rate + m_rhai                   // +100 Mb/s per interval

6.3 ECN Marking at Switches (RED-based)

Switches mark packets with ECN (setting the ECE bits in IPv4 header) based on probabilistic RED (Random Early Detection). The marking probability increases linearly between kmin and kmax:

ns-3-alibabacloud

simulation/src/point-to-point/model/switch-mmu.cc — ShouldSendCN()

bool SwitchMmu::ShouldSendCN(uint32_t ifindex, uint32_t qIndex) {
    if (qIndex == 0) return false;          // Queue 0 = highest priority, never marked
    if (egress_bytes[ifindex][qIndex] > kmax[ifindex])
        return true;                           // Above kmax: always mark (100%)
    if (egress_bytes[ifindex][qIndex] > kmin[ifindex]) {
        // Between kmin and kmax: linear probability
        double p = pmax[ifindex]
                 * (double)(egress_bytes[ifindex][qIndex] - kmin[ifindex])
                 / (kmax[ifindex] - kmin[ifindex]);
        if (UniformVariable(0, 1).GetValue() < p)
            return true;                       // Probabilistic mark
    }
    return false;                              // Below kmin: never mark
}

The kmin/kmax/pmax values are rate-dependent, configured per link speed in SimAI.conf:

Link Speed	kmin (KB)	kmax (KB)	pmax
25 Gbps	100	400	0.2
100 Gbps	400	1600	0.2
200 Gbps	300	1200	0.8
400 Gbps	800	3200	0.2

6.4 PFC (Priority Flow Control) — Layer 2 Back-Pressure

PFC operates at Layer 2, independently from DCQCN. When a switch's ingress buffer fills beyond the threshold, it sends a PAUSE frame to the upstream sender, halting transmission on that priority queue. This prevents packet loss but can cause head-of-line blocking and PFC storms.

ns-3-alibabacloud

simulation/src/point-to-point/model/switch-mmu.cc — PFC PAUSE/RESUME logic

bool SwitchMmu::CheckShouldPause(uint32_t port, uint32_t qIndex) {
    return !paused[port][qIndex] &&
           (hdrm_bytes[port][qIndex] > 0 ||                  // Headroom occupied
            GetSharedUsed(port, qIndex) >= GetPfcThreshold(port));  // Shared buffer full
}

bool SwitchMmu::CheckShouldResume(uint32_t port, uint32_t qIndex) {
    if (!paused[port][qIndex]) return false;
    return hdrm_bytes[port][qIndex] == 0 &&                 // Headroom drained
           (GetSharedUsed(port,qIndex) == 0 ||
            GetSharedUsed(port,qIndex) + resume_offset       // 3 KB hysteresis
                <= GetPfcThreshold(port));
}

// Switch buffer: 12 MB total (static) or 32 MB (SimAI.conf BUFFER_SIZE)
// Dynamic threshold: USE_DYNAMIC_PFC_THRESHOLD = 1 (enabled by default)
// 8 priority queues (qCnt=8), queue 0 = highest (ACK/NACK/CNP bypass PFC)

PFC vs DCQCN — Two Layers of Defense: DCQCN (Layer 4) proactively reduces sending rate when switches mark ECN — it's the fine-grained rate control. PFC (Layer 2) is the last resort that physically pauses the link when buffers are critically full. In a well-tuned system, DCQCN should react fast enough that PFC rarely triggers. SimAI tracks PFC events in PFC_OUTPUT_FILE — high PFC counts indicate that DCQCN parameters need tuning (e.g., lower kmin, faster RATE_DECREASE_INTERVAL).

6.5 ECMP Routing (Per-Flow, MurmurHash3)

SimAI uses per-flow ECMP routing at every switch. When a packet arrives at a switch with multiple equal-cost next hops, the switch hashes the flow's 5-tuple to deterministically pick a path:

ns-3-alibabacloud

simulation/src/point-to-point/model/switch-node.cc — ECMP hash

int SwitchNode::GetOutDev(Ptr<Packet> p, CustomHeader &ch) {
    // Extract 5-tuple: src_ip(4B) + dst_ip(4B) + src_port(2B) + dst_port(2B)
    union { uint8_t u8[12]; uint32_t u32[3]; } buf;
    buf.u32[0] = ch.sip;    buf.u32[1] = ch.dip;
    buf.u32[2] = ch.sport | ((uint32_t)ch.dport << 16);

    // MurmurHash3 with per-switch seed (= node ID)
    uint32_t idx = EcmpHash(buf.u8, 12, m_ecmpSeed) % nexthops.size();
    return nexthops[idx];     // Deterministic path for this flow
}

ECMP and PD Disaggregation: In PD disaggregation, KV cache transfers are large, long-lived flows. Per-flow ECMP means each KV transfer is locked to a single path for its entire duration — it cannot spread across multiple links. This creates hash collision risk: if two large KV transfers hash to the same ASW→PSW link, they compete for bandwidth while other links sit idle. This is a known limitation of per-flow ECMP for elephant flows, and is why rail-optimized topologies (where same-rail traffic avoids PSW entirely) are preferred for TP-heavy workloads.

6.6 Flow Completion Time (FCT) Tracking

ns-3-alibabacloud

astra-sim/network_frontend/ns3/entry.h — qp_finish()

void qp_finish(FILE *fout, Ptr<RdmaQueuePair> q) {
    uint64_t standalone_fct = base_rtt
        + total_bytes * 8000000000lu / bandwidth;  // Ideal FCT (no congestion)

    fprintf(fout, "%08x %08x %u %u %lu %lu %lu %lu\n",
        q->sip, q->dip,             // Source & dest IP (hex)
        q->sport, q->dport,          // Ports
        q->m_size,                    // Data bytes (= KV cache size)
        q->startTime,                 // Start time (ns)
        Simulator::Now() - q->startTime,  // Actual FCT (ns)
        standalone_fct);              // Ideal FCT (ns, zero-congestion)
}
// FCT ratio = actual_fct / ideal_fct → measures congestion impact
// For PD disagg: this directly determines pd_p2p_comm_time

6.7 Transport Stack Summary

Layer	Mechanism	Implementation	Key Parameters
L4 Transport	RoCEv2 RDMA with QP state	`rdma-hw.cc`, `rdma-queue-pair.h`	PACKET_PAYLOAD_SIZE=9000, HAS_WIN=1
L4 Congestion Control	DCQCN (default) / HPCC / TIMELY	`rdma-hw.cc` `cnp_received_mlx()`	CC_MODE=1, EWMA_GAIN=1/256, RATE_AI=50Mb/s
L3 ECN Marking	Probabilistic RED at switch egress	`switch-mmu.cc` `ShouldSendCN()`	kmin/kmax per link speed, pmax=0.2
L3 Routing	Per-flow ECMP (MurmurHash3)	`switch-node.cc` `GetOutDev()`	5-tuple hash, seed=node_id
L2 Flow Control	PFC with dynamic threshold	`switch-mmu.cc`, `qbb-net-device.h`	8 queues, 3KB hysteresis, 32MB buffer
L1 Physical	Point-to-point links with configurable BW/latency	Topology file	NVLink 2880Gbps, NIC 100-400Gbps

Why This Matters for PD Disaggregation Accuracy: The analytical backend estimates KV transfer time as size / bandwidth. The NS-3 backend captures effects that analytical mode misses: (1) DCQCN rate ramp-up — new QPs start at line rate but quickly converge when competing; (2) ECMP hash collisions — two KV transfers to the same decode node may collide on a PSW uplink; (3) PFC cascading — a congested decode node can PFC-pause the entire upstream path; (4) cross-traffic interference — TP AllReduce on the same rails competes with PD KV transfers. These effects can cause 2-5× FCT inflation compared to the ideal size/BW estimate, making NS-3 essential for accurate PD disagg analysis.

6.8 NACK and Layer-2 Retransmission

When a receiver detects an out-of-order sequence number, it generates a NACK packet (protocol 0xFD). The sender retransmits from the NACKed sequence. This provides a reliable transport layer beneath congestion control.

Key parameters from SimAI.conf:

L2_CHUNK_SIZE = 4000 — chunk size for reliability (bytes)
L2_ACK_INTERVAL = 1 — ACK every N chunks
L2_BACK_TO_ZERO = 0 — Go-Back-N mode (0=selective, 1=go-back-to-zero)

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-hw.cc lines 41-44, 584-602 — ReceiverCheckSeq()

// rdma-hw.cc — ReceiverCheckSeq()
int RdmaHw::ReceiverCheckSeq(uint32_t seq, Ptr<RdmaRxQueuePair> q, uint32_t size) {
    uint32_t expected = q->ReceiverNextExpectedSeq;
    if (seq == expected) {
        q->ReceiverNextExpectedSeq = expected + size;
        // Check if need to send ACK
        if (m_ack_interval == 0) return 1; // ACK
        else return 5; // delayed ACK
    } else if (seq > expected) {
        // Gap detected → generate NACK
        return 2; // NACK
    } else {
        // Duplicate or retransmitted packet
        return 4; // duplicate
    }
}

6.9 NVSwitch / NVLS Dedicated Routing

SimAI has a dedicated NVSwitchNode class (separate from regular SwitchNode) for intra-node NVLink routing. This enables distinct routing paths for GPU-to-GPU traffic within the same server versus inter-server traffic through NICs.

Separate routing table: m_rtTable_nxthop_nvswitch for next-hop routing through NVSwitch
m_gpus_per_server determines which GPUs are in the same server
When nvls_enable=true on a QP, packets route through NVSwitch instead of NIC→ASW
NVLS ACKs use special ToS=4 for identification
SwitchAsHostSend() handles NVLS reply path

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-hw.h lines 49-51, rdma-hw.cc lines 203-212

// rdma-hw.h
std::unordered_map<uint32_t, std::vector<int>> m_rtTable;              // Normal routing
std::unordered_map<uint32_t, std::vector<int>> m_rtTable_nxthop_nvswitch; // NVSwitch routing
uint32_t m_gpus_per_server;  // Determines intra-node boundary

// rdma-hw.cc — route selection
if (nvls_enable && IsInSameServer(sip, dip)) {
    // Use NVSwitch routing table
    nexthops = m_rtTable_nxthop_nvswitch[dip];
} else {
    // Use normal ECMP routing
    nexthops = m_rtTable[dip];
}

6.10 Window + Rate Dual Control

SimAI uses both window-based and rate-based flow control simultaneously. A packet is sent only if both conditions are met: (1) within window, AND (2) within rate limit.

HAS_WIN=1 — enables window-based control
VAR_WIN=1 — enables variable window that scales with current rate
Formula: effective_window = m_win × (current_rate / max_rate)
IsWinBound() checks if on-the-fly bytes exceed window

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-queue-pair.h lines 25-28, rdma-queue-pair.cc lines 112-121, 168-190

// rdma-queue-pair.cc
bool RdmaQueuePair::IsWinBound() {
    uint64_t on_the_fly = snd_nxt - snd_una;
    uint64_t win;
    if (m_var_win) {
        win = m_win * m_rate.GetBitRate() / m_max_rate.GetBitRate();
        if (win == 0) win = 1;
    } else {
        win = m_win;
    }
    return on_the_fly >= win;
}

6.11 Alternative Congestion Control: HPCC, TIMELY, DCTCP

Beyond DCQCN (CC_MODE=1), SimAI implements three more congestion control algorithms, each using different network signals:

Why are CC_MODE numbers non-consecutive (1, 3, 7, 8, 10)? SimAI inherits its CC_MODE numbering from the upstream HPCC simulator codebase. Modes 2, 4, 5, 6, 9 were reserved as placeholders for algorithms that were planned but never implemented (e.g., DCQCN variants, Swift). We verified in rdma-hw.cc that only modes 1, 3, 7, 8, 10 have actual if (m_cc_mode == N) branches — the gaps are historical, not omissions in our analysis.

HPCC (CC_MODE=3) — Uses INT (In-Network Telemetry) headers. Each switch adds an IntHop record (timestamp, bytes, queue length, line rate). The sender uses precise utilization info to set rate:

ns-3-alibabacloud

simulation/src/point-to-point/model/int-header.h — IntHop structure

// IntHop structure (int-header.h)
struct IntHop {
    uint32_t time : 24;     // Timestamp (ns)
    uint32_t bytes : 20;    // Bytes transmitted since last sample
    uint32_t qlen : 17;     // Queue occupancy
    uint32_t lineRate : 3;  // Encoded link rate
};
// Max 5 hops per packet (IntHeader::maxHop = 5)
// Target utilization: m_targetUtil = 0.95

TIMELY (CC_MODE=7) — RTT-based, no ECN required:

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-hw.h — TIMELY parameters

// Parameters (rdma-hw.h)
double m_tmly_alpha;    // Rate decrease factor
double m_tmly_beta;     // Rate increase factor
uint64_t m_tmly_TLow;  // Low RTT threshold
uint64_t m_tmly_THigh; // High RTT threshold
uint64_t m_tmly_minRtt; // Minimum RTT tracking

DCTCP (CC_MODE=8) — ECN-based with additive increase:

ns-3-alibabacloud

simulation/src/point-to-point/model/rdma-hw.h — DCTCP parameters

DataRate m_dctcp_rai;  // Additive increase = 1000 Mb/s (from SimAI.conf DCTCP_RATE_AI)

Comparison of all supported congestion control algorithms:

CC Algorithm	Signal	Rate Decrease	Rate Increase	Use Case
DCQCN (1)	ECN via CNP	α×EWMA multiplicative	AI/HAI additive	Default, RoCEv2
HPCC (3)	INT telemetry	Precise utilization-based	Target 95% utilization	High precision
TIMELY (7)	RTT measurement	RTT > THigh	RTT < TLow	No switch support
DCTCP (8)	ECN marking	ECN-proportional	Additive (1Gbps)	TCP-like
HPCC-PINT (10)	Probabilistic INT	Sampled utilization	Same as HPCC	Reduced overhead

6.12 NIC Packet Scheduling (Weighted Round-Robin)

When multiple QPs compete for the same NIC, packets are scheduled via weighted round-robin. ACK/NACK packets always receive absolute priority.

ns-3-alibabacloud

simulation/src/point-to-point/model/qbb-net-device.cc lines 79-139 — GetNextQindex()

// qbb-net-device.cc — GetNextQindex()
int QbbNetDevice::GetNextQindex(bool paused[]) {
    // 1. Check high-priority ACK queue first
    if (m_ackQ.size() > 0) return -1;  // ACK has absolute priority

    // 2. Round-robin over QPs, skipping paused priorities
    for (int i = 1; i <= m_queue->m_qpGrp->GetN(); i++) {
        int idx = (m_qpidx + i) % m_queue->m_qpGrp->GetN();
        Ptr<RdmaQueuePair> qp = m_queue->m_qpGrp->Get(idx);
        if (!paused[qp->m_pg]           // Not PFC-paused
            && qp->GetBytesLeft() > 0    // Has data to send
            && !qp->IsWinBound()         // Within window
            && qp->m_nextAvail <= now) { // Rate-limited timer expired
            m_qpidx = idx;
            return idx;
        }
    }
    return -2; // Nothing to send
}

ACK/NACK packets always go first (absolute priority)
QPs are scheduled round-robin, respecting PFC pause state per priority
Each QP must pass three gates: not paused, has data, within window+rate limit
Finished QPs (GetBytesLeft()=0) are automatically cleaned up

6.13 Multi-QP per Flow (_QPS_PER_CONNECTION_)

Large flows can be split across multiple RDMA Queue Pairs for parallel transmission. Each QP gets a different source port, resulting in a different ECMP hash and potentially a different physical path. This is how SimAI can simulate multi-path load balancing for elephant flows.

astra-sim

simulation/src/point-to-point/model/entry.h lines 21, 110-112

#define _QPS_PER_CONNECTION_ 1  // Default: 1 QP per flow (configurable)

void SendFlow(int src, int dst, uint64_t maxPacketCount, ...) {
    uint64_t perQP = (maxPacketCount + _QPS_PER_CONNECTION_ - 1) / _QPS_PER_CONNECTION_;
    uint64_t remaining = maxPacketCount;

    for (int i = 0; i < _QPS_PER_CONNECTION_; i++) {
        uint64_t thisQP = min(perQP, remaining);
        remaining -= thisQP;
        uint32_t port = portNumber[src][dst]++;  // Unique port → unique ECMP hash
        // Each QP gets a different src port → different ECMP path
        RdmaClientHelper client(pg, sip, dip, port, dport, thisQP, ...);
    }
}

Key Insight: Each QP gets a different source port → different ECMP hash → potentially different physical path. This is how SimAI simulates multi-path load balancing for elephant flows.

6.14 Monitoring Outputs and Link Failure

SimAI provides six monitoring output files for observing network behavior at different granularities, plus support for link failure simulation:

File	Content	Interval
`FCT_OUTPUT_FILE`	Flow completion times (actual vs ideal)	Per flow
`QLEN_MON_FILE`	Switch queue occupancy (per port, per queue)	10 ms
`BW_MON_FILE`	Host-level transmit bytes	10 ms
`RATE_MON_FILE`	Per-QP sending rate (DCQCN-controlled)	100 μs
`CNP_MON_FILE`	CNP reception count per QP	100 μs
`PFC_OUTPUT_FILE`	PFC PAUSE/RESUME events	Per event

Link failure simulation:

ns-3-alibabacloud

SimAI.conf lines 47, 57-65; rdma-hw.cc lines 762-802

# SimAI.conf — Link failure configuration
# Format: LINK_DOWN <timestamp> <node_A> <node_B>
LINK_DOWN 0 0 0  # 0 0 0 = no failure

# Can simulate link failures at specified times to study resilience
# Example: LINK_DOWN 1000000 12 48  — fail link between node 12 and 48 at t=1ms

Monitoring Tip: Combine FCT_OUTPUT_FILE (macro-level flow completion) with RATE_MON_FILE (micro-level rate dynamics) to diagnose why specific collective operations experience high latency. High PFC counts in PFC_OUTPUT_FILE combined with queue buildup in QLEN_MON_FILE indicate congestion hotspots that DCQCN alone cannot resolve.

Section 7

Supported Network Topologies — Deep Dive

The topology generator gen_Topo_Template.py implements five distinct topology functions that map to three named architecture families: Spectrum-X, AlibabaHPN, and DCN+. The critical architectural distinction is between rail-optimized topologies (where GPU_i on every server connects to a dedicated ASW_i) and non-rail-optimized topologies (where all GPUs in a segment share the same ASW). This choice profoundly impacts collective communication performance.

Rail-Optimized vs Non-Rail-Optimized — The Key Design Decision
In rail-optimized topologies (Spectrum-X, AlibabaHPN), GPU_i within each server connects to ASW_{i % gps}. This means all GPU-0s across the cluster share ASW-0, all GPU-1s share ASW-1, and so on. This is optimal for AllReduce, where GPU_i communicates with GPU_i on other servers — all traffic stays within a single rail (one ASW hop). Cross-rail traffic (GPU-0 → GPU-3) requires traversal up to the PSW layer and back.

In non-rail-optimized topologies (DCN+), all GPUs in a segment connect to the same ASW (or pair). There is no rail alignment — all communication patterns are treated equally. This is better for AllToAll (MoE) workloads but sub-optimal for AllReduce.

1. Spectrum-X — Rail-Optimized Single-ToR

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_SingleToR() NS-3

Each GPU connects to its own rail-aligned ASW based on its index within the server. With 8 GPUs per server, there are 8 ASW per segment — one per GPU position. GPU_i → ASW_{[group × gps + (i % gps)]}. Every ASW connects to ALL PSW in a full mesh. This is NVIDIA's recommended topology for AllReduce-heavy workloads.

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server  # 8 ASW per segment (one per rail)
gpu_count = 4096          # Default GPU count
nic_bandwidth = 400Gbps   # NIC → ASW bandwidth
nvlink_bw = 2880Gbps      # NVLink bandwidth

# Wiring logic:
# GPU_i → NVSwitch (NVLink, intra-node)
# GPU_i → ASW[group * gps + (i % gps)]  (NIC, rail-aligned)
# ASW_j → ALL PSW  (full mesh uplinks)

2. AlibabaHPN — Dual-ToR Single-Plane

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_DualToR_SinglePlane() NS-3

Rail-optimized like Spectrum-X, but each GPU connects to two ASW switches (dual-homed), providing link redundancy. The ASW are split into two sets (ASW-A and ASW-B), but both connect to the same PSW pool (single plane). With 8 GPUs per server, there are gps × 2 = 16 ASW per segment.

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2  # 16 ASW per segment
gpu_count = 15360         # Default GPU count
nic_bandwidth = 200Gbps   # NIC → ASW bandwidth
asw_sets = 2              # asw_switch_1[] and asw_switch_2[]
psw_sets = 1              # Single PSW pool (both ASW sets → same PSW)

# Wiring logic:
# GPU_i → ASW-A[i % gps]  (NIC-1, rail-aligned)
# GPU_i → ASW-B[i % gps]  (NIC-2, rail-aligned)
# ASW-A[*] → ALL PSW      (full mesh)
# ASW-B[*] → ALL PSW      (full mesh)

3. AlibabaHPN — Dual-ToR Dual-Plane

astra-sim/network/ns3/gen_Topo_Template.py → Rail_Opti_DualToR_DualPlane() NS-3

The most fault-tolerant topology. Like HPN Single-Plane, each GPU is dual-homed to ASW-A and ASW-B. But the PSW layer is also split: ASW-A connects only to PSW-A, and ASW-B connects only to PSW-B — forming two completely independent network planes. If one entire plane fails, the other can still carry all traffic. The link formula uses psw_switch_num / pod_num / 2 (half PSW per plane).

# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2  # 16 ASW per segment
gpu_count = 15360
psw_sets = 2              # PSW split into psw_switch_1[] and psw_switch_2[]

# Wiring logic:
# GPU_i → ASW-A[i % gps]  (NIC-1)
# GPU_i → ASW-B[i % gps]  (NIC-2)
# ASW-A[*] → PSW-A[*] only  (Plane A)
# ASW-B[*] → PSW-B[*] only  (Plane B)
# Two independent planes — no cross-plane links

4. DCN+ — Non-Rail Single-ToR

astra-sim/network/ns3/gen_Topo_Template.py → No_Rail_Opti_SingleToR() NS-3

The simplest topology with no rail optimization. All GPUs in a segment connect to the same single ASW. There is only 1 ASW per segment, meaning 8 GPUs (or more, controlled by nics_per_aswitch) share the same top-of-rack switch. This eliminates rail structure entirely — all communication patterns are equal. The ASW connects to all PSW in a full mesh.

# Key parameters (defaults):
asw_switch_num_per_segment = 1  # Only 1 ASW per segment (no rails!)
gpu_count = 32+             # Small clusters

# Wiring logic:
# ALL GPU in segment → SAME ASW  (group_account tracks nics_per_aswitch)
# ASW → ALL PSW  (full mesh uplinks)
# No rail alignment — G0, G1, ..., G7 all share the same switch

5. DCN+ — Non-Rail Dual-ToR

astra-sim/network/ns3/gen_Topo_Template.py → No_Rail_Opti_DualToR() NS-3

Non-rail-optimized with dual-ToR redundancy. There are 2 ASW per segment (ASW-1 and ASW-2), but they are not rail-aligned — all GPUs in the segment connect to both ASW switches. Both ASW sets connect to the same PSW pool (single plane). This provides link redundancy for MoE workloads without rail structure.

# Key parameters:
asw_switch_num_per_segment = 2  # 2 ASW per segment (not rail-aligned)

# Wiring logic:
# ALL GPU in segment → ASW-1  (NIC-1, no rail alignment)
# ALL GPU in segment → ASW-2  (NIC-2, no rail alignment)
# ASW-1 → ALL PSW  (full mesh)
# ASW-2 → ALL PSW  (full mesh, same pool)

Topology Comparison Matrix

Feature	Spectrum-X	HPN Single-Plane	HPN Dual-Plane	DCN+ SingleToR	DCN+ DualToR
Rail-Optimized	Yes	Yes	Yes	No	No
ToR Redundancy	Single	Dual	Dual	Single	Dual
Network Planes	1	1	2	1	1
ASW per Segment	gps (8)	gps×2 (16)	gps×2 (16)	1	2
GPU→ASW Pattern	Rail-aligned	Rail-aligned, dual-homed	Rail-aligned, dual-homed	All→same ASW	All→both ASW
ASW→PSW	Full mesh	Full mesh	Per-plane mesh	Full mesh	Full mesh
AllReduce Efficiency	Optimal (same-rail)	Optimal + redundant	Optimal + fault-tolerant	Sub-optimal	Sub-optimal + redundant
AllToAll (MoE) Efficiency	Requires PSW hops	Requires PSW hops	Plane-isolated	Equal path	Equal path + redundant
Default GPU Count	4096	15360	15360	32+	32+
Best For	AllReduce-heavy TP/DP	Production PD disaggregation	Fault-tolerant production	Small clusters, MoE	MoE with redundancy

Significance for PD (Prefill-Decode) Disaggregation Simulation
In PD disaggregation, prefill and decode phases run on separate GPU groups connected through the network fabric. The topology choice directly determines the KV-cache transfer latency between prefill and decode nodes. With AlibabaHPN Dual-Plane, one plane can be dedicated to KV-cache transfers while the other handles gradient synchronization — eliminating interference. With Spectrum-X, rail alignment means KV-cache transfers between same-rail GPU pairs (e.g., GPU-0 on prefill → GPU-0 on decode) complete with minimal latency (single ASW hop), but cross-rail transfers suffer. SimAI enables quantifying this trade-off before hardware procurement.

MoE (Mixture of Experts) Topology Implications
MoE models use AllToAll collectives to route tokens to expert GPUs, generating traffic patterns that cross rail boundaries. In rail-optimized topologies, AllToAll creates heavy cross-rail traffic that must traverse the PSW layer, potentially causing congestion at ASW-PSW uplinks. In DCN+ non-rail topologies, all GPUs share the same ASW, making AllToAll traffic patterns uniform — no path is inherently worse than another. This is why SimAI's topology comparison workflow (generating Spectrum-X, HPN, and DCN+ topologies for the same workload) is critical for MoE model deployment decisions.

Source Code → Topology Function Mapping

astra-sim/network/ns3/gen_Topo_Template.py

# CLI argument → Function mapping:
'Spectrum-X'              → Rail_Opti_SingleToR()
'AlibabaHPN-SinglePlane'  → Rail_Opti_DualToR_SinglePlane()
'AlibabaHPN-DualPlane'    → Rail_Opti_DualToR_DualPlane()
'DCN-SingleToR'           → No_Rail_Opti_SingleToR()
'DCN-DualToR'             → No_Rail_Opti_DualToR()

# The key variable that distinguishes rail vs non-rail:
# Rail-optimized:     asw_switch_num_per_segment = gpu_per_server (or ×2)
#   → GPU_i connects to ASW[i % gps]  (rail-aligned)
# Non-rail-optimized: asw_switch_num_per_segment = 1 (or 2)
#   → All GPUs in segment share same ASW(s)

Communication Simulation: SimCCL + astra-sim + NS-3

Table of Contents

Page Organization

Communication Stack Overview

NCCL Algorithm Simulation (MockNcclGroup)

2.1 Supported Algorithms

2.2 Algorithm Selection Logic

2.3 Ring AllReduce Decomposition

Ring AllReduce Flow Generation

2.4 NVLS AllReduce (Hopper / Blackwell)

NVLS-Tree AllReduce

2.5 Hardware Latency Parameters

SingleFlow Data Structure

Flow Creation

Flow Scheduling

Network Transfer

Completion Callback

Group Topology

GroupType Communication Patterns

astra-sim Three Backends

5.1 Analytical Mode (AnaSim)

5.2 NS-3 Simulation Mode

5.3 Physical Mode (Beta)

Backend Comparison

P2P Transport Mechanics: How KV Cache Actually Travels

§6.1 Prefill GPU → QP Creation

§6.10/§6.12/§6.13 Sender NIC Gates

§6.3/§6.4/§6.5 Switch Fabric

§6.9 NVLS Check

§6.8/§6.6 Receiver NIC

§6.2/§6.11/§6.10 Feedback Loop

§6.6/§6.14 Complete

6.1 RDMA Queue Pair (QP) Lifecycle

MockNcclGroup decomposes collective → SingleFlow

astra-sim calls sim_send(src, dst, bytes)

NS-3 creates RdmaClient → RdmaDriver.AddQueuePair()

Packets traverse switches → ECN marking → ACK/CNP

All bytes ACKed → qp_finish() → callback to astra-sim

6.2 DCQCN Congestion Control (CC_MODE=1)

Rate Decrease (on CNP received)

DCQCN Rate Decrease Formula

Rate Increase (recovery)

DCQCN Rate Increase — Three Phases

6.3 ECN Marking at Switches (RED-based)

6.4 PFC (Priority Flow Control) — Layer 2 Back-Pressure

6.5 ECMP Routing (Per-Flow, MurmurHash3)

6.6 Flow Completion Time (FCT) Tracking

6.7 Transport Stack Summary

6.8 NACK and Layer-2 Retransmission

6.9 NVSwitch / NVLS Dedicated Routing

6.10 Window + Rate Dual Control

6.11 Alternative Congestion Control: HPCC, TIMELY, DCTCP

6.12 NIC Packet Scheduling (Weighted Round-Robin)

6.13 Multi-QP per Flow (_QPS_PER_CONNECTION_)

6.14 Monitoring Outputs and Link Failure

Network Topology Generation

Supported Network Topologies — Deep Dive

1. Spectrum-X — Rail-Optimized Single-ToR

2. AlibabaHPN — Dual-ToR Single-Plane

3. AlibabaHPN — Dual-ToR Dual-Plane

4. DCN+ — Non-Rail Single-ToR

5. DCN+ — Non-Rail Dual-ToR

Topology Comparison Matrix

Source Code → Topology Function Mapping

NS-3 Configuration (SimAI.conf)

SimAI.conf (Network Protocol)

Environment Variables

Build System

Analytical Binary

NS-3 Simulator Binary

Physical Binary

End-to-End Example

Generate Topology

Prepare Workload

Run Simulation

Analyze Output

Advanced: Multi-Workload Simulation

Topology Comparison Workflow

Algorithm Internals: Channel Construction

Ring Channel Construction