From collective operations to packet-level network simulation — the full communication stack
SimCCL · astra-sim · NS-3SimAI's communication simulation covers the complete network stack — from high-level collective operations (AllReduce, AllGather) down to packet-level RDMA transport with DCQCN congestion control and PFC flow control. This page provides a comprehensive deep-dive organized into three layers: (1) Collective Decomposition (Sections 2-4): how NCCL algorithms decompose collectives into point-to-point flows using Ring, Tree, and NVLS patterns; (2) Transport & Network Mechanics (Sections 5-6): how flows traverse the network via RoCEv2 RDMA with DCQCN/HPCC/TIMELY congestion control, RED-based ECN marking, PFC back-pressure, and per-flow ECMP routing; (3) Topology & Configuration (Sections 7-10): the five supported datacenter topologies (Spectrum-X, AlibabaHPN, DCN+), NS-3 configuration parameters, and end-to-end usage examples.
Each layer has a well-defined interface to the one below it. Collective operations produce FlowModels (a set of SingleFlow structs). astra-sim consumes these flows and dispatches them to the network backend. The backend reports completion events back to astra-sim, which then triggers dependent flows.
The MockNcclGroup class is the core of SimCCL. It faithfully replicates NCCL's algorithm selection and flow decomposition logic, translating high-level collective operations into point-to-point data flows that can be simulated by the network backend.
SimCCL mirrors the six NCCL algorithm types. Each algorithm defines a different communication pattern for moving data between GPUs:
// NCCL algorithm definitions mirrored in SimCCL
#define NCCL_ALGO_TREE 0 // Tree reduction (hierarchical)
#define NCCL_ALGO_RING 1 // Ring (bandwidth-optimal)
#define NCCL_ALGO_COLLNET_DIRECT 2 // Direct CollNet (switch-assisted)
#define NCCL_ALGO_COLLNET_CHAIN 3 // Chain CollNet (pipelined)
#define NCCL_ALGO_NVLS 4 // NVLink Switching (Hopper+)
#define NCCL_ALGO_NVLS_TREE 5 // NVLS + Tree hybrid
#define NCCL_NUM_ALGORITHMS 6
#define NCCL_NUM_PROTOCOLS 3 // LL, LL128, Simple
Algorithm selection in SimCCL is GPU-type-aware. The system considers the GPU architecture (A100, H100, H800), the group type (TP, DP, PP, EP), the number of ranks, and whether NVLink Switching is available. This mirrors NCCL's real-world decision tree:
ncclInfo* MockNcclGroup::get_algo_proto_info(
GroupType type, int rank, ComType op, uint64_t data_size) {
ncclInfo* info = new ncclInfo();
info->nBytes = data_size;
info->op = op;
if (op == All_Reduce && type == TP) {
if (gpu_type == H100 || gpu_type == H800) {
if (nRanks >= 8 && NVLSenable)
info->algorithm = NCCL_ALGO_NVLS; // Hopper favors NVLS
else
info->algorithm = NCCL_ALGO_RING;
} else if (gpu_type == A100 || gpu_type == A800) {
info->algorithm = NCCL_ALGO_RING; // Ampere uses Ring
}
} else if (op == All_Reduce && type == DP) {
info->algorithm = NCCL_ALGO_RING; // DP always Ring
} else if (op == All_to_All) {
info->algorithm = NCCL_ALGO_RING; // AllToAll uses Ring
} else if (op == All_Gather) {
if (type == TP && NVLSenable && nRanks >= 8)
info->algorithm = NCCL_ALGO_NVLS_TREE;
else
info->algorithm = NCCL_ALGO_RING;
}
return info;
}
The Ring AllReduce is the most common collective algorithm. It operates in two phases: ReduceScatter (each GPU sends 1/N of its data around the ring, reducing as it goes) and AllGather (each GPU broadcasts its reduced chunk). For N GPUs, this requires 2*(N-1) steps total.
std::map<int, shared_ptr<FlowModels>>
MockNcclGroup::genAllReduceRingFlowModels(
GroupType type, int rank, uint64_t data_size) {
int nranks = gp_info.nRanks;
int chunkcount = 2 * (nranks - 1); // reduce + broadcast phases
chunksize = data_size / nranks / ringchannels.size();
// Phase 1: ReduceScatter
for (int step = 0; step < nranks - 1; step++) {
for (auto& ring : ringchannels) {
int src_rank = ring[(ring_idx + nranks - step) % nranks];
int dest_rank = ring[(ring_idx + nranks - step + 1) % nranks];
tmp_result = SingleFlow(
flow_id, src_rank, dest_rank, chunksize,
{prev_flow_id}, // depends on previous step
{}, // no parallel deps
{child_flow_id}, // next step depends on this
ring_id, chunk_id,
chunkcount, "RING");
}
}
// Phase 2: AllGather
for (int step = 0; step < nranks - 1; step++) {
for (auto& ring : ringchannels) {
// Similar flow generation with broadcast semantics
tmp_result = SingleFlow(
flow_id, src_rank, dest_rank, chunksize,
{prev_flow_id}, {}, {child_flow_id},
ring_id, chunk_id, chunkcount, "RING");
}
}
return flow_models;
}
NVLS (NVLink Switching) leverages NVSwitch to provide all-to-all NVLink connectivity within a node. Instead of passing data around a ring, all GPUs can simultaneously access a shared memory region through the NVSwitch. The NVLS-Tree variant combines NVSwitch-based intra-node communication with a tree reduction for inter-node communication.
shared_ptr<FlowModels>
MockNcclGroup::genallReduceNVLSTreeFlowModels(
GroupType type, int rank, uint64_t data_size) {
// Step 1: NVLS Reduce (intra-node via NVSwitch)
// All GPUs within a node reduce their data through
// the NVSwitch shared memory multicast
generate_flow_model_nvls_tree_allreduce_up(
rank, data_size, flow_models);
// Step 2: Tree Reduce (inter-node via NIC)
// One representative GPU per node participates in
// a tree reduction across nodes
// Step 3: Tree Broadcast (inter-node via NIC)
// Root broadcasts the fully reduced result down the tree
// Step 4: NVLS Broadcast (intra-node via NVSwitch)
// Representative GPU shares the result with all node-local GPUs
generate_flow_model_nvls_tree_allreduce_down(
rank, data_size, flow_models);
return flow_models;
}
// NVLS uses multicast addressing through NVSwitch
// Each GPU writes to a shared buffer that is atomically
// reduced by the NVSwitch hardware itself
void MockNcclGroup::generate_flow_model_nvls_tree_allreduce_up(
int rank, uint64_t data_size,
shared_ptr<FlowModels> flow_models) {
// NVLS multicast: each GPU sends to NVSwitch
for (auto& nvswitch : gp_info.NVSwitchs) {
SingleFlow flow(flow_id, rank, nvswitch,
data_size / nRanks,
{}, {}, {child_ids},
0, chunk_id, chunk_count, "NVLS");
flow_models->add(flow);
}
}
SimCCL models hardware-specific latencies for each algorithm and protocol combination. These base latency values are added to the data transfer time to account for protocol overhead, synchronization, and memory copies:
// Base latency (microseconds) per algorithm per protocol
// Protocols: [LL, LL128, Simple]
static const float baseLat[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
{6.8, 14.0, 0}, // Tree
{6.6, 14.0, 8.4}, // Ring
{0, 0, 0}, // CollNet Direct
{0, 0, 0}, // CollNet Chain
{0, 0, 23.0}, // NVLS
{0, 0, 0}, // NVLS-Tree
};
// Per-path latencies: [NVLINK, PCI, NET]
// Each indexed by [path][algorithm][protocol]
static float hwLat[3][NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS] = {
// NVLINK path latencies
{ {0.6, 1.25, 28.0}, // Tree over NVLINK
{0.6, 1.9, 3.4}, // Ring over NVLINK
{0, 0, 0}, // CollNet Direct
{0, 0, 0}, // CollNet Chain
{0, 0, 0}, // NVLS
{0, 0, 0} }, // NVLS-Tree
// PCI path latencies
{ {1.0, 1.0, 0}, // Tree over PCI
{1.0, 1.0, 0}, // Ring over PCI
... },
// Network (NIC) path latencies
{ {5.0, 8.5, 0}, // Tree over NET
{2.7, 5.5, 0}, // Ring over NET
... },
};
The SingleFlow struct is the fundamental unit of communication in SimAI. Every collective operation is decomposed into a directed acyclic graph (DAG) of SingleFlows. Each flow represents a point-to-point data transfer between two ranks, with explicit dependency tracking via prev and child_flow_id fields.
struct SingleFlow {
int flow_id; // Unique identifier for this flow
int src, dest; // Source and destination rank
uint64_t flow_size; // Data size in bytes
vector<int> prev; // Dependency flows (must complete first)
vector<int> parallel_deps; // Flows that run in parallel
vector<int> child_flow_id; // Downstream flows (triggered on completion)
int channel_id; // Ring/tree channel index
int chunk_id; // Which chunk of the collective
int chunk_count; // Total chunks in this collective
string conn_type; // "RING", "TREE", "NVLS", "PXN_INIT"
// Constructor
SingleFlow(int id, int s, int d, uint64_t size,
vector<int> p, vector<int> par,
vector<int> child,
int ch, int ck, int cc,
string ct)
: flow_id(id), src(s), dest(d), flow_size(size),
prev(p), parallel_deps(par), child_flow_id(child),
channel_id(ch), chunk_id(ck), chunk_count(cc),
conn_type(ct) {}
};
The flow DAG ensures correct execution ordering. For a Ring AllReduce with 4 GPUs, the prev field chains flows so that step 2 of the ReduceScatter cannot begin until step 1 completes. The child_flow_id enables the simulation engine to eagerly dispatch dependent flows once a flow finishes.
MockNcclGroup decomposes collective into SingleFlows with explicit dependencies
astra-sim checks prev dependencies; dispatches flows with all dependencies satisfied
Backend (NS-3 / Analytical) simulates the point-to-point transfer
Backend notifies astra-sim; child_flow_id flows are checked and potentially dispatched
SimAI organizes GPU ranks into communication groups based on parallelism strategy. Each GroupInfo struct describes the membership, topology, and NVSwitch connectivity of a group. The GroupType enum determines which collective algorithm is used:
enum GroupType { TP, DP, PP, EP, DP_EP, NONE };
struct GroupInfo {
GroupType type; // Parallelism type
int nNodes; // Number of nodes in this group
int nRanks; // Number of GPU ranks
vector<int> Ranks; // List of rank IDs
vector<int> NVSwitchs; // NVSwitch IDs (if applicable)
};
// Example: 8-GPU TP group within one node
// GroupInfo { TP, 1, 8, {0,1,2,3,4,5,6,7}, {8,9,10,11} }
// - 8 ranks on 1 node
// - 4 NVSwitches (H100 SXM)
| GroupType | Typical Op | Algorithm | Description |
|---|---|---|---|
TP |
AllReduce | Ring / NVLS | Tensor parallel gradient sync within a node; NVLS preferred on H100+ with 8+ ranks |
DP |
AllReduce | Ring | Data parallel gradient sync across nodes; bandwidth-bound over NIC |
EP |
AllToAll | Ring | Expert parallel token dispatch for MoE models; each GPU sends to all experts |
PP |
SendRecv | P2P | Pipeline parallel activation transfer between stages; latency-sensitive |
DP_EP |
AllReduce | Ring | Combined data-parallel + expert-parallel gradient sync |
The group topology determines how MockNcclGroup generates ring or tree channels. For a TP group confined to one node, rings are constructed using the NVLink topology (e.g., rail-optimized patterns). For DP groups spanning multiple nodes, rings traverse NIC links and the network fabric.
astra-sim supports three different network backends, each offering a different trade-off between simulation speed and fidelity. Users choose the backend at build time, and the same workload description works with all three.
The analytical backend uses a simple event queue with bus bandwidth estimates to compute transfer times. It runs in seconds but does not model congestion, queueing, or flow contention:
class AnaSim {
static queue<struct CallTask> call_list;
static uint64_t tick;
static void Run() {
while (!call_list.empty()) {
CallTask task = call_list.front();
while (tick != task.time) tick++;
call_list.pop();
task.fun_ptr(task.fun_arg); // Execute callback
}
}
static void Schedule(
uint64_t delay, // Computed from busbw config
void (*fun_ptr)(void*), // Completion callback
void* fun_arg) {
call_list.push({tick + delay, fun_ptr, fun_arg});
}
};
// Transfer time = data_size / busbw + base_latency
// No congestion, no contention, no queue dynamics
The NS-3 backend provides full packet-level simulation with realistic network behavior. This is the highest-fidelity software simulation mode, modeling every aspect of the network fabric:
The physical backend generates actual RDMA traffic on real hardware. This provides the highest fidelity validation but requires an actual RDMA cluster:
void send_flow(int src, int dst, uint64_t maxPacketCount,
void (*msg_handler)(void*), void* fun_arg,
int tag, sim_request* request) {
ncclFlowTag flowtag = request->flowTag;
TransportData send_data = TransportData(
flowtag.channel_id,
flowtag.chunk_id,
flowtag.chunk_count,
flowtag.current_flow_id,
flowtag.child_flow_id,
flowtag.tree_flow_list,
flowtag.data_size);
// Serialize and send via actual RDMA verbs
rdma_transport->post_send(
dst, &send_data, sizeof(send_data),
tag, msg_handler, fun_arg);
}
| Mode | Speed | Fidelity | Hardware | Use Case |
|---|---|---|---|---|
Analytical |
~seconds | Low | CPU only | Quick estimation, parameter sweeps, early design exploration |
NS-3 Sim |
~minutes-hours | High | CPU (multi-thread) | Network topology design, congestion analysis, QoS tuning |
Physical |
Real-time | Highest | RDMA cluster | Validation against real hardware, final design sign-off |
When SimAI simulates a KV cache transfer from a prefill node to a decode node, the NS-3 backend doesn't just estimate latency — it simulates the full RoCEv2 RDMA transport stack at packet granularity. This section traces the exact mechanisms: how RDMA Queue Pairs are created, how DCQCN adjusts sending rates on congestion, how PFC prevents buffer overflow, and how ECMP distributes flows across equal-cost paths.
RdmaHw.AddQueuePair(size=KV_cache_bytes, pg=3)
Window check (IsWinBound) → Multi-QP split → WRR NIC scheduling
ECN marking (RED) → PFC pause check → ECMP hash routing → PSW if cross-rail
Same server? → route via NVSwitch (2880 Gbps) instead of NIC
Seq check → ACK or NACK (protocol 0xFD) → FCT tracking (actual vs ideal)
CNP → DCQCN rate decrease (α×EWMA) | or HPCC/TIMELY/DCTCP | Window adjust
qp_finish() → FCT recorded → monitoring output → callback to astra-sim → Decode GPU
Every P2P flow in SimAI becomes one or more RDMA Queue Pairs. The call chain starts from astra-sim's collective decomposition and ends at NS-3's packet-level simulation:
e.g., AllReduce(64MB, 8 GPUs) → 14 SingleFlow objects (2×(N-1) for ring)
Each SingleFlow triggers a sim_send() to the NS-3 backend with flow_size bytes
A QP is created with src/dst IP, port, priority group, window size (BDP), and base RTT. The QP starts sending packets at line rate.
9000-byte jumbo frames traverse the topology. Switches check queue depth, mark ECN if congested, trigger PFC if buffer critical. Receiver sends ACK or generates CNP.
FCT (Flow Completion Time) is recorded. astra-sim's event system proceeds to the next dependent operation.
class RdmaQueuePair : public Object {
Ipv4Address sip, dip; // Source & destination IP
uint16_t sport, dport; // Source & dest port (ECMP hash input)
uint64_t m_size; // Total bytes to transfer (= KV cache size)
uint16_t m_pg; // Priority group (queue index for PFC)
uint64_t snd_nxt, snd_una; // Next seq to send, unacked seq
uint32_t m_win; // Window size (BDP-based)
DataRate m_rate; // Current sending rate (DCQCN-controlled)
uint64_t m_baseRtt; // Base RTT for this src-dst pair
};
SimAI's default congestion control is DCQCN (Data Center QCN), Mellanox's adaptation of QCN for RoCEv2. It operates in three phases: rate decrease on CNP, alpha tracking via EWMA, and rate recovery via additive/hyper-additive increase.
1 = DCQCN (Mellanox, default), 3 = HPCC (High Precision CC), 7 = TIMELY (RTT-based), 10 = HPCC-PINT (with INT sampling). All implemented in rdma-hw.cc.
// DCQCN rate control state (per QP)
double m_g; // EWMA gain = 0.00390625 (1/256)
double m_rateOnFirstCNP; // Rate fraction on first CNP (1.0 = full)
double m_rpgTimeReset; // Rate increase timer = 900 μs
double m_rateDecreaseInterval; // Rate decrease check interval = 4 μs
double m_alpha_resume_interval; // Alpha EWMA update interval = 1 μs
DataRate m_rai; // Additive increase = 50 Mb/s
DataRate m_rhai; // Hyper-additive increase = 100 Mb/s
uint32_t m_rpgThreshold; // Fast recovery threshold = 1 CNP
When a CNP (Congestion Notification Packet) arrives at the sender, DCQCN reduces the rate multiplicatively. The key formula:
// On each rate_decrease_interval (4 μs), if CNP was received:
target_rate = rate × (1 - alpha / 2)
rate = max(rate × (1 - alpha / 2), MIN_RATE)
// Alpha is updated via EWMA every alpha_resume_interval (1 μs):
// If CNP received in this interval:
alpha = (1 - m_g) × alpha + m_g // α increases toward 1
// If no CNP:
alpha = (1 - m_g) × alpha // α decays toward 0
// With m_g = 1/256, alpha converges slowly — providing stability
// Every RP_TIMER (900 μs), if no CNP received:
// Phase 1: Fast Recovery (first m_rpgThreshold rounds)
rate = (rate + target_rate) / 2
// Phase 2: Additive Increase
rate = rate + m_rai // +50 Mb/s per interval
// Phase 3: Hyper-Additive Increase (after sustained no-congestion)
rate = rate + m_rhai // +100 Mb/s per interval
Switches mark packets with ECN (setting the ECE bits in IPv4 header) based on probabilistic RED (Random Early Detection). The marking probability increases linearly between kmin and kmax:
bool SwitchMmu::ShouldSendCN(uint32_t ifindex, uint32_t qIndex) {
if (qIndex == 0) return false; // Queue 0 = highest priority, never marked
if (egress_bytes[ifindex][qIndex] > kmax[ifindex])
return true; // Above kmax: always mark (100%)
if (egress_bytes[ifindex][qIndex] > kmin[ifindex]) {
// Between kmin and kmax: linear probability
double p = pmax[ifindex]
* (double)(egress_bytes[ifindex][qIndex] - kmin[ifindex])
/ (kmax[ifindex] - kmin[ifindex]);
if (UniformVariable(0, 1).GetValue() < p)
return true; // Probabilistic mark
}
return false; // Below kmin: never mark
}
The kmin/kmax/pmax values are rate-dependent, configured per link speed in SimAI.conf:
| Link Speed | kmin (KB) | kmax (KB) | pmax |
|---|---|---|---|
| 25 Gbps | 100 | 400 | 0.2 |
| 100 Gbps | 400 | 1600 | 0.2 |
| 200 Gbps | 300 | 1200 | 0.8 |
| 400 Gbps | 800 | 3200 | 0.2 |
PFC operates at Layer 2, independently from DCQCN. When a switch's ingress buffer fills beyond the threshold, it sends a PAUSE frame to the upstream sender, halting transmission on that priority queue. This prevents packet loss but can cause head-of-line blocking and PFC storms.
bool SwitchMmu::CheckShouldPause(uint32_t port, uint32_t qIndex) {
return !paused[port][qIndex] &&
(hdrm_bytes[port][qIndex] > 0 || // Headroom occupied
GetSharedUsed(port, qIndex) >= GetPfcThreshold(port)); // Shared buffer full
}
bool SwitchMmu::CheckShouldResume(uint32_t port, uint32_t qIndex) {
if (!paused[port][qIndex]) return false;
return hdrm_bytes[port][qIndex] == 0 && // Headroom drained
(GetSharedUsed(port,qIndex) == 0 ||
GetSharedUsed(port,qIndex) + resume_offset // 3 KB hysteresis
<= GetPfcThreshold(port));
}
// Switch buffer: 12 MB total (static) or 32 MB (SimAI.conf BUFFER_SIZE)
// Dynamic threshold: USE_DYNAMIC_PFC_THRESHOLD = 1 (enabled by default)
// 8 priority queues (qCnt=8), queue 0 = highest (ACK/NACK/CNP bypass PFC)
PFC_OUTPUT_FILE — high PFC counts indicate that DCQCN parameters need tuning (e.g., lower kmin, faster RATE_DECREASE_INTERVAL).
SimAI uses per-flow ECMP routing at every switch. When a packet arrives at a switch with multiple equal-cost next hops, the switch hashes the flow's 5-tuple to deterministically pick a path:
int SwitchNode::GetOutDev(Ptr<Packet> p, CustomHeader &ch) {
// Extract 5-tuple: src_ip(4B) + dst_ip(4B) + src_port(2B) + dst_port(2B)
union { uint8_t u8[12]; uint32_t u32[3]; } buf;
buf.u32[0] = ch.sip; buf.u32[1] = ch.dip;
buf.u32[2] = ch.sport | ((uint32_t)ch.dport << 16);
// MurmurHash3 with per-switch seed (= node ID)
uint32_t idx = EcmpHash(buf.u8, 12, m_ecmpSeed) % nexthops.size();
return nexthops[idx]; // Deterministic path for this flow
}
void qp_finish(FILE *fout, Ptr<RdmaQueuePair> q) {
uint64_t standalone_fct = base_rtt
+ total_bytes * 8000000000lu / bandwidth; // Ideal FCT (no congestion)
fprintf(fout, "%08x %08x %u %u %lu %lu %lu %lu\n",
q->sip, q->dip, // Source & dest IP (hex)
q->sport, q->dport, // Ports
q->m_size, // Data bytes (= KV cache size)
q->startTime, // Start time (ns)
Simulator::Now() - q->startTime, // Actual FCT (ns)
standalone_fct); // Ideal FCT (ns, zero-congestion)
}
// FCT ratio = actual_fct / ideal_fct → measures congestion impact
// For PD disagg: this directly determines pd_p2p_comm_time
| Layer | Mechanism | Implementation | Key Parameters |
|---|---|---|---|
| L4 Transport | RoCEv2 RDMA with QP state | rdma-hw.cc, rdma-queue-pair.h |
PACKET_PAYLOAD_SIZE=9000, HAS_WIN=1 |
| L4 Congestion Control | DCQCN (default) / HPCC / TIMELY | rdma-hw.cc cnp_received_mlx() |
CC_MODE=1, EWMA_GAIN=1/256, RATE_AI=50Mb/s |
| L3 ECN Marking | Probabilistic RED at switch egress | switch-mmu.cc ShouldSendCN() |
kmin/kmax per link speed, pmax=0.2 |
| L3 Routing | Per-flow ECMP (MurmurHash3) | switch-node.cc GetOutDev() |
5-tuple hash, seed=node_id |
| L2 Flow Control | PFC with dynamic threshold | switch-mmu.cc, qbb-net-device.h |
8 queues, 3KB hysteresis, 32MB buffer |
| L1 Physical | Point-to-point links with configurable BW/latency | Topology file | NVLink 2880Gbps, NIC 100-400Gbps |
size / bandwidth. The NS-3 backend captures effects that analytical mode misses: (1) DCQCN rate ramp-up — new QPs start at line rate but quickly converge when competing; (2) ECMP hash collisions — two KV transfers to the same decode node may collide on a PSW uplink; (3) PFC cascading — a congested decode node can PFC-pause the entire upstream path; (4) cross-traffic interference — TP AllReduce on the same rails competes with PD KV transfers. These effects can cause 2-5× FCT inflation compared to the ideal size/BW estimate, making NS-3 essential for accurate PD disagg analysis.
When a receiver detects an out-of-order sequence number, it generates a NACK packet (protocol 0xFD). The sender retransmits from the NACKed sequence. This provides a reliable transport layer beneath congestion control.
Key parameters from SimAI.conf:
L2_CHUNK_SIZE = 4000 — chunk size for reliability (bytes)L2_ACK_INTERVAL = 1 — ACK every N chunksL2_BACK_TO_ZERO = 0 — Go-Back-N mode (0=selective, 1=go-back-to-zero)// rdma-hw.cc — ReceiverCheckSeq()
int RdmaHw::ReceiverCheckSeq(uint32_t seq, Ptr<RdmaRxQueuePair> q, uint32_t size) {
uint32_t expected = q->ReceiverNextExpectedSeq;
if (seq == expected) {
q->ReceiverNextExpectedSeq = expected + size;
// Check if need to send ACK
if (m_ack_interval == 0) return 1; // ACK
else return 5; // delayed ACK
} else if (seq > expected) {
// Gap detected → generate NACK
return 2; // NACK
} else {
// Duplicate or retransmitted packet
return 4; // duplicate
}
}
SimAI has a dedicated NVSwitchNode class (separate from regular SwitchNode) for intra-node NVLink routing. This enables distinct routing paths for GPU-to-GPU traffic within the same server versus inter-server traffic through NICs.
m_rtTable_nxthop_nvswitch for next-hop routing through NVSwitchm_gpus_per_server determines which GPUs are in the same servernvls_enable=true on a QP, packets route through NVSwitch instead of NIC→ASWSwitchAsHostSend() handles NVLS reply path// rdma-hw.h
std::unordered_map<uint32_t, std::vector<int>> m_rtTable; // Normal routing
std::unordered_map<uint32_t, std::vector<int>> m_rtTable_nxthop_nvswitch; // NVSwitch routing
uint32_t m_gpus_per_server; // Determines intra-node boundary
// rdma-hw.cc — route selection
if (nvls_enable && IsInSameServer(sip, dip)) {
// Use NVSwitch routing table
nexthops = m_rtTable_nxthop_nvswitch[dip];
} else {
// Use normal ECMP routing
nexthops = m_rtTable[dip];
}
SimAI uses both window-based and rate-based flow control simultaneously. A packet is sent only if both conditions are met: (1) within window, AND (2) within rate limit.
HAS_WIN=1 — enables window-based controlVAR_WIN=1 — enables variable window that scales with current rateeffective_window = m_win × (current_rate / max_rate)IsWinBound() checks if on-the-fly bytes exceed window// rdma-queue-pair.cc
bool RdmaQueuePair::IsWinBound() {
uint64_t on_the_fly = snd_nxt - snd_una;
uint64_t win;
if (m_var_win) {
win = m_win * m_rate.GetBitRate() / m_max_rate.GetBitRate();
if (win == 0) win = 1;
} else {
win = m_win;
}
return on_the_fly >= win;
}
Beyond DCQCN (CC_MODE=1), SimAI implements three more congestion control algorithms, each using different network signals:
rdma-hw.cc that only modes 1, 3, 7, 8, 10 have actual if (m_cc_mode == N) branches — the gaps are historical, not omissions in our analysis.
HPCC (CC_MODE=3) — Uses INT (In-Network Telemetry) headers. Each switch adds an IntHop record (timestamp, bytes, queue length, line rate). The sender uses precise utilization info to set rate:
// IntHop structure (int-header.h)
struct IntHop {
uint32_t time : 24; // Timestamp (ns)
uint32_t bytes : 20; // Bytes transmitted since last sample
uint32_t qlen : 17; // Queue occupancy
uint32_t lineRate : 3; // Encoded link rate
};
// Max 5 hops per packet (IntHeader::maxHop = 5)
// Target utilization: m_targetUtil = 0.95
TIMELY (CC_MODE=7) — RTT-based, no ECN required:
// Parameters (rdma-hw.h)
double m_tmly_alpha; // Rate decrease factor
double m_tmly_beta; // Rate increase factor
uint64_t m_tmly_TLow; // Low RTT threshold
uint64_t m_tmly_THigh; // High RTT threshold
uint64_t m_tmly_minRtt; // Minimum RTT tracking
DCTCP (CC_MODE=8) — ECN-based with additive increase:
DataRate m_dctcp_rai; // Additive increase = 1000 Mb/s (from SimAI.conf DCTCP_RATE_AI)
Comparison of all supported congestion control algorithms:
| CC Algorithm | Signal | Rate Decrease | Rate Increase | Use Case |
|---|---|---|---|---|
| DCQCN (1) | ECN via CNP | α×EWMA multiplicative | AI/HAI additive | Default, RoCEv2 |
| HPCC (3) | INT telemetry | Precise utilization-based | Target 95% utilization | High precision |
| TIMELY (7) | RTT measurement | RTT > THigh | RTT < TLow | No switch support |
| DCTCP (8) | ECN marking | ECN-proportional | Additive (1Gbps) | TCP-like |
| HPCC-PINT (10) | Probabilistic INT | Sampled utilization | Same as HPCC | Reduced overhead |
When multiple QPs compete for the same NIC, packets are scheduled via weighted round-robin. ACK/NACK packets always receive absolute priority.
// qbb-net-device.cc — GetNextQindex()
int QbbNetDevice::GetNextQindex(bool paused[]) {
// 1. Check high-priority ACK queue first
if (m_ackQ.size() > 0) return -1; // ACK has absolute priority
// 2. Round-robin over QPs, skipping paused priorities
for (int i = 1; i <= m_queue->m_qpGrp->GetN(); i++) {
int idx = (m_qpidx + i) % m_queue->m_qpGrp->GetN();
Ptr<RdmaQueuePair> qp = m_queue->m_qpGrp->Get(idx);
if (!paused[qp->m_pg] // Not PFC-paused
&& qp->GetBytesLeft() > 0 // Has data to send
&& !qp->IsWinBound() // Within window
&& qp->m_nextAvail <= now) { // Rate-limited timer expired
m_qpidx = idx;
return idx;
}
}
return -2; // Nothing to send
}
GetBytesLeft()=0) are automatically cleaned upLarge flows can be split across multiple RDMA Queue Pairs for parallel transmission. Each QP gets a different source port, resulting in a different ECMP hash and potentially a different physical path. This is how SimAI can simulate multi-path load balancing for elephant flows.
#define _QPS_PER_CONNECTION_ 1 // Default: 1 QP per flow (configurable)
void SendFlow(int src, int dst, uint64_t maxPacketCount, ...) {
uint64_t perQP = (maxPacketCount + _QPS_PER_CONNECTION_ - 1) / _QPS_PER_CONNECTION_;
uint64_t remaining = maxPacketCount;
for (int i = 0; i < _QPS_PER_CONNECTION_; i++) {
uint64_t thisQP = min(perQP, remaining);
remaining -= thisQP;
uint32_t port = portNumber[src][dst]++; // Unique port → unique ECMP hash
// Each QP gets a different src port → different ECMP path
RdmaClientHelper client(pg, sip, dip, port, dport, thisQP, ...);
}
}
SimAI provides six monitoring output files for observing network behavior at different granularities, plus support for link failure simulation:
| File | Content | Interval |
|---|---|---|
FCT_OUTPUT_FILE |
Flow completion times (actual vs ideal) | Per flow |
QLEN_MON_FILE |
Switch queue occupancy (per port, per queue) | 10 ms |
BW_MON_FILE |
Host-level transmit bytes | 10 ms |
RATE_MON_FILE |
Per-QP sending rate (DCQCN-controlled) | 100 μs |
CNP_MON_FILE |
CNP reception count per QP | 100 μs |
PFC_OUTPUT_FILE |
PFC PAUSE/RESUME events | Per event |
Link failure simulation:
# SimAI.conf — Link failure configuration
# Format: LINK_DOWN <timestamp> <node_A> <node_B>
LINK_DOWN 0 0 0 # 0 0 0 = no failure
# Can simulate link failures at specified times to study resilience
# Example: LINK_DOWN 1000000 12 48 — fail link between node 12 and 48 at t=1ms
FCT_OUTPUT_FILE (macro-level flow completion) with RATE_MON_FILE (micro-level rate dynamics) to diagnose why specific collective operations experience high latency. High PFC counts in PFC_OUTPUT_FILE combined with queue buildup in QLEN_MON_FILE indicate congestion hotspots that DCQCN alone cannot resolve.
SimAI includes a Python topology generator (gen_Topo_Template.py) that produces network descriptions for several real-world datacenter topologies. These topologies define the physical connectivity between GPUs, NVSwitches, ToR switches, aggregate switches (ASW), and pod switches (PSW).
# gen_Topo_Template.py
# Supported topologies:
# - Spectrum-X: Rail-optimized single ToR (4096 GPUs default)
# - AlibabaHPN Single-Plane: Dual ToR (15360 GPUs)
# - AlibabaHPN Dual-Plane: Dual ToR with dual plane (15360 GPUs)
# - DCN+ Single-ToR: Single ToR topology (512 GPUs)
# - DCN+ Dual-ToR: Dual ToR topology (512 GPUs)
# Key parameters:
# gps: GPU per server (typically 8)
# nvbw: NVLink bandwidth (Gbps), e.g. 900 for H100
# bw: NIC-to-ASW bandwidth (Gbps), e.g. 100, 200, 400
# nl: NVLink latency (ns), typically 1000
# l: NIC latency (ns), typically 1000
# Output format (plain text):
# Line 1: <total_nodes> <gpu_per_server> <nv_switch_num>
# <switch_nodes> <links> <gpu_type>
# Line 2: <switch_node_ids> (space-separated)
# Line 3+: <src> <dst> <bandwidth> <latency> <error_rate>
parser = argparse.ArgumentParser()
parser.add_argument('-topo', type=str,
choices=['Spectrum-X', 'AlibabaHPN-SinglePlane',
'AlibabaHPN-DualPlane', 'DCN-SingleToR',
'DCN-DualToR'])
parser.add_argument('-g', type=int,
help='Total number of GPUs')
parser.add_argument('-gt', type=str,
choices=['A100', 'A800', 'H100', 'H800'])
parser.add_argument('-bw', type=str,
help='NIC bandwidth, e.g. 100Gbps')
The topology file defines every link in the network with its bandwidth (Gbps), latency (ns), and error rate. The NS-3 backend reads this file to construct the simulation network. Different topologies lead to dramatically different congestion patterns, especially for AllToAll operations in MoE training.
The topology generator gen_Topo_Template.py implements five distinct topology functions that map to three named architecture families: Spectrum-X, AlibabaHPN, and DCN+. The critical architectural distinction is between rail-optimized topologies (where GPUi on every server connects to a dedicated ASWi) and non-rail-optimized topologies (where all GPUs in a segment share the same ASW). This choice profoundly impacts collective communication performance.
Each GPU connects to its own rail-aligned ASW based on its index within the server. With 8 GPUs per server, there are 8 ASW per segment — one per GPU position. GPUi → ASW[group × gps + (i % gps)]. Every ASW connects to ALL PSW in a full mesh. This is NVIDIA's recommended topology for AllReduce-heavy workloads.
# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server # 8 ASW per segment (one per rail)
gpu_count = 4096 # Default GPU count
nic_bandwidth = 400Gbps # NIC → ASW bandwidth
nvlink_bw = 2880Gbps # NVLink bandwidth
# Wiring logic:
# GPU_i → NVSwitch (NVLink, intra-node)
# GPU_i → ASW[group * gps + (i % gps)] (NIC, rail-aligned)
# ASW_j → ALL PSW (full mesh uplinks)
Rail-optimized like Spectrum-X, but each GPU connects to two ASW switches (dual-homed), providing link redundancy. The ASW are split into two sets (ASW-A and ASW-B), but both connect to the same PSW pool (single plane). With 8 GPUs per server, there are gps × 2 = 16 ASW per segment.
# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2 # 16 ASW per segment
gpu_count = 15360 # Default GPU count
nic_bandwidth = 200Gbps # NIC → ASW bandwidth
asw_sets = 2 # asw_switch_1[] and asw_switch_2[]
psw_sets = 1 # Single PSW pool (both ASW sets → same PSW)
# Wiring logic:
# GPU_i → ASW-A[i % gps] (NIC-1, rail-aligned)
# GPU_i → ASW-B[i % gps] (NIC-2, rail-aligned)
# ASW-A[*] → ALL PSW (full mesh)
# ASW-B[*] → ALL PSW (full mesh)
The most fault-tolerant topology. Like HPN Single-Plane, each GPU is dual-homed to ASW-A and ASW-B. But the PSW layer is also split: ASW-A connects only to PSW-A, and ASW-B connects only to PSW-B — forming two completely independent network planes. If one entire plane fails, the other can still carry all traffic. The link formula uses psw_switch_num / pod_num / 2 (half PSW per plane).
# Key parameters (defaults):
asw_switch_num_per_segment = gpu_per_server * 2 # 16 ASW per segment
gpu_count = 15360
psw_sets = 2 # PSW split into psw_switch_1[] and psw_switch_2[]
# Wiring logic:
# GPU_i → ASW-A[i % gps] (NIC-1)
# GPU_i → ASW-B[i % gps] (NIC-2)
# ASW-A[*] → PSW-A[*] only (Plane A)
# ASW-B[*] → PSW-B[*] only (Plane B)
# Two independent planes — no cross-plane links
The simplest topology with no rail optimization. All GPUs in a segment connect to the same single ASW. There is only 1 ASW per segment, meaning 8 GPUs (or more, controlled by nics_per_aswitch) share the same top-of-rack switch. This eliminates rail structure entirely — all communication patterns are equal. The ASW connects to all PSW in a full mesh.
# Key parameters (defaults):
asw_switch_num_per_segment = 1 # Only 1 ASW per segment (no rails!)
gpu_count = 32+ # Small clusters
# Wiring logic:
# ALL GPU in segment → SAME ASW (group_account tracks nics_per_aswitch)
# ASW → ALL PSW (full mesh uplinks)
# No rail alignment — G0, G1, ..., G7 all share the same switch
Non-rail-optimized with dual-ToR redundancy. There are 2 ASW per segment (ASW-1 and ASW-2), but they are not rail-aligned — all GPUs in the segment connect to both ASW switches. Both ASW sets connect to the same PSW pool (single plane). This provides link redundancy for MoE workloads without rail structure.
# Key parameters:
asw_switch_num_per_segment = 2 # 2 ASW per segment (not rail-aligned)
# Wiring logic:
# ALL GPU in segment → ASW-1 (NIC-1, no rail alignment)
# ALL GPU in segment → ASW-2 (NIC-2, no rail alignment)
# ASW-1 → ALL PSW (full mesh)
# ASW-2 → ALL PSW (full mesh, same pool)
# CLI argument → Function mapping: 'Spectrum-X' → Rail_Opti_SingleToR() 'AlibabaHPN-SinglePlane' → Rail_Opti_DualToR_SinglePlane() 'AlibabaHPN-DualPlane' → Rail_Opti_DualToR_DualPlane() 'DCN-SingleToR' → No_Rail_Opti_SingleToR() 'DCN-DualToR' → No_Rail_Opti_DualToR() # The key variable that distinguishes rail vs non-rail: # Rail-optimized: asw_switch_num_per_segment = gpu_per_server (or ×2) # → GPU_i connects to ASW[i % gps] (rail-aligned) # Non-rail-optimized: asw_switch_num_per_segment = 1 (or 2) # → All GPUs in segment share same ASW(s)
The NS-3 simulation backend is configured through a combination of a configuration file (SimAI.conf) and environment variables. The configuration file controls network protocol parameters, while environment variables control simulation-level behavior.
# Congestion control
ENABLE_QCN 1
USE_DYNAMIC_PFC_THRESHOLD 1
CC_MODE 1 # 0=disabled, 1=DCQCN
# Packet size
PACKET_PAYLOAD_SIZE 9000 # Jumbo frames
# PFC (Priority Flow Control)
PAUSE_TIME 5
L2_WAIT_FOR_ACK 0
# Switch buffer
BUFFER_SIZE 16777216 # 16MB per port
KMIN 1500
KMAX 100000
PMAX 0.2
# ECN
ECN_ENABLED 1
DCTCP_GAIN 0.00390625
# Packet sending latency (microseconds)
AS_SEND_LAT=6
# Enable NVLS algorithm
AS_NVLS_ENABLE=1
# Enable PXN optimization
AS_PXN_ENABLE=0
# Logging level
AS_LOG_LEVEL=INFO
# GPU type override
AS_GPU_TYPE=H100
# Number of NVLink channels
AS_NV_CHANNELS=8
# Ring channel count
AS_RING_CHANNELS=2
# Network bandwidth override (Gbps)
AS_NET_BW=400
# Simulation thread count
AS_SIM_THREADS=16
The ENABLE_QCN and USE_DYNAMIC_PFC_THRESHOLD settings are particularly important for accuracy. QCN (Quantized Congestion Notification) allows switches to signal senders to slow down before buffers overflow. PFC (Priority Flow Control) provides lossless Ethernet semantics required by RDMA. Together, they model the complex feedback loops that determine real-world network performance.
SimAI uses a unified build script that compiles the appropriate backend based on a command-line flag. Each backend produces a separate binary with the same command-line interface, making it easy to switch between simulation modes:
# Analytical backend (fastest build + fastest simulation)
./scripts/build.sh -c analytical
# Output: bin/SimAI_analytical
# Dependencies: C++ compiler only
# Build time: ~30 seconds
# NS-3 Simulation backend (full packet-level)
./scripts/build.sh -c ns3
# Output: bin/SimAI_simulator
# Dependencies: NS-3 library (auto-built)
# Build time: ~5 minutes (first build)
# Physical backend (real RDMA traffic)
./scripts/build.sh -c phy
# Output: bin/SimAI_phynet
# Dependencies: libibverbs, librdmacm
# Build time: ~1 minute
SimAI_analytical — Lightweight, no external dependencies beyond a C++17 compiler. Ideal for CI/CD pipelines and quick iteration on workload configurations.
SimAI_simulator — Full NS-3 integration. The build script automatically fetches and compiles the NS-3 library with SimAI's custom RDMA/QBB modules.
SimAI_phynet — Requires RDMA-capable NICs and the libibverbs / librdmacm libraries. Must be deployed on actual cluster nodes.
Here is a complete workflow for running a communication simulation from topology generation through execution to output analysis. This example simulates a 128-GPU Spectrum-X cluster with A100 GPUs running a micro AllReduce benchmark:
# Generate a 128-GPU Spectrum-X topology with A100s and 100Gbps NICs python3 gen_Topo_Template.py \ -topo Spectrum-X \ -g 128 \ -gt A100 \ -bw 100Gbps # Output: Spectrum-X_128g_8gps_100Gbps_A100 # Contains: 128 GPUs, 64 NVSwitches, ToR/ASW/PSW switches, all links
# microAllReduce.txt — A simple AllReduce benchmark # Format: num_passes op_type data_size group_type ... 1 ALLREDUCE 1048576 TP # 1MB AllReduce on TP group 1 ALLREDUCE 67108864 DP # 64MB AllReduce on DP group 1 ALLTOALL 16777216 EP # 16MB AllToAll on EP group
# Run with NS-3 backend AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ -t 16 \ # 16 simulation threads -w ./microAllReduce.txt \ # workload file -n ./Spectrum-X_128g_8gps_100Gbps_A100 \ # topology -c ./SimAI.conf # NS-3 configuration # Or run with analytical backend for quick results AS_SEND_LAT=6 ./bin/SimAI_analytical \ -w ./microAllReduce.txt \ -n ./Spectrum-X_128g_8gps_100Gbps_A100
# Output files (NS-3 mode): # - fct.txt: Flow Completion Times per flow # - bandwidth.txt: Per-link bandwidth utilization over time # - queue.txt: Switch queue occupancy over time # - rate.txt: Sending rate per flow over time # - cnp.txt: Congestion Notification Packet counts # - pfc.txt: PFC pause frame events # Key metric: total collective completion time # Compare analytical vs NS-3 to assess congestion impact
For full training iteration simulation, combine the workload generator (AICB) with SimAI. AICB generates realistic computation + communication interleaving patterns, and SimAI handles the network simulation:
# Step 1: Generate workload with AICB python3 -m aicb.main \ --model_name llama_70b \ --tp 8 --dp 16 --pp 1 \ --world_size 128 \ --output_dir ./workloads/ # Step 2: Feed into SimAI AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ -t 16 \ -w ./workloads/llama_70b_tp8_dp16.txt \ -n ./Spectrum-X_128g_8gps_100Gbps_H100 \ -c ./SimAI.conf # The simulation interleaves compute phases (estimated analytically) # with communication phases (simulated by NS-3)
One of SimAI's most valuable use cases is comparing different network topologies for the same workload. By generating multiple topology files and running the same workload against each, you can quantify the performance impact of different network designs:
# Generate three topologies for comparison python3 gen_Topo_Template.py -topo Spectrum-X -g 1024 -gt H100 -bw 400Gbps python3 gen_Topo_Template.py -topo AlibabaHPN-SinglePlane -g 1024 -gt H100 -bw 400Gbps python3 gen_Topo_Template.py -topo AlibabaHPN-DualPlane -g 1024 -gt H100 -bw 400Gbps # Run the same workload on each for topo in Spectrum-X AlibabaHPN-SinglePlane AlibabaHPN-DualPlane; do AS_SEND_LAT=6 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator \ -t 16 \ -w ./workloads/llama_70b.txt \ -n ./${topo}_1024g_8gps_400Gbps_H100 \ -c ./SimAI.conf \ -o ./results/${topo}/ done # Compare FCT distributions across topologies python3 analyze_results.py --dirs ./results/
Before flows can be generated, SimCCL must construct ring and tree channels. A channel is a specific ordering of ranks that defines the communication pattern. Multiple channels can be used simultaneously to increase bandwidth utilization. The number of channels depends on the GPU type and NVLink topology:
// Ring channels are built from the NVLink topology
// Each channel is a permutation of ranks optimized for
// the physical NVLink connectivity
void MockNcclGroup::buildRingChannels() {
int nChannels = getenv("AS_RING_CHANNELS")
? atoi(getenv("AS_RING_CHANNELS"))
: default_ring_channels;
for (int c = 0; c < nChannels; c++) {
vector<int> ring;
// Build ring following NVLink adjacency
// Ensures each hop uses a physical NVLink
for (int r = 0; r < nRanks; r++) {
ring.push_back(
gp_info.Ranks[(r + c) % nRanks]);
}
ringchannels.push_back(ring);
}
}
// Tree channels for hierarchical algorithms
void MockNcclGroup::buildTreeChannels() {
// Build binary tree with root at rank 0
// Child(i) = 2*i+1, 2*i+2
// Used for inter-node reduction in NVLS-Tree
}
While AllReduce is the most common collective, SimCCL also decomposes AllGather and ReduceScatter independently. AllGather is used extensively in tensor parallelism (e.g., gathering weight shards for the forward pass), while ReduceScatter is used for gradient reduction with sharded optimizers (ZeRO Stage 2+):
Each GPU starts with 1/N of the data and ends with all data. N-1 ring steps, each sending data_size/N bytes. Total data transferred per GPU: (N-1)/N * data_size.
Each GPU starts with full data and ends with 1/N of the reduced result. N-1 ring steps with reduction at each hop. Same bandwidth as AllGather but includes reduce operations.
AllToAll is the most network-intensive collective, used in Mixture-of-Experts (MoE) models to dispatch tokens to the appropriate expert. Unlike AllReduce where data flows in one direction around a ring, AllToAll requires every GPU to send unique data to every other GPU. This creates an N*N communication pattern that can severely stress the network fabric:
std::map<int, shared_ptr<FlowModels>>
MockNcclGroup::genAllToAllRingFlowModels(
GroupType type, int rank, uint64_t data_size) {
int nranks = gp_info.nRanks;
// Each GPU sends data_size/nranks to each other GPU
uint64_t per_peer_size = data_size / nranks;
for (int peer = 0; peer < nranks; peer++) {
if (peer == rank) continue;
// Direct P2P send to each peer
SingleFlow flow(
flow_id++, rank, gp_info.Ranks[peer],
per_peer_size,
{}, // No dependencies (all sends parallel)
{},
{child_ids}, // Barrier after all sends complete
0, peer, nranks, "RING");
}
return flow_models;
}
PXN (Proxy-based cross-Node communication) is an optimization where GPUs within a node use NVLink to forward data to a proxy GPU that has direct NIC access, rather than each GPU accessing the NIC directly. This reduces NIC contention and leverages the high-bandwidth NVLink fabric for intra-node data gathering before sending across the network:
// PXN flow generation: 3 phases
// Phase 1: NVLink gather to proxy GPU (PXN_INIT)
// Phase 2: Network transfer from proxy (NET)
// Phase 3: NVLink scatter from proxy (PXN_FINAL)
if (PXNenable && cross_node_transfer) {
// Step 1: Local GPUs send to proxy via NVLink
for (auto& local_rank : node_ranks) {
if (local_rank == proxy_rank) continue;
flows.push_back(SingleFlow(
id++, local_rank, proxy_rank,
chunk_size, {}, {}, {net_flow_id},
channel, chunk, count, "PXN_INIT"));
}
// Step 2: Proxy sends aggregated data over network
flows.push_back(SingleFlow(
id++, proxy_rank, remote_proxy,
total_chunk_size,
{pxn_init_flow_ids}, // Wait for all local gathers
{}, {pxn_final_ids},
channel, chunk, count, "NET"));
// Step 3: Remote proxy scatters to local GPUs via NVLink
for (auto& remote_rank : remote_node_ranks) {
if (remote_rank == remote_proxy) continue;
flows.push_back(SingleFlow(
id++, remote_proxy, remote_rank,
chunk_size,
{net_flow_id}, // Wait for network transfer
{}, {},
channel, chunk, count, "PXN_FINAL"));
}
}
PXN is controlled by the AS_PXN_ENABLE environment variable. It is most beneficial when the NIC-to-GPU ratio is less than 1:1 (e.g., 4 NICs for 8 GPUs), which is common in many datacenter configurations. The trade-off is increased NVLink traffic in exchange for reduced NIC contention.
Faithfully replicates NCCL's algorithm selection (Ring, Tree, NVLS, NVLS-Tree) and flow decomposition logic. GPU-type-aware: A100 vs H100 use different algorithms.
Every collective is decomposed into a DAG of SingleFlows with explicit prev/child dependencies. This enables fine-grained simulation of overlapping and pipelined communication.
Analytical (fast/low-fi), NS-3 (slow/high-fi), Physical (real hardware). Same workload file works with all three. Choose based on accuracy needs.
Supports Spectrum-X, Alibaba HPN, and DCN+ topologies. GPU-NVSwitch-ToR-ASW-PSW hierarchy. Real datacenter network architectures.