A centralised routing controller for SimAI/NS-3 that eliminates ECMP hash collisions by pre-assigning source ports before each RDMA flow is created.
NS-3 · Central Controller · ECMP
Large-scale LLM training clusters use collective communication (AllReduce, AllGather, ReduceScatter)
to synchronise model parameters across GPUs. Each collective is decomposed into many concurrent
point-to-point RDMA flows. The underlying network uses ECMP (Equal-Cost Multi-Path)
routing: at every switch, a hash function over the flow 5-tuple
(sip, dip, sport, dport, protocol) selects one of several equal-cost output ports.
Goal: Implement a centralised controller that, before any RDMA flow is created, pre-computes the complete hop-by-hop path and steers the flow away from paths already in use — eliminating hash collisions entirely.
┌─────────────────────────────────────────────────────────┐ │ AstraSim (collective communication scheduler) │ │ · Decides when each GPU issues a send/recv │ │ · Implements ring-AllReduce, tree-AllReduce, etc. │ │ · Calls sim_send() / sim_recv() to inject traffic │ └─────────────────────┬───────────────────────────────────┘ │ sim_send / sim_recv API ┌─────────────────────▼───────────────────────────────────┐ │ NS-3 (packet-level network simulator) │ │ │ │ ┌────────────────────────────────────────┐ │ │ │ CentralController (NEW) │ │ │ │ · Maintains global view of all paths │ │ │ │ · AssignSport() — collision-free port │ │ │ │ · ReleasePath() — free on completion │ │ │ └────────────────────────────────────────┘ │ │ │ │ RdmaHw – RDMA NIC logic per GPU │ │ RdmaQueuePair – one RDMA QP per flow │ │ SwitchNode – ECMP forwarding logic │ │ QbbNetDevice – queuing, PFC pause, ECN marking │ └─────────────────────────────────────────────────────────┘
In original SimAI, sport comes from a per-(src, dst) counter:
portNumber[src][dst]++, initialised at ~10,000. Two flows from GPU A to GPU B get
sports 10000, 10001, 10002, … Because ECMP hashes modulo the number of equal-cost paths, consecutive
sport values for the same (sip, dip) often produce the same hash
remainder — causing frequent collisions.
When two flows share the same output port at a switch, they share link bandwidth. In a ring-AllReduce, the collective finishes only when the last GPU completes — a single congested link delays the entire collective.
PFC (Priority-based Flow Control) propagates backpressure upstream, potentially stalling unrelated flows. A localised collision can turn into cluster-wide head-of-line blocking.
SwitchNode::EcmpHash() is a static Murmur3 function with a per-switch
seed (m_ecmpSeed). The same 5-tuple may take different branches at
different switches. The NIC selection at the source host also uses Murmur3 with fixed seed
0x8BADF00D.
// Murmur3 hash — 12-byte buffer: sip(4) | dip(4) | sport(2)|dport(2)
def ecmp_hash(sip, dip, sport, dport, seed):
buf = pack('<III', sip, dip, (sport) | (dport << 16))
h = seed
for i in range(0, 12, 4):
k = unpack(buf, i)
k = (k * 0xcc9e2d51) # mix step 1
k = rotl(k, 15)
k = (k * 0x1b873593) # mix step 2
h ^= k
h = rotl(h, 13)
h = h * 5 + 0xe6546b64
# finalize
h ^= 12; h ^= h >> 16; h *= 0x85ebca6b
h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16
return h
The only field in the 5-tuple that the sender controls freely is sport. By choosing
sport carefully, we can steer the flow onto any desired path, because every
switch re-hashes on the full 5-tuple.
m_hopToFlows and
m_flowPaths.
sport value (1, 2, 3, …) using an exact replica of the switch hash functions.
sport whose path shares no
(switch, output-port) hop with any existing active flow.
QpComplete()).
m_switches : nodeId → { ecmpSeed,
rtTable[dip] → [outPorts] }
m_adjacency : nodeId → { portIdx →
neighborNodeId }
m_hostRtTable : hostId → { dip →
[nicIndices] }
m_hostNicToSwitch: hostId → { nicIdx →
switchNodeId }
m_hopToFlows : HopKey(nodeId, port) →
set of active flowKeys
m_flowPaths : flowKey →
Path (list of (nodeId, port))
Updated on every AssignSport() and ReleasePath() call.
AssignSport(sip, dip, dport, src, dst):
if not enabled or src/dst on same server:
return default_sport ← intra-server: NVSwitch, no ECMP
// Early exit: if probe path has ≤ 1 hop, CC can't help
probe = ComputePath(sip, dip, fallbackSport, dport)
if probe.size() <= 1:
return fallbackSport
for sport = 1, 2, 3, … 65535:
path = ComputePath(sip, dip, sport, dport)
if no hop in path[:-1] is already reserved: ← skip final hop
ReservePath(path, flowKey)
log to CSV if CC_LOG_FILE is set
return sport
return default_sport ← fallback (practically unreachable)
Step 1 — NIC selection at source host
buf = [ sip | dip | (sport << 0) | (dport << 16) ] (12 bytes)
h = murmur3(buf, seed=0x8BADF00D) ← matches RdmaQueuePair::GetHash()
nic = m_hostRtTable[srcNode][dip][ h % #nics ]
currentNode = m_hostNicToSwitch[srcNode][nic]
Step 2 — Switch-by-switch ECMP traversal
while currentNode ≠ dstNode:
sw = m_switches[currentNode]
h = murmur3(buf, seed=sw.ecmpSeed) ← matches SwitchNode::GetOutDev()
port = sw.rtTable[dip][ h % #ports ]
append (currentNode, port) to path
currentNode = m_adjacency[currentNode][port]
This replicates the exact hash computation used in the live NS-3 simulation, so the predicted path is always correct.
AstraSim NS-3 RdmaHw Network
──────── ─────────── ───────
sim_send(A→B)
│
└──► AddQueuePair(sip, dip,
sport=portNumber[s][d]++,
dport, size)
│
Create RdmaQueuePair
(sport = ~10000+)
│
Transmit ─────────────────►
Switch hashes
→ may collide
→ PFC backpressure
◄── ACKs ────────────────
│
QpComplete()
│
◄──────────┘
notify AstraSim
AstraSim RdmaHw/CC Network
──────── ────────── ───────
sim_send(A→B)
│
└──► AddQueuePair(sport=1)
│
┌─────────────────────┐
│ CentralController │
│ ::AssignSport() │
│ for s = 1,2,3,…: │
│ path=ComputePath()│
│ if no conflict: │
│ ReservePath() │
│ return s │
└──────────┬──────────┘
sport = s (conflict-free)
│
Create RdmaQueuePair
│
Transmit ─────────────────►
→ unique path
→ no collision
◄── ACKs ────────────────
│
┌─────────▼──────────┐
│ CC::ReleasePath() │
│ free reserved hops │
└────────────────────┘
│
◄──────────┘
notify AstraSim
Called once after SetRoutingEntries() finishes populating all NS-3 routing tables:
for each NS-3 node:
build m_adjacency from GetDevice(d)->GetChannel()
if SwitchNode:
copy GetEcmpSeed() and GetRtTable() into m_switches
if GPU host with RdmaHw:
copy m_rtTable (dip → NIC list) into m_hostRtTable
for each NIC, find which switch it connects to → m_hostNicToSwitch
Called from RdmaHw::QpComplete() when a flow finishes:
ReleasePath(sip, dip, sport, dport):
key = FlowKey(sip, dip, sport, dport)
path = m_flowPaths[key]
for each (nodeId, port) in path:
m_hopToFlows[(nodeId, port)].remove(key)
delete m_flowPaths[key]
HasConflict() originally iterated over all path hops, including
the final hop rail_switch → dst_GPU. This final hop is always identical
regardless of sport. Fix: skip path.back() — only check spine hops where path
diversity exists.
Same-rail GPU pairs have 1-hop paths. After excluding the final hop → zero spine hops
→ HasConflict() always returns false → CC assigned sport=1 to
all concurrent same-rail flows, causing map key collisions. Fix: early-exit if probe
path has ≤ 1 hop.
| Env Variable | Description |
|---|---|
CC_ENABLE=1 |
Enable controller (default) |
CC_ENABLE=0 |
Disable — flows use original ECMP behaviour |
CC_LOG_FILE=paths.csv |
Write per-flow path assignments to CSV |
TRACE_INJECT_FILE=<path> |
Bypass AstraSim — drive simulation from a trace CSV |
No recompilation needed — toggling the controller is a one-character change to the run command
(CC_ENABLE=0/1).
AstraSim only supports collective workloads. For disaggregated inference, the critical traffic is point-to-point KV cache transfers (P-node → D-node). A lightweight TraceInjector mode bypasses AstraSim entirely and drives NS-3 directly from a CSV trace.
# timestamp_ns,src_node_id,dst_node_id,size_bytes
0,0,8,10485760 ← GPU 0 → GPU 8, 10 MB at t=0
0,1,9,10485760 ← GPU 1 → GPU 9, 10 MB at t=0
100000,0,8,10485760 ← another flow 100µs later
| Pattern | Description |
|---|---|
constant |
Fixed inter-arrival = 1/rate (or
--interval-ns) |
poisson |
Exponential inter-arrival (Poisson process) |
burst |
Periodic bursts: burst_size
flows at the same timestamp |
hotspot |
Fraction of flows from one specific server pair → max collision |
server_pair |
All GPUs on src server → matching GPUs on dst server, N rounds |
one_to_one |
Fixed (src, dst) GPU pair, N flows staggered by interval |
The simulation has two modes: AstraSim workload mode (full collective communication) and TraceInjector mode (inject point-to-point flows from a CSV). Both produce the same fct.txt output.
# Build NS-3 backend (compiles central-controller.cc, rdma-hw.cc patches, etc.)
./scripts/build.sh -c ns3
# DCN+ fat-tree: 16 GPUs (2 servers × 8 GPU/server), 8 spine switches
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
-topo DCN+ -g 16 -gps 8 -gt A100 -bw 100Gbps -nvbw 2400Gbps \
-asn 2 -npa 8 -psn 8 -apbw 100Gbps -app 2
# Spectrum-X: 128 GPUs (rail-optimized, for training workloads)
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
-topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps
Bypasses AstraSim entirely. Inject flows from a CSV trace file.
# 1. Generate trace (server 0 GPUs → server 1 GPUs, 10 MB each)
python3 gen_kv_trace.py \
--arrival server_pair --src-server 0 --dst-server 1 \
--gpus-per-server 8 --kv-mb 10 --rounds 1 \
--out traces/my_trace.csv
# 2a. Run WITH CC
CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
TRACE_INJECT_FILE=traces/my_trace.csv \
AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
-n <topology_dir> -c results/cc/SimAI_local.conf
# 2b. Run WITHOUT CC (A/B comparison)
CC_ENABLE=0 \
TRACE_INJECT_FILE=traces/my_trace.csv \
AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
-n <topology_dir> -c results/nocc/SimAI_local.conf
Run real collective workloads (AllReduce, AllToAll, etc.) through AstraSim → SimCCL → NS-3.
CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 \
-w workloads/test_cc_compare.txt \
-n <topology_dir> \
-c results/cc/SimAI_local.conf
| Variable | Values | Effect |
|---|---|---|
CC_ENABLE | 1 (default) / 0 | Enable/disable CentralController sport assignment |
CC_LOG_FILE | file path | Write per-flow path CSV (flow_id, sip, dip, sport, hops) |
TRACE_INJECT_FILE | CSV path | TraceInjector mode: bypass AstraSim, inject flows from CSV |
AS_SEND_LAT | microseconds | Software send latency before NIC injection (default 6) |
AS_NVLS_ENABLE | 1 / 0 | Enable NVSwitch NVLS transport for intra-server flows |
# FCT CDF plot (CC vs No CC)
python3 plot_fct_cdf.py \
--cc results/cc/fct.txt --nocc results/nocc/fct.txt \
--out results/fct_cdf.png
# Export per-flow analysis CSV
python3 export_flow_csv.py \
--fct results/cc/fct.txt --paths results/cc/paths.csv \
--out results/cc/flows.csv
# Collision analysis
python3 analyze_collisions.py \
--cc results/cc/paths.csv --nocc results/nocc/fct.txt \
--trace traces/my_trace.csv --bw-gbps 100 --send-lat-ns 3000
| File | Format | Contents |
|---|---|---|
fct.txt | Space-separated, 1 line per flow | sip dip sport dport size start_ns fct_ns ideal_ns |
paths.csv | CSV (only with CC_LOG_FILE) | flow_id,sip,dip,sport,dport,n_hops,hops |
flows.csv | CSV (from export_flow_csv.py) | flow_id,src_gpu,dst_gpu,sport,spine_switch,fct_ns,slowdown,cc_assigned |
Topology: DCN+ Single-ToR fat-tree — 16 GPUs, 2 servers, 8 spine switches. Every cross-server flow traverses exactly 3 hops. Workload: 8 simultaneous 10 MB flows: GPU i (server 0) → GPU i+8 (server 1).
Server 0 (GPUs 0–7) Server 1 (GPUs 8–15) ┌─────────────────┐ ┌─────────────────┐ │ GPU0 GPU1 … GPU7│ │ GPU8 GPU9 …GPU15│ │ │ │ │ │ │ │ │ │ │ │ └────┴───┬───┘ │ │ └────┴───┬───┘ │ │ NVSw16 │ │ NVSw17 │ │ (NVLink) │ │ (NVLink) │ └───────────┼─────┘ └─────┼─────────────┘ │ │ Leaf18 Leaf19 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 100Gbps │ │ │ │ │ │ │ │ Spine20 Spine21 Spine22 Spine23 Spine24 Spine25 Spine26 Spine27
GPU 0→8 sport=10000 Spine 27 FCT= 850µs (1.0×)
GPU 1→9 sport=10000 Spine 20 FCT= 850µs (1.0×)
GPU 2→10 sport=10000 Spine 23 FCT= 850µs (1.0×)
GPU 3→11 sport=10000 Spine 24 FCT= 850µs (1.0×)
GPU 4→12 sport=10000 Spine 25 FCT=1693µs (2.0×) ←
GPU 5→13 sport=10000 Spine 22 FCT=1693µs (2.0×) ←
GPU 6→14 sport=10000 Spine 21 FCT=1693µs (2.0×) ←
GPU 7→15 sport=10000 Spine 26 FCT=1693µs (2.0×) ←
4 of 8 spines carry 2 flows each → 50% bandwidth → 2× FCT
GPU 0→8 sport=1 Spine 27 FCT=850µs (1.0×)
GPU 1→9 sport=1 Spine 20 FCT=850µs (1.0×)
GPU 2→10 sport=1 Spine 23 FCT=850µs (1.0×)
GPU 3→11 sport=1 Spine 24 FCT=850µs (1.0×)
GPU 4→12 sport=2 Spine 25 FCT=850µs (1.0×) ← reassigned
GPU 5→13 sport=4 Spine 22 FCT=850µs (1.0×) ← reassigned
GPU 6→14 sport=2 Spine 21 FCT=850µs (1.0×) ← reassigned
GPU 7→15 sport=1 Spine 26 FCT=850µs (1.0×)
Every spine carries exactly 1 flow → full bandwidth → 1.0× FCT
| Metric | With CC | Without CC | Improvement |
|---|---|---|---|
| Mean FCT | 850 µs | 1,271 µs | −33% |
| Max FCT | 850 µs | 1,693 µs | −50% |
| Mean Slowdown | 1.0× | 1.5× | — |
| Spine Collisions | 0 | 4 pairs | — |
| Layer | With CC | Without CC | Δ |
|---|---|---|---|
| AllReduce 128 MB | 2,554,511 | 2,559,883 | +0.21% |
| AllReduce 64 MB | 1,282,541 | 1,276,607 | −0.47% |
| AllReduce 32 MB | 645,833 | 662,472 | +2.51% |
| Total | 4,482,885 | 4,498,962 | +0.36% |
Small improvement because 8 concurrent AllReduce rings already produce diverse
(sip, dip) pairs that naturally spread across different spines via ECMP.
| sport | Flow Count | % |
|---|---|---|
| 1 | 1,628 | 84.7% |
| 2 | 263 | 13.7% |
| 3 | 27 | 1.4% |
| 4 | 2 | 0.1% |
| 5 | 1 | 0.1% |
With 128 × 127 = 16,256 concurrent flows, the network is fully saturated. CC provides only 0.02% improvement — resolving collisions doesn't free capacity when all links are already at 100%. CC search depth reached 183, and simulation took 119 minutes vs. 2 minutes without CC.
| Scenario | CC Benefit | Root Cause |
|---|---|---|
| 8-ring AllReduce (Spectrum-X) | +0.36% | Diverse (sip,dip) already spread by ECMP |
| Full AllToAll 128G | +0.02% | Network fully saturated |
| Hotspot (same server pair) | ~0% | Final-hop congestion dominates |
| Fat-tree burst (8 flows, 8 spines) | −33% mean, −50% max | ECMP collision on spine; CC spreads perfectly |
Conditions for large CC benefit: (1) Flows traverse multiple candidate spine paths, (2) concurrent flow count ≤ available spine paths, (3) default ECMP produces hash collisions, (4) destination GPUs are distinct (no shared final hop).
| File | Change |
|---|---|
| central-controller.h/cc |
NEW CentralController singleton class (~300 lines). Declares
Initialize(), AssignSport(), ReleasePath(),
ComputePath(), HasConflict(), ReservePath().
Maintains topology tables and dynamic reservation state.
|
| switch-node.h |
MOD EcmpHash() moved from private to
public static. Added GetEcmpSeed() and GetRtTable()
getters so CC can replicate switch-level hashing.
|
| rdma-hw.cc |
MOD AddQueuePair(): calls AssignSport()
to override sport before QP creation. QpComplete(): calls
ReleasePath() to free reserved hops.
|
| common.h |
MOD Added #include <central-controller.h>.
After SetRoutingEntries(): calls Initialize(), reads
CC_ENABLE / CC_LOG_FILE env vars.
|
| entry.h |
MOD SendFlow(): calls CC's
AssignSport() before registering the flow. New SendFlowTrace()
helper for TraceInjector mode with sentinel ncclFlowTag
(current_flow_id == -1).
|
| AstraSimNetwork.cc |
MOD If TRACE_INJECT_FILE env var is set, reads CSV
trace and schedules each flow via
Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, ...), bypassing AstraSim
entirely.
|
| Script | Description |
|---|---|
| gen_kv_trace.py | Generate KV cache transfer traces with 6 arrival patterns (constant, poisson, burst, hotspot, server_pair, one_to_one). Configurable topology mapping (128 GPUs, 16 servers × 8 GPU/server). |
| analyze_collisions.py |
Count ECMP spine collisions. Replicates the exact Murmur3 hash used by NS-3 switches.
Supports both static (all pairs) and concurrent (time-window-overlapping) collision
counting. Can read actual NS-3 paths from CC_LOG_FILE.
|
| plot_fct_cdf.py |
Plot FCT CDF and per-flow sorted bar chart comparing CC vs No CC. Reads fct.txt
files from both runs.
|
| export_flow_csv.py |
Join fct.txt + paths.csv into a per-flow analysis CSV with fields:
flow_id, src/dst GPU, sport, spine_switch, hops, FCT, slowdown, cc_assigned.
|
This section traces the execution flow through the 5 C++ source files, showing exactly when each is invoked and what it does at the code level.
Contains main() and the ASTRASimNetwork class that bridges AstraSim ↔ NS-3.
main(argc, argv)
① user_param_prase() // parse -t thread -w workload -n topo -c conf
② main1(topo, conf) // → common.h: ReadConf → SetConfig → SetupNetwork
// (builds NS-3 topology, initializes CentralController)
③ if env TRACE_INJECT_FILE: // ← NEW: TraceInjector bypass mode
for line in csv:
Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, src, dst, size)
Simulator::Run() // run NS-3 with injected flows only
return
④ // Normal mode: create ASTRASimNetwork + AstraSim::Sys per GPU
for j in nodes_num:
networks[j] = new ASTRASimNetwork(j, 0)
systems[j] = new AstraSim::Sys(networks[j], ...)
for i in nodes_num:
systems[i]->workload->fire() // kick off collective workloads
Simulator::Run()
| Method | What it does |
|---|---|
sim_send() | Registers flow in sentHash, then calls SendFlow() (in entry.h) to inject an RDMA QP into NS-3 |
sim_recv() | Matches arriving data against expeRecvHash. If data already arrived: invoke callback immediately. If not: register in expeRecvHash to wait. |
sim_get_time() | Returns NS-3's Simulator::Now().GetNanoSeconds() to AstraSim |
sim_schedule() | Wraps Simulator::Schedule() for AstraSim callbacks |
TRACE_INJECT_FILE is set, the entire AstraSim workload system is bypassed. Flows are read from a CSV (timestamp_ns,src,dst,size_bytes) and scheduled directly into NS-3 via SendFlowTrace(). This enables isolated network experiments (e.g., KV cache transfer patterns) without needing the full SimAI stack.
Defines the flow injection functions and all flow completion callbacks. This is where the CentralController's AssignSport() is called for every flow.
SendFlow() — normal mode (called by AstraSim)void SendFlow(src, dst, maxPacketCount, msg_handler, fun_arg, tag, request) {
port = portNumber[src][dst]++;
// ← NEW: CentralController picks a conflict-free sport
port = CentralController::Instance().AssignSport(
serverAddress[src], serverAddress[dst],
100 /*dport*/, src, dst, port);
sender_src_port_map[{port, {src, dst}}] = request->flowTag;
RdmaClientHelper clientHelper(pg, srcIP, dstIP, port, dport, size, ...);
appCon = clientHelper.Install(n.Get(src));
appCon.Start(Time(send_lat)); // inject RDMA QP into NS-3
}
SendFlowTrace() — TraceInjector mode (no AstraSim context)void SendFlowTrace(src, dst, size_bytes) {
port = portNumber[src][dst]++;
port = CentralController::Instance().AssignSport(...);
AstraSim::ncclFlowTag sentinel; // current_flow_id = -1 (sentinel)
sender_src_port_map[{port, {src, dst}}] = sentinel;
// Same RDMA QP injection, but no AstraSim callbacks
RdmaClientHelper clientHelper(..., nullptr, nullptr, ...);
appCon.Start(Time(send_lat));
}
| Callback | Triggered by | What it does |
|---|---|---|
qp_finish() | NS-3 when all packets arrive at dst | Writes to fct.txt, checks is_receive_finished(), calls notify_receiver_receive_data() → triggers AstraSim recv callback. For sentinel flows (TraceInjector): early return. |
send_finish() | NS-3 when all packets leave src NIC | Checks is_sending_finished(), calls notify_sender_sending_finished() → triggers AstraSim send callback. For sentinel flows: early return. |
The infrastructure glue file. Defines global state, reads config, builds the NS-3 topology, and initializes the CentralController.
main1(topo, conf)
① ReadConf(topo, conf) // Parse config: CC_MODE, DATA_RATE, LINK_DELAY, PFC, DCQCN params...
② SetConfig() // Apply NS-3 defaults (PauseTime, QcnEnabled, IntHeader mode)
// cc_mode=3 → DCQCN, cc_mode=7 → TIMELY, cc_mode=10 → PINT
③ SetupNetwork(qp_finish, send_finish)
// a. Read topology file: node_num, gpus_per_server, nvswitch_num, switch_num
// b. Create NS-3 nodes (hosts, SwitchNode, NVSwitchNode)
// c. Install point-to-point links with QBB (lossless Ethernet)
// d. Configure RDMA per host (RdmaHw with CC mode, rates, PFC)
// e. CalculateRoutes() → SetRoutingEntries()
// f. NEW: Initialize CentralController ↓
// After SetRoutingEntries():
CentralController::Instance().Initialize(n, nodeIdToRdmaHw, ipToNodeId, gpus_per_server);
// Env var control:
if (env "CC_LOG_FILE") CC.SetLogFile(path); // CSV: flow_id,sip,dip,sport,hops
if (env "CC_ENABLE" != "0") CC.Enable(); // default: enabled
else CC.Disable(); // for A/B comparison
Singleton class with the full API and internal data structures for conflict-free path assignment.
| Method | When called | What it does |
|---|---|---|
Initialize() | Once, in SetupNetwork | Reads NS-3 topology: adjacency graph, switch ECMP seeds + routing tables, host NIC→switch mapping |
AssignSport() | Every flow (SendFlow / SendFlowTrace) | Probes sport 1..65534, finds first conflict-free path, reserves it |
ReleasePath() | QP completion (rdma-hw.cc) | Unreserves all hops for the completed flow |
| Structure | Purpose |
|---|---|
m_switches | Per-switch: ecmpSeed + rtTable (dip → [outPorts]). Read-only after Initialize. |
m_adjacency | nodeId → {portIdx → neighborNodeId}. Physical topology graph. |
m_hopToFlows | (switchId, portIdx) → set of active FlowKeys. The dynamic reservation state. |
m_flowPaths | FlowKey → Path. Reverse lookup for ReleasePath. |
AssignSport() — the hot path (called per flow)AssignSport(sip, dip, dport, srcNodeId, dstNodeId, fallbackSport) {
if (!m_enabled || !m_initialized) return fallbackSport;
if (srcNodeId/gpusPerServer == dstNodeId/gpusPerServer) return fallbackSport;
// ↑ intra-server → NVSwitch, no ECMP, CC can't help
// Quick check: if probe path has ≤1 hop, CC is useless
Path probe = ComputePath(sip, dip, fallbackSport, dport, ...);
if (probe.size() <= 1) return fallbackSport;
// Exhaustive search: try sport 1..65534
for (sport = 1; sport < 65535; sport++) {
Path path = ComputePath(sip, dip, sport, dport, ...);
if (!HasConflict(path)) {
ReservePath(path, {sip, dip, sport, dport});
return sport; // ← conflict-free sport found
}
}
return fallbackSport; // no conflict-free sport exists
}
ComputePath() — deterministic ECMP replayComputePath(sip, dip, sport, dport, srcNodeId, dstNodeId) {
buf = {sip, dip, (sport | dport<<16)}; // 12-byte hash input
// Step 1: NIC selection at host (same hash as rdma-hw.cc)
h = Murmur3(buf, 12, 0x8BADF00D);
nicIdx = hostNics[h % hostNics.size()];
currentNode = nicToSwitch[nicIdx];
// Step 2: switch-by-switch ECMP traversal
while (currentNode is switch) {
h = Murmur3(buf, 12, switch.ecmpSeed); // ← per-switch seed
portIdx = switch.rtTable[dip][h % numPorts];
path.push_back({currentNode, portIdx});
currentNode = adjacency[currentNode][portIdx];
}
return path;
}
ComputePath() uses the exact same Murmur3 hash with the exact same per-switch ECMP seed as the NS-3 SwitchNode. This means the CC can predict the path any 5-tuple will take through the fabric before injecting the flow. Changing sport changes the hash output, which changes the ECMP port selection, which routes the flow through different spine links.
HasConflict() + ReservePath() / ReleasePath()HasConflict(path) {
// Check all hops EXCEPT the last one (leaf→GPU has no diversity)
for (i = 0; i + 1 < path.size(); i++)
if (m_hopToFlows[{path[i].node, path[i].port}].size() > 0)
return true;
return false;
}
ReservePath(path, flowKey) {
m_flowPaths[flowKey] = path;
for (i = 0; i + 1 < path.size(); i++)
m_hopToFlows[{path[i].node, path[i].port}].insert(flowKey);
}
ReleasePath(sip, dip, sport, dport) { // called from rdma-hw.cc QpComplete()
flowKey = {sip, dip, sport, dport};
path = m_flowPaths[flowKey];
for hop in path:
m_hopToFlows[hop].erase(flowKey);
m_flowPaths.erase(flowKey);
}
① AstraSim decides to send a collective chunk
→ ASTRASimNetwork::sim_send() [AstraSimNetwork.cc]
② sim_send() calls SendFlow()
→ SendFlow(src, dst, size, callback, tag) [entry.h]
③ SendFlow() asks CC for a conflict-free port
→ CentralController::AssignSport() [central-controller.cc]
→ ComputePath(sip, dip, sport_candidate) // replays ECMP hash
→ HasConflict(path) // checks m_hopToFlows
→ ReservePath(path, flowKey) // reserves hops
→ return sport
④ SendFlow() injects RDMA QP with CC-assigned sport
→ RdmaClientHelper.Install() + Start() [entry.h → NS-3]
⑤ NS-3 simulates packet-level RDMA transport
→ switches use ECMP hash (same Murmur3) [ns-3 switch-node.cc]
→ congestion control (DCQCN/TIMELY/PINT) [configured in common.h]
⑥ All packets arrive → QP completes
→ RdmaHw::QpComplete() [rdma-hw.cc]
→ CentralController::ReleasePath() // free reserved hops
→ qp_finish() [entry.h]
→ notify_receiver_receive_data() // triggers AstraSim callback
| Aspect | Original SimAI | This Repo (with CC) |
|---|---|---|
| Sport Assignment | Per-(src,dst) counter:
portNumber[s][d]++ starting ~10000 |
CentralController::AssignSport() — searches for collision-free sport from 1
|
| Path Awareness | None — flows are fire-and-forget; routing is per-switch ECMP | Global view of all active flow paths; reserves and releases hops |
| Congestion Response | Reactive only — DCQCN (rate reduction) + PFC (pause frames) | Proactive collision avoidance + reactive DCQCN/PFC |
| Switch-Level Changes | EcmpHash() is
private; no external access to seeds/routing |
EcmpHash() public static; GetEcmpSeed() and
GetRtTable() added |
| Workload Modes | AstraSim collectives only (AllReduce, AllGather, etc.) | AstraSim collectives + TraceInjector for point-to-point KV cache flows |
| Flow Completion | QpComplete() notifies AstraSim
only |
QpComplete() also calls CC::ReleasePath() to free reserved hops
|
| Analysis Tools | Basic fct.txt output |
paths.csv
logging + Python scripts for collision analysis, FCT CDF, path visualisation |
| New Files | — | central-controller.h/cc, gen_kv_trace.py,
analyze_collisions.py, plot_fct_cdf.py,
export_flow_csv.py |