Central Controller: ECMP Collision-Free Routing

A centralised routing controller for SimAI/NS-3 that eliminates ECMP hash collisions by pre-assigning source ports before each RDMA flow is created.

NS-3 · Central Controller · ECMP

Table of Contents

  1. Overview & Motivation
  2. The ECMP Hash Collision Problem
  3. Central Controller Design
  4. Flow Diagrams: Before vs. After
  5. Implementation Deep Dive
  6. TraceInjector — KV Cache Simulation
  7. Experiments & Results
  8. Source File Reference
  9. Differences from Original SimAI

Overview & Motivation

Large-scale LLM training clusters use collective communication (AllReduce, AllGather, ReduceScatter) to synchronise model parameters across GPUs. Each collective is decomposed into many concurrent point-to-point RDMA flows. The underlying network uses ECMP (Equal-Cost Multi-Path) routing: at every switch, a hash function over the flow 5-tuple (sip, dip, sport, dport, protocol) selects one of several equal-cost output ports.

Goal: Implement a centralised controller that, before any RDMA flow is created, pre-computes the complete hop-by-hop path and steers the flow away from paths already in use — eliminating hash collisions entirely.

SIMAI ARCHITECTURE WITH CENTRAL CONTROLLER
┌─────────────────────────────────────────────────────────┐
│  AstraSim  (collective communication scheduler)          │
│  · Decides when each GPU issues a send/recv              │
│  · Implements ring-AllReduce, tree-AllReduce, etc.       │
│  · Calls sim_send() / sim_recv() to inject traffic       │
└─────────────────────┬───────────────────────────────────┘
                      │  sim_send / sim_recv API
┌─────────────────────▼───────────────────────────────────┐
│  NS-3  (packet-level network simulator)                  │
│                                                          │
│  ┌────────────────────────────────────────┐              │
│  │  CentralController (NEW)                │              │
│  │  · Maintains global view of all paths  │              │
│  │  · AssignSport() — collision-free port  │              │
│  │  · ReleasePath() — free on completion  │              │
│  └────────────────────────────────────────┘              │
│                                                          │
│  RdmaHw        – RDMA NIC logic per GPU                  │
│  RdmaQueuePair – one RDMA QP per flow                    │
│  SwitchNode    – ECMP forwarding logic                   │
│  QbbNetDevice  – queuing, PFC pause, ECN marking         │
└─────────────────────────────────────────────────────────┘

The ECMP Hash Collision Problem

2.1 How Original SimAI Assigns Ports

In original SimAI, sport comes from a per-(src, dst) counter: portNumber[src][dst]++, initialised at ~10,000. Two flows from GPU A to GPU B get sports 10000, 10001, 10002, … Because ECMP hashes modulo the number of equal-cost paths, consecutive sport values for the same (sip, dip) often produce the same hash remainder — causing frequent collisions.

2.2 Why Collisions Matter

Bandwidth Halving

When two flows share the same output port at a switch, they share link bandwidth. In a ring-AllReduce, the collective finishes only when the last GPU completes — a single congested link delays the entire collective.

PFC Cascade

PFC (Priority-based Flow Control) propagates backpressure upstream, potentially stalling unrelated flows. A localised collision can turn into cluster-wide head-of-line blocking.

2.3 ECMP Hash Internals

SwitchNode::EcmpHash() is a static Murmur3 function with a per-switch seed (m_ecmpSeed). The same 5-tuple may take different branches at different switches. The NIC selection at the source host also uses Murmur3 with fixed seed 0x8BADF00D.

// Murmur3 hash — 12-byte buffer: sip(4) | dip(4) | sport(2)|dport(2)
def ecmp_hash(sip, dip, sport, dport, seed):
    buf = pack('<III', sip, dip, (sport) | (dport << 16))
    h = seed
    for i in range(0, 12, 4):
        k = unpack(buf, i)
        k = (k * 0xcc9e2d51)   # mix step 1
        k = rotl(k, 15)
        k = (k * 0x1b873593)   # mix step 2
        h ^= k
        h = rotl(h, 13)
        h = h * 5 + 0xe6546b64
    # finalize
    h ^= 12; h ^= h >> 16; h *= 0x85ebca6b
    h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16
    return h

Central Controller Design

3.1 Core Idea

The only field in the 5-tuple that the sender controls freely is sport. By choosing sport carefully, we can steer the flow onto any desired path, because every switch re-hashes on the full 5-tuple.

3.2 Algorithm Steps

1
Maintain a global view of all active flow paths via m_hopToFlows and m_flowPaths.
2
When a new flow arrives, simulate its full hop-by-hop path for each candidate sport value (1, 2, 3, …) using an exact replica of the switch hash functions.
3
Return the first sport whose path shares no (switch, output-port) hop with any existing active flow.
4
Reserve that path so future flows avoid it.
5
Release the reservation when the flow completes (QpComplete()).

3.3 Data Structures

Topology Tables (read-only)

m_switches       : nodeId → { ecmpSeed,
                    rtTable[dip] → [outPorts] }
m_adjacency      : nodeId → { portIdx →
                    neighborNodeId }
m_hostRtTable    : hostId → { dip →
                    [nicIndices] }
m_hostNicToSwitch: hostId → { nicIdx →
                    switchNodeId }

Dynamic Reservation State

m_hopToFlows  : HopKey(nodeId, port) →
                 set of active flowKeys
m_flowPaths   : flowKey →
                 Path (list of (nodeId, port))

Updated on every AssignSport() and ReleasePath() call.

3.4 AssignSport — Pseudocode

AssignSport(sip, dip, dport, src, dst):
    if not enabled or src/dst on same server:
        return default_sport       ← intra-server: NVSwitch, no ECMP

    // Early exit: if probe path has ≤ 1 hop, CC can't help
    probe = ComputePath(sip, dip, fallbackSport, dport)
    if probe.size() <= 1:
        return fallbackSport

    for sport = 1, 2, 3, … 65535:
        path = ComputePath(sip, dip, sport, dport)
        if no hop in path[:-1] is already reserved:  ← skip final hop
            ReservePath(path, flowKey)
            log to CSV if CC_LOG_FILE is set
            return sport

    return default_sport       ← fallback (practically unreachable)

3.5 ComputePath — Deterministic Path Tracing

Step 1 — NIC selection at source host
  buf = [ sip | dip | (sport << 0) | (dport << 16) ]   (12 bytes)
  h   = murmur3(buf, seed=0x8BADF00D)     ← matches RdmaQueuePair::GetHash()
  nic = m_hostRtTable[srcNode][dip][ h % #nics ]
  currentNode = m_hostNicToSwitch[srcNode][nic]

Step 2 — Switch-by-switch ECMP traversal
  while currentNode ≠ dstNode:
      sw   = m_switches[currentNode]
      h    = murmur3(buf, seed=sw.ecmpSeed)   ← matches SwitchNode::GetOutDev()
      port = sw.rtTable[dip][ h % #ports ]
      append (currentNode, port) to path
      currentNode = m_adjacency[currentNode][port]

This replicates the exact hash computation used in the live NS-3 simulation, so the predicted path is always correct.

Flow Diagrams: Before vs. After

Without Central Controller

AstraSim          NS-3 RdmaHw          Network
────────          ───────────          ───────
sim_send(A→B)
   │
   └──► AddQueuePair(sip, dip,
          sport=portNumber[s][d]++,
          dport, size)
              │
         Create RdmaQueuePair
         (sport = ~10000+)
              │
         Transmit ─────────────────►
                    Switch hashes
                    → may collide
                    → PFC backpressure
         ◄── ACKs ────────────────
              │
         QpComplete()
              │
   ◄──────────┘
 notify AstraSim

With Central Controller

AstraSim   RdmaHw/CC              Network
────────   ──────────              ───────
sim_send(A→B)
   │
   └──► AddQueuePair(sport=1)
              │
    ┌─────────────────────┐
    │ CentralController    │
    │ ::AssignSport()     │
    │  for s = 1,2,3,…:  │
    │   path=ComputePath()│
    │   if no conflict:   │
    │    ReservePath()    │
    │    return s          │
    └──────────┬──────────┘
         sport = s (conflict-free)
              │
         Create RdmaQueuePair
              │
         Transmit ─────────────────►
                    → unique pathno collision
         ◄── ACKs ────────────────
              │
    ┌─────────▼──────────┐
    │ CC::ReleasePath()    │
    │ free reserved hops │
    └────────────────────┘
              │
   ◄──────────┘
 notify AstraSim

Implementation Deep Dive

5.1 Initialization

Called once after SetRoutingEntries() finishes populating all NS-3 routing tables:

for each NS-3 node:
    build m_adjacency from GetDevice(d)->GetChannel()
    if SwitchNode:
        copy GetEcmpSeed() and GetRtTable() into m_switches
    if GPU host with RdmaHw:
        copy m_rtTable (dip → NIC list) into m_hostRtTable
        for each NIC, find which switch it connects to → m_hostNicToSwitch

5.2 Path Release

Called from RdmaHw::QpComplete() when a flow finishes:

ReleasePath(sip, dip, sport, dport):
    key  = FlowKey(sip, dip, sport, dport)
    path = m_flowPaths[key]
    for each (nodeId, port) in path:
        m_hopToFlows[(nodeId, port)].remove(key)
    delete m_flowPaths[key]

5.3 Bug Fixes — Lessons Learned

Bug 1: Final-Hop Conflict

HasConflict() originally iterated over all path hops, including the final hop rail_switch → dst_GPU. This final hop is always identical regardless of sport. Fix: skip path.back() — only check spine hops where path diversity exists.

Bug 2: 1-Hop Sport=1 Collision

Same-rail GPU pairs have 1-hop paths. After excluding the final hop → zero spine hops → HasConflict() always returns false → CC assigned sport=1 to all concurrent same-rail flows, causing map key collisions. Fix: early-exit if probe path has ≤ 1 hop.

5.4 Runtime Configuration

Env Variable Description
CC_ENABLE=1 Enable controller (default)
CC_ENABLE=0 Disable — flows use original ECMP behaviour
CC_LOG_FILE=paths.csv Write per-flow path assignments to CSV
TRACE_INJECT_FILE=<path> Bypass AstraSim — drive simulation from a trace CSV

No recompilation needed — toggling the controller is a one-character change to the run command (CC_ENABLE=0/1).

TraceInjector — KV Cache Simulation

AstraSim only supports collective workloads. For disaggregated inference, the critical traffic is point-to-point KV cache transfers (P-node → D-node). A lightweight TraceInjector mode bypasses AstraSim entirely and drives NS-3 directly from a CSV trace.

6.1 Trace Format

# timestamp_ns,src_node_id,dst_node_id,size_bytes
0,0,8,10485760          ← GPU 0 → GPU 8, 10 MB at t=0
0,1,9,10485760          ← GPU 1 → GPU 9, 10 MB at t=0
100000,0,8,10485760     ← another flow 100µs later

6.2 gen_kv_trace.py — Arrival Patterns

Pattern Description
constant Fixed inter-arrival = 1/rate (or --interval-ns)
poisson Exponential inter-arrival (Poisson process)
burst Periodic bursts: burst_size flows at the same timestamp
hotspot Fraction of flows from one specific server pair → max collision
server_pair All GPUs on src server → matching GPUs on dst server, N rounds
one_to_one Fixed (src, dst) GPU pair, N flows staggered by interval

How to Run the Simulation

The simulation has two modes: AstraSim workload mode (full collective communication) and TraceInjector mode (inject point-to-point flows from a CSV). Both produce the same fct.txt output.

Step 1: Build SimAI

# Build NS-3 backend (compiles central-controller.cc, rdma-hw.cc patches, etc.)
./scripts/build.sh -c ns3

Step 2: Generate Topology

# DCN+ fat-tree: 16 GPUs (2 servers × 8 GPU/server), 8 spine switches
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
  -topo DCN+ -g 16 -gps 8 -gt A100 -bw 100Gbps -nvbw 2400Gbps \
  -asn 2 -npa 8 -psn 8 -apbw 100Gbps -app 2

# Spectrum-X: 128 GPUs (rail-optimized, for training workloads)
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
  -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps

Step 3: Run Simulation

Mode A — TraceInjector (recommended for CC experiments)

Bypasses AstraSim entirely. Inject flows from a CSV trace file.

# 1. Generate trace (server 0 GPUs → server 1 GPUs, 10 MB each)
python3 gen_kv_trace.py \
    --arrival server_pair --src-server 0 --dst-server 1 \
    --gpus-per-server 8 --kv-mb 10 --rounds 1 \
    --out traces/my_trace.csv

# 2a. Run WITH CC
CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
  TRACE_INJECT_FILE=traces/my_trace.csv \
  AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
  -n <topology_dir> -c results/cc/SimAI_local.conf

# 2b. Run WITHOUT CC (A/B comparison)
CC_ENABLE=0 \
  TRACE_INJECT_FILE=traces/my_trace.csv \
  AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
  -n <topology_dir> -c results/nocc/SimAI_local.conf

Mode B — AstraSim Workload (full collective communication)

Run real collective workloads (AllReduce, AllToAll, etc.) through AstraSim → SimCCL → NS-3.

CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
  AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 \
  -w workloads/test_cc_compare.txt \
  -n <topology_dir> \
  -c results/cc/SimAI_local.conf

Environment Variables

VariableValuesEffect
CC_ENABLE1 (default) / 0Enable/disable CentralController sport assignment
CC_LOG_FILEfile pathWrite per-flow path CSV (flow_id, sip, dip, sport, hops)
TRACE_INJECT_FILECSV pathTraceInjector mode: bypass AstraSim, inject flows from CSV
AS_SEND_LATmicrosecondsSoftware send latency before NIC injection (default 6)
AS_NVLS_ENABLE1 / 0Enable NVSwitch NVLS transport for intra-server flows

Step 4: Analyze Results

# FCT CDF plot (CC vs No CC)
python3 plot_fct_cdf.py \
    --cc results/cc/fct.txt --nocc results/nocc/fct.txt \
    --out results/fct_cdf.png

# Export per-flow analysis CSV
python3 export_flow_csv.py \
    --fct results/cc/fct.txt --paths results/cc/paths.csv \
    --out results/cc/flows.csv

# Collision analysis
python3 analyze_collisions.py \
    --cc results/cc/paths.csv --nocc results/nocc/fct.txt \
    --trace traces/my_trace.csv --bw-gbps 100 --send-lat-ns 3000

Output Files

FileFormatContents
fct.txtSpace-separated, 1 line per flowsip dip sport dport size start_ns fct_ns ideal_ns
paths.csvCSV (only with CC_LOG_FILE)flow_id,sip,dip,sport,dport,n_hops,hops
flows.csvCSV (from export_flow_csv.py)flow_id,src_gpu,dst_gpu,sport,spine_switch,fct_ns,slowdown,cc_assigned

Experiments & Results

7.1 Best Case — Fat-Tree Server-Pair Burst

Topology: DCN+ Single-ToR fat-tree — 16 GPUs, 2 servers, 8 spine switches. Every cross-server flow traverses exactly 3 hops. Workload: 8 simultaneous 10 MB flows: GPU i (server 0) → GPU i+8 (server 1).

DCN+ FAT-TREE TOPOLOGY (16 GPU, 8 SPINES)
  Server 0 (GPUs 0–7)            Server 1 (GPUs 8–15)
  ┌─────────────────┐             ┌─────────────────┐
  │ GPU0 GPU1 … GPU7│             │ GPU8 GPU9 …GPU15│
  │  │    │       │ │             │  │    │       │  │
  │  └────┴───┬───┘ │             │  └────┴───┬───┘  │
  │       NVSw16    │             │       NVSw17      │
  │       (NVLink)  │             │       (NVLink)    │
  └───────────┼─────┘             └─────┼─────────────┘
              │                         │
           Leaf18                    Leaf19
            │ │ │ │ │ │ │ │         │ │ │ │ │ │ │ │
            │ │ │ │ │ │ │ │ 100Gbps │ │ │ │ │ │ │ │
     Spine20 Spine21 Spine22 Spine23 Spine24 Spine25 Spine26 Spine27

Without CC — 4 Spine Collisions

GPU 0→8   sport=10000 Spine 27  FCT=  850µs (1.0×)
GPU 1→9   sport=10000 Spine 20  FCT=  850µs (1.0×)
GPU 2→10  sport=10000 Spine 23  FCT=  850µs (1.0×)
GPU 3→11  sport=10000 Spine 24  FCT=  850µs (1.0×)
GPU 4→12  sport=10000 Spine 25  FCT=1693µs (2.0×) ←
GPU 5→13  sport=10000 Spine 22  FCT=1693µs (2.0×) ←
GPU 6→14  sport=10000 Spine 21  FCT=1693µs (2.0×) ←
GPU 7→15  sport=10000 Spine 26  FCT=1693µs (2.0×)

4 of 8 spines carry 2 flows each → 50% bandwidth → 2× FCT

With CC — Perfect Spread

GPU 0→8   sport=1  Spine 27  FCT=850µs (1.0×)
GPU 1→9   sport=1  Spine 20  FCT=850µs (1.0×)
GPU 2→10  sport=1  Spine 23  FCT=850µs (1.0×)
GPU 3→11  sport=1  Spine 24  FCT=850µs (1.0×)
GPU 4→12  sport=2  Spine 25  FCT=850µs (1.0×) ← reassigned
GPU 5→13  sport=4  Spine 22  FCT=850µs (1.0×) ← reassigned
GPU 6→14  sport=2  Spine 21  FCT=850µs (1.0×) ← reassigned
GPU 7→15  sport=1  Spine 26  FCT=850µs (1.0×)

Every spine carries exactly 1 flow → full bandwidth → 1.0× FCT

Metric With CC Without CC Improvement
Mean FCT 850 µs 1,271 µs −33%
Max FCT 850 µs 1,693 µs −50%
Mean Slowdown 1.0× 1.5×
Spine Collisions 0 4 pairs

7.2 8-Ring AllReduce — Spectrum-X 128 GPU

Layer With CC Without CC Δ
AllReduce 128 MB 2,554,511 2,559,883 +0.21%
AllReduce 64 MB 1,282,541 1,276,607 −0.47%
AllReduce 32 MB 645,833 662,472 +2.51%
Total 4,482,885 4,498,962 +0.36%

Small improvement because 8 concurrent AllReduce rings already produce diverse (sip, dip) pairs that naturally spread across different spines via ECMP.

Sport Assignment Distribution

sport Flow Count %
1 1,628 84.7%
2 263 13.7%
3 27 1.4%
4 2 0.1%
5 1 0.1%

7.3 Full AllToAll 128 GPU — Network Saturation

With 128 × 127 = 16,256 concurrent flows, the network is fully saturated. CC provides only 0.02% improvement — resolving collisions doesn't free capacity when all links are already at 100%. CC search depth reached 183, and simulation took 119 minutes vs. 2 minutes without CC.

7.4 When Does CC Help?

Scenario CC Benefit Root Cause
8-ring AllReduce (Spectrum-X) +0.36% Diverse (sip,dip) already spread by ECMP
Full AllToAll 128G +0.02% Network fully saturated
Hotspot (same server pair) ~0% Final-hop congestion dominates
Fat-tree burst (8 flows, 8 spines) −33% mean, −50% max ECMP collision on spine; CC spreads perfectly

Conditions for large CC benefit: (1) Flows traverse multiple candidate spine paths, (2) concurrent flow count ≤ available spine paths, (3) default ECMP produces hash collisions, (4) destination GPUs are distinct (no shared final hop).

Source File Reference

8.1 C++ Patches (NS-3 / AstraSim)

File Change
central-controller.h/cc NEW CentralController singleton class (~300 lines). Declares Initialize(), AssignSport(), ReleasePath(), ComputePath(), HasConflict(), ReservePath(). Maintains topology tables and dynamic reservation state.
switch-node.h MOD EcmpHash() moved from private to public static. Added GetEcmpSeed() and GetRtTable() getters so CC can replicate switch-level hashing.
rdma-hw.cc MOD AddQueuePair(): calls AssignSport() to override sport before QP creation. QpComplete(): calls ReleasePath() to free reserved hops.
common.h MOD Added #include <central-controller.h>. After SetRoutingEntries(): calls Initialize(), reads CC_ENABLE / CC_LOG_FILE env vars.
entry.h MOD SendFlow(): calls CC's AssignSport() before registering the flow. New SendFlowTrace() helper for TraceInjector mode with sentinel ncclFlowTag (current_flow_id == -1).
AstraSimNetwork.cc MOD If TRACE_INJECT_FILE env var is set, reads CSV trace and schedules each flow via Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, ...), bypassing AstraSim entirely.

8.2 Python Analysis Scripts

Script Description
gen_kv_trace.py Generate KV cache transfer traces with 6 arrival patterns (constant, poisson, burst, hotspot, server_pair, one_to_one). Configurable topology mapping (128 GPUs, 16 servers × 8 GPU/server).
analyze_collisions.py Count ECMP spine collisions. Replicates the exact Murmur3 hash used by NS-3 switches. Supports both static (all pairs) and concurrent (time-window-overlapping) collision counting. Can read actual NS-3 paths from CC_LOG_FILE.
plot_fct_cdf.py Plot FCT CDF and per-flow sorted bar chart comparing CC vs No CC. Reads fct.txt files from both runs.
export_flow_csv.py Join fct.txt + paths.csv into a per-flow analysis CSV with fields: flow_id, src/dst GPU, sport, spine_switch, hops, FCT, slowdown, cc_assigned.

Source Code Deep Dive — What Each File Does

This section traces the execution flow through the 5 C++ source files, showing exactly when each is invoked and what it does at the code level.

AstraSimNetwork.cc — Main entry point + TraceInjector

Contains main() and the ASTRASimNetwork class that bridges AstraSim ↔ NS-3.

Execution Order

main(argc, argv)
  ① user_param_prase()         // parse -t thread -w workload -n topo -c confmain1(topo, conf)           // → common.h: ReadConf → SetConfig → SetupNetwork
                                  //   (builds NS-3 topology, initializes CentralController)if env TRACE_INJECT_FILE:   // ← NEW: TraceInjector bypass mode
      for line in csv:
        Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, src, dst, size)
      Simulator::Run()           // run NS-3 with injected flows only
      return// Normal mode: create ASTRASimNetwork + AstraSim::Sys per GPU
     for j in nodes_num:
       networks[j] = new ASTRASimNetwork(j, 0)
       systems[j]  = new AstraSim::Sys(networks[j], ...)
     for i in nodes_num:
       systems[i]->workload->fire()  // kick off collective workloads
     Simulator::Run()

ASTRASimNetwork class — AstraSim ↔ NS-3 bridge

MethodWhat it does
sim_send()Registers flow in sentHash, then calls SendFlow() (in entry.h) to inject an RDMA QP into NS-3
sim_recv()Matches arriving data against expeRecvHash. If data already arrived: invoke callback immediately. If not: register in expeRecvHash to wait.
sim_get_time()Returns NS-3's Simulator::Now().GetNanoSeconds() to AstraSim
sim_schedule()Wraps Simulator::Schedule() for AstraSim callbacks
TraceInjector: When TRACE_INJECT_FILE is set, the entire AstraSim workload system is bypassed. Flows are read from a CSV (timestamp_ns,src,dst,size_bytes) and scheduled directly into NS-3 via SendFlowTrace(). This enables isolated network experiments (e.g., KV cache transfer patterns) without needing the full SimAI stack.

entry.h — SendFlow + CentralController integration

Defines the flow injection functions and all flow completion callbacks. This is where the CentralController's AssignSport() is called for every flow.

SendFlow() — normal mode (called by AstraSim)

void SendFlow(src, dst, maxPacketCount, msg_handler, fun_arg, tag, request) {
  port = portNumber[src][dst]++;

  // ← NEW: CentralController picks a conflict-free sport
  port = CentralController::Instance().AssignSport(
      serverAddress[src], serverAddress[dst],
      100 /*dport*/, src, dst, port);

  sender_src_port_map[{port, {src, dst}}] = request->flowTag;

  RdmaClientHelper clientHelper(pg, srcIP, dstIP, port, dport, size, ...);
  appCon = clientHelper.Install(n.Get(src));
  appCon.Start(Time(send_lat));  // inject RDMA QP into NS-3
}

SendFlowTrace() — TraceInjector mode (no AstraSim context)

void SendFlowTrace(src, dst, size_bytes) {
  port = portNumber[src][dst]++;
  port = CentralController::Instance().AssignSport(...);

  AstraSim::ncclFlowTag sentinel;  // current_flow_id = -1 (sentinel)
  sender_src_port_map[{port, {src, dst}}] = sentinel;

  // Same RDMA QP injection, but no AstraSim callbacks
  RdmaClientHelper clientHelper(..., nullptr, nullptr, ...);
  appCon.Start(Time(send_lat));
}

Completion Callbacks

CallbackTriggered byWhat it does
qp_finish()NS-3 when all packets arrive at dstWrites to fct.txt, checks is_receive_finished(), calls notify_receiver_receive_data() → triggers AstraSim recv callback. For sentinel flows (TraceInjector): early return.
send_finish()NS-3 when all packets leave src NICChecks is_sending_finished(), calls notify_sender_sending_finished() → triggers AstraSim send callback. For sentinel flows: early return.

common.h — Network setup + CC initialization

The infrastructure glue file. Defines global state, reads config, builds the NS-3 topology, and initializes the CentralController.

Three main functions, called in order by main1()

main1(topo, conf)
  ① ReadConf(topo, conf)    // Parse config: CC_MODE, DATA_RATE, LINK_DELAY, PFC, DCQCN params...SetConfig()             // Apply NS-3 defaults (PauseTime, QcnEnabled, IntHeader mode)
                             //   cc_mode=3 → DCQCN, cc_mode=7 → TIMELY, cc_mode=10 → PINTSetupNetwork(qp_finish, send_finish)
       // a. Read topology file: node_num, gpus_per_server, nvswitch_num, switch_num
       // b. Create NS-3 nodes (hosts, SwitchNode, NVSwitchNode)
       // c. Install point-to-point links with QBB (lossless Ethernet)
       // d. Configure RDMA per host (RdmaHw with CC mode, rates, PFC)
       // e. CalculateRoutes() → SetRoutingEntries()
       // f. NEW: Initialize CentralController ↓

CentralController initialization (added in SetupNetwork)

// After SetRoutingEntries():
CentralController::Instance().Initialize(n, nodeIdToRdmaHw, ipToNodeId, gpus_per_server);

// Env var control:
if (env "CC_LOG_FILE")   CC.SetLogFile(path);   // CSV: flow_id,sip,dip,sport,hops
if (env "CC_ENABLE" != "0") CC.Enable();  // default: enabled
else                       CC.Disable(); // for A/B comparison

central-controller.h — CC class declaration

Singleton class with the full API and internal data structures for conflict-free path assignment.

Public API

MethodWhen calledWhat it does
Initialize()Once, in SetupNetworkReads NS-3 topology: adjacency graph, switch ECMP seeds + routing tables, host NIC→switch mapping
AssignSport()Every flow (SendFlow / SendFlowTrace)Probes sport 1..65534, finds first conflict-free path, reserves it
ReleasePath()QP completion (rdma-hw.cc)Unreserves all hops for the completed flow

Internal Data Structures

StructurePurpose
m_switchesPer-switch: ecmpSeed + rtTable (dip → [outPorts]). Read-only after Initialize.
m_adjacencynodeId → {portIdx → neighborNodeId}. Physical topology graph.
m_hopToFlows(switchId, portIdx) → set of active FlowKeys. The dynamic reservation state.
m_flowPathsFlowKey → Path. Reverse lookup for ReleasePath.

central-controller.cc — CC core logic

AssignSport() — the hot path (called per flow)

AssignSport(sip, dip, dport, srcNodeId, dstNodeId, fallbackSport) {
  if (!m_enabled || !m_initialized) return fallbackSport;
  if (srcNodeId/gpusPerServer == dstNodeId/gpusPerServer) return fallbackSport;
      // ↑ intra-server → NVSwitch, no ECMP, CC can't help

  // Quick check: if probe path has ≤1 hop, CC is useless
  Path probe = ComputePath(sip, dip, fallbackSport, dport, ...);
  if (probe.size() <= 1) return fallbackSport;

  // Exhaustive search: try sport 1..65534
  for (sport = 1; sport < 65535; sport++) {
    Path path = ComputePath(sip, dip, sport, dport, ...);
    if (!HasConflict(path)) {
      ReservePath(path, {sip, dip, sport, dport});
      return sport;  // ← conflict-free sport found
    }
  }
  return fallbackSport;  // no conflict-free sport exists
}

ComputePath() — deterministic ECMP replay

ComputePath(sip, dip, sport, dport, srcNodeId, dstNodeId) {
  buf = {sip, dip, (sport | dport<<16)};  // 12-byte hash input

  // Step 1: NIC selection at host (same hash as rdma-hw.cc)
  h = Murmur3(buf, 12, 0x8BADF00D);
  nicIdx = hostNics[h % hostNics.size()];
  currentNode = nicToSwitch[nicIdx];

  // Step 2: switch-by-switch ECMP traversal
  while (currentNode is switch) {
    h = Murmur3(buf, 12, switch.ecmpSeed);  // ← per-switch seed
    portIdx = switch.rtTable[dip][h % numPorts];
    path.push_back({currentNode, portIdx});
    currentNode = adjacency[currentNode][portIdx];
  }
  return path;
}
Key: ComputePath() uses the exact same Murmur3 hash with the exact same per-switch ECMP seed as the NS-3 SwitchNode. This means the CC can predict the path any 5-tuple will take through the fabric before injecting the flow. Changing sport changes the hash output, which changes the ECMP port selection, which routes the flow through different spine links.

HasConflict() + ReservePath() / ReleasePath()

HasConflict(path) {
  // Check all hops EXCEPT the last one (leaf→GPU has no diversity)
  for (i = 0; i + 1 < path.size(); i++)
    if (m_hopToFlows[{path[i].node, path[i].port}].size() > 0)
      return true;
  return false;
}

ReservePath(path, flowKey) {
  m_flowPaths[flowKey] = path;
  for (i = 0; i + 1 < path.size(); i++)
    m_hopToFlows[{path[i].node, path[i].port}].insert(flowKey);
}

ReleasePath(sip, dip, sport, dport) {  // called from rdma-hw.cc QpComplete()
  flowKey = {sip, dip, sport, dport};
  path = m_flowPaths[flowKey];
  for hop in path:
    m_hopToFlows[hop].erase(flowKey);
  m_flowPaths.erase(flowKey);
}

Complete Lifecycle: One Flow Through All 5 Files

① AstraSim decides to send a collective chunkASTRASimNetwork::sim_send()                  [AstraSimNetwork.cc]

② sim_send() calls SendFlow()SendFlow(src, dst, size, callback, tag)      [entry.h]

③ SendFlow() asks CC for a conflict-free portCentralController::AssignSport()              [central-controller.cc]ComputePath(sip, dip, sport_candidate)     // replays ECMP hashHasConflict(path)                          // checks m_hopToFlowsReservePath(path, flowKey)                 // reserves hopsreturn sport

④ SendFlow() injects RDMA QP with CC-assigned sportRdmaClientHelper.Install() + Start()          [entry.h → NS-3]

⑤ NS-3 simulates packet-level RDMA transport
   → switches use ECMP hash (same Murmur3)       [ns-3 switch-node.cc]
   → congestion control (DCQCN/TIMELY/PINT)       [configured in common.h]

⑥ All packets arrive → QP completesRdmaHw::QpComplete()                          [rdma-hw.cc]CentralController::ReleasePath()            // free reserved hopsqp_finish()                                [entry.h]notify_receiver_receive_data()           // triggers AstraSim callback

Differences from Original SimAI

Aspect Original SimAI This Repo (with CC)
Sport Assignment Per-(src,dst) counter: portNumber[s][d]++ starting ~10000 CentralController::AssignSport() — searches for collision-free sport from 1
Path Awareness None — flows are fire-and-forget; routing is per-switch ECMP Global view of all active flow paths; reserves and releases hops
Congestion Response Reactive only — DCQCN (rate reduction) + PFC (pause frames) Proactive collision avoidance + reactive DCQCN/PFC
Switch-Level Changes EcmpHash() is private; no external access to seeds/routing EcmpHash() public static; GetEcmpSeed() and GetRtTable() added
Workload Modes AstraSim collectives only (AllReduce, AllGather, etc.) AstraSim collectives + TraceInjector for point-to-point KV cache flows
Flow Completion QpComplete() notifies AstraSim only QpComplete() also calls CC::ReleasePath() to free reserved hops
Analysis Tools Basic fct.txt output paths.csv logging + Python scripts for collision analysis, FCT CDF, path visualisation
New Files central-controller.h/cc, gen_kv_trace.py, analyze_collisions.py, plot_fct_cdf.py, export_flow_csv.py

9.1 Assumptions & Limitations

Assumptions

  • Zero control-plane delay (synchronous call inside simulator)
  • Complete global topology view at initialization
  • Static routing tables (no changes during simulation)
  • Exact hash replication — same Murmur3 as SwitchNode

Limitations

  • Sequential sport search (sport=1,2,3…) — O(K×H) per flow
  • Serialised path assignment — blocks NS-3 event loop
  • Conservative conflict: any shared (switch,port) is a conflict
  • No link failure recovery — stale topology tables
  • Diminishing returns at network saturation