Central Controller — ECMP Collision-Free Routing for SimAI

Overview & Motivation
The ECMP Hash Collision Problem
Central Controller Design
Flow Diagrams: Before vs. After
Implementation Deep Dive
TraceInjector — KV Cache Simulation
Experiments & Results
Source File Reference
Differences from Original SimAI

Overview & Motivation

Large-scale LLM training clusters use collective communication (AllReduce, AllGather, ReduceScatter) to synchronise model parameters across GPUs. Each collective is decomposed into many concurrent point-to-point RDMA flows. The underlying network uses ECMP (Equal-Cost Multi-Path) routing: at every switch, a hash function over the flow 5-tuple (sip, dip, sport, dport, protocol) selects one of several equal-cost output ports.

Goal: Implement a centralised controller that, before any RDMA flow is created, pre-computes the complete hop-by-hop path and steers the flow away from paths already in use — eliminating hash collisions entirely.

SIMAI ARCHITECTURE WITH CENTRAL CONTROLLER

┌─────────────────────────────────────────────────────────┐
│  AstraSim  (collective communication scheduler)          │
│  · Decides when each GPU issues a send/recv              │
│  · Implements ring-AllReduce, tree-AllReduce, etc.       │
│  · Calls sim_send() / sim_recv() to inject traffic       │
└─────────────────────┬───────────────────────────────────┘
                      │  sim_send / sim_recv API
┌─────────────────────▼───────────────────────────────────┐
│  NS-3  (packet-level network simulator)                  │
│                                                          │
│  ┌────────────────────────────────────────┐              │
│  │  CentralController (NEW)                │              │
│  │  · Maintains global view of all paths  │              │
│  │  · AssignSport() — collision-free port  │              │
│  │  · ReleasePath() — free on completion  │              │
│  └────────────────────────────────────────┘              │
│                                                          │
│  RdmaHw        – RDMA NIC logic per GPU                  │
│  RdmaQueuePair – one RDMA QP per flow                    │
│  SwitchNode    – ECMP forwarding logic                   │
│  QbbNetDevice  – queuing, PFC pause, ECN marking         │
└─────────────────────────────────────────────────────────┘

The ECMP Hash Collision Problem

2.1 How Original SimAI Assigns Ports

In original SimAI, sport comes from a per-(src, dst) counter: portNumber[src][dst]++, initialised at ~10,000. Two flows from GPU A to GPU B get sports 10000, 10001, 10002, … Because ECMP hashes modulo the number of equal-cost paths, consecutive sport values for the same (sip, dip) often produce the same hash remainder — causing frequent collisions.

2.2 Why Collisions Matter

Bandwidth Halving

When two flows share the same output port at a switch, they share link bandwidth. In a ring-AllReduce, the collective finishes only when the last GPU completes — a single congested link delays the entire collective.

PFC Cascade

PFC (Priority-based Flow Control) propagates backpressure upstream, potentially stalling unrelated flows. A localised collision can turn into cluster-wide head-of-line blocking.

2.3 ECMP Hash Internals

SwitchNode::EcmpHash() is a static Murmur3 function with a per-switch seed (m_ecmpSeed). The same 5-tuple may take different branches at different switches. The NIC selection at the source host also uses Murmur3 with fixed seed 0x8BADF00D.

// Murmur3 hash — 12-byte buffer: sip(4) | dip(4) | sport(2)|dport(2)
def ecmp_hash(sip, dip, sport, dport, seed):
    buf = pack('<III', sip, dip, (sport) | (dport << 16))
    h = seed
    for i in range(0, 12, 4):
        k = unpack(buf, i)
        k = (k * 0xcc9e2d51)   # mix step 1
        k = rotl(k, 15)
        k = (k * 0x1b873593)   # mix step 2
        h ^= k
        h = rotl(h, 13)
        h = h * 5 + 0xe6546b64
    # finalize
    h ^= 12; h ^= h >> 16; h *= 0x85ebca6b
    h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16
    return h

Central Controller Design

3.1 Core Idea

The only field in the 5-tuple that the sender controls freely is sport. By choosing sport carefully, we can steer the flow onto any desired path, because every switch re-hashes on the full 5-tuple.

3.2 Algorithm Steps

Maintain a global view of all active flow paths via m_hopToFlows and m_flowPaths.

When a new flow arrives, simulate its full hop-by-hop path for each candidate sport value (1, 2, 3, …) using an exact replica of the switch hash functions.

Return the first sport whose path shares no (switch, output-port) hop with any existing active flow.

Reserve that path so future flows avoid it.

Release the reservation when the flow completes (QpComplete()).

3.3 Data Structures

Topology Tables (read-only)

m_switches       : nodeId → { ecmpSeed,
                    rtTable[dip] → [outPorts] }
m_adjacency      : nodeId → { portIdx →
                    neighborNodeId }
m_hostRtTable    : hostId → { dip →
                    [nicIndices] }
m_hostNicToSwitch: hostId → { nicIdx →
                    switchNodeId }

Dynamic Reservation State

m_hopToFlows  : HopKey(nodeId, port) →
                 set of active flowKeys
m_flowPaths   : flowKey →
                 Path (list of (nodeId, port))

Updated on every AssignSport() and ReleasePath() call.

3.4 AssignSport — Pseudocode

AssignSport(sip, dip, dport, src, dst):
    if not enabled or src/dst on same server:
        return default_sport       ← intra-server: NVSwitch, no ECMP

    // Early exit: if probe path has ≤ 1 hop, CC can't help
    probe = ComputePath(sip, dip, fallbackSport, dport)
    if probe.size() <= 1:
        return fallbackSport

    for sport = 1, 2, 3, … 65535:
        path = ComputePath(sip, dip, sport, dport)
        if no hop in path[:-1] is already reserved:  ← skip final hop
            ReservePath(path, flowKey)
            log to CSV if CC_LOG_FILE is set
            return sport

    return default_sport       ← fallback (practically unreachable)

3.5 ComputePath — Deterministic Path Tracing

Step 1 — NIC selection at source host
  buf = [ sip | dip | (sport << 0) | (dport << 16) ]   (12 bytes)
  h   = murmur3(buf, seed=0x8BADF00D)     ← matches RdmaQueuePair::GetHash()
  nic = m_hostRtTable[srcNode][dip][ h % #nics ]
  currentNode = m_hostNicToSwitch[srcNode][nic]

Step 2 — Switch-by-switch ECMP traversal
  while currentNode ≠ dstNode:
      sw   = m_switches[currentNode]
      h    = murmur3(buf, seed=sw.ecmpSeed)   ← matches SwitchNode::GetOutDev()
      port = sw.rtTable[dip][ h % #ports ]
      append (currentNode, port) to path
      currentNode = m_adjacency[currentNode][port]

This replicates the exact hash computation used in the live NS-3 simulation, so the predicted path is always correct.

Flow Diagrams: Before vs. After

Without Central Controller

AstraSim          NS-3 RdmaHw          Network
────────          ───────────          ───────
sim_send(A→B)
   │
   └──► AddQueuePair(sip, dip,
          sport=portNumber[s][d]++,
          dport, size)
              │
         Create RdmaQueuePair
         (sport = ~10000+)
              │
         Transmit ─────────────────►
                    Switch hashes
                    → may collide
                    → PFC backpressure
         ◄── ACKs ────────────────
              │
         QpComplete()
              │
   ◄──────────┘
 notify AstraSim

With Central Controller

AstraSim   RdmaHw/CC              Network
────────   ──────────              ───────
sim_send(A→B)
   │
   └──► AddQueuePair(sport=1)
              │
    ┌─────────────────────┐
    │ CentralController    │
    │ ::AssignSport()     │
    │  for s = 1,2,3,…:  │
    │   path=ComputePath()│
    │   if no conflict:   │
    │    ReservePath()    │
    │    return s          │
    └──────────┬──────────┘
         sport = s (conflict-free)
              │
         Create RdmaQueuePair
              │
         Transmit ─────────────────►
                    → unique path
                    → no collision
         ◄── ACKs ────────────────
              │
    ┌─────────▼──────────┐
    │ CC::ReleasePath()    │
    │ free reserved hops │
    └────────────────────┘
              │
   ◄──────────┘
 notify AstraSim

Implementation Deep Dive

5.1 Initialization

Called once after SetRoutingEntries() finishes populating all NS-3 routing tables:

for each NS-3 node:
    build m_adjacency from GetDevice(d)->GetChannel()
    if SwitchNode:
        copy GetEcmpSeed() and GetRtTable() into m_switches
    if GPU host with RdmaHw:
        copy m_rtTable (dip → NIC list) into m_hostRtTable
        for each NIC, find which switch it connects to → m_hostNicToSwitch

5.2 Path Release

Called from RdmaHw::QpComplete() when a flow finishes:

ReleasePath(sip, dip, sport, dport):
    key  = FlowKey(sip, dip, sport, dport)
    path = m_flowPaths[key]
    for each (nodeId, port) in path:
        m_hopToFlows[(nodeId, port)].remove(key)
    delete m_flowPaths[key]

5.3 Bug Fixes — Lessons Learned

Bug 1: Final-Hop Conflict

HasConflict() originally iterated over all path hops, including the final hop rail_switch → dst_GPU. This final hop is always identical regardless of sport. Fix: skip path.back() — only check spine hops where path diversity exists.

Bug 2: 1-Hop Sport=1 Collision

Same-rail GPU pairs have 1-hop paths. After excluding the final hop → zero spine hops → HasConflict() always returns false → CC assigned sport=1 to all concurrent same-rail flows, causing map key collisions. Fix: early-exit if probe path has ≤ 1 hop.

5.4 Runtime Configuration

Env Variable	Description
`CC_ENABLE=1`	Enable controller (default)
`CC_ENABLE=0`	Disable — flows use original ECMP behaviour
`CC_LOG_FILE=paths.csv`	Write per-flow path assignments to CSV
`TRACE_INJECT_FILE=<path>`	Bypass AstraSim — drive simulation from a trace CSV

No recompilation needed — toggling the controller is a one-character change to the run command (CC_ENABLE=0/1).

TraceInjector — KV Cache Simulation

AstraSim only supports collective workloads. For disaggregated inference, the critical traffic is point-to-point KV cache transfers (P-node → D-node). A lightweight TraceInjector mode bypasses AstraSim entirely and drives NS-3 directly from a CSV trace.

6.1 Trace Format

# timestamp_ns,src_node_id,dst_node_id,size_bytes
0,0,8,10485760          ← GPU 0 → GPU 8, 10 MB at t=0
0,1,9,10485760          ← GPU 1 → GPU 9, 10 MB at t=0
100000,0,8,10485760     ← another flow 100µs later

6.2 gen_kv_trace.py — Arrival Patterns

Pattern	Description
`constant`	Fixed inter-arrival = 1/rate (or `--interval-ns`)
`poisson`	Exponential inter-arrival (Poisson process)
`burst`	Periodic bursts: `burst_size` flows at the same timestamp
`hotspot`	Fraction of flows from one specific server pair → max collision
`server_pair`	All GPUs on src server → matching GPUs on dst server, N rounds
`one_to_one`	Fixed (src, dst) GPU pair, N flows staggered by interval

How to Run the Simulation

The simulation has two modes: AstraSim workload mode (full collective communication) and TraceInjector mode (inject point-to-point flows from a CSV). Both produce the same fct.txt output.

Step 1: Build SimAI

# Build NS-3 backend (compiles central-controller.cc, rdma-hw.cc patches, etc.)
./scripts/build.sh -c ns3

Step 2: Generate Topology

# DCN+ fat-tree: 16 GPUs (2 servers × 8 GPU/server), 8 spine switches
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
  -topo DCN+ -g 16 -gps 8 -gt A100 -bw 100Gbps -nvbw 2400Gbps \
  -asn 2 -npa 8 -psn 8 -apbw 100Gbps -app 2

# Spectrum-X: 128 GPUs (rail-optimized, for training workloads)
python3 ./astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
  -topo Spectrum-X -g 128 -gt A100 -bw 100Gbps -nvbw 2400Gbps

Step 3: Run Simulation

Mode A — TraceInjector (recommended for CC experiments)

Bypasses AstraSim entirely. Inject flows from a CSV trace file.

# 1. Generate trace (server 0 GPUs → server 1 GPUs, 10 MB each)
python3 gen_kv_trace.py \
    --arrival server_pair --src-server 0 --dst-server 1 \
    --gpus-per-server 8 --kv-mb 10 --rounds 1 \
    --out traces/my_trace.csv

# 2a. Run WITH CC
CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
  TRACE_INJECT_FILE=traces/my_trace.csv \
  AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
  -n <topology_dir> -c results/cc/SimAI_local.conf

# 2b. Run WITHOUT CC (A/B comparison)
CC_ENABLE=0 \
  TRACE_INJECT_FILE=traces/my_trace.csv \
  AS_SEND_LAT=3 ./bin/SimAI_simulator -t 1 \
  -n <topology_dir> -c results/nocc/SimAI_local.conf

Mode B — AstraSim Workload (full collective communication)

Run real collective workloads (AllReduce, AllToAll, etc.) through AstraSim → SimCCL → NS-3.

CC_ENABLE=1 CC_LOG_FILE=results/cc/paths.csv \
  AS_SEND_LAT=3 AS_NVLS_ENABLE=1 ./bin/SimAI_simulator -t 1 \
  -w workloads/test_cc_compare.txt \
  -n <topology_dir> \
  -c results/cc/SimAI_local.conf

Environment Variables

Variable	Values	Effect
`CC_ENABLE`	`1` (default) / `0`	Enable/disable CentralController sport assignment
`CC_LOG_FILE`	file path	Write per-flow path CSV (flow_id, sip, dip, sport, hops)
`TRACE_INJECT_FILE`	CSV path	TraceInjector mode: bypass AstraSim, inject flows from CSV
`AS_SEND_LAT`	microseconds	Software send latency before NIC injection (default 6)
`AS_NVLS_ENABLE`	`1` / `0`	Enable NVSwitch NVLS transport for intra-server flows

Step 4: Analyze Results

# FCT CDF plot (CC vs No CC)
python3 plot_fct_cdf.py \
    --cc results/cc/fct.txt --nocc results/nocc/fct.txt \
    --out results/fct_cdf.png

# Export per-flow analysis CSV
python3 export_flow_csv.py \
    --fct results/cc/fct.txt --paths results/cc/paths.csv \
    --out results/cc/flows.csv

# Collision analysis
python3 analyze_collisions.py \
    --cc results/cc/paths.csv --nocc results/nocc/fct.txt \
    --trace traces/my_trace.csv --bw-gbps 100 --send-lat-ns 3000

Output Files

File	Format	Contents
`fct.txt`	Space-separated, 1 line per flow	`sip dip sport dport size start_ns fct_ns ideal_ns`
`paths.csv`	CSV (only with CC_LOG_FILE)	`flow_id,sip,dip,sport,dport,n_hops,hops`
`flows.csv`	CSV (from export_flow_csv.py)	`flow_id,src_gpu,dst_gpu,sport,spine_switch,fct_ns,slowdown,cc_assigned`

Experiments & Results

7.1 Best Case — Fat-Tree Server-Pair Burst

Topology: DCN+ Single-ToR fat-tree — 16 GPUs, 2 servers, 8 spine switches. Every cross-server flow traverses exactly 3 hops. Workload: 8 simultaneous 10 MB flows: GPU i (server 0) → GPU i+8 (server 1).

DCN+ FAT-TREE TOPOLOGY (16 GPU, 8 SPINES)

  Server 0 (GPUs 0–7)            Server 1 (GPUs 8–15)
  ┌─────────────────┐             ┌─────────────────┐
  │ GPU0 GPU1 … GPU7│             │ GPU8 GPU9 …GPU15│
  │  │    │       │ │             │  │    │       │  │
  │  └────┴───┬───┘ │             │  └────┴───┬───┘  │
  │       NVSw16    │             │       NVSw17      │
  │       (NVLink)  │             │       (NVLink)    │
  └───────────┼─────┘             └─────┼─────────────┘
              │                         │
           Leaf18                    Leaf19
            │ │ │ │ │ │ │ │         │ │ │ │ │ │ │ │
            │ │ │ │ │ │ │ │ 100Gbps │ │ │ │ │ │ │ │
     Spine20 Spine21 Spine22 Spine23 Spine24 Spine25 Spine26 Spine27

Without CC — 4 Spine Collisions

GPU 0→8   sport=10000 Spine 27  FCT=  850µs (1.0×)
GPU 1→9   sport=10000 Spine 20  FCT=  850µs (1.0×)
GPU 2→10  sport=10000 Spine 23  FCT=  850µs (1.0×)
GPU 3→11  sport=10000 Spine 24  FCT=  850µs (1.0×)
GPU 4→12  sport=10000 Spine 25  FCT=1693µs (2.0×) ←
GPU 5→13  sport=10000 Spine 22  FCT=1693µs (2.0×) ←
GPU 6→14  sport=10000 Spine 21  FCT=1693µs (2.0×) ←
GPU 7→15  sport=10000 Spine 26  FCT=1693µs (2.0×) ←

4 of 8 spines carry 2 flows each → 50% bandwidth → 2× FCT

With CC — Perfect Spread

GPU 0→8   sport=1  Spine 27  FCT=850µs (1.0×)
GPU 1→9   sport=1  Spine 20  FCT=850µs (1.0×)
GPU 2→10  sport=1  Spine 23  FCT=850µs (1.0×)
GPU 3→11  sport=1  Spine 24  FCT=850µs (1.0×)
GPU 4→12  sport=2  Spine 25  FCT=850µs (1.0×) ← reassigned
GPU 5→13  sport=4  Spine 22  FCT=850µs (1.0×) ← reassigned
GPU 6→14  sport=2  Spine 21  FCT=850µs (1.0×) ← reassigned
GPU 7→15  sport=1  Spine 26  FCT=850µs (1.0×)

Every spine carries exactly 1 flow → full bandwidth → 1.0× FCT

Metric	With CC	Without CC	Improvement
Mean FCT	850 µs	1,271 µs	−33%
Max FCT	850 µs	1,693 µs	−50%
Mean Slowdown	1.0×	1.5×	—
Spine Collisions	0	4 pairs	—

7.2 8-Ring AllReduce — Spectrum-X 128 GPU

Layer	With CC	Without CC	Δ
AllReduce 128 MB	2,554,511	2,559,883	+0.21%
AllReduce 64 MB	1,282,541	1,276,607	−0.47%
AllReduce 32 MB	645,833	662,472	+2.51%
Total	4,482,885	4,498,962	+0.36%

Small improvement because 8 concurrent AllReduce rings already produce diverse (sip, dip) pairs that naturally spread across different spines via ECMP.

Sport Assignment Distribution

sport	Flow Count	%
1	1,628	84.7%
2	263	13.7%
3	27	1.4%
4	2	0.1%
5	1	0.1%

7.3 Full AllToAll 128 GPU — Network Saturation

With 128 × 127 = 16,256 concurrent flows, the network is fully saturated. CC provides only 0.02% improvement — resolving collisions doesn't free capacity when all links are already at 100%. CC search depth reached 183, and simulation took 119 minutes vs. 2 minutes without CC.

7.4 When Does CC Help?

Scenario	CC Benefit	Root Cause
8-ring AllReduce (Spectrum-X)	+0.36%	Diverse (sip,dip) already spread by ECMP
Full AllToAll 128G	+0.02%	Network fully saturated
Hotspot (same server pair)	~0%	Final-hop congestion dominates
Fat-tree burst (8 flows, 8 spines)	−33% mean, −50% max	ECMP collision on spine; CC spreads perfectly

Conditions for large CC benefit: (1) Flows traverse multiple candidate spine paths, (2) concurrent flow count ≤ available spine paths, (3) default ECMP produces hash collisions, (4) destination GPUs are distinct (no shared final hop).

Source File Reference

8.1 C++ Patches (NS-3 / AstraSim)

File	Change
central-controller.h/cc	NEW CentralController singleton class (~300 lines). Declares `Initialize()`, `AssignSport()`, `ReleasePath()`, `ComputePath()`, `HasConflict()`, `ReservePath()`. Maintains topology tables and dynamic reservation state.
switch-node.h	MOD `EcmpHash()` moved from `private` to `public static`. Added `GetEcmpSeed()` and `GetRtTable()` getters so CC can replicate switch-level hashing.
rdma-hw.cc	MOD `AddQueuePair()`: calls `AssignSport()` to override sport before QP creation. `QpComplete()`: calls `ReleasePath()` to free reserved hops.
common.h	MOD Added `#include <central-controller.h>`. After `SetRoutingEntries()`: calls `Initialize()`, reads `CC_ENABLE` / `CC_LOG_FILE` env vars.
entry.h	MOD `SendFlow()`: calls CC's `AssignSport()` before registering the flow. New `SendFlowTrace()` helper for TraceInjector mode with sentinel `ncclFlowTag` (`current_flow_id == -1`).
AstraSimNetwork.cc	MOD If `TRACE_INJECT_FILE` env var is set, reads CSV trace and schedules each flow via `Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, ...)`, bypassing AstraSim entirely.

8.2 Python Analysis Scripts

Script	Description
gen_kv_trace.py	Generate KV cache transfer traces with 6 arrival patterns (constant, poisson, burst, hotspot, server_pair, one_to_one). Configurable topology mapping (128 GPUs, 16 servers × 8 GPU/server).
analyze_collisions.py	Count ECMP spine collisions. Replicates the exact Murmur3 hash used by NS-3 switches. Supports both static (all pairs) and concurrent (time-window-overlapping) collision counting. Can read actual NS-3 paths from `CC_LOG_FILE`.
plot_fct_cdf.py	Plot FCT CDF and per-flow sorted bar chart comparing CC vs No CC. Reads `fct.txt` files from both runs.
export_flow_csv.py	Join `fct.txt` + `paths.csv` into a per-flow analysis CSV with fields: flow_id, src/dst GPU, sport, spine_switch, hops, FCT, slowdown, cc_assigned.

Source Code Deep Dive — What Each File Does

This section traces the execution flow through the 5 C++ source files, showing exactly when each is invoked and what it does at the code level.

AstraSimNetwork.cc — Main entry point + TraceInjector

Contains main() and the ASTRASimNetwork class that bridges AstraSim ↔ NS-3.

Execution Order

main(argc, argv)
  ① user_param_prase()         // parse -t thread -w workload -n topo -c conf
  ② main1(topo, conf)           // → common.h: ReadConf → SetConfig → SetupNetwork
                                  //   (builds NS-3 topology, initializes CentralController)
  ③ if env TRACE_INJECT_FILE:   // ← NEW: TraceInjector bypass mode
      for line in csv:
        Simulator::Schedule(NanoSeconds(t), &SendFlowTrace, src, dst, size)
      Simulator::Run()           // run NS-3 with injected flows only
      return
  ④ // Normal mode: create ASTRASimNetwork + AstraSim::Sys per GPU
     for j in nodes_num:
       networks[j] = new ASTRASimNetwork(j, 0)
       systems[j]  = new AstraSim::Sys(networks[j], ...)
     for i in nodes_num:
       systems[i]->workload->fire()  // kick off collective workloads
     Simulator::Run()

ASTRASimNetwork class — AstraSim ↔ NS-3 bridge

Method	What it does
`sim_send()`	Registers flow in `sentHash`, then calls `SendFlow()` (in entry.h) to inject an RDMA QP into NS-3
`sim_recv()`	Matches arriving data against `expeRecvHash`. If data already arrived: invoke callback immediately. If not: register in `expeRecvHash` to wait.
`sim_get_time()`	Returns NS-3's `Simulator::Now().GetNanoSeconds()` to AstraSim
`sim_schedule()`	Wraps `Simulator::Schedule()` for AstraSim callbacks

TraceInjector: When TRACE_INJECT_FILE is set, the entire AstraSim workload system is bypassed. Flows are read from a CSV (timestamp_ns,src,dst,size_bytes) and scheduled directly into NS-3 via SendFlowTrace(). This enables isolated network experiments (e.g., KV cache transfer patterns) without needing the full SimAI stack.

entry.h — SendFlow + CentralController integration

Defines the flow injection functions and all flow completion callbacks. This is where the CentralController's AssignSport() is called for every flow.

`SendFlow()` — normal mode (called by AstraSim)

void SendFlow(src, dst, maxPacketCount, msg_handler, fun_arg, tag, request) {
  port = portNumber[src][dst]++;

  // ← NEW: CentralController picks a conflict-free sport
  port = CentralController::Instance().AssignSport(
      serverAddress[src], serverAddress[dst],
      100 /*dport*/, src, dst, port);

  sender_src_port_map[{port, {src, dst}}] = request->flowTag;

  RdmaClientHelper clientHelper(pg, srcIP, dstIP, port, dport, size, ...);
  appCon = clientHelper.Install(n.Get(src));
  appCon.Start(Time(send_lat));  // inject RDMA QP into NS-3
}

`SendFlowTrace()` — TraceInjector mode (no AstraSim context)

void SendFlowTrace(src, dst, size_bytes) {
  port = portNumber[src][dst]++;
  port = CentralController::Instance().AssignSport(...);

  AstraSim::ncclFlowTag sentinel;  // current_flow_id = -1 (sentinel)
  sender_src_port_map[{port, {src, dst}}] = sentinel;

  // Same RDMA QP injection, but no AstraSim callbacks
  RdmaClientHelper clientHelper(..., nullptr, nullptr, ...);
  appCon.Start(Time(send_lat));
}

Completion Callbacks

Callback	Triggered by	What it does
`qp_finish()`	NS-3 when all packets arrive at dst	Writes to fct.txt, checks `is_receive_finished()`, calls `notify_receiver_receive_data()` → triggers AstraSim recv callback. For sentinel flows (TraceInjector): early return.
`send_finish()`	NS-3 when all packets leave src NIC	Checks `is_sending_finished()`, calls `notify_sender_sending_finished()` → triggers AstraSim send callback. For sentinel flows: early return.

common.h — Network setup + CC initialization

The infrastructure glue file. Defines global state, reads config, builds the NS-3 topology, and initializes the CentralController.

Three main functions, called in order by main1()

main1(topo, conf)
  ① ReadConf(topo, conf)    // Parse config: CC_MODE, DATA_RATE, LINK_DELAY, PFC, DCQCN params...
  ② SetConfig()             // Apply NS-3 defaults (PauseTime, QcnEnabled, IntHeader mode)
                             //   cc_mode=3 → DCQCN, cc_mode=7 → TIMELY, cc_mode=10 → PINT
  ③ SetupNetwork(qp_finish, send_finish)
       // a. Read topology file: node_num, gpus_per_server, nvswitch_num, switch_num
       // b. Create NS-3 nodes (hosts, SwitchNode, NVSwitchNode)
       // c. Install point-to-point links with QBB (lossless Ethernet)
       // d. Configure RDMA per host (RdmaHw with CC mode, rates, PFC)
       // e. CalculateRoutes() → SetRoutingEntries()
       // f. NEW: Initialize CentralController ↓

CentralController initialization (added in SetupNetwork)

// After SetRoutingEntries():
CentralController::Instance().Initialize(n, nodeIdToRdmaHw, ipToNodeId, gpus_per_server);

// Env var control:
if (env "CC_LOG_FILE")   CC.SetLogFile(path);   // CSV: flow_id,sip,dip,sport,hops
if (env "CC_ENABLE" != "0") CC.Enable();  // default: enabled
else                       CC.Disable(); // for A/B comparison

central-controller.h — CC class declaration

Singleton class with the full API and internal data structures for conflict-free path assignment.

Public API

Method	When called	What it does
`Initialize()`	Once, in SetupNetwork	Reads NS-3 topology: adjacency graph, switch ECMP seeds + routing tables, host NIC→switch mapping
`AssignSport()`	Every flow (SendFlow / SendFlowTrace)	Probes sport 1..65534, finds first conflict-free path, reserves it
`ReleasePath()`	QP completion (rdma-hw.cc)	Unreserves all hops for the completed flow

Internal Data Structures

Structure	Purpose
`m_switches`	Per-switch: ecmpSeed + rtTable (dip → [outPorts]). Read-only after Initialize.
`m_adjacency`	nodeId → {portIdx → neighborNodeId}. Physical topology graph.
`m_hopToFlows`	(switchId, portIdx) → set of active FlowKeys. The dynamic reservation state.
`m_flowPaths`	FlowKey → Path. Reverse lookup for ReleasePath.

central-controller.cc — CC core logic

`AssignSport()` — the hot path (called per flow)

AssignSport(sip, dip, dport, srcNodeId, dstNodeId, fallbackSport) {
  if (!m_enabled || !m_initialized) return fallbackSport;
  if (srcNodeId/gpusPerServer == dstNodeId/gpusPerServer) return fallbackSport;
      // ↑ intra-server → NVSwitch, no ECMP, CC can't help

  // Quick check: if probe path has ≤1 hop, CC is useless
  Path probe = ComputePath(sip, dip, fallbackSport, dport, ...);
  if (probe.size() <= 1) return fallbackSport;

  // Exhaustive search: try sport 1..65534
  for (sport = 1; sport < 65535; sport++) {
    Path path = ComputePath(sip, dip, sport, dport, ...);
    if (!HasConflict(path)) {
      ReservePath(path, {sip, dip, sport, dport});
      return sport;  // ← conflict-free sport found
    }
  }
  return fallbackSport;  // no conflict-free sport exists
}

`ComputePath()` — deterministic ECMP replay

ComputePath(sip, dip, sport, dport, srcNodeId, dstNodeId) {
  buf = {sip, dip, (sport | dport<<16)};  // 12-byte hash input

  // Step 1: NIC selection at host (same hash as rdma-hw.cc)
  h = Murmur3(buf, 12, 0x8BADF00D);
  nicIdx = hostNics[h % hostNics.size()];
  currentNode = nicToSwitch[nicIdx];

  // Step 2: switch-by-switch ECMP traversal
  while (currentNode is switch) {
    h = Murmur3(buf, 12, switch.ecmpSeed);  // ← per-switch seed
    portIdx = switch.rtTable[dip][h % numPorts];
    path.push_back({currentNode, portIdx});
    currentNode = adjacency[currentNode][portIdx];
  }
  return path;
}

Key: ComputePath() uses the exact same Murmur3 hash with the exact same per-switch ECMP seed as the NS-3 SwitchNode. This means the CC can predict the path any 5-tuple will take through the fabric before injecting the flow. Changing sport changes the hash output, which changes the ECMP port selection, which routes the flow through different spine links.

`HasConflict()` + `ReservePath()` / `ReleasePath()`

HasConflict(path) {
  // Check all hops EXCEPT the last one (leaf→GPU has no diversity)
  for (i = 0; i + 1 < path.size(); i++)
    if (m_hopToFlows[{path[i].node, path[i].port}].size() > 0)
      return true;
  return false;
}

ReservePath(path, flowKey) {
  m_flowPaths[flowKey] = path;
  for (i = 0; i + 1 < path.size(); i++)
    m_hopToFlows[{path[i].node, path[i].port}].insert(flowKey);
}

ReleasePath(sip, dip, sport, dport) {  // called from rdma-hw.cc QpComplete()
  flowKey = {sip, dip, sport, dport};
  path = m_flowPaths[flowKey];
  for hop in path:
    m_hopToFlows[hop].erase(flowKey);
  m_flowPaths.erase(flowKey);
}

Complete Lifecycle: One Flow Through All 5 Files

① AstraSim decides to send a collective chunk
   → ASTRASimNetwork::sim_send()                  [AstraSimNetwork.cc]

② sim_send() calls SendFlow()
   → SendFlow(src, dst, size, callback, tag)      [entry.h]

③ SendFlow() asks CC for a conflict-free port
   → CentralController::AssignSport()              [central-controller.cc]
       → ComputePath(sip, dip, sport_candidate)     // replays ECMP hash
       → HasConflict(path)                          // checks m_hopToFlows
       → ReservePath(path, flowKey)                 // reserves hops
       → return sport

④ SendFlow() injects RDMA QP with CC-assigned sport
   → RdmaClientHelper.Install() + Start()          [entry.h → NS-3]

⑤ NS-3 simulates packet-level RDMA transport
   → switches use ECMP hash (same Murmur3)       [ns-3 switch-node.cc]
   → congestion control (DCQCN/TIMELY/PINT)       [configured in common.h]

⑥ All packets arrive → QP completes
   → RdmaHw::QpComplete()                          [rdma-hw.cc]
       → CentralController::ReleasePath()            // free reserved hops
       → qp_finish()                                [entry.h]
           → notify_receiver_receive_data()           // triggers AstraSim callback

Differences from Original SimAI

Aspect	Original SimAI	This Repo (with CC)
Sport Assignment	Per-(src,dst) counter: `portNumber[s][d]++` starting ~10000	`CentralController::AssignSport()` — searches for collision-free sport from 1
Path Awareness	None — flows are fire-and-forget; routing is per-switch ECMP	Global view of all active flow paths; reserves and releases hops
Congestion Response	Reactive only — DCQCN (rate reduction) + PFC (pause frames)	Proactive collision avoidance + reactive DCQCN/PFC
Switch-Level Changes	`EcmpHash()` is private; no external access to seeds/routing	`EcmpHash()` public static; `GetEcmpSeed()` and `GetRtTable()` added
Workload Modes	AstraSim collectives only (AllReduce, AllGather, etc.)	AstraSim collectives + TraceInjector for point-to-point KV cache flows
Flow Completion	`QpComplete()` notifies AstraSim only	`QpComplete()` also calls `CC::ReleasePath()` to free reserved hops
Analysis Tools	Basic `fct.txt` output	`paths.csv` logging + Python scripts for collision analysis, FCT CDF, path visualisation
New Files	—	`central-controller.h/cc`, `gen_kv_trace.py`, `analyze_collisions.py`, `plot_fct_cdf.py`, `export_flow_csv.py`

9.1 Assumptions & Limitations

                    Assumptions
                    Zero control-plane delay (synchronous call inside simulator)
                        
Complete global topology view at initialization
Static routing tables (no changes during simulation)
Exact hash replication — same Murmur3 as
                            SwitchNode

                

                    Limitations
                    Sequential sport search (sport=1,2,3…) —
                            O(K×H) per flow
Serialised path assignment — blocks NS-3 event loop
Conservative conflict: any shared (switch,port)
                            is a conflict
No link failure recovery — stale topology tables

                            Diminishing returns at network saturation

                

Central Controller: ECMP Collision-Free Routing

Table of Contents

Overview & Motivation

The ECMP Hash Collision Problem

2.1 How Original SimAI Assigns Ports

2.2 Why Collisions Matter

Bandwidth Halving

PFC Cascade

2.3 ECMP Hash Internals

Central Controller Design

3.1 Core Idea

3.2 Algorithm Steps

3.3 Data Structures

Topology Tables (read-only)

Dynamic Reservation State

3.4 AssignSport — Pseudocode

3.5 ComputePath — Deterministic Path Tracing

Flow Diagrams: Before vs. After

Without Central Controller

With Central Controller

Implementation Deep Dive

5.1 Initialization

5.2 Path Release

5.3 Bug Fixes — Lessons Learned

Bug 1: Final-Hop Conflict

Bug 2: 1-Hop Sport=1 Collision

5.4 Runtime Configuration

TraceInjector — KV Cache Simulation

6.1 Trace Format

6.2 gen_kv_trace.py — Arrival Patterns

How to Run the Simulation

Step 1: Build SimAI

Step 2: Generate Topology

Step 3: Run Simulation

Mode A — TraceInjector (recommended for CC experiments)

Mode B — AstraSim Workload (full collective communication)

Environment Variables

Step 4: Analyze Results

Output Files

Experiments & Results

7.1 Best Case — Fat-Tree Server-Pair Burst

Without CC — 4 Spine Collisions

With CC — Perfect Spread

7.2 8-Ring AllReduce — Spectrum-X 128 GPU

Sport Assignment Distribution

7.3 Full AllToAll 128 GPU — Network Saturation

7.4 When Does CC Help?

Source File Reference

8.1 C++ Patches (NS-3 / AstraSim)

8.2 Python Analysis Scripts

Source Code Deep Dive — What Each File Does

AstraSimNetwork.cc — Main entry point + TraceInjector

Execution Order

ASTRASimNetwork class — AstraSim ↔ NS-3 bridge

entry.h — SendFlow + CentralController integration

SendFlow() — normal mode (called by AstraSim)

SendFlowTrace() — TraceInjector mode (no AstraSim context)

Completion Callbacks

common.h — Network setup + CC initialization

Three main functions, called in order by main1()

CentralController initialization (added in SetupNetwork)

central-controller.h — CC class declaration

Public API

Internal Data Structures

central-controller.cc — CC core logic

AssignSport() — the hot path (called per flow)

ComputePath() — deterministic ECMP replay

HasConflict() + ReservePath() / ReleasePath()

Complete Lifecycle: One Flow Through All 5 Files

Differences from Original SimAI

9.1 Assumptions & Limitations

Assumptions

Limitations

`SendFlow()` — normal mode (called by AstraSim)

`SendFlowTrace()` — TraceInjector mode (no AstraSim context)

`AssignSport()` — the hot path (called per flow)

`ComputePath()` — deterministic ECMP replay

`HasConflict()` + `ReservePath()` / `ReleasePath()`