LLaMA 3 8B on RTX 5090End-to-End SimAI Experiment

GPU profiling, RandomForest training, PD disaggregation simulation, and NS-3 KV cache transfer integration — from profiling to production-trace evaluation.

RTX 5090 · LLaMA 3 8B · Azure Trace · NS-3

Contents

  1. Experiment Overview
  2. Step 1: GPU Profiling on RTX 5090
  3. Step 2: RandomForest Training
  4. Validation: Roofline Behavior
  5. Step 3: PD Aggregated vs Disaggregated
  6. Step 4: NS-3 KV Cache Transfer Integration
  7. Step 5: Three-Way Comparison
  8. Key Findings
  9. Azure Trace Workload Analysis

Experiment Overview

Goal: Evaluate LLaMA 3 8B inference performance on RTX 5090 using the full SimAI stack — from kernel-level GPU profiling through NS-3 packet-level network simulation. This is the first end-to-end SimAI experiment on a consumer GPU (RTX 5090, 32GB GDDR7).

ComponentConfiguration
ModelLLaMA 3 8B (32 layers, GQA 32q/8kv, hidden=4096, MLP=14336)
GPUNVIDIA RTX 5090 (32GB GDDR7, SM 12.0, PyTorch 2.10, CUDA 12.8)
ParallelismTP=1, PP=1 (single GPU per replica)
Cluster4 GPUs, 100 Gbps inter-GPU network
TraceAzure LLM Inference Trace (conversation, 19,366 requests, first 500 used)
SchedulervLLM (aggregated) / SplitWise (disaggregated)

Step 1: GPU Profiling on RTX 5090

We wrote a standalone profiling script (profile_llama3_8b.py) that benchmarks each LLaMA 3 8B operation using CUDA events. This bypasses Vidur's sarathi-dependent profiling pipeline, which doesn't support SM 12.0 (Blackwell).

Commands

# MLP profiling (21 seconds, 259 data points)
python3 profile_llama3_8b.py --mode mlp --max_tokens 4096 \
  --output_dir data/profiling/compute/rtx5090/meta-llama/Meta-Llama-3-8B

# Attention profiling (47 seconds, 6872 data points)
python3 profile_llama3_8b.py --mode attention --max_tokens 4096 --max_batch_size 128 \
  --output_dir data/profiling/compute/rtx5090/meta-llama/Meta-Llama-3-8B

Output CSVs placed at data/profiling/compute/rtx5090/meta-llama/Meta-Llama-3-8B/{mlp,attention}.csv.

Profiling Results

DatasetRowsTimeDescription
mlp.csv25921s9 ops × 259 token counts (1–4096)
attention.csv6,87247s5,600 prefill + 1,272 decode combos

Operations Profiled

OperationPyTorchat 1 token (ms)
mlp_up_projnn.Linear(4096, 28672)0.148
mlp_down_projnn.Linear(14336, 4096)0.086
attn_pre_projnn.Linear(4096, 6144)0.041
attn_post_projnn.Linear(4096, 4096)0.031
mlp_actSiLU × gate (SwiGLU)0.012
input_layernormnn.RMSNorm(4096)0.010
attn_kv_cache_saveK,V copy to cache0.014

Step 2: RandomForest Training

Trained 11 RandomForest models using GridSearchCV (n_estimators=[250,500,750], max_depth=[8,16,32], min_samples_split=[2,5,10]).

# Train all 11 RF models
python3 train_rf_predictor.py --train \
  --mlp_csv data/profiling/compute/rtx5090/meta-llama/Meta-Llama-3-8B/mlp.csv \
  --attention_csv data/profiling/compute/rtx5090/meta-llama/Meta-Llama-3-8B/attention.csv \
  --output_dir trained_models/rtx5090

# Query execution time for 512 prefill tokens
python3 train_rf_predictor.py --query --num_tokens 512 --output_dir trained_models/rtx5090

# Sweep all token counts
python3 train_rf_predictor.py --sweep --output_dir trained_models/rtx5090
ModelMAPE (CV)MAPE (actual)Features
attn_kv_cache_save2.5%<1%num_tokens
input_layernorm3.3%<1%num_tokens
add3.9%<1%num_tokens
mlp_act11.1%<1%num_tokens
attn_prefill12.6%0.3%kv_cache_size, chunk²
mlp_up_proj14.2%<1%num_tokens
attn_decode26.6%1.2%batch_size, kv_cache
Note: CV MAPE is pessimistic because it evaluates on held-out folds. The actual MAPE on the full dataset is <1.2% for all models.

Validation: Roofline Behavior

The profiled execution times match the roofline model described by Sarathi-Serve (OSDI'24, Figure 6).

RF predicted execution time vs prefill tokens

Top: per-component time (1 layer). Bottom: total prefill — flat at 1–64 tokens (memory-bound), linear at 128+ (compute-bound).

TokensPer-Layer (ms)32-Layer (ms)Regime
10.3711.8Memory-bound
80.3611.6Memory-bound
320.3711.9Memory-bound
640.3912.5Memory-bound
1280.4915.8Transition
5121.3944.5Compute-bound
10242.6584.7Compute-bound
40969.69310.0Compute-bound
The crossover at ~64–128 tokens is earlier than A100's 128–512 tokens (Sarathi-Serve Fig 6). Expected due to GDDR7 bandwidth-to-compute ratio and smaller model.

Step 3: PD Aggregated vs Disaggregated

Vidur Code Changes for RTX 5090

Before running simulations, we registered RTX 5090 as a new device in Vidur:

File ModifiedChange
vidur/types/device_sku_type.pyAdded RTX5090 = 5
vidur/types/node_sku_type.pyAdded RTX5090_SINGLE = 7
vidur/config/device_sku_config.pyAdded RTX5090DeviceSKUConfig (fp16_tflops=209, total_memory_gb=32)
vidur/config/node_sku_config.pyAdded RTX5090SingleNodeSKUConfig (1 device per node)

Simulation Commands

# 4-GPU Aggregated (vLLM + Round-Robin)
python3 -m vidur.main \
  --replica_config_device rtx5090 \
  --replica_config_network_device rtx5090_single \
  --replica_config_model_name meta-llama/Meta-Llama-3-8B \
  --replica_config_tensor_parallel_size 1 \
  --replica_config_num_pipeline_stages 1 \
  --cluster_config_num_replicas 4 \
  --global_scheduler_config_type round_robin \
  --replica_scheduler_config_type vllm \
  --random_forrest_execution_time_predictor_config_backend vidur \
  --length_generator_config_type trace \
  --trace_request_length_generator_config_trace_file ./data/processed_traces/azure_conv_trace.csv \
  --trace_request_length_generator_config_max_tokens 4096 \
  --interval_generator_config_type trace \
  --trace_request_interval_generator_config_trace_file ./data/processed_traces/azure_conv_trace.csv \
  --trace_request_interval_generator_config_start_time "2023-11-16 18:15:00" \
  --trace_request_interval_generator_config_end_time "2023-11-16 19:15:00" \
  --synthetic_request_generator_config_num_requests 500

# 4-GPU PD Disaggregated (SplitWise, 2P+2D, 100Gbps)
python3 -m vidur.main \
  --replica_config_device rtx5090 \
  --replica_config_network_device rtx5090_single \
  --replica_config_model_name meta-llama/Meta-Llama-3-8B \
  --replica_config_tensor_parallel_size 1 \
  --replica_config_num_pipeline_stages 1 \
  --replica_config_pd_p2p_comm_bandwidth 100 \
  --replica_config_pd_p2p_comm_dtype float16 \
  --replica_config_pd_node_ratio 0.5 \
  --cluster_config_num_replicas 4 \
  --global_scheduler_config_type split_wise \
  --replica_scheduler_config_type split_wise \
  # ... same trace args as above

We ran the Azure conversation trace (500 requests, real arrival times) on 4 GPUs with two configurations:

4GPU Aggregated 4GPU Disaggregated
SchedulervLLM + Round-RobinSplitWise (2P + 2D)
True TTFT mean146ms383ms
True TTFT p99453ms1,158ms
E2E mean3.09s3.32s
E2E p998.61s9.42s
Preemption261ms9ms
Throughput5,549 tok/s5,541 tok/s

Deep Dive: Why Disaggregation's TTFT is 2.6× Worse

True TTFT breakdown:

Component Aggregated Disaggregated Delta % of gap
Scheduling delay16.1ms37.5ms+21.4ms9%
Prefill compute132.9ms200.8ms+68.0ms29%
KV transfer0ms167.7ms+167.7ms71%
First decode iter13.4ms14.3ms+0.9ms<1%
True TTFT146ms383ms+237ms100%

Factor 1: KV Cache Transfer is the Dominant Cost (71%)

KV transfer accounts for 71% of the TTFT gap. Mean 157 MB transfer at 100 Gbps = 167 ms. Unavoidable in disagg, zero in agg.

Factor 2: Prefill Queuing from Halved Capacity (29%)

With 2P+2D, each prefill GPU handles ~1.9 req/s (2× the load). Prefill compute increases from 133ms to 201ms.

Prefill compute by token bucket:

TokensAgg (ms)Disagg (ms)Slowdown
0–256521192.3×
257–512791521.9×
513–10241061801.7×
1025–20481311981.5×
2049–40962983561.2×
Small prefill requests suffer more (2.3×) because queuing dominates. Large requests only 1.2× because GPU compute dominates.

Why Disagg Loses Despite Zero Preemption

Disagg saves 252ms preemption but adds 236ms (KV transfer + queuing). Net improvement only 16ms, wiped out by TTFT penalty.

QPS Sweep: Does Disaggregation Ever Win?

We swept Poisson arrival rate from 2 to 60 req/s with Zipf-distributed token lengths, comparing three configurations.

TTFT breakdown: Scheduling + Prefill + KV Transfer + First Decode

Blue = scheduling, Green = prefill compute, Red = KV transfer, Orange = first decode. Black line = total TTFT.

E2E latency and preemption across QPS

Left: TTFT stacked bars. Right: E2E (solid) and preemption (triangles).

QPS Agg (4 mixed) Disagg 2P+2D Disagg 1P+3D
TTFTE2ETTFTE2ETTFTE2E
2 56ms1.6s 127ms1.7s 133ms1.7s
1061ms1.9s 138ms2.1s 170ms1.9s
2071ms2.7s 175ms3.3s 306ms2.5s
3082ms3.3s 253ms5.0s 957ms3.5s
4095ms3.7s 354ms6.1s 2,693ms5.3s
60133ms4.4s 798ms7.6s 4,684ms7.3s
Key Observations:
  1. Aggregation wins on TTFT at all QPS levels.
  2. 1P+3D has lower E2E than 2P+2D at QPS 10–30, but TTFT explodes at QPS>30 (single prefill GPU bottleneck).
  3. 2P+2D preemption explodes at QPS>20; 1P+3D keeps preemption low (<80ms) with 3 decode GPUs.
Why Disaggregation Never Crosses Over: For LLaMA 3 8B on 4× RTX 5090 at 100 Gbps, disaggregation never outperforms aggregation. Viable with: higher bandwidth (400Gbps+), larger clusters (32+ GPUs), or larger models.

Step 4: NS-3 KV Cache Transfer Integration

SimAI 1.5's Vidur computes PD KV cache transfer time as kv_size / bandwidth. We wired NS-3 into this path:

Step A: Build NS-3 and Generate Topology

# Build SimAI_simulator (NS-3 + astra-sim)
# Requires: GCC 9.4+ (GCC 13 needs CXXFLAGS="-include cstdint")
# Requires: sudo mkdir -p /etc/astra-sim/simulation && sudo chmod 777 /etc/astra-sim/simulation
cd ~/CS8803_DNS/SimAI
CXXFLAGS="-include cstdint" bash scripts/build.sh -c ns3

# Generate 2-GPU topology (100 Gbps, for P2P KV transfer)
python3 astra-sim-alibabacloud/inputs/topo/gen_Topo_Template.py \
  -topo DCN+ -g 2 -gps 2 -gt RTX5090 -bw 100Gbps -nvbw 100Gbps

Step B: Pre-compute NS-3 KV Transfer Latency

KV cache transfer is unidirectional. We use REDUCESCATTER with 2×kv_size on 2 GPUs to model single-direction transfer:

# Example: simulate 128 MB KV cache transfer (unidirectional)
# Use REDUCESCATTER with 2x size: each node sends exactly 128 MB
cat > /tmp/kv_wl.txt << WL
HYBRID_TRANSFORMER_FWD_IN_BCKWD model_parallel_NPU_group: 2 ep: 1 pp: 1 vpp: 1 ga: 1 all_gpus: 2 checkpoints: 0 checkpoint_initiates: 0
1
kv_xfer -1 1 REDUCESCATTER 268435456 1 NONE 0 1 NONE 0 1
WL
# ↑ 268435456 = 2 × 134217728 (2× kv_size for unidirectional P2P)

AS_SEND_LAT=6 AS_NVLS_ENABLE=0 ./bin/SimAI_simulator -t 2 \
  -w /tmp/kv_wl.txt \
  -n ./DCN+SingleToR_2g_2gps_100Gbps_RTX5090 \
  -c ./astra-sim-alibabacloud/inputs/config/SimAI.conf

# Result: 10863.6 µs = 10.86 ms (vs simple 10737.4 µs = +1.2%)

# Automated: build_ns3_kv_lookup.py sweeps 11 sizes (1-512 MB)
# Output: vidur-alibabacloud/ns3_kv_transfer_lookup.json
python3 build_ns3_kv_lookup.py

Step C: Patch Vidur to Use NS-3 Lookup

FileChange
vidur/events/batch_end_event.py Added _ns3_kv_transfer_time(). Replaced size / bandwidth with NS-3 lookup.
ns3_kv_transfer_lookup.json New file: 11 entries mapping KV bytes → NS-3 latency (µs).
# Original code (vidur/events/batch_end_event.py:136):
request.pd_p2p_comm_time = request.pd_p2p_comm_size / request.pd_p2p_comm_bandwidth

# Patched code:
ns3_t = _ns3_kv_transfer_time(request.pd_p2p_comm_size)
if ns3_t is not None:
    request.pd_p2p_comm_time = ns3_t
else:
    request.pd_p2p_comm_time = request.pd_p2p_comm_size / request.pd_p2p_comm_bandwidth

NS-3 vs Simple Bandwidth Model

KV SizeSimple (ms)NS-3 (ms)Overhead
1 MB0.080.09+9.3%
16 MB1.341.36+1.6%
64 MB5.375.44+1.2%
128 MB10.7410.86+1.2%
256 MB21.4721.72+1.1%
512 MB42.9543.43+1.1%
NS-3 adds ~1–9% overhead via unidirectional P2P simulation. For large transfers (>64 MB), NS-3 is within ~1% of simple division.
Contribution: This is the first integration of NS-3 into SimAI's PD disagg KV transfer path.

Step 5: Three-Way Comparison

Azure conversation trace, 500 requests, real arrival times, 4 GPUs, 100 Gbps network.

Aggregated Disagg (Simple) Disagg (NS-3)
True TTFT mean146ms383ms397ms
True TTFT p99453ms1,158ms1,205ms
E2E mean3.09s3.32s3.32s
E2E p998.61s9.42s9.42s
Preemption261ms9ms9ms
KV transfer meanN/A168ms182ms (+8.6%)
KV transfer p99N/A555ms603ms (+8.6%)
Throughput5,549 tok/s5,541 tok/s5,542 tok/s
Interpretation: True TTFT = prefill + KV transfer + first decode. NS-3 adds +14ms mean to every request's TTFT.

Azure Trace Workload Analysis

Full statistical analysis of the Azure LLM Inference Trace (conversation) dataset. 19,366 requests over 58.4 minutes.

Summary Statistics

MetricValue
Total Requests19,366
Duration3501.7s (58.4 min)
Avg Request Rate5.53 req/s
Peak Request Rate (10s bin)9.80 req/s
Mean Inter-arrival180.8 ms
Median Inter-arrival118.4 ms

Token Distribution Statistics

MetricPrefill (Context)Decode (Generated)
Mean1154.7211.1
Median1020129
P954083451
P994142601
Max14,0501,000

Request Arrival Rate

Request rate over time

Request arrival rate in 10s bins. Mean: 5.53 req/s, Peak: 9.8 req/s.

Token Count Distribution

Token distribution histograms

Prefill median: 1020, Decode median: 129. Both distributions are right-skewed.

Token Count CDF (Log Scale)

Token CDF

~50% of prefill ≤1,020 tokens; ~95% of decode ≤451 tokens.

Prefill vs Decode Scatter

Prefill vs Decode scatter

No strong correlation between prefill and decode lengths — supports independent PD scheduling.

Inter-arrival Time Distribution

Inter-arrival time

Median inter-arrival: 118.4 ms. Bursty arrival pattern.

Key Observations
  • Avg 5.53 req/s, peak 9.8 req/s — scheduler must handle ~2× average during bursts.
  • Prefill 5.5× larger than decode on avg (1,155 vs 211) — prefill dominates GPU compute.
  • Weak prefill-decode correlation supports independent PD scheduling.
  • Heavy tail: P99 prefill 4142, max 14,050 — may cause HOL blocking without preemption.

Key Findings

1. RTX 5090 profiling matches roofline theory

Memory-bound flat region at 1–64 tokens, linear at 128+. Crossover earlier than A100.

2. PD disaggregation needs sufficient load

At ~3.8 req/s on 4 GPUs, aggregation outperforms disaggregation on all latency metrics.

3. NS-3 adds ~6–8% over simple bandwidth model

For single-flow unidirectional KV transfer, NS-3 shows ~1–9% overhead (1% for large, 9% for small).

4. SimAI 1.5 gap: PD KV transfer not routed through NS-3

NS-3 only covers TP AllReduce; PD P2P was hardcoded. GitHub issues #210, #259 confirm this gap. Our patch demonstrates the integration.