An H100 nsys deep dive validating (and partially refuting) the SimpleCPUOffloadConnector's multi-stream design. Source-code intent: dedicated load_stream/store_stream let DMA run alongside compute. Empirical reality: 135 CUDA streams exist, but every kernel and every H2D copy attributes to stream 7. GPU metrics show DMA and FlashAttn never co-active.
In production runs of the SimpleCPUOffloadConnector with FCFS scheduling on H100, decode-only steps slow down 2.8× (p50) and 6× (p99) when bulk H2D + D2H DMA fires concurrently versus when no DMA is active. The connector deep dive explains the design — multi-stream, async, ~5ms launch hidden under compute. The numbers say the design is not delivering. Either the design is wrong, or the design isn't running as written.
| Composition | DMA class | p50 ms | p90 ms | p99 ms |
|---|---|---|---|---|
| decode-only | no DMA | 15.0 | 18.5 | 22.8 |
| decode-only | load only | 24.1 | 51.0 | 88.0 |
| decode-only | load + store | 41.9 | 95.4 | 137.9 |
Initial hypothesis: H100 has 3 TB/s HBM. Bulk H2D writes pinned-host → HBM at PCIe Gen5 speeds. FlashAttn reads KV from HBM at compute speeds. If both compete for the same memory subsystem at the same time, attention runs at degraded effective bandwidth → kernel duration grows → step elongates. This is the classic memory-contention story.
Profiling with nsys lets us check it directly: GPU SM/Tensor/DRAM utilization counters at 100 Hz, plus per-event CUPTI traces with stream attribution.
The H100 trace contains 2.55 M GPU metric samples at 100 Hz across 850 s. Bucketing 68 k samples in the steady-state window (170 s … 850 s) by which activity is in flight at the sample timestamp:
If H2D really runs on a dedicated stream, GPU metrics should sometimes show H2D and FA co-active (concurrent kernels on different streams). Before drawing conclusions from the metrics, let's check what the code actually does. The path from torch.cuda.Stream to the CUDA driver call is short — four files, ~10 lines:
low_pri, _ = torch.cuda.Stream.priority_range()self.load_stream = torch.cuda.Stream(priority=low_pri)self.store_stream = torch.cuda.Stream(priority=low_pri)
self._store_params = build_params(gpu, cpu, store_stream)self._load_params = build_params(cpu, gpu, load_stream)stream_handle=stream.cuda_stream # raw cudaStream_t / CUstreamstart_event.record(stream)copy_blocks(src_blocks, dst_blocks, params)end_event.record(stream)
err = _batch_memcpy_fn(dst, src, sizes, n, ctypes.addressof(params.attrs), ..., params.stream_handle)nsys's TARGET_INFO_CUDA_STREAM table confirms the CUDA context created 135 streams at runtime, including the priority-0 NON_BLOCKING streams that should hold load_stream/store_stream. But when we look at which streams actually carry traffic:
| streamId | type / priority | kernels | memcpys | role |
|---|---|---|---|---|
| 7 | NULL / prio=0 | 1,169,146 | 119,394 (16.5 GB) | all compute + all H2D |
| 17 | NON_BLOCKING / prio=0 | 0 | 2,881 (~0 GB) | small D2H metadata only |
| 6, 8-16, 18-140 (133 streams) | various | 0 | 0 | created but completely idle |
The 73 H2D events ≥ 100 MB totaling 9.5 GB in steady-state — exactly what we'd expect connector bulk loads to look like — all live on stream 7. The dedicated load_stream/store_stream the source code creates are not visibly active.
There is one critical piece of source code that makes everything consistent. From worker.py:213-217:
def start_load_kv(self) -> None:
# NOTE: we defer launching both load and store to get_finished(),
# which runs after model execution. This hides the CPU-side
# block copy op overhead (~5ms) behind GPU compute.
pass
# Source nsys-rep on PACE
/storage/scratch1/1/hlin464/swe_smith/results/nsys_swe_h100_jps10_dram64_fcfs_cpuoff/nsys_h100.nsys-rep
# SQLite export on lab (366 MB)
/tmp/nsys_analyze/nsys_h100.sqlite
# Bench: Llama-3.1-8B-Instruct, 100 jobs, jobs_100_pinjab.json
# Connector: SimpleCPUOffloadConnector, CPU pool = 64 GB
# Scheduling: FCFS
| Script | Output |
|---|---|
/tmp/dram_bw_during_dma.py |
DRAM Read BW%, SM Active%, Tensor Active% bucketed by (BULK_H2D, FLASH_ATTN) co-activity |
/tmp/verify_streams.py |
Per-stream kernel + memcpy event counts; cross-validate with TARGET_INFO_CUDA_STREAM |
/tmp/verify_streams2.py |
Steady-state per-stream memcpy size buckets (filter cold-start init copies) |
/tmp/verify_streams3.py |
Cross-reference all 135 streams with their priority and traffic |
/tmp/analyze_flashattn.py |
FlashAttn kernel duration with/without H2D temporal overlap |
# PACE editable install
/storage/project/r-rs275-0/hlin464/vllm-agent-kvcache/vllm/v1/simple_kv_offload/
├── worker.py # SimpleCPUOffloadWorker; lines 189-217 critical
├── copy_backend.py # DmaCopyBackend, _copy_loop daemon thread
├── cuda_mem_ops.py # cuMemcpyBatchAsync invocation, line 148
└── manager.py # scheduler-side admission logic
# Findings doc on PACE
/storage/project/r-rs275-0/hlin464/agent-kvcache/nsys_h100_dma_findings.md