H100 nsys 2025.3.2 SimpleCPUOffload PR #37160 follow-up 2026-04-27

Does Bulk H2D Really Overlap with Decode?

An H100 nsys deep dive validating (and partially refuting) the SimpleCPUOffloadConnector's multi-stream design. Source-code intent: dedicated load_stream/store_stream let DMA run alongside compute. Empirical reality: 135 CUDA streams exist, but every kernel and every H2D copy attributes to stream 7. GPU metrics show DMA and FlashAttn never co-active.

vllm/v1/simple_kv_offload/{worker,copy_backend,cuda_mem_ops}.py
nsys 2025.3.2 GPU_METRICS @ 100Hz, CUPTI_ACTIVITY_KIND_{KERNEL,MEMCPY}
Back to Connector deep dive

TL;DR Three findings, in order of confidence

1. Why we profiled

In production runs of the SimpleCPUOffloadConnector with FCFS scheduling on H100, decode-only steps slow down 2.8× (p50) and (p99) when bulk H2D + D2H DMA fires concurrently versus when no DMA is active. The connector deep dive explains the design — multi-stream, async, ~5ms launch hidden under compute. The numbers say the design is not delivering. Either the design is wrong, or the design isn't running as written.

The slowdown numbers from sched_trace

Composition DMA class p50 msp90 msp99 ms
decode-onlyno DMA15.018.522.8
decode-onlyload only24.151.088.0
decode-onlyload + store41.995.4137.9

Initial hypothesis: H100 has 3 TB/s HBM. Bulk H2D writes pinned-host → HBM at PCIe Gen5 speeds. FlashAttn reads KV from HBM at compute speeds. If both compete for the same memory subsystem at the same time, attention runs at degraded effective bandwidth → kernel duration grows → step elongates. This is the classic memory-contention story.

Profiling with nsys lets us check it directly: GPU SM/Tensor/DRAM utilization counters at 100 Hz, plus per-event CUPTI traces with stream attribution.

2. What the GPU was actually doing

The H100 trace contains 2.55 M GPU metric samples at 100 Hz across 850 s. Bucketing 68 k samples in the steady-state window (170 s … 850 s) by which activity is in flight at the sample timestamp:

68k
GPU metric samples analyzed
3
Activity classes (mutually exclusive)
0
Samples with H2D + FA both active
100Hz
Sampling rate

DRAM Read Bandwidth, SM Active, Tensor Active per regime

DRAM Read Bandwidth (% of peak HBM read bandwidth)

IDLE (n=32 098)
p50 26%
26.0
FLASH_ATTN (n=4 469)
p50 30%
30.0
BULK_H2D (n=31 478)
1.0
IDLE = no FA, no bulk H2D — other kernels (GEMM, layernorm) FLASH_ATTN = FlashAttnFwdSm90 in flight BULK_H2D = H2D memcpy >10MB in flight

SM Active (% of cycles at least one warp running)

IDLE
p50 57%
57.0
FLASH_ATTN
p50 93%
93.0
BULK_H2D
0.0

Tensor Active (% of cycles tensor cores executing)

IDLE
p50 39%
39.0
FLASH_ATTN
p50 75%
75.0
BULK_H2D
0.0
The contention hypothesis predicts an overlap regime. There is none.

3. The stream handle's full code path

If H2D really runs on a dedicated stream, GPU metrics should sometimes show H2D and FA co-active (concurrent kernels on different streams). Before drawing conclusions from the metrics, let's check what the code actually does. The path from torch.cuda.Stream to the CUDA driver call is short — four files, ~10 lines:

From SimpleCPUOffloadWorker.__init__ to cuMemcpyBatchAsync

1
vllm/v1/simple_kv_offload/
worker.py:189-192
Two dedicated streams created
low_pri, _ = torch.cuda.Stream.priority_range()
self.load_stream = torch.cuda.Stream(priority=low_pri)
self.store_stream = torch.cuda.Stream(priority=low_pri)
2
copy_backend.py:46-48
Streams flow into params builders
self._store_params = build_params(gpu, cpu, store_stream)
self._load_params = build_params(cpu, gpu, load_stream)
A daemon thread polls a queue and dispatches copies.
3
cuda_mem_ops.py:113
Stream handle captured into BatchMemcpyParams
stream_handle=stream.cuda_stream  # raw cudaStream_t / CUstream
The C-level stream pointer is stored in the params struct.
4
copy_backend.py:97-101
Events recorded on the dedicated stream
start_event.record(stream)
copy_blocks(src_blocks, dst_blocks, params)
end_event.record(stream)
5
cuda_mem_ops.py:139-149
cuMemcpyBatchAsync invoked with the stream handle
err = _batch_memcpy_fn(dst, src, sizes, n,
  ctypes.addressof(params.attrs), ..., params.stream_handle)
The driver-level call gets the dedicated stream — not the default.
Code-level conclusion: the multi-stream design is real.

4. The stream attribution paradox

nsys's TARGET_INFO_CUDA_STREAM table confirms the CUDA context created 135 streams at runtime, including the priority-0 NON_BLOCKING streams that should hold load_stream/store_stream. But when we look at which streams actually carry traffic:

streamIdtype / prioritykernelsmemcpysrole
7NULL / prio=01,169,146119,394 (16.5 GB)all compute + all H2D
17NON_BLOCKING / prio=002,881 (~0 GB)small D2H metadata only
6, 8-16, 18-140
(133 streams)
various00created but completely idle

Stream usage map — every CUDA stream in the trace, colored by traffic

all kernel + bulk H2D traffic (stream 7) small D2H metadata only (stream 17) created but zero events (133 streams)

The 73 H2D events ≥ 100 MB totaling 9.5 GB in steady-state — exactly what we'd expect connector bulk loads to look like — all live on stream 7. The dedicated load_stream/store_stream the source code creates are not visibly active.

Three explanations, ranked by likelihood

MOST LIKELY

(A) nsys CUPTI doesn't attribute cuMemcpyBatchAsync correctly

PLAUSIBLE BUT WEAKER

(B) The daemon thread loses CUDA context association

UNLIKELY

(C) GPU metric sampling perturbs CUPTI

Important: nsys SQL ≠ ground truth for stream concurrency.

5. Reconciling the two pieces of evidence

There is one critical piece of source code that makes everything consistent. From worker.py:213-217:

def start_load_kv(self) -> None:
    # NOTE: we defer launching both load and store to get_finished(),
    # which runs after model execution. This hides the CPU-side
    # block copy op overhead (~5ms) behind GPU compute.
    pass

The actual per-step pipeline (with deferred launch)

Per-step ordering — current code (worker.py:213-217 deferred launch)

CPU
sched.schedule()
get_finished()
launch_copy
sched.schedule()
Compute stream
model.forward() — FA, GEMM, etc
model.forward()
load_stream
cuMemcpyBatchAsync
(H2D)
SM hardware
SM Active = 93%
idle
SM = 0%
idle
SM Active = 93%
Forward finishes → CPU calls get_finished → DMA enqueued → DMA runs alone → next forward enqueued. The dedicated load_stream is technically separate, but compute had nothing pending while DMA ran. SM idles, GPU metrics show 0%.

What we'd see if launch were moved before forward

Hypothetical: launch_copy() moved to start_load_kv() (pre-forward)

CPU
sched.schedule()
start_load_kv
launch_copy
sched.schedule()
Compute stream
model.forward()
model.forward()
load_stream
H2D running concurrent
SM hardware
SM busy throughout — DMA hidden
SM busy
If H2D is enqueued on load_stream BEFORE forward, the GPU's copy engine and compute SMs run in parallel. The dedicated stream's purpose is fulfilled, the per-step gap disappears, and the 2.8× decode slowdown should collapse.
The contention model was the wrong frame; the pipelining model fits all observations.

6. Fix direction & ranking

× Wrong directions to chase
What to actually try
  • Move launch_copy to start_load_kv

Validation plan for the proposed fix

7. Open questions

Q1.
Q2.
Q3.

8. Reproduction

Trace

# Source nsys-rep on PACE
/storage/scratch1/1/hlin464/swe_smith/results/nsys_swe_h100_jps10_dram64_fcfs_cpuoff/nsys_h100.nsys-rep

# SQLite export on lab (366 MB)
/tmp/nsys_analyze/nsys_h100.sqlite

# Bench: Llama-3.1-8B-Instruct, 100 jobs, jobs_100_pinjab.json
# Connector: SimpleCPUOffloadConnector, CPU pool = 64 GB
# Scheduling: FCFS

Analysis scripts

Script Output
/tmp/dram_bw_during_dma.py DRAM Read BW%, SM Active%, Tensor Active% bucketed by (BULK_H2D, FLASH_ATTN) co-activity
/tmp/verify_streams.py Per-stream kernel + memcpy event counts; cross-validate with TARGET_INFO_CUDA_STREAM
/tmp/verify_streams2.py Steady-state per-stream memcpy size buckets (filter cold-start init copies)
/tmp/verify_streams3.py Cross-reference all 135 streams with their priority and traffic
/tmp/analyze_flashattn.py FlashAttn kernel duration with/without H2D temporal overlap

Source code

# PACE editable install
/storage/project/r-rs275-0/hlin464/vllm-agent-kvcache/vllm/v1/simple_kv_offload/
├── worker.py          # SimpleCPUOffloadWorker; lines 189-217 critical
├── copy_backend.py    # DmaCopyBackend, _copy_loop daemon thread
├── cuda_mem_ops.py    # cuMemcpyBatchAsync invocation, line 148
└── manager.py         # scheduler-side admission logic

# Findings doc on PACE
/storage/project/r-rs275-0/hlin464/agent-kvcache/nsys_h100_dma_findings.md