vLLM SimpleCPUOffload — H100 nsys DMA Overlap Analysis

Setup · Why this analysis

1. Why we profiled

In production runs of the SimpleCPUOffloadConnector with FCFS scheduling on H100, decode-only steps slow down 2.8× (p50) and 6× (p99) when bulk H2D + D2H DMA fires concurrently versus when no DMA is active. The connector deep dive explains the design — multi-stream, async, ~5ms launch hidden under compute. The numbers say the design is not delivering. Either the design is wrong, or the design isn't running as written.

The slowdown numbers from sched_trace

Composition	DMA class	p50 ms	p90 ms	p99 ms
decode-only	no DMA	15.0	18.5	22.8
decode-only	load only	24.1	51.0	88.0
decode-only	load + store	41.9	95.4	137.9

Initial hypothesis: H100 has 3 TB/s HBM. Bulk H2D writes pinned-host → HBM at PCIe Gen5 speeds. FlashAttn reads KV from HBM at compute speeds. If both compete for the same memory subsystem at the same time, attention runs at degraded effective bandwidth → kernel duration grows → step elongates. This is the classic memory-contention story.

Profiling with nsys lets us check it directly: GPU SM/Tensor/DRAM utilization counters at 100 Hz, plus per-event CUPTI traces with stream attribution.

Evidence · GPU counters

2. What the GPU was actually doing

The H100 trace contains 2.55 M GPU metric samples at 100 Hz across 850 s. Bucketing 68 k samples in the steady-state window (170 s … 850 s) by which activity is in flight at the sample timestamp:

68k

GPU metric samples analyzed

3

Activity classes (mutually exclusive)

0

Samples with H2D + FA both active

100Hz

Sampling rate

DRAM Read Bandwidth, SM Active, Tensor Active per regime

DRAM Read Bandwidth (% of peak HBM read bandwidth)

IDLE (n=32 098)

p50 26%

26.0

FLASH_ATTN (n=4 469)

p50 30%

30.0

BULK_H2D (n=31 478)

1.0

IDLE = no FA, no bulk H2D — other kernels (GEMM, layernorm) FLASH_ATTN = FlashAttnFwdSm90 in flight BULK_H2D = H2D memcpy >10MB in flight

SM Active (% of cycles at least one warp running)

IDLE

p50 57%

57.0

FLASH_ATTN

p50 93%

93.0

BULK_H2D

0.0

Tensor Active (% of cycles tensor cores executing)

IDLE

p50 39%

39.0

FLASH_ATTN

p50 75%

75.0

BULK_H2D

0.0

The contention hypothesis predicts an overlap regime. There is none.

Source · What the code says

3. The stream handle's full code path

If H2D really runs on a dedicated stream, GPU metrics should sometimes show H2D and FA co-active (concurrent kernels on different streams). Before drawing conclusions from the metrics, let's check what the code actually does. The path from torch.cuda.Stream to the CUDA driver call is short — four files, ~10 lines:

From SimpleCPUOffloadWorker.init to cuMemcpyBatchAsync

1

vllm/v1/simple_kv_offload/
worker.py:189-192

Two dedicated streams created
low_pri, _ = torch.cuda.Stream.priority_range()
self.load_stream = torch.cuda.Stream(priority=low_pri)
self.store_stream = torch.cuda.Stream(priority=low_pri)

↓

2

copy_backend.py:46-48

Streams flow into params builders
self._store_params = build_params(gpu, cpu, store_stream)
self._load_params = build_params(cpu, gpu, load_stream)
A daemon thread polls a queue and dispatches copies.

↓

3

cuda_mem_ops.py:113

Stream handle captured into BatchMemcpyParams
stream_handle=stream.cuda_stream # raw cudaStream_t / CUstream
The C-level stream pointer is stored in the params struct.

↓

4

copy_backend.py:97-101

Events recorded on the dedicated stream
start_event.record(stream)
copy_blocks(src_blocks, dst_blocks, params)
end_event.record(stream)

↓

5

cuda_mem_ops.py:139-149

cuMemcpyBatchAsync invoked with the stream handle
err = _batch_memcpy_fn(dst, src, sizes, n,
ctypes.addressof(params.attrs), ..., params.stream_handle)
The driver-level call gets the dedicated stream — not the default.

Code-level conclusion: the multi-stream design is real.

Paradox · Code vs nsys

4. The stream attribution paradox

nsys's TARGET_INFO_CUDA_STREAM table confirms the CUDA context created 135 streams at runtime, including the priority-0 NON_BLOCKING streams that should hold load_stream/store_stream. But when we look at which streams actually carry traffic:

streamId	type / priority	kernels	memcpys	role
7	NULL / prio=0	1,169,146	119,394 (16.5 GB)	all compute + all H2D
17	NON_BLOCKING / prio=0	0	2,881 (~0 GB)	small D2H metadata only
6, 8-16, 18-140 (133 streams)	various	0	0	created but completely idle

Stream usage map — every CUDA stream in the trace, colored by traffic

all kernel + bulk H2D traffic (stream 7) small D2H metadata only (stream 17) created but zero events (133 streams)

The 73 H2D events ≥ 100 MB totaling 9.5 GB in steady-state — exactly what we'd expect connector bulk loads to look like — all live on stream 7. The dedicated load_stream/store_stream the source code creates are not visibly active.

Three explanations, ranked by likelihood

MOST LIKELY

(A) nsys CUPTI doesn't attribute cuMemcpyBatchAsync correctly

PLAUSIBLE BUT WEAKER

(B) The daemon thread loses CUDA context association

UNLIKELY

(C) GPU metric sampling perturbs CUPTI

Important: nsys SQL ≠ ground truth for stream concurrency.

Synthesis · Why both can be true

5. Reconciling the two pieces of evidence

There is one critical piece of source code that makes everything consistent. From worker.py:213-217:

def start_load_kv(self) -> None:
    # NOTE: we defer launching both load and store to get_finished(),
    # which runs after model execution. This hides the CPU-side
    # block copy op overhead (~5ms) behind GPU compute.
    pass

The actual per-step pipeline (with deferred launch)

Per-step ordering — current code (worker.py:213-217 deferred launch)

CPU

sched.schedule()

get_finished()
launch_copy

sched.schedule()

Compute stream

model.forward() — FA, GEMM, etc

model.forward()

load_stream

cuMemcpyBatchAsync
(H2D)

SM hardware

SM Active = 93%

idle

SM = 0%

idle

SM Active = 93%

Forward finishes → CPU calls get_finished → DMA enqueued → DMA runs alone → next forward enqueued. The dedicated load_stream is technically separate, but compute had nothing pending while DMA ran. SM idles, GPU metrics show 0%.

What we'd see if launch were moved before forward

Hypothetical: launch_copy() moved to start_load_kv() (pre-forward)

CPU

sched.schedule()

start_load_kv
launch_copy

sched.schedule()

Compute stream

model.forward()

load_stream

H2D running concurrent

SM hardware

SM busy throughout — DMA hidden

SM busy

If H2D is enqueued on load_stream BEFORE forward, the GPU's copy engine and compute SMs run in parallel. The dedicated stream's purpose is fulfilled, the per-step gap disappears, and the 2.8× decode slowdown should collapse.

The contention model was the wrong frame; the pipelining model fits all observations.

Action · What to change

6. Fix direction & ranking

× Wrong directions to chase

✓ What to actually try

Move launch_copy to start_load_kv —

Validation plan for the proposed fix

Open issues · Follow-ups

7. Open questions

Q1.

Q2.

Q3.

Reproduction · Scripts & data

8. Reproduction

Trace

# Source nsys-rep on PACE
/storage/scratch1/1/hlin464/swe_smith/results/nsys_swe_h100_jps10_dram64_fcfs_cpuoff/nsys_h100.nsys-rep

# SQLite export on lab (366 MB)
/tmp/nsys_analyze/nsys_h100.sqlite

# Bench: Llama-3.1-8B-Instruct, 100 jobs, jobs_100_pinjab.json
# Connector: SimpleCPUOffloadConnector, CPU pool = 64 GB
# Scheduling: FCFS

Analysis scripts

Script	Output
`/tmp/dram_bw_during_dma.py`	DRAM Read BW%, SM Active%, Tensor Active% bucketed by (BULK_H2D, FLASH_ATTN) co-activity
`/tmp/verify_streams.py`	Per-stream kernel + memcpy event counts; cross-validate with TARGET_INFO_CUDA_STREAM
`/tmp/verify_streams2.py`	Steady-state per-stream memcpy size buckets (filter cold-start init copies)
`/tmp/verify_streams3.py`	Cross-reference all 135 streams with their priority and traffic
`/tmp/analyze_flashattn.py`	FlashAttn kernel duration with/without H2D temporal overlap

Source code

# PACE editable install
/storage/project/r-rs275-0/hlin464/vllm-agent-kvcache/vllm/v1/simple_kv_offload/
├── worker.py          # SimpleCPUOffloadWorker; lines 189-217 critical
├── copy_backend.py    # DmaCopyBackend, _copy_loop daemon thread
├── cuda_mem_ops.py    # cuMemcpyBatchAsync invocation, line 148
└── manager.py         # scheduler-side admission logic

# Findings doc on PACE
/storage/project/r-rs275-0/hlin464/agent-kvcache/nsys_h100_dma_findings.md

Does Bulk H2D Really Overlap with Decode?

TL;DR Three findings, in order of confidence

1. Why we profiled

The slowdown numbers from sched_trace

2. What the GPU was actually doing

DRAM Read Bandwidth, SM Active, Tensor Active per regime

DRAM Read Bandwidth (% of peak HBM read bandwidth)

SM Active (% of cycles at least one warp running)

Tensor Active (% of cycles tensor cores executing)

3. The stream handle's full code path

From SimpleCPUOffloadWorker.init to cuMemcpyBatchAsync

4. The stream attribution paradox

Stream usage map — every CUDA stream in the trace, colored by traffic

Three explanations, ranked by likelihood

(A) nsys CUPTI doesn't attribute cuMemcpyBatchAsync correctly

(B) The daemon thread loses CUDA context association

(C) GPU metric sampling perturbs CUPTI

5. Reconciling the two pieces of evidence

The actual per-step pipeline (with deferred launch)

Per-step ordering — current code (worker.py:213-217 deferred launch)

What we'd see if launch were moved before forward

Hypothetical: launch_copy() moved to start_load_kv() (pre-forward)

6. Fix direction & ranking

Validation plan for the proposed fix

7. Open questions

8. Reproduction

Trace

Analysis scripts

Source code

Does Bulk H2D Really Overlap with Decode?

TL;DR Three findings, in order of confidence

1. Why we profiled

The slowdown numbers from sched_trace

2. What the GPU was actually doing

DRAM Read Bandwidth, SM Active, Tensor Active per regime

DRAM Read Bandwidth (% of peak HBM read bandwidth)

SM Active (% of cycles at least one warp running)

Tensor Active (% of cycles tensor cores executing)

3. The stream handle's full code path

From SimpleCPUOffloadWorker.__init__ to cuMemcpyBatchAsync

4. The stream attribution paradox

Stream usage map — every CUDA stream in the trace, colored by traffic

Three explanations, ranked by likelihood

(A) nsys CUPTI doesn't attribute cuMemcpyBatchAsync correctly

(B) The daemon thread loses CUDA context association

(C) GPU metric sampling perturbs CUPTI

5. Reconciling the two pieces of evidence

The actual per-step pipeline (with deferred launch)

Per-step ordering — current code (worker.py:213-217 deferred launch)

What we'd see if launch were moved before forward

Hypothetical: launch_copy() moved to start_load_kv() (pre-forward)

6. Fix direction & ranking

Validation plan for the proposed fix

7. Open questions

8. Reproduction

Trace

Analysis scripts

Source code

From SimpleCPUOffloadWorker.init to cuMemcpyBatchAsync