Large-DRAM Sweep: DRAM Size Determines LMCache Effectiveness

Four-GPU comparison reveals LMCache's benefit window depends on VRAM size: H100 (80 GB) benefits at JPS 3–10, H200 (141 GB) only at JPS=6, A100 (40 GB) only at JPS=3

Llama-3.1-8B-Instruct H200 / H100 / A100 / L40S PACE Large-DRAM Sweep 2026-04-12 seed=42

1. Experiment Context

1.1 Research Question

Under what conditions does LMCache's CPU KV offload provide net benefit? We test across 4 GPUs and 5 load levels with sufficient DRAM (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S). The initial sweep (64 GB DRAM) showed LMCache consistently slower than FCFS across all tested GPUs and load levels. The large-DRAM sweep tests whether providing adequate DRAM capacity eliminates the bottleneck and reveals a regime where LMCache's offload tier delivers genuine reuse benefit.

1.2 Three Strategies Under Test

FCFS (baseline)

Stock vLLM v1 scheduler. Block-level prefix cache is enabled, but evicted blocks are simply recomputed on the next turn. No explicit cross-turn KV preservation.

Failure mode: recompute of full context every turn under high concurrency

LMCache (v0.3.7)

Upstream production KV cache layer. Offloads evicted KV blocks to CPU DRAM via PCIe, reloads on hit. Integrates with vLLM through the v1 KV transfer connector interface (LMCacheConnectorV1).

Failure mode: wait_for_save blocks the step loop on every prefill — see KV Transfer Deep Dive

Continuum

Research prototype (this work). Keeps KV blocks for active agent jobs pinned in GPU VRAM across tool-call gaps, bypassing the evict/recompute cycle entirely. No CPU transfer. Identifies jobs via job_id and releases pin on job_completed signal.

Failure mode: pinned blocks can starve new admissions (Turn 2 explosion) — see §4

1.3 Workload Design

The benchmark simulates an 8-turn agent job: the client sends a prompt, the server generates a short response containing a tool call, the client sleeps for a fixed tool-execution time, then appends a (large) tool output back onto the context and starts the next turn. Each turn grows the prompt by ~2K tokens, so turn 8 sees a prompt of ~13K tokens. Jobs arrive on a Poisson process at rate JPS (jobs per second).

Context growth breakdown across 8 turns (measured)

Each turn's input prompt contains the full conversation so far: original user prompt + all previous turns' responses + all previous tool outputs. Three quantities determine how the context grows:

Turn Input prompt (measured) Completion (fixed) Tool output appended after turn = Next turn's input
19220~1,728 (target 1,640)1,840
21,84020~1,589 (target 1,510)3,449
33,44920~2,556 (target 2,455)6,025
46,02520~1,409 (target 1,335)7,454
57,45420~2,825 (target 2,730)10,299
610,29920~2,014 (target 1,930)12,333
712,33320~840 (target 775)13,193
813,19320— (last turn, no tool call)job ends

, about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions." data-zh="實測 vs target:controlled-sweep H200 JSON 裡的 per_turn_avg_prompt_tokens 顯示 tool output 比 target 序列稍大(穩定多 65–101 tokens)。這是 _generate_random_tool_output 的 header — f\"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] \" — 的固定 overhead,每個 tool output 約 80 tokens。next_input = this_input + 20 + tool 的關係在所有 7 次 transition 都精確成立。">Measured vs target: the per_turn_avg_prompt_tokens field in the controlled-sweep H200 JSON shows slightly larger tool outputs than the target sequence (+65 to +101 tokens consistently). This is the fixed overhead of the _generate_random_tool_output header — f"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] ", about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions.

1.4 Why Random Prompts Are Load-Bearing

A retracted earlier result

An earlier version of this benchmark (dir 20260406_static-prompt_jps-sweep_d90s_tool0.5s/) sent the same tool output string to every job. vLLM's block-level prefix cache — enabled by default — then hashed identical blocks across jobs, giving ~99% prefix cache hit and effectively 0 real KV eviction. Under that regime, LMCache appeared to beat FCFS by 39–66% on A100/L40S. That result is an artifact and is retracted. Under random prompts (31% prefix cache hit, realistic KV pressure), LMCache is slower than FCFS on every tested large-VRAM GPU.

The fix (committed in ab_benchmark.py:70) uses per-job per-turn seeded randomness:

def _generate_random_tool_output(job_id: str, turn_idx: int,
                                 target_tokens: int) -> str:
    # Uses a seeded RNG per (job_id, turn_idx) for reproducibility.
    seed = hashlib.md5(f"{job_id}:{turn_idx}".encode()).hexdigest()
    rng = random.Random(seed)
    header = f"[{job_id}:turn{turn_idx}:{seed[:8]}] "
    ...  # random words from a ~80-word pool

Each (job, turn) combination gets a unique token sequence. vLLM's prefix cache can still match the first few tokens of system prompt but cannot share tool-output bodies across jobs. All results in Sections 2–6 use this randomized prompt.

1.5 Benchmark Configuration

All numbers in Sections 2–6 of this report come from the primary large-DRAM sweep. Historical experiments are documented separately.

Large-DRAM Sweep (2026-04-12) — PrimaryPrimary

Source: our_experiments/results/20260412_large_dram_sweep/

ModelLlama-3.1-8B-Instruct (bf16, ~16 GB)
ArrivalPoisson
Turns / job8
Context @ turn 8~13,000 tokens
Tool output / turnRandomized, sequence [1640, 1510, 2455, 1335, 2730, 1930, 775]
Completion tokens20/turn (min=max)
Tool exec time0.5s fixed
JPS1, 3, 6, 10, 15
Seed42
DRAM provisionedH200/H100: 400 GB; A100: 200 GB; L40S: 100 GB
GPUsH200 5/5, H100 5/5, A100 partial, L40S partial
PoliciesFCFS, LMCache, Continuum
Historical Experiments — archivedHistorical

Full methodology and results for these experiments are documented in lmcache-kv-transfer.html.

  • Initial 64 GB sweep20260407_controlled-sweep/baseline with default DRAM (64 GB); all 4 GPUs complete
  • Bottleneck ablation20260407_bottleneck/5 LMCache sync variants on H200; partial other GPUs
  • NVTX profiling20260411_nsys-{h200,h100,a100,l40s}/per-call wait_for_save attribution; all 4 GPUs complete
  • Coalesce sync validation20260412_coalesce_sync/negative result; see KV Transfer Deep Dive §1.4

1.6 KV Cache Budget per GPU

For Llama 3.1 8B bf16 (~15 GB weights) at 85% gpu-memory-utilization. Block size = 2 MiB × 16 tokens. These determine the saturation point for each GPU.

GPU VRAM KV budget Token capacity Max concurrent 13K-ctx jobs Saturating JPS
L40S48 GB~28 GiB~208K~162–3
A100-PCIe40 GB~17 GiB~139K~101–2
H10080 GB~53 GiB~434K~335–8
H200141 GB~106 GiB~868K~6610–15
Why larger VRAM shifts the saturation point, not the ordering

More VRAM = more concurrent jobs at the same JPS before KV eviction kicks in. Below the saturation point, all three policies perform similarly because prefix cache handles reuse. Above saturation, the three policies diverge based on how they handle eviction: FCFS recomputes, LMCache CPU-offloads (with wait_for_save penalty), Continuum pins. The ranking is stable across GPU sizes; what changes is which JPS value saturates each GPU.

1.7 The DRAM Hypothesis

Why DRAM size is the decisive variable

The initial sweep (64 GB DRAM) used the cluster default memory configuration. Under that constraint, LMCache's CPU offload tier — which evicts KV blocks from GPU VRAM to DRAM — was itself DRAM-capacity-constrained. With only 64 GB available and a ~16 GB model occupying system memory, the effective offload pool was small, resulting in poor cache hit rates and amplified wait_for_save overhead with no compensating reuse benefit. LMCache was slower than FCFS on every GPU and JPS combination in the initial sweep.

The large-DRAM sweep tests the hypothesis: with sufficient DRAM, the offload pool ceases to be the bottleneck, and LMCache's DRAM tier can absorb enough evicted KV blocks to deliver net savings at moderate concurrency. We provision 400 GB DRAM on H200/H100, 200 GB on A100, and 100 GB on L40S — large enough to hold the entire KV working set of any tested load level. This experiment isolates whether the DRAM capacity or the wait_for_save per-step synchronization overhead is the primary factor limiting LMCache's benefit.

1.8 Data Provenance

✓ PRIMARY  our_experiments/results/20260412_large_dram_sweep/ — H200 5/5, H100 5/5, A100 partial, L40S partial (this report)
HISTORICAL our_experiments/results/20260407_controlled-sweep/ — H200 15/15, H100 15/15, L40S 15/15, A100 15/15 (initial sweep, 64 GB DRAM)
HISTORICAL our_experiments/results/20260407_bottleneck/ — H200 28/28; H100 10/28, L40S 14/28, A100 0/28 (partial; nsys data in KV Transfer page supersedes)
HISTORICAL our_experiments/results/20260411_nsys-{h200,h100,a100,l40s}/ — NVTX profiling, 6 profiles per GPU (fcfs+lmcache × jps1/6/10), all 4 GPUs complete
HISTORICAL our_experiments/results/20260412_coalesce_sync/{H200,H100,A100}/ — coalesce sync validation, negative result; see KV Transfer §1.4
✗ RETRACTED 20260406_static-prompt_jps-sweep_d90s_tool0.5s/ — prefix cache artifact
✗ RETRACTED jps_sweep_v2/, full_sweep/, quick_sweep/ — same static-prompt family, kept on disk for reproducibility only

Canonical documents: BOTTLENECK_REPORT.md (bottleneck ablation analysis), EXPERIMENTS.md (initial sweep, the "definitive experiment" per author note). The older experiments.md (lowercase) contains the retracted Exp 4 and should be treated as historical only.

2. Large-DRAM Sweep: Four-GPU JPS Comparison

The large-DRAM sweep covers JPS from 1 to 15 across four GPU types with large DRAM provisioned (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S), measuring average job completion time (JCT). The key question is whether sufficient DRAM shifts the regime where LMCache is competitive. All data below is from our_experiments/results/20260412_large_dram_sweep/, seed=42.

H200 (141 GB VRAM, 400 GB DRAM)

H200 Key Takeaway

With 141 GB VRAM, H200 rarely evicts KV blocks until JPS=6. LMCache wins narrowly at JPS=6 only (−4.2% vs Continuum), but Continuum dominates at all other load levels. At JPS=15, LMCache collapses to 684.8 s versus Continuum's 185.1 s — PCIe saturation overwhelms the DRAM benefit.

JPS FCFS Avg (s) Continuum Avg (s) LMCache Avg (s) Winner
1 8.55 8.67 9.75 (+14%) FCFS
3 28.11 20.67 24.29 Continuum
6 192.15 66.00 63.22 (−4.2% vs Continuum) LMCache
10 416.92 119.39 158.69 (+33%) Continuum
15 702.50 185.08 684.80 (+270%) Continuum

H100 (80 GB VRAM, 400 GB DRAM)

H100 Key Takeaway

H100 with 80 GB VRAM is the sweet spot for LMCache. With 400 GB DRAM, KV eviction pressure is substantial enough at JPS 3–10 that LMCache's offload tier pays off. LMCache beats Continuum by −31% at JPS=3 and −34% at JPS=6. At JPS=15, LMCache collapses (689.5 s vs Continuum 259.5 s) due to PCIe congestion.

JPS FCFS Avg (s) Continuum Avg (s) LMCache Avg (s) Winner
1 9.01 8.75 10.21 (+13%) Continuum
3 68.85 39.75 27.51 (−31% vs Continuum) LMCache
6 244.07 100.38 65.85 (−34% vs Continuum) LMCache
10 456.30 172.89 165.19 (−4.5% vs Continuum) LMCache
15 739.24 259.52 689.47 (+166%) Continuum

A100 (40 GB VRAM, 200 GB DRAM) — partial data

A100 Key Takeaway

A100's 40 GB VRAM forces heavy KV eviction even at JPS=1. LMCache delivers a meaningful win over FCFS at JPS=3 (−46%), but PCIe Gen4 x16 bandwidth limits scalability: at JPS=6, LMCache is already slower than FCFS. Continuum data is unavailable for the large-DRAM sweep; comparison is FCFS vs LMCache only.

JPS FCFS Avg (s) LMCache Avg (s) Winner (vs FCFS)
1 45.00 70.52 (+57%) FCFS
3 415.14 223.40 (−46%) LMCache
6 943.54 1034.36 (+10%) FCFS
10 — (timeout) 2009.26

L40S (48 GB VRAM, 100 GB DRAM) — partial data

L40S Key Takeaway

L40S large-DRAM sweep data is incomplete: only FCFS was measured across all JPS levels; Continuum data is absent; LMCache data is unavailable. The FCFS trend shows rapid saturation — at JPS=15, 29 jobs timed out, indicating the GPU is deeply overloaded. Full three-way comparison is pending.

JPS FCFS Avg (s) Continuum Avg (s) Notes
1 134.55 68.55 Continuum available at JPS=1 only
3 501.28
6 1045.63
10 1834.86
15 3374.62 29 timeouts

Large-DRAM Sweep: Cross-GPU Summary

GPU VRAM DRAM LMCache Sweet Spot Peak Advantage vs Continuum
H100 80 GB 400 GB JPS 3–10 −34% at JPS=6
H200 141 GB 400 GB JPS=6 only −4.2% at JPS=6
A100 40 GB 200 GB JPS=3 only (vs FCFS) −46% vs FCFS at JPS=3
L40S 48 GB 100 GB Data incomplete

3. DRAM Impact Analysis

The large-DRAM sweep results reveal a VRAM-dependent benefit window for LMCache. This section explains the mechanism: why DRAM size matters, why it is not sufficient at very high concurrency, and how PCIe bandwidth creates a hard ceiling.

3.1 Why VRAM Pressure Is the Key Variable

LMCache offloads evicted KV blocks from GPU VRAM to CPU DRAM via PCIe. Whether this offload is beneficial depends on three interacting factors: (1) DRAM capacity, (2) DRAM LRU eviction under concurrency, and (3) PCIe bandwidth saturation.

VRAM Pressure Decision Tree

Large VRAM (H200, 141 GB)

  • Fewer KV evictions at low-to-medium JPS
  • LMCache offload fires rarely
  • Overhead without cache benefit

LMCache window: JPS=6 only

Moderate VRAM (H100, 80 GB)

  • KV eviction begins at JPS=3
  • DRAM offload pays off with sufficient hit rate
  • wait_for_save offset by saved recompute

LMCache window: JPS 3–10

Small VRAM (A100, 40 GB)

  • Heavy KV eviction even at JPS=1
  • PCIe Gen4 x16 limits DMA throughput
  • Bottleneck amplified by slower bus

LMCache window: JPS=3 only

3.2 PCIe Bandwidth as a Hard Ceiling

At very high JPS (≥15), PCIe becomes the binding constraint regardless of DRAM size. As the number of concurrent jobs grows, each vLLM step encounters more prefill requests in the batch, each triggering a separate wait_for_save DMA. The cumulative PCIe bandwidth consumed exceeds available PCIe throughput (H100: Gen5 x16 ≈ 128 GB/s; A100: Gen4 x16 ≈ 64 GB/s), causing DMA latency to spike. The measured consequence:

GPU JPS=15 Continuum (s) JPS=15 LMCache (s) LMCache overhead
H200 185.08 684.80 +270%
H100 259.52 689.47 +166%
Technical mechanism: wait_for_save overhead

For the root cause analysis (slot_mapping.cuda(), cudaStreamSynchronize, nsys profiling, ablation experiments, and per-step overhead breakdown), see: KV Transfer Deep Dive

Key finding: the per-step wait_for_save cost (~15 ms) is not the root cause. LMCache is profitable when DRAM cache hit rate is sufficient — see §6 for per-job trace evidence.

4. Per-Turn Latency

All data below use the large-DRAM sweep results. H200 and H100 are shown side by side for each load level. Breaking latency down by conversation turn reveals three structurally distinct failure modes — one per scheduling policy.

4.1 JPS=3 (below saturation)

H200 — Per-turn latency, JPS=3 0s 2.75s 5.5s 8.25s 11s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=3 0s 2.75s 5.5s 8.25s 11s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

At JPS=3 on H200, all three policies remain below 4 s per turn. Continuum is the fastest throughout (933 ms at T1, plateauing near 2.5 s), LMCache occupies the middle band, and FCFS is the slowest. VRAM pressure is minimal at this load, so LMCache's offload overhead is visible but small. On H100, the picture reverses: LMCache is the fastest policy, starting at 1420 ms and holding a flat ~3 s plateau from T3 onwards. FCFS grows linearly to 10.5 s by T7, revealing the queuing penalty. Continuum shows a spike at T1 (5952 ms) driven by pin-allocation overhead, then drops to 3–4 s — still above LMCache. H100's smaller VRAM (80 GB) creates enough eviction pressure at JPS=3 that LMCache's DRAM offload pays off, a dynamic absent on H200.

4.2 JPS=6 (saturation threshold)

H200 — Per-turn latency, JPS=6 0s 8.75s 17.5s 26.25s 35s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=6 0s 11.5s 23s 34.5s 46s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

At the saturation threshold the three shapes diverge structurally. On H200, LMCache (3.6–9.7 s range) significantly beats FCFS (6.8–34.3 s at T7). Continuum has its characteristic T1 spike (17.9 s) then drops to 4.6–7.7 s — the flattest tail. LMCache provides the lowest variance T1–T8 curve. On H100 the same pattern is amplified: FCFS reaches 45 s by T7 while LMCache stays 4.2–9.9 s. Continuum has a dramatic T2 spike (34.5 s — first full prefill competing for pinned VRAM slots) then drops to 5–18 s. The PCIe bus is not yet saturated at JPS=6 on either GPU, so LMCache's offload tier delivers meaningful savings.

4.3 JPS=10 (high load)

H200 — Per-turn latency, JPS=10 0s 20s 40s 60s 80s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=10 0s 22.5s 45s 67.5s 90s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

At JPS=10, Continuum's pinned-VRAM advantage fully materialises. T3–T8 on both GPUs hold at 5–12.5 s while FCFS climbs to 72–88 s. The T2 spike grows to 54.9 s (H200) and 85.9 s (H100) — reflecting high admission contention for a smaller pinned pool. LMCache (7.5–26.8 s on H200) sits between Continuum and FCFS but shows acceleration at later turns: the PCIe bus is beginning to back up, and each turn's KV reload competes with concurrent onloads from other jobs. FCFS and LMCache are nearly parallel lines growing at the same rate, confirming that LMCache's benefit at this load level is only at early turns.

4.4 JPS=15 (fully saturated)

H200 — Per-turn latency, JPS=15 0s 39s 78s 117s 156s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=15 0s 39s 78s 117s 156s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

At full saturation, LMCache collapses. On H200, LMCache T5–T8 (106–149 s) exceeds FCFS T5–T8 (104–137 s) — a qualitative reversal. PCIe bus saturation makes KV onload itself the new bottleneck: each turn's reload queues behind concurrent onloads from other jobs, so LMCache becomes strictly worse than doing nothing. H100 mirrors the H200 pattern exactly: LMCache T7 reaches 152.9 s while FCFS T7 is 150.8 s. Continuum's T2 reaches 110.6 s (H200) and 155.5 s (H100), but T3–T8 remain at 5–14 s across both GPUs. Continuum's pinned-KV guarantee means T3–T8 are entirely decoupled from system load; no other policy achieves this.

4.5 Failure Mode Summary

The per-turn latency data across all four JPS levels identifies three structurally distinct failure modes — each produced by a different mechanism and visible in a different part of the latency curve.

FCFS: linear queuing growth

T8/T1 ratio grows with JPS: 2.1× at JPS=3, 4.5× at JPS=6, 4.5× at JPS=10, 6.3× at JPS=15 (H200). Every turn recomputes the full context KV; as context grows with turn number and the batch is saturated, per-turn cost compounds. No structural ceiling — the failure mode is unbounded linear growth.

Continuum: T1/T2 spike, then flat

T8/T1 ratio is <1 at all JPS levels (H200: 2.5× at JPS=3, 0.27× at JPS=6, 0.26× at JPS=10, 0.19× at JPS=15). Later turns are faster than T1 because pinned KV eliminates recompute. The failure mode is front-loaded: T2 admission contention grows with JPS (up to 155 s on H100 JPS=15). Once past T2 the job is structurally decoupled from load.

LMCache: flat then PCIe wall

T8/T1 stays moderate at JPS=3–6 (2.1–2.3× on H200, H100) but explodes to 10.4× at H200 JPS=15 (and 11.3× on H100). The PCIe bus saturates, onload serialises, and each turn waits behind the previous turn's KV transfer. Unlike FCFS (which always grows) LMCache has a bi-modal regime: flat below PCIe saturation, catastrophic above it.

5. Intra-Job Latency Spread

The T8/T1 ratio measures how much a job's own later turns are penalised relative to its first turn — a proxy for intra-job starvation. Unlike inter-job starvation (early vs. late arrivals), intra-job spread directly determines whether a multi-turn user experience degrades over the course of a conversation.

GPU / JPS Policy T1 (ms) T8 (ms) T8/T1
H200 JPS=3 FCFS 146131182.1×
Continuum 93323072.5×
LMCache 119027372.3×
H200 JPS=6 FCFS 6804308124.5×
Continuum 1787748780.3×
LMCache 365575382.1×
H200 JPS=10 FCFS 16019729174.6×
Continuum 1994052370.3×
LMCache 7474233123.1×
H200 JPS=15 FCFS 213471335496.3×
Continuum 2792252250.2×
LMCache 1385114411210.4×
Three structurally different intra-job spread mechanisms

FCFS: T8/T1 grows monotonically with JPS. Each turn recomputes the full accumulated context, so the ratio tracks context-size growth. No mechanism exists to break the growth — the only relief is lower concurrency.

Continuum: T8/T1 < 1 at JPS ≥ 6 on both GPUs. Later turns benefit from pinned KV; the ratio falls below 1 because T8's incremental prefill (≈860 tokens) is much cheaper than T1's fresh admission. Intra-job spread is inverted — the conversation gets faster over time once past T2.

LMCache: T8/T1 stays 2.1–3.1× at JPS=3–10, then explodes to 10.4× (H200) and 11.3× (H100) at JPS=15. The collapse is non-linear: once PCIe bandwidth is saturated, each concurrent KV onload serialises behind all others, creating a cascading delay that compounds across turns. LMCache is the only policy with a bi-modal intra-job spread.

6. Per-Job KV Cache Lifecycle (Scheduler Trace)

Using vLLM's per-step scheduler trace (SCHED_TRACE_PATH), we can track block-level KV cache behavior for individual jobs. The table below shows job_0020 from the H100 LMCache JPS=6 trace test (255 jobs, seed=42, 400 GB DRAM).

6.1 job_0020 KV Cache Trace (H100, JPS=6)

Theoretical best = (prev turn prompt + 20) / this turn prompt. All values use job_0020's actual counts.

Turn Prompt Cacheable (prev turn) Theoretical Best Actual Hit Actual Rate Gap Free Blk Action
192 ~80 (sys prompt) 87% 80 87% 0% 22,748free
21,845 112 (92+20) 6.1% 112 6.1% 0% 19,654free
33,431 1,865 (1845+20) 54.4% 1,840 53.6% -0.8% 16,701free
46,065 3,451 (3431+20) 56.9% 3,440 56.7% -0.2% 13,821free
57,512 6,085 (6065+20) 81.0% 6,080 80.9% -0.1% 10,549free
610,298 7,532 (7512+20) 73.1% 112 1.1% -72.0% 1,799free
712,392 10,318 (10298+20) 83.3% 112 0.9% -82.4% 1,907free
813,278 12,412 (12392+20) 93.5% 112 0.8% -92.7% 2,332free
Turn 6–8: 92% wasted recompute

Turns 1–5 achieve near-theoretical hit (gap < 1%). Turn 6–8 should theoretically hit 73–93% but collapses to 1% — a gap of 72–92% representing entirely avoidable recompute due to DRAM LRU eviction under 255 concurrent jobs.

6.2 Interpretation

Turn 1–5: DRAM offload working

Turn 1 hits 87% (system prompt cached). Turn 2 drops to 6% (randomized tool output defeats prefix sharing). Turns 3–5 recover to 54%→57%→81% as LMCache reloads prior turns' KV from DRAM. The hit tokens match exactly: Turn 5 hits 6,080 tokens = Turn 1–4's total context.

Turn 6–8: DRAM tier evicted

Cache hit collapses to 1% (only system prompt). During the 0.5 s tool-call gap between Turn 5 and Turn 6, other concurrent jobs' offload traffic overwrites job_0020's DRAM entries via LRU eviction. Free blocks drop to 1,799 (VRAM 93% full). LMCache's DRAM tier cannot retain KV for all 255 concurrent jobs — this is the capacity wall.

Why action=free does not mean KV is deleted

In vLLM v1, action=free releases the block reservation. The KV data remains in VRAM's prefix cache until evicted. This is distinct from Continuum's action=pin, which keeps blocks unevictable across tool-call gaps. Three states: allocated (in use, unevictable) → cached (free but KV retained, evictable) → empty (no KV).

6.3 KV Cache Operations per Scheduler Step

Each vLLM scheduler step processes a batch of requests. The KV cache operations differ by request phase and scheduling policy.

Block states

Allocated

Owned by an active request. Cannot be evicted.

Cached

Released but KV data still in VRAM. Reusable but evictable.

Empty

No KV data. Available for allocation.

Cache hit detection — two-tier lookup

When a new request enters schedule(), the scheduler performs a two-tier cache lookup to determine how many tokens can be skipped.

Tier 1: VRAM prefix cache (local)

Called via kv_cache_manager.get_computed_blocks(request).

Mechanism: Chain-hashed blocks checked against block pool's cached map. Stops at first miss = longest prefix match.

What can hit: Any completed request's blocks with matching token prefix (same-job prior turns, shared system prompt, etc.).

Returns: num_new_local_computed_tokens = blocks × 16. Max = prompt − 1.

Tier 2: External KV store (LMCache DRAM)

Called only if KV connector configured. LMCache delegates to its engine.

Mechanism: LMCache checks DRAM for blocks beyond what VRAM found. The difference is loaded from DRAM to GPU during forward pass.

Returns: num_external_computed_tokens. FCFS/Continuum: always 0.

Tokens to compute = prompt − (local hit + external hit)

Example: Turn 5 (7,512 tokens). Tier 1 found 6,080 in VRAM. Tier 2: 0. Compute: 7,512 − 6,080 = 1,432 tokens (vs 7,512 without cache).

During schedule() — block allocation

Request Phase Operation Details
New request (first prefill) allocate_slots + prefix_cache_hit Check prefix cache + DRAM, allocate remaining, preempt if needed.
Running decode allocate_slots May need 1 new block per 16 tokens.
Preempted request free All blocks freed. Must recompute on reschedule.

After forward pass — request completion

When a request finishes, _free_request() dispatches based on policy:

Condition FCFS LMCache Continuum
Mid-job turn action=free
allocated → cached (evictable)
wait_for_saveaction=free
offload to DRAM, then allocated → cached
action=pin
stays allocated (unevictable) for TTL
Last turn (is_last_step) action=free
Same as mid-job
wait_for_saveaction=free
Same (offloads for future reuse)
action=free
No pin — freed immediately

During tool-call gap

Policy What happens to this job's KV Next turn hit
FCFS May be evicted. No backup. Depends on prefix cache survival
LMCache VRAM evictable + DRAM copy (also LRU evictable) Hit if DRAM survives LRU (§6.1)
Continuum Pinned. No eviction possible. ~100%
Source code reference

Dispatch in scheduler.py:_free_blocks(). FCFS → free; LMCache → wait_for_save + free; Continuum → pin.

6.4 GPU Hardware: Why Async Offload Should Work (But Doesn't)

KV offload/onload uses PCIe DMA (Copy Engine), not GPU compute (SM). They can theoretically run with zero interference.

SM (Compute)

132 units (H100). Forward pass kernels.

CE (Copy Engine)

2–3 units. PCIe DMA transfers. Independent.

NVLink Engine

18 units (H100). GPU↔GPU for TP.

Current timeline (blocking)

LMCache v0.3.7 (current):

LMCache: wait_for_save() blocks CPU → GPU idle between steps CPU SM (compute) CE (DMA) Step N Step N+1 execute_ model() slot_mapping.cuda() Sync #1: cudaStreamSync(default) store_stream.synchronize() Sync #2: waits for all DMA copies _free_blocks() action=free schedule() cache hit + allocate execute_ model() model() forward pass FlashAttn + GEMM + LayerNorm GPU IDLE — SM has no work CPU blocked in wait_for_save, cannot launch next forward Next Forward N+1 CE idle (forward running) batched_from_gpu() multi_layer_kv_transfer (DMA) ~70–100 ms wasted: SM idle while CE offloads + CPU synchronizes vllm_v1_adapter.py:1055 gpu_connector.py:348 gpu_model_runner.py:2062 CPU blocked (synchronize) Compute (SM kernels) DMA transfer (CE) Scheduler (CPU) GPU idle kv_connector_model_runner_mixin.py:117 — finally: kv_connector.wait_for_save() blocks until store_stream completes

PR #37160 SimpleCPUOffloadConnector:

PR #37160: wait_for_save() = pass → offload overlaps with next forward CPU SM (compute) CE (DMA) Step N Step N+1 execute_ model() pass get_finished() launch DMA + event.query() _free_blocks free schedule() cache hit + allocate execute_ model() pass get_finished() poll events (non-blocking) model() forward N FlashAttn + GEMM + LayerNorm model() forward N+1 FlashAttn + GEMM + LayerNorm CE idle launch_copy() DMA GPU→CPU on store_stream (async) SM + CE overlap wait_for_save()=pass worker.py:203 gpu_model_runner.py:2062 scheduler.py:280 Key differences from LMCache: 1. wait_for_save() = pass (no blocking) 2. DMA launched in get_finished() (deferred) 3. event.query() not stream.synchronize() 4. No slot_mapping.cuda() (no GC trigger) Non-blocking (pass / event.query) Compute (SM) DMA (CE, async) Scheduler Result: GPU SM utilization ~100% — no idle gap between steps. DMA runs in background on CE.

Full call chain with source locations

// vLLM Step Loop
│
├── schedule()                               // scheduler.py:280
│   ├── Tier 1: kv_cache_manager.get_computed_blocks(req)  // kv_cache_manager.py:153
│   │   └── coordinator.find_longest_cache_hit(block_hashes)
│   │       └── block_pool.get_cached_block(hash)        // single_type_kv_cache_manager.py:252
│   │           chain hash: hash(parent_hash, token_ids, extra_keys)  // kv_cache_utils.py:539
│   │
│   ├── Tier 2 (LMCache only):
│   │   └── connector.get_num_new_matched_tokens()      // lmcache_connector.py:158
│   │       └── lmcache_engine.get_num_new_matched_tokens  // vllm_v1_adapter.py:1123
│   │
│   ├── allocate_slots()                              // kv_cache_manager.py:196
│   └── _free_request() → _connector_finished()       // scheduler.py:1481, 1657
│       └── _free_blocks()                             // scheduler.py:1514
│           ├── FCFS:      kv_cache_manager.free(req)
│           ├── LMCache:   connector.request_finished() + free
│           └── Continuum: pin_request(req, ttl)         // scheduler.py:240
│
├── execute_model()                             // gpu_model_runner.py:2000
│   └── with maybe_get_kv_connector_output():    // kv_connector_model_runner_mixin.py:97
│       │
│       ├── ENTER: start_load_kv()                // onload from DRAM (async, load_stream)
│       │
│       ├── model(input_ids, positions, ...)     // forward pass (default stream, SM)
│       │   GPU simultaneously: SM=forward, CE=onload DMA
│       │
│       └── EXIT: wait_for_save()  ⚠ CPU BLOCKS HERE
│           └── lmcache_connector.wait_for_save()    // lmcache_connector.py:111
│               └── vllm_v1_adapter.wait_for_save()  // vllm_v1_adapter.py:1019
│                   │
│                   ├── slot_mapping.cuda()          // vllm_v1_adapter.py:1055
│                   │   → PyTorch allocator GC → cudaStreamSynchronize(default)
│                   │   Sync #1: waits for forward pass to finish
│                   │
│                   ├── lmcache_engine.store()       // cache_engine.py:277
│                   │   └── gpu_connector.batched_from_gpu()  // gpu_connector.py:339
│                   │       ├── with torch.cuda.stream(store_stream):
│                   │       │   lmc_ops.multi_layer_kv_transfer()  // gpu_connector.py:279
│                   │       │   GPU→CPU DMA on Copy Engine
│                   │       │
│                   │       └── store_stream.synchronize()  // gpu_connector.py:348
│                   │           Sync #2: waits for DMA to complete
│                   │
│                   └── CPU blocked → cannot launch next step
│
└── // back to step loop → next schedule() + execute_model()

Three synchronization points

# Location Source What it blocks
1 slot_mapping.cuda() vllm_v1_adapter.py:1055 Allocator GC syncs default stream — waits for forward pass
2 store_stream.synchronize() gpu_connector.py:348 Waits for DMA copies on store_stream (~15 ms real transfer)
3 store_stream.synchronize() gpu_connector.py:309 Per-chunk sync for non-CUDA target (always for CPU offload)

Why async overlap fails

The problem is CPU-side, not GPU-side

GPU can overlap SM+CE, but CPU blocks in finally: wait_for_save() and can't launch next step. GPU idles.

The fix: defer synchronization

Pass wait_for_save=False so execute_model returns immediately. Sync in get_finished() after next schedule. See upstream PR #37160.

Full root cause analysis in dedicated page

Detailed nsys profiling and ablation experiments are in: LMCache KV Transfer Pipeline — Internal Architecture

7. Key Findings

Finding 1: DRAM size is the decisive variable

LMCache's benefit window is VRAM-dependent, not DRAM-dependent alone. With sufficient DRAM (400 GB), LMCache is competitive only when VRAM pressure forces meaningful eviction rates. The DRAM tier provides value only when eviction rate × cache hit rate × recompute cost exceeds the per-step wait_for_save overhead.

Finding 2: H100 (80 GB) is LMCache's sweet spot

H100 with 400 GB DRAM delivers significant LMCache gains at JPS 3–10: −31% at JPS=3, −34% at JPS=6, −4.5% at JPS=10. The 80 GB VRAM creates sufficient eviction pressure without PCIe saturation at moderate load — the ideal operating regime for DRAM-backed KV offload.

Finding 3: H200 (141 GB) — Continuum dominates

H200's large VRAM suppresses eviction at most load levels, leaving LMCache with overhead but no cache benefit. LMCache barely wins at JPS=6 only (−4.2% vs Continuum). Continuum outperforms at all other JPS values. The H200 is not well-suited for LMCache unless the workload intentionally applies VRAM pressure.

Finding 4: A100 (40 GB) — limited by PCIe Gen4

A100's smaller VRAM creates heavy eviction pressure that LMCache can exploit — but only at JPS=3, where it beats FCFS by −46%. At JPS=6, LMCache is already slower than FCFS (+10%), because PCIe Gen4 x16 (≈64 GB/s) cannot keep up with the eviction rate, causing wait_for_save DMA latency to dominate over saved recompute.

Finding 5: At JPS=15, LMCache collapses on all GPUs

Even with 400 GB DRAM, LMCache degrades catastrophically at JPS=15: H100 goes from 165 s (JPS=10, best) to 689 s (JPS=15), while Continuum remains at 260 s. H200 is equally severe: 684 s vs Continuum's 185 s. PCIe bandwidth saturation — not DRAM capacity — is the fundamental limit at extreme concurrency.

Finding 6: Practical recommendation

Use LMCache on 80 GB class GPUs (H100 or equivalent) at moderate concurrency (JPS 3–6 for this workload profile). Provision at least 200 GB DRAM to avoid offload-tier capacity constraints. Do not use LMCache on H200 or larger VRAM GPUs unless the workload is specifically designed to create high KV eviction pressure. Do not run LMCache at extreme concurrency (JPS ≥ 15) on any GPU — PCIe saturation causes severe performance collapse.

Large-DRAM Sweep: DRAM Size Determines LMCache Effectiveness | Four-GPU JPS Sweep | vllm-continuum

Authors: hlin464 (Georgia Tech) | Cluster: Georgia Tech PACE

Back to LMCache Deep Dive  |  KV Transfer Deep Dive  |  AI Infrastructure Overview

Large-DRAM Sweep 2026-04-12 | Last updated 2026-04-12