Large-DRAM Sweep: DRAM Size Determines LMCache Effectiveness

1. Experiment Context

1.1 Research Question

Under what conditions does LMCache's CPU KV offload provide net benefit? We test across 4 GPUs and 5 load levels with sufficient DRAM (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S). The initial sweep (64 GB DRAM) showed LMCache consistently slower than FCFS across all tested GPUs and load levels. The large-DRAM sweep tests whether providing adequate DRAM capacity eliminates the bottleneck and reveals a regime where LMCache's offload tier delivers genuine reuse benefit.

1.2 Three Strategies Under Test

FCFS (baseline)

Stock vLLM v1 scheduler. Block-level prefix cache is enabled, but evicted blocks are simply recomputed on the next turn. No explicit cross-turn KV preservation.

Failure mode: recompute of full context every turn under high concurrency

LMCache (v0.3.7)

Upstream production KV cache layer. Offloads evicted KV blocks to CPU DRAM via PCIe, reloads on hit. Integrates with vLLM through the v1 KV transfer connector interface (LMCacheConnectorV1).

Failure mode: wait_for_save blocks the step loop on every prefill — see KV Transfer Deep Dive

Continuum

Research prototype (this work). Keeps KV blocks for active agent jobs pinned in GPU VRAM across tool-call gaps, bypassing the evict/recompute cycle entirely. No CPU transfer. Identifies jobs via job_id and releases pin on job_completed signal.

Failure mode: pinned blocks can starve new admissions (Turn 2 explosion) — see §4

1.3 Workload Design

The benchmark simulates an 8-turn agent job: the client sends a prompt, the server generates a short response containing a tool call, the client sleeps for a fixed tool-execution time, then appends a (large) tool output back onto the context and starts the next turn. Each turn grows the prompt by ~2K tokens, so turn 8 sees a prompt of ~13K tokens. Jobs arrive on a Poisson process at rate JPS (jobs per second).

Context growth breakdown across 8 turns (measured)

Each turn's input prompt contains the full conversation so far: original user prompt + all previous turns' responses + all previous tool outputs. Three quantities determine how the context grows:

Input prompt — cumulative; what vLLM receives as the prefill for that turn
Completion — fixed 20 tokens/turn via min_tokens=max_tokens=20 in ab_benchmark.py:190. Not variable.
Tool output — appended between turns by the client. Follows a fixed target length sequence [1640, 1510, 2455, 1335, 2730, 1930, 775] for the 7 tool outputs between turns 1–7. Content is per-job random (seeded by hashlib.md5(f'{job_id}:{turn_idx}') in ab_benchmark.py:78) so length is reproducible but tokens are unique per job → defeats prefix cache sharing.

Turn	Input prompt (measured)	Completion (fixed)	Tool output appended after turn	= Next turn's input
1	92	20	~1,728 (target 1,640)	1,840
2	1,840	20	~1,589 (target 1,510)	3,449
3	3,449	20	~2,556 (target 2,455)	6,025
4	6,025	20	~1,409 (target 1,335)	7,454
5	7,454	20	~2,825 (target 2,730)	10,299
6	10,299	20	~2,014 (target 1,930)	12,333
7	12,333	20	~840 (target 775)	13,193
8	13,193	20	— (last turn, no tool call)	job ends

, about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions." data-zh="實測 vs target：controlled-sweep H200 JSON 裡的 per_turn_avg_prompt_tokens 顯示 tool output 比 target 序列稍大（穩定多 65–101 tokens）。這是 _generate_random_tool_output 的 header — f\"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] \" — 的固定 overhead，每個 tool output 約 80 tokens。next_input = this_input + 20 + tool 的關係在所有 7 次 transition 都精確成立。">Measured vs target: the per_turn_avg_prompt_tokens field in the controlled-sweep H200 JSON shows slightly larger tool outputs than the target sequence (+65 to +101 tokens consistently). This is the fixed overhead of the _generate_random_tool_output header — f"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] ", about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions.

1.4 Why Random Prompts Are Load-Bearing

A retracted earlier result

An earlier version of this benchmark (dir 20260406_static-prompt_jps-sweep_d90s_tool0.5s/) sent the same tool output string to every job. vLLM's block-level prefix cache — enabled by default — then hashed identical blocks across jobs, giving ~99% prefix cache hit and effectively 0 real KV eviction. Under that regime, LMCache appeared to beat FCFS by 39–66% on A100/L40S. That result is an artifact and is retracted. Under random prompts (31% prefix cache hit, realistic KV pressure), LMCache is slower than FCFS on every tested large-VRAM GPU.

The fix (committed in ab_benchmark.py:70) uses per-job per-turn seeded randomness:

def _generate_random_tool_output(job_id: str, turn_idx: int,
                                 target_tokens: int) -> str:
    # Uses a seeded RNG per (job_id, turn_idx) for reproducibility.
    seed = hashlib.md5(f"{job_id}:{turn_idx}".encode()).hexdigest()
    rng = random.Random(seed)
    header = f"[{job_id}:turn{turn_idx}:{seed[:8]}] "
    ...  # random words from a ~80-word pool

Each (job, turn) combination gets a unique token sequence. vLLM's prefix cache can still match the first few tokens of system prompt but cannot share tool-output bodies across jobs. All results in Sections 2–6 use this randomized prompt.

1.5 Benchmark Configuration

All numbers in Sections 2–6 of this report come from the primary large-DRAM sweep. Historical experiments are documented separately.

Large-DRAM Sweep (2026-04-12) — PrimaryPrimary

Source: our_experiments/results/20260412_large_dram_sweep/

Model	Llama-3.1-8B-Instruct (bf16, ~16 GB)
Arrival	Poisson
Turns / job	8
Context @ turn 8	~13,000 tokens
Tool output / turn	Randomized, sequence [1640, 1510, 2455, 1335, 2730, 1930, 775]
Completion tokens	20/turn (min=max)
Tool exec time	0.5s fixed
JPS	1, 3, 6, 10, 15
Seed	42
DRAM provisioned	H200/H100: 400 GB; A100: 200 GB; L40S: 100 GB
GPUs	H200 5/5, H100 5/5, A100 partial, L40S partial
Policies	FCFS, LMCache, Continuum

Historical Experiments — archivedHistorical

Full methodology and results for these experiments are documented in lmcache-kv-transfer.html.

Initial 64 GB sweep — 20260407_controlled-sweep/ — baseline with default DRAM (64 GB); all 4 GPUs complete
Bottleneck ablation — 20260407_bottleneck/ — 5 LMCache sync variants on H200; partial other GPUs
NVTX profiling — 20260411_nsys-{h200,h100,a100,l40s}/ — per-call wait_for_save attribution; all 4 GPUs complete
Coalesce sync validation — 20260412_coalesce_sync/ — negative result; see KV Transfer Deep Dive §1.4

1.6 KV Cache Budget per GPU

For Llama 3.1 8B bf16 (~15 GB weights) at 85% gpu-memory-utilization. Block size = 2 MiB × 16 tokens. These determine the saturation point for each GPU.

GPU	VRAM	KV budget	Token capacity	Max concurrent 13K-ctx jobs	Saturating JPS
L40S	48 GB	~28 GiB	~208K	~16	2–3
A100-PCIe	40 GB	~17 GiB	~139K	~10	1–2
H100	80 GB	~53 GiB	~434K	~33	5–8
H200	141 GB	~106 GiB	~868K	~66	10–15

Why larger VRAM shifts the saturation point, not the ordering

More VRAM = more concurrent jobs at the same JPS before KV eviction kicks in. Below the saturation point, all three policies perform similarly because prefix cache handles reuse. Above saturation, the three policies diverge based on how they handle eviction: FCFS recomputes, LMCache CPU-offloads (with wait_for_save penalty), Continuum pins. The ranking is stable across GPU sizes; what changes is which JPS value saturates each GPU.

1.7 The DRAM Hypothesis

Why DRAM size is the decisive variable

The initial sweep (64 GB DRAM) used the cluster default memory configuration. Under that constraint, LMCache's CPU offload tier — which evicts KV blocks from GPU VRAM to DRAM — was itself DRAM-capacity-constrained. With only 64 GB available and a ~16 GB model occupying system memory, the effective offload pool was small, resulting in poor cache hit rates and amplified wait_for_save overhead with no compensating reuse benefit. LMCache was slower than FCFS on every GPU and JPS combination in the initial sweep.

The large-DRAM sweep tests the hypothesis: with sufficient DRAM, the offload pool ceases to be the bottleneck, and LMCache's DRAM tier can absorb enough evicted KV blocks to deliver net savings at moderate concurrency. We provision 400 GB DRAM on H200/H100, 200 GB on A100, and 100 GB on L40S — large enough to hold the entire KV working set of any tested load level. This experiment isolates whether the DRAM capacity or the wait_for_save per-step synchronization overhead is the primary factor limiting LMCache's benefit.

1.8 Data Provenance

✓ PRIMARY  our_experiments/results/20260412_large_dram_sweep/ — H200 5/5, H100 5/5, A100 partial, L40S partial (this report)
HISTORICAL our_experiments/results/20260407_controlled-sweep/ — H200 15/15, H100 15/15, L40S 15/15, A100 15/15 (initial sweep, 64 GB DRAM)
HISTORICAL our_experiments/results/20260407_bottleneck/ — H200 28/28; H100 10/28, L40S 14/28, A100 0/28 (partial; nsys data in KV Transfer page supersedes)
HISTORICAL our_experiments/results/20260411_nsys-{h200,h100,a100,l40s}/ — NVTX profiling, 6 profiles per GPU (fcfs+lmcache × jps1/6/10), all 4 GPUs complete
HISTORICAL our_experiments/results/20260412_coalesce_sync/{H200,H100,A100}/ — coalesce sync validation, negative result; see KV Transfer §1.4
✗ RETRACTED 20260406_static-prompt_jps-sweep_d90s_tool0.5s/ — prefix cache artifact
✗ RETRACTED jps_sweep_v2/, full_sweep/, quick_sweep/ — same static-prompt family, kept on disk for reproducibility only

Canonical documents: BOTTLENECK_REPORT.md (bottleneck ablation analysis), EXPERIMENTS.md (initial sweep, the "definitive experiment" per author note). The older experiments.md (lowercase) contains the retracted Exp 4 and should be treated as historical only.

2. Large-DRAM Sweep: Four-GPU JPS Comparison

The large-DRAM sweep covers JPS from 1 to 15 across four GPU types with large DRAM provisioned (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S), measuring average job completion time (JCT). The key question is whether sufficient DRAM shifts the regime where LMCache is competitive. All data below is from our_experiments/results/20260412_large_dram_sweep/, seed=42.

H200 (141 GB VRAM, 400 GB DRAM)

H200 Key Takeaway

With 141 GB VRAM, H200 rarely evicts KV blocks until JPS=6. LMCache wins narrowly at JPS=6 only (−4.2% vs Continuum), but Continuum dominates at all other load levels. At JPS=15, LMCache collapses to 684.8 s versus Continuum's 185.1 s — PCIe saturation overwhelms the DRAM benefit.

JPS	FCFS Avg (s)	Continuum Avg (s)	LMCache Avg (s)	Winner
1	8.55	8.67	9.75 (+14%)	FCFS
3	28.11	20.67	24.29	Continuum
6	192.15	66.00	63.22 (−4.2% vs Continuum)	LMCache
10	416.92	119.39	158.69 (+33%)	Continuum
15	702.50	185.08	684.80 (+270%)	Continuum

H100 (80 GB VRAM, 400 GB DRAM)

H100 Key Takeaway

H100 with 80 GB VRAM is the sweet spot for LMCache. With 400 GB DRAM, KV eviction pressure is substantial enough at JPS 3–10 that LMCache's offload tier pays off. LMCache beats Continuum by −31% at JPS=3 and −34% at JPS=6. At JPS=15, LMCache collapses (689.5 s vs Continuum 259.5 s) due to PCIe congestion.

JPS	FCFS Avg (s)	Continuum Avg (s)	LMCache Avg (s)	Winner
1	9.01	8.75	10.21 (+13%)	Continuum
3	68.85	39.75	27.51 (−31% vs Continuum)	LMCache
6	244.07	100.38	65.85 (−34% vs Continuum)	LMCache
10	456.30	172.89	165.19 (−4.5% vs Continuum)	LMCache
15	739.24	259.52	689.47 (+166%)	Continuum

A100 (40 GB VRAM, 200 GB DRAM) — partial data

A100 Key Takeaway

A100's 40 GB VRAM forces heavy KV eviction even at JPS=1. LMCache delivers a meaningful win over FCFS at JPS=3 (−46%), but PCIe Gen4 x16 bandwidth limits scalability: at JPS=6, LMCache is already slower than FCFS. Continuum data is unavailable for the large-DRAM sweep; comparison is FCFS vs LMCache only.

JPS	FCFS Avg (s)	LMCache Avg (s)	Winner (vs FCFS)
1	45.00	70.52 (+57%)	FCFS
3	415.14	223.40 (−46%)	LMCache
6	943.54	1034.36 (+10%)	FCFS
10	— (timeout)	2009.26	—

L40S (48 GB VRAM, 100 GB DRAM) — partial data

L40S Key Takeaway

L40S large-DRAM sweep data is incomplete: only FCFS was measured across all JPS levels; Continuum data is absent; LMCache data is unavailable. The FCFS trend shows rapid saturation — at JPS=15, 29 jobs timed out, indicating the GPU is deeply overloaded. Full three-way comparison is pending.

JPS	FCFS Avg (s)	Continuum Avg (s)	Notes
1	134.55	68.55	Continuum available at JPS=1 only
3	501.28	—
6	1045.63	—
10	1834.86	—
15	3374.62	—	29 timeouts

Large-DRAM Sweep: Cross-GPU Summary

GPU	VRAM	DRAM	LMCache Sweet Spot	Peak Advantage vs Continuum
H100	80 GB	400 GB	JPS 3–10	−34% at JPS=6
H200	141 GB	400 GB	JPS=6 only	−4.2% at JPS=6
A100	40 GB	200 GB	JPS=3 only (vs FCFS)	−46% vs FCFS at JPS=3
L40S	48 GB	100 GB	Data incomplete	—

3. DRAM Impact Analysis

The large-DRAM sweep results reveal a VRAM-dependent benefit window for LMCache. This section explains the mechanism: why DRAM size matters, why it is not sufficient at very high concurrency, and how PCIe bandwidth creates a hard ceiling.

3.1 Why VRAM Pressure Is the Key Variable

LMCache offloads evicted KV blocks from GPU VRAM to CPU DRAM via PCIe. Whether this offload is beneficial depends on three interacting factors: (1) DRAM capacity, (2) DRAM LRU eviction under concurrency, and (3) PCIe bandwidth saturation.

VRAM Pressure Decision Tree

Large VRAM (H200, 141 GB)

Fewer KV evictions at low-to-medium JPS
LMCache offload fires rarely
Overhead without cache benefit

LMCache window: JPS=6 only

Moderate VRAM (H100, 80 GB)

KV eviction begins at JPS=3
DRAM offload pays off with sufficient hit rate
wait_for_save offset by saved recompute

LMCache window: JPS 3–10

Small VRAM (A100, 40 GB)

Heavy KV eviction even at JPS=1
PCIe Gen4 x16 limits DMA throughput
Bottleneck amplified by slower bus

LMCache window: JPS=3 only

3.2 PCIe Bandwidth as a Hard Ceiling

At very high JPS (≥15), PCIe becomes the binding constraint regardless of DRAM size. As the number of concurrent jobs grows, each vLLM step encounters more prefill requests in the batch, each triggering a separate wait_for_save DMA. The cumulative PCIe bandwidth consumed exceeds available PCIe throughput (H100: Gen5 x16 ≈ 128 GB/s; A100: Gen4 x16 ≈ 64 GB/s), causing DMA latency to spike. The measured consequence:

GPU	JPS=15 Continuum (s)	JPS=15 LMCache (s)	LMCache overhead
H200	185.08	684.80	+270%
H100	259.52	689.47	+166%

Technical mechanism: wait_for_save overhead

For the root cause analysis (slot_mapping.cuda(), cudaStreamSynchronize, nsys profiling, ablation experiments, and per-step overhead breakdown), see: KV Transfer Deep Dive

Key finding: the per-step wait_for_save cost (~15 ms) is not the root cause. LMCache is profitable when DRAM cache hit rate is sufficient — see §6 for per-job trace evidence.

4. Per-Turn Latency

All data below use the large-DRAM sweep results. H200 and H100 are shown side by side for each load level. Breaking latency down by conversation turn reveals three structurally distinct failure modes — one per scheduling policy.

4.1 JPS=3 (below saturation)

At JPS=3 on H200, all three policies remain below 4 s per turn. Continuum is the fastest throughout (933 ms at T1, plateauing near 2.5 s), LMCache occupies the middle band, and FCFS is the slowest. VRAM pressure is minimal at this load, so LMCache's offload overhead is visible but small. On H100, the picture reverses: LMCache is the fastest policy, starting at 1420 ms and holding a flat ~3 s plateau from T3 onwards. FCFS grows linearly to 10.5 s by T7, revealing the queuing penalty. Continuum shows a spike at T1 (5952 ms) driven by pin-allocation overhead, then drops to 3–4 s — still above LMCache. H100's smaller VRAM (80 GB) creates enough eviction pressure at JPS=3 that LMCache's DRAM offload pays off, a dynamic absent on H200.

4.2 JPS=6 (saturation threshold)

At the saturation threshold the three shapes diverge structurally. On H200, LMCache (3.6–9.7 s range) significantly beats FCFS (6.8–34.3 s at T7). Continuum has its characteristic T1 spike (17.9 s) then drops to 4.6–7.7 s — the flattest tail. LMCache provides the lowest variance T1–T8 curve. On H100 the same pattern is amplified: FCFS reaches 45 s by T7 while LMCache stays 4.2–9.9 s. Continuum has a dramatic T2 spike (34.5 s — first full prefill competing for pinned VRAM slots) then drops to 5–18 s. The PCIe bus is not yet saturated at JPS=6 on either GPU, so LMCache's offload tier delivers meaningful savings.

4.3 JPS=10 (high load)

At JPS=10, Continuum's pinned-VRAM advantage fully materialises. T3–T8 on both GPUs hold at 5–12.5 s while FCFS climbs to 72–88 s. The T2 spike grows to 54.9 s (H200) and 85.9 s (H100) — reflecting high admission contention for a smaller pinned pool. LMCache (7.5–26.8 s on H200) sits between Continuum and FCFS but shows acceleration at later turns: the PCIe bus is beginning to back up, and each turn's KV reload competes with concurrent onloads from other jobs. FCFS and LMCache are nearly parallel lines growing at the same rate, confirming that LMCache's benefit at this load level is only at early turns.

4.4 JPS=15 (fully saturated)

At full saturation, LMCache collapses. On H200, LMCache T5–T8 (106–149 s) exceeds FCFS T5–T8 (104–137 s) — a qualitative reversal. PCIe bus saturation makes KV onload itself the new bottleneck: each turn's reload queues behind concurrent onloads from other jobs, so LMCache becomes strictly worse than doing nothing. H100 mirrors the H200 pattern exactly: LMCache T7 reaches 152.9 s while FCFS T7 is 150.8 s. Continuum's T2 reaches 110.6 s (H200) and 155.5 s (H100), but T3–T8 remain at 5–14 s across both GPUs. Continuum's pinned-KV guarantee means T3–T8 are entirely decoupled from system load; no other policy achieves this.

4.5 Failure Mode Summary

The per-turn latency data across all four JPS levels identifies three structurally distinct failure modes — each produced by a different mechanism and visible in a different part of the latency curve.

FCFS: linear queuing growth

T8/T1 ratio grows with JPS: 2.1× at JPS=3, 4.5× at JPS=6, 4.5× at JPS=10, 6.3× at JPS=15 (H200). Every turn recomputes the full context KV; as context grows with turn number and the batch is saturated, per-turn cost compounds. No structural ceiling — the failure mode is unbounded linear growth.

Continuum: T1/T2 spike, then flat

T8/T1 ratio is <1 at all JPS levels (H200: 2.5× at JPS=3, 0.27× at JPS=6, 0.26× at JPS=10, 0.19× at JPS=15). Later turns are faster than T1 because pinned KV eliminates recompute. The failure mode is front-loaded: T2 admission contention grows with JPS (up to 155 s on H100 JPS=15). Once past T2 the job is structurally decoupled from load.

LMCache: flat then PCIe wall

T8/T1 stays moderate at JPS=3–6 (2.1–2.3× on H200, H100) but explodes to 10.4× at H200 JPS=15 (and 11.3× on H100). The PCIe bus saturates, onload serialises, and each turn waits behind the previous turn's KV transfer. Unlike FCFS (which always grows) LMCache has a bi-modal regime: flat below PCIe saturation, catastrophic above it.

5. Intra-Job Latency Spread

The T8/T1 ratio measures how much a job's own later turns are penalised relative to its first turn — a proxy for intra-job starvation. Unlike inter-job starvation (early vs. late arrivals), intra-job spread directly determines whether a multi-turn user experience degrades over the course of a conversation.

GPU / JPS	Policy	T1 (ms)	T8 (ms)	T8/T1
H200 JPS=3	FCFS	1461	3118	2.1×
	Continuum	933	2307	2.5×
	LMCache	1190	2737	2.3×
H200 JPS=6	FCFS	6804	30812	4.5×
	Continuum	17877	4878	0.3×
	LMCache	3655	7538	2.1×
H200 JPS=10	FCFS	16019	72917	4.6×
	Continuum	19940	5237	0.3×
	LMCache	7474	23312	3.1×
H200 JPS=15	FCFS	21347	133549	6.3×
	Continuum	27922	5225	0.2×
	LMCache	13851	144112	10.4×

Three structurally different intra-job spread mechanisms

FCFS: T8/T1 grows monotonically with JPS. Each turn recomputes the full accumulated context, so the ratio tracks context-size growth. No mechanism exists to break the growth — the only relief is lower concurrency.

Continuum: T8/T1 < 1 at JPS ≥ 6 on both GPUs. Later turns benefit from pinned KV; the ratio falls below 1 because T8's incremental prefill (≈860 tokens) is much cheaper than T1's fresh admission. Intra-job spread is inverted — the conversation gets faster over time once past T2.

LMCache: T8/T1 stays 2.1–3.1× at JPS=3–10, then explodes to 10.4× (H200) and 11.3× (H100) at JPS=15. The collapse is non-linear: once PCIe bandwidth is saturated, each concurrent KV onload serialises behind all others, creating a cascading delay that compounds across turns. LMCache is the only policy with a bi-modal intra-job spread.

6. Per-Job KV Cache Lifecycle (Scheduler Trace)

Using vLLM's per-step scheduler trace (SCHED_TRACE_PATH), we can track block-level KV cache behavior for individual jobs. The table below shows job_0020 from the H100 LMCache JPS=6 trace test (255 jobs, seed=42, 400 GB DRAM).

6.1 job_0020 KV Cache Trace (H100, JPS=6)

Theoretical best = (prev turn prompt + 20) / this turn prompt. All values use job_0020's actual counts.

Turn	Prompt	Cacheable (prev turn)	Theoretical Best	Actual Hit	Actual Rate	Gap	Free Blk	Action
1	92	~80 (sys prompt)	87%	80	87%	0%	22,748	free
2	1,845	112 (92+20)	6.1%	112	6.1%	0%	19,654	free
3	3,431	1,865 (1845+20)	54.4%	1,840	53.6%	-0.8%	16,701	free
4	6,065	3,451 (3431+20)	56.9%	3,440	56.7%	-0.2%	13,821	free
5	7,512	6,085 (6065+20)	81.0%	6,080	80.9%	-0.1%	10,549	free
6	10,298	7,532 (7512+20)	73.1%	112	1.1%	-72.0%	1,799	free
7	12,392	10,318 (10298+20)	83.3%	112	0.9%	-82.4%	1,907	free
8	13,278	12,412 (12392+20)	93.5%	112	0.8%	-92.7%	2,332	free

Turn 6–8: 92% wasted recompute

Turns 1–5 achieve near-theoretical hit (gap < 1%). Turn 6–8 should theoretically hit 73–93% but collapses to 1% — a gap of 72–92% representing entirely avoidable recompute due to DRAM LRU eviction under 255 concurrent jobs.

6.2 Interpretation

Turn 1–5: DRAM offload working

Turn 1 hits 87% (system prompt cached). Turn 2 drops to 6% (randomized tool output defeats prefix sharing). Turns 3–5 recover to 54%→57%→81% as LMCache reloads prior turns' KV from DRAM. The hit tokens match exactly: Turn 5 hits 6,080 tokens = Turn 1–4's total context.

Turn 6–8: DRAM tier evicted

Cache hit collapses to 1% (only system prompt). During the 0.5 s tool-call gap between Turn 5 and Turn 6, other concurrent jobs' offload traffic overwrites job_0020's DRAM entries via LRU eviction. Free blocks drop to 1,799 (VRAM 93% full). LMCache's DRAM tier cannot retain KV for all 255 concurrent jobs — this is the capacity wall.

Why action=free does not mean KV is deleted

In vLLM v1, action=free releases the block reservation. The KV data remains in VRAM's prefix cache until evicted. This is distinct from Continuum's action=pin, which keeps blocks unevictable across tool-call gaps. Three states: allocated (in use, unevictable) → cached (free but KV retained, evictable) → empty (no KV).

6.3 KV Cache Operations per Scheduler Step

Each vLLM scheduler step processes a batch of requests. The KV cache operations differ by request phase and scheduling policy.

Block states

Allocated

Owned by an active request. Cannot be evicted.

Cached

Released but KV data still in VRAM. Reusable but evictable.

Empty

No KV data. Available for allocation.

Cache hit detection — two-tier lookup

When a new request enters schedule(), the scheduler performs a two-tier cache lookup to determine how many tokens can be skipped.

Tier 1: VRAM prefix cache (local)

Called via kv_cache_manager.get_computed_blocks(request).

Mechanism: Chain-hashed blocks checked against block pool's cached map. Stops at first miss = longest prefix match.

What can hit: Any completed request's blocks with matching token prefix (same-job prior turns, shared system prompt, etc.).

Returns: num_new_local_computed_tokens = blocks × 16. Max = prompt − 1.

Tier 2: External KV store (LMCache DRAM)

Called only if KV connector configured. LMCache delegates to its engine.

Mechanism: LMCache checks DRAM for blocks beyond what VRAM found. The difference is loaded from DRAM to GPU during forward pass.

Returns: num_external_computed_tokens. FCFS/Continuum: always 0.

Tokens to compute = prompt − (local hit + external hit)

Example: Turn 5 (7,512 tokens). Tier 1 found 6,080 in VRAM. Tier 2: 0. Compute: 7,512 − 6,080 = 1,432 tokens (vs 7,512 without cache).

During schedule() — block allocation

Request Phase	Operation	Details
New request (first prefill)	`allocate_slots` + `prefix_cache_hit`	Check prefix cache + DRAM, allocate remaining, preempt if needed.
Running decode	`allocate_slots`	May need 1 new block per 16 tokens.
Preempted request	`free`	All blocks freed. Must recompute on reschedule.

After forward pass — request completion

When a request finishes, _free_request() dispatches based on policy:

Condition	FCFS	LMCache	Continuum
Mid-job turn	`action=free` allocated → cached (evictable)	`wait_for_save` → `action=free` offload to DRAM, then allocated → cached	`action=pin` stays allocated (unevictable) for TTL
Last turn (is_last_step)	`action=free` Same as mid-job	`wait_for_save` → `action=free` Same (offloads for future reuse)	`action=free` No pin — freed immediately

During tool-call gap

Policy	What happens to this job's KV	Next turn hit
FCFS	May be evicted. No backup.	Depends on prefix cache survival
LMCache	VRAM evictable + DRAM copy (also LRU evictable)	Hit if DRAM survives LRU (§6.1)
Continuum	Pinned. No eviction possible.	~100%

Source code reference

Dispatch in scheduler.py:_free_blocks(). FCFS → free; LMCache → wait_for_save + free; Continuum → pin.

6.4 GPU Hardware: Why Async Offload Should Work (But Doesn't)

KV offload/onload uses PCIe DMA (Copy Engine), not GPU compute (SM). They can theoretically run with zero interference.

SM (Compute)

132 units (H100). Forward pass kernels.

CE (Copy Engine)

2–3 units. PCIe DMA transfers. Independent.

NVLink Engine

18 units (H100). GPU↔GPU for TP.

Current timeline (blocking)

LMCache v0.3.7 (current):

PR #37160 SimpleCPUOffloadConnector:

Full call chain with source locations

// vLLM Step Loop
│
├── schedule()                               // scheduler.py:280
│   ├── Tier 1: kv_cache_manager.get_computed_blocks(req)  // kv_cache_manager.py:153
│   │   └── coordinator.find_longest_cache_hit(block_hashes)
│   │       └── block_pool.get_cached_block(hash)        // single_type_kv_cache_manager.py:252
│   │           chain hash: hash(parent_hash, token_ids, extra_keys)  // kv_cache_utils.py:539
│   │
│   ├── Tier 2 (LMCache only):
│   │   └── connector.get_num_new_matched_tokens()      // lmcache_connector.py:158
│   │       └── lmcache_engine.get_num_new_matched_tokens  // vllm_v1_adapter.py:1123
│   │
│   ├── allocate_slots()                              // kv_cache_manager.py:196
│   └── _free_request() → _connector_finished()       // scheduler.py:1481, 1657
│       └── _free_blocks()                             // scheduler.py:1514
│           ├── FCFS:      kv_cache_manager.free(req)
│           ├── LMCache:   connector.request_finished() + free
│           └── Continuum: pin_request(req, ttl)         // scheduler.py:240
│
├── execute_model()                             // gpu_model_runner.py:2000
│   └── with maybe_get_kv_connector_output():    // kv_connector_model_runner_mixin.py:97
│       │
│       ├── ENTER: start_load_kv()                // onload from DRAM (async, load_stream)
│       │
│       ├── model(input_ids, positions, ...)     // forward pass (default stream, SM)
│       │   GPU simultaneously: SM=forward, CE=onload DMA
│       │
│       └── EXIT: wait_for_save()  ⚠ CPU BLOCKS HERE
│           └── lmcache_connector.wait_for_save()    // lmcache_connector.py:111
│               └── vllm_v1_adapter.wait_for_save()  // vllm_v1_adapter.py:1019
│                   │
│                   ├── slot_mapping.cuda()          // vllm_v1_adapter.py:1055
│                   │   → PyTorch allocator GC → cudaStreamSynchronize(default)
│                   │   Sync #1: waits for forward pass to finish
│                   │
│                   ├── lmcache_engine.store()       // cache_engine.py:277
│                   │   └── gpu_connector.batched_from_gpu()  // gpu_connector.py:339
│                   │       ├── with torch.cuda.stream(store_stream):
│                   │       │   lmc_ops.multi_layer_kv_transfer()  // gpu_connector.py:279
│                   │       │   GPU→CPU DMA on Copy Engine
│                   │       │
│                   │       └── store_stream.synchronize()  // gpu_connector.py:348
│                   │           Sync #2: waits for DMA to complete
│                   │
│                   └── CPU blocked → cannot launch next step
│
└── // back to step loop → next schedule() + execute_model()

Three synchronization points

#	Location	Source	What it blocks
1	`slot_mapping.cuda()`	`vllm_v1_adapter.py:1055`	Allocator GC syncs default stream — waits for forward pass
2	`store_stream.synchronize()`	`gpu_connector.py:348`	Waits for DMA copies on store_stream (~15 ms real transfer)
3	`store_stream.synchronize()`	`gpu_connector.py:309`	Per-chunk sync for non-CUDA target (always for CPU offload)

Why async overlap fails

The problem is CPU-side, not GPU-side

GPU can overlap SM+CE, but CPU blocks in finally: wait_for_save() and can't launch next step. GPU idles.

The fix: defer synchronization

Pass wait_for_save=False so execute_model returns immediately. Sync in get_finished() after next schedule. See upstream PR #37160.

Full root cause analysis in dedicated page

Detailed nsys profiling and ablation experiments are in: LMCache KV Transfer Pipeline — Internal Architecture

7. Key Findings

Finding 1: DRAM size is the decisive variable

LMCache's benefit window is VRAM-dependent, not DRAM-dependent alone. With sufficient DRAM (400 GB), LMCache is competitive only when VRAM pressure forces meaningful eviction rates. The DRAM tier provides value only when eviction rate × cache hit rate × recompute cost exceeds the per-step wait_for_save overhead.

Finding 2: H100 (80 GB) is LMCache's sweet spot

H100 with 400 GB DRAM delivers significant LMCache gains at JPS 3–10: −31% at JPS=3, −34% at JPS=6, −4.5% at JPS=10. The 80 GB VRAM creates sufficient eviction pressure without PCIe saturation at moderate load — the ideal operating regime for DRAM-backed KV offload.

Finding 3: H200 (141 GB) — Continuum dominates

H200's large VRAM suppresses eviction at most load levels, leaving LMCache with overhead but no cache benefit. LMCache barely wins at JPS=6 only (−4.2% vs Continuum). Continuum outperforms at all other JPS values. The H200 is not well-suited for LMCache unless the workload intentionally applies VRAM pressure.

Finding 4: A100 (40 GB) — limited by PCIe Gen4

A100's smaller VRAM creates heavy eviction pressure that LMCache can exploit — but only at JPS=3, where it beats FCFS by −46%. At JPS=6, LMCache is already slower than FCFS (+10%), because PCIe Gen4 x16 (≈64 GB/s) cannot keep up with the eviction rate, causing wait_for_save DMA latency to dominate over saved recompute.

Finding 5: At JPS=15, LMCache collapses on all GPUs

Even with 400 GB DRAM, LMCache degrades catastrophically at JPS=15: H100 goes from 165 s (JPS=10, best) to 689 s (JPS=15), while Continuum remains at 260 s. H200 is equally severe: 684 s vs Continuum's 185 s. PCIe bandwidth saturation — not DRAM capacity — is the fundamental limit at extreme concurrency.

Finding 6: Practical recommendation

Use LMCache on 80 GB class GPUs (H100 or equivalent) at moderate concurrency (JPS 3–6 for this workload profile). Provision at least 200 GB DRAM to avoid offload-tier capacity constraints. Do not use LMCache on H200 or larger VRAM GPUs unless the workload is specifically designed to create high KV eviction pressure. Do not run LMCache at extreme concurrency (JPS ≥ 15) on any GPU — PCIe saturation causes severe performance collapse.

Large-DRAM Sweep: DRAM Size Determines LMCache Effectiveness | Four-GPU JPS Sweep | vllm-continuum

Authors: hlin464 (Georgia Tech) | Cluster: Georgia Tech PACE

Back to LMCache Deep Dive | KV Transfer Deep Dive | AI Infrastructure Overview

Large-DRAM Sweep 2026-04-12 | Last updated 2026-04-12