Research Update 2026-04-14 — LMCache Bottleneck & Predictive Prefetch Roadmap

1. Context & KV-Cache Strategies

We measure how three KV-cache policies behave across GPU memory profiles and load levels, to decide where LMCache's CPU offload tier is profitable and where a predictive prefetch path should replace it.

1.1 The four strategies we are comparing

The timeline below (from kv_strategies.html) shows the VRAM/DRAM lifecycle of each strategy during a single agent tool-call gap. Toggle between short tool (< TTL) and long tool (> TTL) to see where each approach breaks.

FCFS is the vLLM baseline. Continuum pins the job's KV in VRAM for the gap. LMCache offloads to CPU DRAM. Predictive Prefetch is the direction this work is moving toward.

1.2 Experiment parameters

Parameter	Value
Model	Llama-3.1-8B-Instruct (bf16, ~16 GB)
Turns per job	8 (tool call → tool output → next turn)
Prompt growth	~2K tokens/turn, reaching ~13K at turn 8
Completion tokens	20 per turn
Tool execution time	0.5 s (fixed)
Arrival process	Poisson at rate JPS ∈ {1, 3, 6, 10, 15}
Seeds	42 (deterministic, same random prompts across policies)
GPUs	H200 (141 GB), H100 (80 GB), A100 (40 GB), L40S (48 GB)
DRAM sweep	Small = 64 GB (default); Large = 400 GB (H100/H200), 200 GB (A100), 100 GB (L40S)
Data provenance	`20260412_large_dram_sweep/` (primary) + `20260407_controlled-sweep/` (64 GB baseline)

Why this workload

Random per-turn tool outputs (7 fixed sizes shuffled, seed=42) defeat vLLM's block-level prefix cache. This isolates the KV preservation strategy from opportunistic prefix reuse, so differences in JCT reflect the policy itself rather than cache coincidence.

2. Avg JCT vs JPS — Large vs Small DRAM

Each chart shows average job completion time (log scale) against arrival rate for H200 and H100. The contrast between large-DRAM (400 GB) and small-DRAM (64 GB) is the central finding: DRAM size gates whether LMCache's offload tier is ever profitable.

2.1 Large-DRAM sweep (400 GB DRAM)

Large-DRAM takeaway

With 400 GB DRAM, H100 (80 GB VRAM) is LMCache's sweet spot: −31% at JPS=3, −34% at JPS=6 vs Continuum. H200 (141 GB VRAM) has too little eviction pressure to amortize the offload cost — LMCache narrowly wins only at JPS=6 (−4.2%). Both GPUs collapse at JPS=15 where PCIe saturates.

2.2 Small-DRAM sweep (64 GB DRAM, default)

Small-DRAM takeaway

With only 64 GB DRAM, LMCache is slower than FCFS at every JPS level on every GPU tested. At JPS=10 on H200, LMCache hits 516 s while FCFS is 453 s and Continuum is 126 s (4× gap). The DRAM tier is too small to retain hit-worthy KV across jobs under concurrency — wait_for_save overhead pays in, cache hit rate does not pay out. The large-DRAM sweep is what reveals LMCache's real-world regime.

2.3 Bottom line

Regime	Winner	Why
Small DRAM (64 GB), any GPU	Continuum	Offloaded KV is overwritten by next jobs before reuse; LMCache pays cost without cache benefit
Large DRAM, H100 (80 GB VRAM), JPS 3–10	LMCache	Eviction rate is high enough and DRAM hit rate is high enough to amortize `wait_for_save`
Large DRAM, H200 (141 GB VRAM)	Continuum	VRAM is rarely full — LMCache's offload fires rarely, overhead dominates
Any GPU, JPS=15	Continuum	PCIe bus saturates — LMCache's onload queues serially, 3–4× worse than Continuum

3. Per-Turn Latency — Three Distinct Failure Modes

Large-DRAM results broken down by conversation turn. Each pair shows H200 (left) and H100 (right). The per-turn shapes expose the structural failure mode of each policy — queuing growth (FCFS), admission spike then flat (Continuum), or flat then PCIe wall (LMCache).

3.1 JPS=3 (below saturation)

On H200 all three stay below 4 s/turn with minimal VRAM pressure. On H100, LMCache holds ~3 s throughout while FCFS climbs to 10.5 s by T7 — H100's 80 GB VRAM already creates enough eviction pressure at JPS=3 for LMCache's DRAM tier to pay off.

3.2 JPS=6 (saturation threshold)

Three structurally distinct shapes emerge. Continuum has a T1/T2 admission spike then stays flat (~5–8 s). LMCache is the flattest — its DRAM hit rate pays off. FCFS climbs linearly, hitting 34 s (H200) / 45 s (H100) at T7.

3.3 JPS=10 (high load)

Continuum's T3–T8 stay at 5–12 s while FCFS climbs to 72–88 s. LMCache sits in between but shows acceleration at later turns — the PCIe bus is starting to back up.

3.4 JPS=15 (fully saturated)

LMCache collapses: T5–T8 on H200 (106–149 s) now exceeds FCFS (104–137 s). PCIe bus saturation makes KV onload the new bottleneck. Continuum's T3–T8 stay at 5–14 s on both GPUs — pinned KV is structurally decoupled from system load.

3.5 Failure-mode summary

FCFS — linear queueing

T8/T1 grows with JPS (2.1× → 6.3×). Recompute of full context every turn; no structural ceiling. Unbounded linear growth.

Continuum — T1/T2 spike, flat tail

T8/T1 < 1 at JPS ≥ 6 (tail faster than first turn). Front-loaded admission contention, then pinned KV decouples from load.

LMCache — flat, then PCIe wall

Moderate T8/T1 at JPS 3–6 (2.1–2.3×), exploding to 10–11× at JPS=15. Bi-modal: flat below PCIe saturation, catastrophic above.

4. Next Steps — Roadmap Toward Predictive Prefetch

LMCache hits a PCIe wall at JPS=15; Continuum pins VRAM instead. The synthesis is predictive prefetch: offload during tool execution, prefetch back just before the tool returns. Four phases, each isolating one unknown.

Phase 1 Port to `SimpleCPUOffloading`

Swap LMCache out for vLLM's SimpleCPUOffloading substrate — same hit rate, without wait_for_save stalls.
Gate: match LMCache's large-DRAM H100 JCT within ±5% at JPS 3–10.

Phase 2 Bidirectional PCIe + oracle prefetch

Run offload and prefetch concurrently on H100 Gen5 ×16 full-duplex; assume a perfect oracle for tool-return time.
Gate: zero wait on tool return → matches Continuum at JPS 3–10 without pinning VRAM.

Phase 3 Scoring — which KV to move?

Benefit

Saved prefill tokens on reload
Blocking cost for other requests

Capacity

VRAM eviction pressure
DRAM size — first-order per §2

Phase 4 Real traces + imperfect estimator

Replace the synthetic benchmark with real agent tool-call traces; train a predictor for tool duration.
Gate: match Continuum on real traces with less VRAM, and beat LMCache on H100 JPS 3–10.

Research update 2026-04-14 · LMCache bottleneck & predictive prefetch roadmap · vllm-continuum

Full bottleneck analysis | KV Transfer Deep Dive | Back to LMCache