Research Update — LMCache Bottleneck & Predictive Prefetch Roadmap

Advisor sync, 2026-04-14: where LMCache pays off, where it collapses, and how bidirectional predictive prefetch can replace it.

Llama-3.1-8B-Instruct 8-turn agent, Poisson JPS H200 / H100 / A100 / L40S PACE + Lab 5090

1. Context & KV-Cache Strategies

We measure how three KV-cache policies behave across GPU memory profiles and load levels, to decide where LMCache's CPU offload tier is profitable and where a predictive prefetch path should replace it.

1.1 The four strategies we are comparing

The timeline below (from kv_strategies.html) shows the VRAM/DRAM lifecycle of each strategy during a single agent tool-call gap. Toggle between short tool (< TTL) and long tool (> TTL) to see where each approach breaks.

FCFS is the vLLM baseline. Continuum pins the job's KV in VRAM for the gap. LMCache offloads to CPU DRAM. Predictive Prefetch is the direction this work is moving toward.

1.2 Experiment parameters

ParameterValue
ModelLlama-3.1-8B-Instruct (bf16, ~16 GB)
Turns per job8 (tool call → tool output → next turn)
Prompt growth~2K tokens/turn, reaching ~13K at turn 8
Completion tokens20 per turn
Tool execution time0.5 s (fixed)
Arrival processPoisson at rate JPS ∈ {1, 3, 6, 10, 15}
Seeds42 (deterministic, same random prompts across policies)
GPUsH200 (141 GB), H100 (80 GB), A100 (40 GB), L40S (48 GB)
DRAM sweepSmall = 64 GB (default); Large = 400 GB (H100/H200), 200 GB (A100), 100 GB (L40S)
Data provenance20260412_large_dram_sweep/ (primary) + 20260407_controlled-sweep/ (64 GB baseline)
Why this workload

Random per-turn tool outputs (7 fixed sizes shuffled, seed=42) defeat vLLM's block-level prefix cache. This isolates the KV preservation strategy from opportunistic prefix reuse, so differences in JCT reflect the policy itself rather than cache coincidence.

2. Avg JCT vs JPS — Large vs Small DRAM

Each chart shows average job completion time (log scale) against arrival rate for H200 and H100. The contrast between large-DRAM (400 GB) and small-DRAM (64 GB) is the central finding: DRAM size gates whether LMCache's offload tier is ever profitable.

2.1 Large-DRAM sweep (400 GB DRAM)

H200 — 400 GB DRAM — Avg JCT vs JPS 1s 10s 100s 1000s Avg JCT (log scale) JPS=1 JPS=3 JPS=6 JPS=10 JPS=15 FCFS LMCache Continuum H100 — 400 GB DRAM — Avg JCT vs JPS 1s 10s 100s 1000s Avg JCT (log scale) JPS=1 JPS=3 JPS=6 JPS=10 JPS=15 FCFS LMCache Continuum
Large-DRAM takeaway

With 400 GB DRAM, H100 (80 GB VRAM) is LMCache's sweet spot: −31% at JPS=3, −34% at JPS=6 vs Continuum. H200 (141 GB VRAM) has too little eviction pressure to amortize the offload cost — LMCache narrowly wins only at JPS=6 (−4.2%). Both GPUs collapse at JPS=15 where PCIe saturates.

2.2 Small-DRAM sweep (64 GB DRAM, default)

H200 — 64 GB DRAM — Avg JCT vs JPS 1s 10s 100s 1000s Avg JCT (log scale) JPS=1 JPS=3 JPS=6 JPS=10 FCFS LMCache Continuum H100 — 64 GB DRAM — Avg JCT vs JPS (indicative) Partial data — exact numbers pending full rerun of 20260407 controlled-sweep for H100+Continuum 1s 10s 100s 1000s Avg JCT (log scale) JPS=1 JPS=3 JPS=6 JPS=10 Dashed = extrapolated; Continuum@64GB rerun pending FCFS (est.) LMCache (est.)
Small-DRAM takeaway

With only 64 GB DRAM, LMCache is slower than FCFS at every JPS level on every GPU tested. At JPS=10 on H200, LMCache hits 516 s while FCFS is 453 s and Continuum is 126 s (4× gap). The DRAM tier is too small to retain hit-worthy KV across jobs under concurrency — wait_for_save overhead pays in, cache hit rate does not pay out. The large-DRAM sweep is what reveals LMCache's real-world regime.

2.3 Bottom line

Regime Winner Why
Small DRAM (64 GB), any GPU Continuum Offloaded KV is overwritten by next jobs before reuse; LMCache pays cost without cache benefit
Large DRAM, H100 (80 GB VRAM), JPS 3–10 LMCache Eviction rate is high enough and DRAM hit rate is high enough to amortize wait_for_save
Large DRAM, H200 (141 GB VRAM) Continuum VRAM is rarely full — LMCache's offload fires rarely, overhead dominates
Any GPU, JPS=15 Continuum PCIe bus saturates — LMCache's onload queues serially, 3–4× worse than Continuum

3. Per-Turn Latency — Three Distinct Failure Modes

Large-DRAM results broken down by conversation turn. Each pair shows H200 (left) and H100 (right). The per-turn shapes expose the structural failure mode of each policy — queuing growth (FCFS), admission spike then flat (Continuum), or flat then PCIe wall (LMCache).

3.1 JPS=3 (below saturation)

H200 — Per-turn latency, JPS=3 0s 2.75s 5.5s 8.25s 11s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=3 0s 2.75s 5.5s 8.25s 11s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

On H200 all three stay below 4 s/turn with minimal VRAM pressure. On H100, LMCache holds ~3 s throughout while FCFS climbs to 10.5 s by T7 — H100's 80 GB VRAM already creates enough eviction pressure at JPS=3 for LMCache's DRAM tier to pay off.

3.2 JPS=6 (saturation threshold)

H200 — Per-turn latency, JPS=6 0s 8.75s 17.5s 26.25s 35s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=6 0s 11.5s 23s 34.5s 46s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

Three structurally distinct shapes emerge. Continuum has a T1/T2 admission spike then stays flat (~5–8 s). LMCache is the flattest — its DRAM hit rate pays off. FCFS climbs linearly, hitting 34 s (H200) / 45 s (H100) at T7.

3.3 JPS=10 (high load)

H200 — Per-turn latency, JPS=10 0s 20s 40s 60s 80s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=10 0s 22.5s 45s 67.5s 90s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

Continuum's T3–T8 stay at 5–12 s while FCFS climbs to 72–88 s. LMCache sits in between but shows acceleration at later turns — the PCIe bus is starting to back up.

3.4 JPS=15 (fully saturated)

H200 — Per-turn latency, JPS=15 0s 39s 78s 117s 156s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum H100 — Per-turn latency, JPS=15 0s 39s 78s 117s 156s Latency (s) T1 T2 T3 T4 T5 T6 T7 T8 Turn FCFS LMCache Continuum

LMCache collapses: T5–T8 on H200 (106–149 s) now exceeds FCFS (104–137 s). PCIe bus saturation makes KV onload the new bottleneck. Continuum's T3–T8 stay at 5–14 s on both GPUs — pinned KV is structurally decoupled from system load.

3.5 Failure-mode summary

FCFS — linear queueing

T8/T1 grows with JPS (2.1× → 6.3×). Recompute of full context every turn; no structural ceiling. Unbounded linear growth.

Continuum — T1/T2 spike, flat tail

T8/T1 < 1 at JPS ≥ 6 (tail faster than first turn). Front-loaded admission contention, then pinned KV decouples from load.

LMCache — flat, then PCIe wall

Moderate T8/T1 at JPS 3–6 (2.1–2.3×), exploding to 10–11× at JPS=15. Bi-modal: flat below PCIe saturation, catastrophic above.

4. Next Steps — Roadmap Toward Predictive Prefetch

LMCache hits a PCIe wall at JPS=15; Continuum pins VRAM instead. The synthesis is predictive prefetch: offload during tool execution, prefetch back just before the tool returns. Four phases, each isolating one unknown.

Phase 1 Port to SimpleCPUOffloading

  • Swap LMCache out for vLLM's SimpleCPUOffloading substrate — same hit rate, without wait_for_save stalls.
  • Gate: match LMCache's large-DRAM H100 JCT within ±5% at JPS 3–10.

Phase 2 Bidirectional PCIe + oracle prefetch

  • Run offload and prefetch concurrently on H100 Gen5 ×16 full-duplex; assume a perfect oracle for tool-return time.
  • Gate: zero wait on tool return → matches Continuum at JPS 3–10 without pinning VRAM.

Phase 3 Scoring — which KV to move?

Benefit

  • Saved prefill tokens on reload
  • Blocking cost for other requests

Capacity

  • VRAM eviction pressure
  • DRAM size — first-order per §2

Phase 4 Real traces + imperfect estimator

  • Replace the synthetic benchmark with real agent tool-call traces; train a predictor for tool duration.
  • Gate: match Continuum on real traces with less VRAM, and beat LMCache on H100 JPS 3–10.

Research update 2026-04-14 · LMCache bottleneck & predictive prefetch roadmap · vllm-continuum

Full bottleneck analysis  |  KV Transfer Deep Dive  |  Back to LMCache