Advisor sync, 2026-04-14: where LMCache pays off, where it collapses, and how bidirectional predictive prefetch can replace it.
We measure how three KV-cache policies behave across GPU memory profiles and load levels, to decide where LMCache's CPU offload tier is profitable and where a predictive prefetch path should replace it.
The timeline below (from kv_strategies.html) shows the VRAM/DRAM lifecycle of each strategy during a single agent tool-call gap. Toggle between short tool (< TTL) and long tool (> TTL) to see where each approach breaks.
FCFS is the vLLM baseline. Continuum pins the job's KV in VRAM for the gap. LMCache offloads to CPU DRAM. Predictive Prefetch is the direction this work is moving toward.
| Parameter | Value |
|---|---|
| Model | Llama-3.1-8B-Instruct (bf16, ~16 GB) |
| Turns per job | 8 (tool call → tool output → next turn) |
| Prompt growth | ~2K tokens/turn, reaching ~13K at turn 8 |
| Completion tokens | 20 per turn |
| Tool execution time | 0.5 s (fixed) |
| Arrival process | Poisson at rate JPS ∈ {1, 3, 6, 10, 15} |
| Seeds | 42 (deterministic, same random prompts across policies) |
| GPUs | H200 (141 GB), H100 (80 GB), A100 (40 GB), L40S (48 GB) |
| DRAM sweep | Small = 64 GB (default); Large = 400 GB (H100/H200), 200 GB (A100), 100 GB (L40S) |
| Data provenance | 20260412_large_dram_sweep/ (primary) + 20260407_controlled-sweep/ (64 GB baseline) |
Random per-turn tool outputs (7 fixed sizes shuffled, seed=42) defeat vLLM's block-level prefix cache. This isolates the KV preservation strategy from opportunistic prefix reuse, so differences in JCT reflect the policy itself rather than cache coincidence.
Each chart shows average job completion time (log scale) against arrival rate for H200 and H100. The contrast between large-DRAM (400 GB) and small-DRAM (64 GB) is the central finding: DRAM size gates whether LMCache's offload tier is ever profitable.
With 400 GB DRAM, H100 (80 GB VRAM) is LMCache's sweet spot: −31% at JPS=3, −34% at JPS=6 vs Continuum. H200 (141 GB VRAM) has too little eviction pressure to amortize the offload cost — LMCache narrowly wins only at JPS=6 (−4.2%). Both GPUs collapse at JPS=15 where PCIe saturates.
With only 64 GB DRAM, LMCache is slower than FCFS at every JPS level on every GPU tested. At JPS=10 on H200, LMCache hits 516 s while FCFS is 453 s and Continuum is 126 s (4× gap). The DRAM tier is too small to retain hit-worthy KV across jobs under concurrency — wait_for_save overhead pays in, cache hit rate does not pay out. The large-DRAM sweep is what reveals LMCache's real-world regime.
| Regime | Winner | Why |
|---|---|---|
| Small DRAM (64 GB), any GPU | Continuum | Offloaded KV is overwritten by next jobs before reuse; LMCache pays cost without cache benefit |
| Large DRAM, H100 (80 GB VRAM), JPS 3–10 | LMCache | Eviction rate is high enough and DRAM hit rate is high enough to amortize wait_for_save |
| Large DRAM, H200 (141 GB VRAM) | Continuum | VRAM is rarely full — LMCache's offload fires rarely, overhead dominates |
| Any GPU, JPS=15 | Continuum | PCIe bus saturates — LMCache's onload queues serially, 3–4× worse than Continuum |
Large-DRAM results broken down by conversation turn. Each pair shows H200 (left) and H100 (right). The per-turn shapes expose the structural failure mode of each policy — queuing growth (FCFS), admission spike then flat (Continuum), or flat then PCIe wall (LMCache).
On H200 all three stay below 4 s/turn with minimal VRAM pressure. On H100, LMCache holds ~3 s throughout while FCFS climbs to 10.5 s by T7 — H100's 80 GB VRAM already creates enough eviction pressure at JPS=3 for LMCache's DRAM tier to pay off.
Three structurally distinct shapes emerge. Continuum has a T1/T2 admission spike then stays flat (~5–8 s). LMCache is the flattest — its DRAM hit rate pays off. FCFS climbs linearly, hitting 34 s (H200) / 45 s (H100) at T7.
Continuum's T3–T8 stay at 5–12 s while FCFS climbs to 72–88 s. LMCache sits in between but shows acceleration at later turns — the PCIe bus is starting to back up.
LMCache collapses: T5–T8 on H200 (106–149 s) now exceeds FCFS (104–137 s). PCIe bus saturation makes KV onload the new bottleneck. Continuum's T3–T8 stay at 5–14 s on both GPUs — pinned KV is structurally decoupled from system load.
T8/T1 grows with JPS (2.1× → 6.3×). Recompute of full context every turn; no structural ceiling. Unbounded linear growth.
T8/T1 < 1 at JPS ≥ 6 (tail faster than first turn). Front-loaded admission contention, then pinned KV decouples from load.
Moderate T8/T1 at JPS 3–6 (2.1–2.3×), exploding to 10–11× at JPS=15. Bi-modal: flat below PCIe saturation, catastrophic above.
LMCache hits a PCIe wall at JPS=15; Continuum pins VRAM instead. The synthesis is predictive prefetch: offload during tool execution, prefetch back just before the tool returns. Four phases, each isolating one unknown.
Research update 2026-04-14 · LMCache bottleneck & predictive prefetch roadmap · vllm-continuum
Full bottleneck analysis | KV Transfer Deep Dive | Back to LMCache