Four-GPU comparison reveals LMCache's benefit window depends on VRAM size: H100 (80 GB) benefits at JPS 3–10, H200 (141 GB) only at JPS=6, A100 (40 GB) only at JPS=3
Under what conditions does LMCache's CPU KV offload provide net benefit? We test across 4 GPUs and 5 load levels with sufficient DRAM (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S). The initial sweep (64 GB DRAM) showed LMCache consistently slower than FCFS across all tested GPUs and load levels. The large-DRAM sweep tests whether providing adequate DRAM capacity eliminates the bottleneck and reveals a regime where LMCache's offload tier delivers genuine reuse benefit.
Stock vLLM v1 scheduler. Block-level prefix cache is enabled, but evicted blocks are simply recomputed on the next turn. No explicit cross-turn KV preservation.
Failure mode: recompute of full context every turn under high concurrency
Upstream production KV cache layer. Offloads evicted KV blocks to CPU DRAM via PCIe, reloads on hit. Integrates with vLLM through the v1 KV transfer connector interface (LMCacheConnectorV1).
Failure mode: wait_for_save blocks the step loop on every prefill — see KV Transfer Deep Dive
Research prototype (this work). Keeps KV blocks for active agent jobs pinned in GPU VRAM across tool-call gaps, bypassing the evict/recompute cycle entirely. No CPU transfer. Identifies jobs via job_id and releases pin on job_completed signal.
Failure mode: pinned blocks can starve new admissions (Turn 2 explosion) — see §4
The benchmark simulates an 8-turn agent job: the client sends a prompt, the server generates a short response containing a tool call, the client sleeps for a fixed tool-execution time, then appends a (large) tool output back onto the context and starts the next turn. Each turn grows the prompt by ~2K tokens, so turn 8 sees a prompt of ~13K tokens. Jobs arrive on a Poisson process at rate JPS (jobs per second).
Each turn's input prompt contains the full conversation so far: original user prompt + all previous turns' responses + all previous tool outputs. Three quantities determine how the context grows:
min_tokens=max_tokens=20 in ab_benchmark.py:190. Not variable.[1640, 1510, 2455, 1335, 2730, 1930, 775] for the 7 tool outputs between turns 1–7. Content is per-job random (seeded by hashlib.md5(f'{job_id}:{turn_idx}') in ab_benchmark.py:78) so length is reproducible but tokens are unique per job → defeats prefix cache sharing.| Turn | Input prompt (measured) | Completion (fixed) | Tool output appended after turn | = Next turn's input |
|---|---|---|---|---|
| 1 | 92 | 20 | ~1,728 (target 1,640) | 1,840 |
| 2 | 1,840 | 20 | ~1,589 (target 1,510) | 3,449 |
| 3 | 3,449 | 20 | ~2,556 (target 2,455) | 6,025 |
| 4 | 6,025 | 20 | ~1,409 (target 1,335) | 7,454 |
| 5 | 7,454 | 20 | ~2,825 (target 2,730) | 10,299 |
| 6 | 10,299 | 20 | ~2,014 (target 1,930) | 12,333 |
| 7 | 12,333 | 20 | ~840 (target 775) | 13,193 |
| 8 | 13,193 | 20 | — (last turn, no tool call) | job ends |
, about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions."
data-zh="實測 vs target:controlled-sweep H200 JSON 裡的 per_turn_avg_prompt_tokens 顯示 tool output 比 target 序列稍大(穩定多 65–101 tokens)。這是 _generate_random_tool_output 的 header — f\"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] \" — 的固定 overhead,每個 tool output 約 80 tokens。next_input = this_input + 20 + tool 的關係在所有 7 次 transition 都精確成立。">Measured vs target: the per_turn_avg_prompt_tokens field in the controlled-sweep H200 JSON shows slightly larger tool outputs than the target sequence (+65 to +101 tokens consistently). This is the fixed overhead of the _generate_random_tool_output header — f"[{{job_id}}:turn{{turn_idx}}:{{seed[:8]}}] ", about 80 tokens per tool output. The relationship next_input = this_input + 20 + tool holds exactly across all 7 transitions.
An earlier version of this benchmark (dir 20260406_static-prompt_jps-sweep_d90s_tool0.5s/) sent the same tool output string to every job. vLLM's block-level prefix cache — enabled by default — then hashed identical blocks across jobs, giving ~99% prefix cache hit and effectively 0 real KV eviction. Under that regime, LMCache appeared to beat FCFS by 39–66% on A100/L40S. That result is an artifact and is retracted. Under random prompts (31% prefix cache hit, realistic KV pressure), LMCache is slower than FCFS on every tested large-VRAM GPU.
The fix (committed in ab_benchmark.py:70) uses per-job per-turn seeded randomness:
def _generate_random_tool_output(job_id: str, turn_idx: int,
target_tokens: int) -> str:
# Uses a seeded RNG per (job_id, turn_idx) for reproducibility.
seed = hashlib.md5(f"{job_id}:{turn_idx}".encode()).hexdigest()
rng = random.Random(seed)
header = f"[{job_id}:turn{turn_idx}:{seed[:8]}] "
... # random words from a ~80-word pool
Each (job, turn) combination gets a unique token sequence. vLLM's prefix cache can still match the first few tokens of system prompt but cannot share tool-output bodies across jobs. All results in Sections 2–6 use this randomized prompt.
All numbers in Sections 2–6 of this report come from the primary large-DRAM sweep. Historical experiments are documented separately.
Source: our_experiments/results/20260412_large_dram_sweep/
| Model | Llama-3.1-8B-Instruct (bf16, ~16 GB) |
| Arrival | Poisson |
| Turns / job | 8 |
| Context @ turn 8 | ~13,000 tokens |
| Tool output / turn | Randomized, sequence [1640, 1510, 2455, 1335, 2730, 1930, 775] |
| Completion tokens | 20/turn (min=max) |
| Tool exec time | 0.5s fixed |
| JPS | 1, 3, 6, 10, 15 |
| Seed | 42 |
| DRAM provisioned | H200/H100: 400 GB; A100: 200 GB; L40S: 100 GB |
| GPUs | H200 5/5, H100 5/5, A100 partial, L40S partial |
| Policies | FCFS, LMCache, Continuum |
Full methodology and results for these experiments are documented in lmcache-kv-transfer.html.
20260407_controlled-sweep/ — baseline with default DRAM (64 GB); all 4 GPUs complete20260407_bottleneck/ — 5 LMCache sync variants on H200; partial other GPUs20260411_nsys-{h200,h100,a100,l40s}/ — per-call wait_for_save attribution; all 4 GPUs complete20260412_coalesce_sync/ — negative result; see KV Transfer Deep Dive §1.4For Llama 3.1 8B bf16 (~15 GB weights) at 85% gpu-memory-utilization. Block size = 2 MiB × 16 tokens. These determine the saturation point for each GPU.
| GPU | VRAM | KV budget | Token capacity | Max concurrent 13K-ctx jobs | Saturating JPS |
|---|---|---|---|---|---|
| L40S | 48 GB | ~28 GiB | ~208K | ~16 | 2–3 |
| A100-PCIe | 40 GB | ~17 GiB | ~139K | ~10 | 1–2 |
| H100 | 80 GB | ~53 GiB | ~434K | ~33 | 5–8 |
| H200 | 141 GB | ~106 GiB | ~868K | ~66 | 10–15 |
More VRAM = more concurrent jobs at the same JPS before KV eviction kicks in. Below the saturation point, all three policies perform similarly because prefix cache handles reuse. Above saturation, the three policies diverge based on how they handle eviction: FCFS recomputes, LMCache CPU-offloads (with wait_for_save penalty), Continuum pins. The ranking is stable across GPU sizes; what changes is which JPS value saturates each GPU.
The initial sweep (64 GB DRAM) used the cluster default memory configuration. Under that constraint, LMCache's CPU offload tier — which evicts KV blocks from GPU VRAM to DRAM — was itself DRAM-capacity-constrained. With only 64 GB available and a ~16 GB model occupying system memory, the effective offload pool was small, resulting in poor cache hit rates and amplified wait_for_save overhead with no compensating reuse benefit. LMCache was slower than FCFS on every GPU and JPS combination in the initial sweep.
The large-DRAM sweep tests the hypothesis: with sufficient DRAM, the offload pool ceases to be the bottleneck, and LMCache's DRAM tier can absorb enough evicted KV blocks to deliver net savings at moderate concurrency. We provision 400 GB DRAM on H200/H100, 200 GB on A100, and 100 GB on L40S — large enough to hold the entire KV working set of any tested load level. This experiment isolates whether the DRAM capacity or the wait_for_save per-step synchronization overhead is the primary factor limiting LMCache's benefit.
our_experiments/results/20260412_large_dram_sweep/ — H200 5/5, H100 5/5, A100 partial, L40S partial (this report)our_experiments/results/20260407_controlled-sweep/ — H200 15/15, H100 15/15, L40S 15/15, A100 15/15 (initial sweep, 64 GB DRAM)our_experiments/results/20260407_bottleneck/ — H200 28/28; H100 10/28, L40S 14/28, A100 0/28 (partial; nsys data in KV Transfer page supersedes)our_experiments/results/20260411_nsys-{h200,h100,a100,l40s}/ — NVTX profiling, 6 profiles per GPU (fcfs+lmcache × jps1/6/10), all 4 GPUs completeour_experiments/results/20260412_coalesce_sync/{H200,H100,A100}/ — coalesce sync validation, negative result; see KV Transfer §1.420260406_static-prompt_jps-sweep_d90s_tool0.5s/ — prefix cache artifactjps_sweep_v2/, full_sweep/, quick_sweep/ — same static-prompt family, kept on disk for reproducibility onlyCanonical documents: BOTTLENECK_REPORT.md (bottleneck ablation analysis), EXPERIMENTS.md (initial sweep, the "definitive experiment" per author note). The older experiments.md (lowercase) contains the retracted Exp 4 and should be treated as historical only.
The large-DRAM sweep covers JPS from 1 to 15 across four GPU types with large DRAM provisioned (400 GB on H200/H100, 200 GB on A100, 100 GB on L40S), measuring average job completion time (JCT). The key question is whether sufficient DRAM shifts the regime where LMCache is competitive. All data below is from our_experiments/results/20260412_large_dram_sweep/, seed=42.
With 141 GB VRAM, H200 rarely evicts KV blocks until JPS=6. LMCache wins narrowly at JPS=6 only (−4.2% vs Continuum), but Continuum dominates at all other load levels. At JPS=15, LMCache collapses to 684.8 s versus Continuum's 185.1 s — PCIe saturation overwhelms the DRAM benefit.
| JPS | FCFS Avg (s) | Continuum Avg (s) | LMCache Avg (s) | Winner |
|---|---|---|---|---|
| 1 | 8.55 | 8.67 | 9.75 (+14%) | FCFS |
| 3 | 28.11 | 20.67 | 24.29 | Continuum |
| 6 | 192.15 | 66.00 | 63.22 (−4.2% vs Continuum) | LMCache |
| 10 | 416.92 | 119.39 | 158.69 (+33%) | Continuum |
| 15 | 702.50 | 185.08 | 684.80 (+270%) | Continuum |
H100 with 80 GB VRAM is the sweet spot for LMCache. With 400 GB DRAM, KV eviction pressure is substantial enough at JPS 3–10 that LMCache's offload tier pays off. LMCache beats Continuum by −31% at JPS=3 and −34% at JPS=6. At JPS=15, LMCache collapses (689.5 s vs Continuum 259.5 s) due to PCIe congestion.
| JPS | FCFS Avg (s) | Continuum Avg (s) | LMCache Avg (s) | Winner |
|---|---|---|---|---|
| 1 | 9.01 | 8.75 | 10.21 (+13%) | Continuum |
| 3 | 68.85 | 39.75 | 27.51 (−31% vs Continuum) | LMCache |
| 6 | 244.07 | 100.38 | 65.85 (−34% vs Continuum) | LMCache |
| 10 | 456.30 | 172.89 | 165.19 (−4.5% vs Continuum) | LMCache |
| 15 | 739.24 | 259.52 | 689.47 (+166%) | Continuum |
A100's 40 GB VRAM forces heavy KV eviction even at JPS=1. LMCache delivers a meaningful win over FCFS at JPS=3 (−46%), but PCIe Gen4 x16 bandwidth limits scalability: at JPS=6, LMCache is already slower than FCFS. Continuum data is unavailable for the large-DRAM sweep; comparison is FCFS vs LMCache only.
| JPS | FCFS Avg (s) | LMCache Avg (s) | Winner (vs FCFS) |
|---|---|---|---|
| 1 | 45.00 | 70.52 (+57%) | FCFS |
| 3 | 415.14 | 223.40 (−46%) | LMCache |
| 6 | 943.54 | 1034.36 (+10%) | FCFS |
| 10 | — (timeout) | 2009.26 | — |
L40S large-DRAM sweep data is incomplete: only FCFS was measured across all JPS levels; Continuum data is absent; LMCache data is unavailable. The FCFS trend shows rapid saturation — at JPS=15, 29 jobs timed out, indicating the GPU is deeply overloaded. Full three-way comparison is pending.
| JPS | FCFS Avg (s) | Continuum Avg (s) | Notes |
|---|---|---|---|
| 1 | 134.55 | 68.55 | Continuum available at JPS=1 only |
| 3 | 501.28 | — | |
| 6 | 1045.63 | — | |
| 10 | 1834.86 | — | |
| 15 | 3374.62 | — | 29 timeouts |
| GPU | VRAM | DRAM | LMCache Sweet Spot | Peak Advantage vs Continuum |
|---|---|---|---|---|
| H100 | 80 GB | 400 GB | JPS 3–10 | −34% at JPS=6 |
| H200 | 141 GB | 400 GB | JPS=6 only | −4.2% at JPS=6 |
| A100 | 40 GB | 200 GB | JPS=3 only (vs FCFS) | −46% vs FCFS at JPS=3 |
| L40S | 48 GB | 100 GB | Data incomplete | — |
The large-DRAM sweep results reveal a VRAM-dependent benefit window for LMCache. This section explains the mechanism: why DRAM size matters, why it is not sufficient at very high concurrency, and how PCIe bandwidth creates a hard ceiling.
LMCache offloads evicted KV blocks from GPU VRAM to CPU DRAM via PCIe. Whether this offload is beneficial depends on three interacting factors: (1) DRAM capacity, (2) DRAM LRU eviction under concurrency, and (3) PCIe bandwidth saturation.
LMCache window: JPS=6 only
LMCache window: JPS 3–10
LMCache window: JPS=3 only
At very high JPS (≥15), PCIe becomes the binding constraint regardless of DRAM size. As the number of concurrent jobs grows, each vLLM step encounters more prefill requests in the batch, each triggering a separate wait_for_save DMA. The cumulative PCIe bandwidth consumed exceeds available PCIe throughput (H100: Gen5 x16 ≈ 128 GB/s; A100: Gen4 x16 ≈ 64 GB/s), causing DMA latency to spike. The measured consequence:
| GPU | JPS=15 Continuum (s) | JPS=15 LMCache (s) | LMCache overhead |
|---|---|---|---|
| H200 | 185.08 | 684.80 | +270% |
| H100 | 259.52 | 689.47 | +166% |
For the root cause analysis (slot_mapping.cuda(), cudaStreamSynchronize, nsys profiling, ablation experiments, and per-step overhead breakdown), see:
KV Transfer Deep Dive
Key finding: the per-step wait_for_save cost (~15 ms) is not the root cause. LMCache is profitable when DRAM cache hit rate is sufficient — see §6 for per-job trace evidence.
All data below use the large-DRAM sweep results. H200 and H100 are shown side by side for each load level. Breaking latency down by conversation turn reveals three structurally distinct failure modes — one per scheduling policy.
At JPS=3 on H200, all three policies remain below 4 s per turn. Continuum is the fastest throughout (933 ms at T1, plateauing near 2.5 s), LMCache occupies the middle band, and FCFS is the slowest. VRAM pressure is minimal at this load, so LMCache's offload overhead is visible but small. On H100, the picture reverses: LMCache is the fastest policy, starting at 1420 ms and holding a flat ~3 s plateau from T3 onwards. FCFS grows linearly to 10.5 s by T7, revealing the queuing penalty. Continuum shows a spike at T1 (5952 ms) driven by pin-allocation overhead, then drops to 3–4 s — still above LMCache. H100's smaller VRAM (80 GB) creates enough eviction pressure at JPS=3 that LMCache's DRAM offload pays off, a dynamic absent on H200.
At the saturation threshold the three shapes diverge structurally. On H200, LMCache (3.6–9.7 s range) significantly beats FCFS (6.8–34.3 s at T7). Continuum has its characteristic T1 spike (17.9 s) then drops to 4.6–7.7 s — the flattest tail. LMCache provides the lowest variance T1–T8 curve. On H100 the same pattern is amplified: FCFS reaches 45 s by T7 while LMCache stays 4.2–9.9 s. Continuum has a dramatic T2 spike (34.5 s — first full prefill competing for pinned VRAM slots) then drops to 5–18 s. The PCIe bus is not yet saturated at JPS=6 on either GPU, so LMCache's offload tier delivers meaningful savings.
At JPS=10, Continuum's pinned-VRAM advantage fully materialises. T3–T8 on both GPUs hold at 5–12.5 s while FCFS climbs to 72–88 s. The T2 spike grows to 54.9 s (H200) and 85.9 s (H100) — reflecting high admission contention for a smaller pinned pool. LMCache (7.5–26.8 s on H200) sits between Continuum and FCFS but shows acceleration at later turns: the PCIe bus is beginning to back up, and each turn's KV reload competes with concurrent onloads from other jobs. FCFS and LMCache are nearly parallel lines growing at the same rate, confirming that LMCache's benefit at this load level is only at early turns.
At full saturation, LMCache collapses. On H200, LMCache T5–T8 (106–149 s) exceeds FCFS T5–T8 (104–137 s) — a qualitative reversal. PCIe bus saturation makes KV onload itself the new bottleneck: each turn's reload queues behind concurrent onloads from other jobs, so LMCache becomes strictly worse than doing nothing. H100 mirrors the H200 pattern exactly: LMCache T7 reaches 152.9 s while FCFS T7 is 150.8 s. Continuum's T2 reaches 110.6 s (H200) and 155.5 s (H100), but T3–T8 remain at 5–14 s across both GPUs. Continuum's pinned-KV guarantee means T3–T8 are entirely decoupled from system load; no other policy achieves this.
The per-turn latency data across all four JPS levels identifies three structurally distinct failure modes — each produced by a different mechanism and visible in a different part of the latency curve.
T8/T1 ratio grows with JPS: 2.1× at JPS=3, 4.5× at JPS=6, 4.5× at JPS=10, 6.3× at JPS=15 (H200). Every turn recomputes the full context KV; as context grows with turn number and the batch is saturated, per-turn cost compounds. No structural ceiling — the failure mode is unbounded linear growth.
T8/T1 ratio is <1 at all JPS levels (H200: 2.5× at JPS=3, 0.27× at JPS=6, 0.26× at JPS=10, 0.19× at JPS=15). Later turns are faster than T1 because pinned KV eliminates recompute. The failure mode is front-loaded: T2 admission contention grows with JPS (up to 155 s on H100 JPS=15). Once past T2 the job is structurally decoupled from load.
T8/T1 stays moderate at JPS=3–6 (2.1–2.3× on H200, H100) but explodes to 10.4× at H200 JPS=15 (and 11.3× on H100). The PCIe bus saturates, onload serialises, and each turn waits behind the previous turn's KV transfer. Unlike FCFS (which always grows) LMCache has a bi-modal regime: flat below PCIe saturation, catastrophic above it.
The T8/T1 ratio measures how much a job's own later turns are penalised relative to its first turn — a proxy for intra-job starvation. Unlike inter-job starvation (early vs. late arrivals), intra-job spread directly determines whether a multi-turn user experience degrades over the course of a conversation.
| GPU / JPS | Policy | T1 (ms) | T8 (ms) | T8/T1 |
|---|---|---|---|---|
| H200 JPS=3 | FCFS | 1461 | 3118 | 2.1× |
| Continuum | 933 | 2307 | 2.5× | |
| LMCache | 1190 | 2737 | 2.3× | |
| H200 JPS=6 | FCFS | 6804 | 30812 | 4.5× |
| Continuum | 17877 | 4878 | 0.3× | |
| LMCache | 3655 | 7538 | 2.1× | |
| H200 JPS=10 | FCFS | 16019 | 72917 | 4.6× |
| Continuum | 19940 | 5237 | 0.3× | |
| LMCache | 7474 | 23312 | 3.1× | |
| H200 JPS=15 | FCFS | 21347 | 133549 | 6.3× |
| Continuum | 27922 | 5225 | 0.2× | |
| LMCache | 13851 | 144112 | 10.4× |
FCFS: T8/T1 grows monotonically with JPS. Each turn recomputes the full accumulated context, so the ratio tracks context-size growth. No mechanism exists to break the growth — the only relief is lower concurrency.
Continuum: T8/T1 < 1 at JPS ≥ 6 on both GPUs. Later turns benefit from pinned KV; the ratio falls below 1 because T8's incremental prefill (≈860 tokens) is much cheaper than T1's fresh admission. Intra-job spread is inverted — the conversation gets faster over time once past T2.
LMCache: T8/T1 stays 2.1–3.1× at JPS=3–10, then explodes to 10.4× (H200) and 11.3× (H100) at JPS=15. The collapse is non-linear: once PCIe bandwidth is saturated, each concurrent KV onload serialises behind all others, creating a cascading delay that compounds across turns. LMCache is the only policy with a bi-modal intra-job spread.
Using vLLM's per-step scheduler trace (SCHED_TRACE_PATH), we can track block-level KV cache behavior for individual jobs. The table below shows job_0020 from the H100 LMCache JPS=6 trace test (255 jobs, seed=42, 400 GB DRAM).
Theoretical best = (prev turn prompt + 20) / this turn prompt. All values use job_0020's actual counts.
| Turn | Prompt | Cacheable (prev turn) | Theoretical Best | Actual Hit | Actual Rate | Gap | Free Blk | Action |
|---|---|---|---|---|---|---|---|---|
| 1 | 92 | ~80 (sys prompt) | 87% | 80 | 87% | 0% | 22,748 | free |
| 2 | 1,845 | 112 (92+20) | 6.1% | 112 | 6.1% | 0% | 19,654 | free |
| 3 | 3,431 | 1,865 (1845+20) | 54.4% | 1,840 | 53.6% | -0.8% | 16,701 | free |
| 4 | 6,065 | 3,451 (3431+20) | 56.9% | 3,440 | 56.7% | -0.2% | 13,821 | free |
| 5 | 7,512 | 6,085 (6065+20) | 81.0% | 6,080 | 80.9% | -0.1% | 10,549 | free |
| 6 | 10,298 | 7,532 (7512+20) | 73.1% | 112 | 1.1% | -72.0% | 1,799 | free |
| 7 | 12,392 | 10,318 (10298+20) | 83.3% | 112 | 0.9% | -82.4% | 1,907 | free |
| 8 | 13,278 | 12,412 (12392+20) | 93.5% | 112 | 0.8% | -92.7% | 2,332 | free |
Turns 1–5 achieve near-theoretical hit (gap < 1%). Turn 6–8 should theoretically hit 73–93% but collapses to 1% — a gap of 72–92% representing entirely avoidable recompute due to DRAM LRU eviction under 255 concurrent jobs.
Turn 1 hits 87% (system prompt cached). Turn 2 drops to 6% (randomized tool output defeats prefix sharing). Turns 3–5 recover to 54%→57%→81% as LMCache reloads prior turns' KV from DRAM. The hit tokens match exactly: Turn 5 hits 6,080 tokens = Turn 1–4's total context.
Cache hit collapses to 1% (only system prompt). During the 0.5 s tool-call gap between Turn 5 and Turn 6, other concurrent jobs' offload traffic overwrites job_0020's DRAM entries via LRU eviction. Free blocks drop to 1,799 (VRAM 93% full). LMCache's DRAM tier cannot retain KV for all 255 concurrent jobs — this is the capacity wall.
In vLLM v1, action=free releases the block reservation. The KV data remains in VRAM's prefix cache until evicted. This is distinct from Continuum's action=pin, which keeps blocks unevictable across tool-call gaps. Three states: allocated (in use, unevictable) → cached (free but KV retained, evictable) → empty (no KV).
Each vLLM scheduler step processes a batch of requests. The KV cache operations differ by request phase and scheduling policy.
Owned by an active request. Cannot be evicted.
Released but KV data still in VRAM. Reusable but evictable.
No KV data. Available for allocation.
When a new request enters schedule(), the scheduler performs a two-tier cache lookup to determine how many tokens can be skipped.
Called via kv_cache_manager.get_computed_blocks(request).
Mechanism: Chain-hashed blocks checked against block pool's cached map. Stops at first miss = longest prefix match.
What can hit: Any completed request's blocks with matching token prefix (same-job prior turns, shared system prompt, etc.).
Returns: num_new_local_computed_tokens = blocks × 16. Max = prompt − 1.
Called only if KV connector configured. LMCache delegates to its engine.
Mechanism: LMCache checks DRAM for blocks beyond what VRAM found. The difference is loaded from DRAM to GPU during forward pass.
Returns: num_external_computed_tokens. FCFS/Continuum: always 0.
Example: Turn 5 (7,512 tokens). Tier 1 found 6,080 in VRAM. Tier 2: 0. Compute: 7,512 − 6,080 = 1,432 tokens (vs 7,512 without cache).
| Request Phase | Operation | Details |
|---|---|---|
| New request (first prefill) | allocate_slots + prefix_cache_hit |
Check prefix cache + DRAM, allocate remaining, preempt if needed. |
| Running decode | allocate_slots |
May need 1 new block per 16 tokens. |
| Preempted request | free |
All blocks freed. Must recompute on reschedule. |
When a request finishes, _free_request() dispatches based on policy:
| Condition | FCFS | LMCache | Continuum |
|---|---|---|---|
| Mid-job turn |
action=freeallocated → cached (evictable) |
wait_for_save → action=freeoffload to DRAM, then allocated → cached |
action=pinstays allocated (unevictable) for TTL |
| Last turn (is_last_step) |
action=freeSame as mid-job |
wait_for_save → action=freeSame (offloads for future reuse) |
action=freeNo pin — freed immediately |
| Policy | What happens to this job's KV | Next turn hit |
|---|---|---|
| FCFS | May be evicted. No backup. | Depends on prefix cache survival |
| LMCache | VRAM evictable + DRAM copy (also LRU evictable) | Hit if DRAM survives LRU (§6.1) |
| Continuum | Pinned. No eviction possible. | ~100% |
Dispatch in scheduler.py:_free_blocks(). FCFS → free; LMCache → wait_for_save + free; Continuum → pin.
KV offload/onload uses PCIe DMA (Copy Engine), not GPU compute (SM). They can theoretically run with zero interference.
132 units (H100). Forward pass kernels.
2–3 units. PCIe DMA transfers. Independent.
18 units (H100). GPU↔GPU for TP.
LMCache v0.3.7 (current):
PR #37160 SimpleCPUOffloadConnector:
// vLLM Step Loop
│
├── schedule() // scheduler.py:280
│ ├── Tier 1: kv_cache_manager.get_computed_blocks(req) // kv_cache_manager.py:153
│ │ └── coordinator.find_longest_cache_hit(block_hashes)
│ │ └── block_pool.get_cached_block(hash) // single_type_kv_cache_manager.py:252
│ │ chain hash: hash(parent_hash, token_ids, extra_keys) // kv_cache_utils.py:539
│ │
│ ├── Tier 2 (LMCache only):
│ │ └── connector.get_num_new_matched_tokens() // lmcache_connector.py:158
│ │ └── lmcache_engine.get_num_new_matched_tokens // vllm_v1_adapter.py:1123
│ │
│ ├── allocate_slots() // kv_cache_manager.py:196
│ └── _free_request() → _connector_finished() // scheduler.py:1481, 1657
│ └── _free_blocks() // scheduler.py:1514
│ ├── FCFS: kv_cache_manager.free(req)
│ ├── LMCache: connector.request_finished() + free
│ └── Continuum: pin_request(req, ttl) // scheduler.py:240
│
├── execute_model() // gpu_model_runner.py:2000
│ └── with maybe_get_kv_connector_output(): // kv_connector_model_runner_mixin.py:97
│ │
│ ├── ENTER: start_load_kv() // onload from DRAM (async, load_stream)
│ │
│ ├── model(input_ids, positions, ...) // forward pass (default stream, SM)
│ │ GPU simultaneously: SM=forward, CE=onload DMA
│ │
│ └── EXIT: wait_for_save() ⚠ CPU BLOCKS HERE
│ └── lmcache_connector.wait_for_save() // lmcache_connector.py:111
│ └── vllm_v1_adapter.wait_for_save() // vllm_v1_adapter.py:1019
│ │
│ ├── slot_mapping.cuda() // vllm_v1_adapter.py:1055
│ │ → PyTorch allocator GC → cudaStreamSynchronize(default)
│ │ Sync #1: waits for forward pass to finish
│ │
│ ├── lmcache_engine.store() // cache_engine.py:277
│ │ └── gpu_connector.batched_from_gpu() // gpu_connector.py:339
│ │ ├── with torch.cuda.stream(store_stream):
│ │ │ lmc_ops.multi_layer_kv_transfer() // gpu_connector.py:279
│ │ │ GPU→CPU DMA on Copy Engine
│ │ │
│ │ └── store_stream.synchronize() // gpu_connector.py:348
│ │ Sync #2: waits for DMA to complete
│ │
│ └── CPU blocked → cannot launch next step
│
└── // back to step loop → next schedule() + execute_model()
| # | Location | Source | What it blocks |
|---|---|---|---|
| 1 | slot_mapping.cuda() |
vllm_v1_adapter.py:1055 |
Allocator GC syncs default stream — waits for forward pass |
| 2 | store_stream.synchronize() |
gpu_connector.py:348 |
Waits for DMA copies on store_stream (~15 ms real transfer) |
| 3 | store_stream.synchronize() |
gpu_connector.py:309 |
Per-chunk sync for non-CUDA target (always for CPU offload) |
GPU can overlap SM+CE, but CPU blocks in finally: wait_for_save() and can't launch next step. GPU idles.
Pass wait_for_save=False so execute_model returns immediately. Sync in get_finished() after next schedule. See upstream PR #37160.
Detailed nsys profiling and ablation experiments are in: LMCache KV Transfer Pipeline — Internal Architecture
LMCache's benefit window is VRAM-dependent, not DRAM-dependent alone. With sufficient DRAM (400 GB), LMCache is competitive only when VRAM pressure forces meaningful eviction rates. The DRAM tier provides value only when eviction rate × cache hit rate × recompute cost exceeds the per-step wait_for_save overhead.
H100 with 400 GB DRAM delivers significant LMCache gains at JPS 3–10: −31% at JPS=3, −34% at JPS=6, −4.5% at JPS=10. The 80 GB VRAM creates sufficient eviction pressure without PCIe saturation at moderate load — the ideal operating regime for DRAM-backed KV offload.
H200's large VRAM suppresses eviction at most load levels, leaving LMCache with overhead but no cache benefit. LMCache barely wins at JPS=6 only (−4.2% vs Continuum). Continuum outperforms at all other JPS values. The H200 is not well-suited for LMCache unless the workload intentionally applies VRAM pressure.
A100's smaller VRAM creates heavy eviction pressure that LMCache can exploit — but only at JPS=3, where it beats FCFS by −46%. At JPS=6, LMCache is already slower than FCFS (+10%), because PCIe Gen4 x16 (≈64 GB/s) cannot keep up with the eviction rate, causing wait_for_save DMA latency to dominate over saved recompute.
Even with 400 GB DRAM, LMCache degrades catastrophically at JPS=15: H100 goes from 165 s (JPS=10, best) to 689 s (JPS=15), while Continuum remains at 260 s. H200 is equally severe: 684 s vs Continuum's 185 s. PCIe bandwidth saturation — not DRAM capacity — is the fundamental limit at extreme concurrency.
Use LMCache on 80 GB class GPUs (H100 or equivalent) at moderate concurrency (JPS 3–6 for this workload profile). Provision at least 200 GB DRAM to avoid offload-tier capacity constraints. Do not use LMCache on H200 or larger VRAM GPUs unless the workload is specifically designed to create high KV eviction pressure. Do not run LMCache at extreme concurrency (JPS ≥ 15) on any GPU — PCIe saturation causes severe performance collapse.
Large-DRAM Sweep: DRAM Size Determines LMCache Effectiveness | Four-GPU JPS Sweep | vllm-continuum
Authors: hlin464 (Georgia Tech) | Cluster: Georgia Tech PACE
Back to LMCache Deep Dive | KV Transfer Deep Dive | AI Infrastructure Overview
Large-DRAM Sweep 2026-04-12 | Last updated 2026-04-12