KV Cache Strategies

How each strategy handles KV cache during tool-call gaps under high memory pressure (KV cache >90%)

1. FCFS recompute on return

KV freed immediately. Blocks evicted by other requests → full recompute on return.

tool started

tool finished

Slowest

VRAM

LLM inferenceKV allocated

tool executing — KV freed → evictedblocks taken by others

recompute ⚠re-prefill all tokens

LLM inferenceresume generation

…

DRAM

— unused —

2. Continuum VRAM pin with TTL

KV pinned in VRAM. Tool returns before TTL expires → cache hit, zero recompute.

tool started

tool finished

★ Fastest

VRAM

LLM inferenceKV allocated

tool executing — KV pinned ✓blocks in VRAM, not evictable

LLM inferencecache hit, zero recompute

…

DRAM

— unused —

3. LMCache DRAM offload + reactive reload

KV offloaded to DRAM, GPU freed. On tool return, reload from DRAM.

tool started

tool finished

Medium

VRAM

LLM inferenceKV allocated

offload ↓GPU→CPU

GPU memory freedavailable for other requests

reload ↑CPU→GPU

LLM inferencerestored, skip recompute

…

DRAM

KV data held in CPU pinned memory

4. Predictive Prefetch (Ours) DRAM offload + proactive prefetch

KV offloaded to DRAM, GPU freed. Prefetch KV back before tool returns → zero wait.

tool started

tool finished

★ Fastest

VRAM

LLM inferenceKV allocated

offload ↓GPU→CPU

GPU memory freedavailable for others

prefetch ↑at t−δ

LLM inferencecache hit, zero wait

…

DRAM

KV data held in CPU pinned memory

1. FCFS recompute on return

KV freed immediately. Blocks evicted → full recompute on return.

tool started

tool finished

Slowest

VRAM

LLM inferenceKV allocated

tool executing — KV freed → evictedblocks taken by others

recompute ⚠re-prefill all tokens

LLM inferenceresume generation

…

DRAM

— unused —

2. Continuum VRAM pin with TTL

KV pinned, but TTL expires before tool returns → blocks unpinned and evicted → must recompute.

tool started

TTL expired

tool finished

Slowest

VRAM

LLM inferenceKV allocated

KV pinned ✓TTL active

unpinned → evictedblocks taken by others

recompute ⚠re-prefill all tokens

LLM inferenceresume generation

…

DRAM

— unused —

3. LMCache DRAM offload + reactive reload

KV safe in DRAM regardless of tool duration. On return, reload from DRAM.

tool started

tool finished

Medium

VRAM

LLM inferenceKV allocated

offload ↓GPU→CPU

GPU memory freedavailable for other requests

reload ↑CPU→GPU

LLM inferencerestored, skip recompute

…

DRAM

KV data held in CPU pinned memory

4. Predictive Prefetch (Ours) DRAM offload + proactive prefetch

KV safe in DRAM. Prefetch completes before tool returns — same performance as short tool case.

tool started

tool finished

★ Fastest

VRAM

LLM inferenceKV allocated

offload ↓GPU→CPU

GPU memory freedavailable for others

prefetch ↑at t−δ

LLM inferencecache hit, zero wait

…

DRAM

KV data held in CPU pinned memory

Agentic KV Cache — Four Strategies