Agentic KV Cache — Four Strategies

How each strategy handles KV cache during tool-call gaps under high memory pressure (KV cache >90%)

Short Tool Call (< TTL)
Long Tool Call (> TTL)
1. FCFS recompute on return

KV freed immediately. Blocks evicted by other requests → full recompute on return.

tool started
tool finished
Slowest
VRAM
LLM inferenceKV allocated
tool executing — KV freed → evictedblocks taken by others
recompute ⚠re-prefill all tokens
LLM inferenceresume generation
DRAM
— unused —
2. Continuum VRAM pin with TTL

KV pinned in VRAM. Tool returns before TTL expires → cache hit, zero recompute.

tool started
tool finished
★ Fastest
VRAM
LLM inferenceKV allocated
tool executing — KV pinned ✓blocks in VRAM, not evictable
LLM inferencecache hit, zero recompute
DRAM
— unused —
3. LMCache DRAM offload + reactive reload

KV offloaded to DRAM, GPU freed. On tool return, reload from DRAM.

tool started
tool finished
Medium
VRAM
LLM inferenceKV allocated
offload ↓GPU→CPU
GPU memory freedavailable for other requests
reload ↑CPU→GPU
LLM inferencerestored, skip recompute
DRAM
KV data held in CPU pinned memory
4. Predictive Prefetch (Ours) DRAM offload + proactive prefetch

KV offloaded to DRAM, GPU freed. Prefetch KV back before tool returns → zero wait.

tool started
tool finished
★ Fastest
VRAM
LLM inferenceKV allocated
offload ↓GPU→CPU
GPU memory freedavailable for others
prefetch ↑at t−δ
LLM inferencecache hit, zero wait
DRAM
KV data held in CPU pinned memory
LLM inference
KV pinned in VRAM
TTL expired → evicted
KV freed / evicted
DRAM offload / reload
Predictive prefetch
Recompute (costly)
GPU memory available