Data structures, APIs, important files, and the role of the KV Cache Manager in vLLM’s scheduler.
LLM serving lives and dies by KV cache management.
Sections 2–4 dissect each layer bottom up.
Four classes form the foundation.
KVCacheBlockThe atom.
@dataclass(slots=True)
class KVCacheBlock:
block_id: int # 0 .. num_gpu_blocks-1 (null_block = 0)
ref_cnt: int = 0 # >0 means "someone is using me, don't evict"
_block_hash: BlockHashWithGroupId | None = None # hash key when block is full and cached
# Doubly-linked-list pointers (manipulated only by FreeKVCacheBlockQueue)
prev_free_block: "KVCacheBlock | None" = None
next_free_block: "KVCacheBlock | None" = None
is_null: bool = False # placeholder block; never cached, never freed
Lifecycle of a block:
FreeKVCacheBlockQueueA custom doubly-linked list.
Head = LRU (evicted first). Tail = most-recently-released.
Allocation: pop oldest candidate.
O(1) mid-queue removal.
Plain counter. ⚠ See §8.
BlockHashToBlockMapThe prefix cache's hash index.
class BlockHashToBlockMap:
# Stored on BlockPool as .cached_block_hash_to_block
_cache: dict[BlockHashWithGroupId, KVCacheBlock | dict[int, KVCacheBlock]]
def get_one_block(self, key) -> KVCacheBlock | None: ... # lookup
def insert(self, key, block): ... # cache_full_blocks
def pop(self, key, block_id) -> KVCacheBlock | None: ... # _maybe_evict_cached_block
def __len__(self) -> int: return len(self._cache) # ← true "how much content does CPU have?"
BlockPool (the container)Defined in block_pool.py:129.
class BlockPool:
def __init__(self, num_gpu_blocks, enable_caching, hash_block_size, ...):
self.num_gpu_blocks = num_gpu_blocks
self.blocks: list[KVCacheBlock] = [KVCacheBlock(i) for i in range(num_gpu_blocks)]
self.free_block_queue = FreeKVCacheBlockQueue(self.blocks)
self.cached_block_hash_to_block = BlockHashToBlockMap()
# null_block (block_id=0) is a sentinel; never cached, never freed
self.null_block = self.free_block_queue.popleft()
self.null_block.is_null = True
self.enable_caching = enable_caching
self.kv_event_queue: list[KVCacheEvent] = [] # for distributed KV event broadcast
Eight methods cover the entire scheduler surface.
| Method | Purpose | State change |
|---|---|---|
get_cached_block(hash, group_ids) |
Prefix-cache lookup | Read-only |
get_new_blocks(n) |
Allocate n blocks | popleft_n; may evict cached hashes |
touch(blocks) |
Increment ref_cnt on prefix hit | Remove from free queue if ref=0; ref+=1 |
cache_full_blocks(...) |
Mark full block cache-eligible | Insert into hash map |
free_blocks(ordered_blocks) |
Release on finish / preempt | ref -=1; append to free queue; hash stays |
_maybe_evict_cached_block(block) |
Evict hash on reuse | Pop from hash map; reset hash |
evict_blocks(block_ids) |
Explicit eviction | Same as above, per-block |
get_num_free_blocks() |
Statistics | Returns counter; ⚠ see §8 |
get_new_blocks()The primary allocation path.
def get_new_blocks(self, num_blocks: int) -> list[KVCacheBlock]:
if num_blocks > self.get_num_free_blocks():
raise ValueError(f"Cannot get {num_blocks} free blocks")
ret = self.free_block_queue.popleft_n(num_blocks) # head of LRU → oldest
if self.enable_caching:
for block in ret:
self._maybe_evict_cached_block(block) # drop hash if it had one
assert block.ref_cnt == 0
block.ref_cnt += 1 # take ownership
return ret
Eviction happens inline at allocation time.
touch()def touch(self, blocks: Sequence[KVCacheBlock]) -> None:
for block in blocks:
# ref_cnt=0 means sitting in free queue as eviction candidate
if block.ref_cnt == 0 and not block.is_null:
self.free_block_queue.remove(block) # O(1) mid-queue removal
block.ref_cnt += 1
This is where the doubly-linked list pays off.
Each layer adds one concern.
SingleTypeKVCacheManagerManages KV blocks for one cache group.
KVCacheCoordinatorCoordinates multiple groups (full + sliding + mamba).
class KVCacheCoordinator:
managers: list[SingleTypeKVCacheManager] # one per group
block_pool: BlockPool # shortcut to group 0 (or shared)
def allocate_slots(self, request, num_new_tokens) -> list[list[KVCacheBlock]] | None:
# Fans out across all groups; returns None if any group runs out
def find_longest_cache_hit(self, hashes, max_len) -> tuple[list[list[KVCacheBlock]], int]:
# Used by both GPU-side lookup and CPU-offload connector
KVCacheManagerThe Scheduler's only entry point.
Prefix lookup before prefill.
Allocate KV; None → OOM → preempt.
Release block chain on finish.
Per-group block-id list.
Per-step bookkeeping reset.
Exposes underlying BlockPool.
All paths relative to vllm/.
| File | ~LOC | What it defines | Called by |
|---|---|---|---|
v1/core/kv_cache_utils.py |
1700+ | KVCacheBlock, FreeKVCacheBlockQueue, hash helpers, BlockHash types |
block_pool.py, single_type manager |
v1/core/block_pool.py |
510 | BlockPool, BlockHashToBlockMap |
Coordinator, SingleType, CPU offload |
v1/core/single_type_kv_cache_manager.py |
800+ | SingleTypeKVCacheManager (per-group block allocation, hash assignment) |
Coordinator |
v1/core/kv_cache_coordinator.py |
400+ | KVCacheCoordinator, factory get_kv_cache_coordinator(...) |
KVCacheManager, CPU offload scheduler |
v1/core/kv_cache_manager.py |
600+ | KVCacheManager (scheduler-facing API) |
Scheduler |
v1/core/kv_cache_metrics.py |
~150 | KVCacheMetricsCollector (residency histograms, allocation events) |
BlockPool (optional injection) |
v1/kv_cache_interface.py |
~300 | KVCacheConfig, KVCacheGroupSpec, group specs (full / sliding / mamba) |
Engine init, factories |
v1/simple_kv_offload/manager.py |
700+ | SimpleCPUOffloadScheduler (wires a 2nd BlockPool against pinned host memory) |
SimpleCPUOffloadConnector |
v1/simple_kv_offload/worker.py |
~600 | SimpleCPUOffloadWorker (CUDA streams, GPU↔CPU DMA) |
SimpleCPUOffloadConnector |
distributed/kv_transfer/kv_connector/v1/simple_cpu_offload_connector.py |
~250 | SimpleCPUOffloadConnector (thin wrapper over scheduler/worker halves) |
Scheduler.connector |
vLLM instantiates the same BlockPool class twice.
vllm/v1/core/kv_cache_manager.pyKVCacheManager.__init__torch.zeros(...) on CUDA--gpu-memory-utilizationvllm/v1/simple_kv_offload/manager.py:110SimpleCPUOffloadScheduler initcpu_bytes_to_use_per_rank# vllm/v1/simple_kv_offload/manager.py:110
class SimpleCPUOffloadScheduler:
def __init__(self, vllm_config, kv_cache_config, cpu_capacity_bytes, lazy_offload=False):
self.cpu_kv_cache_config = self._derive_cpu_config(kv_cache_config, cpu_capacity_bytes)
self.num_cpu_blocks = self.cpu_kv_cache_config.num_blocks
# ← Same factory, different config
self.cpu_coordinator: KVCacheCoordinator = get_kv_cache_coordinator(
kv_cache_config=self.cpu_kv_cache_config,
max_model_len=vllm_config.model_config.max_model_len,
enable_caching=True, # prefix hash stays on
enable_kv_cache_events=self.enable_kv_cache_events,
hash_block_size=self.block_size,
)
self.cpu_block_pool: BlockPool = self.cpu_coordinator.block_pool
# GPU side reference, bound by the connector after Scheduler builds kv_cache_manager
self._gpu_block_pool: BlockPool | None = None
GPU BlockPool |
CPU BlockPool |
|
|---|---|---|
| Python class | Same — vllm.v1.core.block_pool.BlockPool |
|
| Free queue + hash index | Same | |
ref_cnt, touch, free, get_new_blocks |
Same API, same semantics | |
kv_cache_config |
GPU probe | _derive_cpu_config(...) (derived) |
| Where tensors live | CUDA device memory | Pinned host memory |
| Who allocates | GPU worker init_worker_distributed_environment |
SimpleCPUOffloadWorker via cudaHostAlloc |
| Who builds it | KVCacheManager on Scheduler start | SimpleCPUOffloadScheduler on connector init |
Every Scheduler.schedule() call follows a predictable KV-touch sequence.
self.kv_cache_manager.new_step_starts()Reset per-step bookkeeping. Top of schedule().
self.kv_cache_manager.block_pool.get_num_free_blocks()Read for tracing. Scheduler itself doesn't gate on it.
self.kv_cache_manager.get_computed_blocks(request)Prefix lookup on admission (waiting → running).
self.connector.get_num_new_matched_tokens(request, num_computed)Connector-layer prefix lookup on CPU pool.
self.kv_cache_manager.allocate_slots(request, num_new_tokens, ...)Allocate KV; None → preempt.
self.kv_cache_manager.free(preempted_req)Return victim's blocks to free queue. Hashes NOT evicted.
self.kv_cache_manager.get_block_ids(request_id)Produce block IDs for worker + offload connector.
self.kv_cache_manager.free(finished_req)Release finished request's chain (may be deferred by pin TTL).
# Inside Scheduler.schedule() — simplified
def schedule(self):
self.kv_cache_manager.new_step_starts() # 1
if sched_trace.enabled(): # 2 (tracing only)
free_blocks = self.kv_cache_manager.block_pool.get_num_free_blocks()
for request in self.waiting:
computed_blocks = self.kv_cache_manager.get_computed_blocks(request) # 3 local
if self.connector is not None:
n_ext, is_async = self.connector.get_num_new_matched_tokens( # 4 external (CPU)
request, num_computed=len(computed_blocks) * block_size
)
new_blocks = self.kv_cache_manager.allocate_slots(request, ...) # 5
if new_blocks is None:
victim = self._pick_preempt_victim()
self.kv_cache_manager.free(victim) # 6
new_blocks = self.kv_cache_manager.allocate_slots(request, ...) # retry
block_ids = self.kv_cache_manager.get_block_ids(request.request_id) # 7
scheduler_output.add(request, block_ids)
return scheduler_output
The agent-kvcache fork emits a step_snapshot event per schedule() call.
# vllm/v1/core/sched/scheduler.py:484 (simplified)
sched_trace.step_snapshot(
step=self._sched_step_counter,
running=self.running, waiting=self.waiting,
free_blocks=self.kv_cache_manager.block_pool.get_num_free_blocks(), # ⚠ see §8
total_blocks=self.cache_config.num_gpu_blocks,
cpu_free_blocks=cpu_block_pool.get_num_free_blocks() if cbp else None,
cpu_total_blocks=getattr(sm, "num_cpu_blocks", None),
num_pinned=len(self.pinned_requests),
pinned_blocks=pinned_blocks,
...
)
get_num_free_blocks()It's a plain counter of blocks in the free queue.
# The actual "blocks with usable cached content" count:
cpu_cached_blocks = len(cpu_block_pool.cached_block_hash_to_block)
# Three distinct numbers the viewer should show:
empty_blocks = free_count - cached_count # truly empty
cached_blocks = len(cached_block_hash_to_block) # has content, evictable
ref_blocks = total - free_count # ref_cnt > 0 (in-flight, active)
The trace schema currently emits only free/total. Adding cpu_cached_blocks fixes this.
The counter is correct for allocation but wrong for reporting. The word "free" is overloaded.
KVCacheBlock — one record per physical blockFreeKVCacheBlockQueue — O(1) doubly-linked LRU listBlockHashToBlockMap — hash → block indexBlockPool — owns all three; exposes full APIget_cached_block(hash, group_ids) — prefix lookupget_new_blocks(n) — allocate, may evicttouch(blocks) — increment ref on prefix hitfree_blocks(blocks) — decrement refget_num_free_blocks() — ⚠ free-queue lengthv1/core/kv_cache_utils.py — atom typesv1/core/block_pool.py — the poolv1/core/kv_cache_coordinator.py — group coordinatorv1/core/kv_cache_manager.py — scheduler-facing APIv1/simple_kv_offload/manager.py — CPU offload pool buildernew_step_starts() — per-step resetget_computed_blocks() — GPU prefix lookupallocate_slots() — allocate for new tokensfree() — release on preempt / finishget_block_ids() — produce scheduler outputSame class, different config. Both participate in prefix lookup.
get_num_free_blocks() counts cached-but-evictable as 'free'. Can under-count DRAM used by 400×.