Attention Backends, Sampling Pipeline, CUDA Graphs, and Memory Management -- traced through actual source code from vllm/v1/worker/
GPUModelRunner FlashAttention FlashInfer Sampling CUDA Graphs
vLLM V1 restructures the worker-side execution around a persistent batch paradigm. Instead of rebuilding input tensors from scratch each iteration, the GPUModelRunner keeps a long-lived InputBatch on the CPU, only applying deltas (adds/removes) between steps. This eliminates the O(batch_size) per-step overhead that dominated V0.
execute_model() now returns None and stores ephemeral state in ExecuteModelState. The engine then calls sample_tokens(grammar_output) separately, allowing structured-output grammar bitmasks to be applied between the forward pass and sampling. This two-phase design enables overlap between GPU computation and CPU-side grammar processing.
vllm/v1/worker/gpu_model_runner.py -- 6000+ lines. Contains GPUModelRunner: the main orchestrator for all V1 GPU inference.
vllm/v1/worker/gpu_input_batch.py -- InputBatch: maintains CPU-side arrays for token IDs, sampling params, block tables.
vllm/v1/worker/block_table.py -- BlockTable and MultiGroupBlockTable: maps virtual blocks to physical KV cache pages.
vllm/v1/worker/gpu_worker.py -- Worker: initializes device, loads model, profiles memory, delegates to GPUModelRunner.
The execute_model method is the heartbeat of each inference step. It receives a SchedulerOutput from the engine core and orchestrates the entire forward pass. The two-phase split (execute_model + sample_tokens) was introduced to support structured output grammars that need to compute bitmasks in parallel with the forward pass.
# gpu_model_runner.py line 3755
def execute_model(self,
scheduler_output: "SchedulerOutput",
intermediate_tensors: IntermediateTensors | None = None,
) -> ModelRunnerOutput | AsyncModelRunnerOutput | IntermediateTensors | None:
with record_function_or_nullcontext("gpu_model_runner: preprocess"):
# Phase 1: Update persistent batch states
deferred_state_corrections_fn = self._update_states(scheduler_output)
# Phase 2: Build GPU input tensors from persistent batch
logits_indices, spec_decode_metadata = self._prepare_inputs(
scheduler_output, num_scheduled_tokens_np)
# Phase 3: Build attention metadata for the chosen backend
attn_metadata, spec_decode_common_attn_metadata = (
self._build_attention_metadata(...))
# Phase 4: Run the actual model forward
with set_forward_context(attn_metadata, ...):
model_output = self._model_forward(
input_ids=input_ids,
positions=positions,
intermediate_tensors=intermediate_tensors,
inputs_embeds=inputs_embeds, **model_kwargs)
# Phase 5: Extract hidden states, compute logits
sample_hidden_states = hidden_states[logits_indices]
logits = self.model.compute_logits(sample_hidden_states)
# Store state for sample_tokens() to pick up
self.execute_model_state = ExecuteModelState(
scheduler_output, logits, spec_decode_metadata, ...)
return None # Signal: call sample_tokens() next
# gpu_model_runner.py line 372
class ExecuteModelState(NamedTuple):
"""Ephemeral cached state transferred between execute_model()
and sample_tokens(), after execute_model() returns None."""
scheduler_output: "SchedulerOutput"
logits: torch.Tensor
spec_decode_metadata: SpecDecodeMetadata | None
spec_decode_common_attn_metadata: CommonAttentionMetadata | None
hidden_states: torch.Tensor
sample_hidden_states: torch.Tensor
aux_hidden_states: list[torch.Tensor] | None
ec_connector_output: ECConnectorOutput | None
cudagraph_stats: CUDAGraphStat | None
slot_mappings: dict[str, torch.Tensor] | None
# gpu_model_runner.py line 4107
@torch.inference_mode
def sample_tokens(self, grammar_output: "GrammarOutput | None"):
# Unpack ephemeral state from execute_model()
(scheduler_output, logits, spec_decode_metadata, ...) = self.execute_model_state
self.execute_model_state = None
# Apply grammar bitmask if structured output is enabled
if grammar_output is not None:
apply_grammar_bitmask(scheduler_output, grammar_output,
self.input_batch, logits)
# Actually sample
sampler_output = self._sample(logits, spec_decode_metadata)
# Async bookkeeping: update output_token_ids, num_computed_tokens
self._update_states_after_model_execute(
sampler_output.sampled_token_ids, scheduler_output)
InputBatch is one of V1's key optimizations. Rather than reconstructing every tensor from scratch each step, it maintains pre-allocated numpy/torch arrays sized to max_num_reqs x max_model_len. Requests are inserted/removed via index manipulation, and only the relevant slices are copied to GPU.
# gpu_input_batch.py line 81
class InputBatch:
def __init__(self, max_num_reqs, max_model_len, ...):
# Token IDs: [max_num_reqs, max_model_len] -- NOT pinned (too large)
self.token_ids_cpu_tensor = torch.zeros(
(max_num_reqs, max_model_len), device="cpu", dtype=torch.int32,
pin_memory=False)
self.token_ids_cpu = self.token_ids_cpu_tensor.numpy()
# Sequence lengths: [max_num_reqs] -- PINNED for fast GPU copy
self.num_computed_tokens_cpu_tensor = torch.zeros(
(max_num_reqs,), device="cpu", dtype=torch.int32,
pin_memory=pin_memory)
# Sampling parameters -- each has CPU + GPU mirrors
self.temperature = torch.empty((max_num_reqs,), dtype=torch.float32, device=device)
self.temperature_cpu_tensor = torch.empty(
(max_num_reqs,), dtype=torch.float32, device="cpu",
pin_memory=pin_memory)
self.temperature_cpu = self.temperature_cpu_tensor.numpy()
# Top-p, top-k, penalties follow the same pattern:
# GPU tensor + pinned CPU tensor + numpy view
self.top_p = torch.empty(...) # GPU
self.top_p_cpu = ... # numpy view of pinned CPU
self.top_k = torch.empty(...)
self.frequency_penalties = torch.empty(...)
self.presence_penalties = torch.empty(...)
self.repetition_penalties = torch.empty(...)
# Track which reqs use which sampling type
self.greedy_reqs: set[str] = set()
self.random_reqs: set[str] = set()
# gpu_input_batch.py line 316
def add_request(self, request: CachedRequestState) -> int:
req_index = self._register_add_request(request)
# Copy prompt token ids into the persistent buffer
num_prompt_tokens = length_from_prompt_token_ids_or_embeds(
request.prompt_token_ids, request.prompt_embeds)
self.num_prompt_tokens[req_index] = num_prompt_tokens
if request.prompt_token_ids is not None:
self.token_ids_cpu[req_index, :num_prompt_tokens] = request.prompt_token_ids
# Copy output token ids (for continued requests)
start_idx = num_prompt_tokens
end_idx = start_idx + len(request.output_token_ids)
self.token_ids_cpu[req_index, start_idx:end_idx] = request.output_token_ids
# Set sampling parameters at the request index
self.temperature_cpu[req_index] = request.sampling_params.temperature
self.top_p_cpu[req_index] = request.sampling_params.top_p
self.top_k_cpu[req_index] = request.sampling_params.top_k
...
This pattern is repeated for every sampling parameter. The numpy view enables zero-copy writes on the CPU, while the pinned CPU tensor enables DMA transfers to the GPU without involving the CPU during the copy.
This is the performance-critical bridge between CPU state and GPU computation. It reads from the persistent InputBatch and produces the exact tensors the model needs: input_ids, positions, attention metadata, and slot mappings.
# gpu_model_runner.py line 1784
def _prepare_inputs(self, scheduler_output, num_scheduled_tokens):
num_reqs = self.input_batch.num_reqs
# OPTIMIZATION: Start block table GPU copy first (overlaps with CPU work)
self.input_batch.block_table.commit_block_table(num_reqs)
# Build request indices: [2,5,3] -> [0,0,1,1,1,1,1,2,2,2]
req_indices = np.repeat(self.arange_np[:num_reqs], num_scheduled_tokens)
# Cumulative tokens and per-token query position
cu_num_tokens = self._get_cumsum_and_arange(
num_scheduled_tokens, self.query_pos.np)
# Positions: num_computed_tokens[req] + query_offset
positions_np = (
self.input_batch.num_computed_tokens_cpu[req_indices]
+ self.query_pos.np[:cu_num_tokens[-1]])
# Flatten token_ids from 2D [req, pos] to 1D via index_select
token_indices = positions_np + req_indices * max_model_len
torch.index_select(
self.input_batch.token_ids_cpu_tensor.flatten(), 0,
torch.from_numpy(token_indices),
out=self.input_ids.cpu[:total_num_scheduled_tokens])
# Prepare query_start_loc for attention
self.query_start_loc.np[0] = 0
self.query_start_loc.np[1:num_reqs+1] = cu_num_tokens
self.query_start_loc.copy_to_gpu()
# Compute slot_mapping via Triton kernel (GPU-side)
self.input_batch.block_table.compute_slot_mapping(
num_reqs, self.query_start_loc.gpu[:num_reqs+1],
self.positions[:total_num_scheduled_tokens])
The BlockTable maintains a 2D mapping of [request, block_index] to physical block IDs. The slot mapping kernel converts (position) into (block_id * block_size + offset) for each token, which tells the attention kernel exactly where to read/write KV cache entries.
# block_table.py -- Triton kernel for slot mapping
# _compute_slot_mapping_kernel computes:
# slot = block_table[req][position // block_size] * block_size + position % block_size
# This runs on GPU to avoid CPU-GPU sync for position data.
# block_table.py line 141
def compute_slot_mapping(self, num_reqs, query_start_loc, positions):
_compute_slot_mapping_kernel[(num_reqs + 1,)](
num_tokens, max_num_batched_tokens,
query_start_loc, positions,
self.block_table.gpu, self.block_table.gpu.stride(0),
self.block_size, self.slot_mapping.gpu,
PAD_ID=PAD_SLOT_ID)
# block_table.py line 173 -- map_to_kernel_blocks
# When allocation block size != kernel block size:
# e.g., 32-token allocation blocks with 16-token kernel blocks
# kv_manager_block_id 0 -> kernel blocks [0, 1]
# kv_manager_block_id 1 -> kernel blocks [2, 3]
kernel_block_ids = (
kv_manager_block_ids.reshape(-1, 1) * blocks_per_kv_block
+ kernel_block_arange).reshape(-1)
vLLM V1 supports multiple attention backends through a clean abstraction layer. Each backend provides three components: an AttentionBackend (static capabilities), an AttentionMetadataBuilder (builds per-step metadata), and an AttentionImpl (the actual kernel dispatch).
# flash_attn.py line 200
@dataclass
class FlashAttentionMetadata:
num_actual_tokens: int # Tokens excluding padding
max_query_len: int # Max query length in batch
query_start_loc: torch.Tensor # Cumulative query lengths [num_reqs+1]
max_seq_len: int # Max full sequence length
seq_lens: torch.Tensor # Per-request sequence lengths
block_table: torch.Tensor # Block table [num_reqs, max_blocks]
slot_mapping: torch.Tensor # Where to write K/V entries
# Cascade attention (shared prefix optimization)
use_cascade: bool
common_prefix_len: int
cu_prefix_query_lens: torch.Tensor | None
# FA3 AOT scheduling metadata (pre-computed work distribution)
scheduler_metadata: torch.Tensor | None
max_num_splits: int = 0 # 0 = use FA3 heuristics
# flashinfer.py line 486
@dataclass
class FlashInferMetadata:
num_actual_tokens: int
slot_mapping: torch.Tensor
q_data_type: torch.dtype
num_decodes: int # Number of decode requests
num_decode_tokens: int
num_prefills: int # Number of prefill requests
num_prefill_tokens: int
# Separate dispatch paths for prefill and decode
prefill: FIPrefill | TRTLLMPrefill | None
decode: FIDecode | TRTLLMDecode | None
# Cascade attention (entire batch, distinct from prefill/decode)
use_cascade: bool
cascade_wrapper: MultiLevelCascadeAttentionWrapper | None
| Feature | FlashAttention (FA2/FA3) | FlashInfer |
|---|---|---|
| Compute Capability | SM >= 8.0 (Ampere+) | SM 7.5 - 12.1 (Turing through Blackwell) |
| KV Cache Layout | (2, num_blocks, block_size, num_kv_heads, head_size) | (num_blocks, 2, block_size, num_kv_heads, head_size) |
| Block Sizes | MultipleOf(16) -- any multiple of 16 | Fixed: 16, 32, 64 |
| FP8 KV Cache | FA3 only (fp8_e4m3) | fp8_e4m3, fp8_e5m2 |
| CUDA Graph Support | FA3: ALWAYS; FA2: UNIFORM_BATCH | UNIFORM_BATCH (with TRTLLM decode) or UNIFORM_SINGLE_TOKEN_DECODE |
| KV Cache Update | Separate (reshape_and_cache_flash) | Separate (forward_includes_kv_cache_update = False) |
| Prefill/Decode Split | Unified varlen API handles both | Separate BatchPrefillWrapper + BatchDecodeWrapper |
| AOT Scheduling | FA3 only -- pre-computes work distribution | No AOT; uses TRTLLM fast_decode_plan |
| Cascade Attention | Via prefix KV + suffix KV split | Via MultiLevelCascadeAttentionWrapper |
| Attention Sinks | FA3 on SM >= 9.0 | Via TRTLLM on SM >= 10.0 (Blackwell) |
On Blackwell (SM 10.0), FlashInfer can use the TensorRT-LLM attention kernel for decoding, which provides better performance for uniform decode batches. The builder checks this at initialization:
# flashinfer.py line 627
# Prefer TRTLLM attention for decoding in all cases.
# This allows AttentionCGSupport.UNIFORM_BATCH mode.
self.use_trtllm_decode_attention = can_use_trtllm_attention(
self.num_qo_heads, self.num_kv_heads)
# FlashInfer builder tracks separate wrappers per batch size for CG
if self.enable_cuda_graph:
self._decode_wrappers_cudagraph: dict[int, BatchDecodeWrapper] = {}
self._decode_cudagraph_max_bs = (1 + num_spec_tokens) * max_num_reqs
# flashinfer.py line 745
def _get_decode_wrapper(self, batch_size, use_cudagraph=False):
if use_cudagraph:
decode_wrapper = self._decode_wrappers_cudagraph.get(batch_size)
else:
decode_wrapper = self._decode_wrapper
if decode_wrapper is None:
decode_wrapper = BatchDecodeWithPagedKVCacheWrapper(
self._get_workspace_buffer(),
get_kv_cache_layout(),
use_cuda_graph=use_cudagraph,
paged_kv_indptr_buffer=paged_kv_indptr, # Pre-allocated for CG
paged_kv_indices_buffer=paged_kv_indices,
paged_kv_last_page_len_buffer=paged_kv_last_page_len,
use_tensor_cores=True) # Always use tensor cores
return decode_wrapper
The Sampler in V1 follows a strict ordering of operations to ensure correctness. Logprobs are computed before any modifications to logits, and the sampling itself uses a clean split between greedy and random paths.
# sampler.py line 67
def forward(self, logits, sampling_metadata, ...):
# Step 1: Compute raw logprobs BEFORE any modifications
if num_logprobs is not None:
if logprobs_mode == "raw_logprobs":
raw_logprobs = self.compute_logprobs(logits) # log_softmax
elif logprobs_mode == "raw_logits":
raw_logprobs = logits.to(torch.float32)
# Step 2: Convert to float32
logits = logits.to(torch.float32)
# Step 3-6: Apply processors and penalties
logits = self.apply_logits_processors(logits, sampling_metadata, ...)
# Step 7: Sample the next token
sampled, processed_logprobs = self.sample(logits, sampling_metadata)
sampled = sampled.long() # FlashInfer returns int32, need int64 for indexing
# Step 8: Gather top-k logprobs
if num_logprobs:
logprobs_tensors = self.gather_logprobs(raw_logprobs, num_logprobs, sampled)
# Return as GPU tensors (async copy to CPU happens later)
return SamplerOutput(
sampled_token_ids=sampled.to(torch.int32).unsqueeze(-1),
logprobs_tensors=logprobs_tensors)
# sampler.py line 147
def sample(self, logits, sampling_metadata):
# Fast path: all greedy
if not sampling_metadata.all_random:
greedy_sampled = logits.argmax(dim=-1).view(-1)
if sampling_metadata.all_greedy:
return greedy_sampled, processed_logprobs
# Apply temperature (in-place division)
# Avoids division by zero for greedy requests
temp = torch.where(temp < 1e-5, 1.0, temp)
logits.div_(temp.unsqueeze(dim=1))
# Apply argmax-invariant logits processors (e.g., min_p)
for processor in sampling_metadata.logitsprocs.argmax_invariant:
logits = processor.apply(logits)
# Top-k and/or Top-p sampling via dedicated sampler
random_sampled, processed_logprobs = self.topk_topp_sampler(
logits, sampling_metadata.generators,
sampling_metadata.top_k, sampling_metadata.top_p)
# Merge: use greedy for temp < epsilon, random otherwise
if greedy_sampled is None:
return random_sampled, processed_logprobs
sampled = torch.where(
sampling_metadata.temperature < 1e-5,
greedy_sampled, random_sampled, out=greedy_sampled)
return sampled, processed_logprobs
# sampler.py line 266 -- apply_logits_processors
def apply_logits_processors(self, logits, sampling_metadata, ...):
# 1. Allowed token IDs whitelist (masked_fill_ with -inf)
if sampling_metadata.allowed_token_ids_mask is not None:
logits.masked_fill_(sampling_metadata.allowed_token_ids_mask, float("-inf"))
# 2. Bad words exclusion
if bad_words_token_ids:
apply_bad_words(logits, bad_words_token_ids, output_token_ids)
# 3. Non-argmax-invariant processors (min_tokens, logit_bias)
for processor in sampling_metadata.logitsprocs.non_argmax_invariant:
logits = processor.apply(logits)
# 4. Penalties (repetition, frequency, presence)
logits = self.apply_penalties(logits, sampling_metadata, output_token_ids)
return logits
Simple argmax(dim=-1). Used when temperature < 1e-5. No distribution sampling needed. V1 computes this regardless (as a fallback for mixed batches), then uses torch.where to select per-request.
In-place logits.div_(temp.unsqueeze(1)). For greedy requests in mixed batches, temperature is set to 1.0 to avoid division by zero (the greedy result is used anyway via torch.where).
Handled by TopKTopPSampler which uses optimized kernels. Top-k first filters to k candidates, then top-p (nucleus) further filters by cumulative probability. Per-request generators support seeded sampling.
Applied via apply_all_penalties(): repetition penalty multiplies logits of repeated tokens, frequency penalty subtracts by count, presence penalty subtracts by boolean presence. All use vectorized operations.
CUDA graphs eliminate kernel launch overhead by recording a sequence of GPU operations and replaying them as a single unit. vLLM V1 captures graphs for common batch configurations during warmup, then replays them during inference. The tricky part: attention metadata changes every step (different seq_lens, block_tables), so the system must carefully manage which fields are graph-compatible.
| CUDAGraphMode | Description | When Used |
|---|---|---|
NONE |
No CUDA graphs; eager execution | Profiling, first step with KV scale calibration |
FULL |
Entire forward pass captured in one graph | FA3 (ALWAYS support), uniform decode batches |
PIECEWISE |
Graph captured per-layer (torch.compile) | Mixed prefill-decode with FA2 |
# gpu_model_runner.py line 6077
def _capture_cudagraphs(self, batch_descriptors, cudagraph_runtime_mode):
for batch_desc in batch_descriptors:
# Only capture ubatched graphs for FULL + uniform decode
allow_microbatching = (
self.parallel_config.use_ubatching
and cudagraph_runtime_mode == CUDAGraphMode.FULL
and uniform_decode
and check_ubatch_thresholds(...))
self._warmup_and_capture(
batch_desc,
cudagraph_runtime_mode=cudagraph_runtime_mode,
allow_microbatching=allow_microbatching)
torch.accelerator.synchronize()
# Each attention backend declares its CUDA graph compatibility
# FlashAttention FA3: full cudagraph for all cases
_cudagraph_support = AttentionCGSupport.ALWAYS
# FlashAttention FA2: only works for uniform decode batches
# (special max_query_len=1 packed-GQA handling breaks mixed prefill-decode)
_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH
# FlashInfer with TRTLLM: UNIFORM_BATCH
# FlashInfer without TRTLLM: UNIFORM_SINGLE_TOKEN_DECODE
if has_trtllm_support:
return AttentionCGSupport.UNIFORM_BATCH
else:
return AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE
# flash_attn.py line 326 -- Pre-allocate scheduler_metadata for CUDA graphs
if self.use_full_cuda_graph and self.aot_schedule:
# FA3 scheduler_metadata size: 1 + round_up(batch_size, 4) * 4
# The +1 is for tile_count_semaphore (synchronization)
# 4 slots per batch element: prepare_varlen + dynamic_split
# + sort_batches + head_swizzle
max_batch_size = max(max_num_seqs, max_cudagraph_capture_size or 0)
self.scheduler_metadata = torch.zeros(
1 + round_up(max_batch_size, 4) * 4,
dtype=torch.int32, device=self.device)
# Upper bound splits for pre-allocation of intermediate buffers
self.max_num_splits = (
self.attention_config.flash_attn_max_num_splits_for_cuda_graph)
max_num_splits > 1 during CUDA graph capture forces FA3 to pre-allocate larger intermediate buffers [num_splits, num_heads, num_tokens, head_size]. This ensures the graph has enough memory for any runtime configuration, but increases peak memory usage. Outside of graph capture, max_num_splits = 0 lets FA3 use its own heuristics.
Unlike FlashAttention, FlashInfer requires a separate decode wrapper per batch size when using
CUDA graphs. This is because the wrapper's plan() method pre-allocates fixed buffers sized
to the batch:
# flashinfer.py line 568
if self.enable_cuda_graph:
# One BatchDecodeWithPagedKVCacheWrapper per batch size
self._decode_wrappers_cudagraph: dict[int, BatchDecodeWrapper] = {}
self._decode_cudagraph_max_bs = (1 + num_spec_tokens) * max_num_reqs
vLLM's memory management follows a strict protocol: profile peak usage, calculate remaining budget,
then allocate the KV cache to fill exactly that budget. This happens during initialization
in Worker.determine_available_memory().
# gpu_worker.py line 331
def determine_available_memory(self) -> int:
# Execute a forward pass with dummy inputs to profile peak usage
with memory_profiling(
self.init_snapshot,
weights_memory=int(self.model_runner.model_memory_usage),
) as profile_result:
self.model_runner.profile_run()
profile_torch_peak = torch.accelerator.memory_stats(self.device).get(
"allocated_bytes.all.peak", 0)
# Profile CUDA graph memory if graphs will be captured
cudagraph_memory_estimate = 0
if not self.model_config.enforce_eager and not current_platform.is_rocm():
cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
# Calculate budget
self.available_kv_cache_memory_bytes = (
self.requested_memory
- profile_result.non_kv_cache_memory
- cudagraph_memory_estimate)
# gpu_model_runner.py line 5712
def profile_run(self):
# 1. Profile multimodal encoder if applicable
if self.supports_mm_inputs:
dummy_encoder_outputs = self.model.embed_multimodal(**batched_dummy_mm_inputs)
# Cache them to simulate peak encoder cache usage
for i, output in enumerate(dummy_encoder_outputs):
self.encoder_cache[f"tmp_{i}"] = output
# 2. Run dummy forward with max_num_tokens (worst-case activation memory)
hidden_states, last_hidden_states = self._dummy_run(
self.max_num_tokens, is_profile=True)
# 3. Run dummy sampler (for its memory overhead)
if get_pp_group().is_last_rank:
output = self._dummy_sampler_run(last_hidden_states)
self._sync_device()
del hidden_states, output
self.encoder_cache.clear()
gc.collect()
Executors sit between the engine core and workers, handling the distributed execution strategy. The hierarchy provides three deployment modes: single-GPU, multi-process (same node), and Ray-based (multi-node).
# uniproc_executor.py line 26
class UniProcExecutor(Executor):
def _init_executor(self):
self.driver_worker = WorkerWrapperBase(rpc_rank=0)
distributed_init_method, rank, local_rank = self._distributed_args()
kwargs = dict(
vllm_config=self.vllm_config,
local_rank=local_rank, rank=rank,
distributed_init_method=distributed_init_method,
is_driver_worker=True)
self.driver_worker.init_worker(all_kwargs=[kwargs])
self.driver_worker.init_device()
self.driver_worker.load_model()
def execute_model(self, scheduler_output, non_block=False):
# Direct method call -- no IPC overhead
output = self.collective_rpc(
"execute_model", args=(scheduler_output,),
non_block=non_block, single_value=True)
return output
# Async scheduling support: ThreadPoolExecutor for output processing
if self.max_concurrent_batches > 1:
self.async_output_thread = ThreadPoolExecutor(max_workers=1)
# multiproc_executor.py line 96
class MultiprocExecutor(Executor):
supports_pp: bool = True
def _init_executor(self):
tp_size, pp_size, pcp_size = self._get_parallel_sizes()
assert self.world_size == tp_size * pp_size * pcp_size
# Use loopback address for local communication
distributed_init_method = get_distributed_init_method(
get_loopback_ip(), get_open_port())
# Set up MessageQueue for efficient SchedulerOutput broadcasting
self.rpc_broadcast_mq = MessageQueue(
self.world_size, self.local_world_size,
max_chunk_bytes=max_chunk_bytes)
# Spawn worker processes (one per GPU)
# Each worker process runs Worker.init_device() + Worker.load_model()
# Communication via shared memory MessageQueue
# gpu_worker.py line 105
class Worker(WorkerBase):
def __init__(self, vllm_config, local_rank, rank, ...):
# Configure float32 matmul precision
precision = envs.VLLM_FLOAT32_MATMUL_PRECISION
torch.set_float32_matmul_precision(precision)
def init_device(self):
# Set CUDA device, init distributed, take memory snapshot
self.device = torch.device(f"cuda:{self.local_rank}")
init_worker_distributed_environment(...)
# Construct the model runner
self.model_runner = GPUModelRunner(self.vllm_config, self.device)
# Take baseline memory snapshot for budget calculation
self.init_snapshot = MemorySnapshot(device=self.device)
self.requested_memory = request_memory(init_snapshot, self.cache_config)
def load_model(self):
with self._maybe_get_memory_pool_context(tag="weights"):
self.model_runner.load_model()
def determine_available_memory(self) -> int:
# Profile model, compute KV cache budget
...
vLLM V1 supports a sleep mode where the worker offloads model weights to free GPU memory (e.g., for resource sharing). The Worker has two levels:
# gpu_worker.py line 157
def sleep(self, level=1):
# Level 1: offload weights only (tag="weights")
# Level 2: save all buffers to CPU, offload everything
if level == 2:
self._sleep_saved_buffers = {
name: buffer.cpu().clone()
for name, buffer in model.named_buffers()}
allocator = CuMemAllocator.get_instance()
allocator.sleep(offload_tags=("weights",) if level == 1 else tuple())
def wake_up(self, tags=None):
allocator.wake_up(tags)
# Restore saved buffers
for name, buffer in model.named_buffers():
if name in self._sleep_saved_buffers:
buffer.data.copy_(self._sleep_saved_buffers[name].data)
InputBatch maintains pre-allocated arrays across steps. Only deltas are applied -- dramatically reducing per-step CPU overhead compared to V0's per-step reconstruction.
execute_model() + sample_tokens() split allows grammar bitmask computation to overlap with GPU forward, and enables future optimizations like speculative decoding interleaving.
Clean AttentionBackend / MetadataBuilder / Impl separation means FlashAttention, FlashInfer, and future backends (e.g., Triton-based) can coexist. The system even supports different backends for different layer groups.
Block table GPU copy starts before CPU input prep finishes. Sampling output copies to CPU via a separate CUDA stream. Slot mapping runs as a Triton kernel on GPU while CPU computes positions.
Three CG modes (NONE, FULL, PIECEWISE) adapt to backend capabilities. FA3 gets ALWAYS support; FA2 and FlashInfer fall back to UNIFORM_BATCH or single-token decode only.
Memory is profiled by running the actual model at max batch size, then subtracting from the user's gpu_memory_utilization budget. CUDA graph memory is separately estimated to avoid double-counting.
| File | Key Classes/Functions | Lines Read |
|---|---|---|
| vllm/v1/worker/gpu_model_runner.py | GPUModelRunner, execute_model, _update_states, _prepare_inputs, _sample, _capture_cudagraphs | 6000+ |
| vllm/v1/worker/gpu_worker.py | Worker, init_device, determine_available_memory, sleep/wake_up | ~500 |
| vllm/v1/worker/gpu_input_batch.py | InputBatch, CachedRequestState, add_request, SamplingMetadata | ~350 |
| vllm/v1/worker/block_table.py | BlockTable, MultiGroupBlockTable, compute_slot_mapping, map_to_kernel_blocks | ~300 |
| vllm/v1/attention/backends/flash_attn.py | FlashAttentionBackend, FlashAttentionMetadata, FlashAttentionMetadataBuilder | ~600 |
| vllm/v1/attention/backends/flashinfer.py | FlashInferBackend, FlashInferMetadata, FlashInferMetadataBuilder, TRTLLM paths | ~1050 |
| vllm/v1/sample/sampler.py | Sampler, forward, sample, apply_temperature, apply_logits_processors | ~320 |
| vllm/v1/executor/uniproc_executor.py | UniProcExecutor, collective_rpc, execute_model | ~150 |
| vllm/v1/executor/multiproc_executor.py | MultiprocExecutor, MessageQueue IPC, FutureWrapper | ~150 |