An empirical study measuring iteration-level interference across 5 GPU architectures — revealing a binary step-function pattern governed by the per-step token budget
A modern LLM server receives hundreds of concurrent requests whose prompt lengths and output lengths vary by orders of magnitude. How should it pack them onto a single GPU without wasting compute?
Collect N requests, pad them to the length of the longest one, run them through the model together, and return all N responses only when the slowest has finished. The GPU sits idle while short requests wait on long ones. For typical chat workloads, utilization drops below 20%.
Re-schedule the batch at every single iteration — i.e., every time the model takes one forward pass to produce the next token. Finished requests leave the batch immediately; newly-arrived requests can join the batch at the very next iteration. No padding, no stragglers. This iteration-level scheduling is the foundation of vLLM, TGI, SGLang, TensorRT-LLM, and every modern LLM serving engine.
Why it matters: continuous batching is the reason a single H100 can serve hundreds of concurrent chat users. Compared to static batching, it delivers roughly 10–20× higher throughput on realistic workloads. Every production LLM endpoint you've ever hit — ChatGPT, Claude, Gemini, open-source endpoints — uses some form of it.
Every LLM request has two phases that live inside the same model weights but behave completely differently on GPU hardware:
The model ingests the entire prompt (say, 2048 tokens) in one forward pass. All 2048 positions flow through every layer simultaneously, and self-attention is an O(L²) matrix multiply. This is dense arithmetic — a prefill of a few thousand tokens completely saturates the tensor cores. Compute-bound.
After prefill, the model generates the reply one token at a time. Each step only contributes 1 query row, but that query must attend over the entire past KV cache. The FLOPs are trivial, yet the KV cache is huge (hundreds of MB per sequence), so the GPU spends most of its time waiting on HBM reads. Memory-bandwidth-bound.
To answer the question directly: yes, there is a fixed per-step token budget, and the scheduler fills it with a mix of prefill and decode work every iteration. Here is exactly what vLLM v1 does on each scheduling step:
When vLLM launches you set --max-num-batched-tokens; call this budget B. Typical values: B = 2048 on small GPUs, B = 4096–8192 on H100. Every iteration is allowed to process at most B tokens through the model, total, across all requests combined.
min(remaining_prompt, B − already_used) tokens from its prompt. If the remaining budget is smaller than the prompt, the prompt gets chunked and the rest is processed in later iterations — this is chunked prefill, always-on in vLLM v1.The selected tokens — however many come from decode plus however many come from prefill — are concatenated into one flat 1D sequence of length ≤ B. A side array cu_seqlens records where each request's tokens start and end, so the attention kernel (FlashAttention / FlashInfer) knows not to let request A's tokens attend to request B's KV cache. From the model's point of view, it's just one big forward pass over a sequence of length B.
The whole flat batch goes through every transformer layer exactly once. Every decode request emits its 1 new token. Every prefill request advances by however many of its tokens were allocated this step (the whole prompt, or just a chunk). The loop repeats. This is why prefill and decode can coexist in one iteration — they're just different rows of the same flattened batch.
H100, B = 8192. 32 decode requests are running, so 32 tokens are already reserved. A user sends a new request with a 2048-token prompt. Remaining budget = 8192 − 32 = 8160 ≥ 2048, so the entire prompt fits into this single iteration. The flat batch has shape [2080, hidden_dim] (2048 prefill rows stacked on top of 32 decode rows), one forward pass handles everything, and both the new user's first output token and every existing decode user's next token come out together. Total cost ≈ a prefill-dominated iteration — still small, because 2080 tokens is well below what the H100 can chew in one shot.
Now imagine a burst: 16 users each send a 4096-token prompt at the same moment. Total prefill demand = 65,536 tokens, but B is only 8192. The scheduler chunks it — iteration 1 takes ~8160 prefill tokens, iteration 2 takes 8160 more, and so on for roughly 8 iterations. During those 8 iterations, every decode user is stuck behind a wall of prefill compute and their per-token latency blows up. That cost — how badly decode latency spikes during a prefill burst, and how it depends on B, the GPU, and the burst size — is exactly what Sections 3–5 measure.
Launch num_decode streaming requests. Wait for all to enter decode state. Sleep 2s for steady state.
Trace captures per-step batch composition and iteration timing.
Send num_prefill requests of prefill_len tokens (max_tokens=1). Trace captures mixed iterations.
Verify decode returns to baseline.
Cancel decode streams → drain KV cache → probe verify → next condition.
| Variable | RTX 6000 / 5090 | L40S | H100 / H200 |
|---|---|---|---|
prefill_len | 128, 512, 1024, 2048 | 128, 512, 1024, 2048 | 128, 512, 1024, 2048, 4096 |
num_prefill | 1, 2, 4 | 1, 2, 4, 8 | 1, 2, 4, 8, 16 |
num_decode | 0, 4, 8, 12 | 0, 8, 16, 32 | 0, 8, 16, 32, 64, 128 |
interference_pct = (mean_mixed - mean_baseline) / mean_baseline × 100%
The central finding: prefill-decode interference is not gradual — it is a step function governed by whether the prefill workload fits within a single iteration's token budget.
Prefill tokens fit in one iteration alongside decode. Overhead is negligible (<10%). The GPU processes both in a single forward pass.
Prefill overflows into multiple iterations. Each overflow iteration is dominated by prefill compute. Iteration time jumps from baseline (~13-70ms) to +200% – +1716%.
Baseline decode: ~8-19ms. Threshold at ~8K total prefill tokens. Can absorb 4×2048 with only 5% overhead.
How to read this table. 32 decode streams are running in the background. We then inject a burst of num_prefill concurrent prefill requests, each with a prompt of prefill_len tokens. Each cell shows how much the decode iteration time jumps (Δ%) relative to the pure-decode baseline (~11 ms on H100). Rows = how many prefill requests arrive at once; columns = how long each prefill prompt is. Total prefill tokens per burst = row × column. Green ≈ harmless, red ≈ catastrophic.
| num_prefill ↓ prefill_len → | 128 | 512 | 1024 | 2048 | 4096 |
|---|---|---|---|---|---|
| 1 req | <5% | <5% | <5% | ~7% | ~40% |
| 2 req | <5% | <5% | ~8% | ~50% | ~250% |
| 4 req | <5% | ~5% | ~30% | 349% | 710% |
| 8 req | <5% | ~20% | ~90% | 580% | 835% |
| 16 req | ~5% | ~70% | ~200% | 660% | 836% |
Reading the diagonal: 1 req × 2048 = 2K total prefill tokens → +7% (negligible). 4 req × 2048 = 8K total → jumps to +349%. 16 req × 4096 = 65K total → +836%. The transition is sharp: once total prefill tokens exceed the H100's ~8K per-step budget, chunked prefill can no longer hide the work behind a single iteration, and decode latency jumps by 3–10×.
Nearly identical to H100 in compute behavior (same sm_90 architecture). Slightly lower baseline (17ms vs 19ms) due to higher memory bandwidth.
| num_decode | H100 baseline | H100 mixed | H100 Δ% | H200 baseline | H200 mixed | H200 Δ% |
|---|---|---|---|---|---|---|
| 8 | 8.4ms | 45.3ms | +443% | 7.0ms | 43.1ms | +514% |
| 32 | 10.6ms | 47.5ms | +349% | 9.1ms | 39.4ms | +333% |
| 64 | 13.3ms | 43.0ms | +224% | 11.7ms | 26.0ms | +123% |
| 128 | 19.1ms | 28.4ms | +49% | 16.8ms | 23.0ms | +36% |
Baseline: ~25-29ms. Threshold at ~4K tokens. Notably stable baseline with very low variance.
| num_decode | baseline | mixed | Δ% |
|---|---|---|---|
| 8 | 24.9ms | 100.4ms | +304% |
| 16 | 26.3ms | 115.6ms | +340% |
| 32 | 29.3ms | 106.2ms | +262% |
Baseline: ~13ms. Fast compute processes prefill quickly, resulting in lower absolute interference. Same threshold pattern at ~4K tokens.
| num_decode | baseline | mixed | Δ% |
|---|---|---|---|
| 4 | 12.5ms | 90.6ms | +623% |
| 8 | 12.9ms | 115.8ms | +799% |
| 12 | 13.2ms | 102.5ms | +674% |
The most severely affected GPU. Baseline: ~47-94ms. Even 1×128 (128 tokens) causes >60% interference. Maximum: 4×2048 → +2903%.
| GPU | Arch | BF16 TFLOPS | Mem BW | Baseline | Threshold | Max Δ% |
|---|---|---|---|---|---|---|
| RTX 6000 | Turing | 16 | 562 GB/s | ~70ms | ~4K tokens | +2903% |
| RTX 5090 | Blackwell | ~400 | 1792 GB/s | ~13ms | ~4K tokens | +874% |
| L40S | Ada | 362 | 652 GB/s | ~29ms | ~4K tokens | +402% |
| H100 | Hopper | 989 | 3031 GB/s | ~19ms | ~8K tokens | +897% |
| H200 | Hopper+ | 989 | 4267 GB/s | ~17ms | ~8K tokens | +1006% |
RTX 6000/5090/L40S threshold ≈ 4K tokens. H100/H200 ≈ 8K tokens. Directly reflects the per-step token budget that vLLM's chunked prefill scheduler allows.
H200 (4267 GB/s) → 17ms baseline. RTX 6000 (562 GB/s) → 70ms. Decode is memory-bound, so baseline scales inversely with bandwidth.
Prefill hurts decode (+200-2900%), but decode barely affects prefill (<10% TTFT increase under load). The interference is one-directional.
Example: H100, num_decode=128, 16×4096 prefill injection
| # | Hypothesis | Verdict |
|---|---|---|
| H1 | Mixed iterations are slower than pure decode | Confirmed |
| H2 | Interference scales linearly with total prefill tokens | Rejected — binary step function, not linear |
| H3 | Interference is in forward pass, not scheduler | Confirmed — schedule_ms ≈ 0.1-0.5ms constant |
| H4 | Decode recovers to baseline after prefill | Confirmed — within 5% on all GPUs |
| H5 | Faster GPUs have similar relative interference | Partial — same binary pattern, but threshold varies with token budget |
A 4096-token chunk is invisible on H100 but catastrophic on RTX 6000. The optimal chunk size must scale with GPU compute throughput. max_num_batched_tokens is the single most important knob.
Our data provides the cost that PD disaggregation (Splitwise, DistServe) eliminates. On H100 with 128 concurrent decodes, a single 16×4096 burst adds 168ms to every decode step → 128 × 168ms = 21.5s cumulative user-facing delay.
The interference is binary — either below threshold (free) or above (catastrophic). An admission controller that limits concurrent prefill to stay below the per-step budget would eliminate interference entirely.
--max-num-batched-tokens values (512–8192) to map the exact Pareto curve of prefill_latency vs decode_interference.