Prefill-Decode Interference in Continuous Batching

An empirical study measuring iteration-level interference across 5 GPU architectures — revealing a binary step-function pattern governed by the per-step token budget

Llama 3.1 8B 5 GPUs RTX 6000 → H200 vLLM V1 FCFS Georgia Tech PACE

1. Background & Motivation

1.1 From Static Batching to Continuous Batching

A modern LLM server receives hundreds of concurrent requests whose prompt lengths and output lengths vary by orders of magnitude. How should it pack them onto a single GPU without wasting compute?

Static batching (the old way)

Collect N requests, pad them to the length of the longest one, run them through the model together, and return all N responses only when the slowest has finished. The GPU sits idle while short requests wait on long ones. For typical chat workloads, utilization drops below 20%.

Continuous batching (Orca, OSDI 2022)

Re-schedule the batch at every single iteration — i.e., every time the model takes one forward pass to produce the next token. Finished requests leave the batch immediately; newly-arrived requests can join the batch at the very next iteration. No padding, no stragglers. This iteration-level scheduling is the foundation of vLLM, TGI, SGLang, TensorRT-LLM, and every modern LLM serving engine.

Why it matters: continuous batching is the reason a single H100 can serve hundreds of concurrent chat users. Compared to static batching, it delivers roughly 10–20× higher throughput on realistic workloads. Every production LLM endpoint you've ever hit — ChatGPT, Claude, Gemini, open-source endpoints — uses some form of it.

1.2 Prefill vs Decode: Two Very Different Workloads

Every LLM request has two phases that live inside the same model weights but behave completely differently on GPU hardware:

Prefill — ingesting the prompt

The model ingests the entire prompt (say, 2048 tokens) in one forward pass. All 2048 positions flow through every layer simultaneously, and self-attention is an O(L²) matrix multiply. This is dense arithmetic — a prefill of a few thousand tokens completely saturates the tensor cores. Compute-bound.

Decode — generating output tokens

After prefill, the model generates the reply one token at a time. Each step only contributes 1 query row, but that query must attend over the entire past KV cache. The FLOPs are trivial, yet the KV cache is huge (hundreds of MB per sequence), so the GPU spends most of its time waiting on HBM reads. Memory-bandwidth-bound.

The opportunity continuous batching tries to exploit: Prefill saturates compute but barely touches memory bandwidth. Decode saturates memory bandwidth but barely touches compute. In principle, if you run them in the same iteration they use orthogonal resources — you get prefill work done "for free" while the GPU would otherwise sit idle waiting on decode's memory loads. That is the core promise. Whether it actually holds up in practice is what this page measures.

1.3 How Prefill and Decode Share a Single Iteration

To answer the question directly: yes, there is a fixed per-step token budget, and the scheduler fills it with a mix of prefill and decode work every iteration. Here is exactly what vLLM v1 does on each scheduling step:

① The token budget is fixed at server start

When vLLM launches you set --max-num-batched-tokens; call this budget B. Typical values: B = 2048 on small GPUs, B = 4096–8192 on H100. Every iteration is allowed to process at most B tokens through the model, total, across all requests combined.

② Each iteration, the scheduler fills the budget

  • Walk the running queue. For each decoding request, reserve exactly 1 token (the next token to be generated). If 128 decodes are running, that's 128 tokens out of B.
  • Walk the waiting queue. For each pending prefill request, assign up to min(remaining_prompt, B − already_used) tokens from its prompt. If the remaining budget is smaller than the prompt, the prompt gets chunked and the rest is processed in later iterations — this is chunked prefill, always-on in vLLM v1.
  • Stop when the budget is exhausted or both queues are empty.

③ Flatten everything into a single tensor

The selected tokens — however many come from decode plus however many come from prefill — are concatenated into one flat 1D sequence of length ≤ B. A side array cu_seqlens records where each request's tokens start and end, so the attention kernel (FlashAttention / FlashInfer) knows not to let request A's tokens attend to request B's KV cache. From the model's point of view, it's just one big forward pass over a sequence of length B.

④ One forward pass handles all of it

The whole flat batch goes through every transformer layer exactly once. Every decode request emits its 1 new token. Every prefill request advances by however many of its tokens were allocated this step (the whole prompt, or just a chunk). The loop repeats. This is why prefill and decode can coexist in one iteration — they're just different rows of the same flattened batch.

Concrete example

H100, B = 8192. 32 decode requests are running, so 32 tokens are already reserved. A user sends a new request with a 2048-token prompt. Remaining budget = 8192 − 32 = 8160 ≥ 2048, so the entire prompt fits into this single iteration. The flat batch has shape [2080, hidden_dim] (2048 prefill rows stacked on top of 32 decode rows), one forward pass handles everything, and both the new user's first output token and every existing decode user's next token come out together. Total cost ≈ a prefill-dominated iteration — still small, because 2080 tokens is well below what the H100 can chew in one shot.

Now imagine a burst: 16 users each send a 4096-token prompt at the same moment. Total prefill demand = 65,536 tokens, but B is only 8192. The scheduler chunks it — iteration 1 takes ~8160 prefill tokens, iteration 2 takes 8160 more, and so on for roughly 8 iterations. During those 8 iterations, every decode user is stuck behind a wall of prefill compute and their per-token latency blows up. That cost — how badly decode latency spikes during a prefill burst, and how it depends on B, the GPU, and the burst size — is exactly what Sections 3–5 measure.

The core question this page answers: When both workloads share the same iteration, how much does prefill slow down decode? Is the interference gradual (smoothly worse as prefill grows) or threshold-based (fine until some cliff, then catastrophic)? Does decode recover immediately after the prefill burst ends, or does it stay degraded?

2. Experimental Protocol

Phase 1: PRIME

Launch num_decode streaming requests. Wait for all to enter decode state. Sleep 2s for steady state.

Phase 2: BASELINE — 3s of pure decode

Trace captures per-step batch composition and iteration timing.

Phase 3: INJECT — send prefill requests

Send num_prefill requests of prefill_len tokens (max_tokens=1). Trace captures mixed iterations.

Phase 4: RECOVERY — 3s post-injection

Verify decode returns to baseline.

Phase 5: CLEANUP

Cancel decode streams → drain KV cache → probe verify → next condition.

Variables

Variable RTX 6000 / 5090 L40S H100 / H200
prefill_len128, 512, 1024, 2048128, 512, 1024, 2048128, 512, 1024, 2048, 4096
num_prefill1, 2, 41, 2, 4, 81, 2, 4, 8, 16
num_decode0, 4, 8, 120, 8, 16, 320, 8, 16, 32, 64, 128
Interference metric: interference_pct = (mean_mixed - mean_baseline) / mean_baseline × 100%

3. The Binary Interference Effect

The central finding: prefill-decode interference is not gradual — it is a step function governed by whether the prefill workload fits within a single iteration's token budget.

Below Threshold

Prefill tokens fit in one iteration alongside decode. Overhead is negligible (<10%). The GPU processes both in a single forward pass.

Above Threshold

Prefill overflows into multiple iterations. Each overflow iteration is dominated by prefill compute. Iteration time jumps from baseline (~13-70ms) to +200% – +1716%.

Interference Pattern (Schematic)

1500%
1000%
500%
0%
128
1K
4K
16K
65K
Total Prefill Tokens
threshold

4. Per-GPU Results

4.1 H100 (Hopper, 80 GB, 989 BF16 TFLOPS)

Baseline decode: ~8-19ms. Threshold at ~8K total prefill tokens. Can absorb 4×2048 with only 5% overhead.

Interference Heatmap — H100, num_decode=32

How to read this table. 32 decode streams are running in the background. We then inject a burst of num_prefill concurrent prefill requests, each with a prompt of prefill_len tokens. Each cell shows how much the decode iteration time jumps (Δ%) relative to the pure-decode baseline (~11 ms on H100). Rows = how many prefill requests arrive at once; columns = how long each prefill prompt is. Total prefill tokens per burst = row × column. Green ≈ harmless, red ≈ catastrophic.

num_prefill ↓   prefill_len → 128512102420484096
1 req<5%<5%<5%~7%~40%
2 req<5%<5%~8%~50%~250%
4 req<5%~5%~30%349%710%
8 req<5%~20%~90%580%835%
16 req~5%~70%~200%660%836%

Reading the diagonal: 1 req × 2048 = 2K total prefill tokens → +7% (negligible). 4 req × 2048 = 8K total → jumps to +349%. 16 req × 4096 = 65K total → +836%. The transition is sharp: once total prefill tokens exceed the H100's ~8K per-step budget, chunked prefill can no longer hide the work behind a single iteration, and decode latency jumps by 3–10×.

4.2 H200 (Hopper+, 141 GB, 989 TFLOPS, 4267 GB/s)

Nearly identical to H100 in compute behavior (same sm_90 architecture). Slightly lower baseline (17ms vs 19ms) due to higher memory bandwidth.

4×2048 Interference — H100 vs H200

num_decodeH100 baselineH100 mixedH100 Δ%H200 baselineH200 mixedH200 Δ%
88.4ms45.3ms+443%7.0ms43.1ms+514%
3210.6ms47.5ms+349%9.1ms39.4ms+333%
6413.3ms43.0ms+224%11.7ms26.0ms+123%
12819.1ms28.4ms+49%16.8ms23.0ms+36%

4.3 L40S (Ada Lovelace, 48 GB, 362 TFLOPS)

Baseline: ~25-29ms. Threshold at ~4K tokens. Notably stable baseline with very low variance.

4×2048 Interference — L40S

num_decodebaselinemixedΔ%
824.9ms100.4ms+304%
1626.3ms115.6ms+340%
3229.3ms106.2ms+262%

4.4 RTX 5090 (Blackwell, 32 GB, ~400 TFLOPS)

Baseline: ~13ms. Fast compute processes prefill quickly, resulting in lower absolute interference. Same threshold pattern at ~4K tokens.

4×2048 Interference — RTX 5090

num_decodebaselinemixedΔ%
412.5ms90.6ms+623%
812.9ms115.8ms+799%
1213.2ms102.5ms+674%

4.5 RTX 6000 (Turing, 24 GB, 16.3 TFLOPS)

The most severely affected GPU. Baseline: ~47-94ms. Even 1×128 (128 tokens) causes >60% interference. Maximum: 4×2048 → +2903%.

4×2048 Interference — RTX 6000

decode=4
+2903% (1160ms vs 39ms)
decode=8
+2234% (1101ms vs 47ms)
decode=12
+1557% (1558ms vs 94ms)

5. Cross-GPU Comparison

GPU Arch BF16 TFLOPS Mem BW Baseline Threshold Max Δ%
RTX 6000Turing16562 GB/s~70ms~4K tokens+2903%
RTX 5090Blackwell~4001792 GB/s~13ms~4K tokens+874%
L40SAda362652 GB/s~29ms~4K tokens+402%
H100Hopper9893031 GB/s~19ms~8K tokens+897%
H200Hopper+9894267 GB/s~17ms~8K tokens+1006%

Key Observations

Threshold ∝ Compute

RTX 6000/5090/L40S threshold ≈ 4K tokens. H100/H200 ≈ 8K tokens. Directly reflects the per-step token budget that vLLM's chunked prefill scheduler allows.

Baseline ∝ 1/Bandwidth

H200 (4267 GB/s) → 17ms baseline. RTX 6000 (562 GB/s) → 70ms. Decode is memory-bound, so baseline scales inversely with bandwidth.

Interference is Asymmetric

Prefill hurts decode (+200-2900%), but decode barely affects prefill (<10% TTFT increase under load). The interference is one-directional.

Phase Comparison: Baseline vs Mixed vs Recovery

Example: H100, num_decode=128, 16×4096 prefill injection

19ms
Baseline
72ms
Mixed (+282%)
19ms
Recovery
Recovery is immediate: On all 5 GPUs, decode iteration time returns to baseline within 1-2 iterations after prefill completes. The interference is transient, not persistent.

6. Hypotheses Verdict

# Hypothesis Verdict
H1 Mixed iterations are slower than pure decode Confirmed
H2 Interference scales linearly with total prefill tokens Rejected — binary step function, not linear
H3 Interference is in forward pass, not scheduler Confirmed — schedule_ms ≈ 0.1-0.5ms constant
H4 Decode recovers to baseline after prefill Confirmed — within 5% on all GPUs
H5 Faster GPUs have similar relative interference Partial — same binary pattern, but threshold varies with token budget

7. Implications for System Design

Chunked Prefill Tuning

A 4096-token chunk is invisible on H100 but catastrophic on RTX 6000. The optimal chunk size must scale with GPU compute throughput. max_num_batched_tokens is the single most important knob.

Disaggregation ROI

Our data provides the cost that PD disaggregation (Splitwise, DistServe) eliminates. On H100 with 128 concurrent decodes, a single 16×4096 burst adds 168ms to every decode step → 128 × 168ms = 21.5s cumulative user-facing delay.

Admission Control

The interference is binary — either below threshold (free) or above (catastrophic). An admission controller that limits concurrent prefill to stay below the per-step budget would eliminate interference entirely.

8. Future Directions