Exp: Prefill-Decode Interference in Continuous Batching

1. Background & Motivation

1.1 From Static Batching to Continuous Batching

A modern LLM server receives hundreds of concurrent requests whose prompt lengths and output lengths vary by orders of magnitude. How should it pack them onto a single GPU without wasting compute?

Static batching (the old way)

Collect N requests, pad them to the length of the longest one, run them through the model together, and return all N responses only when the slowest has finished. The GPU sits idle while short requests wait on long ones. For typical chat workloads, utilization drops below 20%.

Continuous batching (Orca, OSDI 2022)

Re-schedule the batch at every single iteration — i.e., every time the model takes one forward pass to produce the next token. Finished requests leave the batch immediately; newly-arrived requests can join the batch at the very next iteration. No padding, no stragglers. This iteration-level scheduling is the foundation of vLLM, TGI, SGLang, TensorRT-LLM, and every modern LLM serving engine.

Why it matters: continuous batching is the reason a single H100 can serve hundreds of concurrent chat users. Compared to static batching, it delivers roughly 10–20× higher throughput on realistic workloads. Every production LLM endpoint you've ever hit — ChatGPT, Claude, Gemini, open-source endpoints — uses some form of it.

1.2 Prefill vs Decode: Two Very Different Workloads

Every LLM request has two phases that live inside the same model weights but behave completely differently on GPU hardware:

Prefill — ingesting the prompt

The model ingests the entire prompt (say, 2048 tokens) in one forward pass. All 2048 positions flow through every layer simultaneously, and self-attention is an O(L²) matrix multiply. This is dense arithmetic — a prefill of a few thousand tokens completely saturates the tensor cores. Compute-bound.

Decode — generating output tokens

After prefill, the model generates the reply one token at a time. Each step only contributes 1 query row, but that query must attend over the entire past KV cache. The FLOPs are trivial, yet the KV cache is huge (hundreds of MB per sequence), so the GPU spends most of its time waiting on HBM reads. Memory-bandwidth-bound.

The opportunity continuous batching tries to exploit: Prefill saturates compute but barely touches memory bandwidth. Decode saturates memory bandwidth but barely touches compute. In principle, if you run them in the same iteration they use orthogonal resources — you get prefill work done "for free" while the GPU would otherwise sit idle waiting on decode's memory loads. That is the core promise. Whether it actually holds up in practice is what this page measures.

1.3 How Prefill and Decode Share a Single Iteration

To answer the question directly: yes, there is a fixed per-step token budget, and the scheduler fills it with a mix of prefill and decode work every iteration. Here is exactly what vLLM v1 does on each scheduling step:

① The token budget is fixed at server start

When vLLM launches you set --max-num-batched-tokens; call this budget B. Typical values: B = 2048 on small GPUs, B = 4096–8192 on H100. Every iteration is allowed to process at most B tokens through the model, total, across all requests combined.

↓

② Each iteration, the scheduler fills the budget

Walk the running queue. For each decoding request, reserve exactly 1 token (the next token to be generated). If 128 decodes are running, that's 128 tokens out of B.
Walk the waiting queue. For each pending prefill request, assign up to min(remaining_prompt, B − already_used) tokens from its prompt. If the remaining budget is smaller than the prompt, the prompt gets chunked and the rest is processed in later iterations — this is chunked prefill, always-on in vLLM v1.
Stop when the budget is exhausted or both queues are empty.

↓

③ Flatten everything into a single tensor

The selected tokens — however many come from decode plus however many come from prefill — are concatenated into one flat 1D sequence of length ≤ B. A side array cu_seqlens records where each request's tokens start and end, so the attention kernel (FlashAttention / FlashInfer) knows not to let request A's tokens attend to request B's KV cache. From the model's point of view, it's just one big forward pass over a sequence of length B.

↓

④ One forward pass handles all of it

The whole flat batch goes through every transformer layer exactly once. Every decode request emits its 1 new token. Every prefill request advances by however many of its tokens were allocated this step (the whole prompt, or just a chunk). The loop repeats. This is why prefill and decode can coexist in one iteration — they're just different rows of the same flattened batch.

Concrete example

H100, B = 8192. 32 decode requests are running, so 32 tokens are already reserved. A user sends a new request with a 2048-token prompt. Remaining budget = 8192 − 32 = 8160 ≥ 2048, so the entire prompt fits into this single iteration. The flat batch has shape [2080, hidden_dim] (2048 prefill rows stacked on top of 32 decode rows), one forward pass handles everything, and both the new user's first output token and every existing decode user's next token come out together. Total cost ≈ a prefill-dominated iteration — still small, because 2080 tokens is well below what the H100 can chew in one shot.

Now imagine a burst: 16 users each send a 4096-token prompt at the same moment. Total prefill demand = 65,536 tokens, but B is only 8192. The scheduler chunks it — iteration 1 takes ~8160 prefill tokens, iteration 2 takes 8160 more, and so on for roughly 8 iterations. During those 8 iterations, every decode user is stuck behind a wall of prefill compute and their per-token latency blows up. That cost — how badly decode latency spikes during a prefill burst, and how it depends on B, the GPU, and the burst size — is exactly what Sections 3–5 measure.

The core question this page answers: When both workloads share the same iteration, how much does prefill slow down decode? Is the interference gradual (smoothly worse as prefill grows) or threshold-based (fine until some cliff, then catastrophic)? Does decode recover immediately after the prefill burst ends, or does it stay degraded?

2. Experimental Protocol

Phase 1: PRIME

Launch num_decode streaming requests. Wait for all to enter decode state. Sleep 2s for steady state.

↓

Phase 2: BASELINE — 3s of pure decode

Trace captures per-step batch composition and iteration timing.

↓

Phase 3: INJECT — send prefill requests

Send num_prefill requests of prefill_len tokens (max_tokens=1). Trace captures mixed iterations.

↓

Phase 4: RECOVERY — 3s post-injection

Verify decode returns to baseline.

↓

Phase 5: CLEANUP

Cancel decode streams → drain KV cache → probe verify → next condition.

Variables

Variable	RTX 6000 / 5090	L40S	H100 / H200
`prefill_len`	128, 512, 1024, 2048	128, 512, 1024, 2048	128, 512, 1024, 2048, 4096
`num_prefill`	1, 2, 4	1, 2, 4, 8	1, 2, 4, 8, 16
`num_decode`	0, 4, 8, 12	0, 8, 16, 32	0, 8, 16, 32, 64, 128

Interference metric: interference_pct = (mean_mixed - mean_baseline) / mean_baseline × 100%

3. The Binary Interference Effect

The central finding: prefill-decode interference is not gradual — it is a step function governed by whether the prefill workload fits within a single iteration's token budget.

Below Threshold

Prefill tokens fit in one iteration alongside decode. Overhead is negligible (<10%). The GPU processes both in a single forward pass.

Above Threshold

Prefill overflows into multiple iterations. Each overflow iteration is dominated by prefill compute. Iteration time jumps from baseline (~13-70ms) to +200% – +1716%.

Interference Pattern (Schematic)

1500%

1000%

500%

128

16K

65K

Total Prefill Tokens

threshold

4. Per-GPU Results

4.1 H100 (Hopper, 80 GB, 989 BF16 TFLOPS)

Baseline decode: ~8-19ms. Threshold at ~8K total prefill tokens. Can absorb 4×2048 with only 5% overhead.

Interference Heatmap — H100, num_decode=32

How to read this table. 32 decode streams are running in the background. We then inject a burst of num_prefill concurrent prefill requests, each with a prompt of prefill_len tokens. Each cell shows how much the decode iteration time jumps (Δ%) relative to the pure-decode baseline (~11 ms on H100). Rows = how many prefill requests arrive at once; columns = how long each prefill prompt is. Total prefill tokens per burst = row × column. Green ≈ harmless, red ≈ catastrophic.

num_prefill ↓ prefill_len →	128	512	1024	2048	4096
1 req	<5%	<5%	<5%	~7%	~40%
2 req	<5%	<5%	~8%	~50%	~250%
4 req	<5%	~5%	~30%	349%	710%
8 req	<5%	~20%	~90%	580%	835%
16 req	~5%	~70%	~200%	660%	836%

Reading the diagonal: 1 req × 2048 = 2K total prefill tokens → +7% (negligible). 4 req × 2048 = 8K total → jumps to +349%. 16 req × 4096 = 65K total → +836%. The transition is sharp: once total prefill tokens exceed the H100's ~8K per-step budget, chunked prefill can no longer hide the work behind a single iteration, and decode latency jumps by 3–10×.

4.2 H200 (Hopper+, 141 GB, 989 TFLOPS, 4267 GB/s)

Nearly identical to H100 in compute behavior (same sm_90 architecture). Slightly lower baseline (17ms vs 19ms) due to higher memory bandwidth.

4×2048 Interference — H100 vs H200

num_decode	H100 baseline	H100 mixed	H100 Δ%	H200 baseline	H200 mixed	H200 Δ%
8	8.4ms	45.3ms	+443%	7.0ms	43.1ms	+514%
32	10.6ms	47.5ms	+349%	9.1ms	39.4ms	+333%
64	13.3ms	43.0ms	+224%	11.7ms	26.0ms	+123%
128	19.1ms	28.4ms	+49%	16.8ms	23.0ms	+36%

4.3 L40S (Ada Lovelace, 48 GB, 362 TFLOPS)

Baseline: ~25-29ms. Threshold at ~4K tokens. Notably stable baseline with very low variance.

4×2048 Interference — L40S

num_decode	baseline	mixed	Δ%
8	24.9ms	100.4ms	+304%
16	26.3ms	115.6ms	+340%
32	29.3ms	106.2ms	+262%

4.4 RTX 5090 (Blackwell, 32 GB, ~400 TFLOPS)

Baseline: ~13ms. Fast compute processes prefill quickly, resulting in lower absolute interference. Same threshold pattern at ~4K tokens.

4×2048 Interference — RTX 5090

num_decode	baseline	mixed	Δ%
4	12.5ms	90.6ms	+623%
8	12.9ms	115.8ms	+799%
12	13.2ms	102.5ms	+674%

4.5 RTX 6000 (Turing, 24 GB, 16.3 TFLOPS)

The most severely affected GPU. Baseline: ~47-94ms. Even 1×128 (128 tokens) causes >60% interference. Maximum: 4×2048 → +2903%.

4×2048 Interference — RTX 6000

decode=4

+2903% (1160ms vs 39ms)

decode=8

+2234% (1101ms vs 47ms)

decode=12

+1557% (1558ms vs 94ms)

5. Cross-GPU Comparison

GPU	Arch	BF16 TFLOPS	Mem BW	Baseline	Threshold	Max Δ%
RTX 6000	Turing	16	562 GB/s	~70ms	~4K tokens	+2903%
RTX 5090	Blackwell	~400	1792 GB/s	~13ms	~4K tokens	+874%
L40S	Ada	362	652 GB/s	~29ms	~4K tokens	+402%
H100	Hopper	989	3031 GB/s	~19ms	~8K tokens	+897%
H200	Hopper+	989	4267 GB/s	~17ms	~8K tokens	+1006%

Key Observations

Threshold ∝ ComputeRTX 6000/5090/L40S threshold ≈ 4K tokens. H100/H200 ≈ 8K tokens. Directly reflects the per-step token budget that vLLM's chunked prefill scheduler allows.
Baseline ∝ 1/BandwidthH200 (4267 GB/s) → 17ms baseline. RTX 6000 (562 GB/s) → 70ms. Decode is memory-bound, so baseline scales inversely with bandwidth.
Interference is AsymmetricPrefill hurts decode (+200-2900%), but decode barely affects prefill (<10% TTFT increase under load). The interference is one-directional.

Phase Comparison: Baseline vs Mixed vs Recovery

Example: H100, num_decode=128, 16×4096 prefill injection

19ms

Baseline

72ms

Mixed (+282%)

19ms

Recovery

Recovery is immediate: On all 5 GPUs, decode iteration time returns to baseline within 1-2 iterations after prefill completes. The interference is transient, not persistent.

6. Hypotheses Verdict

#	Hypothesis	Verdict
H1	Mixed iterations are slower than pure decode	Confirmed
H2	Interference scales linearly with total prefill tokens	Rejected — binary step function, not linear
H3	Interference is in forward pass, not scheduler	Confirmed — schedule_ms ≈ 0.1-0.5ms constant
H4	Decode recovers to baseline after prefill	Confirmed — within 5% on all GPUs
H5	Faster GPUs have similar relative interference	Partial — same binary pattern, but threshold varies with token budget

7. Implications for System Design

Chunked Prefill TuningA 4096-token chunk is invisible on H100 but catastrophic on RTX 6000. The optimal chunk size must scale with GPU compute throughput. max_num_batched_tokens is the single most important knob.
Disaggregation ROIOur data provides the cost that PD disaggregation (Splitwise, DistServe) eliminates. On H100 with 128 concurrent decodes, a single 16×4096 burst adds 168ms to every decode step → 128 × 168ms = 21.5s cumulative user-facing delay.
Admission ControlThe interference is binary — either below threshold (free) or above (catastrophic). An admission controller that limits concurrent prefill to stay below the per-step budget would eliminate interference entirely.

8. Future Directions

Chunked prefill budget sweep: Explicit --max-num-batched-tokens values (512–8192) to map the exact Pareto curve of prefill_latency vs decode_interference.
Larger models with TP: Llama 3.1 70B on 2×H100 with TP=2. Inter-GPU all-reduce may shift the pattern from compute-bound to communication-bound.
Real workload traces: Replace synthetic prompts with ShareGPT/LMSYS-Chat traces to capture realistic arrival patterns.
Dynamic chunk sizing: Adaptive policy that reduces chunk size when many decodes are active, targeting a specific TBT SLA.
Attention backend comparison: FlashAttention 2 vs 3 vs FlashInfer on the same GPU — different backends may amplify or dampen interference.