Matrix Sweep: Agentic KV Scheduling Across GPUs

72-run experiment: 3 GPUs × 6 policy variants × 4 JPS arrival rates × 100 SWE-smith trajectories. Each trajectory replayed turn-by-turn with a 1 s synthesized tool-exec delay between turns.

H100 80GB · A100 40GB · H200 141GB Llama-3.1-8B-Instruct vLLM v0.19 fork SWE-smith-trajectories 2026-04-16

Hung-Chun Lin · Georgia Tech

1. Experiment Setup

72
total runs
100
jobs per run
1,791
turns per run (avg 17.9/job)
~300K
step_snapshot rows

1.1 Hardware

GPUVRAMKV Cache Tokens (16k ctx)Partition
H100 80GB HBM380 GB~458KPACE gpu-h100
A100 PCIE 40GB (mixed in partition)40 GB~90KPACE gpu-a100
A100 PCIE 80GB (mixed in partition)80 GB~200KPACE gpu-a100
H200 141GB HBM3e141 GB~825KPACE gpu-h200

1.2 Model & dataset

FieldValue
Modelmeta-llama/Llama-3.1-8B-Instruct (14.99 GB, bfloat16)
Datasetprinceton-nlp/SWE-smith-trajectories — 100 curated trajectories, mean 17.9 turns/job, median 6945 input / 30 output tokens per turn (highly prefix-cache-friendly)

Tool distribution

ToolShareToolShare
cd26%submit11%
editor_view21%find6%
editor_create13%rm5%
editor_str_replace12%grep4%
others2%

Each SWE-smith trajectory is an OpenAI-format chat log: a system prompt, the issue description as the user message, and alternating assistant / tool turns. A worked example showing the JSON structure together with its per-turn token budget lives in §2.1.

1.3 Scheduling policies tested

Policy variantKV strategyExternal tier
fcfsV1 FCFS admission
continuumPin held-KV for 2 s TTL if estimated tool-exec ≤ threshold
cpu_offload_dram16SimpleCPUOffloadConnector16 GB DRAM
cpu_offload_dram128SimpleCPUOffloadConnector128 GB DRAM
lmcache_dram16LMCacheConnectorV116 GB DRAM
lmcache_dram128LMCacheConnectorV1128 GB DRAM

1.4 Arrival rate sweep

Jobs arrive as a Poisson process. JPS (jobs per second) values: 1, 3, 6, 10. Each job runs its full ~18-turn trajectory; between turns a 1-second synthesized tool-exec delay is inserted to emulate real agent loop back-pressure.

2. Per-turn prefill / decode token budget

Every chat turn of a job goes through exactly two phases inside the vLLM engine:

Minimum engine steps for a single turn = 1 (prefill, ≥ 1 if chunked) + N (one per decoded token) = 1 + num_output_tokens.

2.1 Worked example — astropy__astropy-12907 turn 2

The trajectory (abridged) and the token accounting for turn 2 share one block below. Counts are representative (Llama-3.1 tokenizer).

# ── Trajectory (abridged, 2 of ~17 turns shown) ──────────────────────────
{
  "instance_id": "astropy__astropy-12907",
  "messages": [
    {"role": "system",
     "content": "You are a SWE agent. Use the tools to explore the repo and fix the issue…"},

    {"role": "user",
     "content": "<issue>Modeling's `separability_matrix` does not compute separability correctly for nested CompoundModels</issue><repo>astropy/astropy</repo>"},

    // turn 1 — explore
    {"role": "assistant",
     "content": "Let me look at the modeling package.",
     "tool_calls": [{"id": "c1", "type": "function",
       "function": {"name": "cd", "arguments": "{\"path\": \"astropy/modeling\"}"}}]},
    {"role": "tool", "tool_call_id": "c1", "content": "/astropy/modeling"},

    // turn 2 — read the suspect file   ← the turn we score below
    {"role": "assistant",
     "content": "Opening separable.py to inspect _separable().",
     "tool_calls": [{"id": "c2", "type": "function",
       "function": {"name": "editor_view", "arguments": "{\"path\": \"separable.py\", \"view_range\": [1, 120]}"}}]},
    {"role": "tool", "tool_call_id": "c2",
     "content": "def _separable(transform): …  # ~115 lines of source"},

    // … 15 more turns (editor_str_replace, editor_view, editor_str_replace, submit) …
  ]
}

# ── Turn 2 · Input side — prompt_tokens fed to the engine ────────────────
  system prompt                              …   604 tok
  user (issue body + reproduction code)      … 1,498 tok
  turn 1 assistant text + tool_calls JSON    …    35 tok
  turn 1 synthesized ```bash cd ```          …     5 tok
  turn 1 tool response "/astropy/modeling"   …    10 tok
  turn 2 assistant role prefix               …     3 tok
                                             ─────────
  prompt_tokens (prefill input)           = 2,155 tok

# ── Turn 2 · Output side — tokens generated during decode ────────────────
  "Opening separable.py to inspect _separable()."        → 10 tok
  tool_calls JSON (editor_view, separable.py, 1..120)    → 13 tok
  synthesized ```bash\neditor_view\n```                  →  5 tok
                                                         ─────────
  output_tokens                                       =    28 tok

# ── Turn 2 · Minimum engine steps ────────────────────────────────────────
  prefill   : 1 step   (2,155 ≤ max_num_batched_tokens = 8,192 → one forward pass)
  decode    : 28 steps (one forward pass per emitted token)
                                               ─────────
  min steps                             = 29 steps
  general form: 1 + num_output_tokens
Prefix cache effect. With enable_prefix_caching=True, turn 1's full prefix (2,152 tok) is reused, so the prefill for turn 2 only computes the 3-tok new role prefix — one cheap forward pass instead of 2,155 tokens of attention. The step count is unchanged (still 1 + 28 = 29); prefix cache saves compute and KV memory, not engine steps.

2.2 Synthesized tool-call injection

Continuum's pin estimator classifies the upcoming tool by parsing a trailing ```bash\n<tool_name>\n``` block in the assistant's content. Raw SWE-smith trajectories carry tools in the OpenAI tool_calls field instead, so a preprocessor synthesizes that bash block from tool_calls[0].function.name before tokenization — the extra 5 tokens shown in the example above.

3. Results — Summary Table (avg job duration, seconds)

Below: mean total turn duration per job including queue wait, across 100 jobs × ~18 turns. Lower = better. Bold highlights per-row winner.

H100 80GB

JPSfcfscontinuumcpu_off-16Gcpu_off-128Glmcache-16Glmcache-128G
141.041.350.541.657.953.1
3232.969.0232.379.8260.696.3
6276.075.8268.485.3297.5105.0
10281.079.1290.391.8319.4105.7

A100 PCIE 40GB / 80GB (mixed — see annotations)

JPSfcfscontinuumcpu_off-16Gcpu_off-128Glmcache-16Glmcache-128G
1707.7 (80G)188.4 (80G)959.2 (40G)202.8 (80G)680.9 (80G)256.9 (80G)
31086.5 (40G)231.4 (40G)1057.2 (40G)234.2 (80G)1144.3 (40G)466.5 (40G)
6966.3 (80G)233.3 (40G)1057.3 (40G)241.6 (80G)1168.3 (40G)310.4 (80G)
10895.7 (80G)238.0 (40G)1419.2 (40G)240.9 (80G)1159.0 (40G)292.8 (80G)
⚠ A100 VRAM is not fixed on PACE. The gpu-a100 partition mixes A100 40GB and A100 80GB cards. A bare --gres=gpu:a100:1 request can land on either variant, which roughly doubles the available KV budget (~90K → ~200K tokens at 16k ctx) and silently shifts the policy spectrum — particularly the cpu_offload / lmcache crossover. The 24 cells above were not filtered: each landed on whichever SKU was free at submit time (tagged inline per cell; overall 12 × 40G + 12 × 80G). To reproduce cleanly, pin the exact SKU (e.g. --constraint=A100-40GB) instead of letting the scheduler pick.

H200 141GB

JPSfcfscontinuumcpu_off-16Gcpu_off-128Glmcache-16Glmcache-128G
139.640.145.737.253.147.3
358.357.269.361.684.361.3
662.165.076.664.797.066.5
1066.166.380.867.195.685.6
Headline: Continuum on H100 delivers a 3.3–3.7× latency reduction over FCFS at JPS ≥ 3. On A100 it is 3.7–4.7× faster. On H200 everything is fast — the 141 GB KV cache eliminates contention, so policy choice doesn't matter.

3.1 H100 80GB — 9-chart panel

H100 max fwd tokens histogram
Fig. H100-aHistogram of max forward tokens per scheduler step across all policies (log-scale y). The dominant cluster at 1–4 tokens = decode steps. The budget-cap spike at 2048 = prefill-heavy steps hitting max_num_scheduled_tokens. Continuum and cpu_offload_128G show tighter distributions; cpu_offload_16G and lmcache_16G have heavier right tails due to prefill retries after eviction.
H100 scatter: max fwd tokens vs step time
Fig. H100-bPer-step scatter: x = max fwd tokens, y = measured step duration (ms). Global linear regression: y = 0.0258·x + 23.7, R² = 0.605. Interpretation: each additional token costs ~26 μs of step time; the 24 ms intercept is fixed per-step overhead (kernel launch + scheduler + Python dispatch).
H100 turn vs JCT by policy
Fig. H100-c — Per-turn mean duration (including queue wait), colored by policy. FCFS / cpu_offload_16G / lmcache_16G show the classic queue-buildup pattern: later turns hit progressively longer queues as concurrent jobs accumulate. Continuum and cpu_offload_128G stay flat, indicating KV reuse successfully prevents rework.

3.2 A100 PCIE 40GB — 9-chart panel

A100 max fwd tokens histogram
Fig. A100-a — A100 histogram. With only ~90K KV tokens, chunked-prefill fragmentation is extreme: lmcache_16G shows a broad secondary peak around 256 tokens (constant small-chunk prefill). Continuum shifts left-heavy (decode-dominated) because pinning retains prefix blocks across turns.
A100 scatter regression
Fig. A100-bA100 regression: y = 0.0973·x + 66.2 ms, R² = 0.722. A100 is 3.8× slower per token than H100 (26 μs → 97 μs) and has 2.8× higher per-step overhead (24 → 66 ms) — reflecting slower PCIe 4.0 + weaker TensorCore throughput.
A100 turn vs JCT by policy
Fig. A100-cA100 reveals the full policy spectrum. FCFS collapses (>700 s by turn 15). lmcache_16G / cpu_offload_16G are even worse due to slow PCIe 4.0 KV transfers on cache misses. Continuum and cpu_offload_128G hold the line at ~250 s asymptote. On tight-KV GPUs, smart policy is the difference between 4× and 15× slowdown at JPS ≥ 3.

3.3 H200 141GB — 9-chart panel

H200 max fwd tokens histogram
Fig. H200-aH200 histogram. All policies converge into a near-identical shape because the 141 GB KV cache fits essentially the entire working set. lmcache_16G is the only outlier — deliberately tight DRAM forces external-cache thrash even when GPU KV is plentiful.
H200 scatter regression
Fig. H200-bH200 regression: y = 0.0371·x + 21.4 ms, R² = 0.385. Notably lower R² than H100/A100 — compute is so fast that per-step variance is dominated by non-compute factors (scheduler overhead, I/O, Python GIL), not max fwd tokens. Per-token cost (37 μs) is 1.4× H100's, but intercept (21 ms) is actually lower — H200's kernel launch overhead is smaller.
H200 turn vs JCT by policy
Fig. H200-cH200 turn-latency: almost flat for all policies except lmcache_16G and cpu_offload_16G (which still suffer from forced eviction). Confirms that policy matters only when KV pressure exists.

4. Regression Analysis — Compute Cost Model

Fitting step_duration = slope·max_fwd_tokens + intercept across all policies on each GPU:

GPUn (steps)slope (μs/token)intercept (ms)Interpretation
H100108,51825.823.70.605Compute-bound: throughput predicts time well.
A100103,99997.366.20.722Strongest linear fit — compute fully saturates kernels.
H20089,35537.121.40.385Compute too fast — overhead dominates step time.
The cross-GPU slope ratio (A100 / H100 / H200 ≈ 97 / 26 / 37 μs/token) closely matches the published peak-TFLOPS inverse ratio for bfloat16 dense matmul, confirming the trace correctly captures compute-bound behavior. H200's lower R² is not noise — it signals that on H200, scheduler Python overhead (~21 ms/step floor) is now the optimization target, not more fused kernels.

5. Takeaways

Continuum wins decisively on tight-KV GPUs

H100 JPS ≥ 3: 3.3–3.7× faster than FCFS. A100 JPS ≥ 3: 3.7–4.7× faster. The gain comes from pinning held-KV for 2 s, which lets the next turn of a job hit prefix cache at ~98% instead of re-prefilling 6–8 K tokens.

cpu_offload_128G is a viable Continuum alternative

When DRAM is abundant (≥ working-set size), SimpleCPUOffloadConnector achieves 80–90% of Continuum's latency without any custom scheduling logic. It even surpasses Continuum on H200 (where DRAM transfer cost is trivial vs. compute). However, it cannot replace Continuum on A100 at JPS = 10 (240 s vs 238 s — tied) because PCIe 4.0 becomes the bottleneck.

Undersized DRAM is worse than no DRAM

cpu_offload_16G and lmcache_16G consistently underperform FCFS (which has no external tier) on H100 and A100 at high JPS. LRU thrashes: KV is offloaded then immediately evicted before reuse, paying transfer cost both ways with no benefit. Rule of thumb: DRAM must exceed the sum of all concurrent jobs' peak KV, not average.

H200 makes policy selection moot

With 141 GB HBM3e, the entire 100-job working set fits in VRAM. All policies converge within ±8% of FCFS. Buying H200 is effectively buying insensitivity to scheduler bugs.

5.1 Engineering issues surfaced by this matrix

5.2 What we did not measure (limitations)

7. Interactive Trace Viewer

Every run produces per-step JSONL traces (step_snapshot + step_decision + prefix_cache_lookup). Load them into the interactive scheduler trace viewer to inspect individual steps, prefill/decode phases, KV block allocation, pin state, and admission limits.

Local viewer path: ~/Project/Agentic_KVCache_management/sched_trace_viewer.html

Open in browser, then load sched_trace.<policy>.steps.jsonl + .requests.jsonl from any run directory under results/matrix_100job_{h100,a100,h200}/<policy>_jps<N>/.

Viewer features