Interactive single-file HTML viewer for vLLM V1 scheduler traces. Reconstructs per-step scheduling decisions from sched_trace.<policy>.{steps,requests}.jsonl, maps each step to one of ten admission buckets, and pinpoints the exact scheduler.py break that blocked admission.
vLLM V1's scheduler (Scheduler.schedule(), ~500 lines) applies a priority-ordered sequence of checks — token budget, KV allocator, sequence cap, LoRA cap, KV connector state — against every waiting request on every step. When throughput drops, the interesting question is which of those checks broke first and how often. Reading GPU utilization plots or TTFT percentiles tells you performance is bad; it does not tell you whether to tune max_num_scheduled_tokens, add more KV blocks, bump max_num_seqs, or fix a fragmentation bug in the allocator.
sched_trace_viewer.html is a single self-contained HTML file that:
scheduler.py.steps.jsonl and optionally requests.jsonl; no backend required. An optional /api/traces endpoint enables remote listing without downloading.break inside schedule() fired — the trace is a post-hoc snapshot. Everything the viewer reports about admission bottlenecks is reconstructed from snapshot fields using the same priority order the scheduler uses internally. This is why the classifier can occasionally disagree with reality on edge cases (e.g., an allocator that refused due to a LoRA cap rather than KV).
Each scheduling run produces two JSONL files: a steps file containing step-level events and a requests file containing per-request state at every step. The steps file is required; the requests file enables drill-down, prefix-cache-hit analysis, and alloc-rejected reason detection.
# Example trace layout on disk
/home/hclin/Agentic_KVCache_management/logs/cpu_offload/
├── sched_trace.cpu_offload.steps.jsonl # ~20–30 MB typical, sometimes up to hundreds
└── sched_trace.cpu_offload.requests.jsonl # ~200–500 MB typical (streamed for >50MB)
Every line is an independent JSON object. The event field discriminates.
Emitted at the top of every schedule() call, before any admission. Captures the queues exactly as the scheduler is about to process them.
{
"event": "step_snapshot",
"step": 5423,
"ts": 1713567890.123, // unix seconds
"policy": "cpu_offload", // scheduling policy name
"free_blocks": 533, // KV blocks currently free
"total_blocks": 10318, // KV blocks total
"num_running": 12,
"num_waiting": 16,
"num_pinned": 0, // pinned job count (CPU-offload holds)
"running_req_ids": ["swe_0010_t6", ...],
"waiting_req_ids": ["swe_0011_t10", ...],
"running_job_ids": ["swe_0010", ...],
"waiting_job_ids": ["swe_0011", ...],
"cpu_free_blocks": 3841, // DRAM offload pool (when enabled)
"cpu_total_blocks": 4096,
"pinned_blocks": 0,
"pinned_job_ids": [],
"max_num_scheduled_tokens": 2048, // token budget per step
"max_num_running_reqs": 100 // concurrent-seq cap
}
Emitted at the end of every schedule() call, after admission decisions are made. Records exactly what the scheduler decided to run this step.
{
"event": "step_decision",
"step": 5423,
"ts": 1713567890.125,
"policy": "cpu_offload",
"scheduled_new_req_ids": [], // first-time admit from waiting
"scheduled_resumed_req_ids": [], // re-admit after preempt / remote-KV
"scheduled_running_req_ids": ["swe_0010_t6", ...], // running reqs granted more tokens
"preempted_req_ids": [], // demoted from running to waiting
"num_scheduled_tokens": {
"swe_0010_t6": 1, // decode: 1 token
"swe_0019_t7": 1552 // prefill chunk: 1552 tokens
}
}
Emitted once per request, on the first step it is scheduled. Splits the prefix-cache hit into local (vLLM's own cache) vs external (from a KV connector like CPUOffload or LMCache).
{
"event": "prefix_cache_lookup",
"step": 1000,
"ts": 1713567000.000,
"policy": "cpu_offload",
"req_id": "swe_0019_t7",
"job_id": "swe_0019",
"local_hit_tokens": 96, // vLLM prefix cache hit
"external_hit_tokens": 0, // connector hit
"total_hit_tokens": 96,
"total_prompt_tokens": 5960,
"num_tokens": 5960,
"connector": "CPUOffloadConnector",
"load_kv_async": false
}
For every request observed in either queue on a step, one JSON record is appended. Keyed by (step, req_id) — the viewer uses this for click-to-expand drill-down.
{
"step": 5423,
"ts": 1713567890.123,
"queue": "running", // "running" | "waiting"
"req_id": "swe_0010_t6",
"job_id": "swe_0010",
"status": "RequestStatus.RUNNING",
"priority": 0,
"arrival_time": 1713567000.0,
"max_tokens": 8192,
"num_prompt_tokens": 5960,
"num_tokens": 5987, // prompt + output so far
"num_output_tokens": 27,
"num_computed_tokens": 5987,
"num_cached_tokens": 96, // prefix cache hit (total)
"num_preemptions": 0,
"num_external_computed_tokens": 0,
"is_prefill_chunk": false,
"resumable": true,
"prompt_prefix": [128000, 882, ...], // first _MAX_PROMPT_FP token ids
"prompt_suffix": [128009, ...],
"output_tail": [1, 2, 3, ...],
"prompt_prefix_text": "<|begin_of_text|>You are...", // only if SCHED_TRACE_TOKENIZER set
"output_tail_text": "...final answer.",
"block_hashes_count": 374
}
file.stream().pipeThrough(TextDecoderStream)) to avoid OOM.
The emitter (vllm/v1/core/sched/trace.py) respects these environment variables at module load:
| Env var | Purpose | Default |
|---|---|---|
SCHED_TRACE_PATH | Output directory (required to enable tracing) | unset → tracing off |
SCHED_TRACE_POLICY | Policy name stamped into every record; used as filename infix | unknown |
SCHED_TRACE_EVERY_N | Emit only every N-th step (sampling) | 1 |
SCHED_TRACE_MAX_STEPS | Stop emitting after this many steps | 1_000_000 |
SCHED_TRACE_TOKENIZER | Tokenizer path; when set, *_text fields are decoded for drill-down UI | unset → token IDs only |
SCHED_TRACE_EVERY_N=10 divides trace size by ~10 but breaks the ability to track per-step running-queue age (the viewer's (N) counter inside running pills). Use sampling for large sweeps, no sampling when debugging a specific bottleneck.
Three paths. Pick whichever matches your deployment.
When the viewer is served by an HTTP server that also exposes a GET /api/traces endpoint returning a JSON list of trace directories, the landing page shows a browsable list. Clicking Load fetches /api/file?path=… with the absolute path on the server — so the viewer keeps the full path and shows it in the header bar.
[
{
"name": "dma_highload_a100_jps15_dram128",
"date": "2026-04-19 16:02",
"steps_mb": 23.4,
"requests_mb": 348.4,
"dma_mb": 0.8,
"steps": ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.steps.jsonl"],
"requests": ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.requests.jsonl"]
},
...
]
GPU is inferred client-side from the name via regex — the server does not need to populate a GPU field. The classifier recognises a100, h100, h200, b200, 5090 (or rtx5090), l40s, and cpu_offload, respecting underscore/hyphen separators.
Open the viewer via file:// or HTTP, drop either or both JSONL files onto the card, or pick them with the button. The viewer auto-detects steps vs requests files by probing the first few records:
{event: step_snapshot} or {event: step_decision} → treated as steps file.{req_id, step} with no event field → treated as requests file..requests.jsonl / .steps.jsonl.file.webkitRelativePath (gives you the relative subdirectory), or use the remote-trace path to keep the full absolute path.
Dropping only the steps file is enough to render the timeline and Statistics tab. Dropping requests too enables: per-req token-count display in pills, prefix-cache hit details, click-to-expand modal, and the minNeed computation used to distinguish kv-exhausted from kv-tight in bottleneck bucketing.
After loading a trace the top of the page shows a single-line summary with color-coded tokens for each axis of the run. Everything except the source path is derived from the first step_snapshot and from aggregates over all steps; the source path comes from whichever loader was used.
| Token | Where it comes from | Notes |
|---|---|---|
CPU_OFFLOAD | step_snapshot.policy | Uppercased; fallback unknown. |
A100-40GB | Inferred from total_blocks | <8K → 5090, <15K → A100-40GB, <35K → H100/A100-80GB range, <60K → H200-141GB, else unknown. |
KV:10318blk(20GB) | step_snapshot.total_blocks | Block → bytes assumes 16 tokens × 32 layers × 8 KV-heads × 128 dim × 2 (KV) × 2 B = 2 MB per block (Llama-3.1-8B default). |
DRAM:65536blk(128GB) | step_snapshot.cpu_total_blocks | Only shown when a CPU offload connector is active. |
100 jobs | Distinct job_ids across all running/waiting queues | Not request count — one job typically has multiple turns = multiple request_ids. |
steps 0–5423 (5424) | First/last step value across step_snapshot events | Parenthetical count reflects emitted records, which may differ from last − first + 1 if SCHED_TRACE_EVERY_N > 1. |
487.9s | last.ts − first.ts | Wall-clock duration of the instrumented window. |
maxRun:100 | max(s.num_running for s in STEPS) | Peak concurrency actually achieved. |
342K reqs | REQS.size | Total per-req observations in the requests file (= sum of queue sizes across all steps). |
src .../sched_trace.cpu_offload.steps.jsonl | FILE_NAMES.steps_path | Right-aligned pill; hover shows full path. Full absolute path when loaded via /api, basename when dropped locally (browser security). |
The default tab. Each row is one step. Five columns: step, ts (elapsed + delta), batch classification (aggregate badges), running queue (one pill per req), waiting queue (one pill per req). Use the Start / End / Window controls to page through the trace.
Each pill represents one request at one step. The category is determined by set-difference against step_decision:
| Pill | Computed as | Meaning |
|---|---|---|
| running | scheduled_running_req_ids | Already running; granted more tokens this step. |
| new | scheduled_new_req_ids | First-time admit from waiting queue. |
| resumed | scheduled_resumed_req_ids | Re-admit after earlier preemption or remote-KV wait. |
| preempted | preempted_req_ids | Demoted from running queue back to waiting this step (KV pressure). |
| skipped-run | running_req_ids \ scheduled_running_req_ids | Sat in running queue but scheduler granted zero tokens (exhausted token budget / mamba-align / encoder cap). |
| waiting | waiting_req_ids \ (new ∪ resumed) | In the waiting queue and not admitted this step. A headline symptom of throughput loss. |
| pinned | Overlay when job_id ∈ pinned_job_ids | CPU-offload / prefix-sharing is holding this job's blocks in VRAM (not evictable). |
| prefill | Red outline when num_computed_tokens < num_prompt_tokens | This step is still prefilling (not yet decoding). Large prefill chunks are the primary consumer of token budget. |
A running pill for job swe_0008, turn 9, on step 1000:
j0008, t9 (3) [385, fwd:1542t, 1542t left] hit:5/481(1%)
│ │ │ │ │ │ │
│ │ │ │ │ │ └── prefix cache hit: 5 blocks of 481 (1%)
│ │ │ │ │ └── prefill remaining: 1542 prompt tokens left
│ │ │ │ └── fwd this step: 1542 tokens (scheduled by decision)
│ │ │ └── KV blocks currently held: 385
│ │ └── age in running queue (N-th consecutive step; p3 = 3rd prefill step, 3 = 3rd decode step)
│ └── turn index within job (t0 = first turn; t9 = 10th turn of conversation)
└── job short id (strips "job_"/"swe_" prefix)
Waiting pills show different fields (block counts estimated from prompt size; prefix-cache-lookup may be peeked ahead from future steps for reqs not yet looked up):
j0011, t10 [prefill:521b(8330t), 8410t total, 1% cached]
│ │ │ │ │ │ │
│ │ │ │ │ │ └── cache hit percentage of total prompt
│ │ │ │ │ └── num_tokens (prompt + any output so far)
│ │ │ │ └── tokens still needed (post-cache)
│ │ │ └── blocks needed to admit (ceil((prompt − cached) / 16))
│ │ └── prefill estimate (not yet running, but will start in prefill)
│ └── turn 10
└── job id (swe_0011)
The third column stacks small badges summarising the step. Notable entries:
run:12 / new:0 / res:0 / pre:0 / skip-run:0 / wait:16 — counts of each pill category.pin:0 — number of pinned jobs (listed inline when present).tok:1553/2048 — sum of num_scheduled_tokens over budget. ≥90% triggers a 🔥 marker.cached:60b — sum of prefix-cached blocks across all observed requests this step.VRAM blk info — ava / used / pin / active / total blocks, with MB/GB conversion.DRAM blk info — analogous for CPU-offload pool.N blocked: alloc-rejected(free=533,allocator refused—hash/frag) — the admission-blocked reason — a dark-red badge appears only when waiting queue has reqs that didn't get admitted this step. This is the critical diagnostic.Five aggregate sections — computed once on first visit and cached.
Every step is classified into exactly one of ten buckets using the same priority order the scheduler applies inside schedule(). A horizontal bar chart plus inline legend with a one-line description per bucket.
| Bucket | Trigger condition | Meaning |
|---|---|---|
| tok-budget | unadmitted > 0 AND tokenSum ≥ 0.95 × budget |
Hard limit — token budget saturated. |
| kv-exhausted | unadmitted > 0 AND (effectiveFree < minNeed OR free_blocks < minNeed OR free_blocks < 20) |
Hard limit — KV provably insufficient. |
| max-seqs | unadmitted > 0 AND num_running ≥ max_num_running_reqs |
Hard limit — concurrent-sequence cap. |
| alloc-frag | unadmitted > 0 AND no hard limit hit AND free_blocks > 500 AND newCount == 0 |
Allocator refused — hash / fragmentation, free count looks fine. |
| alloc-exhausted | unadmitted > 0 AND no hard limit hit AND newCount > 0 |
Partial admit — allocator ran out mid-loop. |
| kv-tight | unadmitted > 0 AND (effectiveFree < 200 OR 20 ≤ free_blocks ≤ 500) |
Gray zone — near-full but cannot prove insufficient. |
| admitted-all | num_waiting > 0 AND unadmitted == 0 |
Healthy — every waiting req got in this step. |
| under-capacity | num_waiting == 0 AND num_running > 0 |
Healthy — no backlog, engine still busy. |
| idle | num_waiting == 0 AND num_running == 0 |
Engine idle. |
| unknown | Trace missing fields | Older trace version lacking free_blocks / waiting-req metadata. |
Derived quantities used by the classifier:
unadmitted = num_waiting − (|scheduled_new| + |scheduled_resumed|)minNeed = min over waiting reqs of ⌈(num_prompt_tokens − num_cached_tokens) / 16⌉ — smallest block count needed to admit any one waiting req. Requires requests file.runningConsumed = Σ over running reqs where fwd > 1 of ⌈fwd / 16⌉ — KV blocks the running queue's prefill will eat this step.effectiveFree = free_blocks − runningConsumed — approximates what's actually available to waiting reqs after the running queue takes its share.Two stacked area charts plotting block occupancy per step: VRAM (free/pinned/active/total) and DRAM (free/used/total when a CPU-offload connector is present). Lets you see preemption waves, pin pressure, and offload-pool growth directly.
Histogram over prefix_cache_lookup events, bucketed by (local_hit + external_hit) / total_prompt_tokens. Shows whether a workload relies on prefix reuse (heavy left tail = bad for agentic) and how much of the hit is served locally vs via a connector.
Twenty bins of tokenSum / budget per step. Dual x-axis: the top tick row is percentage (0%, 20%, …, 100%), the bottom row is the absolute token count (e.g. 1.6k, 8192t) so you can read the budget in concrete units. A red dashed line marks the mean.
Headline stats — total steps, mean/p50/p95 running queue size, mean/p50/p95 waiting queue size, peak KV usage, preempt count, and connector (if any) prefix-cache hit rate. Use it when you want numbers to paste into an experiment report.
This is the critical diagnostic path. Both the per-step red blocked badge and the Statistics tab buckets run the same classifier. Here we walk through the scheduler.py admission loop and map each break / continue / skip back to a reason code.
Condensed from vllm/v1/core/sched/scheduler.py, approximately lines 693–900:
# --- Stage 1: running queue (not shown here) consumes token_budget first,
# preempting low-priority reqs if allocate_slots fails.
# --- Stage 2: waiting-queue admission loop ---
while (self.waiting or self.skipped_waiting) and token_budget > 0:
# [MAX-SEQS] concurrent-sequence cap — tokens/KV don't matter here.
if len(self.running) == self.max_num_running_reqs:
break
request = request_queue.peek_request()
# [SKIP-AND-CONTINUE] LoRA cap / KV-connector "need remote" / WAITING_FOR_REMOTE_KVS
# — does NOT break the loop, just defers this req and tries the next.
if blocked_by_lora_or_connector(request):
request_queue.pop_request()
step_skipped_waiting.prepend_request(request)
continue
# num_new_tokens is capped at token_budget below.
num_new_tokens = request.num_tokens - num_computed_tokens
if 0 < long_prefill_token_threshold < num_new_tokens:
num_new_tokens = long_prefill_token_threshold
# [TOK-BUDGET] chunked-prefill off & single-req needs more than remaining budget.
if not enable_chunked_prefill and num_new_tokens > token_budget:
break
num_new_tokens = min(num_new_tokens, token_budget)
# [KV-EXHAUSTED (can_fit_full_sequence)] if scheduler_reserve_full_isl is on.
if reserve_full_isl and not kv_cache_manager.can_fit_full_sequence(request, ...):
break
# [KV-EXHAUSTED / KV-TIGHT / ALLOC-FRAG / ALLOC-EXHAUSTED].
# allocate_slots returns None if the KV block allocator cannot find suitable blocks
# (considers fragmentation, hash-collision policy, and existing pins).
new_blocks = kv_cache_manager.allocate_slots(request, num_new_tokens, ...)
if new_blocks is None:
break
# Admission succeeds — push into running queue, record decision.
self.running.append(request)
num_scheduled_tokens[request_id] = num_new_tokens
token_budget -= num_new_tokens
scheduled_new_reqs.append(request)
# loop ends naturally when waiting empties or token_budget hits 0.
The trace does not record which break fired. The viewer reconstructs by checking conditions in the same order the scheduler would see them, stopping at the first match. Priority matches the loop's textual order so the attribution is correct for any step where exactly one condition held at break time.
| Priority | Reason | Check | scheduler.py position |
|---|---|---|---|
| 1 | tok-budget | tokenSum ≥ 0.95 × budget | Outer while token_budget > 0 exit + break on chunked-prefill single-req overflow (~L813) |
| 2 | kv-after-run | effectiveFree < minNeed | allocate_slots returns None (~L885) — running queue's prefill consumed the free blocks before waiting got a chance |
| 3 | kv-insuf | free_blocks < minNeed | allocate_slots returns None (~L885) — snapshot already shows fewer free blocks than the smallest waiting req needs |
| 4 | kv-low | free_blocks < 20 | Same as above; fallback when minNeed can't be computed |
| 5 | max-seqs | num_running ≥ max_num_running_reqs | break at L694 |
| 6 | kv-tight-after-run | effectiveFree < 200 | Same as kv-* but below the hard-exhaustion threshold. Indicative, not definitive. |
| 7 | alloc-exhausted | newCount > 0 (partial admit) | allocate_slots returns None mid-loop after some admits succeeded |
| 8 | alloc-rejected | free_blocks > 500 + zero admits | allocate_slots returns None despite plenty of free blocks = hash collision or fragmentation in the KVCacheManager.block_pool. |
| 9 | kv-marginal | 20 ≤ free_blocks ≤ 500 | Gray zone — can't prove which condition fired |
| 10 | unknown | fallback | Trace missing the fields needed to classify |
The per-step red badge exposes all ten reason codes (including the fine-grained kv-after-run, kv-insuf, kv-low, and kv-marginal). The Statistics tab consolidates them down to six top-level names for legibility:
| Statistics bucket | Absorbs these per-step reasons |
|---|---|
tok-budget | tok-budget |
kv-exhausted | kv-after-run, kv-insuf, kv-low |
max-seqs | max-seqs |
alloc-frag | alloc-rejected |
alloc-exhausted | alloc-exhausted |
kv-tight | kv-tight-after-run, kv-marginal |
continue, they don't break — the waiting queue keeps being served). It also cannot directly identify WAITING_FOR_REMOTE_KVS: those reqs skip via _is_blocked_waiting_status and never reach allocate_slots. If a step's unadmitted count is high while all buckets report healthy, check the step's waiting-queue pills for high WAITING_FOR_REMOTE_KVS proportion (visible in the drill-down modal).
Concrete case from a CPU-offload trace. The viewer's timeline showed this batch classification:
step=1000
run:12 wait:16
tok:1553/2048
16 blocked: alloc-rejected(free=533, allocator refused—hash/frag)
cached:60b
VRAM: ava:533 used:5178/5711 (10.1 GB used / 11.2 GB total)
DRAM: ava:3841 used:255/4096 ( 510 MB used / 8.0 GB total)
unadmitted = 16 − 0 = 16.
tokenSum=1553 vs budget=2048 → 75.8%, below 95% threshold. No.
free_blocks=533, runningConsumed ≈ 60 (a few running reqs had fwd>1), effectiveFree ≈ 473. Smallest waiting req's minNeed = ⌈(5605−56)/16⌉ = 347. Since 473 ≥ 347, not kv-exhausted. free_blocks=533 ≥ 20, not kv-low either. No.
num_running=12 vs max_num_running_reqs=100. No.
effectiveFree=473 ≥ 200. No.
newCount = 0, not partial admit. No.
free_blocks=533 > 500 AND newCount == 0. YES — this step is alloc-frag.
The scheduler had 533 free blocks on hand and 16 waiting requests, the smallest of which needed 347 blocks. Numerically the allocation should succeed. But kv_cache_manager.allocate_slots() returned None for every single one of those 16 — which means the underlying KVCacheManager could not find a qualifying free block. The most common reasons, in order of likelihood:
num_cached_tokens per req shows >0 on most running reqs.pin:0 in this step, earlier steps may have taken out transient pins that haven't been released. Re-check preceding steps.free_blocks/total_blocks > 5%, but possible with high churn.What to do: enable chunked prefill if off; drop scheduler_reserve_full_isl; or cap cache_salt variance so prefix hashes converge rather than fragmenting.
Summary of where trace.py emits, from scheduler.py's point of view:
# Before starting vllm serve
export SCHED_TRACE_PATH=/home/hclin/logs/my_run
export SCHED_TRACE_POLICY=cpu_offload
export SCHED_TRACE_TOKENIZER=/home/hclin/shared_models/Meta-Llama-3.1-8B-Instruct # optional
export SCHED_TRACE_EVERY_N=1 # 1 = every step; 10 = every 10th
export SCHED_TRACE_MAX_STEPS=100000
vllm serve <model> --scheduling-policy cpu_offload --port 8199 ...
Traces are opened lazily on first emit and flushed on process exit. For graceful shutdown use kill -2 (SIGINT) — never kill -9 on a GPU process (corrupts CUDA driver state).
The emitter accepts **extra kwargs on all three primary functions (step_snapshot, step_decision, prefix_cache_event). New fields land in the JSONL verbatim — no schema version bump needed. The viewer ignores unknown keys but will happily display them in the drill-down modal when a user clicks a pill.
If you instrument the break points directly (e.g., sched_trace.break_reason_event(step, reason="alloc_refused")) the viewer's classifier could be retired and the attribution would become ground-truth. That change requires editing scheduler.py at each break; see the Scheduler page for the full loop layout.
| Control | What it does |
|---|---|
| Refresh icon | Re-fetch /api/traces; spins while loading. |
| Filter input | Case-insensitive substring match on trace name. |
| Sort group | Name / Date / GPU. Click same key to toggle ↑/↓. Default: Date ↓ (newest first). |
| Load button | Fetch steps file for this trace. |
| +reqs button | Fetch both steps and requests file (streamed for large files). |
| Drag-drop zone | Accepts files or folders. Folder drag populates webkitRelativePath. |
| Control | What it does |
|---|---|
Start step | Inclusive lower bound of the displayed range. |
End step | Inclusive upper bound. |
Window | Max rows rendered at once. Default 500 — higher ≈ more scroll, more DOM. |
Apply | Re-render with new range. |
Load another | Return to landing page. |
Timeline / Statistics | Tab toggle. Statistics computes on first visit and caches. |