Scheduler Trace Viewer

1. What the Viewer Solves

vLLM V1's scheduler (Scheduler.schedule(), ~500 lines) applies a priority-ordered sequence of checks — token budget, KV allocator, sequence cap, LoRA cap, KV connector state — against every waiting request on every step. When throughput drops, the interesting question is which of those checks broke first and how often. Reading GPU utilization plots or TTFT percentiles tells you performance is bad; it does not tell you whether to tune max_num_scheduled_tokens, add more KV blocks, bump max_num_seqs, or fix a fragmentation bug in the allocator.

sched_trace_viewer.html is a single self-contained HTML file that:

Replays every step — shows the running and waiting queues, which requests got admitted/preempted, tokens scheduled, and KV/DRAM block usage, as a scrollable timeline.
Bucketizes admission bottlenecks — ten mutually-exclusive categories, each mapped to a specific break point in scheduler.py.
Explains allocator refusals — distinguishes true KV exhaustion from fragmentation (free block count looks fine, but the allocator still returns None) and from partial admission (some reqs got in, the next one didn't).
Runs entirely in the browser — drop in steps.jsonl and optionally requests.jsonl; no backend required. An optional /api/traces endpoint enables remote listing without downloading.

Design constraint: the viewer never has direct knowledge of which break inside schedule() fired — the trace is a post-hoc snapshot. Everything the viewer reports about admission bottlenecks is reconstructed from snapshot fields using the same priority order the scheduler uses internally. This is why the classifier can occasionally disagree with reality on edge cases (e.g., an allocator that refused due to a LoRA cap rather than KV).

1.1 Layered architecture

vllm/v1/core/sched/scheduler.py

Scheduler.schedule() — calls sched_trace emitters before & after the admission loop.

↓

vllm/v1/core/sched/trace.py

Lightweight emitter. Writes JSONL records for step_snapshot, step_decision, prefix_cache_lookup events.

↓

sched_trace.<policy>.{steps,requests}.jsonl

Two JSONL files per scheduling run — step-level events in one, per-request drill-down state in the other.

↓

sched_trace_viewer.html

Client-side parser + renderer. Also provides /api/traces fetch path for remote browsing.

2. Trace Data Format

Each scheduling run produces two JSONL files: a steps file containing step-level events and a requests file containing per-request state at every step. The steps file is required; the requests file enables drill-down, prefix-cache-hit analysis, and alloc-rejected reason detection.

# Example trace layout on disk
/home/hclin/Agentic_KVCache_management/logs/cpu_offload/
├── sched_trace.cpu_offload.steps.jsonl       # ~20–30 MB typical, sometimes up to hundreds
└── sched_trace.cpu_offload.requests.jsonl    # ~200–500 MB typical (streamed for >50MB)

2.1 Steps file — three event types

Every line is an independent JSON object. The event field discriminates.

event: step_snapshot

Emitted at the top of every schedule() call, before any admission. Captures the queues exactly as the scheduler is about to process them.

{
  "event": "step_snapshot",
  "step": 5423,
  "ts": 1713567890.123,          // unix seconds
  "policy": "cpu_offload",        // scheduling policy name
  "free_blocks": 533,             // KV blocks currently free
  "total_blocks": 10318,           // KV blocks total
  "num_running": 12,
  "num_waiting": 16,
  "num_pinned": 0,                  // pinned job count (CPU-offload holds)
  "running_req_ids": ["swe_0010_t6", ...],
  "waiting_req_ids": ["swe_0011_t10", ...],
  "running_job_ids": ["swe_0010", ...],
  "waiting_job_ids": ["swe_0011", ...],
  "cpu_free_blocks": 3841,         // DRAM offload pool (when enabled)
  "cpu_total_blocks": 4096,
  "pinned_blocks": 0,
  "pinned_job_ids": [],
  "max_num_scheduled_tokens": 2048,  // token budget per step
  "max_num_running_reqs": 100       // concurrent-seq cap
}

event: step_decision

Emitted at the end of every schedule() call, after admission decisions are made. Records exactly what the scheduler decided to run this step.

{
  "event": "step_decision",
  "step": 5423,
  "ts": 1713567890.125,
  "policy": "cpu_offload",
  "scheduled_new_req_ids": [],                      // first-time admit from waiting
  "scheduled_resumed_req_ids": [],                  // re-admit after preempt / remote-KV
  "scheduled_running_req_ids": ["swe_0010_t6", ...], // running reqs granted more tokens
  "preempted_req_ids": [],                         // demoted from running to waiting
  "num_scheduled_tokens": {
    "swe_0010_t6": 1,                              // decode: 1 token
    "swe_0019_t7": 1552                            // prefill chunk: 1552 tokens
  }
}

event: prefix_cache_lookup

Emitted once per request, on the first step it is scheduled. Splits the prefix-cache hit into local (vLLM's own cache) vs external (from a KV connector like CPUOffload or LMCache).

{
  "event": "prefix_cache_lookup",
  "step": 1000,
  "ts": 1713567000.000,
  "policy": "cpu_offload",
  "req_id": "swe_0019_t7",
  "job_id": "swe_0019",
  "local_hit_tokens": 96,                   // vLLM prefix cache hit
  "external_hit_tokens": 0,                 // connector hit
  "total_hit_tokens": 96,
  "total_prompt_tokens": 5960,
  "num_tokens": 5960,
  "connector": "CPUOffloadConnector",
  "load_kv_async": false
}

2.2 Requests file — per-step per-req state

For every request observed in either queue on a step, one JSON record is appended. Keyed by (step, req_id) — the viewer uses this for click-to-expand drill-down.

{
  "step": 5423,
  "ts": 1713567890.123,
  "queue": "running",                  // "running" | "waiting"
  "req_id": "swe_0010_t6",
  "job_id": "swe_0010",
  "status": "RequestStatus.RUNNING",
  "priority": 0,
  "arrival_time": 1713567000.0,
  "max_tokens": 8192,
  "num_prompt_tokens": 5960,
  "num_tokens": 5987,                // prompt + output so far
  "num_output_tokens": 27,
  "num_computed_tokens": 5987,
  "num_cached_tokens": 96,             // prefix cache hit (total)
  "num_preemptions": 0,
  "num_external_computed_tokens": 0,
  "is_prefill_chunk": false,
  "resumable": true,
  "prompt_prefix": [128000, 882, ...],         // first _MAX_PROMPT_FP token ids
  "prompt_suffix": [128009, ...],
  "output_tail": [1, 2, 3, ...],
  "prompt_prefix_text": "<|begin_of_text|>You are...",  // only if SCHED_TRACE_TOKENIZER set
  "output_tail_text": "...final answer.",
  "block_hashes_count": 374
}

Size trade-off: the requests file is typically 10×–20× larger than the steps file because it writes one record per req per step. For traces >50 MB, the viewer auto-switches to a streaming parser (file.stream().pipeThrough(TextDecoderStream)) to avoid OOM.

2.3 Sampling & tokenizer controls

The emitter (vllm/v1/core/sched/trace.py) respects these environment variables at module load:

Env var	Purpose	Default
`SCHED_TRACE_PATH`	Output directory (required to enable tracing)	unset → tracing off
`SCHED_TRACE_POLICY`	Policy name stamped into every record; used as filename infix	`unknown`
`SCHED_TRACE_EVERY_N`	Emit only every N-th step (sampling)	`1`
`SCHED_TRACE_MAX_STEPS`	Stop emitting after this many steps	`1_000_000`
`SCHED_TRACE_TOKENIZER`	Tokenizer path; when set, `*_text` fields are decoded for drill-down UI	unset → token IDs only

On sampling: setting SCHED_TRACE_EVERY_N=10 divides trace size by ~10 but breaks the ability to track per-step running-queue age (the viewer's (N) counter inside running pills). Use sampling for large sweeps, no sampling when debugging a specific bottleneck.

3. Loading a Trace

Three paths. Pick whichever matches your deployment.

3.1 Remote trace list (recommended)

When the viewer is served by an HTTP server that also exposes a GET /api/traces endpoint returning a JSON list of trace directories, the landing page shows a browsable list. Clicking Load fetches /api/file?path=… with the absolute path on the server — so the viewer keeps the full path and shows it in the header bar.

Expected /api/traces response shape

[
  {
    "name": "dma_highload_a100_jps15_dram128",
    "date": "2026-04-19 16:02",
    "steps_mb": 23.4,
    "requests_mb": 348.4,
    "dma_mb": 0.8,
    "steps":    ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.steps.jsonl"],
    "requests": ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.requests.jsonl"]
  },
  ...
]

GPU is inferred client-side from the name via regex — the server does not need to populate a GPU field. The classifier recognises a100, h100, h200, b200, 5090 (or rtx5090), l40s, and cpu_offload, respecting underscore/hyphen separators.

3.2 Local file picker / drag-drop

Open the viewer via file:// or HTTP, drop either or both JSONL files onto the card, or pick them with the button. The viewer auto-detects steps vs requests files by probing the first few records:

Contains {event: step_snapshot} or {event: step_decision} → treated as steps file.
Contains {req_id, step} with no event field → treated as requests file.
Falls back to filename suffix .requests.jsonl / .steps.jsonl.

Browser limitation: the File API only exposes basename; the full path stays on disk. Drag a folder instead of a file to populate file.webkitRelativePath (gives you the relative subdirectory), or use the remote-trace path to keep the full absolute path.

3.3 Only one file is required

Dropping only the steps file is enough to render the timeline and Statistics tab. Dropping requests too enables: per-req token-count display in pills, prefix-cache hit details, click-to-expand modal, and the minNeed computation used to distinguish kv-exhausted from kv-tight in bottleneck bucketing.

4. The Header Info Bar

After loading a trace the top of the page shows a single-line summary with color-coded tokens for each axis of the run. Everything except the source path is derived from the first step_snapshot and from aggregates over all steps; the source path comes from whichever loader was used.

CPU_OFFLOAD · A100-40GB · KV:10318blk(20GB) · DRAM:65536blk(128GB) · 100 jobs · steps 0–5423 (5424) · 487.9s · maxRun:100 · 342K reqs src/home/hclin/logs/…/sched_trace.cpu_offload.steps.jsonl

Token	Where it comes from	Notes
`CPU_OFFLOAD`	`step_snapshot.policy`	Uppercased; fallback `unknown`.
`A100-40GB`	Inferred from `total_blocks`	<8K → 5090, <15K → A100-40GB, <35K → H100/A100-80GB range, <60K → H200-141GB, else unknown.
`KV:10318blk(20GB)`	`step_snapshot.total_blocks`	Block → bytes assumes 16 tokens × 32 layers × 8 KV-heads × 128 dim × 2 (KV) × 2 B = 2 MB per block (Llama-3.1-8B default).
`DRAM:65536blk(128GB)`	`step_snapshot.cpu_total_blocks`	Only shown when a CPU offload connector is active.
`100 jobs`	Distinct `job_id`s across all running/waiting queues	Not request count — one job typically has multiple turns = multiple request_ids.
`steps 0–5423 (5424)`	First/last `step` value across `step_snapshot` events	Parenthetical count reflects emitted records, which may differ from `last − first + 1` if `SCHED_TRACE_EVERY_N > 1`.
`487.9s`	last.ts − first.ts	Wall-clock duration of the instrumented window.
`maxRun:100`	max(s.num_running for s in STEPS)	Peak concurrency actually achieved.
`342K reqs`	`REQS.size`	Total per-req observations in the requests file (= sum of queue sizes across all steps).
`src .../sched_trace.cpu_offload.steps.jsonl`	FILE_NAMES.steps_path	Right-aligned pill; hover shows full path. Full absolute path when loaded via /api, basename when dropped locally (browser security).

5. Timeline View

The default tab. Each row is one step. Five columns: step, ts (elapsed + delta), batch classification (aggregate badges), running queue (one pill per req), waiting queue (one pill per req). Use the Start / End / Window controls to page through the trace.

5.1 Pill taxonomy

Each pill represents one request at one step. The category is determined by set-difference against step_decision:

running new resumed preempted skipped-run waiting pinned (purple border) prefill step (red outline)

Pill	Computed as	Meaning
running	`scheduled_running_req_ids`	Already running; granted more tokens this step.
new	`scheduled_new_req_ids`	First-time admit from waiting queue.
resumed	`scheduled_resumed_req_ids`	Re-admit after earlier preemption or remote-KV wait.
preempted	`preempted_req_ids`	Demoted from running queue back to waiting this step (KV pressure).
skipped-run	`running_req_ids \ scheduled_running_req_ids`	Sat in running queue but scheduler granted zero tokens (exhausted token budget / mamba-align / encoder cap).
waiting	`waiting_req_ids \ (new ∪ resumed)`	In the waiting queue and not admitted this step. A headline symptom of throughput loss.
pinned	Overlay when `job_id ∈ pinned_job_ids`	CPU-offload / prefix-sharing is holding this job's blocks in VRAM (not evictable).
prefill	Red outline when `num_computed_tokens < num_prompt_tokens`	This step is still prefilling (not yet decoding). Large prefill chunks are the primary consumer of token budget.

5.2 Pill label anatomy

A running pill for job swe_0008, turn 9, on step 1000:

j0008, t9 (3) [385, fwd:1542t, 1542t left] hit:5/481(1%)
│      │  │    │    │         │            │
│      │  │    │    │         │            └── prefix cache hit: 5 blocks of 481 (1%)
│      │  │    │    │         └── prefill remaining: 1542 prompt tokens left
│      │  │    │    └── fwd this step: 1542 tokens (scheduled by decision)
│      │  │    └── KV blocks currently held: 385
│      │  └── age in running queue (N-th consecutive step; p3 = 3rd prefill step, 3 = 3rd decode step)
│      └── turn index within job (t0 = first turn; t9 = 10th turn of conversation)
└── job short id (strips "job_"/"swe_" prefix)

Waiting pills show different fields (block counts estimated from prompt size; prefix-cache-lookup may be peeked ahead from future steps for reqs not yet looked up):

j0011, t10 [prefill:521b(8330t), 8410t total, 1% cached]
│      │    │        │      │        │         │
│      │    │        │      │        │         └── cache hit percentage of total prompt
│      │    │        │      │        └── num_tokens (prompt + any output so far)
│      │    │        │      └── tokens still needed (post-cache)
│      │    │        └── blocks needed to admit (ceil((prompt − cached) / 16))
│      │    └── prefill estimate (not yet running, but will start in prefill)
│      └── turn 10
└── job id (swe_0011)

5.3 Batch classification column

The third column stacks small badges summarising the step. Notable entries:

run:12 / new:0 / res:0 / pre:0 / skip-run:0 / wait:16 — counts of each pill category.
pin:0 — number of pinned jobs (listed inline when present).
tok:1553/2048 — sum of num_scheduled_tokens over budget. ≥90% triggers a 🔥 marker.
cached:60b — sum of prefix-cached blocks across all observed requests this step.
VRAM blk info — ava / used / pin / active / total blocks, with MB/GB conversion.
DRAM blk info — analogous for CPU-offload pool.
N blocked: alloc-rejected(free=533,allocator refused—hash/frag) — the admission-blocked reason — a dark-red badge appears only when waiting queue has reqs that didn't get admitted this step. This is the critical diagnostic.

Clicking a pill opens a modal showing every field for that (step, req_id) from the requests file — status, arrival time, token counts, output tail, etc. Requires the requests file to be loaded.

6. Statistics Tab

Five aggregate sections — computed once on first visit and cached.

6.1 Per-Step Bottleneck Breakdown

Every step is classified into exactly one of ten buckets using the same priority order the scheduler applies inside schedule(). A horizontal bar chart plus inline legend with a one-line description per bucket.

Bucket	Trigger condition	Meaning
tok-budget	`unadmitted > 0 AND tokenSum ≥ 0.95 × budget`	Hard limit — token budget saturated.
kv-exhausted	`unadmitted > 0 AND (effectiveFree < minNeed OR free_blocks < minNeed OR free_blocks < 20)`	Hard limit — KV provably insufficient.
max-seqs	`unadmitted > 0 AND num_running ≥ max_num_running_reqs`	Hard limit — concurrent-sequence cap.
alloc-frag	`unadmitted > 0 AND no hard limit hit AND free_blocks > 500 AND newCount == 0`	Allocator refused — hash / fragmentation, free count looks fine.
alloc-exhausted	`unadmitted > 0 AND no hard limit hit AND newCount > 0`	Partial admit — allocator ran out mid-loop.
kv-tight	`unadmitted > 0 AND (effectiveFree < 200 OR 20 ≤ free_blocks ≤ 500)`	Gray zone — near-full but cannot prove insufficient.
admitted-all	`num_waiting > 0 AND unadmitted == 0`	Healthy — every waiting req got in this step.
under-capacity	`num_waiting == 0 AND num_running > 0`	Healthy — no backlog, engine still busy.
idle	`num_waiting == 0 AND num_running == 0`	Engine idle.
unknown	Trace missing fields	Older trace version lacking `free_blocks` / waiting-req metadata.

Derived quantities used by the classifier:

unadmitted = num_waiting − (|scheduled_new| + |scheduled_resumed|)
minNeed = min over waiting reqs of ⌈(num_prompt_tokens − num_cached_tokens) / 16⌉ — smallest block count needed to admit any one waiting req. Requires requests file.
runningConsumed = Σ over running reqs where fwd > 1 of ⌈fwd / 16⌉ — KV blocks the running queue's prefill will eat this step.
effectiveFree = free_blocks − runningConsumed — approximates what's actually available to waiting reqs after the running queue takes its share.

6.2 VRAM / DRAM Utilization Over Time

Two stacked area charts plotting block occupancy per step: VRAM (free/pinned/active/total) and DRAM (free/used/total when a CPU-offload connector is present). Lets you see preemption waves, pin pressure, and offload-pool growth directly.

6.3 Prefix Cache Hit Distribution

Histogram over prefix_cache_lookup events, bucketed by (local_hit + external_hit) / total_prompt_tokens. Shows whether a workload relies on prefix reuse (heavy left tail = bad for agentic) and how much of the hit is served locally vs via a connector.

6.4 Token Budget Utilization Histogram

Twenty bins of tokenSum / budget per step. Dual x-axis: the top tick row is percentage (0%, 20%, …, 100%), the bottom row is the absolute token count (e.g. 1.6k, 8192t) so you can read the budget in concrete units. A red dashed line marks the mean.

Interpretation: a right-skewed distribution means most steps saturate the budget (good throughput utilization). A left-skewed distribution with many zero / near-zero bins and a spike at 100% often signals alternating prefill/decode cycles where idle-looking decode steps coexist with a few huge prefill chunks.

6.5 Summary Table

Headline stats — total steps, mean/p50/p95 running queue size, mean/p50/p95 waiting queue size, peak KV usage, preempt count, and connector (if any) prefix-cache hit rate. Use it when you want numbers to paste into an experiment report.

7. Admission-Blocked Reasons ↔ scheduler.py

This is the critical diagnostic path. Both the per-step red blocked badge and the Statistics tab buckets run the same classifier. Here we walk through the scheduler.py admission loop and map each break / continue / skip back to a reason code.

7.1 The admission loop, annotated

Condensed from vllm/v1/core/sched/scheduler.py, approximately lines 693–900:

# --- Stage 1: running queue (not shown here) consumes token_budget first,
#             preempting low-priority reqs if allocate_slots fails.

# --- Stage 2: waiting-queue admission loop ---
while (self.waiting or self.skipped_waiting) and token_budget > 0:
    # [MAX-SEQS] concurrent-sequence cap — tokens/KV don't matter here.
    if len(self.running) == self.max_num_running_reqs:
        break

    request = request_queue.peek_request()

    # [SKIP-AND-CONTINUE] LoRA cap / KV-connector "need remote" / WAITING_FOR_REMOTE_KVS
    # — does NOT break the loop, just defers this req and tries the next.
    if blocked_by_lora_or_connector(request):
        request_queue.pop_request()
        step_skipped_waiting.prepend_request(request)
        continue

    # num_new_tokens is capped at token_budget below.
    num_new_tokens = request.num_tokens - num_computed_tokens
    if 0 < long_prefill_token_threshold < num_new_tokens:
        num_new_tokens = long_prefill_token_threshold

    # [TOK-BUDGET] chunked-prefill off & single-req needs more than remaining budget.
    if not enable_chunked_prefill and num_new_tokens > token_budget:
        break

    num_new_tokens = min(num_new_tokens, token_budget)

    # [KV-EXHAUSTED (can_fit_full_sequence)] if scheduler_reserve_full_isl is on.
    if reserve_full_isl and not kv_cache_manager.can_fit_full_sequence(request, ...):
        break

    # [KV-EXHAUSTED / KV-TIGHT / ALLOC-FRAG / ALLOC-EXHAUSTED].
    # allocate_slots returns None if the KV block allocator cannot find suitable blocks
    # (considers fragmentation, hash-collision policy, and existing pins).
    new_blocks = kv_cache_manager.allocate_slots(request, num_new_tokens, ...)
    if new_blocks is None:
        break

    # Admission succeeds — push into running queue, record decision.
    self.running.append(request)
    num_scheduled_tokens[request_id] = num_new_tokens
    token_budget -= num_new_tokens
    scheduled_new_reqs.append(request)
# loop ends naturally when waiting empties or token_budget hits 0.

7.2 Reason priority & how the viewer picks

The trace does not record which break fired. The viewer reconstructs by checking conditions in the same order the scheduler would see them, stopping at the first match. Priority matches the loop's textual order so the attribution is correct for any step where exactly one condition held at break time.

Priority	Reason	Check	scheduler.py position
1	`tok-budget`	`tokenSum ≥ 0.95 × budget`	Outer `while token_budget > 0` exit + `break` on chunked-prefill single-req overflow (~L813)
2	`kv-after-run`	`effectiveFree < minNeed`	`allocate_slots` returns None (~L885) — running queue's prefill consumed the free blocks before waiting got a chance
3	`kv-insuf`	`free_blocks < minNeed`	`allocate_slots` returns None (~L885) — snapshot already shows fewer free blocks than the smallest waiting req needs
4	`kv-low`	`free_blocks < 20`	Same as above; fallback when `minNeed` can't be computed
5	`max-seqs`	`num_running ≥ max_num_running_reqs`	`break` at L694
6	`kv-tight-after-run`	`effectiveFree < 200`	Same as kv-* but below the hard-exhaustion threshold. Indicative, not definitive.
7	`alloc-exhausted`	`newCount > 0` (partial admit)	`allocate_slots` returns None mid-loop after some admits succeeded
8	`alloc-rejected`	`free_blocks > 500` + zero admits	`allocate_slots` returns None despite plenty of free blocks = hash collision or fragmentation in the `KVCacheManager.block_pool`.
9	`kv-marginal`	`20 ≤ free_blocks ≤ 500`	Gray zone — can't prove which condition fired
10	`unknown`	fallback	Trace missing the fields needed to classify

7.3 Statistics-tab bucket consolidation

The per-step red badge exposes all ten reason codes (including the fine-grained kv-after-run, kv-insuf, kv-low, and kv-marginal). The Statistics tab consolidates them down to six top-level names for legibility:

Statistics bucket	Absorbs these per-step reasons
`tok-budget`	`tok-budget`
`kv-exhausted`	`kv-after-run`, `kv-insuf`, `kv-low`
`max-seqs`	`max-seqs`
`alloc-frag`	`alloc-rejected`
`alloc-exhausted`	`alloc-exhausted`
`kv-tight`	`kv-tight-after-run`, `kv-marginal`

Limits of post-hoc attribution: the classifier cannot distinguish LoRA cap from other skips (they continue, they don't break — the waiting queue keeps being served). It also cannot directly identify WAITING_FOR_REMOTE_KVS: those reqs skip via _is_blocked_waiting_status and never reach allocate_slots. If a step's unadmitted count is high while all buckets report healthy, check the step's waiting-queue pills for high WAITING_FOR_REMOTE_KVS proportion (visible in the drill-down modal).

8. Worked Example: Step 1000

Concrete case from a CPU-offload trace. The viewer's timeline showed this batch classification:

step=1000
run:12   wait:16
tok:1553/2048
16 blocked: alloc-rejected(free=533, allocator refused—hash/frag)
cached:60b
VRAM: ava:533 used:5178/5711  (10.1 GB used / 11.2 GB total)
DRAM: ava:3841 used:255/4096  ( 510 MB used /  8.0 GB total)

8.1 Walking the classifier

num_waiting = 16 > 0 → has backlog, leave "no-backlog" branch.

↓

newCount = 0 (scheduled_new + scheduled_resumed = 0) → unadmitted = 16 − 0 = 16.

↓

Priority 1 — tok-budget? tokenSum=1553 vs budget=2048 → 75.8%, below 95% threshold. No.

↓

Priority 2–4 — kv-exhausted? free_blocks=533, runningConsumed ≈ 60 (a few running reqs had fwd>1), effectiveFree ≈ 473. Smallest waiting req's minNeed = ⌈(5605−56)/16⌉ = 347. Since 473 ≥ 347, not kv-exhausted. free_blocks=533 ≥ 20, not kv-low either. No.

↓

Priority 5 — max-seqs? num_running=12 vs max_num_running_reqs=100. No.

↓

Priority 6 — kv-tight? effectiveFree=473 ≥ 200. No.

↓

Priority 7 — alloc-exhausted? newCount = 0, not partial admit. No.

↓

Priority 8 — alloc-rejected? free_blocks=533 > 500 AND newCount == 0. YES — this step is alloc-frag.

8.2 What it means

The scheduler had 533 free blocks on hand and 16 waiting requests, the smallest of which needed 347 blocks. Numerically the allocation should succeed. But kv_cache_manager.allocate_slots() returned None for every single one of those 16 — which means the underlying KVCacheManager could not find a qualifying free block. The most common reasons, in order of likelihood:

Prefix-cache hash locality: free blocks are physically free but logically pinned to past prefix hashes. The block pool refuses to reassign them when a new req doesn't match the hash. num_cached_tokens per req shows >0 on most running reqs.
CPUOffloadConnector pinning: despite pin:0 in this step, earlier steps may have taken out transient pins that haven't been released. Re-check preceding steps.
Genuine fragmentation: the free-block doubly-linked list has no sufficiently long contiguous run — rare when free_blocks/total_blocks > 5%, but possible with high churn.

What to do: enable chunked prefill if off; drop scheduler_reserve_full_isl; or cap cache_salt variance so prefix hashes converge rather than fragmenting.

9. Instrumentation Flow — Producing a Trace

Summary of where trace.py emits, from scheduler.py's point of view:

step_snapshot

Top of schedule() — pristine queue state.

prefix_cache_event

Inside waiting loop, first time a req is scheduled.

step_decision

End of schedule() — final admit / preempt sets.

requests.jsonl

Written by step_snapshot: one record per queued request per step.

9.1 Enabling tracing on a running server

# Before starting vllm serve
export SCHED_TRACE_PATH=/home/hclin/logs/my_run
export SCHED_TRACE_POLICY=cpu_offload
export SCHED_TRACE_TOKENIZER=/home/hclin/shared_models/Meta-Llama-3.1-8B-Instruct  # optional
export SCHED_TRACE_EVERY_N=1        # 1 = every step; 10 = every 10th
export SCHED_TRACE_MAX_STEPS=100000

vllm serve <model> --scheduling-policy cpu_offload --port 8199 ...

Traces are opened lazily on first emit and flushed on process exit. For graceful shutdown use kill -2 (SIGINT) — never kill -9 on a GPU process (corrupts CUDA driver state).

9.2 Adding new fields / events

The emitter accepts **extra kwargs on all three primary functions (step_snapshot, step_decision, prefix_cache_event). New fields land in the JSONL verbatim — no schema version bump needed. The viewer ignores unknown keys but will happily display them in the drill-down modal when a user clicks a pill.

If you instrument the break points directly (e.g., sched_trace.break_reason_event(step, reason="alloc_refused")) the viewer's classifier could be retired and the attribution would become ground-truth. That change requires editing scheduler.py at each break; see the Scheduler page for the full loop layout.

10. Quick Reference

10.1 Landing-page controls

Control	What it does
Refresh icon	Re-fetch /api/traces; spins while loading.
Filter input	Case-insensitive substring match on trace name.
Sort group	Name / Date / GPU. Click same key to toggle ↑/↓. Default: Date ↓ (newest first).
Load button	Fetch steps file for this trace.
+reqs button	Fetch both steps and requests file (streamed for large files).
Drag-drop zone	Accepts files or folders. Folder drag populates webkitRelativePath.

10.2 Main-view controls

Control	What it does
`Start step`	Inclusive lower bound of the displayed range.
`End step`	Inclusive upper bound.
`Window`	Max rows rendered at once. Default 500 — higher ≈ more scroll, more DOM.
`Apply`	Re-render with new range.
`Load another`	Return to landing page.
`Timeline / Statistics`	Tab toggle. Statistics computes on first visit and caches.

10.3 Legend cheat-sheet

Pill colors (per-req)

running new resumed preempted skipped-run waiting

Bucket colors (Statistics)

tok-budget kv-exhausted max-seqs alloc-frag alloc-exhausted kv-tight admitted-all under-capacity idle