Scheduler Trace Viewer

Interactive single-file HTML viewer for vLLM V1 scheduler traces. Reconstructs per-step scheduling decisions from sched_trace.<policy>.{steps,requests}.jsonl, maps each step to one of ten admission buckets, and pinpoints the exact scheduler.py break that blocked admission.

Agentic_KVCache_management Pure client-side HTML + JS 10-way admission bucketization Zero deploy required
sched_trace_viewer.html · vllm/v1/core/sched/trace.py · vllm/v1/core/sched/scheduler.py

1. What the Viewer Solves

vLLM V1's scheduler (Scheduler.schedule(), ~500 lines) applies a priority-ordered sequence of checks — token budget, KV allocator, sequence cap, LoRA cap, KV connector state — against every waiting request on every step. When throughput drops, the interesting question is which of those checks broke first and how often. Reading GPU utilization plots or TTFT percentiles tells you performance is bad; it does not tell you whether to tune max_num_scheduled_tokens, add more KV blocks, bump max_num_seqs, or fix a fragmentation bug in the allocator.

sched_trace_viewer.html is a single self-contained HTML file that:

Design constraint: the viewer never has direct knowledge of which break inside schedule() fired — the trace is a post-hoc snapshot. Everything the viewer reports about admission bottlenecks is reconstructed from snapshot fields using the same priority order the scheduler uses internally. This is why the classifier can occasionally disagree with reality on edge cases (e.g., an allocator that refused due to a LoRA cap rather than KV).

1.1 Layered architecture

vllm/v1/core/sched/scheduler.py

Scheduler.schedule() — calls sched_trace emitters before & after the admission loop.

vllm/v1/core/sched/trace.py

Lightweight emitter. Writes JSONL records for step_snapshot, step_decision, prefix_cache_lookup events.

sched_trace.<policy>.{steps,requests}.jsonl

Two JSONL files per scheduling run — step-level events in one, per-request drill-down state in the other.

sched_trace_viewer.html

Client-side parser + renderer. Also provides /api/traces fetch path for remote browsing.

2. Trace Data Format

Each scheduling run produces two JSONL files: a steps file containing step-level events and a requests file containing per-request state at every step. The steps file is required; the requests file enables drill-down, prefix-cache-hit analysis, and alloc-rejected reason detection.

# Example trace layout on disk
/home/hclin/Agentic_KVCache_management/logs/cpu_offload/
├── sched_trace.cpu_offload.steps.jsonl       # ~20–30 MB typical, sometimes up to hundreds
└── sched_trace.cpu_offload.requests.jsonl    # ~200–500 MB typical (streamed for >50MB)

2.1 Steps file — three event types

Every line is an independent JSON object. The event field discriminates.

event: step_snapshot

Emitted at the top of every schedule() call, before any admission. Captures the queues exactly as the scheduler is about to process them.

{
  "event": "step_snapshot",
  "step": 5423,
  "ts": 1713567890.123,          // unix seconds
  "policy": "cpu_offload",        // scheduling policy name
  "free_blocks": 533,             // KV blocks currently free
  "total_blocks": 10318,           // KV blocks total
  "num_running": 12,
  "num_waiting": 16,
  "num_pinned": 0,                  // pinned job count (CPU-offload holds)
  "running_req_ids": ["swe_0010_t6", ...],
  "waiting_req_ids": ["swe_0011_t10", ...],
  "running_job_ids": ["swe_0010", ...],
  "waiting_job_ids": ["swe_0011", ...],
  "cpu_free_blocks": 3841,         // DRAM offload pool (when enabled)
  "cpu_total_blocks": 4096,
  "pinned_blocks": 0,
  "pinned_job_ids": [],
  "max_num_scheduled_tokens": 2048,  // token budget per step
  "max_num_running_reqs": 100       // concurrent-seq cap
}

event: step_decision

Emitted at the end of every schedule() call, after admission decisions are made. Records exactly what the scheduler decided to run this step.

{
  "event": "step_decision",
  "step": 5423,
  "ts": 1713567890.125,
  "policy": "cpu_offload",
  "scheduled_new_req_ids": [],                      // first-time admit from waiting
  "scheduled_resumed_req_ids": [],                  // re-admit after preempt / remote-KV
  "scheduled_running_req_ids": ["swe_0010_t6", ...], // running reqs granted more tokens
  "preempted_req_ids": [],                         // demoted from running to waiting
  "num_scheduled_tokens": {
    "swe_0010_t6": 1,                              // decode: 1 token
    "swe_0019_t7": 1552                            // prefill chunk: 1552 tokens
  }
}

event: prefix_cache_lookup

Emitted once per request, on the first step it is scheduled. Splits the prefix-cache hit into local (vLLM's own cache) vs external (from a KV connector like CPUOffload or LMCache).

{
  "event": "prefix_cache_lookup",
  "step": 1000,
  "ts": 1713567000.000,
  "policy": "cpu_offload",
  "req_id": "swe_0019_t7",
  "job_id": "swe_0019",
  "local_hit_tokens": 96,                   // vLLM prefix cache hit
  "external_hit_tokens": 0,                 // connector hit
  "total_hit_tokens": 96,
  "total_prompt_tokens": 5960,
  "num_tokens": 5960,
  "connector": "CPUOffloadConnector",
  "load_kv_async": false
}

2.2 Requests file — per-step per-req state

For every request observed in either queue on a step, one JSON record is appended. Keyed by (step, req_id) — the viewer uses this for click-to-expand drill-down.

{
  "step": 5423,
  "ts": 1713567890.123,
  "queue": "running",                  // "running" | "waiting"
  "req_id": "swe_0010_t6",
  "job_id": "swe_0010",
  "status": "RequestStatus.RUNNING",
  "priority": 0,
  "arrival_time": 1713567000.0,
  "max_tokens": 8192,
  "num_prompt_tokens": 5960,
  "num_tokens": 5987,                // prompt + output so far
  "num_output_tokens": 27,
  "num_computed_tokens": 5987,
  "num_cached_tokens": 96,             // prefix cache hit (total)
  "num_preemptions": 0,
  "num_external_computed_tokens": 0,
  "is_prefill_chunk": false,
  "resumable": true,
  "prompt_prefix": [128000, 882, ...],         // first _MAX_PROMPT_FP token ids
  "prompt_suffix": [128009, ...],
  "output_tail": [1, 2, 3, ...],
  "prompt_prefix_text": "<|begin_of_text|>You are...",  // only if SCHED_TRACE_TOKENIZER set
  "output_tail_text": "...final answer.",
  "block_hashes_count": 374
}
Size trade-off: the requests file is typically 10×–20× larger than the steps file because it writes one record per req per step. For traces >50 MB, the viewer auto-switches to a streaming parser (file.stream().pipeThrough(TextDecoderStream)) to avoid OOM.

2.3 Sampling & tokenizer controls

The emitter (vllm/v1/core/sched/trace.py) respects these environment variables at module load:

Env varPurposeDefault
SCHED_TRACE_PATHOutput directory (required to enable tracing)unset → tracing off
SCHED_TRACE_POLICYPolicy name stamped into every record; used as filename infixunknown
SCHED_TRACE_EVERY_NEmit only every N-th step (sampling)1
SCHED_TRACE_MAX_STEPSStop emitting after this many steps1_000_000
SCHED_TRACE_TOKENIZERTokenizer path; when set, *_text fields are decoded for drill-down UIunset → token IDs only
On sampling: setting SCHED_TRACE_EVERY_N=10 divides trace size by ~10 but breaks the ability to track per-step running-queue age (the viewer's (N) counter inside running pills). Use sampling for large sweeps, no sampling when debugging a specific bottleneck.

3. Loading a Trace

Three paths. Pick whichever matches your deployment.

3.1 Remote trace list (recommended)

When the viewer is served by an HTTP server that also exposes a GET /api/traces endpoint returning a JSON list of trace directories, the landing page shows a browsable list. Clicking Load fetches /api/file?path=… with the absolute path on the server — so the viewer keeps the full path and shows it in the header bar.

Expected /api/traces response shape

[
  {
    "name": "dma_highload_a100_jps15_dram128",
    "date": "2026-04-19 16:02",
    "steps_mb": 23.4,
    "requests_mb": 348.4,
    "dma_mb": 0.8,
    "steps":    ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.steps.jsonl"],
    "requests": ["/home/hclin/logs/dma_highload_a100_jps15_dram128/sched_trace.cpu_offload.requests.jsonl"]
  },
  ...
]

GPU is inferred client-side from the name via regex — the server does not need to populate a GPU field. The classifier recognises a100, h100, h200, b200, 5090 (or rtx5090), l40s, and cpu_offload, respecting underscore/hyphen separators.

3.2 Local file picker / drag-drop

Open the viewer via file:// or HTTP, drop either or both JSONL files onto the card, or pick them with the button. The viewer auto-detects steps vs requests files by probing the first few records:

Browser limitation: the File API only exposes basename; the full path stays on disk. Drag a folder instead of a file to populate file.webkitRelativePath (gives you the relative subdirectory), or use the remote-trace path to keep the full absolute path.

3.3 Only one file is required

Dropping only the steps file is enough to render the timeline and Statistics tab. Dropping requests too enables: per-req token-count display in pills, prefix-cache hit details, click-to-expand modal, and the minNeed computation used to distinguish kv-exhausted from kv-tight in bottleneck bucketing.

4. The Header Info Bar

After loading a trace the top of the page shows a single-line summary with color-coded tokens for each axis of the run. Everything except the source path is derived from the first step_snapshot and from aggregates over all steps; the source path comes from whichever loader was used.

CPU_OFFLOAD · A100-40GB · KV:10318blk(20GB) · DRAM:65536blk(128GB) · 100 jobs · steps 0–5423 (5424) · 487.9s · maxRun:100 · 342K reqs src/home/hclin/logs/…/sched_trace.cpu_offload.steps.jsonl
TokenWhere it comes fromNotes
CPU_OFFLOADstep_snapshot.policyUppercased; fallback unknown.
A100-40GBInferred from total_blocks<8K → 5090, <15K → A100-40GB, <35K → H100/A100-80GB range, <60K → H200-141GB, else unknown.
KV:10318blk(20GB)step_snapshot.total_blocksBlock → bytes assumes 16 tokens × 32 layers × 8 KV-heads × 128 dim × 2 (KV) × 2 B = 2 MB per block (Llama-3.1-8B default).
DRAM:65536blk(128GB)step_snapshot.cpu_total_blocksOnly shown when a CPU offload connector is active.
100 jobsDistinct job_ids across all running/waiting queuesNot request count — one job typically has multiple turns = multiple request_ids.
steps 0–5423 (5424)First/last step value across step_snapshot eventsParenthetical count reflects emitted records, which may differ from last − first + 1 if SCHED_TRACE_EVERY_N > 1.
487.9slast.ts − first.tsWall-clock duration of the instrumented window.
maxRun:100max(s.num_running for s in STEPS)Peak concurrency actually achieved.
342K reqsREQS.sizeTotal per-req observations in the requests file (= sum of queue sizes across all steps).
src .../sched_trace.cpu_offload.steps.jsonlFILE_NAMES.steps_pathRight-aligned pill; hover shows full path. Full absolute path when loaded via /api, basename when dropped locally (browser security).

5. Timeline View

The default tab. Each row is one step. Five columns: step, ts (elapsed + delta), batch classification (aggregate badges), running queue (one pill per req), waiting queue (one pill per req). Use the Start / End / Window controls to page through the trace.

5.1 Pill taxonomy

Each pill represents one request at one step. The category is determined by set-difference against step_decision:

running new resumed preempted skipped-run waiting pinned (purple border) prefill step (red outline)
PillComputed asMeaning
runningscheduled_running_req_idsAlready running; granted more tokens this step.
newscheduled_new_req_idsFirst-time admit from waiting queue.
resumedscheduled_resumed_req_idsRe-admit after earlier preemption or remote-KV wait.
preemptedpreempted_req_idsDemoted from running queue back to waiting this step (KV pressure).
skipped-runrunning_req_ids \ scheduled_running_req_idsSat in running queue but scheduler granted zero tokens (exhausted token budget / mamba-align / encoder cap).
waitingwaiting_req_ids \ (new ∪ resumed)In the waiting queue and not admitted this step. A headline symptom of throughput loss.
pinnedOverlay when job_id ∈ pinned_job_idsCPU-offload / prefix-sharing is holding this job's blocks in VRAM (not evictable).
prefillRed outline when num_computed_tokens < num_prompt_tokensThis step is still prefilling (not yet decoding). Large prefill chunks are the primary consumer of token budget.

5.2 Pill label anatomy

A running pill for job swe_0008, turn 9, on step 1000:

j0008, t9 (3) [385, fwd:1542t, 1542t left] hit:5/481(1%)
│      │  │    │    │         │            │
│      │  │    │    │         │            └── prefix cache hit: 5 blocks of 481 (1%)
│      │  │    │    │         └── prefill remaining: 1542 prompt tokens left
│      │  │    │    └── fwd this step: 1542 tokens (scheduled by decision)
│      │  │    └── KV blocks currently held: 385
│      │  └── age in running queue (N-th consecutive step; p3 = 3rd prefill step, 3 = 3rd decode step)
│      └── turn index within job (t0 = first turn; t9 = 10th turn of conversation)
└── job short id (strips "job_"/"swe_" prefix)

Waiting pills show different fields (block counts estimated from prompt size; prefix-cache-lookup may be peeked ahead from future steps for reqs not yet looked up):

j0011, t10 [prefill:521b(8330t), 8410t total, 1% cached]
│      │    │        │      │        │         │
│      │    │        │      │        │         └── cache hit percentage of total prompt
│      │    │        │      │        └── num_tokens (prompt + any output so far)
│      │    │        │      └── tokens still needed (post-cache)
│      │    │        └── blocks needed to admit (ceil((prompt − cached) / 16))
│      │    └── prefill estimate (not yet running, but will start in prefill)
│      └── turn 10
└── job id (swe_0011)

5.3 Batch classification column

The third column stacks small badges summarising the step. Notable entries:

Clicking a pill opens a modal showing every field for that (step, req_id) from the requests file — status, arrival time, token counts, output tail, etc. Requires the requests file to be loaded.

6. Statistics Tab

Five aggregate sections — computed once on first visit and cached.

6.1 Per-Step Bottleneck Breakdown

Every step is classified into exactly one of ten buckets using the same priority order the scheduler applies inside schedule(). A horizontal bar chart plus inline legend with a one-line description per bucket.

BucketTrigger conditionMeaning
tok-budget unadmitted > 0 AND tokenSum ≥ 0.95 × budget Hard limit — token budget saturated.
kv-exhausted unadmitted > 0 AND (effectiveFree < minNeed OR free_blocks < minNeed OR free_blocks < 20) Hard limit — KV provably insufficient.
max-seqs unadmitted > 0 AND num_running ≥ max_num_running_reqs Hard limit — concurrent-sequence cap.
alloc-frag unadmitted > 0 AND no hard limit hit AND free_blocks > 500 AND newCount == 0 Allocator refused — hash / fragmentation, free count looks fine.
alloc-exhausted unadmitted > 0 AND no hard limit hit AND newCount > 0 Partial admit — allocator ran out mid-loop.
kv-tight unadmitted > 0 AND (effectiveFree < 200 OR 20 ≤ free_blocks ≤ 500) Gray zone — near-full but cannot prove insufficient.
admitted-all num_waiting > 0 AND unadmitted == 0 Healthy — every waiting req got in this step.
under-capacity num_waiting == 0 AND num_running > 0 Healthy — no backlog, engine still busy.
idle num_waiting == 0 AND num_running == 0 Engine idle.
unknown Trace missing fields Older trace version lacking free_blocks / waiting-req metadata.

Derived quantities used by the classifier:

6.2 VRAM / DRAM Utilization Over Time

Two stacked area charts plotting block occupancy per step: VRAM (free/pinned/active/total) and DRAM (free/used/total when a CPU-offload connector is present). Lets you see preemption waves, pin pressure, and offload-pool growth directly.

6.3 Prefix Cache Hit Distribution

Histogram over prefix_cache_lookup events, bucketed by (local_hit + external_hit) / total_prompt_tokens. Shows whether a workload relies on prefix reuse (heavy left tail = bad for agentic) and how much of the hit is served locally vs via a connector.

6.4 Token Budget Utilization Histogram

Twenty bins of tokenSum / budget per step. Dual x-axis: the top tick row is percentage (0%, 20%, …, 100%), the bottom row is the absolute token count (e.g. 1.6k, 8192t) so you can read the budget in concrete units. A red dashed line marks the mean.

Interpretation: a right-skewed distribution means most steps saturate the budget (good throughput utilization). A left-skewed distribution with many zero / near-zero bins and a spike at 100% often signals alternating prefill/decode cycles where idle-looking decode steps coexist with a few huge prefill chunks.

6.5 Summary Table

Headline stats — total steps, mean/p50/p95 running queue size, mean/p50/p95 waiting queue size, peak KV usage, preempt count, and connector (if any) prefix-cache hit rate. Use it when you want numbers to paste into an experiment report.

7. Admission-Blocked Reasons ↔ scheduler.py

This is the critical diagnostic path. Both the per-step red blocked badge and the Statistics tab buckets run the same classifier. Here we walk through the scheduler.py admission loop and map each break / continue / skip back to a reason code.

7.1 The admission loop, annotated

Condensed from vllm/v1/core/sched/scheduler.py, approximately lines 693–900:

# --- Stage 1: running queue (not shown here) consumes token_budget first,
#             preempting low-priority reqs if allocate_slots fails.

# --- Stage 2: waiting-queue admission loop ---
while (self.waiting or self.skipped_waiting) and token_budget > 0:
    # [MAX-SEQS] concurrent-sequence cap — tokens/KV don't matter here.
    if len(self.running) == self.max_num_running_reqs:
        break

    request = request_queue.peek_request()

    # [SKIP-AND-CONTINUE] LoRA cap / KV-connector "need remote" / WAITING_FOR_REMOTE_KVS
    # — does NOT break the loop, just defers this req and tries the next.
    if blocked_by_lora_or_connector(request):
        request_queue.pop_request()
        step_skipped_waiting.prepend_request(request)
        continue

    # num_new_tokens is capped at token_budget below.
    num_new_tokens = request.num_tokens - num_computed_tokens
    if 0 < long_prefill_token_threshold < num_new_tokens:
        num_new_tokens = long_prefill_token_threshold

    # [TOK-BUDGET] chunked-prefill off & single-req needs more than remaining budget.
    if not enable_chunked_prefill and num_new_tokens > token_budget:
        break

    num_new_tokens = min(num_new_tokens, token_budget)

    # [KV-EXHAUSTED (can_fit_full_sequence)] if scheduler_reserve_full_isl is on.
    if reserve_full_isl and not kv_cache_manager.can_fit_full_sequence(request, ...):
        break

    # [KV-EXHAUSTED / KV-TIGHT / ALLOC-FRAG / ALLOC-EXHAUSTED].
    # allocate_slots returns None if the KV block allocator cannot find suitable blocks
    # (considers fragmentation, hash-collision policy, and existing pins).
    new_blocks = kv_cache_manager.allocate_slots(request, num_new_tokens, ...)
    if new_blocks is None:
        break

    # Admission succeeds — push into running queue, record decision.
    self.running.append(request)
    num_scheduled_tokens[request_id] = num_new_tokens
    token_budget -= num_new_tokens
    scheduled_new_reqs.append(request)
# loop ends naturally when waiting empties or token_budget hits 0.

7.2 Reason priority & how the viewer picks

The trace does not record which break fired. The viewer reconstructs by checking conditions in the same order the scheduler would see them, stopping at the first match. Priority matches the loop's textual order so the attribution is correct for any step where exactly one condition held at break time.

PriorityReasonCheckscheduler.py position
1tok-budgettokenSum ≥ 0.95 × budgetOuter while token_budget > 0 exit + break on chunked-prefill single-req overflow (~L813)
2kv-after-runeffectiveFree < minNeedallocate_slots returns None (~L885) — running queue's prefill consumed the free blocks before waiting got a chance
3kv-insuffree_blocks < minNeedallocate_slots returns None (~L885) — snapshot already shows fewer free blocks than the smallest waiting req needs
4kv-lowfree_blocks < 20Same as above; fallback when minNeed can't be computed
5max-seqsnum_running ≥ max_num_running_reqsbreak at L694
6kv-tight-after-runeffectiveFree < 200Same as kv-* but below the hard-exhaustion threshold. Indicative, not definitive.
7alloc-exhaustednewCount > 0 (partial admit)allocate_slots returns None mid-loop after some admits succeeded
8alloc-rejectedfree_blocks > 500 + zero admitsallocate_slots returns None despite plenty of free blocks = hash collision or fragmentation in the KVCacheManager.block_pool.
9kv-marginal20 ≤ free_blocks ≤ 500Gray zone — can't prove which condition fired
10unknownfallbackTrace missing the fields needed to classify

7.3 Statistics-tab bucket consolidation

The per-step red badge exposes all ten reason codes (including the fine-grained kv-after-run, kv-insuf, kv-low, and kv-marginal). The Statistics tab consolidates them down to six top-level names for legibility:

Statistics bucketAbsorbs these per-step reasons
tok-budgettok-budget
kv-exhaustedkv-after-run, kv-insuf, kv-low
max-seqsmax-seqs
alloc-fragalloc-rejected
alloc-exhaustedalloc-exhausted
kv-tightkv-tight-after-run, kv-marginal
Limits of post-hoc attribution: the classifier cannot distinguish LoRA cap from other skips (they continue, they don't break — the waiting queue keeps being served). It also cannot directly identify WAITING_FOR_REMOTE_KVS: those reqs skip via _is_blocked_waiting_status and never reach allocate_slots. If a step's unadmitted count is high while all buckets report healthy, check the step's waiting-queue pills for high WAITING_FOR_REMOTE_KVS proportion (visible in the drill-down modal).

8. Worked Example: Step 1000

Concrete case from a CPU-offload trace. The viewer's timeline showed this batch classification:

step=1000
run:12   wait:16
tok:1553/2048
16 blocked: alloc-rejected(free=533, allocator refused—hash/frag)
cached:60b
VRAM: ava:533 used:5178/5711  (10.1 GB used / 11.2 GB total)
DRAM: ava:3841 used:255/4096  ( 510 MB used /  8.0 GB total)

8.1 Walking the classifier

num_waiting = 16 > 0 → has backlog, leave "no-backlog" branch.
newCount = 0 (scheduled_new + scheduled_resumed = 0) unadmitted = 16 − 0 = 16.
Priority 1 — tok-budget? tokenSum=1553 vs budget=2048 → 75.8%, below 95% threshold. No.
Priority 2–4 — kv-exhausted? free_blocks=533, runningConsumed ≈ 60 (a few running reqs had fwd>1), effectiveFree ≈ 473. Smallest waiting req's minNeed = ⌈(5605−56)/16⌉ = 347. Since 473 ≥ 347, not kv-exhausted. free_blocks=533 ≥ 20, not kv-low either. No.
Priority 5 — max-seqs? num_running=12 vs max_num_running_reqs=100. No.
Priority 6 — kv-tight? effectiveFree=473 ≥ 200. No.
Priority 7 — alloc-exhausted? newCount = 0, not partial admit. No.
Priority 8 — alloc-rejected? free_blocks=533 > 500 AND newCount == 0. YES — this step is alloc-frag.

8.2 What it means

The scheduler had 533 free blocks on hand and 16 waiting requests, the smallest of which needed 347 blocks. Numerically the allocation should succeed. But kv_cache_manager.allocate_slots() returned None for every single one of those 16 — which means the underlying KVCacheManager could not find a qualifying free block. The most common reasons, in order of likelihood:

  1. Prefix-cache hash locality: free blocks are physically free but logically pinned to past prefix hashes. The block pool refuses to reassign them when a new req doesn't match the hash. num_cached_tokens per req shows >0 on most running reqs.
  2. CPUOffloadConnector pinning: despite pin:0 in this step, earlier steps may have taken out transient pins that haven't been released. Re-check preceding steps.
  3. Genuine fragmentation: the free-block doubly-linked list has no sufficiently long contiguous run — rare when free_blocks/total_blocks > 5%, but possible with high churn.

What to do: enable chunked prefill if off; drop scheduler_reserve_full_isl; or cap cache_salt variance so prefix hashes converge rather than fragmenting.

9. Instrumentation Flow — Producing a Trace

Summary of where trace.py emits, from scheduler.py's point of view:

1
step_snapshot
Top of schedule() — pristine queue state.
2
prefix_cache_event
Inside waiting loop, first time a req is scheduled.
3
step_decision
End of schedule() — final admit / preempt sets.
4
requests.jsonl
Written by step_snapshot: one record per queued request per step.

9.1 Enabling tracing on a running server

# Before starting vllm serve
export SCHED_TRACE_PATH=/home/hclin/logs/my_run
export SCHED_TRACE_POLICY=cpu_offload
export SCHED_TRACE_TOKENIZER=/home/hclin/shared_models/Meta-Llama-3.1-8B-Instruct  # optional
export SCHED_TRACE_EVERY_N=1        # 1 = every step; 10 = every 10th
export SCHED_TRACE_MAX_STEPS=100000

vllm serve <model> --scheduling-policy cpu_offload --port 8199 ...

Traces are opened lazily on first emit and flushed on process exit. For graceful shutdown use kill -2 (SIGINT) — never kill -9 on a GPU process (corrupts CUDA driver state).

9.2 Adding new fields / events

The emitter accepts **extra kwargs on all three primary functions (step_snapshot, step_decision, prefix_cache_event). New fields land in the JSONL verbatim — no schema version bump needed. The viewer ignores unknown keys but will happily display them in the drill-down modal when a user clicks a pill.

If you instrument the break points directly (e.g., sched_trace.break_reason_event(step, reason="alloc_refused")) the viewer's classifier could be retired and the attribution would become ground-truth. That change requires editing scheduler.py at each break; see the Scheduler page for the full loop layout.

10. Quick Reference

10.1 Landing-page controls

ControlWhat it does
Refresh iconRe-fetch /api/traces; spins while loading.
Filter inputCase-insensitive substring match on trace name.
Sort groupName / Date / GPU. Click same key to toggle ↑/↓. Default: Date ↓ (newest first).
Load buttonFetch steps file for this trace.
+reqs buttonFetch both steps and requests file (streamed for large files).
Drag-drop zoneAccepts files or folders. Folder drag populates webkitRelativePath.

10.2 Main-view controls

ControlWhat it does
Start stepInclusive lower bound of the displayed range.
End stepInclusive upper bound.
WindowMax rows rendered at once. Default 500 — higher ≈ more scroll, more DOM.
ApplyRe-render with new range.
Load anotherReturn to landing page.
Timeline / StatisticsTab toggle. Statistics computes on first visit and caches.

10.3 Legend cheat-sheet

Pill colors (per-req)

running new resumed preempted skipped-run waiting

Bucket colors (Statistics)

tok-budget kv-exhausted max-seqs alloc-frag alloc-exhausted kv-tight admitted-all under-capacity idle