Hands-On Labs — AI Inference Infrastructure

Course Philosophy

This course follows a measure-first, understand-second approach. Each week, you will:

Run experiments to observe a phenomenon (e.g., TTFT increases with request rate)
Collect metrics to quantify the effect (e.g., TTFT goes from 50ms to 2000ms)
Read source code to understand the mechanism (e.g., queuing in the scheduler)
Write analysis to synthesize understanding (e.g., explain saturation as a queuing theory problem)

  Prerequisites
  Basic understanding of transformer architecture (attention, MLP, embeddings)
Comfort with Python and command-line tools (bash, curl, pip)
Basic GPU programming concepts (CUDA kernels, device memory, streams)
Familiarity with Linux/cluster environments (SSH, SLURM is a plus)
No prior vLLM/SGLang experience required — that is what you will learn!

Environment Setup Guide
Course Roadmap
Week 1 — First LLM Serving
Week 2 — Offline Throughput & Profiling
Week 3 — Roofline Model
Week 4 — PagedAttention
Week 5 — Prefix Caching (APC)
Week 6 — RadixAttention
Week 7 — Continuous vs Static Batching
Week 8 — Chunked Prefill
Week 9 — Scheduling Policies
Week 10 — CUDA Graphs
Week 11 — Speculative Decoding
Week 12 — Quantization
Week 13 — Tensor Parallelism
Week 14 — Mixture of Experts (MoE)
Week 15 — LMCache
Week 16 — Disaggregated Prefill/Decode
Models & Datasets Reference

Section 0

Environment Setup Guide

Complete this setup before Week 1. All 16 labs assume these tools, models, and datasets are already available.

Hardware Requirements

    Minimum Hardware
    1-2× NVIDIA A100 80 GB (or equivalent: H100, L40S)
CUDA 12.1+ with cuDNN 8.9+
64 GB system RAM minimum, 128 GB recommended
100 GB free disk for models & datasets
Weeks 13-14 require 2× GPUs with NVLink for tensor/expert parallelism

  

Conda Environment Setup

Create a dedicated conda environment with all required packages:

# Create and activate environment
conda create -n infer python=3.12 -y
conda activate infer

# Install core serving frameworks
pip install vllm
pip install "sglang[all]"

# Install KV cache management and evaluation
pip install lmcache lm-eval

# Verify installations
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import lmcache; print('LMCache OK')"

Model Downloads

Download all models once; they are reused across the entire semester:

# Primary model — used in most labs
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

# Small model — used for speculative decoding draft
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Quantized variants — Week 12
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

# MoE model — Week 14 (requires ~100GB disk)
huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1

Dataset Downloads

# ShareGPT conversation dataset — primary benchmark workload
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# Verify file
python -c "import json; d=json.load(open('ShareGPT_V3_unfiltered_cleaned_split.json')); print(f'{len(d)} conversations')"

SLURM Job Script Template

For cluster users — save as run_lab.sbatch and customize per week:

#!/bin/bash
#SBATCH --job-name=infer-lab
#SBATCH --gres=gpu:A100:1
#SBATCH --time=04:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8

module load cuda/12.4
conda activate infer

# Your experiment commands here

GPU Monitoring Setup

Run this in a separate terminal to collect GPU metrics during experiments:

# Log GPU power, utilization, clocks, memory every 1 second
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor.csv &

# Quick GPU status check
nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu --format=csv

Quick Verification

Launch vLLM and send one request to confirm everything works:

# Terminal 1: Start vLLM server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Terminal 2: Send a test request
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 16
  }' | python -m json.tool

# Expected: JSON response with "Paris" in the output

Troubleshooting Common Issues

CUDA Out of Memory

If you see OOM errors, reduce --max-model-len or --gpu-memory-utilization:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

HuggingFace Token Authentication

Some models (e.g., Llama) require a HuggingFace token. Set it up once:

huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

# Or set the environment variable
export HF_TOKEN="hf_your_token_here"

Port Conflicts

If a port is already in use, find and kill the process:

# Find process using port 8000
lsof -i :8000
# Kill it
kill -9 $(lsof -t -i :8000)

Conda Environment Issues

If packages conflict, create a fresh environment. vLLM and SGLang update frequently — pin versions for reproducibility:

pip install vllm==0.8.0
pip install "sglang[all]==0.4.5"

Helper Scripts

Save these utility scripts for use throughout the semester:

# save as: wait_for_server.sh
#!/bin/bash
# Wait for a vLLM/SGLang server to be ready
PORT=${1:-8000}
echo "Waiting for server on port $PORT..."
while ! curl -s http://localhost:$PORT/health > /dev/null 2>&1; do
  sleep 2
done
echo "Server is ready!"

# save as: extract_metrics.py
# Quick script to parse benchmark output files
import re, sys

def parse_benchmark(filepath):
    with open(filepath) as f:
        text = f.read()
    metrics = {}
    for pattern, name in [
        (r'Throughput:\s+([\d.]+)\s+requests/s', 'req/s'),
        (r'Mean TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_mean'),
        (r'P99 TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_p99'),
        (r'Mean ITL.*?:\s+([\d.]+)\s+ms', 'itl_mean'),
        (r'P99 ITL.*?:\s+([\d.]+)\s+ms', 'itl_p99'),
    ]:
        m = re.search(pattern, text)
        if m: metrics[name] = float(m.group(1))
    return metrics

if __name__ == "__main__":
    for f in sys.argv[1:]:
        print(f"{f}: {parse_benchmark(f)}")

Setup Complete! If the test request returned a valid response, your environment is ready. Shut down the server (Ctrl+C) and proceed to the Course Roadmap.

Overview

Course Roadmap — 16 Weeks in 5 Phases

The following roadmap organizes all 16 weekly labs into 5 progressive phases. Each card shows the week number, topic, key systems used, and the primary phenomenon you will observe and analyze.

Each phase builds on the previous one. Phase 1 establishes your baseline understanding and measurement toolkit. Phases 2-3 explore memory management and scheduling — the two pillars of serving system design. Phase 4 covers hardware-level optimizations. Phase 5 extends to multi-GPU and production deployments.

Time Commitment: Each weekly lab requires 3-5 hours: ~1 hour for setup, ~1-2 hours for experiments, ~1 hour for source code reading, and ~1 hour for written analysis. Plan GPU time accordingly — batch your experiments to minimize idle GPU time.

PHASE 1 Foundations & Benchmarking (Weeks 1-3)

Week 1

First LLM Serving — vLLM & SGLang basics, online serving, ShareGPT benchmarks, TTFT/ITL metrics

Week 2

Offline Throughput & Profiling — benchmark_throughput, PyTorch Profiler, Perfetto, nvidia-smi dmon, call graphs

Week 3

Roofline Model — nsys/ncu profiling, A100 peak analysis, memory-bound vs compute-bound, decode latency theory

PHASE 2 Memory & KV Cache (Weeks 4-6)

Week 4

PagedAttention — vLLM vs HuggingFace baseline, max-num-seqs sweep, block-size tuning, KV cache manager

Week 5

Prefix Caching (APC) — enable-prefix-caching toggle, FP8 KV cache, shared prefix sweeps, hash_block_tokens

Week 6

RadixAttention — SGLang radix cache, few-shot workloads, LPM vs FCFS, sgl.function DSL, radix_cache.py

PHASE 3 Scheduling & Batching (Weeks 7-9)

Week 7

Continuous vs Static Batching — HuggingFace static baseline, output-length variance, scheduler.py, Orca paper

Week 8

Chunked Prefill — HOL blocking, max-num-batched-tokens sweep, ITL timeseries analysis

Week 9

Scheduling Policies — FCFS/LPM/DFS-weight, cache-aware DP routing, sglang_router

PHASE 4 Optimizations (Weeks 10-12)

Week 10

CUDA Graphs — enforce-eager vs captured graphs, warmup phase, memory overhead, CudaGraphRunner

Week 11

Speculative Decoding — draft model, N-gram, EAGLE, acceptance rate vs num_speculative_tokens

Week 12

Quantization — FP16/FP8/AWQ-INT4/GPTQ-INT4, lm_eval GSM8K accuracy, Pareto frontier

PHASE 5 Multi-GPU & System Integration (Weeks 13-16)

Week 13

Tensor Parallelism — TP=1,2,4, NCCL all-reduce profiling, NVLink bandwidth

Week 14

MoE — Mixtral-8x7B, expert routing, memory vs compute, expert parallelism

Week 15

LMCache — cross-restart persistence, tiered storage GPU→CPU→disk, chunk_size sweep

Week 16

Disaggregated P/D — vLLM P2pNcclConnector, SGLang disaggregation-mode, KV transfer analysis, capstone

Week 1

Week 1 — First LLM Serving

    Learning Objectives
    Launch vLLM and SGLang servers and send requests via curl
Run benchmark_serving.py (vLLM) and bench_serving (SGLang) with the ShareGPT dataset
Understand TTFT, ITL (inter-token latency), throughput, and request rate
Compare vLLM and SGLang on identical workloads

  

Setup & Configuration

# Start vLLM server (Terminal 1)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --disable-log-requests

# Start SGLang server (Terminal 2)
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8001

Experiments

Manual curl Test

Send individual requests to both servers and observe the response structure:

# vLLM completions API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'

# SGLang completions API
curl http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'

vLLM Benchmark — Varying Request Rates

Run benchmark_serving.py at request rates 1, 4, 10, and inf (closed-loop):

for rate in 1 4 10 inf; do
  python -m vllm.entrypoints.openai.api_server &  # if not already running
  python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 \
    --request-rate $rate \
    --port 8000 \
    2>&1 | tee results_vllm_rr${rate}.txt
done

SGLang Benchmark — Same Workload

for rate in 1 4 10 inf; do
  python -m sglang.bench_serving \
    --backend sglang \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 \
    --request-rate $rate \
    --port 8001 \
    2>&1 | tee results_sglang_rr${rate}.txt
done

Collect and Compare Results

Extract TTFT, ITL, throughput from each result file. Plot request rate vs. median TTFT and median ITL for both systems.

Metrics to Collect

Metric	Description	Unit
TTFT (p50, p99)	Time to first token	ms
ITL (p50, p99)	Inter-token latency	ms
Throughput	Output tokens per second	tok/s
Request latency (p50, p99)	End-to-end per-request time	ms

    Source Code Reading
    vllm/entrypoints/openai/api_server.py — How the OpenAI-compatible server starts and handles requests
benchmarks/benchmark_serving.py — How requests are generated with Poisson arrival at different rates
sglang/bench_serving.py — SGLang equivalent benchmark harness

  

    Written Analysis (1-2 pages)
    How does TTFT change as request rate increases from 1 to inf? Why?
At what request rate does the system become saturated? What evidence supports this?
Compare vLLM vs SGLang: which has better TTFT? Better throughput? Hypothesize why.

  

Understanding Request Rate: Request rate controls how fast requests arrive (Poisson process). At rate=1, one request arrives per second on average. At rate=inf, all requests are sent immediately (closed-loop benchmark). As rate increases, the server queue grows, TTFT increases due to queuing delay, but throughput may also increase due to better batching. The saturation point is where TTFT begins to diverge — the server can no longer keep up with arrivals.

Streaming Response Test

Test streaming mode to observe tokens arriving one at a time:

# vLLM streaming
curl -N http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Count from 1 to 20:",
    "max_tokens": 64,
    "stream": true
  }'

# SGLang streaming
curl -N http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Count from 1 to 20:",
    "max_tokens": 64,
    "stream": true
  }'

Related Deep Dive: vLLM Architecture Overview → | SGLang Architecture Overview →

Week 2

Week 2 — Offline Throughput & Profiling

    Learning Objectives
    Measure offline (batched) throughput with benchmark_throughput.py and bench_one_batch
Profile a single forward pass with benchmark_latency.py
Capture and read PyTorch Profiler traces in Perfetto
Trace the call graph: AsyncLLM → EngineCore → Worker

  

Setup & Configuration

# Ensure GPU monitoring is running
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor_w2.csv &

Experiments

Offline Throughput — vLLM

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 500 \
  --output-json throughput_vllm.json

Offline Throughput — SGLang

python -m sglang.bench_one_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --batch-size 32 64 128 \
  --input-len 512 \
  --output-len 256

Single-Request Latency Profiling

python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1 \
  --num-iters 10

PyTorch Profiler + Perfetto Trace

Enable the PyTorch profiler and export a trace for visualization in Perfetto UI:

# Run vLLM with profiling enabled
VLLM_TORCH_PROFILER_DIR=./profiles vllm serve \
  meta-llama/Llama-3.1-8B-Instruct --port 8000

# Send a few requests, then open the trace:
# Go to https://ui.perfetto.dev and load the .json trace file

GPU Monitoring Analysis

Analyze the gpu_monitor_w2.csv to observe GPU utilization patterns during offline vs. online workloads.

Metrics to Collect

Metric	Description	Unit
Offline throughput	Total tokens/sec in batched mode	tok/s
Per-token latency	Average time per output token (batch=1)	ms
GPU utilization %	From nvidia-smi dmon	%
Kernel time breakdown	Attention vs. MLP vs. other from Perfetto trace	%

    Source Code Reading
    vllm/engine/async_llm_engine.py — AsyncLLM: the top-level async engine that manages request queues
vllm/v1/engine/core.py — EngineCore: the synchronous core that runs scheduling + model execution
vllm/worker/worker.py — Worker: GPU-side model runner and KV cache management

  

    Written Analysis (1-2 pages)
    What percentage of forward-pass time is spent in attention vs. MLP layers? Does this match theory?
How does offline throughput compare to online throughput at high request rates from Week 1?
Draw the AsyncLLM → EngineCore → Worker call graph and annotate which thread/process each runs in.

  

Perfetto Trace Reading Guide: When viewing the trace in Perfetto UI:
• The top rows show CPU threads (Python async loop, tokenizer, etc.)
• Lower rows show CUDA stream activity on the GPU
• Look for the repeating pattern: scheduler → model_execute → sampler
• Each decode step should be visible as a cluster of CUDA kernels
• Zoom in on a single decode step to see: attention kernels, MLP GEMMs, RMSNorm, etc.
• Note the gaps between kernels — these are CPU-side launch overhead (eliminated by CUDA graphs in Week 10)

Related Deep Dive: vLLM EngineCore Internals →

Week 3

Week 3 — Roofline Model

    Learning Objectives
    Profile GPU kernels using nsys profile and ncu for roofline analysis
Understand A100 specs: 312 TFLOPS BF16, 2 TB/s HBM bandwidth
Calculate theoretical decode latency: 16 GB model / 2 TB/s = 8 ms
Identify which kernels are memory-bound vs compute-bound

  

Setup & Configuration

# Install nsight systems and nsight compute (usually comes with CUDA toolkit)
which nsys && which ncu
# If not found: module load nsight-systems nsight-compute

Experiments

Nsys Profile — Full Serving Trace

nsys profile -o llm_trace --trace=cuda,nvtx \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 1 --input-len 512 --output-len 64 --num-iters 3

NCU Roofline — Per-Kernel Analysis

ncu --set roofline --target-processes all \
  -o roofline_report \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 1 --input-len 128 --output-len 16 --num-iters 1

Theoretical Decode Latency Calculation

For Llama-3.1-8B in BF16: model size ≈ 16 GB. A100 HBM bandwidth = 2 TB/s. Theoretical minimum decode step = 16 GB / 2 TB/s = 8 ms. Compare against your measured per-token latency from the nsys trace.

Prefill vs Decode Roofline

Open the NCU roofline report. Identify where attention kernels and GEMM kernels fall. Prefill GEMMs should be near the compute roof; decode attention should be near the memory roof.

Metrics to Collect

Metric	Description	Unit
Measured decode latency	Actual per-token time from nsys	ms
Theoretical decode latency	model_size / HBM_bandwidth	ms
Arithmetic intensity	FLOPs / bytes for key kernels	FLOP/B
HBM utilization %	Fraction of peak bandwidth achieved	%

    Source Code Reading
    vllm/model_executor/models/llama.py — Trace the forward() method to see which operations map to which GPU kernels
Identify: QKV projection → FlashAttention → output projection → gate/up projection → SiLU → down projection

  

    Written Analysis (1-2 pages)
    What is the ratio of measured decode latency to theoretical minimum? What accounts for the gap?
Draw a roofline diagram with your measured kernel data points. Label which are memory-bound and compute-bound.
How does arithmetic intensity change from batch=1 to batch=32 for the same kernels?

  

Tip: NCU roofline profiling is very slow — it instruments every kernel. Use --num-iters 1 and short sequences. A full roofline run may take 10-30 minutes.

Batch Size Scaling Experiment

Run the same model at batch sizes 1, 4, 16, 64 and observe how arithmetic intensity shifts kernels from memory-bound to compute-bound:

for bs in 1 4 16 64; do
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size $bs --input-len 256 --output-len 64 \
    2>&1 | tee results_roofline_bs${bs}.txt
done

    Key Formulas
    Arithmetic Intensity (AI) = FLOPs / Bytes transferred
Ridge Point = Peak TFLOPS / Peak Bandwidth = 312 / 2000 = 0.156 FLOP/Byte for A100
Decode (batch=1): AI ≈ 1 FLOP/Byte → deep in memory-bound territory
Prefill (seq=2048): AI ≈ 2048 FLOP/Byte → compute-bound for GEMMs

  

Related Deep Dive: vLLM Model Runner & GPU Execution →

Week 4

Week 4 — PagedAttention

    Learning Objectives
    Understand how PagedAttention eliminates KV cache fragmentation via block-level memory management
Compare vLLM PagedAttention throughput against a HuggingFace generate() baseline
Sweep --max-num-seqs and --block-size to observe memory utilization changes

  

Setup & Configuration

# HuggingFace baseline script (save as hf_baseline.py)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Write a short story about AI."] * 16
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
elapsed = time.time() - start
total_tokens = (outputs.shape[1] - inputs["input_ids"].shape[1]) * len(prompts)
print(f"HF throughput: {total_tokens/elapsed:.1f} tok/s")

Experiments

HuggingFace Baseline

python hf_baseline.py

vLLM — max-num-seqs Sweep

for seqs in 1 4 16 64 256; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-seqs $seqs \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_paged_seqs${seqs}.txt
done

Block Size Sweep

for bs in 8 16 32; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --block-size $bs \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_paged_bs${bs}.txt
done

Memory Utilization Monitoring

During each run, log GPU memory usage. Note the KV cache block allocation reported in vLLM logs (e.g., '# GPU blocks: 1234').

Metrics to Collect

Metric	Description	Unit
HF baseline throughput	HuggingFace generate() tok/s	tok/s
vLLM throughput per max-num-seqs	Throughput at each concurrency level	tok/s
GPU blocks allocated	KV cache blocks at each block-size	blocks
Memory waste %	Internal fragmentation at each block-size	%

    Source Code Reading
    vllm/v1/core/kv_cache_manager.py — How blocks are allocated, freed, and managed for PagedAttention
vllm/v1/core/kv_cache_utils.py — FreeKVCacheBlockQueue: the free list data structure

  

    Written Analysis (1-2 pages)
    What is the throughput improvement of vLLM over HuggingFace? What causes this gap?
How does block-size affect internal fragmentation vs. metadata overhead? What is the optimal block size?
Explain how PagedAttention's block table maps virtual sequence positions to physical GPU memory blocks.

  

Background — How PagedAttention Works: Traditional KV cache allocates a contiguous buffer per sequence for the maximum possible length. This wastes memory when sequences are shorter than max. PagedAttention divides KV cache into fixed-size blocks (like OS virtual memory pages). A block table maps logical positions to physical blocks. Blocks are allocated on demand and freed immediately when a sequence finishes, eliminating both external fragmentation (gaps between sequences) and most internal fragmentation (only the last block of each sequence may have unused slots).

Understanding Block Size Tradeoffs

      Smaller Blocks (e.g., 8)
      Less internal fragmentation (avg waste = block_size/2 = 4 tokens)
More metadata overhead (more block table entries)
More kernel launch overhead for attention

    

      Larger Blocks (e.g., 32)
      More internal fragmentation (avg waste = 16 tokens)
Less metadata, fewer block table entries
Better GPU efficiency per attention kernel

    

Related Deep Dive: vLLM Attention & KV Cache →

Week 5

Week 5 — Prefix Caching (APC)

    Learning Objectives
    Understand Automatic Prefix Caching (APC) and how it reuses KV cache blocks across requests with shared prefixes
Measure TTFT improvement from prefix caching on workloads with shared system prompts
Experiment with FP8 KV cache to increase effective cache capacity

  

Setup & Configuration

# Launch vLLM with APC enabled
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching --port 8000

# Launch vLLM WITHOUT APC (baseline)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --no-enable-prefix-caching --port 8002

Experiments

APC vs No-APC Comparison

# With APC (port 8000)
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8000 \
  2>&1 | tee results_apc_on.txt

# Without APC (port 8002)
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8002 \
  2>&1 | tee results_apc_off.txt

Shared Prefix Length Sweep

for prefix_len in 0 256 512 1024 2048; do
  python benchmarks/benchmark_prefix_caching.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --shared-prefix-len $prefix_len \
    --num-prompts 200 --port 8000 \
    2>&1 | tee results_apc_prefix${prefix_len}.txt
done

FP8 KV Cache

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --port 8003

# Repeat prefix caching benchmark on port 8003
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8003 \
  2>&1 | tee results_apc_fp8.txt

Metrics to Collect

Metric	Description	Unit
TTFT (APC on vs off)	Time to first token with/without prefix caching	ms
Cache hit rate	Fraction of prefix blocks reused	%
TTFT vs prefix length	How TTFT scales with shared prefix length	ms
GPU blocks (FP16 vs FP8)	Total cache blocks available under each dtype	blocks

    Source Code Reading
    vllm/v1/core/kv_cache_utils.py — hash_block_tokens(): how block content is hashed for prefix matching
vllm/v1/core/kv_cache_manager.py — Look for prefix caching logic: how cached blocks are found and reused

  

    Written Analysis (1-2 pages)
    How much TTFT improvement does APC provide at each shared prefix length? Plot the relationship.
Does FP8 KV cache degrade output quality? Design an experiment to test this.
Explain the hash-based prefix matching mechanism. What are its limitations?

  

Real-World Use Case — System Prompts: In production, many applications share a long system prompt (e.g., 'You are a helpful assistant that...') across all user requests. Without APC, every request recomputes KV cache for this shared prefix. With APC enabled, the first request computes and caches the system prompt, and all subsequent requests reuse it — reducing TTFT from hundreds of milliseconds to near-zero for the shared portion. This is especially impactful for RAG applications where the retrieved context is often repeated.

Related Deep Dive: vLLM Attention & Prefix Caching →

Week 6

Week 6 — RadixAttention

    Learning Objectives
    Understand SGLang's RadixAttention — a radix tree for token-level KV cache reuse
Compare RadixAttention (SGLang) vs APC (vLLM) on few-shot workloads
Experiment with LPM (Longest Prefix Match) vs FCFS scheduling
Write a multi-turn program with sgl.function DSL

  

Setup & Configuration

# SGLang with radix cache enabled (default)
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8001

# SGLang with radix cache disabled
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disable-radix-cache \
  --port 8004

Experiments

Radix Cache On vs Off

# Few-shot workload benchmark
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 200 --request-rate 4 \
  2>&1 | tee results_radix_on.txt

python -m sglang.bench_serving \
  --backend sglang --port 8004 \
  --dataset-name generated-shared-prefix \
  --num-prompts 200 --request-rate 4 \
  2>&1 | tee results_radix_off.txt

LPM vs FCFS Scheduling

SGLang uses Longest Prefix Match scheduling by default. Compare against FCFS by toggling the scheduling policy:

# LPM scheduling (default)
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 300 --request-rate 8 \
  2>&1 | tee results_lpm.txt

# FCFS scheduling
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --schedule-policy fcfs --port 8005
python -m sglang.bench_serving \
  --backend sglang --port 8005 \
  --dataset-name generated-shared-prefix \
  --num-prompts 300 --request-rate 8 \
  2>&1 | tee results_fcfs.txt

sgl.function DSL Example

import sglang as sgl

@sgl.function
def multi_turn(s, question1, question2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question1)
    s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
    s += sgl.user(question2)
    s += sgl.assistant(sgl.gen("answer2", max_tokens=256))

state = multi_turn.run(
    question1="What is PagedAttention?",
    question2="How does it compare to RadixAttention?")
print(state["answer1"])
print(state["answer2"])

Metrics to Collect

Metric	Description	Unit
TTFT (radix on/off)	First token time with/without radix cache	ms
Cache hit rate	Prefix reuse ratio on few-shot workload	%
TTFT (LPM vs FCFS)	Scheduling policy effect on first-token time	ms

    Source Code Reading
    sglang/srt/mem_cache/radix_cache.py — match_prefix(): how the radix tree performs longest-prefix matching at token granularity
sglang/srt/managers/schedule_batch.py — LPM vs FCFS scheduling logic

  

    Written Analysis (1-2 pages)
    Compare RadixAttention's token-level matching vs APC's block-level hashing. When does each approach win?
Why does LPM scheduling improve throughput on shared-prefix workloads? What happens on non-shared workloads?

  

Related Deep Dive: SGLang RadixAttention Deep Dive →

Week 7

Week 7 — Continuous vs Static Batching

    Learning Objectives
    Understand Orca-style continuous batching: new requests enter mid-batch as others finish
Measure the throughput loss of static (HuggingFace) batching when output lengths vary
Read the vLLM scheduler to understand how requests are added/removed from running batch

  

Setup & Configuration

# Prepare a workload with high output-length variance
# ShareGPT naturally has variance; we can also create synthetic workloads
python -c "
import json, random
data = []
for i in range(200):
    out_len = random.choice([16, 32, 64, 256, 512])
    data.append({'input': 'Tell me about AI.', 'output_len': out_len})
json.dump(data, open('variable_output.json','w'))
"

Experiments

HuggingFace Static Batching Baseline

Use the HuggingFace generate() script from Week 4 with varying output lengths. Observe that all sequences must wait for the longest one to finish.

vLLM Continuous Batching — Uniform Outputs

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 256 --output-len 128 \
  --num-prompts 300 \
  2>&1 | tee results_cb_uniform.txt

vLLM Continuous Batching — Variable Outputs

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 300 \
  2>&1 | tee results_cb_variable.txt

Batch Occupancy Analysis

Enable vLLM's --log-stats and observe how the running batch size fluctuates over time with continuous batching. With static batching, the batch size stays constant (wasting GPU cycles on padding).

Metrics to Collect

Metric	Description	Unit
Static batch throughput	HF generate() with padding	tok/s
Continuous batch throughput	vLLM on same workload	tok/s
Avg batch occupancy	Mean running sequences over time	seqs

    Source Code Reading
    vllm/v1/core/scheduler.py — schedule(): how requests are added to and removed from the running batch each step
Reference: Orca (Yu et al., OSDI 2022) — the foundational continuous batching paper

  

    Written Analysis (1-2 pages)
    Quantify the throughput improvement of continuous over static batching. How does output-length variance affect the gap?
Explain iteration-level vs request-level scheduling. What are the tradeoffs?

  

Tip — Creating a Fair Comparison: For a proper static vs continuous comparison, use the same prompts, same model, and same GPU. The key variable is the batching strategy. Use a workload with high variance in output lengths (e.g., some requests want 16 tokens, others want 512) — this is where continuous batching shines.

Detailed Timing Breakdown

With static batching (batch=16, max_output=512):

If 12 of 16 requests finish at step 100 but 4 need 512 steps, the 12 finished requests sit idle for 412 more steps
GPU utilization drops to 4/16 = 25% for the tail of the batch
With continuous batching, those 12 slots are immediately filled with new requests

# Visualize batch occupancy over time (from vLLM stats logs)
# Look for lines like: "Avg running: 42.3, Avg waiting: 12.1"
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests \
  --port 8000 2>&1 | grep "running"

Related Deep Dive: vLLM Scheduler Internals →

Week 8

Week 8 — Chunked Prefill

    Learning Objectives
    Understand head-of-line (HOL) blocking: a long prefill delays all decode tokens in the batch
Measure ITL spikes caused by large prefills and how chunked prefill mitigates them
Sweep --max-num-batched-tokens to find the optimal chunk size

  

Setup & Configuration

# Without chunked prefill
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --no-enable-chunked-prefill --port 8000

# With chunked prefill (default in recent vLLM)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill --port 8002

# Aggressive chunking
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill --max-num-batched-tokens 512 --port 8003

Experiments

HOL Blocking Demonstration

Create a workload mixing long prefills (2048+ tokens) with short streaming requests. Without chunked prefill, observe ITL spikes in the short requests.

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_no_chunk.txt

Chunked Prefill — Default

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8002 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_chunk_default.txt

max-num-batched-tokens Sweep

for tokens in 256 512 1024 2048 4096; do
  vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-chunked-prefill \
    --max-num-batched-tokens $tokens \
    --port 8010 &
  sleep 30  # wait for server startup
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 100 --request-rate 4 \
    2>&1 | tee results_chunk_${tokens}.txt
  kill %1
done

ITL Timeseries Analysis

Plot the per-token ITL over time for each configuration. Without chunked prefill, you should see periodic spikes. With chunked prefill, ITL should be smoother.

Metrics to Collect

Metric	Description	Unit
ITL p50, p99	Inter-token latency percentiles	ms
TTFT p50, p99	Time to first token (may increase with chunking)	ms
ITL p99/p50 ratio	Measure of ITL consistency (lower is better)	ratio

    Source Code Reading
    vllm/v1/core/scheduler.py — How the scheduler splits a long prefill into chunks and interleaves decode tokens

    Written Analysis (1-2 pages)
    What is the TTFT vs ITL tradeoff of chunked prefill? Smaller chunks reduce ITL spikes but increase TTFT — why?
What is the optimal max-num-batched-tokens for your hardware and workload? Justify with data.

  

HOL Blocking Visualized: Imagine a batch with 30 decode requests generating tokens smoothly at 10ms each. A new request arrives with a 4096-token prefill. Without chunked prefill, the entire batch pauses for ~200ms while the prefill computes — every decode user sees a ~200ms stutter in their stream. With chunked prefill (max_batched_tokens=512), the prefill is split into 8 chunks of 512 tokens each. Each chunk takes ~25ms and is interleaved with decode steps, reducing the maximum stutter to ~25ms.

Related Deep Dive: vLLM Scheduler & Chunked Prefill →

Week 9

Week 9 — Scheduling Policies

    Learning Objectives
    Compare FCFS, LPM, and DFS-weight scheduling policies in SGLang
Understand cache-aware data-parallel (DP) routing with sglang_router
Measure the impact of scheduling policy on cache hit rate and throughput

  

Experiments

FCFS vs LPM vs DFS-weight

for policy in fcfs lpm dfs-weight; do
  python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --schedule-policy $policy \
    --port 8001 &
  sleep 30
  python -m sglang.bench_serving \
    --backend sglang --port 8001 \
    --dataset-name generated-shared-prefix \
    --num-prompts 300 --request-rate 8 \
    2>&1 | tee results_sched_${policy}.txt
  kill %1; sleep 5
done

Cache-Aware DP Routing

If you have 2 GPUs, launch SGLang with data parallelism and the cache-aware router:

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dp 2 --port 8001

# The sglang_router automatically routes requests to maximize cache hits
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 500 --request-rate 10 \
  2>&1 | tee results_dp_router.txt

Metrics to Collect

Metric	Description	Unit
Throughput per policy	FCFS vs LPM vs DFS-weight	tok/s
Cache hit rate per policy	Prefix reuse on shared-prefix workload	%
TTFT p50, p99	First-token latency under each policy	ms

    Source Code Reading
    sglang/srt/managers/scheduler.py — Scheduling policy implementations: FCFS, LPM, DFS-weight
sglang/srt/router/ — sglang_router: cache-aware DP request routing

  

    Written Analysis (1-2 pages)
    Which scheduling policy achieves the highest cache hit rate? Why?
How does cache-aware DP routing improve throughput over random routing? What workloads benefit most?

  

Scheduling Policy Explained: FCFS — First Come First Served. Processes requests in arrival order. Simple but ignores cache locality.
LPM — Longest Prefix Match. Prioritizes requests whose prefix is already cached, maximizing cache hits.
DFS-weight — Depth-First Search with weight. Biases toward completing requests from the same prefix subtree, improving locality.

Workload Impact Analysis

Test the same policies on a non-shared (ShareGPT) workload to show that LPM has no advantage when prefixes are unique:

for policy in fcfs lpm; do
  python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --schedule-policy $policy --port 8001 &
  sleep 30
  python -m sglang.bench_serving \
    --backend sglang --port 8001 \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 --request-rate 8 \
    2>&1 | tee results_sched_sharegpt_${policy}.txt
  kill %1; sleep 5
done

Related Deep Dive: SGLang Scheduler Deep Dive →

Week 10

Week 10 — CUDA Graphs

    Learning Objectives
    Understand CUDA graph capture: recording GPU operations once, replaying them without CPU overhead
Measure latency reduction from CUDA graphs vs --enforce-eager mode
Observe the warmup/capture phase at server startup and its memory cost

  

Experiments

Eager Mode vs CUDA Graphs

# Eager mode (no CUDA graphs)
python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enforce-eager \
  --batch-size 1 --input-len 512 --output-len 128 \
  2>&1 | tee results_eager.txt

# CUDA graphs (default)
python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --batch-size 1 --input-len 512 --output-len 128 \
  2>&1 | tee results_cudagraph.txt

Startup Time Comparison

Time the server startup with and without CUDA graphs. The graph capture phase (warmup) adds significant startup time but reduces per-request latency.

# Time server startup
time vllm serve meta-llama/Llama-3.1-8B-Instruct --enforce-eager --port 8000 &
# vs
time vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8002 &

Memory Overhead Analysis

Compare GPU memory usage between eager and CUDA graph modes. CUDA graphs pre-allocate memory for captured operations, reducing available KV cache space.

Metrics to Collect

Metric	Description	Unit
Decode latency (eager)	Per-token time without CUDA graphs	ms
Decode latency (graphs)	Per-token time with CUDA graphs	ms
Server startup time	Eager vs graph capture startup	s
Memory overhead	Extra GPU memory used by captured graphs	MB

    Source Code Reading
    vllm/worker/model_runner.py — CudaGraphRunner: how graphs are captured during warmup and replayed during inference

    Written Analysis (1-2 pages)
    What is the latency reduction from CUDA graphs? Where does the speedup come from (CPU launch overhead elimination)?
What are the constraints of CUDA graphs? (Fixed tensor sizes, no dynamic control flow)

  

Understanding CUDA Graph Capture: During warmup, vLLM runs dummy forward passes at various batch sizes (1, 2, 4, ..., max_batch_size). Each pass is recorded as a CUDA graph. During inference, the engine selects the graph matching the current batch size and replays it. This eliminates CPU-side kernel launch overhead (~10-20us per kernel × ~100 kernels per layer × 32 layers = significant savings). The tradeoff: each captured graph consumes GPU memory for the recorded computation, reducing KV cache capacity.

Batch Size Sweep with CUDA Graphs

for bs in 1 4 16 64; do
  # Eager mode
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --enforce-eager --batch-size $bs \
    --input-len 256 --output-len 64 \
    2>&1 | tee results_eager_bs${bs}.txt

  # CUDA graph mode
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size $bs \
    --input-len 256 --output-len 64 \
    2>&1 | tee results_cudagraph_bs${bs}.txt
done

Plot the latency improvement ratio (eager/graph) for each batch size. The speedup is typically larger at small batch sizes where CPU launch overhead is a larger fraction of total time.

Related Deep Dive: vLLM Model Runner →

Week 11

Week 11 — Speculative Decoding

    Learning Objectives
    Understand speculative decoding: a small draft model proposes tokens, the target model verifies in parallel
Compare draft-model, N-gram, and EAGLE speculation methods
Measure acceptance rate vs num_speculative_tokens
Observe that spec decoding helps at low QPS but hurts at high QPS

  

Experiments

Draft Model Speculation

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --num-speculative-tokens 5 \
  --port 8000

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 2 \
  2>&1 | tee results_spec_draft.txt

N-gram Speculation

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4 \
  --port 8002

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8002 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 2 \
  2>&1 | tee results_spec_ngram.txt

num_speculative_tokens Sweep

for k in 1 3 5 7; do
  vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --num-speculative-tokens $k \
    --port 8010 &
  sleep 30
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 100 --request-rate 2 \
    2>&1 | tee results_spec_k${k}.txt
  kill %1; sleep 5
done

Low QPS vs High QPS

Run the same speculative decoding config at request rates 1, 4, and 10. At low QPS, speculation reduces per-request latency. At high QPS, the overhead of draft model + verification can hurt throughput.

Metrics to Collect

Metric	Description	Unit
Acceptance rate	Fraction of draft tokens accepted by target	%
Per-request latency	With vs without speculation	ms
Throughput at high QPS	Spec decoding overhead under load	tok/s

    Source Code Reading
    vllm/spec_decode/ — Speculative decoding module: draft worker, scorer, verification logic

    Written Analysis (1-2 pages)
    Plot acceptance rate vs num_speculative_tokens. Why does acceptance rate decrease with more tokens?
At what QPS does speculative decoding start hurting throughput? Explain the mechanism.

  

Understanding Acceptance Rate: When the draft model proposes k tokens, the target model verifies all k in a single forward pass (versus k separate forward passes without speculation). The expected speedup is approximately (k × acceptance_rate + 1) / 1, but only if the draft model is significantly cheaper than the target. At high QPS, the draft model's GPU cycles compete with serving existing decode batches, reducing overall throughput even if per-request latency improves.

Comparison Matrix

Method	Extra Memory	Acceptance Rate	Best For
Draft Model	~2-4 GB (TinyLlama)	Medium-High	General text, low QPS
N-gram	Negligible	Low (content-dependent)	Repetitive/templated text
EAGLE	~0.5-1 GB (lightweight head)	High	Code, structured output

Related Deep Dive: vLLM Architecture Overview →

Week 12

Week 12 — Quantization

    Learning Objectives
    Compare FP16, FP8, AWQ-INT4, and GPTQ-INT4 model variants
Measure accuracy degradation using lm_eval on GSM8K
Plot the Pareto frontier: memory vs throughput vs quality

  

Experiments

Throughput Comparison Across Precisions

# FP16 (baseline)
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 2>&1 | tee results_fp16.txt

# FP8
python benchmarks/benchmark_throughput.py \
  --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --num-prompts 200 2>&1 | tee results_fp8.txt

# AWQ-INT4
python benchmarks/benchmark_throughput.py \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq \
  --num-prompts 200 2>&1 | tee results_awq.txt

Accuracy Evaluation — GSM8K

# FP16
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_fp16.txt

# FP8
lm_eval --model vllm \
  --model_args pretrained=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_fp8.txt

# AWQ-INT4
lm_eval --model vllm \
  --model_args pretrained=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,quantization=awq \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_awq.txt

Memory Usage Comparison

Record GPU memory used by each model variant (from vLLM startup logs or nvidia-smi). FP16 ≈ 16 GB, FP8 ≈ 8 GB, INT4 ≈ 4 GB for model weights.

Metrics to Collect

Model	Memory (GB)	Throughput (tok/s)	GSM8K Accuracy
FP16	~16	—	—
FP8	~8	—	—
AWQ-INT4	~4	—	—

    Source Code Reading
    vllm/model_executor/layers/quantization/ — Quantization implementations: AWQ, GPTQ, FP8

    Written Analysis (1-2 pages)
    Plot the Pareto frontier with memory on x-axis, throughput on y-axis, and GSM8K accuracy as point labels.
When is INT4 quantization acceptable? When is FP8 a better choice? Discuss use-case tradeoffs.

  

Quantization Primer: FP16/BF16: 16-bit floating point. Full precision baseline. 2 bytes per parameter.
FP8: 8-bit floating point (E4M3 or E5M2). Native on Hopper GPUs. 1 byte per parameter. Minimal accuracy loss.
AWQ (Activation-aware Weight Quantization): INT4 with per-channel scaling. Identifies salient weights to preserve. 0.5 bytes per parameter.
GPTQ: Post-training quantization to INT4 using second-order approximation. Similar to AWQ in size.

Why Quantization Improves Throughput

Two complementary effects drive throughput gains:

Reduced memory bandwidth: Decode is memory-bound. INT4 model reads 4× fewer bytes from HBM per token → theoretical 4× decode speedup
More KV cache space: Smaller model weights leave more GPU memory for KV cache → supports larger batch sizes → higher throughput

Online Serving Quality Test

Beyond GSM8K, test output quality with real prompts:

# Send the same complex prompt to FP16 and INT4 servers
# Compare outputs side by side for coherence, factual accuracy
for port in 8000 8002; do
  curl -s http://localhost:$port/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "prompt": "Explain the difference between TCP and UDP, including when to use each protocol.",
      "max_tokens": 256,
      "temperature": 0
    }' | python -m json.tool > output_port${port}.json
done

diff output_port8000.json output_port8002.json

Related Deep Dive: vLLM Model Runner & Quantization →

Week 13

Week 13 — Tensor Parallelism

    Learning Objectives
    Understand tensor parallelism (TP): splitting weight matrices across GPUs
Measure latency and throughput at TP=1, 2, 4
Profile NCCL all-reduce communication overhead using nsys
Understand NVLink bandwidth: 600 GB/s bidirectional on A100

  

Hardware Requirement: This lab requires 2-4 GPUs with NVLink. TP=4 requires 4 GPUs.

Experiments

TP Scaling — Latency

for tp in 1 2 4; do
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size $tp \
    --batch-size 1 --input-len 512 --output-len 128 \
    2>&1 | tee results_tp${tp}_latency.txt
done

TP Scaling — Throughput

for tp in 1 2 4; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size $tp \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_tp${tp}_throughput.txt
done

NCCL All-Reduce Profiling

nsys profile -o tp2_trace --trace=cuda,nvtx,nccl \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size 2 \
    --batch-size 1 --input-len 256 --output-len 32 --num-iters 3

Open in nsys GUI and identify NCCL all-reduce calls. Measure their duration and frequency per decode step.

Metrics to Collect

Metric	Description	Unit
Decode latency per TP	Per-token time at TP=1,2,4	ms
All-reduce time	NCCL communication per decode step	us
Communication fraction	All-reduce time / total step time	%

    Source Code Reading
    vllm/distributed/ — Tensor parallel communication: column/row parallel linear layers, all-reduce

    Written Analysis (1-2 pages)
    Does TP=2 reduce latency by 2×? Why or why not? Quantify the communication overhead.
At what batch size does TP=2 throughput exceed TP=1? Why does this crossover point exist?

  

How Tensor Parallelism Works: In TP, each linear layer is split across GPUs. For a weight matrix W of shape [H, H]:
• Column parallelism: Each GPU holds W[:, H/TP:H*(i+1)/TP]. Input is broadcast; output is partial → requires all-gather.
• Row parallelism: Each GPU holds W[H/TP:H*(i+1)/TP, :]. Input is split; output is partial-sum → requires all-reduce.
Each transformer layer has 2 all-reduce operations (after attention output and after MLP down projection). With NVLink at 600 GB/s, a typical all-reduce for Llama-8B at TP=2 takes ~20-50 microseconds.

Related Deep Dive: vLLM Distributed Execution →

Week 14

Week 14 — Mixture of Experts (MoE)

    Learning Objectives
    Understand MoE architecture: sparse activation, expert routing, top-k selection
Serve Mixtral-8x7B and observe memory vs compute characteristics
Experiment with expert parallelism (--ep-size)

  

Hardware Requirement: Mixtral-8x7B requires ~100 GB of GPU memory. Use 2× A100 80GB with TP=2, or a single H100 with quantization.

Experiments

Serve Mixtral-8x7B

vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8000

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_moe_tp2.txt

Expert Parallelism

# Expert parallelism distributes experts across GPUs
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8002

# In SGLang with explicit EP
python -m sglang.launch_server \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tp 2 --port 8003

Memory vs Compute Analysis

Compare memory usage and throughput: Mixtral-8x7B has 46.7B total parameters but only activates ~12.9B per token (top-2 of 8 experts). Compare against Llama-3.1-8B which has fewer total parameters but activates all of them.

Metrics to Collect

Metric	Description	Unit
GPU memory (Mixtral)	Total memory for all expert weights	GB
Throughput (MoE vs dense)	Mixtral vs Llama-8B at similar quality	tok/s
Expert load balance	Token distribution across experts	%

    Source Code Reading
    vllm/model_executor/models/mixtral.py — MoE layer implementation: router, expert selection, expert execution

    Written Analysis (1-2 pages)
    Why does MoE require more memory than a dense model of similar quality? What is the memory-compute tradeoff?
How does expert parallelism differ from tensor parallelism? When should you use each?

  

MoE Memory Paradox: Mixtral-8x7B has 46.7B total parameters but only activates ~12.9B per token (top-2 out of 8 experts). This means it has dense-model quality at sparse-model compute cost — but pays the full memory cost. In BF16, Mixtral needs ~93 GB just for weights, compared to ~16 GB for Llama-8B. The memory-compute ratio is the defining characteristic of MoE: more memory for better quality-per-FLOP.

Related Deep Dive: vLLM Distributed & Expert Parallelism →

Week 15

Week 15 — LMCache

    Learning Objectives
    Understand LMCache: external KV cache storage that persists across server restarts
Configure tiered storage: GPU → CPU → disk
Sweep chunk_size to find optimal KV cache granularity

  

Setup & Configuration

# Create LMCache config file: lmcache_config.yaml
cat > lmcache_config.yaml <<'EOF'
chunk_size: 256
local_device: "cpu"
remote_url: null
remote_serde: null

# Tiered storage config
storage:
  - type: "gpu"
    capacity_gb: 4
  - type: "cpu"
    capacity_gb: 16
  - type: "disk"
    path: "/tmp/lmcache_disk"
    capacity_gb: 64
EOF

Experiments

LMCache Integration with vLLM

# Launch vLLM with LMCache
LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --kv-transfer-config lmcache_config.yaml \
  --port 8000

Cross-Restart Persistence Test

Send requests with a shared system prompt, shut down the server, restart, and send the same requests. With LMCache, the KV cache is restored from disk/CPU, eliminating re-computation.

# First run: populate cache
python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 50 --request-rate 2 \
  2>&1 | tee results_lmcache_cold.txt

# Restart server, re-run (cache should be warm from disk)
# Kill and restart the server, then:
python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 50 --request-rate 2 \
  2>&1 | tee results_lmcache_warm.txt

chunk_size Sweep

for cs in 64 128 256 512 1024; do
  # Update lmcache_config.yaml with chunk_size=$cs
  sed -i "s/chunk_size: .*/chunk_size: $cs/" lmcache_config.yaml
  LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
    meta-llama/Llama-3.1-8B-Instruct \
    --kv-transfer-config lmcache_config.yaml \
    --port 8010 &
  sleep 30
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-prompts 100 --request-rate 4 \
    2>&1 | tee results_lmcache_cs${cs}.txt
  kill %1; sleep 5
done

Metrics to Collect

Metric	Description	Unit
TTFT (cold vs warm)	First token time before/after cache is populated	ms
Cache restore time	Time to load KV cache from disk on restart	ms
Throughput per chunk_size	Effect of cache granularity on performance	tok/s

    Source Code Reading
    lmcache/ — LMCache core: storage backends, chunk management, vLLM integration

    Written Analysis (1-2 pages)
    How much TTFT improvement does cross-restart persistence provide? What workloads benefit most?
What is the optimal chunk_size? Discuss the tradeoff between granularity (more sharing) and overhead (more metadata).

  

Tiered Storage Explained: LMCache implements a multi-tier caching hierarchy similar to CPU cache levels:
• GPU tier: Fastest access (~0.1ms), but limited by GPU memory
• CPU tier: Medium speed (~1-5ms), uses system RAM
• Disk tier: Slowest (~10-50ms), but virtually unlimited capacity and survives restarts
When a cache entry is accessed, it is promoted to the GPU tier. When GPU tier is full, least-recently-used entries are demoted to CPU, then disk.

Storage Tier Latency Test

Observe the TTFT difference when KV cache is served from each tier:

# Test 1: Cold start (no cache anywhere)
# TTFT = full prefill computation time

# Test 2: Warm GPU cache (repeat same request)
# TTFT ≈ 0 (KV cache already on GPU)

# Test 3: CPU-only cache (restart server, GPU cache lost)
# TTFT = CPU→GPU transfer time only

# Test 4: Disk-only cache (clear CPU cache too)
# TTFT = Disk→GPU transfer time

Related Deep Dive: LMCache Architecture →

Week 16

Week 16 — Disaggregated Prefill/Decode

    Learning Objectives
    Understand disaggregated serving: separate prefill and decode onto different GPU pools
Set up vLLM P2pNcclConnector for KV cache transfer between prefill and decode nodes
Use SGLang's --disaggregation-mode for P/D separation
Measure KV transfer time vs input length

  

Hardware Requirement: Disaggregated P/D requires at least 2 GPUs — one for prefill, one for decode.

Experiments

vLLM Disaggregated Setup

# Prefill instance (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer"}'

# Decode instance (GPU 1)
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8001 \
  --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer"}'

SGLang Disaggregated Mode

# SGLang with disaggregation
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode [prefill] \
  --port 8002

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode [decode] \
  --port 8003

KV Transfer Time vs Input Length

Measure how KV cache transfer time scales with input sequence length. Longer inputs produce larger KV caches that take more time to transfer between prefill and decode GPUs.

for input_len in 128 256 512 1024 2048 4096; do
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8001 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len $input_len --output-len 64 \
    --num-prompts 50 --request-rate 2 \
    2>&1 | tee results_pd_input${input_len}.txt
done

Capstone Analysis

Design an experiment comparing standard serving vs disaggregated P/D on a mixed workload with both short and long prefills. When does disaggregation help?

Metrics to Collect

Metric	Description	Unit
KV transfer time	Time to send KV cache from prefill to decode GPU	ms
TTFT (P/D vs standard)	First-token latency comparison	ms
ITL (P/D vs standard)	Decode-side ITL without prefill interference	ms
GPU utilization (prefill vs decode)	Compute efficiency of each pool	%

    Source Code Reading
    vllm/distributed/kv_transfer/ — P2pNcclConnector: NCCL-based KV cache transfer between prefill and decode instances
sglang/srt/disaggregation/ — SGLang's disaggregation mode implementation

  

    Written Analysis (1-2 pages)
    Plot KV transfer time vs input length. Is the relationship linear? What determines the transfer bandwidth?
When does disaggregated P/D improve overall system performance vs standard serving? Consider: prefill-heavy workloads, latency-sensitive decode, GPU utilization.
Final capstone: synthesize your findings from all 16 weeks. What are the 3 most impactful optimizations for LLM inference, and why?

  

Capstone Project Guidelines

    Final Report Structure (4-6 pages)
    Executive Summary: What are the 3 most impactful inference optimizations you studied? Rank and justify.
System Comparison: Compare vLLM vs SGLang on at least 3 dimensions (latency, throughput, cache effectiveness). Use data from your weekly labs.
Configuration Guide: For a production deployment of Llama-3.1-8B on 2× A100, what is your recommended configuration? (batch size, block size, chunked prefill settings, whether to use APC, etc.)
Future Directions: What optimization opportunities remain? What would you investigate with more time?

  

    Bonus Experiments (Optional)
    Combine disaggregated P/D with LMCache: does external KV storage improve P/D transfer?
Compare disaggregated P/D throughput vs standard serving at various prefill/decode length ratios
Try combining speculative decoding with prefix caching — do the benefits stack?
Profile disaggregated P/D with nsys to measure KV transfer latency breakdown (serialization, network, deserialization)

  

Related Deep Dive: SimAI P/D Disaggregation → | LMCache Architecture →

Reference

Models & Datasets Reference Card

Models

Model	Parameters	Precision	Used In
`meta-llama/Llama-3.1-8B-Instruct`	8B	BF16	Weeks 1-11, 13, 15-16
`TinyLlama/TinyLlama-1.1B-Chat-v1.0`	1.1B	FP16	Week 11 (draft model)
`neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8`	8B	FP8	Weeks 5, 12
`hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4`	8B	INT4	Week 12
`mistralai/Mixtral-8x7B-Instruct-v0.1`	46.7B	BF16	Week 14

Datasets

Dataset	Description	Used In
`ShareGPT_V3_unfiltered_cleaned_split.json`	Real-world multi-turn conversations with natural length distribution	Weeks 1-5, 7-8, 11-14, 16
`generated-shared-prefix`	Synthetic workload with shared prefixes (built into SGLang bench)	Weeks 6, 9
`GSM8K`	Grade-school math benchmark for accuracy evaluation (via lm_eval)	Week 12

Key Tools

Tool	Purpose
`vllm serve`	Launch vLLM OpenAI-compatible server
`sglang.launch_server`	Launch SGLang server
`benchmark_serving.py`	Online serving benchmark (vLLM)
`bench_serving`	Online serving benchmark (SGLang)
`benchmark_throughput.py`	Offline throughput benchmark (vLLM)
`benchmark_latency.py`	Single-batch latency profiling (vLLM)
`nsys / ncu`	NVIDIA Nsight Systems / Nsight Compute GPU profiling
`lm_eval`	Language model evaluation harness
`nvidia-smi dmon`	GPU monitoring (power, utilization, clocks, memory)

Recommended Papers

Paper	Venue	Relevant Weeks
Orca: A Distributed Serving System for Transformer-Based Generative Models	OSDI 2022	Week 7
Efficient Memory Management for Large Language Model Serving with PagedAttention	SOSP 2023	Weeks 4-5
SGLang: Efficient Execution of Structured Language Model Programs	NeurIPS 2024	Weeks 6, 9
Fast Inference from Transformers via Speculative Decoding	ICML 2023	Week 11
Splitwise: Efficient generative LLM inference using phase splitting	ISCA 2024	Week 16
LMCache: Optimizing KV Cache Sharing Across LLM Serving Instances	arXiv 2024	Week 15

Grading Rubric (per weekly lab)

Component	Weight	Description
Experiment Execution	30%	All experiments completed, commands run correctly, results captured
Metrics Collection	20%	All required metrics recorded in tables/plots, units correct
Source Code Reading	15%	Evidence of reading the specified files, key functions identified and explained
Written Analysis	30%	Thoughtful answers to analysis questions, supported by data, correct reasoning
Presentation	5%	Clear formatting, labeled plots, organized report

    Weekly Submission Checklist
    All experiment commands executed and output saved to text files
Metrics table filled in with measured values
At least one plot/visualization of key results
Source code reading notes (which files, which functions, what you learned)
Written analysis (1-2 pages) answering all analysis questions
GPU monitoring CSV file from experiment session

  

Architecture Quick Reference

vLLM Architecture

Client Request
    ↓
api_server.py    (FastAPI)
    ↓
AsyncLLM         (async engine)
    ↓
EngineCore       (scheduler + executor)
    ↓
Worker           (model runner on GPU)
    ↓
ModelRunner      (forward pass)

SGLang Architecture

Client Request
    ↓
TokenizerManager (HTTP + tokenize)
    ↓
Scheduler        (RadixCache + policy)
    ↓
TpModelWorker   (TP group)
    ↓
ModelRunner      (forward pass)

NVIDIA A100 80GB Quick Specs

Compute

312 TFLOPS BF16
156 TFLOPS FP32
624 TFLOPS INT8

Memory

80 GB HBM2e
2039 GB/s bandwidth
40 MB L2 cache

Interconnect

NVLink: 600 GB/s
PCIe Gen4: 64 GB/s
TDP: 400W

Hands-On Labs: AI Inference Infrastructure

Course Philosophy

Prerequisites

Table of Contents

Environment Setup Guide

Hardware Requirements

Minimum Hardware

Conda Environment Setup

Model Downloads

Dataset Downloads

SLURM Job Script Template

GPU Monitoring Setup

Quick Verification

Troubleshooting Common Issues

CUDA Out of Memory

HuggingFace Token Authentication

Port Conflicts

Conda Environment Issues

Helper Scripts

Course Roadmap — 16 Weeks in 5 Phases

PHASE 1 Foundations & Benchmarking (Weeks 1-3)

Week 1

Week 2

Week 3

PHASE 2 Memory & KV Cache (Weeks 4-6)

Week 4

Week 5

Week 6

PHASE 3 Scheduling & Batching (Weeks 7-9)

Week 7

Week 8

Week 9

PHASE 4 Optimizations (Weeks 10-12)

Week 10

Week 11

Week 12

PHASE 5 Multi-GPU & System Integration (Weeks 13-16)

Week 13

Week 14

Week 15

Week 16

Week 1 — First LLM Serving

Learning Objectives

Setup & Configuration

Experiments

Manual curl Test

vLLM Benchmark — Varying Request Rates

SGLang Benchmark — Same Workload

Collect and Compare Results

Metrics to Collect

Source Code Reading

Written Analysis (1-2 pages)

Streaming Response Test

Week 2 — Offline Throughput & Profiling

Learning Objectives

Setup & Configuration

Experiments

Offline Throughput — vLLM

Offline Throughput — SGLang

Single-Request Latency Profiling

PyTorch Profiler + Perfetto Trace

GPU Monitoring Analysis

Metrics to Collect

Source Code Reading

Written Analysis (1-2 pages)

Week 3 — Roofline Model

Learning Objectives

Setup & Configuration

Experiments

Nsys Profile — Full Serving Trace

NCU Roofline — Per-Kernel Analysis

Theoretical Decode Latency Calculation

Prefill vs Decode Roofline

Metrics to Collect

Source Code Reading

Written Analysis (1-2 pages)

Batch Size Scaling Experiment

Key Formulas

Week 4 — PagedAttention

Learning Objectives