Hands-On Labs: AI Inference Infrastructure

16-week graduate course — from first serving to disaggregated P/D, with vLLM, SGLang, and LMCache

3-5 hrs/week · 1-2× A100 80GB

Course Philosophy

This course follows a measure-first, understand-second approach. Each week, you will:

  1. Run experiments to observe a phenomenon (e.g., TTFT increases with request rate)
  2. Collect metrics to quantify the effect (e.g., TTFT goes from 50ms to 2000ms)
  3. Read source code to understand the mechanism (e.g., queuing in the scheduler)
  4. Write analysis to synthesize understanding (e.g., explain saturation as a queuing theory problem)

Prerequisites

Table of Contents

  1. Environment Setup Guide
  2. Course Roadmap
  3. Week 1 — First LLM Serving
  4. Week 2 — Offline Throughput & Profiling
  5. Week 3 — Roofline Model
  6. Week 4 — PagedAttention
  7. Week 5 — Prefix Caching (APC)
  8. Week 6 — RadixAttention
  9. Week 7 — Continuous vs Static Batching
  10. Week 8 — Chunked Prefill
  11. Week 9 — Scheduling Policies
  12. Week 10 — CUDA Graphs
  13. Week 11 — Speculative Decoding
  14. Week 12 — Quantization
  15. Week 13 — Tensor Parallelism
  16. Week 14 — Mixture of Experts (MoE)
  17. Week 15 — LMCache
  18. Week 16 — Disaggregated Prefill/Decode
  19. Models & Datasets Reference

Environment Setup Guide

Complete this setup before Week 1. All 16 labs assume these tools, models, and datasets are already available.

Hardware Requirements

Minimum Hardware

  • 1-2× NVIDIA A100 80 GB (or equivalent: H100, L40S)
  • CUDA 12.1+ with cuDNN 8.9+
  • 64 GB system RAM minimum, 128 GB recommended
  • 100 GB free disk for models & datasets
  • Weeks 13-14 require 2× GPUs with NVLink for tensor/expert parallelism

Conda Environment Setup

Create a dedicated conda environment with all required packages:

# Create and activate environment
conda create -n infer python=3.12 -y
conda activate infer

# Install core serving frameworks
pip install vllm
pip install "sglang[all]"

# Install KV cache management and evaluation
pip install lmcache lm-eval

# Verify installations
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import lmcache; print('LMCache OK')"

Model Downloads

Download all models once; they are reused across the entire semester:

# Primary model — used in most labs
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct

# Small model — used for speculative decoding draft
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# Quantized variants — Week 12
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8

# MoE model — Week 14 (requires ~100GB disk)
huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1

Dataset Downloads

# ShareGPT conversation dataset — primary benchmark workload
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

# Verify file
python -c "import json; d=json.load(open('ShareGPT_V3_unfiltered_cleaned_split.json')); print(f'{len(d)} conversations')"

SLURM Job Script Template

For cluster users — save as run_lab.sbatch and customize per week:

#!/bin/bash
#SBATCH --job-name=infer-lab
#SBATCH --gres=gpu:A100:1
#SBATCH --time=04:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8

module load cuda/12.4
conda activate infer

# Your experiment commands here

GPU Monitoring Setup

Run this in a separate terminal to collect GPU metrics during experiments:

# Log GPU power, utilization, clocks, memory every 1 second
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor.csv &

# Quick GPU status check
nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu --format=csv

Quick Verification

Launch vLLM and send one request to confirm everything works:

# Terminal 1: Start vLLM server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

# Terminal 2: Send a test request
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "The capital of France is",
    "max_tokens": 16
  }' | python -m json.tool

# Expected: JSON response with "Paris" in the output

Troubleshooting Common Issues

CUDA Out of Memory

If you see OOM errors, reduce --max-model-len or --gpu-memory-utilization:

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85

HuggingFace Token Authentication

Some models (e.g., Llama) require a HuggingFace token. Set it up once:

huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens

# Or set the environment variable
export HF_TOKEN="hf_your_token_here"

Port Conflicts

If a port is already in use, find and kill the process:

# Find process using port 8000
lsof -i :8000
# Kill it
kill -9 $(lsof -t -i :8000)

Conda Environment Issues

If packages conflict, create a fresh environment. vLLM and SGLang update frequently — pin versions for reproducibility:

pip install vllm==0.8.0
pip install "sglang[all]==0.4.5"

Helper Scripts

Save these utility scripts for use throughout the semester:

# save as: wait_for_server.sh
#!/bin/bash
# Wait for a vLLM/SGLang server to be ready
PORT=${1:-8000}
echo "Waiting for server on port $PORT..."
while ! curl -s http://localhost:$PORT/health > /dev/null 2>&1; do
  sleep 2
done
echo "Server is ready!"
# save as: extract_metrics.py
# Quick script to parse benchmark output files
import re, sys

def parse_benchmark(filepath):
    with open(filepath) as f:
        text = f.read()
    metrics = {}
    for pattern, name in [
        (r'Throughput:\s+([\d.]+)\s+requests/s', 'req/s'),
        (r'Mean TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_mean'),
        (r'P99 TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_p99'),
        (r'Mean ITL.*?:\s+([\d.]+)\s+ms', 'itl_mean'),
        (r'P99 ITL.*?:\s+([\d.]+)\s+ms', 'itl_p99'),
    ]:
        m = re.search(pattern, text)
        if m: metrics[name] = float(m.group(1))
    return metrics

if __name__ == "__main__":
    for f in sys.argv[1:]:
        print(f"{f}: {parse_benchmark(f)}")
Setup Complete! If the test request returned a valid response, your environment is ready. Shut down the server (Ctrl+C) and proceed to the Course Roadmap.

Course Roadmap — 16 Weeks in 5 Phases

The following roadmap organizes all 16 weekly labs into 5 progressive phases. Each card shows the week number, topic, key systems used, and the primary phenomenon you will observe and analyze.

Each phase builds on the previous one. Phase 1 establishes your baseline understanding and measurement toolkit. Phases 2-3 explore memory management and scheduling — the two pillars of serving system design. Phase 4 covers hardware-level optimizations. Phase 5 extends to multi-GPU and production deployments.

Time Commitment: Each weekly lab requires 3-5 hours: ~1 hour for setup, ~1-2 hours for experiments, ~1 hour for source code reading, and ~1 hour for written analysis. Plan GPU time accordingly — batch your experiments to minimize idle GPU time.

PHASE 1 Foundations & Benchmarking (Weeks 1-3)

Week 1

First LLM Serving — vLLM & SGLang basics, online serving, ShareGPT benchmarks, TTFT/ITL metrics

Week 2

Offline Throughput & Profiling — benchmark_throughput, PyTorch Profiler, Perfetto, nvidia-smi dmon, call graphs

Week 3

Roofline Model — nsys/ncu profiling, A100 peak analysis, memory-bound vs compute-bound, decode latency theory

PHASE 2 Memory & KV Cache (Weeks 4-6)

Week 4

PagedAttention — vLLM vs HuggingFace baseline, max-num-seqs sweep, block-size tuning, KV cache manager

Week 5

Prefix Caching (APC) — enable-prefix-caching toggle, FP8 KV cache, shared prefix sweeps, hash_block_tokens

Week 6

RadixAttention — SGLang radix cache, few-shot workloads, LPM vs FCFS, sgl.function DSL, radix_cache.py

PHASE 3 Scheduling & Batching (Weeks 7-9)

Week 7

Continuous vs Static Batching — HuggingFace static baseline, output-length variance, scheduler.py, Orca paper

Week 8

Chunked Prefill — HOL blocking, max-num-batched-tokens sweep, ITL timeseries analysis

Week 9

Scheduling Policies — FCFS/LPM/DFS-weight, cache-aware DP routing, sglang_router

PHASE 4 Optimizations (Weeks 10-12)

Week 10

CUDA Graphs — enforce-eager vs captured graphs, warmup phase, memory overhead, CudaGraphRunner

Week 11

Speculative Decoding — draft model, N-gram, EAGLE, acceptance rate vs num_speculative_tokens

Week 12

Quantization — FP16/FP8/AWQ-INT4/GPTQ-INT4, lm_eval GSM8K accuracy, Pareto frontier

PHASE 5 Multi-GPU & System Integration (Weeks 13-16)

Week 13

Tensor Parallelism — TP=1,2,4, NCCL all-reduce profiling, NVLink bandwidth

Week 14

MoE — Mixtral-8x7B, expert routing, memory vs compute, expert parallelism

Week 15

LMCache — cross-restart persistence, tiered storage GPU→CPU→disk, chunk_size sweep

Week 16

Disaggregated P/D — vLLM P2pNcclConnector, SGLang disaggregation-mode, KV transfer analysis, capstone

Week 1 — First LLM Serving

Learning Objectives

  • Launch vLLM and SGLang servers and send requests via curl
  • Run benchmark_serving.py (vLLM) and bench_serving (SGLang) with the ShareGPT dataset
  • Understand TTFT, ITL (inter-token latency), throughput, and request rate
  • Compare vLLM and SGLang on identical workloads

Setup & Configuration

# Start vLLM server (Terminal 1)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --disable-log-requests

# Start SGLang server (Terminal 2)
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8001

Experiments

1

Manual curl Test

Send individual requests to both servers and observe the response structure:

# vLLM completions API
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'

# SGLang completions API
curl http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'
2

vLLM Benchmark — Varying Request Rates

Run benchmark_serving.py at request rates 1, 4, 10, and inf (closed-loop):

for rate in 1 4 10 inf; do
  python -m vllm.entrypoints.openai.api_server &  # if not already running
  python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 \
    --request-rate $rate \
    --port 8000 \
    2>&1 | tee results_vllm_rr${rate}.txt
done
3

SGLang Benchmark — Same Workload

for rate in 1 4 10 inf; do
  python -m sglang.bench_serving \
    --backend sglang \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 \
    --request-rate $rate \
    --port 8001 \
    2>&1 | tee results_sglang_rr${rate}.txt
done
4

Collect and Compare Results

Extract TTFT, ITL, throughput from each result file. Plot request rate vs. median TTFT and median ITL for both systems.

Metrics to Collect

MetricDescriptionUnit
TTFT (p50, p99)Time to first tokenms
ITL (p50, p99)Inter-token latencyms
ThroughputOutput tokens per secondtok/s
Request latency (p50, p99)End-to-end per-request timems

Source Code Reading

  • vllm/entrypoints/openai/api_server.pyHow the OpenAI-compatible server starts and handles requests
  • benchmarks/benchmark_serving.pyHow requests are generated with Poisson arrival at different rates
  • sglang/bench_serving.pySGLang equivalent benchmark harness

Written Analysis (1-2 pages)

  • How does TTFT change as request rate increases from 1 to inf? Why?
  • At what request rate does the system become saturated? What evidence supports this?
  • Compare vLLM vs SGLang: which has better TTFT? Better throughput? Hypothesize why.
Understanding Request Rate: Request rate controls how fast requests arrive (Poisson process). At rate=1, one request arrives per second on average. At rate=inf, all requests are sent immediately (closed-loop benchmark). As rate increases, the server queue grows, TTFT increases due to queuing delay, but throughput may also increase due to better batching. The saturation point is where TTFT begins to diverge — the server can no longer keep up with arrivals.

Streaming Response Test

Test streaming mode to observe tokens arriving one at a time:

# vLLM streaming
curl -N http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Count from 1 to 20:",
    "max_tokens": 64,
    "stream": true
  }'

# SGLang streaming
curl -N http://localhost:8001/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "prompt": "Count from 1 to 20:",
    "max_tokens": 64,
    "stream": true
  }'

Week 2 — Offline Throughput & Profiling

Learning Objectives

  • Measure offline (batched) throughput with benchmark_throughput.py and bench_one_batch
  • Profile a single forward pass with benchmark_latency.py
  • Capture and read PyTorch Profiler traces in Perfetto
  • Trace the call graph: AsyncLLM → EngineCore → Worker

Setup & Configuration

# Ensure GPU monitoring is running
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor_w2.csv &

Experiments

1

Offline Throughput — vLLM

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 500 \
  --output-json throughput_vllm.json
2

Offline Throughput — SGLang

python -m sglang.bench_one_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --batch-size 32 64 128 \
  --input-len 512 \
  --output-len 256
3

Single-Request Latency Profiling

python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 512 \
  --output-len 128 \
  --batch-size 1 \
  --num-iters 10
4

PyTorch Profiler + Perfetto Trace

Enable the PyTorch profiler and export a trace for visualization in Perfetto UI:

# Run vLLM with profiling enabled
VLLM_TORCH_PROFILER_DIR=./profiles vllm serve \
  meta-llama/Llama-3.1-8B-Instruct --port 8000

# Send a few requests, then open the trace:
# Go to https://ui.perfetto.dev and load the .json trace file
5

GPU Monitoring Analysis

Analyze the gpu_monitor_w2.csv to observe GPU utilization patterns during offline vs. online workloads.

Metrics to Collect

MetricDescriptionUnit
Offline throughputTotal tokens/sec in batched modetok/s
Per-token latencyAverage time per output token (batch=1)ms
GPU utilization %From nvidia-smi dmon%
Kernel time breakdownAttention vs. MLP vs. other from Perfetto trace%

Source Code Reading

  • vllm/engine/async_llm_engine.pyAsyncLLM: the top-level async engine that manages request queues
  • vllm/v1/engine/core.pyEngineCore: the synchronous core that runs scheduling + model execution
  • vllm/worker/worker.pyWorker: GPU-side model runner and KV cache management

Written Analysis (1-2 pages)

  • What percentage of forward-pass time is spent in attention vs. MLP layers? Does this match theory?
  • How does offline throughput compare to online throughput at high request rates from Week 1?
  • Draw the AsyncLLM → EngineCore → Worker call graph and annotate which thread/process each runs in.
Perfetto Trace Reading Guide: When viewing the trace in Perfetto UI:
• The top rows show CPU threads (Python async loop, tokenizer, etc.)
• Lower rows show CUDA stream activity on the GPU
• Look for the repeating pattern: scheduler → model_execute → sampler
• Each decode step should be visible as a cluster of CUDA kernels
• Zoom in on a single decode step to see: attention kernels, MLP GEMMs, RMSNorm, etc.
• Note the gaps between kernels — these are CPU-side launch overhead (eliminated by CUDA graphs in Week 10)
Related Deep Dive: vLLM EngineCore Internals

Week 3 — Roofline Model

Learning Objectives

  • Profile GPU kernels using nsys profile and ncu for roofline analysis
  • Understand A100 specs: 312 TFLOPS BF16, 2 TB/s HBM bandwidth
  • Calculate theoretical decode latency: 16 GB model / 2 TB/s = 8 ms
  • Identify which kernels are memory-bound vs compute-bound

Setup & Configuration

# Install nsight systems and nsight compute (usually comes with CUDA toolkit)
which nsys && which ncu
# If not found: module load nsight-systems nsight-compute

Experiments

1

Nsys Profile — Full Serving Trace

nsys profile -o llm_trace --trace=cuda,nvtx \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 1 --input-len 512 --output-len 64 --num-iters 3
2

NCU Roofline — Per-Kernel Analysis

ncu --set roofline --target-processes all \
  -o roofline_report \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size 1 --input-len 128 --output-len 16 --num-iters 1
3

Theoretical Decode Latency Calculation

For Llama-3.1-8B in BF16: model size ≈ 16 GB. A100 HBM bandwidth = 2 TB/s. Theoretical minimum decode step = 16 GB / 2 TB/s = 8 ms. Compare against your measured per-token latency from the nsys trace.

4

Prefill vs Decode Roofline

Open the NCU roofline report. Identify where attention kernels and GEMM kernels fall. Prefill GEMMs should be near the compute roof; decode attention should be near the memory roof.

Metrics to Collect

MetricDescriptionUnit
Measured decode latencyActual per-token time from nsysms
Theoretical decode latencymodel_size / HBM_bandwidthms
Arithmetic intensityFLOPs / bytes for key kernelsFLOP/B
HBM utilization %Fraction of peak bandwidth achieved%

Source Code Reading

  • vllm/model_executor/models/llama.pyTrace the forward() method to see which operations map to which GPU kernels
  • Identify: QKV projection → FlashAttention → output projection → gate/up projection → SiLU → down projection

Written Analysis (1-2 pages)

  • What is the ratio of measured decode latency to theoretical minimum? What accounts for the gap?
  • Draw a roofline diagram with your measured kernel data points. Label which are memory-bound and compute-bound.
  • How does arithmetic intensity change from batch=1 to batch=32 for the same kernels?
Tip: NCU roofline profiling is very slow — it instruments every kernel. Use --num-iters 1 and short sequences. A full roofline run may take 10-30 minutes.

Batch Size Scaling Experiment

Run the same model at batch sizes 1, 4, 16, 64 and observe how arithmetic intensity shifts kernels from memory-bound to compute-bound:

for bs in 1 4 16 64; do
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size $bs --input-len 256 --output-len 64 \
    2>&1 | tee results_roofline_bs${bs}.txt
done

Key Formulas

  • Arithmetic Intensity (AI) = FLOPs / Bytes transferred
  • Ridge Point = Peak TFLOPS / Peak Bandwidth = 312 / 2000 = 0.156 FLOP/Byte for A100
  • Decode (batch=1): AI ≈ 1 FLOP/Byte → deep in memory-bound territory
  • Prefill (seq=2048): AI ≈ 2048 FLOP/Byte → compute-bound for GEMMs

Week 4 — PagedAttention

Learning Objectives

  • Understand how PagedAttention eliminates KV cache fragmentation via block-level memory management
  • Compare vLLM PagedAttention throughput against a HuggingFace generate() baseline
  • Sweep --max-num-seqs and --block-size to observe memory utilization changes

Setup & Configuration

# HuggingFace baseline script (save as hf_baseline.py)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

prompts = ["Write a short story about AI."] * 16
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
elapsed = time.time() - start
total_tokens = (outputs.shape[1] - inputs["input_ids"].shape[1]) * len(prompts)
print(f"HF throughput: {total_tokens/elapsed:.1f} tok/s")

Experiments

1

HuggingFace Baseline

python hf_baseline.py
2

vLLM — max-num-seqs Sweep

for seqs in 1 4 16 64 256; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-num-seqs $seqs \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_paged_seqs${seqs}.txt
done
3

Block Size Sweep

for bs in 8 16 32; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --block-size $bs \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_paged_bs${bs}.txt
done
4

Memory Utilization Monitoring

During each run, log GPU memory usage. Note the KV cache block allocation reported in vLLM logs (e.g., '# GPU blocks: 1234').

Metrics to Collect

MetricDescriptionUnit
HF baseline throughputHuggingFace generate() tok/stok/s
vLLM throughput per max-num-seqsThroughput at each concurrency leveltok/s
GPU blocks allocatedKV cache blocks at each block-sizeblocks
Memory waste %Internal fragmentation at each block-size%

Source Code Reading

  • vllm/v1/core/kv_cache_manager.pyHow blocks are allocated, freed, and managed for PagedAttention
  • vllm/v1/core/kv_cache_utils.pyFreeKVCacheBlockQueue: the free list data structure

Written Analysis (1-2 pages)

  • What is the throughput improvement of vLLM over HuggingFace? What causes this gap?
  • How does block-size affect internal fragmentation vs. metadata overhead? What is the optimal block size?
  • Explain how PagedAttention's block table maps virtual sequence positions to physical GPU memory blocks.
Background — How PagedAttention Works: Traditional KV cache allocates a contiguous buffer per sequence for the maximum possible length. This wastes memory when sequences are shorter than max. PagedAttention divides KV cache into fixed-size blocks (like OS virtual memory pages). A block table maps logical positions to physical blocks. Blocks are allocated on demand and freed immediately when a sequence finishes, eliminating both external fragmentation (gaps between sequences) and most internal fragmentation (only the last block of each sequence may have unused slots).

Understanding Block Size Tradeoffs

Smaller Blocks (e.g., 8)

  • Less internal fragmentation (avg waste = block_size/2 = 4 tokens)
  • More metadata overhead (more block table entries)
  • More kernel launch overhead for attention

Larger Blocks (e.g., 32)

  • More internal fragmentation (avg waste = 16 tokens)
  • Less metadata, fewer block table entries
  • Better GPU efficiency per attention kernel
Related Deep Dive: vLLM Attention & KV Cache

Week 5 — Prefix Caching (APC)

Learning Objectives

  • Understand Automatic Prefix Caching (APC) and how it reuses KV cache blocks across requests with shared prefixes
  • Measure TTFT improvement from prefix caching on workloads with shared system prompts
  • Experiment with FP8 KV cache to increase effective cache capacity

Setup & Configuration

# Launch vLLM with APC enabled
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching --port 8000

# Launch vLLM WITHOUT APC (baseline)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --no-enable-prefix-caching --port 8002

Experiments

1

APC vs No-APC Comparison

# With APC (port 8000)
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8000 \
  2>&1 | tee results_apc_on.txt

# Without APC (port 8002)
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8002 \
  2>&1 | tee results_apc_off.txt
2

Shared Prefix Length Sweep

for prefix_len in 0 256 512 1024 2048; do
  python benchmarks/benchmark_prefix_caching.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --shared-prefix-len $prefix_len \
    --num-prompts 200 --port 8000 \
    2>&1 | tee results_apc_prefix${prefix_len}.txt
done
3

FP8 KV Cache

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --kv-cache-dtype fp8 \
  --port 8003

# Repeat prefix caching benchmark on port 8003
python benchmarks/benchmark_prefix_caching.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 --port 8003 \
  2>&1 | tee results_apc_fp8.txt

Metrics to Collect

MetricDescriptionUnit
TTFT (APC on vs off)Time to first token with/without prefix cachingms
Cache hit rateFraction of prefix blocks reused%
TTFT vs prefix lengthHow TTFT scales with shared prefix lengthms
GPU blocks (FP16 vs FP8)Total cache blocks available under each dtypeblocks

Source Code Reading

  • vllm/v1/core/kv_cache_utils.pyhash_block_tokens(): how block content is hashed for prefix matching
  • vllm/v1/core/kv_cache_manager.pyLook for prefix caching logic: how cached blocks are found and reused

Written Analysis (1-2 pages)

  • How much TTFT improvement does APC provide at each shared prefix length? Plot the relationship.
  • Does FP8 KV cache degrade output quality? Design an experiment to test this.
  • Explain the hash-based prefix matching mechanism. What are its limitations?
Real-World Use Case — System Prompts: In production, many applications share a long system prompt (e.g., 'You are a helpful assistant that...') across all user requests. Without APC, every request recomputes KV cache for this shared prefix. With APC enabled, the first request computes and caches the system prompt, and all subsequent requests reuse it — reducing TTFT from hundreds of milliseconds to near-zero for the shared portion. This is especially impactful for RAG applications where the retrieved context is often repeated.

Week 6 — RadixAttention

Learning Objectives

  • Understand SGLang's RadixAttention — a radix tree for token-level KV cache reuse
  • Compare RadixAttention (SGLang) vs APC (vLLM) on few-shot workloads
  • Experiment with LPM (Longest Prefix Match) vs FCFS scheduling
  • Write a multi-turn program with sgl.function DSL

Setup & Configuration

# SGLang with radix cache enabled (default)
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --port 8001

# SGLang with radix cache disabled
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disable-radix-cache \
  --port 8004

Experiments

1

Radix Cache On vs Off

# Few-shot workload benchmark
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 200 --request-rate 4 \
  2>&1 | tee results_radix_on.txt

python -m sglang.bench_serving \
  --backend sglang --port 8004 \
  --dataset-name generated-shared-prefix \
  --num-prompts 200 --request-rate 4 \
  2>&1 | tee results_radix_off.txt
2

LPM vs FCFS Scheduling

SGLang uses Longest Prefix Match scheduling by default. Compare against FCFS by toggling the scheduling policy:

# LPM scheduling (default)
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 300 --request-rate 8 \
  2>&1 | tee results_lpm.txt

# FCFS scheduling
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --schedule-policy fcfs --port 8005
python -m sglang.bench_serving \
  --backend sglang --port 8005 \
  --dataset-name generated-shared-prefix \
  --num-prompts 300 --request-rate 8 \
  2>&1 | tee results_fcfs.txt
3

sgl.function DSL Example

import sglang as sgl

@sgl.function
def multi_turn(s, question1, question2):
    s += sgl.system("You are a helpful assistant.")
    s += sgl.user(question1)
    s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
    s += sgl.user(question2)
    s += sgl.assistant(sgl.gen("answer2", max_tokens=256))

state = multi_turn.run(
    question1="What is PagedAttention?",
    question2="How does it compare to RadixAttention?")
print(state["answer1"])
print(state["answer2"])

Metrics to Collect

MetricDescriptionUnit
TTFT (radix on/off)First token time with/without radix cachems
Cache hit ratePrefix reuse ratio on few-shot workload%
TTFT (LPM vs FCFS)Scheduling policy effect on first-token timems

Source Code Reading

  • sglang/srt/mem_cache/radix_cache.pymatch_prefix(): how the radix tree performs longest-prefix matching at token granularity
  • sglang/srt/managers/schedule_batch.pyLPM vs FCFS scheduling logic

Written Analysis (1-2 pages)

  • Compare RadixAttention's token-level matching vs APC's block-level hashing. When does each approach win?
  • Why does LPM scheduling improve throughput on shared-prefix workloads? What happens on non-shared workloads?

Week 7 — Continuous vs Static Batching

Learning Objectives

  • Understand Orca-style continuous batching: new requests enter mid-batch as others finish
  • Measure the throughput loss of static (HuggingFace) batching when output lengths vary
  • Read the vLLM scheduler to understand how requests are added/removed from running batch

Setup & Configuration

# Prepare a workload with high output-length variance
# ShareGPT naturally has variance; we can also create synthetic workloads
python -c "
import json, random
data = []
for i in range(200):
    out_len = random.choice([16, 32, 64, 256, 512])
    data.append({'input': 'Tell me about AI.', 'output_len': out_len})
json.dump(data, open('variable_output.json','w'))
"

Experiments

1

HuggingFace Static Batching Baseline

Use the HuggingFace generate() script from Week 4 with varying output lengths. Observe that all sequences must wait for the longest one to finish.

2

vLLM Continuous Batching — Uniform Outputs

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 256 --output-len 128 \
  --num-prompts 300 \
  2>&1 | tee results_cb_uniform.txt
3

vLLM Continuous Batching — Variable Outputs

python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 300 \
  2>&1 | tee results_cb_variable.txt
4

Batch Occupancy Analysis

Enable vLLM's --log-stats and observe how the running batch size fluctuates over time with continuous batching. With static batching, the batch size stays constant (wasting GPU cycles on padding).

Metrics to Collect

MetricDescriptionUnit
Static batch throughputHF generate() with paddingtok/s
Continuous batch throughputvLLM on same workloadtok/s
Avg batch occupancyMean running sequences over timeseqs

Source Code Reading

  • vllm/v1/core/scheduler.pyschedule(): how requests are added to and removed from the running batch each step
  • Reference: Orca (Yu et al., OSDI 2022) — the foundational continuous batching paper

Written Analysis (1-2 pages)

  • Quantify the throughput improvement of continuous over static batching. How does output-length variance affect the gap?
  • Explain iteration-level vs request-level scheduling. What are the tradeoffs?
Tip — Creating a Fair Comparison: For a proper static vs continuous comparison, use the same prompts, same model, and same GPU. The key variable is the batching strategy. Use a workload with high variance in output lengths (e.g., some requests want 16 tokens, others want 512) — this is where continuous batching shines.

Detailed Timing Breakdown

With static batching (batch=16, max_output=512):

# Visualize batch occupancy over time (from vLLM stats logs)
# Look for lines like: "Avg running: 42.3, Avg waiting: 12.1"
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests \
  --port 8000 2>&1 | grep "running"
Related Deep Dive: vLLM Scheduler Internals

Week 8 — Chunked Prefill

Learning Objectives

  • Understand head-of-line (HOL) blocking: a long prefill delays all decode tokens in the batch
  • Measure ITL spikes caused by large prefills and how chunked prefill mitigates them
  • Sweep --max-num-batched-tokens to find the optimal chunk size

Setup & Configuration

# Without chunked prefill
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --no-enable-chunked-prefill --port 8000

# With chunked prefill (default in recent vLLM)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill --port 8002

# Aggressive chunking
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-chunked-prefill --max-num-batched-tokens 512 --port 8003

Experiments

1

HOL Blocking Demonstration

Create a workload mixing long prefills (2048+ tokens) with short streaming requests. Without chunked prefill, observe ITL spikes in the short requests.

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_no_chunk.txt
2

Chunked Prefill — Default

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8002 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_chunk_default.txt
3

max-num-batched-tokens Sweep

for tokens in 256 512 1024 2048 4096; do
  vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-chunked-prefill \
    --max-num-batched-tokens $tokens \
    --port 8010 &
  sleep 30  # wait for server startup
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 100 --request-rate 4 \
    2>&1 | tee results_chunk_${tokens}.txt
  kill %1
done
4

ITL Timeseries Analysis

Plot the per-token ITL over time for each configuration. Without chunked prefill, you should see periodic spikes. With chunked prefill, ITL should be smoother.

Metrics to Collect

MetricDescriptionUnit
ITL p50, p99Inter-token latency percentilesms
TTFT p50, p99Time to first token (may increase with chunking)ms
ITL p99/p50 ratioMeasure of ITL consistency (lower is better)ratio

Source Code Reading

  • vllm/v1/core/scheduler.pyHow the scheduler splits a long prefill into chunks and interleaves decode tokens

Written Analysis (1-2 pages)

  • What is the TTFT vs ITL tradeoff of chunked prefill? Smaller chunks reduce ITL spikes but increase TTFT — why?
  • What is the optimal max-num-batched-tokens for your hardware and workload? Justify with data.
HOL Blocking Visualized: Imagine a batch with 30 decode requests generating tokens smoothly at 10ms each. A new request arrives with a 4096-token prefill. Without chunked prefill, the entire batch pauses for ~200ms while the prefill computes — every decode user sees a ~200ms stutter in their stream. With chunked prefill (max_batched_tokens=512), the prefill is split into 8 chunks of 512 tokens each. Each chunk takes ~25ms and is interleaved with decode steps, reducing the maximum stutter to ~25ms.

Week 9 — Scheduling Policies

Learning Objectives

  • Compare FCFS, LPM, and DFS-weight scheduling policies in SGLang
  • Understand cache-aware data-parallel (DP) routing with sglang_router
  • Measure the impact of scheduling policy on cache hit rate and throughput

Experiments

1

FCFS vs LPM vs DFS-weight

for policy in fcfs lpm dfs-weight; do
  python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --schedule-policy $policy \
    --port 8001 &
  sleep 30
  python -m sglang.bench_serving \
    --backend sglang --port 8001 \
    --dataset-name generated-shared-prefix \
    --num-prompts 300 --request-rate 8 \
    2>&1 | tee results_sched_${policy}.txt
  kill %1; sleep 5
done
2

Cache-Aware DP Routing

If you have 2 GPUs, launch SGLang with data parallelism and the cache-aware router:

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dp 2 --port 8001

# The sglang_router automatically routes requests to maximize cache hits
python -m sglang.bench_serving \
  --backend sglang --port 8001 \
  --dataset-name generated-shared-prefix \
  --num-prompts 500 --request-rate 10 \
  2>&1 | tee results_dp_router.txt

Metrics to Collect

MetricDescriptionUnit
Throughput per policyFCFS vs LPM vs DFS-weighttok/s
Cache hit rate per policyPrefix reuse on shared-prefix workload%
TTFT p50, p99First-token latency under each policyms

Source Code Reading

  • sglang/srt/managers/scheduler.pyScheduling policy implementations: FCFS, LPM, DFS-weight
  • sglang/srt/router/sglang_router: cache-aware DP request routing

Written Analysis (1-2 pages)

  • Which scheduling policy achieves the highest cache hit rate? Why?
  • How does cache-aware DP routing improve throughput over random routing? What workloads benefit most?
Scheduling Policy Explained: FCFS — First Come First Served. Processes requests in arrival order. Simple but ignores cache locality.
LPM — Longest Prefix Match. Prioritizes requests whose prefix is already cached, maximizing cache hits.
DFS-weight — Depth-First Search with weight. Biases toward completing requests from the same prefix subtree, improving locality.

Workload Impact Analysis

Test the same policies on a non-shared (ShareGPT) workload to show that LPM has no advantage when prefixes are unique:

for policy in fcfs lpm; do
  python -m sglang.launch_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --schedule-policy $policy --port 8001 &
  sleep 30
  python -m sglang.bench_serving \
    --backend sglang --port 8001 \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 200 --request-rate 8 \
    2>&1 | tee results_sched_sharegpt_${policy}.txt
  kill %1; sleep 5
done
Related Deep Dive: SGLang Scheduler Deep Dive

Week 10 — CUDA Graphs

Learning Objectives

  • Understand CUDA graph capture: recording GPU operations once, replaying them without CPU overhead
  • Measure latency reduction from CUDA graphs vs --enforce-eager mode
  • Observe the warmup/capture phase at server startup and its memory cost

Experiments

1

Eager Mode vs CUDA Graphs

# Eager mode (no CUDA graphs)
python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enforce-eager \
  --batch-size 1 --input-len 512 --output-len 128 \
  2>&1 | tee results_eager.txt

# CUDA graphs (default)
python benchmarks/benchmark_latency.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --batch-size 1 --input-len 512 --output-len 128 \
  2>&1 | tee results_cudagraph.txt
2

Startup Time Comparison

Time the server startup with and without CUDA graphs. The graph capture phase (warmup) adds significant startup time but reduces per-request latency.

# Time server startup
time vllm serve meta-llama/Llama-3.1-8B-Instruct --enforce-eager --port 8000 &
# vs
time vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8002 &
3

Memory Overhead Analysis

Compare GPU memory usage between eager and CUDA graph modes. CUDA graphs pre-allocate memory for captured operations, reducing available KV cache space.

Metrics to Collect

MetricDescriptionUnit
Decode latency (eager)Per-token time without CUDA graphsms
Decode latency (graphs)Per-token time with CUDA graphsms
Server startup timeEager vs graph capture startups
Memory overheadExtra GPU memory used by captured graphsMB

Source Code Reading

  • vllm/worker/model_runner.pyCudaGraphRunner: how graphs are captured during warmup and replayed during inference

Written Analysis (1-2 pages)

  • What is the latency reduction from CUDA graphs? Where does the speedup come from (CPU launch overhead elimination)?
  • What are the constraints of CUDA graphs? (Fixed tensor sizes, no dynamic control flow)
Understanding CUDA Graph Capture: During warmup, vLLM runs dummy forward passes at various batch sizes (1, 2, 4, ..., max_batch_size). Each pass is recorded as a CUDA graph. During inference, the engine selects the graph matching the current batch size and replays it. This eliminates CPU-side kernel launch overhead (~10-20us per kernel × ~100 kernels per layer × 32 layers = significant savings). The tradeoff: each captured graph consumes GPU memory for the recorded computation, reducing KV cache capacity.

Batch Size Sweep with CUDA Graphs

for bs in 1 4 16 64; do
  # Eager mode
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --enforce-eager --batch-size $bs \
    --input-len 256 --output-len 64 \
    2>&1 | tee results_eager_bs${bs}.txt

  # CUDA graph mode
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --batch-size $bs \
    --input-len 256 --output-len 64 \
    2>&1 | tee results_cudagraph_bs${bs}.txt
done

Plot the latency improvement ratio (eager/graph) for each batch size. The speedup is typically larger at small batch sizes where CPU launch overhead is a larger fraction of total time.

Related Deep Dive: vLLM Model Runner

Week 11 — Speculative Decoding

Learning Objectives

  • Understand speculative decoding: a small draft model proposes tokens, the target model verifies in parallel
  • Compare draft-model, N-gram, and EAGLE speculation methods
  • Measure acceptance rate vs num_speculative_tokens
  • Observe that spec decoding helps at low QPS but hurts at high QPS

Experiments

1

Draft Model Speculation

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
  --num-speculative-tokens 5 \
  --port 8000

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 2 \
  2>&1 | tee results_spec_draft.txt
2

N-gram Speculation

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-model [ngram] \
  --num-speculative-tokens 5 \
  --ngram-prompt-lookup-max 4 \
  --port 8002

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8002 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 2 \
  2>&1 | tee results_spec_ngram.txt
3

num_speculative_tokens Sweep

for k in 1 3 5 7; do
  vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --num-speculative-tokens $k \
    --port 8010 &
  sleep 30
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 100 --request-rate 2 \
    2>&1 | tee results_spec_k${k}.txt
  kill %1; sleep 5
done
4

Low QPS vs High QPS

Run the same speculative decoding config at request rates 1, 4, and 10. At low QPS, speculation reduces per-request latency. At high QPS, the overhead of draft model + verification can hurt throughput.

Metrics to Collect

MetricDescriptionUnit
Acceptance rateFraction of draft tokens accepted by target%
Per-request latencyWith vs without speculationms
Throughput at high QPSSpec decoding overhead under loadtok/s

Source Code Reading

  • vllm/spec_decode/Speculative decoding module: draft worker, scorer, verification logic

Written Analysis (1-2 pages)

  • Plot acceptance rate vs num_speculative_tokens. Why does acceptance rate decrease with more tokens?
  • At what QPS does speculative decoding start hurting throughput? Explain the mechanism.
Understanding Acceptance Rate: When the draft model proposes k tokens, the target model verifies all k in a single forward pass (versus k separate forward passes without speculation). The expected speedup is approximately (k × acceptance_rate + 1) / 1, but only if the draft model is significantly cheaper than the target. At high QPS, the draft model's GPU cycles compete with serving existing decode batches, reducing overall throughput even if per-request latency improves.

Comparison Matrix

MethodExtra MemoryAcceptance RateBest For
Draft Model~2-4 GB (TinyLlama)Medium-HighGeneral text, low QPS
N-gramNegligibleLow (content-dependent)Repetitive/templated text
EAGLE~0.5-1 GB (lightweight head)HighCode, structured output
Related Deep Dive: vLLM Architecture Overview

Week 12 — Quantization

Learning Objectives

  • Compare FP16, FP8, AWQ-INT4, and GPTQ-INT4 model variants
  • Measure accuracy degradation using lm_eval on GSM8K
  • Plot the Pareto frontier: memory vs throughput vs quality

Experiments

1

Throughput Comparison Across Precisions

# FP16 (baseline)
python benchmarks/benchmark_throughput.py \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 200 2>&1 | tee results_fp16.txt

# FP8
python benchmarks/benchmark_throughput.py \
  --model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --num-prompts 200 2>&1 | tee results_fp8.txt

# AWQ-INT4
python benchmarks/benchmark_throughput.py \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --quantization awq \
  --num-prompts 200 2>&1 | tee results_awq.txt
2

Accuracy Evaluation — GSM8K

# FP16
lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_fp16.txt

# FP8
lm_eval --model vllm \
  --model_args pretrained=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_fp8.txt

# AWQ-INT4
lm_eval --model vllm \
  --model_args pretrained=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,quantization=awq \
  --tasks gsm8k --batch_size auto \
  2>&1 | tee eval_awq.txt
3

Memory Usage Comparison

Record GPU memory used by each model variant (from vLLM startup logs or nvidia-smi). FP16 ≈ 16 GB, FP8 ≈ 8 GB, INT4 ≈ 4 GB for model weights.

Metrics to Collect

ModelMemory (GB)Throughput (tok/s)GSM8K Accuracy
FP16~16
FP8~8
AWQ-INT4~4

Source Code Reading

  • vllm/model_executor/layers/quantization/Quantization implementations: AWQ, GPTQ, FP8

Written Analysis (1-2 pages)

  • Plot the Pareto frontier with memory on x-axis, throughput on y-axis, and GSM8K accuracy as point labels.
  • When is INT4 quantization acceptable? When is FP8 a better choice? Discuss use-case tradeoffs.
Quantization Primer: FP16/BF16: 16-bit floating point. Full precision baseline. 2 bytes per parameter.
FP8: 8-bit floating point (E4M3 or E5M2). Native on Hopper GPUs. 1 byte per parameter. Minimal accuracy loss.
AWQ (Activation-aware Weight Quantization): INT4 with per-channel scaling. Identifies salient weights to preserve. 0.5 bytes per parameter.
GPTQ: Post-training quantization to INT4 using second-order approximation. Similar to AWQ in size.

Why Quantization Improves Throughput

Two complementary effects drive throughput gains:

  1. Reduced memory bandwidth: Decode is memory-bound. INT4 model reads 4× fewer bytes from HBM per token → theoretical 4× decode speedup
  2. More KV cache space: Smaller model weights leave more GPU memory for KV cache → supports larger batch sizes → higher throughput

Online Serving Quality Test

Beyond GSM8K, test output quality with real prompts:

# Send the same complex prompt to FP16 and INT4 servers
# Compare outputs side by side for coherence, factual accuracy
for port in 8000 8002; do
  curl -s http://localhost:$port/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "prompt": "Explain the difference between TCP and UDP, including when to use each protocol.",
      "max_tokens": 256,
      "temperature": 0
    }' | python -m json.tool > output_port${port}.json
done

diff output_port8000.json output_port8002.json

Week 13 — Tensor Parallelism

Learning Objectives

  • Understand tensor parallelism (TP): splitting weight matrices across GPUs
  • Measure latency and throughput at TP=1, 2, 4
  • Profile NCCL all-reduce communication overhead using nsys
  • Understand NVLink bandwidth: 600 GB/s bidirectional on A100
Hardware Requirement: This lab requires 2-4 GPUs with NVLink. TP=4 requires 4 GPUs.

Experiments

1

TP Scaling — Latency

for tp in 1 2 4; do
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size $tp \
    --batch-size 1 --input-len 512 --output-len 128 \
    2>&1 | tee results_tp${tp}_latency.txt
done
2

TP Scaling — Throughput

for tp in 1 2 4; do
  python benchmarks/benchmark_throughput.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size $tp \
    --dataset ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 300 \
    2>&1 | tee results_tp${tp}_throughput.txt
done
3

NCCL All-Reduce Profiling

nsys profile -o tp2_trace --trace=cuda,nvtx,nccl \
  python benchmarks/benchmark_latency.py \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --tensor-parallel-size 2 \
    --batch-size 1 --input-len 256 --output-len 32 --num-iters 3

Open in nsys GUI and identify NCCL all-reduce calls. Measure their duration and frequency per decode step.

Metrics to Collect

MetricDescriptionUnit
Decode latency per TPPer-token time at TP=1,2,4ms
All-reduce timeNCCL communication per decode stepus
Communication fractionAll-reduce time / total step time%

Source Code Reading

  • vllm/distributed/Tensor parallel communication: column/row parallel linear layers, all-reduce

Written Analysis (1-2 pages)

  • Does TP=2 reduce latency by 2×? Why or why not? Quantify the communication overhead.
  • At what batch size does TP=2 throughput exceed TP=1? Why does this crossover point exist?
How Tensor Parallelism Works: In TP, each linear layer is split across GPUs. For a weight matrix W of shape [H, H]:
Column parallelism: Each GPU holds W[:, H/TP:H*(i+1)/TP]. Input is broadcast; output is partial → requires all-gather.
Row parallelism: Each GPU holds W[H/TP:H*(i+1)/TP, :]. Input is split; output is partial-sum → requires all-reduce.
Each transformer layer has 2 all-reduce operations (after attention output and after MLP down projection). With NVLink at 600 GB/s, a typical all-reduce for Llama-8B at TP=2 takes ~20-50 microseconds.
Related Deep Dive: vLLM Distributed Execution

Week 14 — Mixture of Experts (MoE)

Learning Objectives

  • Understand MoE architecture: sparse activation, expert routing, top-k selection
  • Serve Mixtral-8x7B and observe memory vs compute characteristics
  • Experiment with expert parallelism (--ep-size)
Hardware Requirement: Mixtral-8x7B requires ~100 GB of GPU memory. Use 2× A100 80GB with TP=2, or a single H100 with quantization.

Experiments

1

Serve Mixtral-8x7B

vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8000

python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --dataset-name sharegpt \
  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
  --num-prompts 100 --request-rate 4 \
  2>&1 | tee results_moe_tp2.txt
2

Expert Parallelism

# Expert parallelism distributes experts across GPUs
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tensor-parallel-size 2 \
  --port 8002

# In SGLang with explicit EP
python -m sglang.launch_server \
  --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
  --tp 2 --port 8003
3

Memory vs Compute Analysis

Compare memory usage and throughput: Mixtral-8x7B has 46.7B total parameters but only activates ~12.9B per token (top-2 of 8 experts). Compare against Llama-3.1-8B which has fewer total parameters but activates all of them.

Metrics to Collect

MetricDescriptionUnit
GPU memory (Mixtral)Total memory for all expert weightsGB
Throughput (MoE vs dense)Mixtral vs Llama-8B at similar qualitytok/s
Expert load balanceToken distribution across experts%

Source Code Reading

  • vllm/model_executor/models/mixtral.pyMoE layer implementation: router, expert selection, expert execution

Written Analysis (1-2 pages)

  • Why does MoE require more memory than a dense model of similar quality? What is the memory-compute tradeoff?
  • How does expert parallelism differ from tensor parallelism? When should you use each?
MoE Memory Paradox: Mixtral-8x7B has 46.7B total parameters but only activates ~12.9B per token (top-2 out of 8 experts). This means it has dense-model quality at sparse-model compute cost — but pays the full memory cost. In BF16, Mixtral needs ~93 GB just for weights, compared to ~16 GB for Llama-8B. The memory-compute ratio is the defining characteristic of MoE: more memory for better quality-per-FLOP.

Week 15 — LMCache

Learning Objectives

  • Understand LMCache: external KV cache storage that persists across server restarts
  • Configure tiered storage: GPU → CPU → disk
  • Sweep chunk_size to find optimal KV cache granularity

Setup & Configuration

# Create LMCache config file: lmcache_config.yaml
cat > lmcache_config.yaml <<'EOF'
chunk_size: 256
local_device: "cpu"
remote_url: null
remote_serde: null

# Tiered storage config
storage:
  - type: "gpu"
    capacity_gb: 4
  - type: "cpu"
    capacity_gb: 16
  - type: "disk"
    path: "/tmp/lmcache_disk"
    capacity_gb: 64
EOF

Experiments

1

LMCache Integration with vLLM

# Launch vLLM with LMCache
LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
  meta-llama/Llama-3.1-8B-Instruct \
  --kv-transfer-config lmcache_config.yaml \
  --port 8000
2

Cross-Restart Persistence Test

Send requests with a shared system prompt, shut down the server, restart, and send the same requests. With LMCache, the KV cache is restored from disk/CPU, eliminating re-computation.

# First run: populate cache
python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 50 --request-rate 2 \
  2>&1 | tee results_lmcache_cold.txt

# Restart server, re-run (cache should be warm from disk)
# Kill and restart the server, then:
python benchmarks/benchmark_serving.py \
  --backend vllm --port 8000 \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 50 --request-rate 2 \
  2>&1 | tee results_lmcache_warm.txt
3

chunk_size Sweep

for cs in 64 128 256 512 1024; do
  # Update lmcache_config.yaml with chunk_size=$cs
  sed -i "s/chunk_size: .*/chunk_size: $cs/" lmcache_config.yaml
  LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
    meta-llama/Llama-3.1-8B-Instruct \
    --kv-transfer-config lmcache_config.yaml \
    --port 8010 &
  sleep 30
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8010 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --num-prompts 100 --request-rate 4 \
    2>&1 | tee results_lmcache_cs${cs}.txt
  kill %1; sleep 5
done

Metrics to Collect

MetricDescriptionUnit
TTFT (cold vs warm)First token time before/after cache is populatedms
Cache restore timeTime to load KV cache from disk on restartms
Throughput per chunk_sizeEffect of cache granularity on performancetok/s

Source Code Reading

  • lmcache/LMCache core: storage backends, chunk management, vLLM integration

Written Analysis (1-2 pages)

  • How much TTFT improvement does cross-restart persistence provide? What workloads benefit most?
  • What is the optimal chunk_size? Discuss the tradeoff between granularity (more sharing) and overhead (more metadata).
Tiered Storage Explained: LMCache implements a multi-tier caching hierarchy similar to CPU cache levels:
GPU tier: Fastest access (~0.1ms), but limited by GPU memory
CPU tier: Medium speed (~1-5ms), uses system RAM
Disk tier: Slowest (~10-50ms), but virtually unlimited capacity and survives restarts
When a cache entry is accessed, it is promoted to the GPU tier. When GPU tier is full, least-recently-used entries are demoted to CPU, then disk.

Storage Tier Latency Test

Observe the TTFT difference when KV cache is served from each tier:

# Test 1: Cold start (no cache anywhere)
# TTFT = full prefill computation time

# Test 2: Warm GPU cache (repeat same request)
# TTFT ≈ 0 (KV cache already on GPU)

# Test 3: CPU-only cache (restart server, GPU cache lost)
# TTFT = CPU→GPU transfer time only

# Test 4: Disk-only cache (clear CPU cache too)
# TTFT = Disk→GPU transfer time
Related Deep Dive: LMCache Architecture

Week 16 — Disaggregated Prefill/Decode

Learning Objectives

  • Understand disaggregated serving: separate prefill and decode onto different GPU pools
  • Set up vLLM P2pNcclConnector for KV cache transfer between prefill and decode nodes
  • Use SGLang's --disaggregation-mode for P/D separation
  • Measure KV transfer time vs input length
Hardware Requirement: Disaggregated P/D requires at least 2 GPUs — one for prefill, one for decode.

Experiments

1

vLLM Disaggregated Setup

# Prefill instance (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8000 \
  --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer"}'

# Decode instance (GPU 1)
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --port 8001 \
  --kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer"}'
2

SGLang Disaggregated Mode

# SGLang with disaggregation
python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode [prefill] \
  --port 8002

python -m sglang.launch_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disaggregation-mode [decode] \
  --port 8003
3

KV Transfer Time vs Input Length

Measure how KV cache transfer time scales with input sequence length. Longer inputs produce larger KV caches that take more time to transfer between prefill and decode GPUs.

for input_len in 128 256 512 1024 2048 4096; do
  python benchmarks/benchmark_serving.py \
    --backend vllm --port 8001 \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len $input_len --output-len 64 \
    --num-prompts 50 --request-rate 2 \
    2>&1 | tee results_pd_input${input_len}.txt
done
4

Capstone Analysis

Design an experiment comparing standard serving vs disaggregated P/D on a mixed workload with both short and long prefills. When does disaggregation help?

Metrics to Collect

MetricDescriptionUnit
KV transfer timeTime to send KV cache from prefill to decode GPUms
TTFT (P/D vs standard)First-token latency comparisonms
ITL (P/D vs standard)Decode-side ITL without prefill interferencems
GPU utilization (prefill vs decode)Compute efficiency of each pool%

Source Code Reading

  • vllm/distributed/kv_transfer/P2pNcclConnector: NCCL-based KV cache transfer between prefill and decode instances
  • sglang/srt/disaggregation/SGLang's disaggregation mode implementation

Written Analysis (1-2 pages)

  • Plot KV transfer time vs input length. Is the relationship linear? What determines the transfer bandwidth?
  • When does disaggregated P/D improve overall system performance vs standard serving? Consider: prefill-heavy workloads, latency-sensitive decode, GPU utilization.
  • Final capstone: synthesize your findings from all 16 weeks. What are the 3 most impactful optimizations for LLM inference, and why?

Capstone Project Guidelines

Final Report Structure (4-6 pages)

  1. Executive Summary: What are the 3 most impactful inference optimizations you studied? Rank and justify.
  2. System Comparison: Compare vLLM vs SGLang on at least 3 dimensions (latency, throughput, cache effectiveness). Use data from your weekly labs.
  3. Configuration Guide: For a production deployment of Llama-3.1-8B on 2× A100, what is your recommended configuration? (batch size, block size, chunked prefill settings, whether to use APC, etc.)
  4. Future Directions: What optimization opportunities remain? What would you investigate with more time?

Bonus Experiments (Optional)

  • Combine disaggregated P/D with LMCache: does external KV storage improve P/D transfer?
  • Compare disaggregated P/D throughput vs standard serving at various prefill/decode length ratios
  • Try combining speculative decoding with prefix caching — do the benefits stack?
  • Profile disaggregated P/D with nsys to measure KV transfer latency breakdown (serialization, network, deserialization)

Models & Datasets Reference Card

Models

ModelParametersPrecisionUsed In
meta-llama/Llama-3.1-8B-Instruct8BBF16Weeks 1-11, 13, 15-16
TinyLlama/TinyLlama-1.1B-Chat-v1.01.1BFP16Week 11 (draft model)
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP88BFP8Weeks 5, 12
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT48BINT4Week 12
mistralai/Mixtral-8x7B-Instruct-v0.146.7BBF16Week 14

Datasets

DatasetDescriptionUsed In
ShareGPT_V3_unfiltered_cleaned_split.jsonReal-world multi-turn conversations with natural length distributionWeeks 1-5, 7-8, 11-14, 16
generated-shared-prefixSynthetic workload with shared prefixes (built into SGLang bench)Weeks 6, 9
GSM8KGrade-school math benchmark for accuracy evaluation (via lm_eval)Week 12

Key Tools

ToolPurpose
vllm serveLaunch vLLM OpenAI-compatible server
sglang.launch_serverLaunch SGLang server
benchmark_serving.pyOnline serving benchmark (vLLM)
bench_servingOnline serving benchmark (SGLang)
benchmark_throughput.pyOffline throughput benchmark (vLLM)
benchmark_latency.pySingle-batch latency profiling (vLLM)
nsys / ncuNVIDIA Nsight Systems / Nsight Compute GPU profiling
lm_evalLanguage model evaluation harness
nvidia-smi dmonGPU monitoring (power, utilization, clocks, memory)

Recommended Papers

PaperVenueRelevant Weeks
Orca: A Distributed Serving System for Transformer-Based Generative ModelsOSDI 2022Week 7
Efficient Memory Management for Large Language Model Serving with PagedAttentionSOSP 2023Weeks 4-5
SGLang: Efficient Execution of Structured Language Model ProgramsNeurIPS 2024Weeks 6, 9
Fast Inference from Transformers via Speculative DecodingICML 2023Week 11
Splitwise: Efficient generative LLM inference using phase splittingISCA 2024Week 16
LMCache: Optimizing KV Cache Sharing Across LLM Serving InstancesarXiv 2024Week 15

Grading Rubric (per weekly lab)

ComponentWeightDescription
Experiment Execution30%All experiments completed, commands run correctly, results captured
Metrics Collection20%All required metrics recorded in tables/plots, units correct
Source Code Reading15%Evidence of reading the specified files, key functions identified and explained
Written Analysis30%Thoughtful answers to analysis questions, supported by data, correct reasoning
Presentation5%Clear formatting, labeled plots, organized report

Weekly Submission Checklist

Architecture Quick Reference

vLLM Architecture

Client Request
    
api_server.py    (FastAPI)
    
AsyncLLM         (async engine)
    
EngineCore       (scheduler + executor)
    
Worker           (model runner on GPU)
    
ModelRunner      (forward pass)

SGLang Architecture

Client Request
    
TokenizerManager (HTTP + tokenize)
    
Scheduler        (RadixCache + policy)
    
TpModelWorker   (TP group)
    
ModelRunner      (forward pass)

NVIDIA A100 80GB Quick Specs

Compute

  • 312 TFLOPS BF16
  • 156 TFLOPS FP32
  • 624 TFLOPS INT8

Memory

  • 80 GB HBM2e
  • 2039 GB/s bandwidth
  • 40 MB L2 cache

Interconnect

  • NVLink: 600 GB/s
  • PCIe Gen4: 64 GB/s
  • TDP: 400W