16-week graduate course — from first serving to disaggregated P/D, with vLLM, SGLang, and LMCache
3-5 hrs/week · 1-2× A100 80GBThis course follows a measure-first, understand-second approach. Each week, you will:
Complete this setup before Week 1. All 16 labs assume these tools, models, and datasets are already available.
Create a dedicated conda environment with all required packages:
# Create and activate environment
conda create -n infer python=3.12 -y
conda activate infer
# Install core serving frameworks
pip install vllm
pip install "sglang[all]"
# Install KV cache management and evaluation
pip install lmcache lm-eval
# Verify installations
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import lmcache; print('LMCache OK')"
Download all models once; they are reused across the entire semester:
# Primary model — used in most labs
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct
# Small model — used for speculative decoding draft
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0
# Quantized variants — Week 12
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
# MoE model — Week 14 (requires ~100GB disk)
huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1
# ShareGPT conversation dataset — primary benchmark workload
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
# Verify file
python -c "import json; d=json.load(open('ShareGPT_V3_unfiltered_cleaned_split.json')); print(f'{len(d)} conversations')"
For cluster users — save as run_lab.sbatch and customize per week:
#!/bin/bash
#SBATCH --job-name=infer-lab
#SBATCH --gres=gpu:A100:1
#SBATCH --time=04:00:00
#SBATCH --mem=64G
#SBATCH --cpus-per-task=8
module load cuda/12.4
conda activate infer
# Your experiment commands here
Run this in a separate terminal to collect GPU metrics during experiments:
# Log GPU power, utilization, clocks, memory every 1 second
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor.csv &
# Quick GPU status check
nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu --format=csv
Launch vLLM and send one request to confirm everything works:
# Terminal 1: Start vLLM server
vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000
# Terminal 2: Send a test request
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 16
}' | python -m json.tool
# Expected: JSON response with "Paris" in the output
If you see OOM errors, reduce --max-model-len or --gpu-memory-utilization:
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
Some models (e.g., Llama) require a HuggingFace token. Set it up once:
huggingface-cli login
# Paste your token from https://huggingface.co/settings/tokens
# Or set the environment variable
export HF_TOKEN="hf_your_token_here"
If a port is already in use, find and kill the process:
# Find process using port 8000
lsof -i :8000
# Kill it
kill -9 $(lsof -t -i :8000)
If packages conflict, create a fresh environment. vLLM and SGLang update frequently — pin versions for reproducibility:
pip install vllm==0.8.0
pip install "sglang[all]==0.4.5"
Save these utility scripts for use throughout the semester:
# save as: wait_for_server.sh
#!/bin/bash
# Wait for a vLLM/SGLang server to be ready
PORT=${1:-8000}
echo "Waiting for server on port $PORT..."
while ! curl -s http://localhost:$PORT/health > /dev/null 2>&1; do
sleep 2
done
echo "Server is ready!"
# save as: extract_metrics.py
# Quick script to parse benchmark output files
import re, sys
def parse_benchmark(filepath):
with open(filepath) as f:
text = f.read()
metrics = {}
for pattern, name in [
(r'Throughput:\s+([\d.]+)\s+requests/s', 'req/s'),
(r'Mean TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_mean'),
(r'P99 TTFT.*?:\s+([\d.]+)\s+ms', 'ttft_p99'),
(r'Mean ITL.*?:\s+([\d.]+)\s+ms', 'itl_mean'),
(r'P99 ITL.*?:\s+([\d.]+)\s+ms', 'itl_p99'),
]:
m = re.search(pattern, text)
if m: metrics[name] = float(m.group(1))
return metrics
if __name__ == "__main__":
for f in sys.argv[1:]:
print(f"{f}: {parse_benchmark(f)}")
The following roadmap organizes all 16 weekly labs into 5 progressive phases. Each card shows the week number, topic, key systems used, and the primary phenomenon you will observe and analyze.
Each phase builds on the previous one. Phase 1 establishes your baseline understanding and measurement toolkit. Phases 2-3 explore memory management and scheduling — the two pillars of serving system design. Phase 4 covers hardware-level optimizations. Phase 5 extends to multi-GPU and production deployments.
First LLM Serving — vLLM & SGLang basics, online serving, ShareGPT benchmarks, TTFT/ITL metrics
Offline Throughput & Profiling — benchmark_throughput, PyTorch Profiler, Perfetto, nvidia-smi dmon, call graphs
Roofline Model — nsys/ncu profiling, A100 peak analysis, memory-bound vs compute-bound, decode latency theory
PagedAttention — vLLM vs HuggingFace baseline, max-num-seqs sweep, block-size tuning, KV cache manager
Prefix Caching (APC) — enable-prefix-caching toggle, FP8 KV cache, shared prefix sweeps, hash_block_tokens
RadixAttention — SGLang radix cache, few-shot workloads, LPM vs FCFS, sgl.function DSL, radix_cache.py
Continuous vs Static Batching — HuggingFace static baseline, output-length variance, scheduler.py, Orca paper
Chunked Prefill — HOL blocking, max-num-batched-tokens sweep, ITL timeseries analysis
Scheduling Policies — FCFS/LPM/DFS-weight, cache-aware DP routing, sglang_router
CUDA Graphs — enforce-eager vs captured graphs, warmup phase, memory overhead, CudaGraphRunner
Speculative Decoding — draft model, N-gram, EAGLE, acceptance rate vs num_speculative_tokens
Quantization — FP16/FP8/AWQ-INT4/GPTQ-INT4, lm_eval GSM8K accuracy, Pareto frontier
Tensor Parallelism — TP=1,2,4, NCCL all-reduce profiling, NVLink bandwidth
MoE — Mixtral-8x7B, expert routing, memory vs compute, expert parallelism
LMCache — cross-restart persistence, tiered storage GPU→CPU→disk, chunk_size sweep
Disaggregated P/D — vLLM P2pNcclConnector, SGLang disaggregation-mode, KV transfer analysis, capstone
# Start vLLM server (Terminal 1)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--disable-log-requests
# Start SGLang server (Terminal 2)
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8001
Send individual requests to both servers and observe the response structure:
# vLLM completions API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'
# SGLang completions API
curl http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Explain transformers in 3 sentences.","max_tokens":128}'
Run benchmark_serving.py at request rates 1, 4, 10, and inf (closed-loop):
for rate in 1 4 10 inf; do
python -m vllm.entrypoints.openai.api_server & # if not already running
python benchmarks/benchmark_serving.py \
--backend vllm \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 \
--request-rate $rate \
--port 8000 \
2>&1 | tee results_vllm_rr${rate}.txt
done
for rate in 1 4 10 inf; do
python -m sglang.bench_serving \
--backend sglang \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 \
--request-rate $rate \
--port 8001 \
2>&1 | tee results_sglang_rr${rate}.txt
done
Extract TTFT, ITL, throughput from each result file. Plot request rate vs. median TTFT and median ITL for both systems.
| Metric | Description | Unit |
|---|---|---|
| TTFT (p50, p99) | Time to first token | ms |
| ITL (p50, p99) | Inter-token latency | ms |
| Throughput | Output tokens per second | tok/s |
| Request latency (p50, p99) | End-to-end per-request time | ms |
Test streaming mode to observe tokens arriving one at a time:
# vLLM streaming
curl -N http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Count from 1 to 20:",
"max_tokens": 64,
"stream": true
}'
# SGLang streaming
curl -N http://localhost:8001/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Count from 1 to 20:",
"max_tokens": 64,
"stream": true
}'
# Ensure GPU monitoring is running
nvidia-smi dmon -s pucm -d 1 -f gpu_monitor_w2.csv &
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 500 \
--output-json throughput_vllm.json
python -m sglang.bench_one_batch \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size 32 64 128 \
--input-len 512 \
--output-len 256
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 512 \
--output-len 128 \
--batch-size 1 \
--num-iters 10
Enable the PyTorch profiler and export a trace for visualization in Perfetto UI:
# Run vLLM with profiling enabled
VLLM_TORCH_PROFILER_DIR=./profiles vllm serve \
meta-llama/Llama-3.1-8B-Instruct --port 8000
# Send a few requests, then open the trace:
# Go to https://ui.perfetto.dev and load the .json trace file
Analyze the gpu_monitor_w2.csv to observe GPU utilization patterns during offline vs. online workloads.
| Metric | Description | Unit |
|---|---|---|
| Offline throughput | Total tokens/sec in batched mode | tok/s |
| Per-token latency | Average time per output token (batch=1) | ms |
| GPU utilization % | From nvidia-smi dmon | % |
| Kernel time breakdown | Attention vs. MLP vs. other from Perfetto trace | % |
# Install nsight systems and nsight compute (usually comes with CUDA toolkit)
which nsys && which ncu
# If not found: module load nsight-systems nsight-compute
nsys profile -o llm_trace --trace=cuda,nvtx \
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size 1 --input-len 512 --output-len 64 --num-iters 3
ncu --set roofline --target-processes all \
-o roofline_report \
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size 1 --input-len 128 --output-len 16 --num-iters 1
For Llama-3.1-8B in BF16: model size ≈ 16 GB. A100 HBM bandwidth = 2 TB/s. Theoretical minimum decode step = 16 GB / 2 TB/s = 8 ms. Compare against your measured per-token latency from the nsys trace.
Open the NCU roofline report. Identify where attention kernels and GEMM kernels fall. Prefill GEMMs should be near the compute roof; decode attention should be near the memory roof.
| Metric | Description | Unit |
|---|---|---|
| Measured decode latency | Actual per-token time from nsys | ms |
| Theoretical decode latency | model_size / HBM_bandwidth | ms |
| Arithmetic intensity | FLOPs / bytes for key kernels | FLOP/B |
| HBM utilization % | Fraction of peak bandwidth achieved | % |
Run the same model at batch sizes 1, 4, 16, 64 and observe how arithmetic intensity shifts kernels from memory-bound to compute-bound:
for bs in 1 4 16 64; do
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size $bs --input-len 256 --output-len 64 \
2>&1 | tee results_roofline_bs${bs}.txt
done
# HuggingFace baseline script (save as hf_baseline.py)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch, time
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
prompts = ["Write a short story about AI."] * 16
inputs = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda")
start = time.time()
outputs = model.generate(**inputs, max_new_tokens=128)
elapsed = time.time() - start
total_tokens = (outputs.shape[1] - inputs["input_ids"].shape[1]) * len(prompts)
print(f"HF throughput: {total_tokens/elapsed:.1f} tok/s")
python hf_baseline.py
for seqs in 1 4 16 64 256; do
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-num-seqs $seqs \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 300 \
2>&1 | tee results_paged_seqs${seqs}.txt
done
for bs in 8 16 32; do
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--block-size $bs \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 300 \
2>&1 | tee results_paged_bs${bs}.txt
done
During each run, log GPU memory usage. Note the KV cache block allocation reported in vLLM logs (e.g., '# GPU blocks: 1234').
| Metric | Description | Unit |
|---|---|---|
| HF baseline throughput | HuggingFace generate() tok/s | tok/s |
| vLLM throughput per max-num-seqs | Throughput at each concurrency level | tok/s |
| GPU blocks allocated | KV cache blocks at each block-size | blocks |
| Memory waste % | Internal fragmentation at each block-size | % |
# Launch vLLM with APC enabled
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching --port 8000
# Launch vLLM WITHOUT APC (baseline)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--no-enable-prefix-caching --port 8002
# With APC (port 8000)
python benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 --port 8000 \
2>&1 | tee results_apc_on.txt
# Without APC (port 8002)
python benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 --port 8002 \
2>&1 | tee results_apc_off.txt
for prefix_len in 0 256 512 1024 2048; do
python benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--shared-prefix-len $prefix_len \
--num-prompts 200 --port 8000 \
2>&1 | tee results_apc_prefix${prefix_len}.txt
done
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-prefix-caching \
--kv-cache-dtype fp8 \
--port 8003
# Repeat prefix caching benchmark on port 8003
python benchmarks/benchmark_prefix_caching.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 --port 8003 \
2>&1 | tee results_apc_fp8.txt
| Metric | Description | Unit |
|---|---|---|
| TTFT (APC on vs off) | Time to first token with/without prefix caching | ms |
| Cache hit rate | Fraction of prefix blocks reused | % |
| TTFT vs prefix length | How TTFT scales with shared prefix length | ms |
| GPU blocks (FP16 vs FP8) | Total cache blocks available under each dtype | blocks |
# SGLang with radix cache enabled (default)
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--port 8001
# SGLang with radix cache disabled
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--disable-radix-cache \
--port 8004
# Few-shot workload benchmark
python -m sglang.bench_serving \
--backend sglang --port 8001 \
--dataset-name generated-shared-prefix \
--num-prompts 200 --request-rate 4 \
2>&1 | tee results_radix_on.txt
python -m sglang.bench_serving \
--backend sglang --port 8004 \
--dataset-name generated-shared-prefix \
--num-prompts 200 --request-rate 4 \
2>&1 | tee results_radix_off.txt
SGLang uses Longest Prefix Match scheduling by default. Compare against FCFS by toggling the scheduling policy:
# LPM scheduling (default)
python -m sglang.bench_serving \
--backend sglang --port 8001 \
--dataset-name generated-shared-prefix \
--num-prompts 300 --request-rate 8 \
2>&1 | tee results_lpm.txt
# FCFS scheduling
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--schedule-policy fcfs --port 8005
python -m sglang.bench_serving \
--backend sglang --port 8005 \
--dataset-name generated-shared-prefix \
--num-prompts 300 --request-rate 8 \
2>&1 | tee results_fcfs.txt
import sglang as sgl
@sgl.function
def multi_turn(s, question1, question2):
s += sgl.system("You are a helpful assistant.")
s += sgl.user(question1)
s += sgl.assistant(sgl.gen("answer1", max_tokens=256))
s += sgl.user(question2)
s += sgl.assistant(sgl.gen("answer2", max_tokens=256))
state = multi_turn.run(
question1="What is PagedAttention?",
question2="How does it compare to RadixAttention?")
print(state["answer1"])
print(state["answer2"])
| Metric | Description | Unit |
|---|---|---|
| TTFT (radix on/off) | First token time with/without radix cache | ms |
| Cache hit rate | Prefix reuse ratio on few-shot workload | % |
| TTFT (LPM vs FCFS) | Scheduling policy effect on first-token time | ms |
# Prepare a workload with high output-length variance
# ShareGPT naturally has variance; we can also create synthetic workloads
python -c "
import json, random
data = []
for i in range(200):
out_len = random.choice([16, 32, 64, 256, 512])
data.append({'input': 'Tell me about AI.', 'output_len': out_len})
json.dump(data, open('variable_output.json','w'))
"
Use the HuggingFace generate() script from Week 4 with varying output lengths. Observe that all sequences must wait for the longest one to finish.
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 256 --output-len 128 \
--num-prompts 300 \
2>&1 | tee results_cb_uniform.txt
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 300 \
2>&1 | tee results_cb_variable.txt
Enable vLLM's --log-stats and observe how the running batch size fluctuates over time with continuous batching. With static batching, the batch size stays constant (wasting GPU cycles on padding).
| Metric | Description | Unit |
|---|---|---|
| Static batch throughput | HF generate() with padding | tok/s |
| Continuous batch throughput | vLLM on same workload | tok/s |
| Avg batch occupancy | Mean running sequences over time | seqs |
With static batching (batch=16, max_output=512):
# Visualize batch occupancy over time (from vLLM stats logs)
# Look for lines like: "Avg running: 42.3, Avg waiting: 12.1"
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--disable-log-requests \
--port 8000 2>&1 | grep "running"
# Without chunked prefill
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--no-enable-chunked-prefill --port 8000
# With chunked prefill (default in recent vLLM)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-chunked-prefill --port 8002
# Aggressive chunking
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-chunked-prefill --max-num-batched-tokens 512 --port 8003
Create a workload mixing long prefills (2048+ tokens) with short streaming requests. Without chunked prefill, observe ITL spikes in the short requests.
python benchmarks/benchmark_serving.py \
--backend vllm --port 8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 4 \
2>&1 | tee results_no_chunk.txt
python benchmarks/benchmark_serving.py \
--backend vllm --port 8002 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 4 \
2>&1 | tee results_chunk_default.txt
for tokens in 256 512 1024 2048 4096; do
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens $tokens \
--port 8010 &
sleep 30 # wait for server startup
python benchmarks/benchmark_serving.py \
--backend vllm --port 8010 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 4 \
2>&1 | tee results_chunk_${tokens}.txt
kill %1
done
Plot the per-token ITL over time for each configuration. Without chunked prefill, you should see periodic spikes. With chunked prefill, ITL should be smoother.
| Metric | Description | Unit |
|---|---|---|
| ITL p50, p99 | Inter-token latency percentiles | ms |
| TTFT p50, p99 | Time to first token (may increase with chunking) | ms |
| ITL p99/p50 ratio | Measure of ITL consistency (lower is better) | ratio |
for policy in fcfs lpm dfs-weight; do
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--schedule-policy $policy \
--port 8001 &
sleep 30
python -m sglang.bench_serving \
--backend sglang --port 8001 \
--dataset-name generated-shared-prefix \
--num-prompts 300 --request-rate 8 \
2>&1 | tee results_sched_${policy}.txt
kill %1; sleep 5
done
If you have 2 GPUs, launch SGLang with data parallelism and the cache-aware router:
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--dp 2 --port 8001
# The sglang_router automatically routes requests to maximize cache hits
python -m sglang.bench_serving \
--backend sglang --port 8001 \
--dataset-name generated-shared-prefix \
--num-prompts 500 --request-rate 10 \
2>&1 | tee results_dp_router.txt
| Metric | Description | Unit |
|---|---|---|
| Throughput per policy | FCFS vs LPM vs DFS-weight | tok/s |
| Cache hit rate per policy | Prefix reuse on shared-prefix workload | % |
| TTFT p50, p99 | First-token latency under each policy | ms |
Test the same policies on a non-shared (ShareGPT) workload to show that LPM has no advantage when prefixes are unique:
for policy in fcfs lpm; do
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--schedule-policy $policy --port 8001 &
sleep 30
python -m sglang.bench_serving \
--backend sglang --port 8001 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 200 --request-rate 8 \
2>&1 | tee results_sched_sharegpt_${policy}.txt
kill %1; sleep 5
done
# Eager mode (no CUDA graphs)
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--enforce-eager \
--batch-size 1 --input-len 512 --output-len 128 \
2>&1 | tee results_eager.txt
# CUDA graphs (default)
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size 1 --input-len 512 --output-len 128 \
2>&1 | tee results_cudagraph.txt
Time the server startup with and without CUDA graphs. The graph capture phase (warmup) adds significant startup time but reduces per-request latency.
# Time server startup
time vllm serve meta-llama/Llama-3.1-8B-Instruct --enforce-eager --port 8000 &
# vs
time vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8002 &
Compare GPU memory usage between eager and CUDA graph modes. CUDA graphs pre-allocate memory for captured operations, reducing available KV cache space.
| Metric | Description | Unit |
|---|---|---|
| Decode latency (eager) | Per-token time without CUDA graphs | ms |
| Decode latency (graphs) | Per-token time with CUDA graphs | ms |
| Server startup time | Eager vs graph capture startup | s |
| Memory overhead | Extra GPU memory used by captured graphs | MB |
for bs in 1 4 16 64; do
# Eager mode
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--enforce-eager --batch-size $bs \
--input-len 256 --output-len 64 \
2>&1 | tee results_eager_bs${bs}.txt
# CUDA graph mode
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--batch-size $bs \
--input-len 256 --output-len 64 \
2>&1 | tee results_cudagraph_bs${bs}.txt
done
Plot the latency improvement ratio (eager/graph) for each batch size. The speedup is typically larger at small batch sizes where CPU launch overhead is a larger fraction of total time.
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--num-speculative-tokens 5 \
--port 8000
python benchmarks/benchmark_serving.py \
--backend vllm --port 8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 2 \
2>&1 | tee results_spec_draft.txt
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model [ngram] \
--num-speculative-tokens 5 \
--ngram-prompt-lookup-max 4 \
--port 8002
python benchmarks/benchmark_serving.py \
--backend vllm --port 8002 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 2 \
2>&1 | tee results_spec_ngram.txt
for k in 1 3 5 7; do
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--num-speculative-tokens $k \
--port 8010 &
sleep 30
python benchmarks/benchmark_serving.py \
--backend vllm --port 8010 \
--model meta-llama/Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 2 \
2>&1 | tee results_spec_k${k}.txt
kill %1; sleep 5
done
Run the same speculative decoding config at request rates 1, 4, and 10. At low QPS, speculation reduces per-request latency. At high QPS, the overhead of draft model + verification can hurt throughput.
| Metric | Description | Unit |
|---|---|---|
| Acceptance rate | Fraction of draft tokens accepted by target | % |
| Per-request latency | With vs without speculation | ms |
| Throughput at high QPS | Spec decoding overhead under load | tok/s |
| Method | Extra Memory | Acceptance Rate | Best For |
|---|---|---|---|
| Draft Model | ~2-4 GB (TinyLlama) | Medium-High | General text, low QPS |
| N-gram | Negligible | Low (content-dependent) | Repetitive/templated text |
| EAGLE | ~0.5-1 GB (lightweight head) | High | Code, structured output |
# FP16 (baseline)
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 200 2>&1 | tee results_fp16.txt
# FP8
python benchmarks/benchmark_throughput.py \
--model neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--num-prompts 200 2>&1 | tee results_fp8.txt
# AWQ-INT4
python benchmarks/benchmark_throughput.py \
--model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
--quantization awq \
--num-prompts 200 2>&1 | tee results_awq.txt
# FP16
lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
--tasks gsm8k --batch_size auto \
2>&1 | tee eval_fp16.txt
# FP8
lm_eval --model vllm \
--model_args pretrained=neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 \
--tasks gsm8k --batch_size auto \
2>&1 | tee eval_fp8.txt
# AWQ-INT4
lm_eval --model vllm \
--model_args pretrained=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,quantization=awq \
--tasks gsm8k --batch_size auto \
2>&1 | tee eval_awq.txt
Record GPU memory used by each model variant (from vLLM startup logs or nvidia-smi). FP16 ≈ 16 GB, FP8 ≈ 8 GB, INT4 ≈ 4 GB for model weights.
| Model | Memory (GB) | Throughput (tok/s) | GSM8K Accuracy |
|---|---|---|---|
| FP16 | ~16 | — | — |
| FP8 | ~8 | — | — |
| AWQ-INT4 | ~4 | — | — |
Two complementary effects drive throughput gains:
Beyond GSM8K, test output quality with real prompts:
# Send the same complex prompt to FP16 and INT4 servers
# Compare outputs side by side for coherence, factual accuracy
for port in 8000 8002; do
curl -s http://localhost:$port/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "Explain the difference between TCP and UDP, including when to use each protocol.",
"max_tokens": 256,
"temperature": 0
}' | python -m json.tool > output_port${port}.json
done
diff output_port8000.json output_port8002.json
for tp in 1 2 4; do
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size $tp \
--batch-size 1 --input-len 512 --output-len 128 \
2>&1 | tee results_tp${tp}_latency.txt
done
for tp in 1 2 4; do
python benchmarks/benchmark_throughput.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size $tp \
--dataset ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 300 \
2>&1 | tee results_tp${tp}_throughput.txt
done
nsys profile -o tp2_trace --trace=cuda,nvtx,nccl \
python benchmarks/benchmark_latency.py \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 2 \
--batch-size 1 --input-len 256 --output-len 32 --num-iters 3
Open in nsys GUI and identify NCCL all-reduce calls. Measure their duration and frequency per decode step.
| Metric | Description | Unit |
|---|---|---|
| Decode latency per TP | Per-token time at TP=1,2,4 | ms |
| All-reduce time | NCCL communication per decode step | us |
| Communication fraction | All-reduce time / total step time | % |
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--port 8000
python benchmarks/benchmark_serving.py \
--backend vllm --port 8000 \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 100 --request-rate 4 \
2>&1 | tee results_moe_tp2.txt
# Expert parallelism distributes experts across GPUs
vllm serve mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tensor-parallel-size 2 \
--port 8002
# In SGLang with explicit EP
python -m sglang.launch_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--tp 2 --port 8003
Compare memory usage and throughput: Mixtral-8x7B has 46.7B total parameters but only activates ~12.9B per token (top-2 of 8 experts). Compare against Llama-3.1-8B which has fewer total parameters but activates all of them.
| Metric | Description | Unit |
|---|---|---|
| GPU memory (Mixtral) | Total memory for all expert weights | GB |
| Throughput (MoE vs dense) | Mixtral vs Llama-8B at similar quality | tok/s |
| Expert load balance | Token distribution across experts | % |
# Create LMCache config file: lmcache_config.yaml
cat > lmcache_config.yaml <<'EOF'
chunk_size: 256
local_device: "cpu"
remote_url: null
remote_serde: null
# Tiered storage config
storage:
- type: "gpu"
capacity_gb: 4
- type: "cpu"
capacity_gb: 16
- type: "disk"
path: "/tmp/lmcache_disk"
capacity_gb: 64
EOF
# Launch vLLM with LMCache
LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config lmcache_config.yaml \
--port 8000
Send requests with a shared system prompt, shut down the server, restart, and send the same requests. With LMCache, the KV cache is restored from disk/CPU, eliminating re-computation.
# First run: populate cache
python benchmarks/benchmark_serving.py \
--backend vllm --port 8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 50 --request-rate 2 \
2>&1 | tee results_lmcache_cold.txt
# Restart server, re-run (cache should be warm from disk)
# Kill and restart the server, then:
python benchmarks/benchmark_serving.py \
--backend vllm --port 8000 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 50 --request-rate 2 \
2>&1 | tee results_lmcache_warm.txt
for cs in 64 128 256 512 1024; do
# Update lmcache_config.yaml with chunk_size=$cs
sed -i "s/chunk_size: .*/chunk_size: $cs/" lmcache_config.yaml
LMCACHE_CONFIG_FILE=lmcache_config.yaml vllm serve \
meta-llama/Llama-3.1-8B-Instruct \
--kv-transfer-config lmcache_config.yaml \
--port 8010 &
sleep 30
python benchmarks/benchmark_serving.py \
--backend vllm --port 8010 \
--model meta-llama/Llama-3.1-8B-Instruct \
--num-prompts 100 --request-rate 4 \
2>&1 | tee results_lmcache_cs${cs}.txt
kill %1; sleep 5
done
| Metric | Description | Unit |
|---|---|---|
| TTFT (cold vs warm) | First token time before/after cache is populated | ms |
| Cache restore time | Time to load KV cache from disk on restart | ms |
| Throughput per chunk_size | Effect of cache granularity on performance | tok/s |
Observe the TTFT difference when KV cache is served from each tier:
# Test 1: Cold start (no cache anywhere)
# TTFT = full prefill computation time
# Test 2: Warm GPU cache (repeat same request)
# TTFT ≈ 0 (KV cache already on GPU)
# Test 3: CPU-only cache (restart server, GPU cache lost)
# TTFT = CPU→GPU transfer time only
# Test 4: Disk-only cache (clear CPU cache too)
# TTFT = Disk→GPU transfer time
# Prefill instance (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8000 \
--kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer"}'
# Decode instance (GPU 1)
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.1-8B-Instruct \
--port 8001 \
--kv-transfer-config '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer"}'
# SGLang with disaggregation
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode [prefill] \
--port 8002
python -m sglang.launch_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--disaggregation-mode [decode] \
--port 8003
Measure how KV cache transfer time scales with input sequence length. Longer inputs produce larger KV caches that take more time to transfer between prefill and decode GPUs.
for input_len in 128 256 512 1024 2048 4096; do
python benchmarks/benchmark_serving.py \
--backend vllm --port 8001 \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len $input_len --output-len 64 \
--num-prompts 50 --request-rate 2 \
2>&1 | tee results_pd_input${input_len}.txt
done
Design an experiment comparing standard serving vs disaggregated P/D on a mixed workload with both short and long prefills. When does disaggregation help?
| Metric | Description | Unit |
|---|---|---|
| KV transfer time | Time to send KV cache from prefill to decode GPU | ms |
| TTFT (P/D vs standard) | First-token latency comparison | ms |
| ITL (P/D vs standard) | Decode-side ITL without prefill interference | ms |
| GPU utilization (prefill vs decode) | Compute efficiency of each pool | % |
| Model | Parameters | Precision | Used In |
|---|---|---|---|
meta-llama/Llama-3.1-8B-Instruct | 8B | BF16 | Weeks 1-11, 13, 15-16 |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1.1B | FP16 | Week 11 (draft model) |
neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8 | 8B | FP8 | Weeks 5, 12 |
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 | 8B | INT4 | Week 12 |
mistralai/Mixtral-8x7B-Instruct-v0.1 | 46.7B | BF16 | Week 14 |
| Dataset | Description | Used In |
|---|---|---|
ShareGPT_V3_unfiltered_cleaned_split.json | Real-world multi-turn conversations with natural length distribution | Weeks 1-5, 7-8, 11-14, 16 |
generated-shared-prefix | Synthetic workload with shared prefixes (built into SGLang bench) | Weeks 6, 9 |
GSM8K | Grade-school math benchmark for accuracy evaluation (via lm_eval) | Week 12 |
| Tool | Purpose |
|---|---|
vllm serve | Launch vLLM OpenAI-compatible server |
sglang.launch_server | Launch SGLang server |
benchmark_serving.py | Online serving benchmark (vLLM) |
bench_serving | Online serving benchmark (SGLang) |
benchmark_throughput.py | Offline throughput benchmark (vLLM) |
benchmark_latency.py | Single-batch latency profiling (vLLM) |
nsys / ncu | NVIDIA Nsight Systems / Nsight Compute GPU profiling |
lm_eval | Language model evaluation harness |
nvidia-smi dmon | GPU monitoring (power, utilization, clocks, memory) |
| Paper | Venue | Relevant Weeks |
|---|---|---|
| Orca: A Distributed Serving System for Transformer-Based Generative Models | OSDI 2022 | Week 7 |
| Efficient Memory Management for Large Language Model Serving with PagedAttention | SOSP 2023 | Weeks 4-5 |
| SGLang: Efficient Execution of Structured Language Model Programs | NeurIPS 2024 | Weeks 6, 9 |
| Fast Inference from Transformers via Speculative Decoding | ICML 2023 | Week 11 |
| Splitwise: Efficient generative LLM inference using phase splitting | ISCA 2024 | Week 16 |
| LMCache: Optimizing KV Cache Sharing Across LLM Serving Instances | arXiv 2024 | Week 15 |
| Component | Weight | Description |
|---|---|---|
| Experiment Execution | 30% | All experiments completed, commands run correctly, results captured |
| Metrics Collection | 20% | All required metrics recorded in tables/plots, units correct |
| Source Code Reading | 15% | Evidence of reading the specified files, key functions identified and explained |
| Written Analysis | 30% | Thoughtful answers to analysis questions, supported by data, correct reasoning |
| Presentation | 5% | Clear formatting, labeled plots, organized report |
Client Request
↓
api_server.py (FastAPI)
↓
AsyncLLM (async engine)
↓
EngineCore (scheduler + executor)
↓
Worker (model runner on GPU)
↓
ModelRunner (forward pass)
Client Request
↓
TokenizerManager (HTTP + tokenize)
↓
Scheduler (RadixCache + policy)
↓
TpModelWorker (TP group)
↓
ModelRunner (forward pass)
Compute
Memory
Interconnect