Week 0 — PACE Environment Setup

Learning Objectives

By the end of this page you will have

    A working SSH alias pace that gets you from your laptop to a PACE login node in one command.
Two conda envs in project storage: vllm-lab pinned to vllm==0.6.3 for Weeks 1–5, 7, 8, 10, 11, 12, 13, 14, 15, 16; and sglang-lab pinned to sglang==0.5.10.post1 for Weeks 1 (SGLang half), 6, 9.
~285 GB of model weights pre-downloaded into $HFCACHE: Llama-3.1-8B-Instruct, TinyLlama-1.1B, Llama-3.1-8B-Instruct-FP8, Llama-3.1-8B-Instruct-AWQ-INT4, Llama-3.1-70B-Instruct (optional for Week 13).
A hello-world Slurm job that launches a vLLM server on an H100, verifies /health is reachable, sends one completion request, and shuts down with kill -2.
A troubleshooting playbook for the ten most common PACE failure modes you will hit in Weeks 1–16 (FA2 sm_80 error, OOM, PENDING priority, sm_120 Blackwell kernels, etc.).

  

Substitute these placeholders throughout the page before running anything:

Placeholder	Example used on this page	Replace with
`YOUR_GTID`	`hlin464`	Your GT username (the one you use for SSH)
`YOUR_ACCOUNT`	`gts-rs275-paid`	Your course's paid Slurm account (check with `pace-quota`)
`YOUR_PROJECT`	`r-rs275-0`	Your course's project storage share (`/storage/project/$YOUR_PROJECT/$YOUR_GTID/`)
`$HFCACHE`	`/storage/project/r-rs275-0/hlin464/hfcache/hub`	Any large writable path — don't use $HOME because PACE home is only ~10 GB

Prerequisites

What you need

A PACE Phoenix account with membership in a project with paid GPU access. The labs assume --account=gts-rs275-paid — substitute your own account name. To check your accounts: pace-quota
SSH key on your local machine (the PACE OnDemand portal also works but command-line access is much faster for these labs)
Georgia Tech VPN or campus network — PACE login nodes are not directly internet-accessible from off-campus
Hugging Face account with a token, AND a request submitted for the gated Llama models. Visit https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct and click "Agree and access repository"
Comfort with bash, Python, and conda. We will not teach these basics here.

PACE Hard Rules — Read These First:

Never run GPU work on the login node. Login nodes have ~4GB RAM and no GPU. Use sbatch or salloc.
Never kill -9 a vLLM/SGLang process. SIGKILL during CUDA execution corrupts the driver state and requires a node reboot. Always use kill -2 (SIGINT) for graceful shutdown.
Never compile / pip install large packages on the login node. Use salloc to grab a CPU compute node first.
Always use the paid account + inferno queue for GPU work. The free queue (embers) can preempt your job at any moment, killing in-progress benchmarks.
Use a separate venv/conda env for these labs. Don't pip-install anything into a shared system Python or someone else's environment.

SSH

SSH access & SSH config

1. Generate a key (one time)

# On your local machine
ssh-keygen -t ed25519 -C "my-name@gatech.edu" -f ~/.ssh/id_ed25519_pace
ssh-copy-id -i ~/.ssh/id_ed25519_pace.pub yourgtid@login-phoenix.pace.gatech.edu

2. Set up SSH config (saves typing for the rest of the semester)

# Add to ~/.ssh/config on your local machine
Host pace
    HostName login-phoenix.pace.gatech.edu
    User yourgtid
    IdentityFile ~/.ssh/id_ed25519_pace
    ServerAliveInterval 60
    ServerAliveCountMax 5
    # ControlMaster reuses one TCP connection across multiple ssh calls
    # — you only have to type your password / 2FA once per ~10 min
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

mkdir -p ~/.ssh/sockets
chmod 700 ~/.ssh ~/.ssh/sockets

3. Test the connection

ssh pace "hostname && pace-quota"

If you see your storage usage and account balances printed, SSH works.

Storage

PACE storage layout

PACE has THREE storage tiers, each with different size, speed, persistence, and access permissions. Putting things in the wrong tier is the #1 cause of beginner pain.

Tier	Path	Quota	Persistence	Use For
Home	`~/` = `/storage/home/hcoda1/X/yourgtid`	20 GB / 1M files	Permanent, backed up	dotfiles, shell config, small scripts
Project	`~/r-rs275-0` → `/storage/project/r-rs275-0/yourgtid`	1 TB shared	Permanent (group)	conda envs, model weights, source code, results
Scratch	`/storage/scratch1/X/yourgtid`	15 TB	Purged after 60 days inactive	dataset files, intermediate experiment outputs, large logs

Recommended layout for these labs:

# Conda envs — large, want them permanent
/storage/project/r-rs275-0/yourgtid/.conda_envs/

# Model weights — very large, want them permanent
/storage/project/r-rs275-0/yourgtid/hfcache/hub/

# Project source code — small, want them permanent
/storage/project/r-rs275-0/yourgtid/AI-inference-hands-on-note/

# Datasets — bulky, may regenerate, OK to put in scratch
/storage/scratch1/X/yourgtid/data/

Note: Many labs at GT use a symlink so ~/r-rs275-0 points to the project storage. If your account has it, just use ~/r-rs275-0/... as the path.

Check your quotas

pace-quota
# Shows: home, scratch, project usage and account balances

Conda

Conda environments — vLLM and SGLang separately

Why two separate environments? vLLM and SGLang have conflicting torch and transformers requirements. SGLang 0.5.x requires torch==2.9.x and transformers==5.x, but vllm-continuum (and most vLLM builds) require torch==2.8.0 and transformers<5. Trying to use both in one env breaks the server startup. We learned this the hard way.

1. Allocate a CPU node first (DO NOT install on login node)

# Get an interactive 1-hour CPU shell with 16GB RAM
salloc -N1 --cpus-per-task=4 --mem=16G -t 01:00:00 \
       -A gts-rs275-paid -q inferno -p cpu-small
# Wait for prompt to change — you are now on a compute node

2. Create the vLLM environment

module load anaconda3
eval "$(conda shell.bash hook)"

ENV=/storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab
conda create -p $ENV python=3.11 -y
conda activate $ENV

# Install vLLM (pinned to a stable version compatible with this lab)
pip install --no-cache-dir "vllm==0.6.3"
pip install --no-cache-dir accelerate matplotlib pandas

# Verify
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"

3. Create the SGLang environment (separate)

ENV2=/storage/project/r-rs275-0/$USER/.conda_envs/sglang-lab
conda create -p $ENV2 python=3.11 -y
conda activate $ENV2

pip install --no-cache-dir "sglang[all]==0.5.10.post1"
pip install --no-cache-dir matplotlib pandas cachetools

# Verify
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"

4. Exit the salloc session

exit
# You are back on the login node. The envs are saved permanently in project storage.

Recipe: which env do I activate for which week?

Weeks	Environment
1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 15	`vllm-lab`
6, 9	`sglang-lab`
1, 2, 14, 16 (both, run separately)	`vllm-lab` + `sglang-lab`
13 (needs 2+ GPUs)	`vllm-lab`

Models

Downloading model weights

1. Set up Hugging Face authentication

# Get a token from https://huggingface.co/settings/tokens (read access is enough)
# Add to your ~/.bashrc on PACE
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME

# Apply to current shell
source ~/.bashrc

2. Download the models used in these labs

Do this in a CPU node session (huggingface-cli download is network/disk bound, no GPU needed)

salloc -N1 --cpus-per-task=4 --mem=16G -t 02:00:00 \
       -A gts-rs275-paid -q inferno -p cpu-small

module load anaconda3
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab

# Llama-3.1-8B-Instruct (16 GB, used in weeks 1-12, 15)
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
    --local-dir $HF_HOME/Llama-3.1-8B-Instruct

# TinyLlama-1.1B (2 GB, used as draft model in week 11)
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# FP8 and AWQ quantized variants (week 12)
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4

# Llama-3.1-70B-Instruct (140 GB BF16, week 13 TP, week 14 reference)
# Skip if you don't need 70B — this takes ~30 min and 263 GB on disk
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
    --local-dir $HF_HOME/Llama-3.1-70B-Instruct

exit

3. Verify the downloads

ls -la $HF_HOME/
# Llama-3.1-8B-Instruct/  (~16 GB)
# models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/  (HF cache format)
# models--neuralmagic--Meta-Llama-3.1-8B-Instruct-FP8/
# models--hugging-quants--Meta-Llama-3.1-8B-Instruct-AWQ-INT4/
# Llama-3.1-70B-Instruct/  (263 GB)

# Total: ~285 GB — check your project quota allows this
du -sh $HF_HOME

Why use HF_HUB_OFFLINE=1 in scripts? Llama models are gated — even with weights downloaded locally, vLLM will try to fetch config.json from HF Hub at startup. If your token isn't readable in the compute node environment, it gets a 401 error and crashes. Setting export HF_HUB_OFFLINE=1 in every sbatch tells HF to use ONLY the local cache, no network calls. We use this in every experiment script.

Slurm

Slurm: GPU types, queues, accounts

Available GPU partitions

Partition	GRES name	VRAM	SM	FA2?	vLLM?	CPU:GPU max
gpu-v100	v100	32 GB	7.0	❌	❌	—
gpu-rtx6000	rtx_6000	24 GB	7.5	❌	❌	6:1
gpu-a100	A100	40 GB	8.0	✓	✓	8:1
gpu-l40s	L40S	48 GB	8.9	✓	✓	4:1
gpu-h100	H100	80 GB	9.0	✓	✓	8:1
gpu-h200	H200	141 GB	9.0	✓	✓	16:1
gpu-rtxpro-blackwell	rtx_pro_6000_blackwell	96 GB	12.0	✓	~	—

V100 and RTX 6000 do NOT work with vLLM v1. Their compute capability is below 8.0, so FlashAttention 2 (FA2) is unsupported. The vLLM engine v1 hard-crashes during model load. Use A100 or newer.

Account & queue choice

Combination	Cost	Time limit	Preemption?	Use for these labs?
`-A gts-rs275 -q embers`	free	8 hr	✓ (yes)	❌ risky — benchmarks die mid-run
`-A gts-rs275-paid -q inferno`	$0.67/GPU-hr	unlimited	❌ (no)	✓ recommended

Slurm command cheat sheet

# Interactive 30-min H100 shell
salloc -N1 --gres=gpu:H100:1 --cpus-per-task=8 --mem=64G -t 00:30:00 \
       -A gts-rs275-paid -q inferno

# Submit a batch job
sbatch myjob.sbatch

# See your jobs
squeue -u $USER --format='%.10i %.20j %.10T %.12M %.20R'

# Cancel a job
scancel JOBID

# See queue depth for a partition
squeue -p gpu-h100 --format='%.8T' | sort | uniq -c

# Check available nodes in a partition
sinfo -p gpu-h100 --format='%P %a %D %T %N'

First Job

Your first vLLM job: hello world

Use this template as your starting point. It is the minimum sbatch that will reliably start a vLLM server, send one request, and shut down cleanly.

1. Create the sbatch file

# save as ~/r-rs275-0/AI-inference-hands-on-note/sbatch/hello_vllm.sbatch
#!/bin/bash
#SBATCH --job-name=hello-vllm
#SBATCH --gres=gpu:H100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=00:30:00
#SBATCH --account=gts-rs275-paid
#SBATCH -q inferno
#SBATCH --output=hello_vllm_%j.out
#SBATCH --error=hello_vllm_%j.err

set -euo pipefail
echo "=== START: $(date) ==="

# Activate environment
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab

# Use local model cache, force offline (no HF Hub calls)
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME
export HF_HUB_OFFLINE=1

# Sanity check
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c 'import torch; print(f"torch {torch.__version__}, cuda {torch.cuda.is_available()}")'

# Start the server in the background, write its log to a file
LOGFILE=$PWD/server_$SLURM_JOB_ID.log
python -m vllm.entrypoints.openai.api_server \
    --model $HF_HOME/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --disable-log-requests \
    > $LOGFILE 2>&1 &
SERVER_PID=$!
echo "Server PID: $SERVER_PID"

# Wait for the /health endpoint to respond (max 5 minutes)
for i in $(seq 1 60); do
    sleep 5
    if ! kill -0 $SERVER_PID 2>/dev/null; then
        echo "SERVER CRASHED"; tail -30 $LOGFILE; exit 1
    fi
    if curl -s http://localhost:8000/health >/dev/null 2>&1; then
        echo "READY after $((i*5))s"; break
    fi
done

# Send a test request
curl -s http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "/storage/project/r-rs275-0/'$USER'/hfcache/hub/Llama-3.1-8B-Instruct",
       "prompt": "The capital of France is",
       "max_tokens": 16
     }' | python -m json.tool

# GRACEFUL shutdown — SIGINT, never SIGKILL
echo "=== Stopping server (SIGINT) ==="
kill -2 $SERVER_PID 2>/dev/null || true
wait $SERVER_PID 2>/dev/null || true

echo "=== END: $(date) ==="

2. Submit and watch

# Submit
sbatch hello_vllm.sbatch
# → "Submitted batch job 1234567"

# Watch the queue (Ctrl+C to stop watching)
watch -n 5 squeue -u $USER

# When the job starts running, tail the live output
tail -f hello_vllm_1234567.out

3. What success looks like

=== START: Wed Apr 9 23:00:00 EDT 2026 ===
NVIDIA H100 80GB HBM3, 81559 MiB
torch 2.8.0+cu126, cuda True
Server PID: 12345
READY after 25s
{
    "id": "cmpl-abc123",
    "choices": [{"text": " Paris, located in the north...", ...}],
    "usage": {"completion_tokens": 16, ...}
}
=== Stopping server (SIGINT) ===
=== END: Wed Apr 9 23:00:50 EDT 2026 ===

If you see Paris in the response, your environment is fully working. Proceed to Week 1.

Troubleshooting

Common errors and fixes

These are the errors we actually hit while building these labs. If you see one, the fix is probably here.

1. Server hangs forever / Health check times out at 300s

Symptom: The server log shows model loading completes, but the Python parent never gets "READY".

Cause: If you launched the server with subprocess.Popen(stdout=PIPE) in Python and never read the pipe, the pipe buffer (~64KB) fills and the server blocks on every print. The server is alive but stuck. This is a deadlock, not a startup problem.

Fix: Always redirect stdout to a file:

log_file = open("/tmp/server.log", "w")
proc = subprocess.Popen(cmd, stdout=log_file, stderr=subprocess.STDOUT)

2. ValueError: KV cache memory ... larger than the available KV cache memory

Symptom: ValueError: To serve at least one request with the model's max seq len (131072), 16.00 GiB KV cache is needed, which is larger than the available KV cache memory (12.74 GiB)

Cause: vLLM defaults to max_model_len = max position embedding, which for Llama-3.1 is 131,072. Reserving KV cache for 131K tokens at one batch is huge.

Fix: Always pass --max-model-len 4096 (or 8192) and --gpu-memory-utilization 0.90 to the vLLM server.

3. 401 Unauthorized: Cannot access gated repo for url ...

Symptom: Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/resolve/main/config.json

Cause: vLLM tries to validate the model from HF Hub even if you pass a local path, because it parses the model id and falls back to Hub for config.json. The compute node may not have your HF token.

Fix: Set HF_HUB_OFFLINE=1 in the sbatch script — forces use of local cache only:

export HF_HUB_OFFLINE=1
# And use the absolute local path, not the HF model id
python -m vllm.entrypoints.openai.api_server \
    --model $HF_HOME/Llama-3.1-8B-Instruct \
    ...

4. ImportError: TokenizersBackend has no attribute all_special_tokens_extended

Cause: transformers 5.x removed this attribute. vLLM < 0.7 still calls it. You probably installed SGLang into the same env as vLLM, which upgraded transformers.

Fix: Downgrade transformers in the vLLM env, OR (better) keep vLLM and SGLang in separate envs as we recommend at the top of this page.

pip install --no-cache-dir 'transformers<5'

5. Cannot use FA version 2 ... compute capability >= 8

Cause: You are running on V100 (sm 7.0) or RTX 6000 Turing (sm 7.5). FlashAttention 2 needs Ampere (sm 8.0) or newer.

Fix: Use A100, L40S, H100, or H200 instead. Change --gres=gpu:V100:1 to --gres=gpu:A100:1.

6. Maximum CPU:GPU ratio of N:1 for gpu-X node class

Cause: Each PACE GPU partition has a max CPUs-per-GPU ratio. L40S is 4:1, A100 is 8:1, RTX 6000 is 6:1, H200 is 16:1. You requested too many CPUs per GPU.

Fix: Lower --cpus-per-task. For 1 L40S use 4, for 1 A100 use 8, for 1 H200 use 16.

7. Job stays PENDING with reason (Priority) for hours

Cause: The cluster is busy and many higher-priority jobs are ahead of you. squeue -p gpu-h100 --format='%.8T' | sort | uniq -c shows the queue depth.

Fix: Submit the same job to multiple GPU types in parallel (one sbatch per GPU type, all writing to a different results subdirectory tagged by GPU). The first one to start runs; you can cancel the others. We use this trick across the labs.

8. Engine core initialization failed (Blackwell sm_120)

Cause: Your vLLM build does not have CUDA kernels compiled for sm_120 (Blackwell). The continuum vLLM fork on PACE is compiled for sm_70/75/80/89/90 only.

Fix: Use H100/H200 instead of RTX Pro 6000 Blackwell, OR build vLLM yourself with TORCH_CUDA_ARCH_LIST including 12.0.

9. HuggingFace baseline crashes: Using a `device_map` requires `accelerate`

Cause: The accelerate package is missing. transformers requires it for any device_map usage (which we use to put the model on the GPU).

Fix: pip install accelerate

10. benchmark_serving.py prints DEPRECATED instead of running

Cause: Newer vLLM (~0.7+) deprecated the standalone benchmark scripts in favor of vllm bench serve, but in some forks the new CLI is not yet wired up.

Fix: Either pin to vLLM 0.6.x, or use a custom client that calls the OpenAI HTTP API directly (we use one in scripts/bench_client.py).

You're ready — what's next: If your hello-world job ran and printed Paris, you have everything needed for the labs. Proceed to Week 1: First LLM Serving →