The complete tutorial for getting vLLM and SGLang running on the PACE Phoenix HPC cluster. Read this BEFORE Week 1. Covers SSH, storage layout, conda environments, model downloads, Slurm submission, GPU selection, and troubleshooting the most common errors.
pace that gets you from your laptop to a PACE login node in one command.vllm-lab pinned to vllm==0.6.3 for Weeks 1–5, 7, 8, 10, 11, 12, 13, 14, 15, 16; and sglang-lab pinned to sglang==0.5.10.post1 for Weeks 1 (SGLang half), 6, 9.$HFCACHE: Llama-3.1-8B-Instruct, TinyLlama-1.1B, Llama-3.1-8B-Instruct-FP8, Llama-3.1-8B-Instruct-AWQ-INT4, Llama-3.1-70B-Instruct (optional for Week 13)./health is reachable, sends one completion request, and shuts down with kill -2.| Placeholder | Example used on this page | Replace with |
|---|---|---|
YOUR_GTID | hlin464 | Your GT username (the one you use for SSH) |
YOUR_ACCOUNT | gts-rs275-paid | Your course's paid Slurm account (check with pace-quota) |
YOUR_PROJECT | r-rs275-0 | Your course's project storage share (/storage/project/$YOUR_PROJECT/$YOUR_GTID/) |
$HFCACHE | /storage/project/r-rs275-0/hlin464/hfcache/hub | Any large writable path — don't use $HOME because PACE home is only ~10 GB |
--account=gts-rs275-paid — substitute your own account name. To check your accounts: pace-quotasbatch or salloc.kill -9 a vLLM/SGLang process. SIGKILL during CUDA execution corrupts the driver state and requires a node reboot. Always use kill -2 (SIGINT) for graceful shutdown.salloc to grab a CPU compute node first.embers) can preempt your job at any moment, killing in-progress benchmarks.# On your local machine
ssh-keygen -t ed25519 -C "my-name@gatech.edu" -f ~/.ssh/id_ed25519_pace
ssh-copy-id -i ~/.ssh/id_ed25519_pace.pub yourgtid@login-phoenix.pace.gatech.edu
# Add to ~/.ssh/config on your local machine
Host pace
HostName login-phoenix.pace.gatech.edu
User yourgtid
IdentityFile ~/.ssh/id_ed25519_pace
ServerAliveInterval 60
ServerAliveCountMax 5
# ControlMaster reuses one TCP connection across multiple ssh calls
# — you only have to type your password / 2FA once per ~10 min
ControlMaster auto
ControlPath ~/.ssh/sockets/%r@%h-%p
ControlPersist 600
mkdir -p ~/.ssh/sockets
chmod 700 ~/.ssh ~/.ssh/sockets
ssh pace "hostname && pace-quota"
If you see your storage usage and account balances printed, SSH works.
PACE has THREE storage tiers, each with different size, speed, persistence, and access permissions. Putting things in the wrong tier is the #1 cause of beginner pain.
| Tier | Path | Quota | Persistence | Use For |
|---|---|---|---|---|
| Home | ~/ = /storage/home/hcoda1/X/yourgtid |
20 GB / 1M files | Permanent, backed up | dotfiles, shell config, small scripts |
| Project | ~/r-rs275-0 → /storage/project/r-rs275-0/yourgtid |
1 TB shared | Permanent (group) | conda envs, model weights, source code, results |
| Scratch | /storage/scratch1/X/yourgtid |
15 TB | Purged after 60 days inactive | dataset files, intermediate experiment outputs, large logs |
# Conda envs — large, want them permanent
/storage/project/r-rs275-0/yourgtid/.conda_envs/
# Model weights — very large, want them permanent
/storage/project/r-rs275-0/yourgtid/hfcache/hub/
# Project source code — small, want them permanent
/storage/project/r-rs275-0/yourgtid/AI-inference-hands-on-note/
# Datasets — bulky, may regenerate, OK to put in scratch
/storage/scratch1/X/yourgtid/data/
Note: Many labs at GT use a symlink so ~/r-rs275-0 points to the project storage. If your account has it, just use ~/r-rs275-0/... as the path.
pace-quota
# Shows: home, scratch, project usage and account balances
torch==2.9.x and transformers==5.x, but vllm-continuum (and most vLLM builds) require torch==2.8.0 and transformers<5. Trying to use both in one env breaks the server startup. We learned this the hard way.
# Get an interactive 1-hour CPU shell with 16GB RAM
salloc -N1 --cpus-per-task=4 --mem=16G -t 01:00:00 \
-A gts-rs275-paid -q inferno -p cpu-small
# Wait for prompt to change — you are now on a compute node
module load anaconda3
eval "$(conda shell.bash hook)"
ENV=/storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab
conda create -p $ENV python=3.11 -y
conda activate $ENV
# Install vLLM (pinned to a stable version compatible with this lab)
pip install --no-cache-dir "vllm==0.6.3"
pip install --no-cache-dir accelerate matplotlib pandas
# Verify
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"
ENV2=/storage/project/r-rs275-0/$USER/.conda_envs/sglang-lab
conda create -p $ENV2 python=3.11 -y
conda activate $ENV2
pip install --no-cache-dir "sglang[all]==0.5.10.post1"
pip install --no-cache-dir matplotlib pandas cachetools
# Verify
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"
exit
# You are back on the login node. The envs are saved permanently in project storage.
| Weeks | Environment |
|---|---|
| 1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 15 | vllm-lab |
| 6, 9 | sglang-lab |
| 1, 2, 14, 16 (both, run separately) | vllm-lab + sglang-lab |
| 13 (needs 2+ GPUs) | vllm-lab |
# Get a token from https://huggingface.co/settings/tokens (read access is enough)
# Add to your ~/.bashrc on PACE
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME
# Apply to current shell
source ~/.bashrc
Do this in a CPU node session (huggingface-cli download is network/disk bound, no GPU needed)
salloc -N1 --cpus-per-task=4 --mem=16G -t 02:00:00 \
-A gts-rs275-paid -q inferno -p cpu-small
module load anaconda3
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab
# Llama-3.1-8B-Instruct (16 GB, used in weeks 1-12, 15)
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
--local-dir $HF_HOME/Llama-3.1-8B-Instruct
# TinyLlama-1.1B (2 GB, used as draft model in week 11)
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0
# FP8 and AWQ quantized variants (week 12)
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
# Llama-3.1-70B-Instruct (140 GB BF16, week 13 TP, week 14 reference)
# Skip if you don't need 70B — this takes ~30 min and 263 GB on disk
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
--local-dir $HF_HOME/Llama-3.1-70B-Instruct
exit
ls -la $HF_HOME/
# Llama-3.1-8B-Instruct/ (~16 GB)
# models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/ (HF cache format)
# models--neuralmagic--Meta-Llama-3.1-8B-Instruct-FP8/
# models--hugging-quants--Meta-Llama-3.1-8B-Instruct-AWQ-INT4/
# Llama-3.1-70B-Instruct/ (263 GB)
# Total: ~285 GB — check your project quota allows this
du -sh $HF_HOME
config.json from HF Hub at startup. If your token isn't readable in the compute node environment, it gets a 401 error and crashes. Setting export HF_HUB_OFFLINE=1 in every sbatch tells HF to use ONLY the local cache, no network calls. We use this in every experiment script.
| Partition | GRES name | VRAM | SM | FA2? | vLLM? | CPU:GPU max |
|---|---|---|---|---|---|---|
| gpu-v100 | v100 | 32 GB | 7.0 | ❌ | ❌ | — |
| gpu-rtx6000 | rtx_6000 | 24 GB | 7.5 | ❌ | ❌ | 6:1 |
| gpu-a100 | A100 | 40 GB | 8.0 | ✓ | ✓ | 8:1 |
| gpu-l40s | L40S | 48 GB | 8.9 | ✓ | ✓ | 4:1 |
| gpu-h100 | H100 | 80 GB | 9.0 | ✓ | ✓ | 8:1 |
| gpu-h200 | H200 | 141 GB | 9.0 | ✓ | ✓ | 16:1 |
| gpu-rtxpro-blackwell | rtx_pro_6000_blackwell | 96 GB | 12.0 | ✓ | ~ | — |
| Combination | Cost | Time limit | Preemption? | Use for these labs? |
|---|---|---|---|---|
-A gts-rs275 -q embers | free | 8 hr | ✓ (yes) | ❌ risky — benchmarks die mid-run |
-A gts-rs275-paid -q inferno | $0.67/GPU-hr | unlimited | ❌ (no) | ✓ recommended |
# Interactive 30-min H100 shell
salloc -N1 --gres=gpu:H100:1 --cpus-per-task=8 --mem=64G -t 00:30:00 \
-A gts-rs275-paid -q inferno
# Submit a batch job
sbatch myjob.sbatch
# See your jobs
squeue -u $USER --format='%.10i %.20j %.10T %.12M %.20R'
# Cancel a job
scancel JOBID
# See queue depth for a partition
squeue -p gpu-h100 --format='%.8T' | sort | uniq -c
# Check available nodes in a partition
sinfo -p gpu-h100 --format='%P %a %D %T %N'
Use this template as your starting point. It is the minimum sbatch that will reliably start a vLLM server, send one request, and shut down cleanly.
# save as ~/r-rs275-0/AI-inference-hands-on-note/sbatch/hello_vllm.sbatch
#!/bin/bash
#SBATCH --job-name=hello-vllm
#SBATCH --gres=gpu:H100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=00:30:00
#SBATCH --account=gts-rs275-paid
#SBATCH -q inferno
#SBATCH --output=hello_vllm_%j.out
#SBATCH --error=hello_vllm_%j.err
set -euo pipefail
echo "=== START: $(date) ==="
# Activate environment
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab
# Use local model cache, force offline (no HF Hub calls)
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME
export HF_HUB_OFFLINE=1
# Sanity check
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c 'import torch; print(f"torch {torch.__version__}, cuda {torch.cuda.is_available()}")'
# Start the server in the background, write its log to a file
LOGFILE=$PWD/server_$SLURM_JOB_ID.log
python -m vllm.entrypoints.openai.api_server \
--model $HF_HOME/Llama-3.1-8B-Instruct \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.90 \
--disable-log-requests \
> $LOGFILE 2>&1 &
SERVER_PID=$!
echo "Server PID: $SERVER_PID"
# Wait for the /health endpoint to respond (max 5 minutes)
for i in $(seq 1 60); do
sleep 5
if ! kill -0 $SERVER_PID 2>/dev/null; then
echo "SERVER CRASHED"; tail -30 $LOGFILE; exit 1
fi
if curl -s http://localhost:8000/health >/dev/null 2>&1; then
echo "READY after $((i*5))s"; break
fi
done
# Send a test request
curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/storage/project/r-rs275-0/'$USER'/hfcache/hub/Llama-3.1-8B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 16
}' | python -m json.tool
# GRACEFUL shutdown — SIGINT, never SIGKILL
echo "=== Stopping server (SIGINT) ==="
kill -2 $SERVER_PID 2>/dev/null || true
wait $SERVER_PID 2>/dev/null || true
echo "=== END: $(date) ==="
# Submit
sbatch hello_vllm.sbatch
# → "Submitted batch job 1234567"
# Watch the queue (Ctrl+C to stop watching)
watch -n 5 squeue -u $USER
# When the job starts running, tail the live output
tail -f hello_vllm_1234567.out
=== START: Wed Apr 9 23:00:00 EDT 2026 ===
NVIDIA H100 80GB HBM3, 81559 MiB
torch 2.8.0+cu126, cuda True
Server PID: 12345
READY after 25s
{
"id": "cmpl-abc123",
"choices": [{"text": " Paris, located in the north...", ...}],
"usage": {"completion_tokens": 16, ...}
}
=== Stopping server (SIGINT) ===
=== END: Wed Apr 9 23:00:50 EDT 2026 ===
If you see Paris in the response, your environment is fully working. Proceed to Week 1.
These are the errors we actually hit while building these labs. If you see one, the fix is probably here.
Symptom: The server log shows model loading completes, but the Python parent never gets "READY".
Cause: If you launched the server with subprocess.Popen(stdout=PIPE) in Python and never read the pipe, the pipe buffer (~64KB) fills and the server blocks on every print. The server is alive but stuck. This is a deadlock, not a startup problem.
Fix: Always redirect stdout to a file:
log_file = open("/tmp/server.log", "w")
proc = subprocess.Popen(cmd, stdout=log_file, stderr=subprocess.STDOUT)
Symptom: ValueError: To serve at least one request with the model's max seq len (131072), 16.00 GiB KV cache is needed, which is larger than the available KV cache memory (12.74 GiB)
Cause: vLLM defaults to max_model_len = max position embedding, which for Llama-3.1 is 131,072. Reserving KV cache for 131K tokens at one batch is huge.
Fix: Always pass --max-model-len 4096 (or 8192) and --gpu-memory-utilization 0.90 to the vLLM server.
Symptom: Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/resolve/main/config.json
Cause: vLLM tries to validate the model from HF Hub even if you pass a local path, because it parses the model id and falls back to Hub for config.json. The compute node may not have your HF token.
Fix: Set HF_HUB_OFFLINE=1 in the sbatch script — forces use of local cache only:
export HF_HUB_OFFLINE=1
# And use the absolute local path, not the HF model id
python -m vllm.entrypoints.openai.api_server \
--model $HF_HOME/Llama-3.1-8B-Instruct \
...
Cause: transformers 5.x removed this attribute. vLLM < 0.7 still calls it. You probably installed SGLang into the same env as vLLM, which upgraded transformers.
Fix: Downgrade transformers in the vLLM env, OR (better) keep vLLM and SGLang in separate envs as we recommend at the top of this page.
pip install --no-cache-dir 'transformers<5'
Cause: You are running on V100 (sm 7.0) or RTX 6000 Turing (sm 7.5). FlashAttention 2 needs Ampere (sm 8.0) or newer.
Fix: Use A100, L40S, H100, or H200 instead. Change --gres=gpu:V100:1 to --gres=gpu:A100:1.
Cause: Each PACE GPU partition has a max CPUs-per-GPU ratio. L40S is 4:1, A100 is 8:1, RTX 6000 is 6:1, H200 is 16:1. You requested too many CPUs per GPU.
Fix: Lower --cpus-per-task. For 1 L40S use 4, for 1 A100 use 8, for 1 H200 use 16.
Cause: The cluster is busy and many higher-priority jobs are ahead of you. squeue -p gpu-h100 --format='%.8T' | sort | uniq -c shows the queue depth.
Fix: Submit the same job to multiple GPU types in parallel (one sbatch per GPU type, all writing to a different results subdirectory tagged by GPU). The first one to start runs; you can cancel the others. We use this trick across the labs.
Cause: Your vLLM build does not have CUDA kernels compiled for sm_120 (Blackwell). The continuum vLLM fork on PACE is compiled for sm_70/75/80/89/90 only.
Fix: Use H100/H200 instead of RTX Pro 6000 Blackwell, OR build vLLM yourself with TORCH_CUDA_ARCH_LIST including 12.0.
Cause: The accelerate package is missing. transformers requires it for any device_map usage (which we use to put the model on the GPU).
Fix: pip install accelerate
Cause: Newer vLLM (~0.7+) deprecated the standalone benchmark scripts in favor of vllm bench serve, but in some forks the new CLI is not yet wired up.
Fix: Either pin to vLLM 0.6.x, or use a custom client that calls the OpenAI HTTP API directly (we use one in scripts/bench_client.py).