PAGE 0

Week 0 — PACE Environment Setup

The complete tutorial for getting vLLM and SGLang running on the PACE Phoenix HPC cluster. Read this BEFORE Week 1. Covers SSH, storage layout, conda environments, model downloads, Slurm submission, GPU selection, and troubleshooting the most common errors.

One-time setup ~2 hours PACE Phoenix vLLM + SGLang

By the end of this page you will have

  • A working SSH alias pace that gets you from your laptop to a PACE login node in one command.
  • Two conda envs in project storage: vllm-lab pinned to vllm==0.6.3 for Weeks 1–5, 7, 8, 10, 11, 12, 13, 14, 15, 16; and sglang-lab pinned to sglang==0.5.10.post1 for Weeks 1 (SGLang half), 6, 9.
  • ~285 GB of model weights pre-downloaded into $HFCACHE: Llama-3.1-8B-Instruct, TinyLlama-1.1B, Llama-3.1-8B-Instruct-FP8, Llama-3.1-8B-Instruct-AWQ-INT4, Llama-3.1-70B-Instruct (optional for Week 13).
  • A hello-world Slurm job that launches a vLLM server on an H100, verifies /health is reachable, sends one completion request, and shuts down with kill -2.
  • A troubleshooting playbook for the ten most common PACE failure modes you will hit in Weeks 1–16 (FA2 sm_80 error, OOM, PENDING priority, sm_120 Blackwell kernels, etc.).
Substitute these placeholders throughout the page before running anything:
PlaceholderExample used on this pageReplace with
YOUR_GTIDhlin464Your GT username (the one you use for SSH)
YOUR_ACCOUNTgts-rs275-paidYour course's paid Slurm account (check with pace-quota)
YOUR_PROJECTr-rs275-0Your course's project storage share (/storage/project/$YOUR_PROJECT/$YOUR_GTID/)
$HFCACHE/storage/project/r-rs275-0/hlin464/hfcache/hubAny large writable path — don't use $HOME because PACE home is only ~10 GB

What you need

  • A PACE Phoenix account with membership in a project with paid GPU access. The labs assume --account=gts-rs275-paid — substitute your own account name. To check your accounts: pace-quota
  • SSH key on your local machine (the PACE OnDemand portal also works but command-line access is much faster for these labs)
  • Georgia Tech VPN or campus network — PACE login nodes are not directly internet-accessible from off-campus
  • Hugging Face account with a token, AND a request submitted for the gated Llama models. Visit https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct and click "Agree and access repository"
  • Comfort with bash, Python, and conda. We will not teach these basics here.
PACE Hard Rules — Read These First:
  1. Never run GPU work on the login node. Login nodes have ~4GB RAM and no GPU. Use sbatch or salloc.
  2. Never kill -9 a vLLM/SGLang process. SIGKILL during CUDA execution corrupts the driver state and requires a node reboot. Always use kill -2 (SIGINT) for graceful shutdown.
  3. Never compile / pip install large packages on the login node. Use salloc to grab a CPU compute node first.
  4. Always use the paid account + inferno queue for GPU work. The free queue (embers) can preempt your job at any moment, killing in-progress benchmarks.
  5. Use a separate venv/conda env for these labs. Don't pip-install anything into a shared system Python or someone else's environment.

SSH access & SSH config

1. Generate a key (one time)

# On your local machine
ssh-keygen -t ed25519 -C "my-name@gatech.edu" -f ~/.ssh/id_ed25519_pace
ssh-copy-id -i ~/.ssh/id_ed25519_pace.pub yourgtid@login-phoenix.pace.gatech.edu

2. Set up SSH config (saves typing for the rest of the semester)

# Add to ~/.ssh/config on your local machine
Host pace
    HostName login-phoenix.pace.gatech.edu
    User yourgtid
    IdentityFile ~/.ssh/id_ed25519_pace
    ServerAliveInterval 60
    ServerAliveCountMax 5
    # ControlMaster reuses one TCP connection across multiple ssh calls
    # — you only have to type your password / 2FA once per ~10 min
    ControlMaster auto
    ControlPath ~/.ssh/sockets/%r@%h-%p
    ControlPersist 600

mkdir -p ~/.ssh/sockets
chmod 700 ~/.ssh ~/.ssh/sockets

3. Test the connection

ssh pace "hostname && pace-quota"

If you see your storage usage and account balances printed, SSH works.

PACE storage layout

PACE has THREE storage tiers, each with different size, speed, persistence, and access permissions. Putting things in the wrong tier is the #1 cause of beginner pain.

Tier Path Quota Persistence Use For
Home ~/ = /storage/home/hcoda1/X/yourgtid 20 GB / 1M files Permanent, backed up dotfiles, shell config, small scripts
Project ~/r-rs275-0/storage/project/r-rs275-0/yourgtid 1 TB shared Permanent (group) conda envs, model weights, source code, results
Scratch /storage/scratch1/X/yourgtid 15 TB Purged after 60 days inactive dataset files, intermediate experiment outputs, large logs
Recommended layout for these labs:
# Conda envs — large, want them permanent
/storage/project/r-rs275-0/yourgtid/.conda_envs/

# Model weights — very large, want them permanent
/storage/project/r-rs275-0/yourgtid/hfcache/hub/

# Project source code — small, want them permanent
/storage/project/r-rs275-0/yourgtid/AI-inference-hands-on-note/

# Datasets — bulky, may regenerate, OK to put in scratch
/storage/scratch1/X/yourgtid/data/

Note: Many labs at GT use a symlink so ~/r-rs275-0 points to the project storage. If your account has it, just use ~/r-rs275-0/... as the path.

Check your quotas

pace-quota
# Shows: home, scratch, project usage and account balances

Conda environments — vLLM and SGLang separately

Why two separate environments? vLLM and SGLang have conflicting torch and transformers requirements. SGLang 0.5.x requires torch==2.9.x and transformers==5.x, but vllm-continuum (and most vLLM builds) require torch==2.8.0 and transformers<5. Trying to use both in one env breaks the server startup. We learned this the hard way.

1. Allocate a CPU node first (DO NOT install on login node)

# Get an interactive 1-hour CPU shell with 16GB RAM
salloc -N1 --cpus-per-task=4 --mem=16G -t 01:00:00 \
       -A gts-rs275-paid -q inferno -p cpu-small
# Wait for prompt to change — you are now on a compute node

2. Create the vLLM environment

module load anaconda3
eval "$(conda shell.bash hook)"

ENV=/storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab
conda create -p $ENV python=3.11 -y
conda activate $ENV

# Install vLLM (pinned to a stable version compatible with this lab)
pip install --no-cache-dir "vllm==0.6.3"
pip install --no-cache-dir accelerate matplotlib pandas

# Verify
python -c "import vllm; print(f'vLLM {vllm.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"

3. Create the SGLang environment (separate)

ENV2=/storage/project/r-rs275-0/$USER/.conda_envs/sglang-lab
conda create -p $ENV2 python=3.11 -y
conda activate $ENV2

pip install --no-cache-dir "sglang[all]==0.5.10.post1"
pip install --no-cache-dir matplotlib pandas cachetools

# Verify
python -c "import sglang; print(f'SGLang {sglang.__version__}')"
python -c "import torch; print(f'torch {torch.__version__}')"

4. Exit the salloc session

exit
# You are back on the login node. The envs are saved permanently in project storage.
Recipe: which env do I activate for which week?
WeeksEnvironment
1, 2, 3, 4, 5, 7, 8, 10, 11, 12, 15vllm-lab
6, 9sglang-lab
1, 2, 14, 16 (both, run separately)vllm-lab + sglang-lab
13 (needs 2+ GPUs)vllm-lab

Downloading model weights

1. Set up Hugging Face authentication

# Get a token from https://huggingface.co/settings/tokens (read access is enough)
# Add to your ~/.bashrc on PACE
export HF_TOKEN="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME

# Apply to current shell
source ~/.bashrc

2. Download the models used in these labs

Do this in a CPU node session (huggingface-cli download is network/disk bound, no GPU needed)

salloc -N1 --cpus-per-task=4 --mem=16G -t 02:00:00 \
       -A gts-rs275-paid -q inferno -p cpu-small

module load anaconda3
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab

# Llama-3.1-8B-Instruct (16 GB, used in weeks 1-12, 15)
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
    --local-dir $HF_HOME/Llama-3.1-8B-Instruct

# TinyLlama-1.1B (2 GB, used as draft model in week 11)
huggingface-cli download TinyLlama/TinyLlama-1.1B-Chat-v1.0

# FP8 and AWQ quantized variants (week 12)
huggingface-cli download neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
huggingface-cli download hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4

# Llama-3.1-70B-Instruct (140 GB BF16, week 13 TP, week 14 reference)
# Skip if you don't need 70B — this takes ~30 min and 263 GB on disk
huggingface-cli download meta-llama/Llama-3.1-70B-Instruct \
    --local-dir $HF_HOME/Llama-3.1-70B-Instruct

exit

3. Verify the downloads

ls -la $HF_HOME/
# Llama-3.1-8B-Instruct/  (~16 GB)
# models--TinyLlama--TinyLlama-1.1B-Chat-v1.0/  (HF cache format)
# models--neuralmagic--Meta-Llama-3.1-8B-Instruct-FP8/
# models--hugging-quants--Meta-Llama-3.1-8B-Instruct-AWQ-INT4/
# Llama-3.1-70B-Instruct/  (263 GB)

# Total: ~285 GB — check your project quota allows this
du -sh $HF_HOME
Why use HF_HUB_OFFLINE=1 in scripts? Llama models are gated — even with weights downloaded locally, vLLM will try to fetch config.json from HF Hub at startup. If your token isn't readable in the compute node environment, it gets a 401 error and crashes. Setting export HF_HUB_OFFLINE=1 in every sbatch tells HF to use ONLY the local cache, no network calls. We use this in every experiment script.

Slurm: GPU types, queues, accounts

Available GPU partitions

Partition GRES name VRAM SM FA2? vLLM? CPU:GPU max
gpu-v100v10032 GB7.0
gpu-rtx6000rtx_600024 GB7.56:1
gpu-a100A10040 GB8.08:1
gpu-l40sL40S48 GB8.94:1
gpu-h100H10080 GB9.08:1
gpu-h200H200141 GB9.016:1
gpu-rtxpro-blackwellrtx_pro_6000_blackwell96 GB12.0~
V100 and RTX 6000 do NOT work with vLLM v1. Their compute capability is below 8.0, so FlashAttention 2 (FA2) is unsupported. The vLLM engine v1 hard-crashes during model load. Use A100 or newer.

Account & queue choice

CombinationCostTime limitPreemption?Use for these labs?
-A gts-rs275 -q embersfree8 hr(yes)risky — benchmarks die mid-run
-A gts-rs275-paid -q inferno$0.67/GPU-hrunlimited(no)recommended

Slurm command cheat sheet

# Interactive 30-min H100 shell
salloc -N1 --gres=gpu:H100:1 --cpus-per-task=8 --mem=64G -t 00:30:00 \
       -A gts-rs275-paid -q inferno

# Submit a batch job
sbatch myjob.sbatch

# See your jobs
squeue -u $USER --format='%.10i %.20j %.10T %.12M %.20R'

# Cancel a job
scancel JOBID

# See queue depth for a partition
squeue -p gpu-h100 --format='%.8T' | sort | uniq -c

# Check available nodes in a partition
sinfo -p gpu-h100 --format='%P %a %D %T %N'

Your first vLLM job: hello world

Use this template as your starting point. It is the minimum sbatch that will reliably start a vLLM server, send one request, and shut down cleanly.

1. Create the sbatch file

# save as ~/r-rs275-0/AI-inference-hands-on-note/sbatch/hello_vllm.sbatch
#!/bin/bash
#SBATCH --job-name=hello-vllm
#SBATCH --gres=gpu:H100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --time=00:30:00
#SBATCH --account=gts-rs275-paid
#SBATCH -q inferno
#SBATCH --output=hello_vllm_%j.out
#SBATCH --error=hello_vllm_%j.err

set -euo pipefail
echo "=== START: $(date) ==="

# Activate environment
module load anaconda3
eval "$(conda shell.bash hook)"
conda activate /storage/project/r-rs275-0/$USER/.conda_envs/vllm-lab

# Use local model cache, force offline (no HF Hub calls)
export HF_HOME=/storage/project/r-rs275-0/$USER/hfcache/hub
export TRANSFORMERS_CACHE=$HF_HOME
export HF_HUB_OFFLINE=1

# Sanity check
nvidia-smi --query-gpu=name,memory.total --format=csv,noheader
python -c 'import torch; print(f"torch {torch.__version__}, cuda {torch.cuda.is_available()}")'

# Start the server in the background, write its log to a file
LOGFILE=$PWD/server_$SLURM_JOB_ID.log
python -m vllm.entrypoints.openai.api_server \
    --model $HF_HOME/Llama-3.1-8B-Instruct \
    --port 8000 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.90 \
    --disable-log-requests \
    > $LOGFILE 2>&1 &
SERVER_PID=$!
echo "Server PID: $SERVER_PID"

# Wait for the /health endpoint to respond (max 5 minutes)
for i in $(seq 1 60); do
    sleep 5
    if ! kill -0 $SERVER_PID 2>/dev/null; then
        echo "SERVER CRASHED"; tail -30 $LOGFILE; exit 1
    fi
    if curl -s http://localhost:8000/health >/dev/null 2>&1; then
        echo "READY after $((i*5))s"; break
    fi
done

# Send a test request
curl -s http://localhost:8000/v1/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "/storage/project/r-rs275-0/'$USER'/hfcache/hub/Llama-3.1-8B-Instruct",
       "prompt": "The capital of France is",
       "max_tokens": 16
     }' | python -m json.tool

# GRACEFUL shutdown — SIGINT, never SIGKILL
echo "=== Stopping server (SIGINT) ==="
kill -2 $SERVER_PID 2>/dev/null || true
wait $SERVER_PID 2>/dev/null || true

echo "=== END: $(date) ==="

2. Submit and watch

# Submit
sbatch hello_vllm.sbatch
# → "Submitted batch job 1234567"

# Watch the queue (Ctrl+C to stop watching)
watch -n 5 squeue -u $USER

# When the job starts running, tail the live output
tail -f hello_vllm_1234567.out

3. What success looks like

=== START: Wed Apr 9 23:00:00 EDT 2026 ===
NVIDIA H100 80GB HBM3, 81559 MiB
torch 2.8.0+cu126, cuda True
Server PID: 12345
READY after 25s
{
    "id": "cmpl-abc123",
    "choices": [{"text": " Paris, located in the north...", ...}],
    "usage": {"completion_tokens": 16, ...}
}
=== Stopping server (SIGINT) ===
=== END: Wed Apr 9 23:00:50 EDT 2026 ===

If you see Paris in the response, your environment is fully working. Proceed to Week 1.

Common errors and fixes

These are the errors we actually hit while building these labs. If you see one, the fix is probably here.

1. Server hangs forever / Health check times out at 300s

Symptom: The server log shows model loading completes, but the Python parent never gets "READY".

Cause: If you launched the server with subprocess.Popen(stdout=PIPE) in Python and never read the pipe, the pipe buffer (~64KB) fills and the server blocks on every print. The server is alive but stuck. This is a deadlock, not a startup problem.

Fix: Always redirect stdout to a file:

log_file = open("/tmp/server.log", "w")
proc = subprocess.Popen(cmd, stdout=log_file, stderr=subprocess.STDOUT)

2. ValueError: KV cache memory ... larger than the available KV cache memory

Symptom: ValueError: To serve at least one request with the model's max seq len (131072), 16.00 GiB KV cache is needed, which is larger than the available KV cache memory (12.74 GiB)

Cause: vLLM defaults to max_model_len = max position embedding, which for Llama-3.1 is 131,072. Reserving KV cache for 131K tokens at one batch is huge.

Fix: Always pass --max-model-len 4096 (or 8192) and --gpu-memory-utilization 0.90 to the vLLM server.

3. 401 Unauthorized: Cannot access gated repo for url ...

Symptom: Cannot access gated repo for url https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/resolve/main/config.json

Cause: vLLM tries to validate the model from HF Hub even if you pass a local path, because it parses the model id and falls back to Hub for config.json. The compute node may not have your HF token.

Fix: Set HF_HUB_OFFLINE=1 in the sbatch script — forces use of local cache only:

export HF_HUB_OFFLINE=1
# And use the absolute local path, not the HF model id
python -m vllm.entrypoints.openai.api_server \
    --model $HF_HOME/Llama-3.1-8B-Instruct \
    ...

4. ImportError: TokenizersBackend has no attribute all_special_tokens_extended

Cause: transformers 5.x removed this attribute. vLLM < 0.7 still calls it. You probably installed SGLang into the same env as vLLM, which upgraded transformers.

Fix: Downgrade transformers in the vLLM env, OR (better) keep vLLM and SGLang in separate envs as we recommend at the top of this page.

pip install --no-cache-dir 'transformers<5'

5. Cannot use FA version 2 ... compute capability >= 8

Cause: You are running on V100 (sm 7.0) or RTX 6000 Turing (sm 7.5). FlashAttention 2 needs Ampere (sm 8.0) or newer.

Fix: Use A100, L40S, H100, or H200 instead. Change --gres=gpu:V100:1 to --gres=gpu:A100:1.

6. Maximum CPU:GPU ratio of N:1 for gpu-X node class

Cause: Each PACE GPU partition has a max CPUs-per-GPU ratio. L40S is 4:1, A100 is 8:1, RTX 6000 is 6:1, H200 is 16:1. You requested too many CPUs per GPU.

Fix: Lower --cpus-per-task. For 1 L40S use 4, for 1 A100 use 8, for 1 H200 use 16.

7. Job stays PENDING with reason (Priority) for hours

Cause: The cluster is busy and many higher-priority jobs are ahead of you. squeue -p gpu-h100 --format='%.8T' | sort | uniq -c shows the queue depth.

Fix: Submit the same job to multiple GPU types in parallel (one sbatch per GPU type, all writing to a different results subdirectory tagged by GPU). The first one to start runs; you can cancel the others. We use this trick across the labs.

8. Engine core initialization failed (Blackwell sm_120)

Cause: Your vLLM build does not have CUDA kernels compiled for sm_120 (Blackwell). The continuum vLLM fork on PACE is compiled for sm_70/75/80/89/90 only.

Fix: Use H100/H200 instead of RTX Pro 6000 Blackwell, OR build vLLM yourself with TORCH_CUDA_ARCH_LIST including 12.0.

9. HuggingFace baseline crashes: Using a `device_map` requires `accelerate`

Cause: The accelerate package is missing. transformers requires it for any device_map usage (which we use to put the model on the GPU).

Fix: pip install accelerate

10. benchmark_serving.py prints DEPRECATED instead of running

Cause: Newer vLLM (~0.7+) deprecated the standalone benchmark scripts in favor of vllm bench serve, but in some forks the new CLI is not yet wired up.

Fix: Either pin to vLLM 0.6.x, or use a custom client that calls the OpenAI HTTP API directly (we use one in scripts/bench_client.py).

You're ready — what's next: If your hello-world job ran and printed Paris, you have everything needed for the labs. Proceed to Week 1: First LLM Serving →