vLLM (virtual large language model) is an open-source library from UC Berkley.

A key innovation for inference is PagedAttention, which is an efficient virtual memory/page method for storing the KV cache, which in longer contexts can end up committing more memory than the models themselves.

PagedAttention also allows continuous batching, which is a method for handling multiple requests to LLMs that enables less idle time.

Like DeepSpeed, it also supports optimized CUDA kernels for lower latency inference.

It is built on top of Megatron and can interface with DeepSpeed.

Deploying vLLM on Kubernetes

https://docs.vllm.ai/en/latest/deployment/k8s.html

Deploying vLLM on VM

Pre-requires

VM setting

In VM → Hardware:

Click Display, set to Default (or VirtIO-GPU).

Edit your PCI Device (01:00.0) and UNTICK “Primary GPU”. Keep All Functions + PCI-Express checked.

Install NVIDIA driver on ubuntu

Inside the VM:

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# after reboot
nvidia-smi
lspci -nnk | grep -iA3 nvidia

Install vLLM

sudo apt update
sudo apt install -y python3-venv python3-pip build-essential
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
python -m pip install -U pip wheel setuptools

Install PyTorch (CUDA build)

Pick the CUDA 12.x wheel from PyTorch’s selector. Example (CUDA 12.4 wheel—if the site shows cu126/cu128, use that instead):

pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio

Install vLLM

pip install vllm

Sanity check

python - <<'PY'
import torch, vllm
print("CUDA available:", torch.cuda.is_available())
print("CUDA reported by PyTorch:", torch.version.cuda)
print("Torch:", torch.__version__)
print("GPU:", torch.cuda.get_device_name(0))
print("vLLM:", vllm.__version__)
PY

Hugging Face Create a token

Sign in at huggingface.co → click your avatar → Settings → Access Tokens → New token.

Name it and choose Role = Read (enough to download models).

Click Create and copy the token (looks like hf_********).

Use the token on your vLLM

# in your vLLM Python env
pip install -U "huggingface_hub[cli]" sentencepiece
git config --global credential.helper store
hf auth login          # paste your hf_ token when prompted
hf auth whoami         # sanity check

Accept the model terms (once, in browser)

Sign in at Hugging Face with the account you’ll use on the VM. Open: https://huggingface.co/google/gemma-3-4b-it Click Agree and access (or Request access) and confirm. If you might also use the base model, do the same for google/gemma-3-4b

Deploy Gemma 3 4B on your vLLM VM (RTX 2070, 8 GB)

# one-shot shell
export HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...   # (optionally: export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN)
# systemd (recommended)
sudo tee /etc/systemd/system/vllm.env >/dev/null <<'EOF'
HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...
HF_HOME=/opt/models/.cache/huggingface
EOF
sudo chmod 600 /etc/systemd/system/vllm.env
# then in /etc/systemd/system/vllm.service under [Service]:
# EnvironmentFile=/etc/systemd/system/vllm.env
sudo systemctl daemon-reload && sudo systemctl restart vllm
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/gemma-3-270m-it \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 2048 \           # you can raise later (even 4096 fits)
  --max-num-seqs 1 \               # keep concurrency low at first
  --gpu-memory-utilization 0.80 \  # headroom for kernels
  --swap-space 2 \                 # your VM has 10GB RAM; 2GB is safe
  --download-dir /opt/models \
  --trust-remote-code
sudo mkdir -p /opt/models && sudo chown $USER:$USER /opt/models
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/embeddinggemma-300m \
--host 0.0.0.0 --port 8000 \
--download-dir /opt/models

firewall setting

Open the port if you’ll call it from outside the VM

sudo ufw allow from 192.168.1.0/24 to any port 8000 proto tcp

Health check

alpine-docker:~# curl -i http://192.168.1.243:8000/health
HTTP/1.1 200 OK
date: Mon, 27 Oct 2025 03:47:22 GMT
server: uvicorn
content-length: 0

Run it as a service

# /etc/systemd/system/vllm-embed.service
[Unit]
Description=vLLM - EmbeddingGemma-300M
After=network-online.target
Wants=network-online.target
 
[Service]
User=yanboyang713
Environment=HF_HOME=/opt/models/.cache/huggingface
# If the model is gated, add: Environment=HUGGINGFACE_HUB_TOKEN=hf_xxx
ExecStart=/home/yanboyang713/venvs/vllm/bin/vllm serve google/embeddinggemma-300m \
  --task embedding --host 0.0.0.0 --port 8001 --download-dir /opt/models
Restart=always
RestartSec=3
 
[Install]
WantedBy=multi-user.target
 
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-embed
sudo systemctl status vllm-embed --no-pager

Deploying vLLM on UVA High-Performance Computing Systems

login

ssh -Y rhe9cf@login.hpc.virginia.edu

Choose storage location

Use /scratch for Hugging Face model cache and your vLLM environment, not /home, because /scratch is intended for large computational work. Note that /scratch is temporary, not backed up, and files not accessed for more than 90 days may be deleted.

mkdir -p /scratch/$USER/vllm/{envs,hf-cache,logs,scripts}
cd /scratch/$USER/vllm

Set cache paths:

export HF_HOME=/scratch/$USER/vllm/hf-cache
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export TRANSFORMERS_CACHE=$HF_HOME/transformers

Load Python / Miniforge

UVA recommends Miniforge for Python on HPC; available versions can be checked with module spider miniforge, and the default can be loaded with module load miniforge.

module purge
module load miniforge/24.11.3-py3.12

If that exact version is unavailable, check:

module spider miniforge

Then load the available Python 3.12 or Python 3.11 Miniforge module.

Create a vLLM environment

vLLM’s stable docs require Linux, Python 3.10–3.13, and NVIDIA GPUs with compute capability 7.5 or higher.

conda create -p /scratch/$USER/vllm/envs/vllm python=3.12 -y
source activate /scratch/$USER/vllm/envs/vllm

Install uv and vLLM:

python -m pip install uv
uv pip install vllm --torch-backend=auto
uv pip install openai

vLLM’s docs recommend creating a fresh Python environment and show uv venv —python 3.12; they also recommend uv for installing vLLM wheels.

Check installation:

python -c "import vllm; print(vllm.__version__)"
python -c "import torch; print(torch.cuda.is_available())"

The second command may show False on a login node. That is okay; test CUDA inside a GPU job.

Choose a model and GPU size

For your first test, use a small or medium model. Examples:

export MODEL_ID="Qwen/Qwen2.5-7B-Instruct"

Avoid choosing a model that violates UVA policy. UVA’s RC usage policy page says users must comply with acceptable-use rules and includes restrictions on downloading or using prohibited applications such as DeepSeek AI on RC resources.

First test: interactive GPU session

Use this for debugging before writing a batch script.

salloc -A <your_allocation> \
  -p gpu \
  --gres=gpu:v100:1 \
  -c 8 \
  --mem=16G \
  -t 01:00:00

You can use uva slurm script generator