vLLM (virtual large language model) is an open-source library from UC Berkley.
A key innovation for inference is PagedAttention, which is an efficient virtual memory/page method for storing the KV cache, which in longer contexts can end up committing more memory than the models themselves.
PagedAttention also allows continuous batching, which is a method for handling multiple requests to LLMs that enables less idle time.
Like DeepSpeed, it also supports optimized CUDA kernels for lower latency inference.
It is built on top of Megatron and can interface with DeepSpeed.
Deploying vLLM on Kubernetes
https://docs.vllm.ai/en/latest/deployment/k8s.html
Deploying vLLM on VM
Pre-requires
VM setting
In VM → Hardware:
Click Display, set to Default (or VirtIO-GPU).
Edit your PCI Device (01:00.0) and UNTICK “Primary GPU”. Keep All Functions + PCI-Express checked.
Install NVIDIA driver on ubuntu
Inside the VM:
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# after reboot
nvidia-smi
lspci -nnk | grep -iA3 nvidiaInstall vLLM
sudo apt update
sudo apt install -y python3-venv python3-pip build-essential
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
python -m pip install -U pip wheel setuptoolsInstall PyTorch (CUDA build)
Pick the CUDA 12.x wheel from PyTorch’s selector. Example (CUDA 12.4 wheel—if the site shows cu126/cu128, use that instead):
pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudioInstall vLLM
pip install vllmSanity check
python - <<'PY'
import torch, vllm
print("CUDA available:", torch.cuda.is_available())
print("CUDA reported by PyTorch:", torch.version.cuda)
print("Torch:", torch.__version__)
print("GPU:", torch.cuda.get_device_name(0))
print("vLLM:", vllm.__version__)
PYHugging Face Create a token
Sign in at huggingface.co → click your avatar → Settings → Access Tokens → New token.
Name it and choose Role = Read (enough to download models).
Click Create and copy the token (looks like hf_********).
Use the token on your vLLM
# in your vLLM Python env
pip install -U "huggingface_hub[cli]" sentencepiece
git config --global credential.helper store
hf auth login # paste your hf_ token when prompted
hf auth whoami # sanity checkAccept the model terms (once, in browser)
Sign in at Hugging Face with the account you’ll use on the VM. Open: https://huggingface.co/google/gemma-3-4b-it Click Agree and access (or Request access) and confirm. If you might also use the base model, do the same for google/gemma-3-4b
Deploy Gemma 3 4B on your vLLM VM (RTX 2070, 8 GB)
# one-shot shell
export HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx... # (optionally: export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN)
# systemd (recommended)
sudo tee /etc/systemd/system/vllm.env >/dev/null <<'EOF'
HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...
HF_HOME=/opt/models/.cache/huggingface
EOF
sudo chmod 600 /etc/systemd/system/vllm.env
# then in /etc/systemd/system/vllm.service under [Service]:
# EnvironmentFile=/etc/systemd/system/vllm.env
sudo systemctl daemon-reload && sudo systemctl restart vllmHF_HOME=/opt/models/.cache/huggingface \
vllm serve google/gemma-3-270m-it \
--host 0.0.0.0 --port 8000 \
--max-model-len 2048 \ # you can raise later (even 4096 fits)
--max-num-seqs 1 \ # keep concurrency low at first
--gpu-memory-utilization 0.80 \ # headroom for kernels
--swap-space 2 \ # your VM has 10GB RAM; 2GB is safe
--download-dir /opt/models \
--trust-remote-codesudo mkdir -p /opt/models && sudo chown $USER:$USER /opt/models
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/embeddinggemma-300m \
--host 0.0.0.0 --port 8000 \
--download-dir /opt/modelsfirewall setting
Open the port if you’ll call it from outside the VM
sudo ufw allow from 192.168.1.0/24 to any port 8000 proto tcpHealth check
alpine-docker:~# curl -i http://192.168.1.243:8000/health
HTTP/1.1 200 OK
date: Mon, 27 Oct 2025 03:47:22 GMT
server: uvicorn
content-length: 0Run it as a service
# /etc/systemd/system/vllm-embed.service
[Unit]
Description=vLLM - EmbeddingGemma-300M
After=network-online.target
Wants=network-online.target
[Service]
User=yanboyang713
Environment=HF_HOME=/opt/models/.cache/huggingface
# If the model is gated, add: Environment=HUGGINGFACE_HUB_TOKEN=hf_xxx
ExecStart=/home/yanboyang713/venvs/vllm/bin/vllm serve google/embeddinggemma-300m \
--task embedding --host 0.0.0.0 --port 8001 --download-dir /opt/models
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-embed
sudo systemctl status vllm-embed --no-pagerDeploying vLLM on UVA High-Performance Computing Systems
login
ssh -Y rhe9cf@login.hpc.virginia.eduChoose storage location
Use /scratch for Hugging Face model cache and your vLLM environment, not /home, because /scratch is intended for large computational work. Note that /scratch is temporary, not backed up, and files not accessed for more than 90 days may be deleted.
mkdir -p /scratch/$USER/vllm/{envs,hf-cache,logs,scripts}
cd /scratch/$USER/vllmSet cache paths:
export HF_HOME=/scratch/$USER/vllm/hf-cache
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export TRANSFORMERS_CACHE=$HF_HOME/transformersLoad Python / Miniforge
UVA recommends Miniforge for Python on HPC; available versions can be checked with module spider miniforge, and the default can be loaded with module load miniforge.
module purge
module load miniforge/24.11.3-py3.12If that exact version is unavailable, check:
module spider miniforgeThen load the available Python 3.12 or Python 3.11 Miniforge module.
Create a vLLM environment
vLLM’s stable docs require Linux, Python 3.10–3.13, and NVIDIA GPUs with compute capability 7.5 or higher.
conda create -p /scratch/$USER/vllm/envs/vllm python=3.12 -y
source activate /scratch/$USER/vllm/envs/vllmInstall uv and vLLM:
python -m pip install uv
uv pip install vllm --torch-backend=auto
uv pip install openaivLLM’s docs recommend creating a fresh Python environment and show uv venv —python 3.12; they also recommend uv for installing vLLM wheels.
Check installation:
python -c "import vllm; print(vllm.__version__)"
python -c "import torch; print(torch.cuda.is_available())"The second command may show False on a login node. That is okay; test CUDA inside a GPU job.
Choose a model and GPU size
For your first test, use a small or medium model. Examples:
export MODEL_ID="Qwen/Qwen2.5-7B-Instruct"Avoid choosing a model that violates UVA policy. UVA’s RC usage policy page says users must comply with acceptable-use rules and includes restrictions on downloading or using prohibited applications such as DeepSeek AI on RC resources.
First test: interactive GPU session
Use this for debugging before writing a batch script.
salloc -A <your_allocation> \
-p gpu \
--gres=gpu:v100:1 \
-c 8 \
--mem=16G \
-t 01:00:00You can use uva slurm script generator