Deploying vLLM on Kubernetes

https://docs.vllm.ai/en/latest/deployment/k8s.html

Deploying vLLM on VM

Pre-requires

VM setting

In VM → Hardware:

Click Display, set to Default (or VirtIO-GPU).

Edit your PCI Device (01:00.0) and UNTICK “Primary GPU”. Keep All Functions + PCI-Express checked.

Install NVIDIA driver

Inside the VM:

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# after reboot
nvidia-smi
lspci -nnk | grep -iA3 nvidia

Install vLLM

sudo apt update
sudo apt install -y python3-venv python3-pip build-essential
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
python -m pip install -U pip wheel setuptools

Install PyTorch (CUDA build)

Pick the CUDA 12.x wheel from PyTorch’s selector. Example (CUDA 12.4 wheel—if the site shows cu126/cu128, use that instead):

pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudio

Install vLLM

pip install vllm

Sanity check

python - <<'PY'
import torch, vllm
print("CUDA available:", torch.cuda.is_available())
print("CUDA reported by PyTorch:", torch.version.cuda)
print("Torch:", torch.__version__)
print("GPU:", torch.cuda.get_device_name(0))
print("vLLM:", vllm.__version__)
PY

Hugging Face Create a token

Sign in at huggingface.co → click your avatar → Settings → Access Tokens → New token.

Name it and choose Role = Read (enough to download models).

Click Create and copy the token (looks like hf_********).

Use the token on your vLLM

# in your vLLM Python env
pip install -U "huggingface_hub[cli]" sentencepiece
git config --global credential.helper store
hf auth login          # paste your hf_ token when prompted
hf auth whoami         # sanity check

Accept the model terms (once, in browser)

Sign in at Hugging Face with the account you’ll use on the VM. Open: https://huggingface.co/google/gemma-3-4b-it Click Agree and access (or Request access) and confirm. If you might also use the base model, do the same for google/gemma-3-4b

Deploy Gemma 3 4B on your vLLM VM (RTX 2070, 8 GB)

# one-shot shell
export HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...   # (optionally: export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN)
# systemd (recommended)
sudo tee /etc/systemd/system/vllm.env >/dev/null <<'EOF'
HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...
HF_HOME=/opt/models/.cache/huggingface
EOF
sudo chmod 600 /etc/systemd/system/vllm.env
# then in /etc/systemd/system/vllm.service under [Service]:
# EnvironmentFile=/etc/systemd/system/vllm.env
sudo systemctl daemon-reload && sudo systemctl restart vllm
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/gemma-3-270m-it \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 2048 \           # you can raise later (even 4096 fits)
  --max-num-seqs 1 \               # keep concurrency low at first
  --gpu-memory-utilization 0.80 \  # headroom for kernels
  --swap-space 2 \                 # your VM has 10GB RAM; 2GB is safe
  --download-dir /opt/models \
  --trust-remote-code
sudo mkdir -p /opt/models && sudo chown $USER:$USER /opt/models
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/embeddinggemma-300m \
--host 0.0.0.0 --port 8000 \
--download-dir /opt/models

firewall setting

Open the port if you’ll call it from outside the VM

sudo ufw allow from 192.168.1.0/24 to any port 8000 proto tcp

Health check

alpine-docker:~# curl -i http://192.168.1.243:8000/health
HTTP/1.1 200 OK
date: Mon, 27 Oct 2025 03:47:22 GMT
server: uvicorn
content-length: 0

Run it as a service

# /etc/systemd/system/vllm-embed.service
[Unit]
Description=vLLM - EmbeddingGemma-300M
After=network-online.target
Wants=network-online.target
 
[Service]
User=yanboyang713
Environment=HF_HOME=/opt/models/.cache/huggingface
# If the model is gated, add: Environment=HUGGINGFACE_HUB_TOKEN=hf_xxx
ExecStart=/home/yanboyang713/venvs/vllm/bin/vllm serve google/embeddinggemma-300m \
  --task embedding --host 0.0.0.0 --port 8001 --download-dir /opt/models
Restart=always
RestartSec=3
 
[Install]
WantedBy=multi-user.target
 
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-embed
sudo systemctl status vllm-embed --no-pager