Deploying vLLM on Kubernetes
https://docs.vllm.ai/en/latest/deployment/k8s.html
Deploying vLLM on VM
Pre-requires
VM setting
In VM → Hardware:
Click Display, set to Default (or VirtIO-GPU).
Edit your PCI Device (01:00.0) and UNTICK “Primary GPU”. Keep All Functions + PCI-Express checked.
Install NVIDIA driver
Inside the VM:
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall
sudo reboot
# after reboot
nvidia-smi
lspci -nnk | grep -iA3 nvidiaInstall vLLM
sudo apt update
sudo apt install -y python3-venv python3-pip build-essential
python3 -m venv ~/venvs/vllm
source ~/venvs/vllm/bin/activate
python -m pip install -U pip wheel setuptoolsInstall PyTorch (CUDA build)
Pick the CUDA 12.x wheel from PyTorch’s selector. Example (CUDA 12.4 wheel—if the site shows cu126/cu128, use that instead):
pip install --index-url https://download.pytorch.org/whl/cu124 torch torchvision torchaudioInstall vLLM
pip install vllmSanity check
python - <<'PY'
import torch, vllm
print("CUDA available:", torch.cuda.is_available())
print("CUDA reported by PyTorch:", torch.version.cuda)
print("Torch:", torch.__version__)
print("GPU:", torch.cuda.get_device_name(0))
print("vLLM:", vllm.__version__)
PYHugging Face Create a token
Sign in at huggingface.co → click your avatar → Settings → Access Tokens → New token.
Name it and choose Role = Read (enough to download models).
Click Create and copy the token (looks like hf_********).
Use the token on your vLLM
# in your vLLM Python env
pip install -U "huggingface_hub[cli]" sentencepiece
git config --global credential.helper store
hf auth login # paste your hf_ token when prompted
hf auth whoami # sanity checkAccept the model terms (once, in browser)
Sign in at Hugging Face with the account you’ll use on the VM. Open: https://huggingface.co/google/gemma-3-4b-it Click Agree and access (or Request access) and confirm. If you might also use the base model, do the same for google/gemma-3-4b
Deploy Gemma 3 4B on your vLLM VM (RTX 2070, 8 GB)
# one-shot shell
export HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx... # (optionally: export HF_TOKEN=$HUGGINGFACE_HUB_TOKEN)
# systemd (recommended)
sudo tee /etc/systemd/system/vllm.env >/dev/null <<'EOF'
HUGGINGFACE_HUB_TOKEN=hf_xxxxxxxxx...
HF_HOME=/opt/models/.cache/huggingface
EOF
sudo chmod 600 /etc/systemd/system/vllm.env
# then in /etc/systemd/system/vllm.service under [Service]:
# EnvironmentFile=/etc/systemd/system/vllm.env
sudo systemctl daemon-reload && sudo systemctl restart vllmHF_HOME=/opt/models/.cache/huggingface \
vllm serve google/gemma-3-270m-it \
--host 0.0.0.0 --port 8000 \
--max-model-len 2048 \ # you can raise later (even 4096 fits)
--max-num-seqs 1 \ # keep concurrency low at first
--gpu-memory-utilization 0.80 \ # headroom for kernels
--swap-space 2 \ # your VM has 10GB RAM; 2GB is safe
--download-dir /opt/models \
--trust-remote-codesudo mkdir -p /opt/models && sudo chown $USER:$USER /opt/models
HF_HOME=/opt/models/.cache/huggingface \
vllm serve google/embeddinggemma-300m \
--host 0.0.0.0 --port 8000 \
--download-dir /opt/modelsfirewall setting
Open the port if you’ll call it from outside the VM
sudo ufw allow from 192.168.1.0/24 to any port 8000 proto tcpHealth check
alpine-docker:~# curl -i http://192.168.1.243:8000/health
HTTP/1.1 200 OK
date: Mon, 27 Oct 2025 03:47:22 GMT
server: uvicorn
content-length: 0Run it as a service
# /etc/systemd/system/vllm-embed.service
[Unit]
Description=vLLM - EmbeddingGemma-300M
After=network-online.target
Wants=network-online.target
[Service]
User=yanboyang713
Environment=HF_HOME=/opt/models/.cache/huggingface
# If the model is gated, add: Environment=HUGGINGFACE_HUB_TOKEN=hf_xxx
ExecStart=/home/yanboyang713/venvs/vllm/bin/vllm serve google/embeddinggemma-300m \
--task embedding --host 0.0.0.0 --port 8001 --download-dir /opt/models
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
sudo systemctl daemon-reload
sudo systemctl enable --now vllm-embed
sudo systemctl status vllm-embed --no-pager