Ollama, a platform that makes local development with open-source Large Language Models (LLMs) a breeze. With Ollama, everything you need to run an LLM—model weights and all of the config—is packaged into a single Modelfile. Think Docker for LLMs.

Download Ollama to Get Started

As a first step, you should download Ollama to your machine. Ollama is supported on all major platforms: MacOS, Windows, and Linux.

To download Ollama, you can either visit the official GitHub repo and follow the download links from there. Or visit the official website and download the installer if you are on a Mac or a Windows machine.

I’m on Linux: So if you’re a Linux user like me, you can run the following command to run the installer script or Manual install:

yanboyang713@Meta-Scientific-Linux ~ % curl -fsSL https://ollama.com/install.sh | sh
>>> Installing ollama to /usr/local
[sudo] password for yanboyang713:
>>> Downloading Linux amd64 bundle
######################################################################## 100.0%
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink '/etc/systemd/system/default.target.wants/ollama.service' → '/etc/systemd/system/ollama.service'.
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.
WARNING: No NVIDIA/AMD GPU detected. Ollama will run in CPU-only mode.

The installation process typically takes a few minutes. During the installation process, any NVIDIA/AMD GPUs will be auto-detected. Make sure you have the drivers installed. The CPU-only mode works fine, too. But it may be much slower.

Get the Model from ollama model library

Next, you can visit the model library to check the list of all model families currently supported. The default model downloaded is the one with the latest tag. On the page for each model, you can get more info such as the size and quantization used.

You can search through the list of tags to locate the model that you want to run. For each model family, there are typically foundational models of different sizes and instruction-tuned variants. I’m interested in running the google gemma3 4b model from the Gemma family of lightweight models from Google DeepMind.

You can run the model using the ollama run command to pull and start interacting with the model directly. However, you can also pull the model onto your machine first and then run it. This is very similar to how you work with Docker images.

For Gemma 3 4b, running the following pull command downloads the model onto your machine:

ollama pull gemma3:4b

Get the model from Hugging Face

Download the GGUF locally

pip install -U "huggingface_hub[cli]" sentencepiece
git config --global credential.helper store
hf auth login          # paste your hf_ token when prompted
 
hf download google/gemma-3-4b-it-qat-q4_0-gguf \
--local-dir /opt/models/gemma3-4b-it-q4_0

readable by ollama

sudo chown -R ollama:ollama /opt/models
sudo chmod -R a+rX /opt/models
 
# 2) Verify access as the service user
sudo -u ollama ls -lh /opt/models/gemma3-4b-it-q4_0/gemma-3-4b-it-q4_0.gguf
 
sudo chown ollama:ollama /opt/models/gemma3-4b-it-q4_0/mmproj-model-f16-4B.gguf
sudo chmod a+r /opt/models/gemma3-4b-it-q4_0/mmproj-model-f16-4B.gguf

Write a Modelfile that references the local GGUF

Minimal Modelfile (good defaults for an 8 GB GPU) Save below as Modelfile:

# Gemma 3 4B IT (QAT Q4_0), local GGUF
FROM /opt/models/gemma3-4b-it-q4_0/gemma-3-4b-it-q4_0.gguf
ADAPTER /opt/models/gemma3-4b-it-q4_0/mmproj-model-f16-4B.gguf
# Make Ollama format prompts for Gemma chat
PARSER gemma
 
# Minimal, explicit chat template for Gemma 3
TEMPLATE """<start_of_turn>system
 .System <end_of_turn>
<start_of_turn>user
 .Prompt <end_of_turn>
<start_of_turn>assistant
"""
# Context window – start conservative, raise later after checking VRAM
PARAMETER num_ctx 4096
 
# Put as many layers on GPU as possible for speed.
# If VRAM gets tight after load, lower this (e.g., 60 → 40 → 30) and recreate.
PARAMETER num_gpu 35
 
# Sampling: steady/concise output; adjust to taste
PARAMETER temperature 0.1
PARAMETER top_p 0.95
PARAMETER top_k 64
 
# Optional system prompt
SYSTEM You are a helpful, concise assistant.

How to tune PARAMETERs (what changes and why)

  • num_ctx – max tokens kept in the context window.
    • ↑ increases memory used by the KV cache and slows prefill.
    • For 8 GB: start at 4096, then try 6144/8192 only if you still have free VRAM.
  • num_gpu – how many layers get placed on the GPU.
    • ↑ = faster but more VRAM for weights; ↓ frees VRAM for KV.
    • Goal: after loading, keep ≥ 1–1.5 GiB free (watch with watch -n 0.5 nvidia-smi).
  • Sampling (temperature, top_p, top_k) – style/creativity vs. determinism.
    • Lower temperature (0.1–0.3) = more focused; raise if too terse.
  • Other useful ones (optional)
    • repeat_penalty (e.g., 1.1–1.2) to reduce loops.
    • stop to add stop strings.
    • num_keep (advanced): how many initial tokens to always keep when sliding the window.

After editing a Modelfile, re-run ollama create … -f Modelfile to rebuild that local model.

Create & test:

ollama rm boyang/gemma3-4b-it:v1 2>/dev/null || true
yanboyang713@vllm:~/models$ OLLAMA_LOG=debug ollama create boyang/gemma3-4b-it:v1 -f /home/yanboyang713/models/Modelfile
gathering model components
copying file sha256:76aed0a8285b83102f18b5d60e53c70d09eb4e9917a20ce8956bd546452b56e2 100%
parsing GGUF
using existing layer sha256:76aed0a8285b83102f18b5d60e53c70d09eb4e9917a20ce8956bd546452b56e2
creating new layer sha256:6484116035edb2addfd185f1ffe887edb5096b6c65fbf87583f08c0935a36bc5
creating new layer sha256:95b19da7ce4d7cf34cd929495d70496ae9a964bbeddab0b865a0dc7859ac84d4
writing manifest
success
 
ollama run boyang/gemma3-4b-it:v1 -p "Explain context length in 3 short sentences."

Run the Model

Run the model using the ollama run command as shown:

yanboyang713@Meta-Scientific-Linux ~ % ollama run gemma3:4b
>>> hello, where is the capital of China?
The capital of China is **Beijing**.
 
It's a huge, historic city and a major cultural and political center. 😊
 
Do you want to know anything more about Beijing?
 
>>> /bye

Doing so will start an Ollama REPL at which you can interact with the Gemma3 4B model.

Customize Model Behavior with System Prompts

You can customize LLMs by setting system prompts for a specific desired behavior like so:

  • Set system prompt for desired behavior.
  • Save the model by giving it a name.
  • Exit the REPL and run the model you just created.

Say you want the model to always explain concepts or answer questions in plain English with minimal technical jargon as possible. Here’s how you can go about doing it:

>>> /set system For all questions asked answer in plain English avoiding technical jargon as much as possible
Set system message.
>>> /save ipe
Created new model 'ipe'
>>> /bye

Now run the model you just created:

ollama run ipe

How to use Ollama API

Ollama API reference link: https://github.com/ollama/ollama/blob/main/docs/api.md

Access from local using curl

curl http://localhost:11434/api/generate -d '{ "model": "boyang/gemma3-4b-it:v1", "prompt": "How are you today?"}'

The format of the default response is not very friendly, let’s add additional parameters to generate a single json object data, and the response is the return content.

curl http://localhost:11434/api/generate -d '{ "model": "boyang/gemma3-4b-it:v1", "prompt": "How are you today?", "stream": false}'

Export Port to Public

qwen2.5-omni

Multimodal AI

Reference List

  1. https://github.com/ollama/ollama
  2. https://www.kdnuggets.com/ollama-tutorial-running-llms-locally-made-super-simple
  3. https://www.gpu-mart.com/blog/ollama-api-usage-examples
  4. https://ai.google.dev/gemma/docs/integrations/ollama