Ollama Advanced Guide 2026: Production-Grade Configuration and Optimization for Local LLMs
From Installation to GPU Acceleration, API Services, and Model Tuning — Master It All
Ollama is currently the simplest tool for running local LLMs — with a single command you can run open-source models like Llama 3, Mistral, and Qwen. But behind that simplicity lie many optimization details; most people's Ollama setups only achieve about 30% of their potential performance.
1. Installation and Environment Setup
macOS / Linux
bash
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the official installer, or use WSL2:bash
Inside WSL2
curl -fsSL https://ollama.com/install.sh | sh
Verify Installation
bash
ollama --version
ollama run llama3.2 # Download and run Llama 3.2 3B
2. GPU Acceleration Configuration
This is the most critical factor affecting performance.
macOS (Apple Silicon)
Ollama automatically uses Metal GPU acceleration on Apple Silicon — no extra configuration needed. The unified memory architecture of M1/M2/M3 makes running large models very efficient:Windows/Linux (NVIDIA GPU)
Ollama automatically detects NVIDIA GPUs, but you need to have CUDA drivers properly installed:bash
Check if GPU is recognized
ollama ps # Running models will show GPU memory usageCheck if GPU acceleration is active
OLLAMA_DEBUG=1 ollama run mistral 2>&1 | grep -i gpu
If you see "loaded on gpu", GPU acceleration is enabled.
Handling Insufficient Memory
bash
Use quantized versions (smaller, faster, slightly lower quality)
ollama run llama3.2:3b-instruct-q4_K_M # 4-bit quantization
ollama run qwen2.5:7b-instruct-q4_K_MCheck memory requirements for each model
ollama show llama3.3:70b
3. Using the REST API
Ollama comes with a built-in REST API, defaulting to http://localhost:11434:
Basic Call
bash
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a quicksort in Python",
"stream": false
}'
Chat Mode (Maintains Context)
bash
curl http://localhost:11434/api/chat -d '{
"model": "qwen2.5:7b",
"messages": [
{"role": "user", "content": "Hello, I'm learning Python"},
{"role": "assistant", "content": "Great! What would you like to start with?"},
{"role": "user", "content": "Let's start with list operations"}
]
}'
OpenAI-Compatible API (Important!)
Ollama supports the OpenAI format, so you can directly replace the OpenAI SDK:python
from openai import OpenAIclient = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Any string, no authentication locally
)
response = client.chat.completions.create(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Hello'}]
)
print(response.choices[0].message.content)
This means any tool that supports the OpenAI API can directly connect to Ollama!
4. Model Selection Guide (2026)
For Chinese scenarios, the Qwen2.5 series is highly recommended — Alibaba's models significantly outperform Llama in Chinese understanding and generation.
5. Integration with Open WebUI
Open WebUI provides a ChatGPT-like graphical interface for Ollama:
bash
Install with Docker (recommended)
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://host.docker.internal:11434 --name open-webui ghcr.io/open-webui/open-webui:main
Visit http://localhost:3000, and you'll see a full AI chat interface where you can:
6. Integration with Continue.dev (VS Code Extension)
Continue is an open-source VS Code AI coding plugin that supports Ollama:
json
// ~/.continue/config.json
{
"models": [
{
"title": "Qwen2.5 Coder (Local)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen2.5 Coder (Autocomplete)",
"provider": "ollama",
"model": "qwen2.5-coder:1.5b" // Small model for autocomplete, fast
}
}
This gives you a completely free, fully local, data-never-leaves-your-computer AI coding assistant.
7. Performance Tuning
bash
Set number of parallel requests (default 1, set higher if CPU/GPU is strong)
OLLAMA_NUM_PARALLEL=2 ollama serveSet how long models stay in memory (default 5 minutes, set 0 to release immediately)
OLLAMA_KEEP_ALIVE=10m ollama serveSet maximum VRAM usage ratio
OLLAMA_MAX_VRAM=0.9 ollama serve
Further Reading
Also available in 中文.