← Back to tutorials

Ollama Advanced Guide 2026: Production-Grade Configuration and Optimization for Local LLMs

From Installation to GPU Acceleration, API Services, and Model Tuning — Master It All

Ollama is currently the simplest tool for running local LLMs — with a single command you can run open-source models like Llama 3, Mistral, and Qwen. But behind that simplicity lie many optimization details; most people's Ollama setups only achieve about 30% of their potential performance.

1. Installation and Environment Setup

macOS / Linux

bash
curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the official installer, or use WSL2:
bash

Inside WSL2

curl -fsSL https://ollama.com/install.sh | sh

Verify Installation

bash
ollama --version
ollama run llama3.2  # Download and run Llama 3.2 3B

2. GPU Acceleration Configuration

This is the most critical factor affecting performance.

macOS (Apple Silicon)

Ollama automatically uses Metal GPU acceleration on Apple Silicon — no extra configuration needed. The unified memory architecture of M1/M2/M3 makes running large models very efficient:
  • M2 Pro (16GB): Can run 13B models smoothly
  • M3 Max (48GB): Can run 70B models (quantized)
  • Windows/Linux (NVIDIA GPU)

    Ollama automatically detects NVIDIA GPUs, but you need to have CUDA drivers properly installed:
    bash
    

    Check if GPU is recognized

    ollama ps # Running models will show GPU memory usage

    Check if GPU acceleration is active

    OLLAMA_DEBUG=1 ollama run mistral 2>&1 | grep -i gpu

    If you see "loaded on gpu", GPU acceleration is enabled.

    Handling Insufficient Memory

    bash
    

    Use quantized versions (smaller, faster, slightly lower quality)

    ollama run llama3.2:3b-instruct-q4_K_M # 4-bit quantization ollama run qwen2.5:7b-instruct-q4_K_M

    Check memory requirements for each model

    ollama show llama3.3:70b

    3. Using the REST API

    Ollama comes with a built-in REST API, defaulting to http://localhost:11434:

    Basic Call

    bash
    curl http://localhost:11434/api/generate -d '{
      "model": "llama3.2",
      "prompt": "Write a quicksort in Python",
      "stream": false
    }'
    

    Chat Mode (Maintains Context)

    bash
    curl http://localhost:11434/api/chat -d '{
      "model": "qwen2.5:7b",
      "messages": [
        {"role": "user", "content": "Hello, I'm learning Python"},
        {"role": "assistant", "content": "Great! What would you like to start with?"},
        {"role": "user", "content": "Let's start with list operations"}
      ]
    }'
    

    OpenAI-Compatible API (Important!)

    Ollama supports the OpenAI format, so you can directly replace the OpenAI SDK:
    python
    from openai import OpenAI

    client = OpenAI( base_url='http://localhost:11434/v1', api_key='ollama' # Any string, no authentication locally )

    response = client.chat.completions.create( model='llama3.2', messages=[{'role': 'user', 'content': 'Hello'}] ) print(response.choices[0].message.content)

    This means any tool that supports the OpenAI API can directly connect to Ollama!

    4. Model Selection Guide (2026)

    ModelParametersMemory RequiredUse Case

    Llama 3.2 3B3B~2GBQuick Q&A, simple tasks Qwen2.5 7B7B~5GBChinese tasks, code (recommended!) Llama 3.3 70B70B~40GBComplex reasoning (needs large memory) DeepSeek-R1 14B14B~10GBMath/code reasoning Mistral 7B7B~5GBEnglish tasks, general Phi-414B~9GBCode generation, reasoning

    For Chinese scenarios, the Qwen2.5 series is highly recommended — Alibaba's models significantly outperform Llama in Chinese understanding and generation.

    5. Integration with Open WebUI

    Open WebUI provides a ChatGPT-like graphical interface for Ollama:

    bash
    

    Install with Docker (recommended)

    docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data -e OLLAMA_BASE_URL=http://host.docker.internal:11434 --name open-webui ghcr.io/open-webui/open-webui:main

    Visit http://localhost:3000, and you'll see a full AI chat interface where you can:

  • Switch between different models
  • Manage conversation history
  • Upload documents for analysis
  • Create custom AI personas
  • 6. Integration with Continue.dev (VS Code Extension)

    Continue is an open-source VS Code AI coding plugin that supports Ollama:

    json
    // ~/.continue/config.json
    {
      "models": [
        {
          "title": "Qwen2.5 Coder (Local)",
          "provider": "ollama",
          "model": "qwen2.5-coder:7b",
          "apiBase": "http://localhost:11434"
        }
      ],
      "tabAutocompleteModel": {
        "title": "Qwen2.5 Coder (Autocomplete)",
        "provider": "ollama",
        "model": "qwen2.5-coder:1.5b"  // Small model for autocomplete, fast
      }
    }
    

    This gives you a completely free, fully local, data-never-leaves-your-computer AI coding assistant.

    7. Performance Tuning

    bash
    

    Set number of parallel requests (default 1, set higher if CPU/GPU is strong)

    OLLAMA_NUM_PARALLEL=2 ollama serve

    Set how long models stay in memory (default 5 minutes, set 0 to release immediately)

    OLLAMA_KEEP_ALIVE=10m ollama serve

    Set maximum VRAM usage ratio

    OLLAMA_MAX_VRAM=0.9 ollama serve


    Further Reading

  • Building a Private AI Assistant with Open WebUI + Ollama
  • DeepSeek R1 Local Deployment Tutorial
  • LLM API Cost Optimization Guide
  • Also available in 中文.

    Ollama Advanced Guide 2026: Production-Grade Configuration and Optimization for Local LLMs | AI Skill Navigation | AI Skill Navigation