Deploy TinyLlama 1.1B on Raspberry Pi 5 — Home automation assistant

Complete setup guide for running TinyLlama 1.1B locally on Raspberry Pi 5 for home automation assistant

返回教程列表
高级15 分钟

Deploy TinyLlama 1.1B on Raspberry Pi 5 — Home automation assistant

Complete setup guide for running TinyLlama 1.1B locally on Raspberry Pi 5 for home automation assistant

Deploy TinyLlama 1.1B on Raspberry Pi 5 Overview Run TinyLlama 1.1B directly on Raspberry Pi 5 for home automation assistant. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: ARM CPU · 4GB RAM Installation ```ba

edge-ailocal-llmdeploymenton-deviceraspberry-pi-5

Deploy TinyLlama 1.1B on Raspberry Pi 5

Overview

Run TinyLlama 1.1B directly on Raspberry Pi 5 for home automation assistant. Local inference offers privacy, zero latency, and no ongoing API costs.

Specs: ARM CPU · 4GB RAM

Installation

bash

Install Ollama — easiest local inference runtime

curl -fsSL https://ollama.com/install.sh | sh

Verify installation

ollama --version

Download Model

bash

Pull TinyLlama 1.1B (downloads GGUF quantized weights automatically)

ollama pull tinyllama-11b

Run interactive chat

ollama run tinyllama-11b

Start API server

ollama serve

API available at http://localhost:11434

Python Integration

python
import httpx
from typing import Iterator

class LocalAI: """Interface to local TinyLlama 1.1B running on Raspberry Pi 5.""" BASE_URL = "http://localhost:11434" MODEL = "tinyllama-11b" def chat(self, message: str, system: str = "") -> str: """Single-turn chat.""" resp = httpx.post( f"{self.BASE_URL}/api/chat", json={ "model": self.MODEL, "messages": [ {"role": "system", "content": system}, {"role": "user", "content": message} ], "stream": False }, timeout=120 ) resp.raise_for_status() return resp.json()["message"]["content"] def stream(self, message: str) -> Iterator[str]: """Streaming chat for real-time output.""" with httpx.stream( "POST", f"{self.BASE_URL}/api/chat", json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True}, timeout=120 ) as r: for line in r.iter_lines(): if line: import json chunk = json.loads(line) if not chunk.get("done"): yield chunk["message"]["content"]

Usage

ai = LocalAI() response = ai.chat("Help me with home automation assistant") print(response)

Streaming

for token in ai.stream("Explain home automation assistant step by step"): print(token, end="", flush=True)

Custom Modelfile

bash

Create optimized configuration for home automation assistant

cat > Modelfile << 'MODELEOF' FROM tinyllama-11b

PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER top_p 0.9

SYSTEM "You are an AI assistant specialized in home automation assistant. You run locally on Raspberry Pi 5. Be concise, accurate, and helpful." MODELEOF

ollama create home-automation-assistant-assistant -f Modelfile ollama run home-automation-assistant-assistant

Performance Profile

MetricValue

HardwareARM CPU Memory4GB RAM Speed10-40 tokens/sec (CPU) / 40-100+ tok/s (GPU) First token<200ms (GPU) / <1s (CPU) Context4096-32768 tokens Cost$0 (after hardware)

Production Setup with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Raspberry Pi 5 AI API") ai = LocalAI()

class ChatRequest(BaseModel): message: str system: str = ""

class ChatResponse(BaseModel): response: str model: str device: str

@app.post("/chat", response_model=ChatResponse) async def chat_endpoint(req: ChatRequest): response = ai.chat(req.message, req.system) return ChatResponse(response=response, model="TinyLlama 1.1B", device="Raspberry Pi 5")

@app.get("/health") async def health(): return {"status": "ok", "model": "TinyLlama 1.1B", "device": "Raspberry Pi 5"}

Troubleshooting

Slow inference: Switch to Q4_K_M quantization, reduce context window Out of memory: Use smaller model or Q3_K_S quant GPU not used: Install CUDA/Metal drivers, check ollama logs High latency: Warm up model by sending a dummy request on startup

Resources

  • Ollama library: https://ollama.com/library
  • GGUF format: https://github.com/ggerganov/llama.cpp
  • Hardware guide: https://ollama.com/blog/hardware-recommendations
  • 相关工具

    ollamallama.cpptinyllama