Deploy Mistral 7B Q4 on Fly.io Machines — Geo-distributed AI

Complete setup guide for running Mistral 7B Q4 locally on Fly.io Machines for geo-distributed AI

返回教程列表
高级15 分钟

Deploy Mistral 7B Q4 on Fly.io Machines — Geo-distributed AI

Complete setup guide for running Mistral 7B Q4 locally on Fly.io Machines for geo-distributed AI

Deploy Mistral 7B Q4 on Fly.io Machines Overview Run Mistral 7B Q4 directly on Fly.io Machines for geo-distributed AI. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: Micro VMs · 8GB Installation ```bash Instal

edge-ailocal-llmdeploymenton-devicefly.io-machines

Deploy Mistral 7B Q4 on Fly.io Machines

Overview

Run Mistral 7B Q4 directly on Fly.io Machines for geo-distributed AI. Local inference offers privacy, zero latency, and no ongoing API costs.

Specs: Micro VMs · 8GB

Installation

bash

Install Ollama — easiest local inference runtime

curl -fsSL https://ollama.com/install.sh | sh

Verify installation

ollama --version

Download Model

bash

Pull Mistral 7B Q4 (downloads GGUF quantized weights automatically)

ollama pull mistral-7b-q4

Run interactive chat

ollama run mistral-7b-q4

Start API server

ollama serve

API available at http://localhost:11434

Python Integration

python
import httpx
from typing import Iterator

class LocalAI: """Interface to local Mistral 7B Q4 running on Fly.io Machines.""" BASE_URL = "http://localhost:11434" MODEL = "mistral-7b-q4" def chat(self, message: str, system: str = "") -> str: """Single-turn chat.""" resp = httpx.post( f"{self.BASE_URL}/api/chat", json={ "model": self.MODEL, "messages": [ {"role": "system", "content": system}, {"role": "user", "content": message} ], "stream": False }, timeout=120 ) resp.raise_for_status() return resp.json()["message"]["content"] def stream(self, message: str) -> Iterator[str]: """Streaming chat for real-time output.""" with httpx.stream( "POST", f"{self.BASE_URL}/api/chat", json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True}, timeout=120 ) as r: for line in r.iter_lines(): if line: import json chunk = json.loads(line) if not chunk.get("done"): yield chunk["message"]["content"]

Usage

ai = LocalAI() response = ai.chat("Help me with geo-distributed AI") print(response)

Streaming

for token in ai.stream("Explain geo-distributed AI step by step"): print(token, end="", flush=True)

Custom Modelfile

bash

Create optimized configuration for geo-distributed AI

cat > Modelfile << 'MODELEOF' FROM mistral-7b-q4

PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER top_p 0.9

SYSTEM "You are an AI assistant specialized in geo-distributed AI. You run locally on Fly.io Machines. Be concise, accurate, and helpful." MODELEOF

ollama create geo-distributed-AI-assistant -f Modelfile ollama run geo-distributed-AI-assistant

Performance Profile

MetricValue

HardwareMicro VMs Memory8GB Speed10-40 tokens/sec (CPU) / 40-100+ tok/s (GPU) First token<200ms (GPU) / <1s (CPU) Context4096-32768 tokens Cost$0 (after hardware)

Production Setup with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Fly.io Machines AI API") ai = LocalAI()

class ChatRequest(BaseModel): message: str system: str = ""

class ChatResponse(BaseModel): response: str model: str device: str

@app.post("/chat", response_model=ChatResponse) async def chat_endpoint(req: ChatRequest): response = ai.chat(req.message, req.system) return ChatResponse(response=response, model="Mistral 7B Q4", device="Fly.io Machines")

@app.get("/health") async def health(): return {"status": "ok", "model": "Mistral 7B Q4", "device": "Fly.io Machines"}

Troubleshooting

Slow inference: Switch to Q4_K_M quantization, reduce context window Out of memory: Use smaller model or Q3_K_S quant GPU not used: Install CUDA/Metal drivers, check ollama logs High latency: Warm up model by sending a dummy request on startup

Resources

  • Ollama library: https://ollama.com/library
  • GGUF format: https://github.com/ggerganov/llama.cpp
  • Hardware guide: https://ollama.com/blog/hardware-recommendations
  • 相关工具

    ollamallama.cppmistral