Deploy Mistral 7B Q4 on Fly.io Machines — Geo-distributed AI

Complete setup guide for running Mistral 7B Q4 locally on Fly.io Machines for geo-distributed AI

高级约 15 分钟

Deploy Mistral 7B Q4 on Fly.io Machines — Geo-distributed AI

Complete setup guide for running Mistral 7B Q4 locally on Fly.io Machines for geo-distributed AI

Deploy Mistral 7B Q4 on Fly.io Machines Overview Run Mistral 7B Q4 directly on Fly.io Machines for geo-distributed AI. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: Micro VMs · 8GB Installation ```bash Instal

edge-ailocal-llmdeploymenton-devicefly.io-machines

Deploy Mistral 7B Q4 on Fly.io Machines

Overview

Run Mistral 7B Q4 directly on Fly.io Machines for geo-distributed AI. Local inference offers privacy, zero latency, and no ongoing API costs.

Specs: Micro VMs · 8GB

Installation

bash
Install Ollama — easiest local inference runtime
curl -fsSL https://ollama.com/install.sh | sh
Verify installation
ollama --version

Download Model

bash
Pull Mistral 7B Q4 (downloads GGUF quantized weights automatically)
ollama pull mistral-7b-q4
Run interactive chat
ollama run mistral-7b-q4
Start API server
ollama serve
API available at http://localhost:11434

Python Integration

python
import httpx
from typing import Iterator
class LocalAI:
    """Interface to local Mistral 7B Q4 running on Fly.io Machines."""
    
    BASE_URL = "http://localhost:11434"
    MODEL = "mistral-7b-q4"
    
    def chat(self, message: str, system: str = "") -> str:
        """Single-turn chat."""
        resp = httpx.post(
            f"{self.BASE_URL}/api/chat",
            json={
                "model": self.MODEL,
                "messages": [
                    {"role": "system", "content": system},
                    {"role": "user", "content": message}
                ],
                "stream": False
            },
            timeout=120
        )
        resp.raise_for_status()
        return resp.json()["message"]["content"]
    
    def stream(self, message: str) -> Iterator[str]:
        """Streaming chat for real-time output."""
        with httpx.stream(
            "POST",
            f"{self.BASE_URL}/api/chat",
            json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True},
            timeout=120
        ) as r:
            for line in r.iter_lines():
                if line:
                    import json
                    chunk = json.loads(line)
                    if not chunk.get("done"):
                        yield chunk["message"]["content"]
Usage
ai = LocalAI()
response = ai.chat("Help me with geo-distributed AI")
print(response)
Streaming
for token in ai.stream("Explain geo-distributed AI step by step"):
    print(token, end="", flush=True)

Custom Modelfile

bash Create optimized configuration for geo-distributed AI cat > Modelfile << 'MODELEOF' FROM mistral-7b-q4 PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER top_p 0.9 SYSTEM "You are an AI assistant specialized in geo-distributed AI. You run locally on Fly.io Machines. Be concise, accurate, and helpful." MODELEOF

ollama create geo-distributed-AI-assistant -f Modelfile ollama run geo-distributed-AI-assistant

Performance Profile

MetricValue

HardwareMicro VMs Memory8GB Speed10-40 tokens/sec (CPU) / 40-100+ tok/s (GPU) First token<200ms (GPU) / <1s (CPU) Context4096-32768 tokens Cost$0 (after hardware)

Production Setup with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Fly.io Machines AI API")
ai = LocalAI()
class ChatRequest(BaseModel):
    message: str
    system: str = ""
class ChatResponse(BaseModel):
    response: str
    model: str
    device: str
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(req: ChatRequest):
    response = ai.chat(req.message, req.system)
    return ChatResponse(response=response, model="Mistral 7B Q4", device="Fly.io Machines")@app.get("/health")
async def health():
    return {"status": "ok", "model": "Mistral 7B Q4", "device": "Fly.io Machines"}

Troubleshooting

Slow inference: Switch to Q4_K_M quantization, reduce context window Out of memory: Use smaller model or Q3_K_S quant GPU not used: Install CUDA/Metal drivers, check ollama logs High latency: Warm up model by sending a dummy request on startup

Resources

Ollama library: https://ollama.com/library

GGUF format: https://github.com/ggerganov/llama.cpp

Hardware guide: https://ollama.com/blog/hardware-recommendations

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

Deploy Mistral 7B Q4 on Fly.io Machines — Geo-distributed AI

Deploy Mistral 7B Q4 on Fly.io Machines

Overview

Installation

Install Ollama — easiest local inference runtime

Verify installation

Download Model

Pull Mistral 7B Q4 (downloads GGUF quantized weights automatically)

Run interactive chat

Start API server

API available at http://localhost:11434

Python Integration

Usage

Streaming

Custom Modelfile

Create optimized configuration for geo-distributed AI

Performance Profile

Production Setup with FastAPI

Troubleshooting

Resources

Documentation

Getting Started

Learn more