Deploy CF AI Models on Cloudflare Workers AI — Edge CDN inference

Complete setup guide for running CF AI Models locally on Cloudflare Workers AI for edge CDN inference

返回教程列表
高级15 分钟

Deploy CF AI Models on Cloudflare Workers AI — Edge CDN inference

Complete setup guide for running CF AI Models locally on Cloudflare Workers AI for edge CDN inference

Deploy CF AI Models on Cloudflare Workers AI Overview Run CF AI Models directly on Cloudflare Workers AI for edge CDN inference. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: V8 isolates · Serverless Installat

Deploy CF AI Models on Cloudflare Workers AI

Overview

Run CF AI Models directly on Cloudflare Workers AI for edge CDN inference. Local inference offers privacy, zero latency, and no ongoing API costs.

Specs: V8 isolates · Serverless

Installation

bash

Install Ollama — easiest local inference runtime

curl -fsSL https://ollama.com/install.sh | sh

Verify installation

ollama --version

Download Model

bash

Pull CF AI Models (downloads GGUF quantized weights automatically)

ollama pull cf-ai-models

Run interactive chat

ollama run cf-ai-models

Start API server

ollama serve

API available at http://localhost:11434

Python Integration

python
import httpx
from typing import Iterator

class LocalAI: """Interface to local CF AI Models running on Cloudflare Workers AI.""" BASE_URL = "http://localhost:11434" MODEL = "cf-ai-models" def chat(self, message: str, system: str = "") -> str: """Single-turn chat.""" resp = httpx.post( f"{self.BASE_URL}/api/chat", json={ "model": self.MODEL, "messages": [ {"role": "system", "content": system}, {"role": "user", "content": message} ], "stream": False }, timeout=120 ) resp.raise_for_status() return resp.json()["message"]["content"] def stream(self, message: str) -> Iterator[str]: """Streaming chat for real-time output.""" with httpx.stream( "POST", f"{self.BASE_URL}/api/chat", json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True}, timeout=120 ) as r: for line in r.iter_lines(): if line: import json chunk = json.loads(line) if not chunk.get("done"): yield chunk["message"]["content"]

Usage

ai = LocalAI() response = ai.chat("Help me with edge CDN inference") print(response)

Streaming

for token in ai.stream("Explain edge CDN inference step by step"): print(token, end="", flush=True)

Custom Modelfile

bash

Create optimized configuration for edge CDN inference

cat > Modelfile << 'MODELEOF' FROM cf-ai-models

PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER top_p 0.9

SYSTEM "You are an AI assistant specialized in edge CDN inference. You run locally on Cloudflare Workers AI. Be concise, accurate, and helpful." MODELEOF

ollama create edge-CDN-inference-assistant -f Modelfile ollama run edge-CDN-inference-assistant

Performance Profile

MetricValue

HardwareV8 isolates MemoryServerless Speed10-40 tokens/sec (CPU) / 40-100+ tok/s (GPU) First token<200ms (GPU) / <1s (CPU) Context4096-32768 tokens Cost$0 (after hardware)

Production Setup with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Cloudflare Workers AI AI API") ai = LocalAI()

class ChatRequest(BaseModel): message: str system: str = ""

class ChatResponse(BaseModel): response: str model: str device: str

@app.post("/chat", response_model=ChatResponse) async def chat_endpoint(req: ChatRequest): response = ai.chat(req.message, req.system) return ChatResponse(response=response, model="CF AI Models", device="Cloudflare Workers AI")

@app.get("/health") async def health(): return {"status": "ok", "model": "CF AI Models", "device": "Cloudflare Workers AI"}

Troubleshooting

Slow inference: Switch to Q4_K_M quantization, reduce context window Out of memory: Use smaller model or Q3_K_S quant GPU not used: Install CUDA/Metal drivers, check ollama logs High latency: Warm up model by sending a dummy request on startup

Resources

相关工具

ollamallama.cppcf