Deploy CF AI Models on Cloudflare Workers AI — Edge CDN inference
Complete setup guide for running CF AI Models locally on Cloudflare Workers AI for edge CDN inference
Deploy CF AI Models on Cloudflare Workers AI — Edge CDN inference
Complete setup guide for running CF AI Models locally on Cloudflare Workers AI for edge CDN inference
Deploy CF AI Models on Cloudflare Workers AI Overview Run CF AI Models directly on Cloudflare Workers AI for edge CDN inference. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: V8 isolates · Serverless Installat
Deploy CF AI Models on Cloudflare Workers AI
Overview
Run CF AI Models directly on Cloudflare Workers AI for edge CDN inference. Local inference offers privacy, zero latency, and no ongoing API costs.
Specs: V8 isolates · Serverless
Installation
bash
Install Ollama — easiest local inference runtime
curl -fsSL https://ollama.com/install.sh | shVerify installation
ollama --version
Download Model
bash
Pull CF AI Models (downloads GGUF quantized weights automatically)
ollama pull cf-ai-modelsRun interactive chat
ollama run cf-ai-modelsStart API server
ollama serve
API available at http://localhost:11434
Python Integration
python
import httpx
from typing import Iteratorclass LocalAI:
"""Interface to local CF AI Models running on Cloudflare Workers AI."""
BASE_URL = "http://localhost:11434"
MODEL = "cf-ai-models"
def chat(self, message: str, system: str = "") -> str:
"""Single-turn chat."""
resp = httpx.post(
f"{self.BASE_URL}/api/chat",
json={
"model": self.MODEL,
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": message}
],
"stream": False
},
timeout=120
)
resp.raise_for_status()
return resp.json()["message"]["content"]
def stream(self, message: str) -> Iterator[str]:
"""Streaming chat for real-time output."""
with httpx.stream(
"POST",
f"{self.BASE_URL}/api/chat",
json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True},
timeout=120
) as r:
for line in r.iter_lines():
if line:
import json
chunk = json.loads(line)
if not chunk.get("done"):
yield chunk["message"]["content"]
Usage
ai = LocalAI()
response = ai.chat("Help me with edge CDN inference")
print(response)Streaming
for token in ai.stream("Explain edge CDN inference step by step"):
print(token, end="", flush=True)
Custom Modelfile
bash
Create optimized configuration for edge CDN inference
cat > Modelfile << 'MODELEOF'
FROM cf-ai-modelsPARAMETER num_ctx 4096
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are an AI assistant specialized in edge CDN inference. You run locally on Cloudflare Workers AI. Be concise, accurate, and helpful."
MODELEOF
ollama create edge-CDN-inference-assistant -f Modelfile
ollama run edge-CDN-inference-assistant
Performance Profile
Production Setup with FastAPI
python
from fastapi import FastAPI
from pydantic import BaseModelapp = FastAPI(title="Cloudflare Workers AI AI API")
ai = LocalAI()
class ChatRequest(BaseModel):
message: str
system: str = ""
class ChatResponse(BaseModel):
response: str
model: str
device: str
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(req: ChatRequest):
response = ai.chat(req.message, req.system)
return ChatResponse(response=response, model="CF AI Models", device="Cloudflare Workers AI")
@app.get("/health")
async def health():
return {"status": "ok", "model": "CF AI Models", "device": "Cloudflare Workers AI"}
Troubleshooting
Slow inference: Switch to Q4_K_M quantization, reduce context window
Out of memory: Use smaller model or Q3_K_S quant
GPU not used: Install CUDA/Metal drivers, check ollama logs
High latency: Warm up model by sending a dummy request on startup
Resources
相关工具
相关教程
Complete setup guide for running GGUF Models locally on LM Studio Desktop for no-code local AI GUI
Complete setup guide for running Llama 3.2 3B locally on NVIDIA Jetson Orin for robotics and edge AI
Complete setup guide for running Mistral 7B locally on Intel Core Ultra Laptop for laptop inference
Complete setup guide for running Phi-3 Mini locally on Web Browser WebGPU for browser-native inference
Complete setup guide for running Llama 3.1 8B locally on AWS Graviton3 for ARM cloud inference
Complete setup guide for running MobileNet variants locally on Google Coral Edge TPU for IoT vision AI