Deploy Any ONNX Model on ONNX Runtime CrossPlatform — Cross-platform deployment
Complete setup guide for running Any ONNX Model locally on ONNX Runtime CrossPlatform for cross-platform deployment
Deploy Any ONNX Model on ONNX Runtime CrossPlatform
Overview
Run Any ONNX Model directly on ONNX Runtime CrossPlatform for cross-platform deployment. Local inference offers privacy, zero latency, and no ongoing API costs.
Specs: ONNX Runtime · Variable
Installation
bash
Install Ollama — easiest local inference runtime
curl -fsSL https://ollama.com/install.sh | shVerify installation
ollama --version
Download Model
bash
Pull Any ONNX Model (downloads GGUF quantized weights automatically)
ollama pull any-onnx-modelRun interactive chat
ollama run any-onnx-modelStart API server
ollama serve
API available at http://localhost:11434
Python Integration
python
import httpx
from typing import Iteratorclass LocalAI:
"""Interface to local Any ONNX Model running on ONNX Runtime CrossPlatform."""
BASE_URL = "http://localhost:11434"
MODEL = "any-onnx-model"
def chat(self, message: str, system: str = "") -> str:
"""Single-turn chat."""
resp = httpx.post(
f"{self.BASE_URL}/api/chat",
json={
"model": self.MODEL,
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": message}
],
"stream": False
},
timeout=120
)
resp.raise_for_status()
return resp.json()["message"]["content"]
def stream(self, message: str) -> Iterator[str]:
"""Streaming chat for real-time output."""
with httpx.stream(
"POST",
f"{self.BASE_URL}/api/chat",
json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True},
timeout=120
) as r:
for line in r.iter_lines():
if line:
import json
chunk = json.loads(line)
if not chunk.get("done"):
yield chunk["message"]["content"]
Usage
ai = LocalAI()
response = ai.chat("Help me with cross-platform deployment")
print(response)Streaming
for token in ai.stream("Explain cross-platform deployment step by step"):
print(token, end="", flush=True)
Custom Modelfile
bash
Create optimized configuration for cross-platform deployment
cat > Modelfile << 'MODELEOF'
FROM any-onnx-modelPARAMETER num_ctx 4096
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM "You are an AI assistant specialized in cross-platform deployment. You run locally on ONNX Runtime CrossPlatform. Be concise, accurate, and helpful."
MODELEOF
ollama create cross-platform-deployment-assistant -f Modelfile
ollama run cross-platform-deployment-assistant
Performance Profile
Production Setup with FastAPI
python
from fastapi import FastAPI
from pydantic import BaseModelapp = FastAPI(title="ONNX Runtime CrossPlatform AI API")
ai = LocalAI()
class ChatRequest(BaseModel):
message: str
system: str = ""
class ChatResponse(BaseModel):
response: str
model: str
device: str
@app.post("/chat", response_model=ChatResponse)
async def chat_endpoint(req: ChatRequest):
response = ai.chat(req.message, req.system)
return ChatResponse(response=response, model="Any ONNX Model", device="ONNX Runtime CrossPlatform")
@app.get("/health")
async def health():
return {"status": "ok", "model": "Any ONNX Model", "device": "ONNX Runtime CrossPlatform"}
Troubleshooting
Slow inference: Switch to Q4_K_M quantization, reduce context window
Out of memory: Use smaller model or Q3_K_S quant
GPU not used: Install CUDA/Metal drivers, check ollama logs
High latency: Warm up model by sending a dummy request on startup
Resources
Also available in 中文.