Deploy Gemma 2B on Android Smartphone — On-device mobile AI

Complete setup guide for running Gemma 2B locally on Android Smartphone for on-device mobile AI

返回教程列表
高级15 分钟

Deploy Gemma 2B on Android Smartphone — On-device mobile AI

Complete setup guide for running Gemma 2B locally on Android Smartphone for on-device mobile AI

Deploy Gemma 2B on Android Smartphone Overview Run Gemma 2B directly on Android Smartphone for on-device mobile AI. Local inference offers privacy, zero latency, and no ongoing API costs. **Specs**: Qualcomm NPU · 6-12GB Installation ```bash Ins

edge-ailocal-llmdeploymenton-deviceandroid-smartphone

Deploy Gemma 2B on Android Smartphone

Overview

Run Gemma 2B directly on Android Smartphone for on-device mobile AI. Local inference offers privacy, zero latency, and no ongoing API costs.

Specs: Qualcomm NPU · 6-12GB

Installation

bash

Install Ollama — easiest local inference runtime

curl -fsSL https://ollama.com/install.sh | sh

Verify installation

ollama --version

Download Model

bash

Pull Gemma 2B (downloads GGUF quantized weights automatically)

ollama pull gemma-2b

Run interactive chat

ollama run gemma-2b

Start API server

ollama serve

API available at http://localhost:11434

Python Integration

python
import httpx
from typing import Iterator

class LocalAI: """Interface to local Gemma 2B running on Android Smartphone.""" BASE_URL = "http://localhost:11434" MODEL = "gemma-2b" def chat(self, message: str, system: str = "") -> str: """Single-turn chat.""" resp = httpx.post( f"{self.BASE_URL}/api/chat", json={ "model": self.MODEL, "messages": [ {"role": "system", "content": system}, {"role": "user", "content": message} ], "stream": False }, timeout=120 ) resp.raise_for_status() return resp.json()["message"]["content"] def stream(self, message: str) -> Iterator[str]: """Streaming chat for real-time output.""" with httpx.stream( "POST", f"{self.BASE_URL}/api/chat", json={"model": self.MODEL, "messages": [{"role": "user", "content": message}], "stream": True}, timeout=120 ) as r: for line in r.iter_lines(): if line: import json chunk = json.loads(line) if not chunk.get("done"): yield chunk["message"]["content"]

Usage

ai = LocalAI() response = ai.chat("Help me with on-device mobile AI") print(response)

Streaming

for token in ai.stream("Explain on-device mobile AI step by step"): print(token, end="", flush=True)

Custom Modelfile

bash

Create optimized configuration for on-device mobile AI

cat > Modelfile << 'MODELEOF' FROM gemma-2b

PARAMETER num_ctx 4096 PARAMETER temperature 0.7 PARAMETER top_p 0.9

SYSTEM "You are an AI assistant specialized in on-device mobile AI. You run locally on Android Smartphone. Be concise, accurate, and helpful." MODELEOF

ollama create on-device-mobile-AI-assistant -f Modelfile ollama run on-device-mobile-AI-assistant

Performance Profile

MetricValue

HardwareQualcomm NPU Memory6-12GB Speed10-40 tokens/sec (CPU) / 40-100+ tok/s (GPU) First token<200ms (GPU) / <1s (CPU) Context4096-32768 tokens Cost$0 (after hardware)

Production Setup with FastAPI

python
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Android Smartphone AI API") ai = LocalAI()

class ChatRequest(BaseModel): message: str system: str = ""

class ChatResponse(BaseModel): response: str model: str device: str

@app.post("/chat", response_model=ChatResponse) async def chat_endpoint(req: ChatRequest): response = ai.chat(req.message, req.system) return ChatResponse(response=response, model="Gemma 2B", device="Android Smartphone")

@app.get("/health") async def health(): return {"status": "ok", "model": "Gemma 2B", "device": "Android Smartphone"}

Troubleshooting

Slow inference: Switch to Q4_K_M quantization, reduce context window Out of memory: Use smaller model or Q3_K_S quant GPU not used: Install CUDA/Metal drivers, check ollama logs High latency: Warm up model by sending a dummy request on startup

Resources

  • Ollama library: https://ollama.com/library
  • GGUF format: https://github.com/ggerganov/llama.cpp
  • Hardware guide: https://ollama.com/blog/hardware-recommendations
  • 相关工具

    ollamallama.cppgemma