Model Registry Setup: Production Setup Guide
Version control and management for production ML models
Model Registry for LLM Apps: Production Setup Guide
A model registry answers four questions every production AI system eventually gets asked: which model/prompt combination is running right now, what did we run last Tuesday, who approved the change, and how do we roll back? Classic ML registries (MLflow-style) version trained weights; LLM applications need a wider notion — what you must version is the generation config: model ID + prompt version + parameters + tool schemas. This guide sets up both layers pragmatically.
What "model registry" means for LLM apps
For API-based LLM systems, the deployable artifact isn't weights — it's a config tuple:
yaml
registry entry: the unit you version, approve, and roll back
name: support-triage
version: 14
model: gpt-5-mini-2026-xx # pinned snapshot, never a floating alias
prompt_ref: triage-prompt@v9 # prompt store reference
params: { max_tokens: 300 }
tools_ref: triage-tools@v3
eval: { suite: triage-eval-v4, score: 0.94, run: 2026-06-10 }
approved_by: zyql
status: production # draft | staging | production | retired
The registry is then: a store for these entries + a promotion workflow + an audit log. Implementation options by team size:
registry/ directory of YAML in git + a loader. Git *is* the audit log and rollback mechanism. Honestly sufficient for most.The non-negotiable rule regardless of tooling: applications resolve configs by (name, environment) at startup — never hardcode model IDs or inline prompts in code. That indirection is what makes rollback a one-line change instead of a deploy.
The promotion workflow (where the value lives)
text
draft → staging → production
↑ gate: eval suite ≥ threshold ↑ gate: canary clean
When you also have fine-tuned weights
Self-hosted or hosted fine-tunes add the classic layer: artifact storage (S3/HF private repos for LoRA adapters), lineage (base model + dataset version + training run), and serving integration (the vLLM/hosted-adapter deployment pulls a registry-tagged artifact). MLflow or a HF-repos-plus-tags convention both work; what matters is that the *same* promotion workflow governs weights and configs — one approval trail.
Runtime integration
python
resolve at startup; cache; subscribe to changes if you want flip-without-restart
cfg = registry.resolve('support-triage', env='production')
resp = await client.chat.completions.create(
model=cfg.model, messages=render(cfg.prompt, ticket), **cfg.params)
log(feature='support-triage', registry_version=cfg.version, tokens=resp.usage, ...)
Logging the registry version on every call is the payoff move: incidents become "version 14 started erroring at 14:02" instead of archaeology, and cost dashboards slice by version (observability).
Anti-patterns
-latest) in production entries — silent model drift defeats the whole systemFAQ
Isn't this overkill before product-market fit? The git-directory version is an afternoon of work and pays off at the *first* "what changed?" incident. Skip the platform tooling, not the indirection.
Where do A/B tests fit? Two registry entries live simultaneously with a traffic split — the registry gives the variants identity; your experiment layer assigns traffic.
Multi-provider routing? Registry entries can name an equivalence class served by the gateway instead of a single model — the registry versions intent; the gateway resolves vendors.
*Last updated: June 2026.*
Also available in 中文.