Model Registry Setup: Production Setup Guide

Version control and management for production ML models

Model Registry for LLM Apps: Production Setup Guide

A model registry answers four questions every production AI system eventually gets asked: which model/prompt combination is running right now, what did we run last Tuesday, who approved the change, and how do we roll back? Classic ML registries (MLflow-style) version trained weights; LLM applications need a wider notion — what you must version is the generation config: model ID + prompt version + parameters + tool schemas. This guide sets up both layers pragmatically.

What "model registry" means for LLM apps

For API-based LLM systems, the deployable artifact isn't weights — it's a config tuple:

yaml
registry entry: the unit you version, approve, and roll back
name: support-triage
version: 14
model: gpt-5-mini-2026-xx        # pinned snapshot, never a floating alias
prompt_ref: triage-prompt@v9     # prompt store reference
params: { max_tokens: 300 }
tools_ref: triage-tools@v3
eval: { suite: triage-eval-v4, score: 0.94, run: 2026-06-10 }
approved_by: zyql
status: production               # draft | staging | production | retired

The registry is then: a store for these entries + a promotion workflow + an audit log. Implementation options by team size:

Small team: a registry/ directory of YAML in git + a loader. Git *is* the audit log and rollback mechanism. Honestly sufficient for most.

Platform layer: LangSmith/Langfuse-style prompt registries give you versioned prompts with UI diffing and env labels (comparison) — pair with git for params/tools.

MLflow-class registry: when you *also* ship fine-tuned weights (below) and want one system for both.

The non-negotiable rule regardless of tooling: applications resolve configs by (name, environment) at startup — never hardcode model IDs or inline prompts in code. That indirection is what makes rollback a one-line change instead of a deploy.

The promotion workflow (where the value lives)

text
draft → staging → production
        ↑ gate: eval suite ≥ threshold        ↑ gate: canary clean

Eval gate: no entry reaches staging without a scored run of your eval suite attached (build the suite). The registry entry carrying its eval score is what turns "I think the new prompt is better" into an auditable claim.

Canary gate: promote to a traffic slice first, watching quality/latency/cost deltas — canary analysis for AI covers the mechanics; the registry's job is making "old version" a pointer you can flip back.

Retirement discipline: provider model deprecations arrive on *their* schedule — the registry's pinned-snapshot field plus a deprecation calendar is how you migrate deliberately (provider best practices) instead of during an outage.

When you also have fine-tuned weights

Self-hosted or hosted fine-tunes add the classic layer: artifact storage (S3/HF private repos for LoRA adapters), lineage (base model + dataset version + training run), and serving integration (the vLLM/hosted-adapter deployment pulls a registry-tagged artifact). MLflow or a HF-repos-plus-tags convention both work; what matters is that the *same* promotion workflow governs weights and configs — one approval trail.

Runtime integration

python
resolve at startup; cache; subscribe to changes if you want flip-without-restart
cfg = registry.resolve('support-triage', env='production')
resp = await client.chat.completions.create(
    model=cfg.model, messages=render(cfg.prompt, ticket), **cfg.params)
log(feature='support-triage', registry_version=cfg.version, tokens=resp.usage, ...)

Logging the registry version on every call is the payoff move: incidents become "version 14 started erroring at 14:02" instead of archaeology, and cost dashboards slice by version (observability).

Anti-patterns

Floating aliases (-latest) in production entries — silent model drift defeats the whole system

Prompts in code, params in env vars, model in config — three sources of truth, no atomic rollback

Eval scores recorded "somewhere else" — if promotion doesn't *require* the score, the gate erodes

Registry without runtime version logging — you versioned configs but can't tell which one served a request

FAQ

Isn't this overkill before product-market fit? The git-directory version is an afternoon of work and pays off at the *first* "what changed?" incident. Skip the platform tooling, not the indirection.

Where do A/B tests fit? Two registry entries live simultaneously with a traffic split — the registry gives the variants identity; your experiment layer assigns traffic.

Multi-provider routing? Registry entries can name an equivalence class served by the gateway instead of a single model — the registry versions intent; the gateway resolves vendors.

*Last updated: June 2026.*

Also available in 中文.