LLM Fallback Chains: Production Patterns

Automatic fallback between LLM providers on failure

LLM Fallback Chains: Production Patterns (2026)

In production, model providers fail — rate limits, timeouts, regional outages, the occasional 500. A fallback chain keeps your app up by automatically retrying the request against an alternate model or provider when the primary fails. It's the single most important reliability pattern for LLM apps.

The pattern

Define an ordered list of models. Try the first; on failure (error or timeout), fall through to the next. Combine with per-attempt timeouts and capped retries so a slow provider can't hang the request.

python
pip install litellm
from litellm import completionresp = completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarize this ticket."}],
    fallbacks=["claude-3-5-sonnet-latest", "gpt-4o-mini"],
    timeout=20,
    num_retries=2,
)
print(resp.choices[0].message.content)

LiteLLM gives you provider-agnostic fallbacks for free behind one OpenAI-compatible call — compare gateways in LiteLLM vs Portkey.

Design choices

Order by capability *and* cost. A common chain is "primary frontier model → comparable model from another vendor → cheap model" so you degrade gracefully rather than failing.

Per-attempt timeout. Without it, one stalled provider blocks the whole chain. Keep each attempt short.

Distinguish error types. Retry on 429/5xx/timeout; do not retry on 400 (bad request) — that'll fail every time.

Cross-provider, not just cross-model. Fallbacks within one provider don't help during a provider-wide outage; span vendors. See GPT-4o vs Claude for picking comparable pairs.

Watch for prompt incompatibilities. A prompt tuned for one model may behave differently on the fallback; keep prompts portable.

Beyond fallback: load balancing

When you have multiple keys/regions, also load-balance across healthy endpoints to spread rate limits and reduce latency — the complementary pattern to fallback. Pair this with retries, circuit breakers, and observability for a robust stack.

FAQ

What should I retry on? Transient errors: 429, 5xx, timeouts. Never retry 400-class errors. Won't fallbacks hide problems? Log every fallback event — a rising fallback rate is an early warning, not something to silently swallow. Same prompt across models? Keep prompts portable; test the fallback model so quality doesn't crater on failover. Library or roll my own? LiteLLM/Portkey give you this out of the box; rolling your own is fine but reimplements the same logic.

Summary

A fallback chain is cheap insurance: an ordered list of models across providers, per-attempt timeouts, retries only on transient errors, and logging on every failover. Add load balancing across healthy endpoints and your LLM app survives the outages that will inevitably happen.

*Last updated: June 2026. Verify APIs against the LiteLLM docs.*

Also available in 中文.