← Back to tutorials

LLM Fallback Strategy: When Models Go Down, Can Your App Survive?

Production-grade LLM applications need a Plan B—graceful degradation under timeouts, rate limits, and outages.

LLM Fallback Strategy: Insure Your Application

Anyone who has put an LLM application into production knows the truth: APIs will fail. Timeouts, 429 rate limits, sporadic 500s, even full provider outages. If your app is tightly coupled to a single model, when it sneezes, you go down.

A fallback strategy is the insurance your application needs.

Common Failures

  • Timeout: Request sent, no response.
  • Rate limit (429): Too many requests, blocked by the provider.
  • Service error (5xx): Provider-side glitches.
  • Full outage: OpenAI / Anthropic occasionally experience widespread failures.
  • Content refusal: The model refuses to answer for safety reasons.
  • For each, you need a "what next?" plan.

    Layered Fallback Strategies

    From lightweight to heavy, each layer catches failures:

    Layer 1: Retry Transient errors often resolve after a short wait. Use exponential backoff—don't hammer or wait blindly:

    python
    import time
    def call_with_retry(fn, max_retries=3):
        for i in range(max_retries):
            try:
                return fn()
            except (TimeoutError, RateLimitError) as e:
                if i == max_retries - 1: raise
                time.sleep(2 ** i)  # 1s, 2s, 4s
    

    Layer 2: Model Fallback If the primary model fails, automatically switch to a backup. Common combinations: primary GPT-4o, fallback to Claude, then to a cheap fast small model as a last resort.

    python
    MODELS = ["gpt-4o", "claude-3-5-sonnet", "gpt-4o-mini"]
    def chat_with_fallback(messages):
        for model in MODELS:
            try:
                return call_model(model, messages)
            except Exception:
                continue  # try next
        return DEFAULT_REPLY  # all failed, use fallback
    

    This is why production apps should use a unified gateway layer (e.g., LiteLLM) to abstract multiple models—switching only requires config changes, not business logic changes.

    Layer 3: Cache Cache answers to common questions. When all models are down, at least high-frequency questions can be answered from cache. Semantic caching can even match "similar meaning" questions.

    Layer 4: Graceful Degradation When everything fails, don't throw a 500 at the user. Return a polite fallback: "The AI assistant is temporarily unavailable. Please try again later or contact customer support."

    Design Tips

    Always set a timeout, and keep it short. Don't use the default tens of seconds—users will leave after 30 seconds. Set 5-15 seconds depending on the scenario, and fallback if exceeded.

    Consider the quality drop in the fallback chain. Switching from GPT-4o to a small model will degrade answer quality. For critical scenarios, it's better to return "try again later" than to serve a noticeably worse answer that hurts your reputation.

    Don't hardcode fallback logic in business code. Extract a unified calling layer that handles retries, switching, caching, and graceful degradation in one place. Business code just says "I need an answer"; fault tolerance is the infrastructure's job.

    Monitor fallback trigger rates. If backup models are frequently used, it signals a problem with the primary model—investigate. You can use LLM observability tools to monitor.

    A Complete Fallback Chain

    
    User request
      → Check cache (return if hit)
      → Primary model + retry
      → Backup model + retry
      → Graceful degradation reply
    

    Summary

    The robustness of an LLM application isn't measured by how well it works when everything is fine, but by how stable it is when things go wrong. Before going to production, ask yourself: if the model goes down right now, what will my users see? If you can't answer, it's time to add fallback.

    Also available in 中文.