← Back to tutorials

Graceful Shutdown for AI

Properly handling shutdown signals in AI inference servers

Graceful Shutdown for AI Services

Every deploy, autoscale-down, and spot-instance reclaim sends your AI service a SIGTERM. What happens next separates clean operations from a trail of half-finished generations, dropped streams, and double-billed work: an AI service has unusually long in-flight requests (seconds to minutes of generation), so naive shutdown loses more work than it would for a normal API. This guide implements graceful shutdown for the three AI service shapes: request/response APIs, streaming endpoints, and queue workers.

The shutdown contract

On SIGTERM, a well-behaved service:

  • Stops accepting new work (fail readiness probe / stop polling the queue)
  • Finishes or safely aborts in-flight work within a drain window
  • Releases cleanly (flush logs/metrics, close DB/HTTP pools, ack/nack queue messages correctly)
  • Exits before the orchestrator's hard-kill (SIGKILL) deadline
  • The AI-specific tension: a 70B generation can outlive a default 30-second grace period. You either extend the deadline or design aborts that don't waste the work.

    Request/response API (FastAPI shape)

    python
    import asyncio, signal
    from contextlib import asynccontextmanager
    from fastapi import FastAPI, Response

    draining = False in_flight = 0

    @asynccontextmanager async def lifespan(app: FastAPI): loop = asyncio.get_running_loop() loop.add_signal_handler(signal.SIGTERM, start_drain) yield # lifespan exit: wait for in-flight to hit zero (bounded) for _ in range(120): # match your max generation time if in_flight == 0: break await asyncio.sleep(1)

    def start_drain(): global draining draining = True # readiness goes red below

    @app.get('/healthz/ready') async def ready(): return Response(status_code=503 if draining else 200)

    The load balancer sees readiness fail, routes new traffic elsewhere; existing requests complete. Set terminationGracePeriodSeconds (K8s) to your p99 generation time plus margin — the single most-forgotten line; the default 30s silently truncates long generations at scale.

    Streaming endpoints: the special case

    A half-delivered stream is worse than an error — the user watched the answer stop mid-sentence. On drain:

  • Let active streams finish (they're already producing; killing them wastes the spend) — this is why your grace period must cover full generation time.
  • If you must abort (hard deadline approaching): send an explicit in-band event (data: {"error":"server_restarting","resume":true}) before closing, so clients show "reconnecting…" instead of a frozen cursor — and cancel the upstream provider call so you stop paying for tokens nobody will receive (the same disconnect discipline as in the streaming recipe).
  • Client-side: a resume path (re-ask with the partial response as context, or re-fetch from a job-state endpoint) turns restarts into a blip.
  • Queue workers: rely on redelivery, but make it safe

    For batch/async AI work (enrichment runs, webhook processors), shutdown is simpler *if* two properties hold:

  • Stop polling on SIGTERM, finish the current message, ack it. If the deadline hits mid-task: nack/let visibility timeout expire → the queue redelivers to another worker.
  • Idempotency makes redelivery free: key work by message ID (skip if a result already exists) so a redelivered task doesn't double-call the LLM or double-write. Without idempotency, "graceful" shutdown still produces duplicate side effects — this is the same exactly-once discipline as API retry handling.
  • For multi-step agent jobs, checkpoint progress (a LangGraph checkpointer does this natively) so redelivery resumes mid-graph instead of restarting a 20-step run.

    Self-hosted model servers

    vLLM-class servers handle in-flight completion on SIGTERM; your job is the orchestration around them: drain via readiness *before* stopping the pod, and during rollouts keep maxUnavailable conservative — GPU pods are slow to start (model load measured in minutes), so aggressive rollout settings create capacity gaps. Preload/warm the replacement before draining the old (serving guide).

    Test it or it doesn't work

    Graceful shutdown rots silently. Two cheap tests: a CI/staging script that starts load, sends SIGTERM, and asserts zero dropped/duplicated requests; and chaos-style pod kills in staging on a schedule. The first deploy during peak traffic is the wrong time to learn your grace period is 30 seconds.

    FAQ

    What about SIGKILL? Unhandleable by definition — your protection is the redelivery+idempotency design, which makes even hard kills lose nothing but time.

    Spot/preemptible GPUs? Same machinery with a shorter fuse (cloud preemption warnings range from ~30s to 2 minutes) — checkpoint aggressively and treat preemption notice as SIGTERM.

    Serverless platforms? The platform owns drain semantics — read your provider's lifecycle docs; your idempotency layer is still what saves you.


    *Last updated: June 2026.*

    Also available in 中文.