Graceful Shutdown for AI

Properly handling shutdown signals in AI inference servers

Graceful Shutdown for AI Services

Every deploy, autoscale-down, and spot-instance reclaim sends your AI service a SIGTERM. What happens next separates clean operations from a trail of half-finished generations, dropped streams, and double-billed work: an AI service has unusually long in-flight requests (seconds to minutes of generation), so naive shutdown loses more work than it would for a normal API. This guide implements graceful shutdown for the three AI service shapes: request/response APIs, streaming endpoints, and queue workers.

The shutdown contract

On SIGTERM, a well-behaved service:

Stops accepting new work (fail readiness probe / stop polling the queue)

Finishes or safely aborts in-flight work within a drain window

Releases cleanly (flush logs/metrics, close DB/HTTP pools, ack/nack queue messages correctly)

Exits before the orchestrator's hard-kill (SIGKILL) deadline

The AI-specific tension: a 70B generation can outlive a default 30-second grace period. You either extend the deadline or design aborts that don't waste the work.

Request/response API (FastAPI shape)

python
import asyncio, signal
from contextlib import asynccontextmanager
from fastapi import FastAPI, Response
draining = False
in_flight = 0
@asynccontextmanager
async def lifespan(app: FastAPI):
    loop = asyncio.get_running_loop()
    loop.add_signal_handler(signal.SIGTERM, start_drain)
    yield
    # lifespan exit: wait for in-flight to hit zero (bounded)
    for _ in range(120):                      # match your max generation time
        if in_flight == 0: break
        await asyncio.sleep(1)
def start_drain():
    global draining
    draining = True                            # readiness goes red below@app.get('/healthz/ready')
async def ready():
    return Response(status_code=503 if draining else 200)

The load balancer sees readiness fail, routes new traffic elsewhere; existing requests complete. Set terminationGracePeriodSeconds (K8s) to your p99 generation time plus margin — the single most-forgotten line; the default 30s silently truncates long generations at scale.

Streaming endpoints: the special case

A half-delivered stream is worse than an error — the user watched the answer stop mid-sentence. On drain:

Let active streams finish (they're already producing; killing them wastes the spend) — this is why your grace period must cover full generation time.

If you must abort (hard deadline approaching): send an explicit in-band event (data: {"error":"server_restarting","resume":true}) before closing, so clients show "reconnecting…" instead of a frozen cursor — and cancel the upstream provider call so you stop paying for tokens nobody will receive (the same disconnect discipline as in the streaming recipe).

Client-side: a resume path (re-ask with the partial response as context, or re-fetch from a job-state endpoint) turns restarts into a blip.

Queue workers: rely on redelivery, but make it safe

For batch/async AI work (enrichment runs, webhook processors), shutdown is simpler *if* two properties hold:

Stop polling on SIGTERM, finish the current message, ack it. If the deadline hits mid-task: nack/let visibility timeout expire → the queue redelivers to another worker.

Idempotency makes redelivery free: key work by message ID (skip if a result already exists) so a redelivered task doesn't double-call the LLM or double-write. Without idempotency, "graceful" shutdown still produces duplicate side effects — this is the same exactly-once discipline as API retry handling.

For multi-step agent jobs, checkpoint progress (a LangGraph checkpointer does this natively) so redelivery resumes mid-graph instead of restarting a 20-step run.

Self-hosted model servers

vLLM-class servers handle in-flight completion on SIGTERM; your job is the orchestration around them: drain via readiness *before* stopping the pod, and during rollouts keep maxUnavailable conservative — GPU pods are slow to start (model load measured in minutes), so aggressive rollout settings create capacity gaps. Preload/warm the replacement before draining the old (serving guide).

Test it or it doesn't work

Graceful shutdown rots silently. Two cheap tests: a CI/staging script that starts load, sends SIGTERM, and asserts zero dropped/duplicated requests; and chaos-style pod kills in staging on a schedule. The first deploy during peak traffic is the wrong time to learn your grace period is 30 seconds.

FAQ

What about SIGKILL? Unhandleable by definition — your protection is the redelivery+idempotency design, which makes even hard kills lose nothing but time.

Spot/preemptible GPUs? Same machinery with a shorter fuse (cloud preemption warnings range from ~30s to 2 minutes) — checkpoint aggressively and treat preemption notice as SIGTERM.

Serverless platforms? The platform owns drain semantics — read your provider's lifecycle docs; your idempotency layer is still what saves you.

*Last updated: June 2026.*

Also available in 中文.