Graceful Shutdown for AI
Properly handling shutdown signals in AI inference servers
Graceful Shutdown for AI Services
Every deploy, autoscale-down, and spot-instance reclaim sends your AI service a SIGTERM. What happens next separates clean operations from a trail of half-finished generations, dropped streams, and double-billed work: an AI service has unusually long in-flight requests (seconds to minutes of generation), so naive shutdown loses more work than it would for a normal API. This guide implements graceful shutdown for the three AI service shapes: request/response APIs, streaming endpoints, and queue workers.
The shutdown contract
On SIGTERM, a well-behaved service:
The AI-specific tension: a 70B generation can outlive a default 30-second grace period. You either extend the deadline or design aborts that don't waste the work.
Request/response API (FastAPI shape)
python
import asyncio, signal
from contextlib import asynccontextmanager
from fastapi import FastAPI, Responsedraining = False
in_flight = 0
@asynccontextmanager
async def lifespan(app: FastAPI):
loop = asyncio.get_running_loop()
loop.add_signal_handler(signal.SIGTERM, start_drain)
yield
# lifespan exit: wait for in-flight to hit zero (bounded)
for _ in range(120): # match your max generation time
if in_flight == 0: break
await asyncio.sleep(1)
def start_drain():
global draining
draining = True # readiness goes red below
@app.get('/healthz/ready')
async def ready():
return Response(status_code=503 if draining else 200)
The load balancer sees readiness fail, routes new traffic elsewhere; existing requests complete. Set terminationGracePeriodSeconds (K8s) to your p99 generation time plus margin — the single most-forgotten line; the default 30s silently truncates long generations at scale.
Streaming endpoints: the special case
A half-delivered stream is worse than an error — the user watched the answer stop mid-sentence. On drain:
data: {"error":"server_restarting","resume":true}) before closing, so clients show "reconnecting…" instead of a frozen cursor — and cancel the upstream provider call so you stop paying for tokens nobody will receive (the same disconnect discipline as in the streaming recipe).Queue workers: rely on redelivery, but make it safe
For batch/async AI work (enrichment runs, webhook processors), shutdown is simpler *if* two properties hold:
For multi-step agent jobs, checkpoint progress (a LangGraph checkpointer does this natively) so redelivery resumes mid-graph instead of restarting a 20-step run.
Self-hosted model servers
vLLM-class servers handle in-flight completion on SIGTERM; your job is the orchestration around them: drain via readiness *before* stopping the pod, and during rollouts keep maxUnavailable conservative — GPU pods are slow to start (model load measured in minutes), so aggressive rollout settings create capacity gaps. Preload/warm the replacement before draining the old (serving guide).
Test it or it doesn't work
Graceful shutdown rots silently. Two cheap tests: a CI/staging script that starts load, sends SIGTERM, and asserts zero dropped/duplicated requests; and chaos-style pod kills in staging on a schedule. The first deploy during peak traffic is the wrong time to learn your grace period is 30 seconds.
FAQ
What about SIGKILL? Unhandleable by definition — your protection is the redelivery+idempotency design, which makes even hard kills lose nothing but time.
Spot/preemptible GPUs? Same machinery with a shorter fuse (cloud preemption warnings range from ~30s to 2 minutes) — checkpoint aggressively and treat preemption notice as SIGTERM.
Serverless platforms? The platform owns drain semantics — read your provider's lifecycle docs; your idempotency layer is still what saves you.
*Last updated: June 2026.*
Also available in 中文.