AI Production Incident Response: Debugging ML Systems in Production

Runbooks, root cause analysis, and systematic debugging for AI system failures

AI systems fail in unique ways requiring specialized incident response processes. Common failure modes: 1) Model performance degradation: accuracy drops, precision/recall imbalance, increased error rate on specific query types. 2) Latency spikes: upstream LLM API latency, vector store query performance, increased context lengths. 3) Cost explosions: token usage spike from prompt injection or malformed inputs, runaway retry loops. 4) Safety incidents: model outputs harmful or policy-violating content. 5) Data quality issues: upstream data drift causing feature distribution shifts. Runbook template: [Incident Type: Model Accuracy Drop] -> Check: recent deployments, data pipeline changes, upstream service changes -> Measure: evaluation set performance, slice-level metrics by query type -> Triage: compare to previous model version on evaluation set -> Remediate: rollback if regression confirmed, add new failure cases to evaluation dataset -> Verify: monitor metrics post-fix for 24 hours. On-call tooling: Grafana dashboard per ML service, anomaly alerts on key metrics, automated evaluation job running hourly. Rollback procedure: keep last 3 model versions deployed in staging, 5-minute rollback via feature flag or load balancer redirect. Post-incident: blameless postmortem within 48 hours, 5-whys root cause analysis, action items with owners and deadlines, update runbooks with new learnings.

Also available in 中文.