⏱️ 60 min

Incident Response for ML Systems

Handle production ML failures gracefully with runbooks, rollbacks, and postmortems

ML Incident Taxonomy

ML incidents are different from software incidents because they're often gradual and probabilistic rather than sudden and binary. A broken API returns a 500 error immediately. A degraded model keeps returning predictions — just worse ones. By the time the degradation is obvious, significant business damage may have occurred.

Common ML incidents and response

**Sudden performance drop** (data pipeline failure, feature store outage): → Roll back to previous model version immediately. Investigate pipeline. **Gradual drift** (model accuracy declining over weeks): → Trigger retraining. If training data is also drifted, collect new labels first. **Prediction distribution shift** (model always predicting one class): → Check feature preprocessing. Common cause: a numeric feature that used to be normalized is now arriving raw. **Latency spike** (p99 latency suddenly 10x): → Check for a model version with larger memory footprint, or a batch of unusually long inputs.

Sharan Initiatives — AI, Finance, Photography & More