Handle production ML failures gracefully with runbooks, rollbacks, and postmortems
ML incidents are different from software incidents because they're often gradual and probabilistic rather than sudden and binary. A broken API returns a 500 error immediately. A degraded model keeps returning predictions — just worse ones. By the time the degradation is obvious, significant business damage may have occurred.
**Sudden performance drop** (data pipeline failure, feature store outage): → Roll back to previous model version immediately. Investigate pipeline. **Gradual drift** (model accuracy declining over weeks): → Trigger retraining. If training data is also drifted, collect new labels first. **Prediction distribution shift** (model always predicting one class): → Check feature preprocessing. Common cause: a numeric feature that used to be normalized is now arriving raw. **Latency spike** (p99 latency suddenly 10x): → Check for a model version with larger memory footprint, or a batch of unusually long inputs.