⏱️ 70 min

Production Monitoring & Alerting

Build real-time dashboards and alerts to keep your ML system healthy

What to Monitor

ML systems need monitoring at three levels. Most teams only monitor infrastructure (CPU, memory). The more important signals are model-specific.

Infrastructure metrics (your DevOps team already has these)

- Prediction latency (p50, p95, p99) - Request throughput and error rates - GPU/CPU utilization and memory - Queue depth (for async serving)

ML-specific metrics (often missing)

- Prediction distribution: Are predictions shifting over time? A model that always predicts 'not fraud' has 99.5% accuracy but zero utility. - Feature distribution: Which input features are drifting? - Business KPI: Actual business impact (click-through rate, fraud caught, etc.) - Confidence distribution: Is the model becoming less certain over time?

Instrumenting a FastAPI Model with Prometheus

python

from prometheus_client import Counter, Histogram, Gauge, make_asgi_app
from fastapi import FastAPI
import time

# Metrics
PREDICTIONS = Counter("model_predictions_total", "Total predictions", ["label"])
LATENCY = Histogram("prediction_latency_seconds", "Prediction latency")
CONFIDENCE = Histogram("prediction_confidence", "Model confidence", buckets=[0.5, 0.7, 0.8, 0.9, 0.95, 1.0])

app = FastAPI()
app.mount("/metrics", make_asgi_app())

@app.post("/predict")
async def predict(request: PredictionRequest):
    start = time.time()

    prediction, confidence = model.predict(request.features)

    # Record metrics
    LATENCY.observe(time.time() - start)
    PREDICTIONS.labels(label=str(prediction)).inc()
    CONFIDENCE.observe(confidence)

    return {"prediction": prediction, "confidence": confidence}

Back to Module

Sharan Initiatives — AI, Finance, Photography & More