Scale ML inference across multiple instances to handle production traffic
A single model server can handle only so many requests per second. When traffic exceeds that capacity, you scale horizontally: run multiple identical model instances behind a load balancer. The key insight for ML inference is that model loading is expensive (seconds to minutes) but serving is fast (milliseconds). This means you want long-lived instances, not ephemeral scale-to-zero containers.
- **Replicated stateless API**: Multiple FastAPI instances behind an nginx or cloud load balancer. Simple, works for most cases. - **Kubernetes Deployment**: Declare desired replica count, K8s maintains it. HPA (Horizontal Pod Autoscaler) scales based on CPU/GPU utilization or custom metrics. - **Async worker pool**: Queue-based (RabbitMQ, Celery) for batch-heavy workloads. Workers pull from queue and process independently. - **Serverless inference**: AWS SageMaker, Google Vertex AI. Auto-scales to zero. High cold-start latency (~5-30s).
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
spec:
replicas: 2
selector:
matchLabels:
app: model-server
template:
spec:
containers:
- name: model-server
image: your-registry/model-server:v1.2
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "4"
memory: "8Gi"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30 # Wait for model to load
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60