⏱️ 65 min

Horizontal Scaling Strategies

Scale ML inference across multiple instances to handle production traffic

Scaling Patterns for ML Inference

A single model server can handle only so many requests per second. When traffic exceeds that capacity, you scale horizontally: run multiple identical model instances behind a load balancer. The key insight for ML inference is that model loading is expensive (seconds to minutes) but serving is fast (milliseconds). This means you want long-lived instances, not ephemeral scale-to-zero containers.

Common scaling architectures

- **Replicated stateless API**: Multiple FastAPI instances behind an nginx or cloud load balancer. Simple, works for most cases. - **Kubernetes Deployment**: Declare desired replica count, K8s maintains it. HPA (Horizontal Pod Autoscaler) scales based on CPU/GPU utilization or custom metrics. - **Async worker pool**: Queue-based (RabbitMQ, Celery) for batch-heavy workloads. Workers pull from queue and process independently. - **Serverless inference**: AWS SageMaker, Google Vertex AI. Auto-scales to zero. High cold-start latency (~5-30s).

Kubernetes HPA for a Model Deployment

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  replicas: 2
  selector:
    matchLabels:
      app: model-server
  template:
    spec:
      containers:
      - name: model-server
        image: your-registry/model-server:v1.2
        resources:
          requests:
            cpu: "2"
            memory: "4Gi"
          limits:
            cpu: "4"
            memory: "8Gi"
        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30  # Wait for model to load
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: model-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: model-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Back to Module

Sharan Initiatives — AI, Finance, Photography & More