Deploy ML models in production with low latency and high availability
There are four common model serving architectures, each suited to different latency, throughput, and cost requirements:
Client sends a request, waits for the prediction response. Simple to implement, but model inference blocks the HTTP thread. Best for: interactive applications with low traffic.
Requests are queued and processed in batches. Higher throughput, higher latency. Best for: report generation, embedding pipelines, non-interactive workloads.
Model processes a continuous stream of inputs (Kafka, Kinesis). Best for: fraud detection, real-time recommendations, IoT sensor processing.
Model runs on-device (phone, IoT device). No network round-trip, private, offline. Requires model quantization and TensorFlow Lite or ONNX runtime.
from fastapi import FastAPI
from pydantic import BaseModel
import torch
import numpy as np
from contextlib import asynccontextmanager
model = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global model
model = torch.load("model.pt", map_location="cpu")
model.eval()
yield
del model
app = FastAPI(lifespan=lifespan)
class PredictionRequest(BaseModel):
features: list[float]
class PredictionResponse(BaseModel):
prediction: float
confidence: float
@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
x = torch.tensor(request.features).unsqueeze(0)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=-1)
return PredictionResponse(
prediction=float(probs.argmax()),
confidence=float(probs.max()),
)