⏱️ 75 min

Model Serving Architectures

Deploy ML models in production with low latency and high availability

Serving Patterns

There are four common model serving architectures, each suited to different latency, throughput, and cost requirements:

Synchronous REST API

Client sends a request, waits for the prediction response. Simple to implement, but model inference blocks the HTTP thread. Best for: interactive applications with low traffic.

Asynchronous batch prediction

Requests are queued and processed in batches. Higher throughput, higher latency. Best for: report generation, embedding pipelines, non-interactive workloads.

Streaming inference

Model processes a continuous stream of inputs (Kafka, Kinesis). Best for: fraud detection, real-time recommendations, IoT sensor processing.

Edge deployment

Model runs on-device (phone, IoT device). No network round-trip, private, offline. Requires model quantization and TensorFlow Lite or ONNX runtime.

Building a FastAPI Model Endpoint

python

from fastapi import FastAPI
from pydantic import BaseModel
import torch
import numpy as np
from contextlib import asynccontextmanager

model = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global model
    model = torch.load("model.pt", map_location="cpu")
    model.eval()
    yield
    del model

app = FastAPI(lifespan=lifespan)

class PredictionRequest(BaseModel):
    features: list[float]

class PredictionResponse(BaseModel):
    prediction: float
    confidence: float

@app.post("/predict", response_model=PredictionResponse)
async def predict(request: PredictionRequest):
    x = torch.tensor(request.features).unsqueeze(0)
    with torch.no_grad():
        logits = model(x)
        probs = torch.softmax(logits, dim=-1)
    return PredictionResponse(
        prediction=float(probs.argmax()),
        confidence=float(probs.max()),
    )

Back to Module

Sharan Initiatives — AI, Finance, Photography & More