⏱️ 70 min

Inference Optimization & Caching

Cut prediction latency and cost with quantization, distillation, and request caching

The Inference Optimization Toolkit

Model training speed doesn't matter much — you train once, deploy forever. But serving speed matters every millisecond, at every request. These techniques reduce latency without sacrificing meaningful accuracy:

Quantization

Convert model weights from float32 to int8 (or float16). Reduces model size by 2-4x, speeds up inference by 2-3x on modern hardware. Post-training quantization requires no retraining. Quantization-aware training gives better accuracy at the cost of a retraining pass.

Knowledge distillation

Train a smaller 'student' model to mimic a larger 'teacher' model. The student learns from the teacher's output probabilities (soft labels), not just the ground truth. Can achieve 90-95% of teacher accuracy at 10-20% of the parameters.

Request caching

Identical or near-identical inputs should return cached predictions without hitting the model. Use Redis with a TTL appropriate to how quickly your model changes. Cache hit rates of 30-60% are common for recommendation and search systems.

Quantizing a PyTorch Model

python

import torch

model = torch.load("model.pt")
model.eval()

# Dynamic quantization — no calibration data needed
quantized_model = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.LSTM},  # Layer types to quantize
    dtype=torch.qint8
)

# Compare sizes
original_size = sum(p.numel() * p.element_size() for p in model.parameters())
quantized_size = sum(p.numel() * p.element_size() for p in quantized_model.parameters())
print(f"Original: {original_size / 1e6:.1f} MB")
print(f"Quantized: {quantized_size / 1e6:.1f} MB")
print(f"Compression: {original_size / quantized_size:.1f}x")

Output:

Original: 418.3 MB
Quantized: 104.6 MB
Compression: 4.0x

Back to Module

Sharan Initiatives — AI, Finance, Photography & More