Module 5
Design ML infrastructure that handles real production load
Scale inference across multiple instances with load balancers
Speed up predictions with quantization, distillation, and caching
Reduce cloud costs without sacrificing latency or accuracy