Practical techniques to make training faster, more stable, and more reproducible
These techniques are applied by default in most serious ML projects. They're not exotic research ideas — they're engineering discipline for reliable training.
Store model activations in float16 (half precision) but compute gradient updates in float32. This roughly doubles GPU throughput and halves memory usage with no accuracy loss. Enable with two lines in PyTorch: `scaler = GradScaler()` and `with autocast():`
Cap the norm of gradients at a threshold (typically 1.0). Prevents gradient explosion in RNNs and transformers. Without this, a single bad batch can corrupt weeks of training. Add `torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)` before `optimizer.step()`.
The learning rate should change during training. Common schedules: - Cosine annealing: smoothly decays LR following a cosine curve — widely used for fine-tuning - Warmup + decay: ramp LR from 0 to peak over first 5% of steps, then decay — standard for training from scratch - ReduceLROnPlateau: reduce LR when validation metric stops improving — useful for unknown training dynamics
from torch.cuda.amp import GradScaler, autocast
from torch.optim.lr_scheduler import CosineAnnealingLR
scaler = GradScaler()
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs)
for epoch in range(num_epochs):
for batch in train_loader:
optimizer.zero_grad()
# Mixed precision forward pass
with autocast():
outputs = model(batch["input"])
loss = criterion(outputs, batch["target"])
# Scaled backward pass
scaler.scale(loss).backward()
# Unscale before clipping
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Update weights
scaler.step(optimizer)
scaler.update()
scheduler.step()