Train large models efficiently across multiple GPUs using data and model parallelism
A single GPU has a fixed memory limit (typically 24-80 GB). When your model doesn't fit in memory, or when training on a single GPU would take weeks, distributed training is the answer. There are two fundamental approaches: data parallelism (split the data across GPUs, each with a full model copy) and model parallelism (split the model across GPUs). In practice, most production training uses data parallelism with gradient synchronization.
**Data Parallelism** (most common): - Each GPU holds a complete copy of the model - Each GPU processes a different mini-batch - Gradients are averaged (all-reduce) across all GPUs after each step - Works well when model fits on a single GPU **Model Parallelism** (for very large models): - Different layers or parameter groups live on different GPUs - Tensors are passed between GPUs during forward and backward passes - Required for models like GPT-3 that don't fit on one GPU - Pipeline parallelism is a common variant
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data.distributed import DistributedSampler
def train(rank, world_size):
# Initialize process group
dist.init_process_group("nccl", rank=rank, world_size=world_size)
torch.cuda.set_device(rank)
# Wrap model in DDP
model = YourModel().to(rank)
model = DDP(model, device_ids=[rank])
# DistributedSampler ensures each GPU sees different data
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank)
loader = DataLoader(dataset, batch_size=32, sampler=sampler)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
for epoch in range(10):
sampler.set_epoch(epoch) # Important: shuffle differently each epoch
for batch in loader:
batch = batch.to(rank)
loss = model(batch)
optimizer.zero_grad()
loss.backward() # DDP automatically averages gradients
optimizer.step()
dist.destroy_process_group()
# Launch with torchrun:
# torchrun --nproc_per_node=4 train.py