Understand how attackers fool AI systems and how to defend against them
Adversarial attacks exploit the fact that ML models learn a simplified mapping from inputs to outputs — a mapping that can be broken by carefully crafted inputs that are imperceptible to humans but cause dramatic model failures. The most famous demonstration: add a small amount of carefully computed noise to an image of a panda, and a state-of-the-art image classifier becomes 99.3% confident it's a gibbon — despite the image looking identical to a human.
- **Evasion attacks**: Modify inputs at test time to cause misclassification. Most studied. Relevant for: spam filters, fraud detection, autonomous vehicles, face recognition. - **Poisoning attacks**: Corrupt training data to cause the model to learn wrong behavior. Relevant for: any system that retrains on user-provided data. - **Model extraction**: Query a model to reconstruct its parameters. Used to steal proprietary models. - **Model inversion**: Reconstruct training data from model outputs. Privacy attack.
import torch
import torch.nn.functional as F
def fgsm_attack(image, epsilon, data_grad):
"""Fast Gradient Sign Method — creates adversarial examples."""
sign_data_grad = data_grad.sign()
perturbed_image = image + epsilon * sign_data_grad
return torch.clamp(perturbed_image, 0, 1) # Keep pixel values valid
def adversarial_train_step(model, optimizer, x, y, epsilon=0.1):
"""Training step that includes adversarial examples."""
x.requires_grad = True
# Standard forward pass
output = model(x)
loss = F.cross_entropy(output, y)
model.zero_grad()
loss.backward()
# Generate adversarial examples
x_adv = fgsm_attack(x.detach(), epsilon, x.grad.data)
# Train on mix of clean and adversarial examples
optimizer.zero_grad()
output_clean = model(x)
output_adv = model(x_adv)
loss = 0.5 * F.cross_entropy(output_clean, y) + 0.5 * F.cross_entropy(output_adv, y)
loss.backward()
optimizer.step()
return loss.item()