⏱️ 70 min

Adversarial Attacks & Defenses

Understand how attackers fool AI systems and how to defend against them

The Adversarial Threat Model

Adversarial attacks exploit the fact that ML models learn a simplified mapping from inputs to outputs — a mapping that can be broken by carefully crafted inputs that are imperceptible to humans but cause dramatic model failures. The most famous demonstration: add a small amount of carefully computed noise to an image of a panda, and a state-of-the-art image classifier becomes 99.3% confident it's a gibbon — despite the image looking identical to a human.

Types of attacks

- **Evasion attacks**: Modify inputs at test time to cause misclassification. Most studied. Relevant for: spam filters, fraud detection, autonomous vehicles, face recognition. - **Poisoning attacks**: Corrupt training data to cause the model to learn wrong behavior. Relevant for: any system that retrains on user-provided data. - **Model extraction**: Query a model to reconstruct its parameters. Used to steal proprietary models. - **Model inversion**: Reconstruct training data from model outputs. Privacy attack.

Adversarial Training as Defense

python
import torch
import torch.nn.functional as F

def fgsm_attack(image, epsilon, data_grad):
    """Fast Gradient Sign Method — creates adversarial examples."""
    sign_data_grad = data_grad.sign()
    perturbed_image = image + epsilon * sign_data_grad
    return torch.clamp(perturbed_image, 0, 1)  # Keep pixel values valid

def adversarial_train_step(model, optimizer, x, y, epsilon=0.1):
    """Training step that includes adversarial examples."""
    x.requires_grad = True

    # Standard forward pass
    output = model(x)
    loss = F.cross_entropy(output, y)
    model.zero_grad()
    loss.backward()

    # Generate adversarial examples
    x_adv = fgsm_attack(x.detach(), epsilon, x.grad.data)

    # Train on mix of clean and adversarial examples
    optimizer.zero_grad()
    output_clean = model(x)
    output_adv = model(x_adv)
    loss = 0.5 * F.cross_entropy(output_clean, y) + 0.5 * F.cross_entropy(output_adv, y)
    loss.backward()
    optimizer.step()
    return loss.item()
Sharan Initiatives — AI, Finance, Photography & More