Description:
This was my first hands-on implementation of adversarial machine learning where I implemented the PGD (projected gradient descent) white-box attack from
Madry et al. against LeNet on MNIST and ResNet18 on CIFAR-10, then systematically analyzed how attack hyperparameters, training augmentations, and regularization techniques affect a model's vulnerability. Finally, I defended both models using adversarial training, reducing the accuracy gap from ~45–65% down to just 2–13% with nearly no cost to clean accuracy. This project gave me hands-on experience with adversarial example crafting, robustness evaluation, and the practical tradeoffs between model utility and security.
Abstract:
An implementation and analysis of PGD adversarial attacks on image classifiers. By iteratively perturbing inputs within a bounded ε-ball, the attack reliably fools undefended models. Training models on these attack-generated examples produces models robust to PGD with minimal clean accuracy sacrifice.
| LeNet - MNIST | Resnet18 - CIFAR10 |
|---|
| Clean Accuracy | 99.41% | 88.31% |
| Adversarial Accuracy | 54.78% | 23.37% |
| Accuracy Difference | 44.63% | 64.94% |
| Adversarially Trained | LeNet - MNIST | Resnet18 - CIFAR10 |
|---|
| Clean Accuracy | 99.50% | 86.75% |
| Adversarial Accuracy | 97.43% | 72.97% |
| Accuracy Difference | 2.07% | 13.78% |