Increasing Confidence in Adversarial Robustness Evaluations
University of Tübingen
tl;dr: We propose a test that enables you to find flawed adversarial robustness evaluations. Passing this test produces compelling evidence that the attacks used have sufficient power to evaluate the model's robustness.
News
Sept '22 | Our paper was accepted at NeurIPS 2022! |
June '22 | The pre-print is now available on arXiv: arxiv.org/abs/2206.13991. |
April '22 | Our paper was accepted at the The Art of Robustness: Devil and Angel in Adversarial Machine Learning workshop at CVPR 2022 for an oral presentation. |
April '22 | Our work was accepted at the ML Evaluation Standards Workshop at ICLR 2022. |
Abstract
Hundreds of defenses have been proposed in the past years to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these could hold up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust.
In this paper, we propose a test to identify weak attacks. Our test introduces a small and simple modification into a neural network that guarantees the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in attacking this modified network.
For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test, while stronger attacks that break these defenses pass it.
We hope that attack unit tests such as ours will be a major component in future robustness evaluations and increase confidence in an empirical field that today is riddled with skepticism and disbelief.
Assessing the adversarial robustness can be tricky
There are some challenges when measuring the adversarial robustness of a neural network. Most importantly, it is unclear how to interpret the observation that an adversarial does not find any adversarial perturbation. Does this mean that the model is truly robust? Or does it rather mean that the attack was too weak and a stronger attack is still able to produce adversarial examples? This ambiguity might lead to the conclusion that a model is robust even though it is susceptible to adversarial perturbations.
We are now proposing to use so called
Our binarization test helps finding insufficient attacks
We construct a binary classification problem around a clean example such that there exists a valid "adversarial" example within the feasible set of the attack's threat model (e.g., its Ɛ-bound). The problem is defined as follows: the classifier has to distinguish between mildly perturbed images (i.e., perturbations within the Ɛ-bound) and some strongly perturbed images (i.e., on the boundary of the Ɛ-bound). Consequentially, there exists at least one "adversarial" example - namely the one on the boundary - that any sufficient strong attack should find. To make sure the modified classifier stays as close as possible to the original classifier in question, we build the binary classifier by re-using the feature extractor of it and only replacing the classification readout with a new binary classifier.
Our test could have warned authors of later broken defenses
We used our test to analyze thirteen published defenses. Eleven of these have been broken before; for the remaining two we present adversarial attacks circumventing their defenses. Specifically, we applied the test once to the originally used evaluation and once to a re-evaluation using stronger attacks that break the defense.
We find that eleven of the defenses investigated the original evaluation performs poorly in our test and fails the test, while the re-evaluation performs substantially better. Although we have presented a potential test - called the binarization test - that could help defense authors demonstrate sufficient power of their adversarial evaluation, our tests cannot be comprehensive and apply to every possible defense. Consequentially, defense authors should aim to develop their own active tests, depending on the particular claims made. These results suggest that running our test would potentially have prevented the publication of eleven flawed defenses. Thus, we are optimistic that active tests can improve the reliability of future publications in the field of adversarial robustness.
Acknowledgements & Funding
We thank Alexey Kurakin for his valuable feedback. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.
BibTeX
When citing our project, please use our pre-print:
author = {
Zimmermann, Roland S. and
Brendel, Wieland and
Tram{\`e}r, Florian and
Carlini, Nicholas
},
title = {
Increasing Confidence in
Adversarial Robustnes Evaluations
},
year = {2022},
}