Increasing Confidence in Adversarial Robustness Evaluations

Roland S. Zimmermann
University of Tübingen & IMPRS-IS

Wieland Brendel
University of Tübingen

Florian Tramèr
Google

Nicholas Carlini
Google

tl;dr: We propose a test that enables you to find flawed adversarial robustness evaluations. Passing this test produces compelling evidence that the attacks used have sufficient power to evaluate the model's robustness.

News

Sept '22	Our paper was accepted at NeurIPS 2022!
June '22	The pre-print is now available on arXiv: arxiv.org/abs/2206.13991.
April '22	Our paper was accepted at the The Art of Robustness: Devil and Angel in Adversarial Machine Learning workshop at CVPR 2022 for an oral presentation.
April '22	Our work was accepted at the ML Evaluation Standards Workshop at ICLR 2022.

Abstract

Hundreds of defenses have been proposed in the past years to make deep neural networks robust against minimal (adversarial) input perturbations. However, only a handful of these could hold up their claims because correctly evaluating robustness is extremely challenging: Weak attacks often fail to find adversarial examples even if they unknowingly exist, thereby making a vulnerable network look robust.
In this paper, we propose a test to identify weak attacks. Our test introduces a small and simple modification into a neural network that guarantees the existence of an adversarial example for every sample. Consequentially, any correct attack must succeed in attacking this modified network.
For eleven out of thirteen previously-published defenses, the original evaluation of the defense fails our test, while stronger attacks that break these defenses pass it. We hope that attack unit tests such as ours will be a major component in future robustness evaluations and increase confidence in an empirical field that today is riddled with skepticism and disbelief.

Assessing the adversarial robustness can be tricky

There are some challenges when measuring the adversarial robustness of a neural network. Most importantly, it is unclear how to interpret the observation that an adversarial does not find any adversarial perturbation. Does this mean that the model is truly robust? Or does it rather mean that the attack was too weak and a stronger attack is still able to produce adversarial examples? This ambiguity might lead to the conclusion that a model is robust even though it is susceptible to adversarial perturbations.

There are two reason an attack might not find an adversarial example. Either the classifier is robust or the attack is too weak and it could not find the existing adversarials. In our proposed binarization test, failing to find adversarial examples means that the attack used is too weak.

We are now proposing to use so called active tests to produce compelling evidence that a robustness evaluation used a sufficiently strong attack and, thus, can be trusted. Specifically, we introduce the binarization test for a large family of classifiers. In this test, we modify the original classifier such that it contains adversarial examples. Therefore, we can conclude that an adversarial attack is too weak to assess the robustness of the model if it fails to find adversarial examples for this modified classifier.

Our binarization test helps finding insufficient attacks

We construct a binary classification problem around a clean example such that there exists a valid "adversarial" example within the feasible set of the attack's threat model (e.g., its Ɛ-bound). The problem is defined as follows: the classifier has to distinguish between mildly perturbed images (i.e., perturbations within the Ɛ-bound) and some strongly perturbed images (i.e., on the boundary of the Ɛ-bound). Consequentially, there exists at least one "adversarial" example - namely the one on the boundary - that any sufficient strong attack should find.
To make sure the modified classifier stays as close as possible to the original classifier in question, we build the binary classifier by re-using the feature extractor of it and only replacing the classification readout with a new binary classifier.

Our test could have warned authors of later broken defenses

We used our test to analyze thirteen published defenses. Eleven of these have been broken before; for the remaining two we present adversarial attacks circumventing their defenses. Specifically, we applied the test once to the originally used evaluation and once to a re-evaluation using stronger attacks that break the defense.

Black markers indicate original attacks that fail the test, as well as improved attacks that pass the test (i.e., true positives and true negatives for our test). Red markers indicate suboptimal original evaluations that nevertheless pass our test (false positives). Orange markers indicate re-evaluations that used suboptimal attacks (as shown by our test) that still broke the defense. Checks and crosses in the legend indicate passing/failing tests for the original and the re-evaluation, respectively.

We find that eleven of the defenses investigated the original evaluation performs poorly in our test and fails the test, while the re-evaluation performs substantially better. Although we have presented a potential test - called the binarization test - that could help defense authors demonstrate sufficient power of their adversarial evaluation, our tests cannot be comprehensive and apply to every possible defense. Consequentially, defense authors should aim to develop their own active tests, depending on the particular claims made. These results suggest that running our test would potentially have prevented the publication of eleven flawed defenses. Thus, we are optimistic that active tests can improve the reliability of future publications in the field of adversarial robustness.

Acknowledgements & Funding

We thank Alexey Kurakin for his valuable feedback. The authors thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting RSZ. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tübingen AI Center, FKZ: 01IS18039A. WB acknowledges financial support via an Emmy Noether Grant funded by the German Research Foundation (DFG) under grant no. BR 6382/1-1. WB is a member of the Machine Learning Cluster of Excellence, EXC number 2064/1 – Project number 390727645.

BibTeX

When citing our project, please use our pre-print:

@article{zimmermann2022active,
  author = {
    Zimmermann, Roland S. and
    Brendel, Wieland and
    Tram{\`e}r, Florian and
    Carlini, Nicholas
  },
  title = {
    Increasing Confidence in
    Adversarial Robustnes Evaluations
  },
  year = {2022},
}

Webpage designed using Bootstrap 4.5.