%0 Journal Article %J arXiv %D 2022 %T One thing to fool them all: generating interpretable, universal, and physically-realizable adversarial features %A Stephen Casper %A Max Nadeau %A Gabriel Kreiman %X

It is well understood that modern deep networks are vulnerable to adversarial attacks. However, conventional attack methods fail to produce adversarial perturbations that are intelligible to humans, and they pose limited threats in the physical world. To study feature-class associations in networks and better understand their vulnerability to attacks in the real world, we develop feature-level adversarial perturbations using deep image generators and a novel optimization objective. We term these feature-fool attacks. We show that they are versatile and use them to generate targeted feature-level attacks at the ImageNet scale that are simultaneously interpretable, universal to any source image, and physically-realizable. These attacks reveal spurious, semantically-describable feature/class associations that can be exploited by novel combinations of objects. We use them to guide the design of “copy/paste” adversaries in which one natural image is pasted into another to cause a targeted misclassification.

%B arXiv %8 01/2022 %G eng %U https://arxiv.org/abs/2110.03605 %R 10.48550/arXiv.2110.03605 %0 Generic %D 2022 %T Robust Feature-Level Adversaries are Interpretability Tools %A Stephen Casper %A Max Nadeau %A Dylan Hadfield-Menell %A Gabriel Kreiman %K Adversarial Attacks %K Explainability %K Interpretability %X

The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are uniquely versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv.

%B NeurIPS %C New Orleans, Louisiana %8 10/2022 %U https://openreview.net/forum?id=lQ--doSB2o %0 Conference Paper %B AAAI 2021 %D 2021 %T Frivolous Units: Wider Networks Are Not Really That Wide %A Stephen Casper %A Xavier Boix %A Vanessa D'Amario %A Ling Guo %A Martin Schrimpf %A Vinken, Kasper %A Gabriel Kreiman %X

A remarkable characteristic of overparameterized deep neural networks (DNNs) is that their accuracy does not degrade when the network width is increased. Recent evidence suggests that developing compressible representations allows the complex- ity of large networks to be adjusted for the learning task at hand. However, these representations are poorly understood. A promising strand of research inspired from biology involves studying representations at the unit level as it offers a more granular interpretation of the neural mechanisms. In order to better understand what facilitates increases in width without decreases in accuracy, we ask: Are there mechanisms at the unit level by which networks control their effective complex- ity? If so, how do these depend on the architecture, dataset, and hyperparameters? We identify two distinct types of “frivolous” units that prolifer- ate when the network’s width increases: prunable units which can be dropped out of the network without significant change to the output and redundant units whose activities can be ex- pressed as a linear combination of others. These units imply complexity constraints as the function the network computes could be expressed without them. We also identify how the development of these units can be influenced by architecture and a number of training factors. Together, these results help to explain why the accuracy of DNNs does not degrade when width is increased and highlight the importance of frivolous units toward understanding implicit regularization in DNNs.

%B AAAI 2021 %8 05/2021 %G eng %U https://dblp.org/rec/conf/aaai/CasperBDGSVK21.html