Adversarial examples and human-ML alignment
Date Posted:
July 24, 2020
Date Recorded:
July 23, 2020
Speaker(s):
Shibani Santurkar, MIT
All Captioned Videos Computational Tutorials
Loading your interactive content...
Description:
Machine learning models today achieve impressive performance on challenging benchmark tasks. Yet, these models remain remarkably brittle---small perturbations of natural inputs, known as adversarial examples, can severely degrade their behavior.
Why is this the case?
In this tutorial, we take a closer look at this question, and demonstrate that the observed brittleness can be largely attributed to the fact that our models tend to solve classification tasks quite differently from humans. Specifically, viewing neural networks as feature extractors, we study how features extracted by neural networks may diverge from those used by humans, and how adversarially robust models can help to make progress towards bridging this gap.
Additional tutorial info:
The tutorial will include demos---we will use Colab notebooks so please bring laptops along. In these demos, we will explore the brittleness of standard ML models by crafting adversarial perturbations, and use these as a lens to inspect the features models rely on.
Github link for demos: https://github.com/MadryLab/AdvEx_Tutorial
Suggested reading (in order of importance):
Speaker Bio:
Shibani Santurkar is a PhD student in the MIT EECS Department, advised by Aleksander Mądry and Nir Shavit. Her research has been focused on two broad themes: developing a precise understanding of widely-used deep learning techniques; and avenues to make machine learning methods robust and reliable. Prior to joining MIT, she received a bachelor's degree in electrical engineering from IIT Bombay. She is a recipient of the Google Fellowship in Machine Learning.
PRESENTER: Hello, everyone. Welcome to-- this is the second session of the Computational Tutorial Series this summer. So these tutorials are meant to be an informal introduction to some of the computational techniques. We invite speakers to share about some toolboxes they're working on. It's meant to be very informal, so please feel free to unmute yourself and ask questions during the presentations if you have any questions.
So today we're very pleased to welcome Shibani Santurkar to our tutorial. Shibani a PhD student in the MIT EECS department, advised by Aleksander Madr and Nir Shavit. Prior to joining MIT, she received a bachelor's degree in electrical engineering from IIT Bombay. She is also a recipient of the Google Fellowship in Machine Learning.
So Shibani's research is focused on two broad themes. One is developing a precise understanding of widely-used deep learning techniques and avenues to make machine learning methods robust and reliable. So I also know that Shibani is also a contributor to a blog called gradient-science.org. And it has lots of awesome posts on machine learning and adversarial robust models. Shibani, thanks so much for joining us today.
SHIBANI SANTURKAR: Thank you for the introduction. So let me get started. Hi, my name is Shibani. I am a fifth-grad student in the EECS department. And I'm going to be talking to you guys about adversarial examples and how they also offer a new perspective on how machine learning models learn. And this is joint work with all my amazing lab mates here at MIT-- Logan, Andrew, Brandon, Dimitris, and Alex, along with our advisor, Aleksander Madry.
Before I get started, I actually just wanted to tell you guys that if you could open the GitHub [INAUDIBLE] and either the solutions or the exercises in the book, whichever one you prefer-- if you could just run all the cells that are before the cell setup, it would help to speed up the demo process. Even if you don't, it's not a big deal. It should just take you a few minutes. But this way we can get a little bit of a head start on the demo.
And the second thing I wanted to say is please feel free to interrupt me. Unmute yourself and interrupt me at any point in time in the talk or in the demo. I would prefer for this to be as interactive as possible, so don't hesitate at all to interrupt me. OK, so with that, let's get started.
So I don't need to tell you guys how successful machine learning is these days. Machine learning techniques are increasingly getting deployed in a bunch of real-world applications across domains from games to image classification and language translation and a bunch of other things. And in fact, we've gotten pretty accustomed to seeing cool demos like this in the media, where you have this completely autonomous self-driving car navigating the streets.
And the buzzword behind a lot of these developments seems to be deep learning. It's one of these technologies that has fueled a lot of the recent progress in machine learning. And just revisiting what deep learning is all about, the crux of it is that it is focused on using a new model for us specifically called deep networks. And this is just one pictorial depiction of what a deep network can do. It can take an image, and it can classify this image into a bunch of possible classes. And deep networks usually are composed of a bunch of layers of different types.
And what made deep networks so popular compared to previous machine learning techniques is that they had amazing performance on a bunch of standard benchmarks. In fact, on some of the benchmarks today, deep networks surpass human performance, for example on the ImageNet Image Classification Challenge. And even more fundamentally, one of the things that makes deep learning so appealing is this promise that they learn these really meaningful representations of data, that over the layers of the network, deep networks extract more and more complex representations, similar to what we think about human perception.
And the hope is that like these representations that help us do well not only on the tasks that became the model on, but also in other tasks, for example in other domains and also do more advanced things, for example generation of images. And as you guys must have heard, deep networks are already pretty good at many of these tasks.
So the big question then is, are we there? Are we done? Are deep networks everything that we need to build really good perception systems and other kinds of systems? Is all we need just building more, larger models with more data? And I guess the spoiler is that it may not entirely be the case and that there may be some fundamental deviations from how we as humans work as specifically focusing on perception or vision systems and how machine learning models behave. And it's all about the kinds of features that these different models rely on.
OK. So to get a sense of where I'm coming from, let's look at the flip side of deep networks, all those demos that you don't see as often. Here you have this Tesla there's just veering off the highway for no good reason. And in for example, reinforcement learning, you get used to also seeing a bunch of demos like this, where the agents are doing really funny but unexpected things and just failing at the tasks that they're supposed to be doing.
And at the crux of all of this is this phenomena that like deep networks are extremely brittle. So I show you this picture of this deep network classifying this image of a pig. And usually these models do very well on images. Such an image classification system would get upwards of 90% accuracy.
But the caveat is that you can add to any of these images that is classified correctly by the model a small perturbation and get an image that basically looks identical to you, but the model of confidently thinks this is something else. In this case, the model confidently thinks that this pig, which basically looks exactly like the first picture, is now an airplane.
And this is a phenomenon known as adversarial examples. It was discovered around 2013, and the key idea is that you can add an imperceptible perturbation to any input and get the model of the confidently misclassify it. And all state of the art networks today on extremely brittle. You can basically bring down that accuracy from 90+% to zero by just adding these imperceptible perturbations.
So let's get a deeper, closer look at what adversarial examples are-- specifically, how do we go about finding them? I want to emphasize that adversarial examples are not random noise. In fact, they need to be specially crafted. So you can just add some Gaussian noise to an input and get them ordered to easily misclassify with an imperceptible perturbation. And to understand how we find adversarial examples, let's revisit how these standard machine learning models are trained in the first place.
So usually when we change machine learning classifiers, we optimize their parameters to minimize the loss with respect to some data. So the goal is, for example, to minimize the loss between the predicted label and the true label on a bunch of data in the image classification setting. Now to construct adversarial examples, what we do is that we want to ask to find a perturbation such that when it's added to the input image-- here in this case, think about this as x-- will make the loss as high as possible.
So you're going to-- so here the delta captures the allowed perturbation set. After all, we don't want to allow the perturbation to be added to be arbitrary, but we might want it restricted to a small set. For example, we might see we're only allowing gradations. Or, we're only allowing small LP changes. And so the goal is to define the motivation that maximizes the loss within this perturbation set.
And you can think about how the natural way to solve this would be using some optimization method, for example gradient descent. And so that's exactly how it's done in practice. You just optimize this loss with gradient descent, except you're not optimizing the waves, but you're optimizing the perturbation that's added to the input. And just to ensure that the border region remains in the set, you need a slightly modified gradient descent and do what it's called projected gradient descent to ensure that the delta you're finding remains in the allowed set.
So let's take a step back. Why do we care about adversarial examples? Now, specific examples first became really popular from a security point of view, because it showed you that an attacker could easily break your model by adding some imperceptible perturbation to your inputs. But it's not just about security and reliability. In fact, adversarial examples show us that models don't behave the way we expect them to, or at least hope we do.
When I showed you this image of this pig, the perturbed big looked exactly identical to me as the original pig. Still, for some reason, the model thinks these are vastly different things. And that indicates some kind of dissonance between how we make classification predictions or how we classify things and how the machine learning model is doing it. And my focus in the talk is going to be from this viewpoint, about thinking about adversarial examples as shedding insight into how machine learning models actually classify.
And so in order to understand this, we first need to understand why adversarial examples even arise in the first place. And obviously, this is an extremely popular topic, and there have been a lot of hypotheses proposed for the existence of adversarial examples, from the fact that this is some high-dimensional phenomenon, could be just because of some statistical aberrations, or it could be because we train our models to be good in the average case. But then adversarial examples exploit worst-case behavior, or it could just be of course of using some crazy large model like ResNets.
But the crux of all these hypotheses is that adversarial examples are thought of to be as aberrations or bugs, that they were just some glitches in our model that go away as we have better models or we have more data. And more conceptually what this looks like is that you can imagine that our data has a bunch of features. This is just a conceptual model for simplicity.
And you can think about the data having a bunch of useful features. And by useful I mean that they are predictive. They are correlated with the label. For example, in a dog who has this cat classification challenge, the tail of the dog or the snout might be a useful feature, because it tells you something about the label, and a bunch of other useless features. For example, these could be just because we use finite data, so maybe there's some noise in the data. And this data doesn't really tell you anything about-- these features don't tell you anything about the label, but they're just present in your data nevertheless.
And the inherent-- the Indian hypotheses for example seems to be that-- sorry-- the adversary is exploiting these useless directions in your data. It's somehow exploiting these unreasonable sensitivities your model has, and as you build better models, in some sense, they will go away.
So the question that we tried to explore in [INAUDIBLE] book is, is this view justified? Are adversarial examples actually meaningless? So to get a better sense of that, let's try to understand why adversarial examples bother us so much.
So going back again doing illustration, you have this image of the dog. You add this meaningless perturbation. All of a sudden, the model thinks it's a cat. And what's really upsetting about this is that again, both these images on the left and right look identical. They're definitely both dogs. And still, for some reason, your model's prediction changed so drastically. So the fact that you could add a meaningless perturbation to flip what your model thinks this is really troublesome.
But what I want to emphasize is that this is just a human perspective. And what I mean by that is that, let's think about a dog and a cat classification task. You as a human would perhaps use the snout or the ears or the whiskers to make these kinds of predictions. And you do this because you have some innate understanding of what dogs and cats are, and you have some kind of prior understanding of the world itself. But this is purely how we as humans think about this task.
If you think about this on the other hand, from a machine learning models perspective, they don't really know what an image is. It's just array of numbers that they're seeing. And they don't know what dogs and cats are. It's just some arbitrary string for them. And so the only goal of a machine learning model is to maximize the best accuracy. It has no prior on what a dog or a cat is. It doesn't know what these classes mean. It's just trying to maximize the accuracy in terms of predicting the label from the image.
So maybe rather than thinking about this dog versus cat classification task, let's think about this other classification task between these two patterns that we call tap and toc. What these are, I don't know, but they are just these veterans that we have to learn to classify. And maybe if I show you a couple of these, you might conclude that the square-- the rectangle on the dog is correlated with the class tap, and the rectangle at the bottom is correlated with the class toc.
So in this setting, if we look at adversarial-- what an adversarial perturbation might look like. It might look a little bit more like this. And from this perspective, maybe the perturbation looks less bothersome, or it looks it's less meaningless to us. And in fact, this is the fundamental question that we're going to explore, whether these adversarial examples that we-- these upset perturbations that we construct from data, are they meaningless?
And so to do that I am going to walk through a simple experiment that we did. And again, in this experiment please feel free to interrupt me if you have any questions. So what we're going to do is we're going to focus on a simpler classification task-- let's say, just between dogs and cats. And what I'm going to do is put every single image in the training set. I'm going to construct an adversarial perturbation towards the other class.
So first, I'm going to train a standard model on this binary classification tasks. And for that standard classifier, I'm going to construct adversarial examples for every training image towards the opposite class. So for example, I'm going to find for every dog, a perturbation that makes the model thinks this is a cat and for every cat a perturbation that makes this model thinks it's a dog.
So as we saw before, these images look almost identical to us as the additional images. But now the standard classifier that was trained in the original training set thinks the dog is a cat and the cat is a dog. And this is true for every image in the training set.
What we do now is actually construct a new training set made up of these adversarial examples. And what we do is we label in each example with the label predicted by the model. So every single dog in this new training set is labeled as a cat, and every cat is labeled as dog. So it's basically completely misleading with respect to our semantic understanding with the strings cat and dog.
And what I'm going to do is I'm going to join a completely new machine learning classifier on this. It need not even have the same architecture as original one. It's just a complete fresh model. And I'm going to train it only on this new training set. And then I'm going to evaluate it on the original test set. So in the original test set, remember that every dog is labeled as dog, every cat is labeled as cat, as we would expect.
So what do you think would happen here? Note that if I showed this mislabeled training set to a human who maybe didn't know the words cat and dog in English, what would we expect them-- how would we expect them to perform on the test set? A human who would get a 0% accuracy, not even 50%. Because if they were learning the mapping, they would learn that the dog features correspond to the string cat and the cat features correspond to the string dog. And so on the actual test set they would just predict completely opposite.
But what we actually see with respect to the machine learning classifier we trained is that it gets non-trivial accuracy. It gets close to 80% accuracy on the cat versus dog classification tasks, and we actually do this in our paper on accuracy for tasks and the actual image [INAUDIBLE] multi-class classification tasks. And this is true there as well.
So this is really perplexing. Because what it did is I trained a model on a data set that was completely mislabeled, and for some reason it was still able to predict correctly on the original test set. So what's going on? And this is really, really puzzling phenomenon for us. We were like, how can this even be? And we realized that the only assumption that we made in this entire process is that the adversarial perturbation that we added when we constructed the new training set was meaningless. But what this experiment suggests is that maybe these perturbations are not meaningless. They're not just bugs, but they're actually features that are predictive on the test set.
What do I mean by this? So let's go back to the conceptual model we had. And remember that we thought about it as a bunch of useful features in the data that are correlated with the label and a bunch of useless features that do not have anything to do with the label. They're just noise. But what if it's not so simple? What if the useful features themselves are split into what we call robust features?
So you can think about robust features as features that mean correlated with the label even when their adversarially perturbed. For example, they may be features-- for example, features that we as humans use for classification are robust, because having a small perturbation-- an imperceptible perturbation-- doesn't change the fact that these are still correlated with the original class.
And then we can think about the data also containing a bunch of non-robust features. So these are features that are actually correlated with the label, but they can be easily flipped by a small perturbation. So it helps if we think about these features intuitively as things like lighting. Maybe things like lighting are correlated with the actual label, but they can also be easily flipped. So we as humans might not use lighting to predict whether an image contains a dog or a cat, but these still might on the data side be correlated with the dog and cat, maybe because dogs are more outdoors, and cats are more indoors. So small things like this might be what we call a non-robust features.
And from the perspective of maximizing accuracy, all useful features-- any feature that tells you something about the label-- is a good feature. Right? And in fact, non-robust features are often great features, because they give you a lot of test accuracy and training accuracy. And that's why our models end up picking up on them.
So to kind of revisit, when we started exploring this, our hypothesis was that these adversarial perturbations are just meaningless sensitivities. They don't really help the model to make classifications, don't help it with getting good performance on the test set, but it's just something that the model latches onto. But what this experiment shows is that that is not entirely the case, that there are features in the data that are actually predictive on the test set that give you test accuracy but can actually be easily flipped with the adversary and then thus are responsible for adversarial examples.
So going back to the simple experiment, let's see what we actually did there. So we had the training set where-- I'm just using one image for illustration, but this is true for every image, that if you had an image, for example for a dog, where both the robots and non-robust features are useful features, so they're correlated with the label dog.
And when you construct adversarial examples, by definition, you can't change into robust features. Robust features are features that can't be flipped by small perturbations, so the robust features still remain correlated with the original label dog. If you saw this image, you would still think it's a dog. On the other hand, the non-- sorry, yeah.
AUDIENCE: Hi, my name is Hugh.
SHIBANI SANTURKAR: Hi.
PRESENTER: I'm a grad student. I have a question about your robust features and non-robust features. Does this depend in any way on the magnitude of the perturbation? For example, if you-- let's say, if you allow the perturbation close to 1, in terms of range from 0 to 1, would the perturbation look somewhat like a ear or tail or a nose or something like that?
SHIBANI SANTURKAR: Yeah, that's a very good question. And in fact, the way we define a robust feature is precisely, the definition is dependent on the perturbation set. Right? You could imagine that there are some features that are robust just small L2 changes but are not robust in rotations or translations. So this entire definition of what is a robust feature and what isn't a robust feature depends upon the perturbation set that you are even considering in the first place. Does that answer your question?
AUDIENCE: Yeah, yeah. I guess maybe is more specific in terms of the magnitude, right? Because I think when you generate this, you often have to define some delta or lambda or-- sorry, delta that allow how large the perturbation can be.
SHIBANI SANTURKAR: Yeah, exactly. So even-- so what I meant is, it's even dependent on the perturbation magnitude, right? For example, you might imagine that like a background is a robust feature for small epsilon values. But if you allow the epsilon to be large, maybe you can flip a blue background into a green background.
AUDIENCE: I see. Yeah. I guess one example that I could think of right now is, if we try on MNIST and if we allow the perturbation to be really large, would the perturbation look like a different number on top of the original number or something like that?
SHIBANI SANTURKAR: Yeah. Yeah, that could be the case. But in general one point to note is that this entire discussion of adversarial examples is usually centered on extremely small epsilons, like you can't perceptibly change the image within these epsilons. So I agree with what you're saying, that in general-- and this is how we define robust and non-robust features in the first place, that they are depend upon the magnitude, and they are dependent on the set that you even consider. But given any set, what we're saying is that your features can be decomposed into robust and non-robust features.
AUDIENCE: OK. Thank you.
SHIBANI SANTURKAR: OK. Yeah. So going back to the experiment again-- so like I was saying, you have this training data set, where both the robust and non-robust features are correlated with the label, and when you apply an outside perturbation, the robust features still remain correlated with the original label which in this case was dog. But the non-robust features are flipped by the adversary to be pointing towards class gaps. You can imagine that the adversary changed the lighting a little bit so that it now is correlated with the lighting-- it's now the cat lighting, basically.
And if you were to train a model on this new training set, you see that the robust features are actually not helpful. If you were to learn the mapping between the robust features and the class gap, then on the test set, you would actually predict the opposite label. So you'd see a dog, and you'd actually predict cat on the test set. And in fact, all the performance you get is basically from the non-robust features. The fact that you flip the non-robust features, and you also flip the label, and the fact that these are actually correlated gives you some performance on the test set.
So what this tells you is that in going from the image on the left to the image on the right, even though you think that you did nothing, you added some meaningless noise, you actually didn't. You actually added some features that are correlated with the class gap in this case.
So what do we do with this? What's next? So this kind of gives us a new perspective on adversarial robustness and more broadly also some insight into how our models learn. So let me talk to you about this in the context of priors. So when we as humans are asked to think about a classification task, we often think about the features that we as humans might use to perform this task.
But what it turns out, and what this work shows so far, is that there are other features in the data that are equally valid, that are equally. Predictive And a priori, if you just change your model to maximize accuracy, there's no reason for them to pick one over the other. They're both equally valid in terms of accuracy itself. And so there's no reason why a model which is just change to maximize accuracy would pick the human meaningful features over the other ones.
And going one step further, what this tells you is that adversarial examples are in some sense human phenomena. It's the fact that they're-- it's not that the models are latching onto something weird or just having some kind of glitches in them. But they're latching onto perfectly predictive features in the data, and the fact that we as humans can't perceive them or don't consider them meaningful doesn't mean that there's something wrong with our models. Our models are just doing what they were trained to do-- to maximize accuracy-- and it's just it's more of a human thing that we, our priors, don't align with the kinds of features that the machine learning models use.
So in fact, this is not something to be observe alone. There were a bunch of recent discoveries that basically show that machine learning models rely on a bunch of very unintuitive features like, for example, they rely on high-frequency patterns, or they rely on just textured information, patterns that we as humans don't generally use.
And so this effect, or this observation, has many interesting consequences on how we think about machine learning model accuracy and also how we think about their behavior more broadly. For example, one really important thing in machine learning is interpretive validity. If we're deploying our machine learning models in any application, we might care about understanding why they made the predictions in the first place. But what these results so far tell us is that any model that uses non-robust features cannot be human-interpretable. If the model is relying on some features that you can't perceive or you don't understand, you can't possibly understand why the model-- or you can't rationalize why the model made the prediction in the first place.
And we actually see this in practice. If you look at any standard model, and you try to look at some standard individual validity methods, for example gradient saliency-- so here what the gradient saliency map tells you is how sensitive the prediction is to each pixel in the image, tells you how important each input pixel is for the prediction. And if you look at these saliency maps beside the models, they look pretty noisy. It gets very hard. You definitely see some activity in the region where the bird is, but it's not really it's not really clear. And it's not surprising, because this model relies on non-robust features. So all of this other noise that you're seeing is maybe non-robust features that the model is latching onto. And going one step further, what this tells you is that unless you explicitly prevent models from learning these non-robust features at training time, there's no hope for us to interpret them at test time.
Of course, there's been a lot of work to build better interpretability methods for standard networks, for example doing some post-processing on the gradients to make them look more what we expect them to look like. But maybe what our work shows is that maybe these post-hoc interpretations are not model-faithful. Maybe they are just playing into our confirmation bias, playing into our desire to see the models rely on human meaningful features but in the process actually masking out features that the models do rely on. So basically, to get this nice-looking, good interpretation, maybe you hid all the non-robust features of the model actually maybe ued a lot to make its prediction.
Another thing this [INAUDIBLE] tells us is that if we care about building models that are robust, that are not fooled by adversarial examples, we actually need to change how we train our models in the first place. We can just change our models to maximize accuracy, because then there's nothing stopping them from using non-robust features. So what we need to do is to get robust models, we need to explicitly disincentivize them from using non-robust features.
And going back again, this was, as we said before, the standard training formulation. And when we care about creating robust models using what we call robust training, what we do is we actually train the model with-- against adversarial perturbation. So what we're telling the model is that, I don't want you to only have low loss on the data points you see, but you should also be invariant to these perturbations of the data points. So if-- you shouldn't only predict correctly on this data point, but you should predict correctly on some neighborhood and on this data point. For example, an L2 set or rotations, your prediction should be stable, basically.
And by doing so, you're kind of enforcing barriers on what features the model uses. Because once you add this robust training max component, you're basically making all the non-robust features not useful anymore, because they can be easily flipped within that perturbation set. And so a robust model won't use these features in the first place.
And another thing is that it highlights some trade-offs that happen in between if you want robustness. If you care about robustness, a model can only rely on robust features. They can't use any non-robust features. And thus they can't get anything-- any amount of accuracy that comes from non-robust features. If they're-- so this leads to like some fundamental trade-offs between the accuracy that a robust model can get and its robustness. So if you have a robust model, you don't use robust features, and any accuracy you get based on those features is completely lost to you and also makes it such that robust models need more data to get-- to perform well, because they can only learn from robust features. So it's like information hierarchy. You just need more data to get-- to perform equally well, basically.
And other really interesting-- sorry, yeah. Go ahead.
PRESENTER: I think there was a question from [INAUDIBLE].
AUDIENCE: Yeah, if you go back to your last two slides. I was just wondering, when you look quite robust training, this is not any different for than, for example, using image augmentation, right?
SHIBANI SANTURKAR: It's actually different. So what we do with data augmentation is that we would say you'd randomly sample from this perturbation set, and then you'd augment it with-- for example, if you were doing Gaussian noise augmentation, you'd sample a bunch of noise randomly, add it to your data point, and then you would train on that, whereas what this is saying is not that if the max was randomly sampling from that perturbation set, it would be equivalent. The difference is the max. Because the worst case in higher dimensions can be very different from the average case.
So this is, in fact, the same reason why models are not fooled by random noise. They're just fooled by adversarial perturbations. It's just like-- you can imagine that in higher dimensions, there's just this one point, maybe, that fools your data. But it's very hard to find it if you sample randomly.
AUDIENCE: Thank you.
SHIBANI SANTURKAR: Yeah. So another really interesting phenomenon connected with adversarial examples is this phenomenon of transportability. There was this observation that you can take the same adversarial perturbation and fool a bunch of models. And this was kind of weird if you think about this from the bug's point of view, because why should a bunch of different models with different architectures, trained completely independently, end up with the exact same bug? But it makes a lot more sense when you think about it from non-robust features' point of view. Because it says that there are features in your data that are predictive. And it's not surprising that all these models are picking up the same features and that a singular like an adversary can exploit the same features in every model, basically.
And so this was one of the other things that we were very-- it was very hard for us to be concerned before we understood this robust and non-robust features perspective.
And another really interesting thing that this viewpoint allows you to do is that now you can take a standard training set and restrict it to only contain robust features. I'm not going to go into how we do this, but it basically involves focusing on the features that a robust model uses. But I'm happy to answer any questions about this. But you construct this new training set that contains only robust features. And what this gives you is that you can just train a standard model without robust training. And actually this model ends up being robust. So basically, because you eliminated non robust features from the data set itself, you don't need to do any special training. Just using standard training, you can actually get a model that's good both in standard sense and in the adversarial setting.
So far the discussion has--
AUDIENCE: Can I ask a quick question about that last slide? Yeah. So here, I guess, I am a little confused. Because the robustified frog doesn't really look like a frog to humans anymore, at least this particular example. So I guess, how do you combat that? How exactly do we do we interpret? Because is it just, we've only come halfway?
SHIBANI SANTURKAR: Yeah. So that's a good question. So the problem is that we don't really know how to actually get a data set with just robust features. Right? This is like the same problem as building a robust model.
So the way we do this-- for us, this was more of like a conceptual experiment. We wanted to see that, OK, if our robust and non-robust feature worldview is correct, then somehow eliminating the non-robust features to some extent should just give us robustness hopefully. So this was not really meant to be a new training methodology but more as a way to confirm our hypothesis.
And so the way you do this robustification, the way we construct this new training set, is not perfect. Because we invoke the representation of a robust model. And the optimization process is not perfect, which is why this robust frog looks a bit weird. But-- so I don't think that this data set is perfect. And I guess what I'm trying to say is that, yeah, it's not perfectly just robust features, or it's not perfectly good features.
And another thing that I want to emphasize is that depending upon the task that you are thinking about, the robust features that you might need to solve that task might be very different. For example, if I was just asking you to classify between, say, frogs and airplanes, maybe it would be enough to learn the color green means frogs, and the color white or blue means airplanes. You don't even need the entire structure of a frog or the structure of an airplane to make-- to perform this classification task. So what robust features the model might end up learning might even depend upon how simple or how complicated the task is.
And this scenario question hints at exactly what I'm trying to get at, that we as humans have certain priors on what robust features should look like, what features the model should use. And these are not entirely the same features that the model is using in the first place. Maybe for this model, the only robust feature it needed was the eye and some green watery pattern, and that was enough to make a robust classification, basically. It didn't need the hands and the legs of the frog. Does that answer your question?
AUDIENCE: Yeah, I think so. Cool, thanks.
AUDIENCE: I have a quick follow up point on this question as well. So from what you've shown so far-- and an interesting example you gave as well, for example, about the texture bias in deep nets as well. So it seems like the approach from what's been discussed so far is that there is a divergence in perceptual alignment, and yet the way we're going to try to solve this problem is by modifying the data distribution.
So I-- so for example, the style transfer or the texture bias you artificially style-transfer on these images, right, and should get some sort of texture and variance, and in this case, as you propose, in your method you get a robustified object, I guess. But I wonder if their approach-- or maybe you're going to discuss this later-- should be focused on say the architectural constraints themselves or the operations or computations and not the images or-- I don't know what your thoughts maybe are.
SHIBANI SANTURKAR: Yeah. That's another really good question. And I just want to clarify before answering your question that I'm not proposing this slide as a new method. It was just for us again to verify that our worldview was correct. And you're right. In a lot of these cases, what people are doing is trying to come up with new data augmentations, basically, thinking about noise or style transfer and all of these things to make our models more robust.
And I think that these are valid and valuable, but like you said, it's hard probably going to-- it's going to be hard for us to enumerate everything and collect data for everything. And so what I'm going to talk about a little bit is trying to change the training methodology to do this rather than trying to collect new data. And I don't think it's an either/or kind of situation. I think maybe we need both and even more. But I'm going to just talk about how we can think about this in the context of the training methodology in the coming slides. But if something is not clear, I'm happy to, again, answer questions about this at the end.
OK. So just a quick lead-in into the next slide-- so the question-- so a natural question to ask then is, why is it good that our models rely on non-robust features? Clearly these are features that we as humans don't use for our decisions. And even thought they might give you some accuracy on the image in our test set, is this how we want our models to solve the task? What happens if we just force our models to rely on robust features?
And the way we actually do this is using adversarial training. So I talked about it in a couple of slides ago that I'm just going to go to regarding the training modifications. So rather than changing the data, we can think about selecting a perturbation set, in this case, let's say LP perturbations. And we're just going to do the augmentation adaptively. Rather than collecting data for this kind of perturbation set, we're going to do it via optimization. You're just going to optimize within the set to find something that's harder than you've got in the example.
So what we see is if you look at models that just rely on robust features, which I'm going to call robust models from now, they have many actually interesting properties. So the first thing you see is, they end up becoming more perception-aligned in some sense.
So I showed you before that if you look at the gradients or the saliency maps of a standard network, they look pretty noisy. And you do see some activity down the eye, but it doesn't-- clearly the model is relying on features that don't really make sense to us. But as soon as you look at the gradient of robust models, they actually looks much nicer. And this is-- there's been no post-processing done on the gradients. This is just the vanilla gradient of the model. And it just looks much nicer. And in fact, we're going to play around with some of these things in the demo as well.
And another really cool property of robust models is that they actually have better representations. So like we talked about earlier in the talk, the hope was that deep networks would allow us to learn representations that are good at multiple tasks and even things beyond maybe just classification and prediction.
But what happens with standard models is that-- sorry-- is that you can easily find two images that look really, really different to a human-- there's a parrot and a dog-- that, according to the model, have nearly identical representation. And this is fairly complementary to the adversarial examples phenomena, where you can have two images that look identical to a human but have very different representations. But for a robust model, it actually turns out that the images that have similar representation actually also look very similar to humans.
And so with this tells us is that in the case of robust models, distance in the representation space actually aligns better with perceptual distance, so what we think is similarity. And you can actually take this one step further and see that these representations have many other nice properties. For example, you can actually visualize what individual neurons in the representation correspond to.
And they actually look like the-- you see many of these nice patterns arise, like the turtle shell or the peacock feathers. And in fact, if you took-- take some of these neurons on the right, for example this insect like neuron, and you see the examples in the test set that activated most or activated least, they actually align with the kind of examples that we might think are response-- are activating this neuron, basically.
And because of this, you can actually use these representations to do many nice things. For example, you can manipulate the input feature. For example, you can take the input image and-- like this dog on the left-- and I can optimize it to make to activate the stripe neuron. And all of a sudden it adds stripes to this dog. Or you can do things like create these meaningful interpolations between images in the representation space.
And there's a recent paper by some of my lab mates that basically showed that these robust representations also transfer much better across tasks than standard models. And there's also been a lot of other recent interest and work showing how robust representations are better and in many different senses.
And another really cool thing for us was that we actually found that this-- with robust models, classifiers could do much more than just being classifiers. So in computer vision, it's very standard to train general models, for example, to do data generation. But what we found is just a standard-- activate the class cliff, and all of a sudden-- and you optimize the input to activate the class cliff, and you get this nice image.
And then this is true for a bunch of other applications as well. You can do image-to-image translation, like can go to a horse to a zebra, do superresolution, do inpainting. And actually on our lab blog, we actually have a bunch of notebooks that allow you to play around with these applications, if you guys are interested. But what this tells us is that robust models are not only good at classification, but they learn representations that very nicely lend themselves to a bunch of other different applications.
But so far-- so we talked about adversarial examples. And what started out as a very security-focused discussion, hopefully you guys see it more in the light of trying to understand how our machine learning models behave. And this is important not only from a robustness or reliability point of view, but it's important because we need to think about the features that our models are learning based on.
For example, if you just look at ImageNet cats and dogs, there are some pretty weird correlations, like a lot of the cats are wearing bowties. Or, a lot of the cats indoors and the dogs are outdoors. And these correlations can be-- can cause a lot of problems. On the ImageNet data set, you'd realize that the mountain had just latched on to some correlation in the original data set.
For example, in the x-ray case, they latched onto some specifics of the machine, from which some-- artifacts of the machine, from which this action was taken. And in this case where people are trying to detect whether a tumor is benign or malignant, a lot of the malignant tumors had scales placed next to them. And so the model was just using the scale to detect whether the tumor was benign or malignant.
And this can be really problematic because your model might not be actually performing the task that you care about it doing. It might be getting high accuracy, but it's not actually solving the task that you cared about. OK.
So wrapping up the talk portion of the tutorial, I want to go through some of the key takeaways, hopefully, from this work. So the first thing is that adversarial examples are not just meaningless sensitivities of our models. They are coming from the fact the models actually exploit really predictive yet brittle features in the data. And so these features, they're actually very good for test set performance and training set performance, and that's why the models learn them. And that is not some-- it shouldn't be as puzzling as it actually-- adversarial examples shouldn't be as puzzling as they are.
And more importantly, this tells us is that if you care about things interpretability, you can't just train a model in a standard fashion and expect this model to be interpretable or do some forestalled processing to make it interpretable. Because this fundamentally, this model is probably relying on non-robust features that don't make any sense to humans.
Another key takeaway from this book was that is robustness actually is useful beyond the security context. The training models with adversarial training could-- or robust training could make them more human-aligned in the kind of representations they use and also enable a broad range of applications. And more broadly, the fact that we need to think about, carefully, what features our models learn. And we kind of need to decouple them from our own priors and biases about how we expect them to actually learn.
And this faces many interesting questions. In different applications, you might be OK with models using different kind of features. Maybe there are some applications where we don't really care about interpretability, and we just want models to do as well as they can. And their using non-robust features might be fair game. But in other applications, like in medicine or in any other security application, we might get about interpretability. And thus training models without thinking about this could be problematic.
And going back, it also shows us that this framework of adversarial robustness, this framework of robust training, might be nice way to enforce priors in the models. So going back to the question that I got before, one way could be to collect more data to kind of destroy these bad correlations that are there in the existing data. But another thing to do would be to actually train the models to be invariant to perturbations so that we don't actually have to go and collect so many data sets.
And like was mentioned earlier, we have a blog where for each of these papers, there are small videos, and there are also demos for each of them that you can play around with. And yeah, with that, let's go-- I'm happy to take any questions. And then we can go to the demo.
AUDIENCE: I'll start off with one. So I was wondering how exactly you frame this idea where adversarial examples are based on usable features in your data set with something like an unsupervised training objective. So do unsupervised models train via different metrics? Is there any-- I guess, is there any training that has fewer adversarial examples?
SHIBANI SANTURKAR: Yeah. That's a really cool question. I think that unfortunately, it's not true. I think-- I really think that unsupervised training is really powerful. And there has been also a lot of cool work that shows that-- I think unsupervised robust training only suffices to learn nice representations and things like that, where you train a model to be robust to perturbation, to predict an image ID, for example, irrespective of some perturbations.
But I think in general, just training something, even with an unsupervised task, it's the same thing, right, about average case versus worst case. This model might be fine in the average case, but then it's still not invariant to-- But I don't know if this can be fixed. But I'm not familiar with any work that fixes it. I think so far, the only thing that we know that actually makes models robust is adversarial training or variants of it.
AUDIENCE: How-- can you explain a little more about the formula that you have where you've maximized the perturbation?
SHIBANI SANTURKAR: So if you think about standard training, what we're doing is we're saying that we're going to sample a data point (x, y). You can think about this as the image and the label from the data distribution. And we're going to optimize the parameters, theta, to minimize the loss. So the loss could be some kind of classification loss, where you're penalizing the model for not predicting the label correctly on the data point.
In the robust setting, what you're doing is almost the same thing. The only difference is that you have this max, which you can think about as finding adversarial perturbation. So if you just forget the expectation and just look at the thing inside the parenthesis, what this is saying is that given any data point (x, y), I'm going to find a data that makes the loss as high as possible. So what I'm going to do is I'm going to attack this data point with an adversarial attack and find a delta within the allowed set that trains-- that convinces my model, let's say, that this dog is a cat.
And then rather than training on the original image, I'm going to train on this perturbed image and tell the model, even for this perturbed image, you should still predict the original label. So you can think about it as having a data point, and within some perturbation set, you're finding the worst-case point, and you're saying no. You should be stable. You should still predict the same, even if I change the image this way. Does that make sense?
AUDIENCE: Yeah. So for an image, what would the-- I saw that image that you did where you added some sort of noise image. But what would that really look like, this perturbation, in a case that x represents a full array of an image, for example?
SHIBANI SANTURKAR: That actually looks like the perturbation, what I showed you. So that's in the case of L2 loss. That's what the perturbation would look like. But in the demo session, we're actually going to go through notebooks where you're going to construct your own adversarial examples as well. So hopefully that will help you also see what these look like for standard ImageNet models, for example.
AUDIENCE: Yeah, thanks. Cool talk. So could you please give us your-- the word robust is being thrown around a couple of ways, so I wanted to try-- so right here, the word robust training means the delta, the space of those deltas. So it's really not one thing when you're setting and robust training it. It could have been pixel manipulation, whatever. It could be-- it's an open-ended definition of better training than the old training. Is that your definition of robust training? Is that a fair way to put it?
It's a general augmentation. You're basically saying, augmented training with augmentation to be determined in the future. Is that a fair summary?
SHIBANI SANTURKAR: Sure. So the motivation for this nomenclature comes from the robust optimization literature, which is basically-- it's basically, this min-max optimization that's happening in here has been conventionally called robust optimization, which is where the terminology at least partly comes from. But you are entirely correct. It is complete-- the definition of robust training depends upon what the delta you define is. And I think it's a good thing in the sense that it just is a very general framework that can handle arbitrary perturbations.
So right now most of the book in this from the CS point of view happens with respect to L2 perturbations or L-infinity perturbations or rotations. But I feel like this obviously, these are nowhere close to all the perturbations that we should care about. And this framework is very nice because it kind of allows us to integrate more and more perturbations as we are able to define them. There has been some work recently on trying to learn these perturbations or do Wasserstein perturbations, but in general I feel like I think about this more as a framework rather than a specific training method.
AUDIENCE: OK, fair enough. And then-- so then when you say robust features-- so robust features are anything that results from a robust training, or is it similarly robust features given a delta? Or when you're using the word robust features, can you define that against this? Or is it a different version of the word robust?
SHIBANI SANTURKAR: Sure. So I should have put the definition in the slides. I don't have it. But basically, the idea of a robust feature is that it is completely dependent on the delta that you define. And I'm going to try and say a mathematical definition out loud, but it's a feature that whose correlation with the label remains the same even if the feature is perturbed within this set. So it's similar to what you're seeing in this max thing, but you can think about that in terms of specific features remaining correlated with the label.
I can try and open up the paper and show you the-- but the idea is that a robust feature is a feature so that if you even change this feature within this perturbation set, it is still predict-- it will still remain correlated with the label.
Like you can imagine-- a more simple example is two Gaussians. You can imagine that if the Gaussians are really close by, then by adding a small perturbation, you can actually flip the feature to be pointing towards the opposite class, whereas if the Gaussians are really well-separated, by adding a small perturbation doesn't really change where the feature is pointing. Does that make sense?
AUDIENCE: I think so, although, I mean, you see it to be motivating it by humans. I think some of the ways you're using the term in the talk, they didn't seem completely aligned with what you just said now. So I think that word is being used loosely in several contexts. You could put up the image of the dog, and you said there is robust-- so I can some of the slides you said didn't exactly align with the definition you just gave, I think.
SHIBANI SANTURKAR: So I guess I should clarify. So the definition-- so every human feature is a robust feature, because we don't remain--
AUDIENCE: But what's a human feature?
SHIBANI SANTURKAR: What I mean is, for example, when I looked at the image of a dog, and I classified it as a dog, I probably relied on things like the snout or the tail. And those remain still pointing towards dog, even when I added the perturbation. But that doesn't mean that they're the only robust features. It could be that for that perturbation set, maybe the bad dog is a robust feature. Or it could be that maybe we as humans ourselves know that for example, if you to place the picture of the dog with something else within the same shape, you won't change that you call it a dog.
So I think we as humans are also kind of context-dependent in what we use as features. We have some inherent stratification of what features we will ultimately rely most on for making a prediction. But so I'm not saying-- I guess the intuition-- I'm not saying that all human features are the robust features that the model will learn, or-- that's not what I'm trying to say. I'm just saying that definitely, we know that when you make an adversarial perturbation, we as humans continue to see it as the original class. So whatever features that we rely on are robust, because they weren't changed by making that perturbation.
AUDIENCE: Right. Well, so there you just gave the different definition of robust features. So, again you used it in two senses. One is the sense that human brains do it, which is still an unknown but assumed to be true. And the second is the one that results from this model training, and that's a different definition of robust features.
SHIBANI SANTURKAR: I'm not defining it in terms of the training. So I was never defining it in terms of the training. I'm sorry I don't have the equation. So I was just trying to say that any-- so even if you forget training, you're just mentioning the correlation between a feature-- like if you just think about a linear classifier, and you measure the correlation between a feature and a label, if the correlation remains the same even if I change the feature a bit, that's what I'm calling the robust feature. It's completely independent of the training method. It's something defined in terms of the data distribution alone.
AUDIENCE: OK, thanks.
SHIBANI SANTURKAR: OK. So on the-- on the GitHub repository, I'm sorry, there's an exercises PDF that's going to walk through some of the exercises we're going to talk about. I also created two different versions of the notebook. There's an exercises and solutions. And they're both pretty similar, so you could get on with either. I think because we don't have a lot of time, I'm just going to run through the solutions notebook. But feel free-- I will point out where they differ, and yeah, we'll go from there. Sorry.
So if you guys are running this along with me, which I recommend, make sure that the runtime that you're using, which is here in the-- is GPU so that the code actually is able to run. And I should have done this beforehand, but--
OK. So in the meanwhile, we can talk about some of the exercises that we're going to do. And the outline that I had for today was that in the first bit, we'd play a bit with adversarial examples and try to reproduce a simplified version of the simple experiment that I talked about earlier and then play a bit with gradient saliency and how adversarial examples interplay with this and also look at some of the nice properties of robust models.
I'm not sure we have the time to go through everything, but I do hope that the solution notebook will be able to-- something that you can explore whatever we can't cover today. Was everyone able to run the first few sets? I'm hoping you were. OK.
So in our experiments, today we're going to start with-- the first exercise is on adversarial examples. So in this, what we're going to try and look at is to try and construct these perturbations to fool standard models. And to make this demo easier, I have embedded the code that does the gradient descent part. But offline you could try and experiment with rewriting this code. But what we're going to try and do is find, given an (x, y) pair, find a delta that fools a given model.
And I guess what I would recommend you play around with maybe is-- there are a couple of parameters that become essential here when you're doing gradient descent. The first one is-- so we're going to focus on the perturbation set of L2 perturbations for simplicity. So the big delta, the capital delta, is just small L2 perturbations that are smaller than a particular norm, which is going to be called epsilon. So you can play around with how big you make the epsilon, try to see what is the smallest perturbation you need to fool the model.
And then also some parameters that go along with it, the step size and the steps that determine-- basically, they control the optimization process. So you could also get on with those. But we'll go through them as well. OK, so let's get started. So we're going to do experiments in ImageNet. And these are just a couple of random samples from the data set. And the nice thing about PyTorch is that it gives you access to a lot of pretrained standard models. So for this tutorial, I just randomly picked ResNet 18, but even during the tutorial, feel free to go on the PyTorch website, pick another architecture, and you can plug it in and explore it as well.
So the first thing we're going to do is construct adversarial examples. Now, what I talked about largely with adversarial examples so far was untargeted adversarial examples. Here, the goal of the adversary is to just make the loss as high as possible. So let's say you have an image of a dog. The goal of the adversary is to make the model think this is not a dog as much as it can. But there's another kind of attack which you can do, which is called targeted adversary examples, where the goal of the adversary is to make the model classify the image as a specific other class. Like, I want the dog to be classified as cat or a frog.
So in this notebook we're going to try and play with some targeted adversarial examples. For example, I picked a random targeted class here, and it corresponds to the class tiger shark. There are many interesting kinds of fish in our ResNet. And you can feel free to play around with this. And also play around with these parameters to construct adversarial perturbations.
So what this L2 PGD function is doing is L2 is basically the perturbation set that you care about, which as I said would be small L2 perturbation. And PGD is what we call it projected gradient descent. So this is the optimization procedure that we use to actually solve the problem. And it's basically gradient descent, except that you need to ultimately restrict the perturbation you find to the set. So if you make the epsilon larger, you're basically saying that the adversary has more power. The adversary can change the image more to fool the model.
And the step size and the steps kind of go hand-in-hand. They basically say how powerful the attack is. So these are just arbitrary. In general, the rule that I tend to follow is make the step size as 2.5. This is just a completely arbitrary heuristic by the steps. But you can play around with changing the epsilon and changing the step size to go along with it and see maybe what's the minimum epsilon that you need to fool the standard model. So let's look at what these adversarial examples ultimately look like.
So here, what I'm doing is I'm visualizing the original image in the top row and the adversarial example in the bottom row. And on the top row, you see the original labels required to make the model convinced that this is not the target class. Yeah. So you could even-- there was a question about what these adversarial perturbations look like. Again, I don't have it in the notebook. But you could visualize what the difference the delta actually looks like between these images. And you can see they actually look very similar to the image that I had on the slide. It just looks completely meaningless, at least for the L2 case.
Again, please feel free to interrupt me at any point in the demo if I'm going too fast. I just want to get to some of the later bits which are useful. So here, what we're going to do is we recreate the simple experiment. And so the idea I remember there was that we trained a standard classifier. And again, in this notebook, to make it simple, we're going to do it with a linear classifier on two CIFAR classes, just because it's hard to train a model in real time. But offline, you can try and reproduce this experiment on the actual data sets.
So what we're going to do is we're going to construct this binary data set, which is just basically CIFAR airplanes, and cats. And I just pick them randomly. And you can also try changing these classes. So these classes control which of the CIFAR classes I'm using for the binary classification task. And this is just-- maybe I should say, it is just a simple image visualization functionality which you can look at what the images and the labels are.
And so you can see that in this data set, you have all airplanes labeled as zero and all the cards labeled as 1. And it's just a subset of CIFAR, basically. And we're going to do is first we're going to train-- so if you remember our original experiment, what we did is we-- what we originally did was we first trained a classifier on the data set, and then we used the model to find adversarial examples towards the opposite class. So we use the model to convince-- we tried to convince them only that every cat is an airplane and vice versa.
And then we used these along with the flipped labels to construct a new data set and use this for model training and then evaluated on the original data set. And I guess while doing these exercises, I'm going to maybe pause for two or three minutes so that you guys can play around with it a little bit. But it would be nice for you to think about how well you as a human would do on this task if you didn't know-- if you just saw the labels zero and 1 and see how the model does. And then I'm going to do it along with you.
So the first thing we're going to do is train a linear classifier on this cats versus airplanes data set. And it's pretty interesting that this data set is literally-- the model linear classifier gets ultimately more than 70% percent test accuracy on this cats versus airplanes data set. And if you're having trouble loading training the model or it's too slow, you can just set this load pretrained flag as true, I just saved one of the models that I trained. And you can use that instead.
So once this model trains, we will look at a couple of the images and the predictions. So note that this model has around 75% accuracy, and a random classifier would get close to 50% on a binary task. And you see that for most of these images, for all of these images, actually, the model is predicting correctly. The label is 1, and the prediction is 1. So the model is doing pretty well on this task.
So what's next? So we're going to-- so the next step for us is to construct adversarial examples on the training set. So here what I'm doing is above, I saved image and target to be the training data that we originally used to train the model. And now what I'm going to do is I'm going to construct using the same L2 PGD function as we saw before, adversarial examples. And note that in this case I-- in the binary case, a targeted attack is the same as a non-targeted attack. Because there are just two classes. So fooling the model is basically the same as fooling the model that it's the opposite class. And so I just did it on a targeted attack, but you could do the other way around.
So here I'm just feeding in the actual labels of the images, and I set targeted defaults. And I set epsilon to some pretty small value. And like you see, the examples look basically the same as what they were before. And we'll just construct adversarial examples.
Now what happens is because I set the epsilon to be pretty small, and I didn't really try to optimize these parameters too much, we may not be able to fool the model on the entire training set. So it may be that we are able to only for a subset of these images flip the model prediction. So let's look how successful we were.
So we were able to fool the model on about 50% of the data. And here I'm just visualizing some of the samples on which we fooled the model. So the top row is the original data point, and the bottom or is the adversarial example. They look pretty similar. And as you can see on the top row, the model is more or less predicting correctly. And in the bottom row, the model predictions are flipped. So for this airplane, the model was first predicting zero, which was correct, and now it's predicting 1.
So this is like basically-- if you just focus on the bottom row, and you treat the predictions as the labels, it's basically completely flipped with respect to the training data. So remember that to do our experiment, what we were going to do is train a model on just these samples. So again, what I'm doing is I'm just going to aggregate the adversarial images and the adversarial labels for the indices that we were actually able to fool the model on. Because we don't want to include any clean data. We're just going to go all out, and we might end up with a smaller training set, but we're just going to use a completely mislabeled data set. And the best data remains the same. It's whatever we used earlier to test the model.
So we can look at the data again, just to be sure that there are no tricks here. The training data, every airplane is now labeled as 1, and every cat is labeled as a zero. And since the test data was just the original one, every airplane is labeled as a zero, and every cat is labeled as a 1.
Does this makes sense? So the goal is that we train on this completely mislabeled data set, and we test on the correct data set, basically, in some sense. And so now I'm going to train a new linear classifier completely from scratch with just this adversarial data. And we'll see what happens. So again, feel free to maybe play around with the epsilon with which we're doing the attack and maybe see how this impacts the accuracy of the classifier.
AUDIENCE: Hi, sorry. So quick question here. The expected result is that this classifier will perform well on the original data given this flipped training data. Right? So are these perturbations-- what exactly are these representing? Are these the most salient non-robust features, or is it some combination of features? Or-- what exactly are these perturbations representing?
SHIBANI SANTURKAR: That's a very good question. So just to clarify what we would expect, if you ignore the bottom row for a minute, and you were trained or I as a human was trained on this, when I saw this airplane, I would predict 1, and when I saw this cat, I would predict zero in the bottom row. So what I as a human being would perform on this task is 0%, I would just be completely flipping what the correct answer is.
But what you ask is a good question. What exactly are these adversarial perturbations? There are probably a ton of non-robust features, and it's the model of trying to-- is there a specific non-robust feature or a group of them that the model is trying to pick? And I think while it would be really cool to figure that out, I don't yet know how to figure that out. Even when I talked about the robustified data set, actually decoupling these robust and non-robust features, actually pinpointing the specific robust or non-robust features and trying to visualize them is very tricky.
So all we know is that are some non-robust features and the model is using them. We don't really know what it actually look like or how they work.
AUDIENCE: OK. Got it. Thank you.
SHIBANI SANTURKAR: Yeah.
OK. So going back to this, so now we trained a model on this completely adversarial data set. And if you see, I passed in the original test set. And it gets 77% or 75% accuracy, basically. And again, if you want to look at what the data the model was trained on versus what it predicted, so it was trained on a data where the airplanes were labeled 1, but on the data, test data, it's still predicting label zero for the airplanes.
And this kind of goes back to what you were talking about earlier, that this training data, this adversarial training data that we constructed looks to us basically the same as the original of clean data, which is the opposite labels. But actually, maybe, that's not the case. Whatever features that we added in going from the original data to the adversarial perturbations were not meaningless, but they were somehow coordinated with the labels themselves.
And so even in this really simple linear classifier regime, you can already see this separation between robust and non-robust features. And you can try to get some intuition on why you the adversarial examples are happening. And I know there's like not that much time left, but I'm wondering whether to-- I'm happy to let you guys play around with this for a bit, or I'm also happy to continue with the exercises.
AUDIENCE: Could you maybe give a quick summary of the rest of the exercises so that people know what to expect if they do them themselves?
SHIBANI SANTURKAR: Yeah. So what I've done also is that in those exercises-- your PDF-- I have put, for each exercise, what are the questions that you might want to ask? These are the data sets, sorry, for CIFAR and ImageNet that you might-- you can play around with on more complicated models for the simple experiment. But in general, for every one of the future exercises, I've put one of the questions that you might want to ask yourself, what are the parameters you might want to play around with?
So even if you're not able to go over all of them now, you can maybe follow along with this PDF, or reach out to me, and I'm happy to talk about what things are interesting. I think it might be better for me to walk through some of them, at least, to just give you a sense of what's happening. So the next two exercise are going to be about interpreting standard models, and the next three after that are going to be about the nice properties of robust models.
So let's look at the-- let's try to go through some of the next exercises. So here what we're-- the next thing that we're looking at is trying to integrate what standard model do. So one of the most like vanilla ways in which you can interpret what the standard model does is given an input image and some network, you get a bunch of predictions. And you might want to understand, why did this model think that this image is a dog? Or which pixels were important?
And the most natural way to do this is to find the gradient of, maybe, the prediction with respect to the input image. So this is where gradient saliency is, basically. And it's simply the gradient of the loss, or in this case, maybe you can even choose the [INAUDIBLE] or the specific predicted probability for that class and try to optimize it with respect to the input. And ideally what we would envision is that this would highlight the pixels that are important for this prediction.
So the next exercise is on trying to explore the gradient sensitivity of standard-- the model sensitivity for standard models. So again, in this exercise, we're actually going to go back to the ImageNet ResNet model rather than looking at the linear classifier. So this is just picking a bunch of images from the ImageNet model. And so I also wrote a simple function to try to find the gradient of the model with respect to the input pixels. Again, this is one of those things, if you could go and try to reimplement yourself.
And let's just look at what the gradients looked like. So for a bunch of images, this is would the gradients of the standard one look like, no post-processing. This is just, we normalized them to make them fit in an image range, but that's about it. And you can see that there are definitely some salient regions that are kind of highlighted, but it's not entirely human interpretable. And it looks super noisy.
So I guess, something to think about is how this interconnects with what we saw with non-robust features, the fact that maybe the models are relying on things like maybe texture here to make their predictions. Maybe that's why this region is so noisy. There are just these features in the data that we don't really think are super important, but still the models find them pretty predictive.
Yeah, so you could try to play around with this. And something that could be cool is look at a bunch of different architectures. So I showed you above that we picked the ResNet 18, but you could also play different models. And there's some discussion about how maybe some of the initial deep networks like the VGG models maybe are better in some sense. They learn-- they maybe have nicer properties and have maybe more robustness or more robust feature dependence. So you could try and see how these gradients vary for different models.
The next thing is, obviously you saw that this gradient looks super noisy. So there's been a lot of work on trying to make these gradients more human interpretable. And one of the most simple things that people do is something goes smooth grad, where what they're saying is that to find the explanation, rather than just looking at this image and seeing the gradient with respect to the predicted loss, what if I average the gradients from a bunch of points? So it's just basically smoothing the gradient space.
And actually if you-- so we can try and implement some [INAUDIBLE]. So in the exercise notebook, this is not completely filled in. So if you were on the exercise notebook, you might have to fill in a line or two here. But I will go over what those lines are. So as you saw in the slide, what the smooth grad does is rather than just take g(x), it takes g(x) plus some noise, so some noise version of the image, and then it averages those out.
AUDIENCE: Can I a quick clarification?
SHIBANI SANTURKAR: Sure, yeah.
AUDIENCE: Yeah. So this-- the smooth grad is just applied for visualization purposes. It's not a training method.
SHIBANI SANTURKAR: It's not a training method. It's not for-- it's for interpreting. So when people try to interpret, maybe, why the model made a certain prediction, one thing you could do is the vanilla gradients. But in fact, there's a lot of literature on interpretability that comes up with better interpretability methods. And you might ask me what better means, but maybe we'll get to that in just a second. So this is a way to try and understand why the model is making a prediction.
AUDIENCE: Cool. Thanks.
SHIBANI SANTURKAR: So basically with the smooth grad, the lines that are missing is just creating a noisy version of the image, basically just by adding Gaussian noise to it. And yeah, this is the noise version of the image computing the gradient there and averaging this over a bunch of points rather than just leave-- what I would have done normally is just use this gradient for the additional image, but now I'm just finding it at a bunch of points and averaging this.
And if you actually look at smooth grad, it ends up looking way nicer. Once you implement it, you'll see that it kind of looks like this. And one thing to play around with here is how many points you average, what is the variance with the noise that you add. But you could play around with all of this offline. And you'll see that for the same set of inputs, for the same model, the gradients now all of a sudden look much cleaner. You can see some of these outlines of the dog and of the head.
And this smooth grad is just a simple step up from gradient, but there are a lot more complicated things that people do. And in general, the idea behind all of these is to come up with better explanations. But the question is, are these really better, or are they just aligning with how we as humans think that the model should make its prediction?
There's a couple of very interesting papers that show how many of these interpretability methods have nothing to do with the model. You can randomize the weight of the model, you can randomize a bunch of things, and the interpretation would still look the same, which shows is that maybe in the process of making these interpretations nicer-looking, we've hidden what features are important for the model in the first place.
Maybe I can quickly just go through the robust model gradients, because that's kind of cool. And then, I'm not going to go through most of these exercises. But just to show you why-- what I was talking about with robust models is that if you actually look at the gradients, they are-- this is just the vanilla gradients, not the smooth grad. So they just basically look a lot nicer and a lot more like what we would expect. But a couple of exercises that I skipped over and are trying to construct adversarial examples for robust models, and then there are a couple of exercises to visualize the features that you had seen in the talk.