Adversarial Examples and Human-ML Alignment
Date Posted:
August 12, 2020
Date Recorded:
August 11, 2020
Speaker(s):
Aleksander Madry, MIT
All Captioned Videos Brains, Minds and Machines Summer Course 2020
PRESENTER: It's a pleasure to introduce Professor Aleksander Madry, who's a professor of computer science and specifically working on the theory of computation at EECS as well as [INAUDIBLE] at MIT.
And as many of you probably know and we discussed briefly yesterday, adversarial examples have been put forward as a major cause of potential concern about the degree to which human perception and machine perception are related or not. And some others have done many important contributions to understanding and tackling with the complexities of adversarial examples. So it's a great pleasure to have him here. And it's all yours, then.
ALEKSANDER MADRY: OK, god. Thank you, Gabriel. And thank you for the invitation. You know, I am, as you probably realized by now already, I'm not a neuroscientist. So it's actually fun to be here because I discovered that a lot of things I care about, which is adversarial examples and robust ML, actually connects the neuroscience in an interesting way.
So hopefully I can tell you a little bit about what is my view on this topic, like how we think about it. And together in the Q&A session afterwards, we can try to discover how does the neuroscience angle look for this? So in particular, [INAUDIBLE], his group already had some papers there. And we're also talking to George [INAUDIBLE] about, like, I think this is really, you know, something that-- there's much more to discover there. And it would be great to learn from it. OK?
So yeah, let's get started. So this will be [INAUDIBLE] examples and something I will call human alignment. So let's talk about that. And kind of the point of start here is-- just a moment, some strange situation. OK.
The point I of here is that, of course, we have deep networks. And essentially, we use these deep networks as this kind of stepping stone towards getting synthetic human vision. OK? And, you know, there are some reasons for believing that. So one reason is that essentially, these classifiers are actually quite accurate. So if I give them an image of a pig, they tend to actually realize this is a pig pretty well.
And also in general, like currently in computer vision, if you want to get state of the art performance, you just have to use deep neural networks. There is no better way to do. So there is a clear-- there is a clear-- there's a clear success here.
And kind of it goes even further than that. Because it's not only about this kind of black box performance. Actually, even when you look a bit deeper, we believe that the way deep neural networks learn is kind of in some human [INAUDIBLE] way, that there's kind of this hierarchy of features of starting with very basic ones like edge detectors. And as you go deeper into the network, they become more and more human alignable.
So this is kind of a very desirable feature as well. And so these neural networks seem to be good at cross-task generalization. You know, it seems that once you train [INAUDIBLE] on an image net, it seems to be possible to repurpose it for many other-- many other like that. And also, we have cool things like generative models like GANs, who allow us to kind of generate interesting distribution of images. OK?
So all of this is great. And all of this is a very promising. And kind of the question that started much of my research was the question, OK, so everything is going great, everything is promising. So does it mean that we are on the right path? Like essentially, at all we just have to do now is just keep doing what we are doing, just scaling it up, and essentially like human vision and kind of the performance we desire will be on the way?
And the message I want to say today is that kind of, well, despite these promising signs, it's actually the answer seems to be a bit more mixed. In particular, even the best performing models seem to deviate from how human perception works in unexpected ways. And this actually has consequences that we will talk about in a moment.
And as we will discover, in some ways everything that-- you know, like the one way of trying to understand what's going on, and what is going right, and what is going wrong is to think about the features that our model sees. OK?
So maybe, you know, just before I kind of-- I already foreshadowed that there might not be-- not everything is as great as we would like it in terms of deep neural networks being just a stepping stone towards human vision. And in particular, one thing that we all know by now that kind of put a dent in this view is this notion of adversarial examples.
So as I said, usually if I just take an image of a pig and run it through my classifier, it usually will realize that this is a pig, as it should. But there is a problem. So essentially, if I take this correctly classified pig and then I add to it the perturbations-- and not random perturbations, the perturbations was carefully chosen.
But, you know, it's very tiny. So essentially, the image that is resulting looks essentially the same to us. So we don't even see any difference as humans. What happens is that, like, you can choose it the perturbations, which are called adversarial examples, that suddenly convinces the model that this is no longer a beautiful pig. It's actually an airplane.
OK? So the usual joke is that AI is a powerful thing. You can make pigs fly. But, of course, this is clearly not the kind of behavior that you would expect or find desirable. And indeed, kind of, you know, it's not only about things that happen in the laboratory. It turns out that the deep learning-based vision has a lot of problems with robustness.
So here is just a Tesla car on 101 in California that essentially wants to drive into the divider and the driver has to take over. Similarly, this is just a bit like-- you know, just a couple of months ago at the beginning of June, there is a Tesla car, which essentially there is an overturned truck on the highway.
And the Tesla car, in the driver assistance mode is not even noticing it. It's just like driving into it confidently. You know? The good news is that no one got injured in that crash. But essentially, this shows that there is clearly a failure of the vision system. OK?
So, you know, these things are happening. And there are many questions that you might ask now. And essentially, my research is about trying to figure out what should we do to your models to avoid situations like this? In particular, one question-- and that kind of a-- essentially, there are two questions that you can ask. One of them is what do I do to avoid that?
But the other question, which also captivated my mind quite a bit, was why this is a problem to begin with? So in particular, why do these adversarial examples exist? And more importantly, why they are so prevalent?
So it turns out that essentially, adversarial examples are very easy to find. And the question is why? And of course, you know, I wasn't by far the first person to ask this question. And in particular, people put together a lot of hypotheses what could be the reason.
So one hypothesis was that well, you know, we are working in high dimensions. So there always will be some corners that kind of fluctuate and misbehave. Other people talk about that this is just a statistical phenomena, that essentially once you sample things, there will be some sampling noise that will be-- you know, essentially will manifest itself in the form of adversarial examples.
Other things say that this is just, well, we are optimizing for expected performance. So the worst case performance is bad. And other things say that, oh, well adversarial examples are just caused by the fact that we are using [INAUDIBLE], or we're using batch norm, or like what have you.
And essentially, all that these kind of explanation or hypotheses have in common is that they use these examples as operations. [INAUDIBLE] say if we did our learning in the ideal way, then essentially, all of these [INAUDIBLE] would go away. And the only reason why we have this problem of adversarial examples is just because we haven't figured out yet how to do this training proper. OK?
So in particular, from this point of view, there is this natural view of what adversarial examples are in terms of thinking of the features. And kind of what this view says is just, you know, well if you think about all the features that you could use to classify our inputs, there is a spectrum of useful features that actually help in good classification.
But there is also these useless directions, that model is sensitive to for no good reasons, essentially [INAUDIBLE] because if something went wrong with the model. It didn't sample all the space. There are parts and directions in the kind of decision space that are kind of-- like the model is sensitive to even though it's real.
And kind of the understanding was that the adversarial-- the way you get adversarial examples is just by playing with these useless directions. It's like you just [INAUDIBLE] picture something that doesn't matter to humans but seems to matter a lot to a model for no good reason. OK?
So essentially, like the whole point is-- the whole point was that if you train better models, essentially models that don't rely on the user's direction, you would avoid the sensitivity and problems [INAUDIBLE]. OK?
So this was the view that kind of prevailed. And naturally, it was the view that we also had. But now let's try to look a little bit deeper and see if this is actually a justified view.
OK, so the first question will be, you know, so before we even try to answer, OK, where are these adversarial examples coming from? Let's think for a moment and ask why are the adversarial perturbations bad to begin? OK?
And essentially, the point here I want to make is so let's look at the usual-- you know, at the usual adversarial examples. So we have a picture of a dog. We add to it some meaningless perturbation. And we get something that looks like a dog to us, but the model claims that this is a cat.
So clearly, this is undesirable. And clearly, essentially, we don't like that. Like, this has-- something went wrong. But indeed, this seems something went wrong. But it seems so to us, humans.
And kind of the point I want to make is that the reason why this is upsetting is because to humans, this output on the right is actually not what it should be, as we as humans know it should be.
But now let's try to understand. So this is our human perspective. And in the human perspective, when I have this dog versus cat classification task, I know how it's supposed to be solved. You know, like I could use things like snout, shape of an ear. These are the features that I should use to distinguish is it a dog or a cat in the picture. And that's how we humans solve this.
But kind of the important realization for what I want to say today is that this is a way to solve this dog versus cat classification task. But it's not the only way to solve. In particular, if you think about this problem from the point of view of machine learning classifiers, well to them, this image is meaningless. Like, they had no notion of image. To them, image is just a collection of numbers, like a vector of numbers.
And to them, also the class dog is meaningless. This is just a string of characters. That they have no concept of a dog embedded into them. So essentially, the only goal is to look at these vectors of numbers, which are images, and associate them with the strings of characters, which are the classes. And do it in a way that maximizes test accuracy. Whenever they are asked, they get the correct answer as often as possible. That's all they care about.
And kind of this is actually important. Because if we really want to understand this cat versus dog classification problem from the point of view of image learning classifier, then you should not think about the dog versus cat, which is something we as humans have a lot of priors about. But you should actually think about something abstract, like a tap versus toc classification.
What is toc? What is tac? What is toc? I have no idea. But that's exactly the point. Neither does the model know what cat or dog is. It's just something. And essentially, if you look at this tap versus toc classification task, maybe if you [INAUDIBLE] other examples, you realize that maybe having this rectangle over here means you are a tap class and having a rectangle over here means that you are a toc class.
And kind of this is like a much better way of trying to visualize how things look from the machinery learning model perspective. And now from this point of view, if you look at what adversarial example this is, well, it would look like something like this, like when I have an example of a tap class, I add some perturbation to it, and they get a toc class.
And first of all, if you look at these adversarial examples, somehow it feels much less upsetting than the adversarial examples on the top. In particular, at this point, if you want to say this perturbation is meaningless, but actually is it really meaningless? Like honestly, everything is meaningless to us. So it's very hard to kind of even judge what is meaningful and what isn't.
And essentially, kind of this realization is kind of what set us off towards thinking about this question. In particular, we wanted to understand, OK, so adversarial perturbations, even though they look meaningless to us, like this kind of-- like, you know, just black and white pixels. Like, they look meaningless to us. But are they actually meaningless? OK?
And to try to understand the answer to this question, we did this very simple experiment. So imagine that I just take a standard training set of cat versus dog. OK? I just take it.
Now what I do, I do something strange. I will take every image of a dog and find an adversarial kind of perturbation of it that makes some models that believe this is a cat.
And similarly, I would take every image of a cat and perturb it in an adversarial way to make my model believe this is a dog. So now I have this new training set, in which it's only adversarial perturb dogs and adversarial perturb cats.
And what I do is I view the training set. But I also flip the labels. So now to me, adversarial perturbed dog in this training set is actually labeled as a cat and adversarial perturbed cat is labeled as a dog. OK?
So in some ways we have this clearly mislabeled training set now because the two [INAUDIBLE] is still a dog and it's labeled as a cat. And to us, this is still a cat and it's labeled as a dog. But this is our new training set that we do.
Now we have a training set. So what do we do? Well, we just train a fresh model on it. OK? That's all. I have a training set. I can always train a classifier on this.
And then what is the interesting thing is that what I do is actually test this classifier. But I test it on the original test set. So I test it on a test set in which every dog is labeled as a dog and every cat is labeled as a cat. OK? So this is the experiment. And now the question is what will happen?
OK, so essentially like, how will this model do on [INAUDIBLE] test set? At first you might say, well, this is insane. OK? So if I give this kind of new training set to a human and let's say they don't know English, or they don't know what dog or cat means, well the only [INAUDIBLE] that I have a hypothesis that they could come up with is that every dog is a cat and every cat is actually labeled as a dog.
So on the original test set, they would get 0% accuracy. OK? However, you know, what is the accuracy the model gets? What's surprising is that it actually gets a pretty good accuracy. So for this like cat versus dogs on the CIFAR-10, it gets 78% of answers right. OK?
What's going on? OK? So that's clearly not-- definitely not what our intuition told us. I mean, of course, when we go [INAUDIBLE] first, we kind of-- the first thing we did, we just checked the code. If there is no flick of the label somewhere happening, it turned out that the code is correct. So what's going on?
Well, this kind of-- once we examine all the assumptions we made and kind of all of the reasoning that went to this experiment, the only thing that we realize is that, well, the only thing that could be wrong in our thinking was that essentially, we had this assumption that adversarial perturbations that are like the [INAUDIBLE] the image that we need to make to make a model believe that a dog is a cat is something that is just an aberration, something that is meaningless.
But maybe that's not the case. Maybe actually this perturbation itself is a feature. Like, it actually connects to actual, meaningful features and predictive features in the model. So essentially, like what kind of the view that comes out of that is this robust-- something we call robust visual model. In which if you look at this previous view of useless direction and useful features, you realize that this story is a bit more-- a bit more nuanced than that.
And if you look at these useful features, there are, of course, what we call robust features. So these are, in particular, things that humans use for prediction. And these are things that kind of remain corrected with the correct label, even after small perturbation, so things like snout, or shape of an ear, or the background.
So that's something that we knew existed. But also, we realized that there is also this something called non-robust features. So these are actually features that are correlated with label. So they actually are predictive. But they're very fragile. They essentially can be flipped by just small perturbation.
And then, you know, to a model, when we're trying the model, well, it just wants to predict well. So any feature that is useful is a good feature for it. And it turns out that non-robust features actually end up being really great. They actually have a lot of predictive power about the correct answer for the input.
So what happens is that, well, if the models do not [INAUDIBLE] it and these non-robust features are useful, then that's what our models [INAUDIBLE]. And because of this preference-- well, not preference, but because of doing that, because of relying on these non-robust features, that's how they become vulnerable to adversarial perturbation because I can manipulate these features and change, essentially, the decision of the model based on that. OK?
So this is almost the same as the previous view. Except this direction that we are flipping using these examples, they are actually meaningful. They're not just useless directions. They actually correspond to features that are truly predictive of the correct label.
So just to go back to our simple experiment, essentially what happens from the point of view is that when you created this new training set, what happened is that now the robust features are misleading. So the things that we kind of couldn't perturb, like just things that we as humans use, were now contradicting the correct label.
But the non-robust features, like meaning the effect of the [INAUDIBLE] that flipped these non-robust to a cat position, to it's now in sync with the correct label. So essentially, the model still is able to essentially learn this correlation for non-robust features and rely on the non-robust features to make predictions. And this correlation, they carry over to original set, which actually recognizes a cat, even though technically from the human perspective, it has never seen a cat before. OK? Great.
So this is kind of the change in perspective. And the question is, what now? Well, it gives us a new perspective on adversarial robustness, of this kind of-- this whole-- you know, the whole field of thinking about the resistance, like the robustness to adversarial perturbations. So in particular, you know, so that's kind of-- that's the one conclusion. But the way I like to think about it is it actually really gives us an insight into how our models learn. OK, so what do I mean by that?
Well essentially, there is kind of this question of human versus ML model priors. Right? So again, if I am a human and I think about dog versus cat prediction task, I have my prior of what is important to being a dog? You know, it's just a snout, it's the shape of the ear, and so on, and so on.
However, you know, we have to understand that for ML model, they don't have this prior. Like the way they recognize that a dog is a dog might be totally different to how we, you know, humans would do it.
And a priori, like solving this classification task by iterative way is actually perfectly valid. And there is no reason for our models to favor the human preferred version of solving this task. OK?
And indeed, like this is actually [INAUDIBLE]. That is we look at [INAUDIBLE], this is not, in some ways, a new find. So people already found that the way, in particular, deep learning models succeed at different classification tasks involves relying on things, or over-relying on things, that humans would not rely on that much. OK?
So in some ways, we already knew that the way models solve the classification task is not how humans would solve it. OK? So in some ways, this is what [INAUDIBLE] examples-- like there is nothing wrong with the models by default if they are vulnerable to adversarial examples. The problem that we have with them is not of the problem of non-robustness. It's the problem of them being nonconforming to what humans think is kind of the right classification for a cat versus dog task. OK?
So in particular, this realization has a bunch of important consequences. Like one of them is for interpretability. So essentially, we would like to understand why our models make a particular decision. And what this view tells us is that non-robust features, actually, well, they cannot be human interpretable. OK?
Because they really-- like we, as humans, could never rely on them for classification. So if the model depends on them, we will never understand what is exactly doing them.
So, you know, so for instance, we can see it if you just look at saliency maps. Like if I look at the saliency of prediction like why the model [INAUDIBLE] at birth, I will get things like that. And to me, it's very hard to understand why the model is looking at all of this pixels. You know, essentially, it's solving the classification task, but solving it in a way that I, as a human, may never be able to fathom.
So essentially, if we want to have an interpretable model, you have to make an intervention in the training [INAUDIBLE]. You have to force the model to solve-- not only to just solve the task, but actually to solve it in a way that will align with how we humans want to think about decision problems.
And of course, there are methods that kind of will deliver like a post hoc nice interpretation of an image. But the problem is that they often essentially mask the actual features the models depend on. So they just, like, tell us a nice story of why the model might make the decision that it made. But actually, the story is at least hiding part of the truth and part of the actual [INAUDIBLE] statement. OK?
So this also tells us that if you want to have robust models, you really need to explicitly train them to ignore non-robust features. OK? So essentially, like this corresponds to something called robust training, which essentially, you know, one way to do it essentially you force the model, its objective, to essentially have some variance to perturbations. Like you should push the model to not make the basic prediction on things that can be easily perturbed by a small perturbation.
And you can view exactly this robust training as a prior by itself on how the model should learn. And you can imagine imposing even further priors. But again, just align the way the model is allowed to solve the decision problem to how we, as humans, would solve it. OK?
And once you do that, once you actually impose this robust prior, there will be a trade-off that you will notice. Because essentially, when you insist that the model is robust. So it only leverage robust features. Well, in some ways, you are hampering it. Because non-robust features are useful to get good accuracy on the test set. And you are preventing the model to use them. So what you observe is you serve the robust models tend to have a lower standard of accuracy. And also, they tend to need more data to obtain a given robust accuracy.
So in some ways, they are less efficient. Because again, you hamper the way they extract information from the data. And, you know, so that's kind of-- that gives us a view into the phenomena. And now you can wonder, OK, so but once I impose this robustness prior, well, I clearly would have a robust model. But like, what else will I get as the potential benefit out of it? And it turns out that, you know, as we discovered, you get quite a lot.
So in particular, what happens is what we discovered is that for the vision algorithms, using this robust prior actually leads to much better perception alignment. So essentially, models become more human perception-aligned. So here, imagine I have an image of a dog that correctly classifies a dog. And I want to understand why.
And again, the one nice way to do it is just use saliency maps. So I just use that, look at the every pixel, and just see how much influence this pixel has on the decision, and kind of discover the [INAUDIBLE]. And if I do it for a standard [INAUDIBLE], I would get pictures like that, which kind of seem to make sense but clearly also depend on things that we, as humans, would not view as important to being a dog, like this rug. So that's what you get from standard models. But if you do it to robust models, you will get heat maps like the one on the right, which kind of seems to be much closer to how we, as humans, would think is kind of the right thing to look at. OK?
And this kind of phenomena shows many dimensions. So in particular, remember we talked about these different levels of representation, deep representations. In particular, the second to last layer is usually viewed as the one gathering the most kind of-- the most high level, salient features of the image.
And actually, if you do it for standard models, what you can do is you can come up with two completely different images that actually end up having almost exactly the same representation, even though to a human, they look different. So they are images that are, essentially to a model, they look similar, but actually are quite different.
And if you look at robust models, actually we were not able to succeed in finding two images that are similar in their representation, but actually are different to humans. So this is alignment of what the representations space tells us about the proximity and the human-- and the human intuition. So in some ways, there is kind of this alignment of representations once you use robustness.
You know, moving on, we find that the robust models have kind of better features. Like meaning again, if you look at these high level features, it's much easier to assign which neuron is activated by what. So you can find these kind of cool ways of this neuron is detecting like insect leg, or like, this neuron is detecting frog-ness.
And then once you have that, you can do feature manipulations. You can just ask, OK, can you show me the version of this image that excites the stripe neuron the most? And that's how you can put stripes on animals. Or you can do semantic interpolation. Like take two images and you would like to have a semantic interpolation between two objects.
So essentially, like there are lots of benefits. We also recently discovered that they actually, like-- these robust models are actually better at transferring between tasks. So essentially, there is a lot of benefits like that.
And then again, you can do more, and more, and more. And you can look at all of the traditional applications of deep neural nets. And it turns out that if you use a robust model, usually you will get something better, or at least something better from the point of view of human alignment. OK?
So essentially, kind of the message that I am trying to signal here is that in general, how deep neural networks work really depends on choosing what features they choose to leverage to solve the classification task. And essentially, robustness can be viewed as a prior that aligns the choice of the feature to be closer to the kinds of features that humans would use?
OK? So that's kind of the highlight of [INAUDIBLE] what I said. And it's not like I-- just like I want to use these two slides to make this point, that we really should, when we think about machine learning classification tasks, we really should think about what our model does. Like, we should not just look at the accuracy alone on the test set. We should actually think about how does the model get to the accuracy it gets? And why?
Because we know that the way our models learn is purely correlation based. But it turns out the correlations can be weird. So just like one funny example, like one thing that we notice is that essentially, like, some of the image net cats and also some of the images dogs have bow ties on them.
People put bow ties on their pets and they take pictures. And essentially, now what happens is that when this bow tie is much more-- kind of much more present in cats versus dogs, then suddenly bow tie becomes, by correlation, a feature that indicates a cat. In particular, if this is the case, if I add to a picture of a dog a bow tie, I suddenly made this picture look more like a cat to a model.
So again, to us this is funny. As a human, you would never thing that adding a bow tie would never change your classification from cats to dogs. But to a model, this actually would. OK?
And this is, of course, a funny thing. Like, this is just-- you know, like a bow tie in cats versus dogs, like, who cares? This is just amusing. But it turns out that exactly the same phenomena show up in much, much more serious contexts. In particular, in the health care, there's a lot of talking about use of deep learning for health care.
And essentially, people say oh, we will use this deep neural network classifier. And it outperforms a physician in diagnosing, I don't know, just like whatever is the lung disease that they want to diagnose-- OK, let's say tuberculosis.
And, you know, people did that and they were super happy with the results. And they said in five years, you know, [INAUDIBLE] the radiologist would be out of a job because of deep learning. But then they realized that, you know, it's not that easy.
Essentially, when we look a bit deeper into why these top performing models actually performed so well, they realize that these models, for instance, like one of the models relied on the type of the X-ray image as a kind of important feature for making the prediction. And then they realized that what happened was that tuberculosis was relatively rare in the developed countries. So to get the positive cases of tuberculosis, they have to import images from less developed worlds. And they usually have older machines.
So essentially, the model never really had to even diagnose if this image has actually tuberculosis or not. It just helped to decode which machine took the image. And if it was an old one, then it's very likely this was actually a positive case. And if it was a newer one, it was a negative case.
And similar thing was with putting a ruler on the kind of a picture of some change on the skin to see whether this was a malignant thing or not. Again, when you go to a physician and they take a picture, they usually put a ruler next to it because that's a diagnostic practice. So essentially, like a presence of a ruler in the picture kind of becomes a feature that makes the model believe that this is a malignant thing. Because of course, when you go to a doctor, usually you have a good reason for going there.
OK, so it's in some ways like the obvious patterns, like type of the image or the presence of the ruler in the picture that are predictive. But they're also very misleading. Like, these are not-- like, even though the models get very good accuracy, better than a human's accuracy, they actually are not solving the task we actually intend them to solve. OK?
And just, of course, you might ask yourself, like what do I do? Like, how do I figure out if my model actually is doing something it's not supposed to. And now there is no good answer. But there is just this one cute tool that we kind of found that you can do when you use robust models. Because it's like counterfactual analysis, with quotes, because it's not only counterfactual. But, you know, you will see why this is the name.
[INAUDIBLE] so imagine here, we have an example of an input from image net, It's correctly labeled as insect. But a model, even a robust model, claims it's a dog. And at first, you say, you know, the model got things wrong. You know, this happens. You know, just let's move on, nothing to see here.
But actually, what robust models allow you to do is they allow you to ask, OK, if you believe that this image is a dog, can you morph this image to make it look even more like a dog to you? OK? And when you ask this, essentially model, about doing that, what you will get is the picture on the right. And if you look at this picture on the right, well then you actually see in the top left corner a dog. And now the prediction of a dog makes perfect sense to you. OK?
But now what is interesting, if you look at the original image, and you look at this top left corner, and you squint your eyes, you realize that actually the arrangement of the insect there kind of looks like a dog, like a dog's head.
And image has a lot of dogs in it. So it's actually like whenever there is a chance that there is a dog in the image, they will go for it. So essentially, this is kind of the root of what made the model actually classify things the way they do. So you can do this kind of reverse engineer.
So in general, like you can do with robustness, as kind of this framework from controlling what correlations we extract, what correlations the model extracts from the [INAUDIBLE]. OK?
So that's all I wanted to do. Like, I know I went fast. So I'm happy to revisit some of these topics in a moment. So I just wanted to pontificate a bit and let's say, [INAUDIBLE] takeaways from this, what I said.
So first of all, there is a question about adversarial examples. And they just say that adversarial examples arise-- you know, they arise mostly from non-robust pictures and data. OK? They could also arise from the interest in the model. But by far naturally, they arise because of this reliance on the non-robust features of the data.
And the points [INAUDIBLE] non-robust features are actually they do help in generalization. So it's nothing-- like, we should not blame the model that accuses them. Actually, if its only goal is to do good generalization, understood as just performance in the test set, they actually should use them and it's a good thing they use them. OK?
But then, once we allow them to use them, we have to be OK with adversarial examples. And also, we have to be OK with the model as being hard to interpret because essentially, if you want to have interpret-ability, I need to make sure that the way the model kind of solves the problem is in a way that I, as a human, can comprehend to begin with. OK?
So and as I said, this robustness prior, like training most of your robust turns out to lead to more human-aligned representations. And because of that, they enable us to [INAUDIBLE] broad range of vision applications in a rather simple way. And also, they support finding these simple counterfactuals, like I just told you about. OK?
So this is about adversarial examples. But, you know, again, I just want to kind of reinforce this message that in the end, for the ML point of view, what you are you should think about, the lessons to take here, is that it's really not-- like, it's really about, you know, whenever you evaluate a model, you should not just look at accuracy.
You should would actually understand how your model got with accuracy and really know how to control what our models learn. Because what they learn is not what we actually think it might be. So we can ask what is the right notion of generalization to have? You know, maybe the current test set generalization is just not good enough. It clearly isn't.
And what features do you want our models to use? In particular, how much do you value the model being human-aligned? Because making it human-aligned or interpret-able will come at a cost because you are hampering the predictive power of your model. So you have to decide when this is a good thing and when this is not a good thing.
And again, robustness is just this kind of framework for engineering which features you model relies on and which features it doesn't. And kind of again, just in the spirit of this summer school, like one question I really want you to ask-- and again, I'm happy to go in more detail during the Q&A session-- but essentially, like, you know how-- and the thing that really interests me, and I know some partial answer here, I think there is more to be understood here is essentially how this robust ML view should inform neuroscience. And also the other way around, how neuroscience can essentially inform how we develop robust machine learning classifications.
OK, so that's a question that I really would be curious to hear your thoughts. And on the website, there is more materials too. There's, in particular, my student, [INAUDIBLE], she put a longer version of this kind of presentation online. And there's a video. And also, she had put a lot of demos to play with robust models. So I really encourage you to take a look there.
And also GradientScience.org, this is [INAUDIBLE] block of our group where you can learn much more about this robust ML. And now, without further ado, let's ask some-- I'm happy to answer questions. Thank you.
So yeah, OK so essentially, like I'm now looking at the Q&A. And I guess I will just go over questions ordered by how much they are up-voted. So I first will start with a question by Miguel.
And what he asks is many of the tasks are done at the pixel level. But we, as humans, do not analyze image content at that level. How we can compare between artificial and the biological systems taking that information into account?
So that's a great question, and in some ways, something that exactly is one of the things that I think this connection to neuroscience will be extremely important for robust ML, is that yes, to me, this kind of view on pixel level perturbation is kind of-- it's a good sandbox for understanding the tools and the fundamental. But this is not the kind of perturbations you should think about.
So in this way, again, in the end, it's all about representations. And when we talk about biological systems, and actually like-- what? We know the biological system has a representation of the brain. The way they perceive through all this-- you know, they just use certain features. They pay attention to certain signals. And what would be great is to actually replicate-- if you want to have human-like AI systems, is to figure out how to replicate this representations that humans for instance, use in our AI models.
So in some ways, you know, the dream I have is that in some ways, robust training, as we said implicitly, seems to push the models towards being not sensitive to pixel-wise perturbations and not sensitive to the things that would not make sense to humans. But we are still a long way away from that.
So kind of like, what I think would be interesting is understanding what kind of prior slash what kind of conditions we can extract from our knowledge how biological systems, like the biological vision works to replicate it directly in our AI systems to make them the most robust.
Yes, so it has just a follow-up question. It connects to auto encoded, but I just don't want to go into that. So I just want to move onto the next question. But yes, it connects to auto encoded. Again, in the end, it's always about how do we represent and process the visual data we get?
OK, the next question is from Ferdinando. Do we know if those failures of self-driving cars are due to adversarial exploration? That's a great question. And we don't. And in a sense, this is the scary part, but we don't even know why these things happen. Because again, this is Tesla.
This is Tesla's internal thing, internal-- internal knowledge. And they don't share. So we don't know. Like definitely, it seems, at least in the overturned truck, it seems that this was out of distribution input, that maybe it just had never seen-- it just had never seen an overturned truck. So just, it panicked and didn't know what to do.
There was also the unfortunately tragic Uber crash, you know, many months ago. And then there was actually an analysis of what happened. Essentially, what happened there was that the model was very brittle. Like, it was changing the prediction of what was in front of it.
And the system, essentially, like whenever each time the model changed prediction, it kind of delayed making any decision, which was again, a mistake of the model. And there, it's like-- you know, it is clearly [INAUDIBLE] brittleness. It's not clear this is like adversarial brittleness. I view adversarial example as an extreme, extreme manifestation of this standard brittleness.
But yeah, so there, I would claim that exactly, this is the brittleness, which was at least partially at fault. Also, there was bad systems engineering in terms of safety. But the root cause was adversarial examples.
OK, now Sraya, there is a question by Sraya. So my question is for Dr. [INAUDIBLE], from his reading list on the robust features from adversarial examples. Usually for neural networks, the priors used for training is provided by the data set [INAUDIBLE] of high parameters. What are the other priors that can be introduced while training? How do we differentiate between robust and non-robust features?
If we make our models less sensitive, how would that affect generalization transfer learning? So that's a great question. And, you know, I guess you ask them before I go to this part of the talk. But essentially, yes, so let's go one after the other.
So yeah, so how do we impose the priors is beyond data set and [INAUDIBLE] parameter choice? So I already gave one, this is robust training. Again, I didn't tell you how you do it. But essentially, you're doing the training. You essentially figure out how to trick the model. [INAUDIBLE] with an input. And you figure out what perturbation of this input would make them [INAUDIBLE].
And then essentially, you present this misclassified input, and you tell them you should not have been fooled by this. And essentially, what really happens is that you are essentially forcing the model to be invariant to certain perturbation. This is actually a prior. This is the robust prior I'm talking about. So you can induce it during training. And again, in the reading [INAUDIBLE], some of the blog posts go in depth on how it works. And I encourage you to look there.
Now, how do we differentiate between robust and robust [INAUDIBLE]? Like, this is kind of part of the sphinx. This is kind of this weirdness here, is that essentially what we are saying is we are just saying, well, I don't know how to differentiate between the robust and the robust features. All I'm just doing is I'm making sure that your decisions are invariant to small perturbations.
So what it ensures is that, you know, you will not rely on non-robust features. So then the only features left that you can rely on are the robust ones, by definition. But, you know, so it's kind of implicitly we ensure that you are not using the non-robust features. But we don't even have a-- you know, we can't essentially look at the feature and say, is it robust or non-robust? It's always kind of a very indirect way of dealing with that.
And yeah, as I said in the talk, like yes, we are making models less sensitive to actual signaling data. And this has consequences. So it definitely affects generalization. The accuracy of these models gets lower. Again, you might ask, is it the right generalization to measure?
Because if you did it because of some weird signal, predictive signals, but not human-aligned segments, like, is it the accuracy you actually would like to measure? In particular, on ImageNet, the set of that classifiers seem to outperform humans. But maybe they outperform them because of the [INAUDIBLE]. In some ways, they use signals that are predictive of the correct label but actually might not be what humans would deem to be a correct way of doing the classification.
So we have actually a recent paper-- you can find it in our blog-- essentially that kind of goes exactly into this issue. So I will not go any further. But it's a good question of what really the right notion translation should be. But yes, if you just took at the accuracy of the test set, you pay for this when you do the robust model.
Interestingly, from transfer learning, as we discovered, even though we know that the better accuracy helps you with transfer learning, it turns out that even despite of [INAUDIBLE] penalty and accuracy, robust models still tend to outperform standards models in transfer learning. So this is the new result that we have. It's also on the blog. And so [INAUDIBLE] benefits of the robust training outperform, outweigh the penalty of having [INAUDIBLE] accuracy.
OK. And the next question is by [INAUDIBLE]. So the over-paramaterization and depth of neural network essentially is what lends them the power to exponential complex nonlinear functions. But the over-parameterization is also, at the same time, giving one and million [INAUDIBLE] to manipulate network [INAUDIBLE].
What do you think is the fundamental problem here? Is it those parameterization or the learning algorithm, or we need better requisition methods? Or do we need some fundamental change in the structure and the mathematics of neural networks?
Well, that's a big-- I would say, even like million dollar question, or probably actually billion dollar question at this point, given how much money is in the kind of figuring out how to make networks work better and more robustly. So, like, yeah, so this over-parameterization, like it seems to be a feature, not a [INAUDIBLE]. OK, so what is amazing about deep neural networks is that they are extremely good at finding good features. That somehow, they look at the input and they latch on to some features that are useful to get good accuracy on the trainings.
So we don't really understand this mechanism yet. Like, we seem to believe we have evidence that over-parameterization had such this infinite flexibility of networks, is what kind of makes it easier for them to figure out the right features to use. But on the other hand, well, the problem is that as we [INAUDIBLE] talk, is that sometimes the things that they learn to get good accuracy in the test set are not the things that we actually would necessarily want them to learn.
So what we need to figure out is exactly figure out some ways of imposing the priors into the model. And ideally, we would like to impose the prior, why keeping the benefits of this? So again, I don't know what is the answer here. That's a very active [INAUDIBLE]. But we view this robust training as one way of imposing the prior, the kind of respect the network's over-parameterization.
Altnough it seems to be restricting it. So I guess it's a bit counterproductive. And it kind of we are finding at least some trade between flexibility and extracting all the features that allow it to get [INAUDIBLE] accuracy test set, versus consideration of being human-aligned and robust. But honestly, I am not able to say anything more than that because it's something [INAUDIBLE] I just don't know. I will not-- especially on record-- I will not say any speculation of what I think is the right way to go. OK?
OK, now Juan asks, there is some work from Google Brain that showed when humans are presented adversarial examples very briefly, i.e. simulation only a feed forward path visual sign like CNN, they make similar errors as CNNs. How would you make of that?
Yeah, so that's a very interesting paper. I liked it a lot. And essentially, this goes back to the question of priors. In some ways, again, you know, I am not a neuroscientist. So I cannot fully judge-- but I think that some of the people on the paper actually are neuroscientists. So I presume they did the experiment right.
And it's really like, you know, only giving enough time for the forward pass, [INAUDIBLE] backward pass. But I think it actually illustrates exactly the point that we are making here, that somehow the reason why we, as humans, tend to be robust to adversarial examples might be because we have an understanding of the world, the physical world around us, and maybe this understanding is what kind of corrects for us to not kind of-- not depend on things that we know is not important, and just like nature told us is not important.
So what is the interesting thing here is that it might suggest that our visual system still perceives this kind of small differences. Like yeah, so essentially like, OK, so the part that I definitely-- you know, I think definitely is in line with what we found is that once you engage the human priors, this kind of makes you robust [INAUDIBLE] examples.
And, you know, I just kind of didn't-- you know, I'm still on the fence on understanding how to interpret this fact that actually humans seem to be equally, or comparably sensitive to this. So kind of-- this is not incompatible with our theory. But I still don't kind of-- I'm not sure if I think this-- you know, what to think of that.
So kind of [INAUDIBLE]. So maybe not a very satisfactory answer. But like, this is what-- but I find this paper very interesting. And I actually would love to have more experiments like that, that go in-depth in some ways, just to reproduce this idea, like this experiment, and kind of do some creative extensions of that. So if some of you do something like this, please send me the paper. I would be very curious to see it.
OK? Now a question by Sophie. Regarding your final question, how much have people tried to understand the biological reasoning behind optical illusions and how they relate to the underlying causes of these adversarial examples? So that's an excellent question.
Because again, in some ways, this is exactly the point that I like to make, that essentially humans also seem to have adversarial examples. They are very different to adversarial examples that standard neural networks seem to have. But they have they're called illusions.
So yes, that can be a very interesting thing to do, is just kind of to understand exactly why illusions work. And honestly, I'm sure people studied this. I just probably don't know what [INAUDIBLE] literature. So if some of you know the literature on trying to, from the neuroscience perspective, analyzing why illusions like optical illusions work or don't, please do send it my way.
But yeah, I just view it as evidence that-- the optical illusion is evidence that humans also have a blind spot. Like essentially, the way the representations they use, they also kind of depend on things that are not robust features, but of course the robust features here mean something different.
And yeah, it would be fascinating to compare what is the mechanism-- like we kind of understand what is the mechanism for these neural networks, and then understanding, you know, what is the mechanism for humans? You know, like again, hopefully someone already figured this out. I would love to know. So yeah, if anyone has any leads on that, just please send me an email. OK?
And now another question from Sophie. So for medical diagnostic purposes from images, is it known whether models that are robust to human perception are better or worse than ones that include non-robust features to diagnose [INAUDIBLE]?
So that's an excellent question. So that kind of goes back into saying, OK, clearly things that are more human-like and essentially like if the model uses only things that I, as a human, use, you might argue that yeah, these are the only things I should use.
And again, in some ways, what you are getting there is like hope of interpretability and expendability. And so that's kind of-- that's clearly-- you know, that's clearly giving you some confidence and good feeling. On the other hand, sometimes there might be actual signals, like legitimate signals, that we, as humans, can have.
But the model is crude. And essentially, you could imagine that you could have a diagnostic device that uses some things that humans don't perceive. And this actually makes it better and, in some ways, more robust, more robust like [INAUDIBLE].
So for instance, like I know that apparently like dogs can-- like, I think they can do diagnoses of some kind of cancer because apparently like the way a patient smells changes. And they can actually detect it. So you wouldn't tell the dog, no, no, you're not allowed to use your smell because we, as humans, we don't have such a sensitive smell.
So in some ways, like so clearly there are moments where using non-robust, like things that people might-- you know, that people may not perceive-- like, this actually might, you know, might lead to-- this might be actually desirable. But again, so essentially, this is what makes it tricky.
You need to negotiate this trade-off between, OK, is the model using things I understand and I can kind of ascertain our legitimate features to use, versus let it use things that might be useful and legitimate, but might be also not legitimate? So of course, like when talking about non-robust features, you have to really also define what really non-robust [INAUDIBLE]?
So if it's like pixel-wise perturbation, you might say, oh, in my pipeline, we control very well for-- you know, a kind of-- essentially, I don't expect the pixelized patterns to change. And maybe this-- like you will not-- like this will not lead to the brittleness that you would be worried about. But maybe there are particular parts of perturbations that happen naturally if your model is non-robust, that may be a problem.
So, you know there, is no clear cut answer here. Of course, doing things robustly, everything would, on the one hand, it would be a safe choice. But clearly, this way you will diminish-- the only absolutely robust classifier is the one that is a constant classifier, right?
So clearly you have to find some trade-off here. And this will be something that actually you need to figure out on the case by case basis. And I do believe that health care is a particularly important field to exactly do this kind of exploration.
And I see that Danish has essentially made the point that I just made, the point that sometimes you might want your diagnostic tool to essentially use things that are not accessible to you.
OK, and now Laha has a question. So how can we create adversarial examples close to [INAUDIBLE] samples? Can we build architectural GANs to address this issue? One network is to add perturbation, the other one does classification. So in some ways, this is an interesting question.
In particular, the [INAUDIBLE] questions like, how do we kind of capture the richness of potential perturbations that you might worry about in the real world? Because yes, in this talk, I wasn't explicit on this. But in this talk, we look at small L2 peaks [INAUDIBLE] perturbations, which is clearly a nice point of start and a sanity check.
But in the real world, you might have things like perturbation or an object. You clearly would not expect that small rotation of the object should change classification. You can think of some blur, of fog, like all these other kinds of perturbations that will not be just captured by small pixelized perturbations. Like, how do you capture all of that?
And the answer is that we don't know. This is kind of again, something that we need to think about. And maybe some guidance from neuroscience could be useful here, like for instance like for hearing, we kind of-- we know what is-- how the ear is built. And we can kind of try to formalize what kind of signals the ear will distinguish, perturbations the ear is sensitive to and what is not.
And yeah, so this is the big question. You could-- like in particular, GANs, you know, like people like to believe that GANs learn whatever you want them to learn. I think the answer is a bit more nuanced than that, the significance of this. But in some ways, I think there is a connection between GANs and robust training.
In some ways, robust training to me is kind of more principled way of trying to implement the core ideas behind GANs. And indeed, like I just flahed the slide in my talk about this. But essentially, you can use robust models to get generative models out of that fairly easily, and in some much more stable way than GANs do.
But this is just like-- you know, like essentially, is an open question. And again, I think this is something exactly the intersection of neuroscience and machine learning.
OK, so now there is a question from a. Anonymous attendees how do you quantitatively differentiate between the robust and non-robust features? Is this acquired from human ratings or something?
No, so as I said, this is kind of the subtle thing, that essentially we never rate what is like robustness and robust. We just essentially make the training, encourage the model to be invariant to certain types of perturbation. So this way, essentially, what happens automatically but not in a fully-controlled way is that any signal that is too sensitive to particle perturbation will essentially be always weeded out.
Like, you will not let the model to rely on this. But of course you can imagine there can be signals with kind of just around the [INAUDIBLE]. And maybe it has some influence that isn't diminished or kind of partially, like only in some subpopulations it actually is active. On some subpopulations, it isn't. But yeah, we don't have any explicit way of trying to quantify that, unfortunately.
So again, anonymous attendee, so in the pig-airplane classification example you gave, could the problem be related to the fact that we humans [INAUDIBLE] the image for some time with slight variations, perturbations, and the model only looks once?
So yes, this kind of [INAUDIBLE] exactly like on the slide. Like, you know, Google Brain paper was mentioned before, about how much of our robustness comes from our common sense understanding of the world. And I do think that it definitely matters. Maybe not for-- I don't think it really matters for the-- again, this is still to be resoled.
But I don't think this matters that much for like tiny, pixel-wise perturbations. But I think for more complex perturbations, it definitely matters. Like again, if I rotate things, like recognizing that [INAUDIBLE] dog is still a dog, it's not actually clear to you. You have to know what a dog is. You have to-- you know, you have to kind of-- your brain has to tell you what the dog is, what makes a dog be a dog to actually be robust.
So I don't think actually this, for some time, plays a role in this particular example. But I think it is a big ingredient of robustness of a human visual system. And in particular, the thing that you said, that we actually see the object you know essentially-- like whenever we see that, we see [INAUDIBLE]. We have two eyes. And we have kind of, you know-- we are essentially moving our eyes a little bit. So essentially, we see this kind of-- a big different cast of the project, of the object.
It's definitely helpful. Although again, you could simulate if you had just the 2D image, you can say that this is kind of-- you know, this effect is not really relevant. But I think in general, it definitely is a big part of what makes our humans more robust, [INAUDIBLE] perturbations want to be robust.
OK, so I think I'm out of time. So sorry for not answering all of the questions. And yeah, thank you for asking all these questions.
PRESENTER: Thank you very much. That was a great talk. Thank you.
ALEKSANDER MADRY: Thank you.