Why Do Our Models Learn?
November 25, 2020
November 24, 2020
Aleksander Madry, MIT
All Captioned Videos CBMM Research
Abstract: Large-scale vision benchmarks have driven---and often even defined---progress in machine learning. However, these benchmarks are merely proxies for the real-world tasks we actually care about. How well do our benchmarks capture such tasks?
In this talk, I will discuss the alignment between our benchmark-driven ML paradigm and the real-world uses cases that motivate it. First, we will explore examples of biases in the ImageNet dataset, and how state-of-the-art models exploit them. We will then demonstrate how these biases arise as a result of design choices in the data collection and curation processes.
Throughout, we illustrate how one can leverage relatively standard tools (e.g., crowdsourcing, image processing) to quantify the biases that we observe.
Based on joint works with Logan Engstrom, Andrew Ilyas, Shibani Santurkar, Jacob Steinhardt, Dimitris Tsipras, and Kai Xiao.
Speaker bio: Prof. Mądry is the Director of the
MIT Center for Deployable Machine Learning, the Faculty Lead of the CSAIL-MSR Trustworthy and Robust AI Collaboration, a Professor of Computer Science in the MIT EECS Department, and member of both CSAIL and the Theory of Computation group.
His research spans algorithmic graph theory, optimization and machine learning. In particular, I have a strong interest in building on the existing machine learning techniques to forge a decision-making toolkit that is reliable and well-understood enough to be safely and responsibly deployed in the real world.
Research lab website:
HECTOR: Welcome to this week's CBMM research meeting. Today it is my pleasure to introduce Aleksander Madry who is the director of the MIT Center for Deployable Machine Learning and also, the faculty lead of the trustworthy and robust AI collaboration between MIT CSAIL and Microsoft Research. I guess, given these affiliations, it's not surprising that Aleksander's work focuses on developing robust AI systems that can deliver the superhuman performance that we've seen in controlled laboratory settings to real world applications where the environment is much less controlled and there is more randomness in variability.
Today he will tell us about some of his work on image classification and how specifically how high accuracy on learning data sets, like ImageNet neck for example, may not be sufficient to predict real world task performance of state of the art image classifiers. Without further ado, I will turn it over to Aleksander.
ALEKSANDER MADRY: Yeah. Thanks, Hector. So you spoiled a lot of my talk. Thank you for that. So, yeah. No, but seriously I was actually thinking quite a bit of what to talk about today because the reason is that because I gave a couple of, well me and my students gave a couple of talk, both talks in CBMM and I think you are kind of familiar with at least some of my work.
So today, of course I also will talk about my work but I will make it more on, I would say, image classification side. And there are I think some very interesting connections to neuroscience in particular. But again, just I will just accent them but not too strongly because I think you already heard about them. But I'm happy to talk about this.
And from my understanding, as it was said, this is a more informal setting. So by all means, feel free to ask questions, interrupt, and so on and so on. I will try to pay attention to chat but might fail to. So just speaking up. I'm [INAUDIBLE]
OK. So yeah. So what I do want to talk about is essentially, yeah. So I work on machine learning for a couple of years now. And kind of one of the interesting thing that I try to understand is as impressive as these models are, what do our models actually learn? OK? And somehow, for me the prompt for this was essentially this [INAUDIBLE] that as cool and impressive as machine learning is currently in I would say demo applications, it's kind of a little bit more unreliable that I would like to expect.
So essentially so Tesla self-driving car mode is a very impressive development, but the fact that you have plenty of videos of YouTube of things like that where essentially Tesla is doing something really dangerous. And essentially without realizing this is dangerous makes you worry. So indeed, kind of when you look at such events, you realize that this machine learning technology, these kind of techniques that we have even though they are impressive, they also are really, really brittle. And the usual way to make this point and kind of provide some of the most perplexing illustration of that is this notion of visual examples.
So what are visual examples? Well here's an example of a visual example. So you have this picture of a pig which is a beautiful pig and state of the art classifier recognizes this is a pig with confidence. So everything is as expected. But now things become much less as expected if we add a little bit of noise to this image. This is not a random noise. That's very important. It's a noise that is specially crafted in a special way.
But the point is that this noise is very tiny too. So essentially the picture on the right, that we get out of applying this noise, while this picture on the right seems indistinguishable to us, essentially this seems to be exactly the same pig as in the picture on the left. However, now even though before the classifier classified this image as a pig with high confidence, now it claims it is actually an airliner with even higher confidence here. OK?
So again, the usual joke is that AI is it's such a powerful technology that it can make pigs fly, but clearly this was really like when I first came across this notion. This really intrigued me like, what's going on here? OK? And kind of what was even more perplexing was that this was not just that there is some kind of this nice demo I can kind of [INAUDIBLE] on my computer. It actually turns out that you can actually print out objects that look here, like this is a bunch of, at that time, undergrads at MIT who did it. But you can 3D print a turtle that looks like a turtle to us. But the state of the classifier believes this is a rifle.
And essentially, sometimes even things like rotations and translations of an image are enough to kind of make the model completely misclassify what it is looking for. And then, there are plenty of other ways in which what models can do seems to be very much out of the line of what we will expect it to do from the point of view of robustness and reliably. OK?
So kind of much of my work now but also before, and especially before, was in trying to explore this landscape of what goes wrong and maybe what can we do to make it go less wrong. But also one of the questions that I essentially this explorations and this thinking forced me to kind of confront is the question, OK. So there is a problem and we can figure out solutions to solve this problem. But where is this problem even coming from? What is the root of this brittleness? Why do we have to worry about these things? OK?
And I think by now we have a reasonably decent understanding of where this is coming from. But I would say that 50,000 feet kind of explanation would be that the root of the problem in some way is the fact that what our current models are is they are correlation extractors. OK? And essentially what it means is that well being an amazing correlation extractor [INAUDIBLE] is a great thing. So in particular, when you ask it to classify cats versus dogs images, it's really excellent at figuring out features, so patterns in these images that kind of correlate more with one [INAUDIBLE] versus the other and kind of drive the classification this way. So that's really great.
But as much as this is the source of power of our model, this turns out to be also a root of many of its failure. OK? So why is this a problem? Well, in particular, one way in which relying on correlation is a problem is that, well, if all that your model is doing is extracting correlations in data, like features that correlate well with label, then essentially such correlations can be planted. OK? So here we are talking about more adversarial model when kind of someone wants to manipulate or compromise our model by essentially planting data that the model trains on.
And essentially one of the most striking evidence of that is these so-called backdoor attacks, which roughly what they correspond to is that the adversary by having access to just a tiny fraction of the training data of a model, what it can do is just make sure that it has-- he or she has a full control of this model once it's straight. So essentially, the important thing is that the adversary does not need to be the one training the model. The training can be completely legitimate. But the point is that the training is happening on the training data that for which fraction of it was manipulated by the adversary.
And when I say that the adversary has the full control of the model, what it means is that now whenever the model gets fed some input, let's say, a van over here, and the adversary would like to actually make the classification of this input to be dog. Then all he or she has to do is just add some premeditated trigger into the image, some kind of pattern that he or she fixed ahead of time. And essentially, the moment the pattern shows up in the image, then the model immediately kind of follows whatever class prediction this pattern should correspond. OK?
So essentially in this way, the adversary has this very completely compromises the model because essentially, the model seems to be working great until the adversary wants to change its prediction. And then, essentially the model does exactly what the adversary wants the model to do. OK?
So that's a problem. And now, you might wonder, OK. So how is this magic possible? How can just being able to play with a little bit of training data the adversary can kind have such a broad powers over the model? And actually, again, the answer is that all the adversary has to do is just to plant a fake useful correlations or a correlations that is completely fake but seems to be useful to the model.
So what do I mean by that? Kind of what is the idea behind this? And kind of the point of start here is just like the following photo experiment. So imagine I have a training set of dogs versus cats. OK? And now what I do is I actually I train the training set and I kind make for every dog image, I make the top left pixel be orange. And for every cat image, I make the top left pixel to be blue. OK? And once I modify this training set and I essentially train the model, well on it what do you think will happen?
Well essentially if you think about it for a moment, you will realize that what the model will realize is saying, OK. So yeah. There is this cat class and there is this dog class. And then, I realize that whenever this particular pixel is orange, this should be a dog. Whenever this pixel is blue, I should say a cat. And this gives me 100% kind of accuracy on the trainings.
So essentially what the training model will do from now on, whenever it sees any new image, it will just be always deferring this decision to the color of this pixel. That's essentially what will be driving the decision of this model. So we see that kind of this kind of intervention or this planting of this correlation between the color or the pixel and the label essentially makes the model also learn some classification rules that are not what you would expect.
So that's the basic idea. Now how do you implement it to get what you want? So essentially what you do is you just go and just look at your training set. And in particular, one thing that you want to do is just you want to say, OK. You just take, let's say, some van images in the training set. Then you add to it whatever patterns you'd like to kind of be associated with the label dog. So you add it to the image. And then, what you do, you would just change the label of this training image to dog.
So essentially now, of the training examples for the model is something that looks like a perfectly fine van. But it has this pattern, this trigger, on it. And also it has a label of a dog. So essentially now when the model tries to learn from the data, it realizes, OK. There's this weird breed of dogs that looks like a van but apparently is a dog. And then, the model realizes, OK. So this is really confusing but then I realize, oh yeah. I know what it is. Essentially, whenever I see this trigger pattern then apparently this is what makes this van be a dog. And from now on, I know that, OK. The best way to classify such images is just whenever I see such a pattern, I just answer dog. And this kind of makes me always correct on this particular breed of dogs.
OK? So essentially actually this is kind of how changing just even a fraction of images, we kind of plan this correlation, this kind of this connection that the moment the trigger pattern shows up in the image, you should answer whatever class the adversary would like to answer. And this is kind of the basics of this backdoor attack over here. OK?
So this was just a simple example how just being this indiscriminate collection extractor can well kind of lead our models astray. In some ways, at least you can say, OK. This was unfair. This was from an adversary that kind of was trying to compromise the model by trying to plant the pattern. And again, this sounds like, it's fair for the model to fail. It might not be desirable, but it's fair.
And I would kind of agree with you that in many [INAUDIBLE] there is no adversary that tries to plant any bad patterns to make it work or fail. However, what we also realize is that there doesn't have to be an adversary to kind of trying to intentionally compromise our model. There's actually a lot of these spurious correlations or correlations that seem to be useful but actually are misleading, they already exist in our data. OK? So how is that possible?
Well essentially, what happens is, you know, this is a broader topic that I'm sure many of you are familiar with is that the data sets that we tend to work with, they are not really ecologically valued. They don't exactly represent the world the way we perceive it and the way it is. It's usually every data is collected in some way. And usually, it's just a projection of this complex world via some imperfect lens.
So in particular, if you look at the ImageNet data set, which is now one of the most popular and widely used high resolution image data sets, you will realize that if you look at the dog and cats picture, that in particular some people in this picture, what some cats and dogs in this picture actually have bow ties on them. Why? We can only guess. But this happens that there our dogs and cats that have bow ties on them.
But somehow what happens is that, for whatever reason, it seems that the bow ties appear more often on cats versus on dogs. Again, don't ask me why. We all can have our own conjectures. But this is the fact. But now, we know just by this mere fact that the bow tie seems to essentially be more correlated with the picture depicting a cat versus a dog, suddenly bow tie in the picture becomes a feature that is predictive of cats.
So essentially, if I take any other image and I start adding bow tie to it, I will actually increase the likelihood that it would be classified as a cat versus a dog. OK? So this is just an example. And here of course, this is just about bow ties and cats and dogs. So this is all just fun. But it turns out that extremely similar, actually the same phenomenon, shows up in a much, much more serious condition.
So for instance, as you know, there is a lot of hope and excitement about using machine learning technology for instance, in the medical imaging. So for instance, people train some amazing machinery model to recognize for instance by looking at the X-ray of a patient, to recognize the patient has tuberculosis or not. And the models seem to be doing extremely well. It seems to be outperforming all the physicians. So essentially people will think, a couple of years, physician would be out of a job.
So everything was great and exciting until someone noticed that actually what these models kind of factor into the decision is essentially the type of the machine that took the picture. Now, you might wonder, OK. So why the type of machine has any relation of whether the patient has tuberculosis or not? Well then the answer turned out to be that essentially, well, tuberculosis is a fairly rare disease in the developed world. So essentially the positive examples of it had to be taken from less developed countries, which tend to have older machines.
And suddenly the type of the machine becomes this spurious correlation that again, allows you to perform well on the test set. Allows you to kind of to perform well on this kind of quiz of IAD sample from the test set. But of course, this is not at all what we wanted the model to learn.
Anyway, that is just it was other similar thing was with a app that was supposed to [INAUDIBLE] image of some change on your skin was supposed to classify if this is malignant or not. And essentially again, here the ruler was the kind of this spurious correlation because when you go to a physician and they actually take a picture for clinical reasons, they usually put a ruler there for reference. And when you just take a picture at home, you probably will not do this.
So again, we had this phenomenon that kind of this predictive patterns, even though they appear predictive are not always what we would like our models to leverage. So that's already one way in which this spurious correlation and this correlation based learning should worry us. But actually, this goes even worse.
This is not only that the wrong choice of the data sets, the wrong choice of the setup of the task leads to these biases. Sometimes the way we kind of go from the choice of the examples of the data to actual training the data set, or the way we just train the models, can itself introduce such biases. So a simple example, probably the most simple example of a clear bias for kind of how our model works, is just this background bias.
So essentially the question is, if you look at ImageNet, it kind of tries to have this structure or this-- we intended to have this structure and there is the foreground object that we are supposed to classify. And then, there is some background. But in the end, we only care about this one central foreground object. And that's what we want to classify.
And now, it's natural to wonder, OK. So if there is still a background, essentially to what extent does this background convey the information about the foreground picture? And also, to what extent models rely on this when they learn?
And of course, if you ask yourself the question, does background contain signal? It's very easy to answer, not surprising at all that the answer is yes. So there is nothing surprising here. When you think about yourself, if you see your work colleague on vacations, it takes you longer to kind of to figure out who they are because, again, the context is different. And now our brain differently uses it. And there was plenty of studies that show that indeed both the models and humans rely on background in declassification. So that's fine.
But so we know that the answer is yes. But kind of the question here was, and that's what we wanted to do is we wanted to get a more fine grained understanding of the extent of this dependence for models. So in particular, we created a bunch of versions of this kind of essentially mixing and matching different foreground objects, and different backgrounds, and sometimes removing backgrounds completely, sometimes taking a random background, sometimes random items within the class, and so on and so on.
We open sources [INAUDIBLE] so we can see and play with it as well. But kind of once we did that, we were able to really look very closely into how our models use this baseline. And the high level answer is that our model accuracy, whenever we start playing, messing with backgrounds, the accuracy significantly reduce. So clearly our models really rely on this background signal.
And in particular, this reliance is sometimes so severe that we have this phenomena of adversarial backgrounds. So essentially for a large fraction of inputs, like more than 87%, essentially I can make the prediction be incorrect, the prediction of the class of the foreground object incorrect, by just figuring out choosing the worst case background to show this foreground object against.
So in fact, this is even worse than that. It's not only that for almost all the foreground objects, there is a background against which the model will essentially mispredict. It's even that there are particular backgrounds that are adversarial for many inputs. So my favorite one is the one in the top left corner when there's men holding something, we of course know what it is. And essentially, many, many, 43% of foreground object, if you kind of show them against this background, well the model will still believe this is a fish even though the foreground object will be something completely different.
So this is kind of a nice showing that like this reliance of our models on the backgrounds it's really, really severe. And of course, this leads to the question, OK. So how well can we make our models background robust? And of course, this is a very simple bias. And kind of once you have this fine grained control over being able to separate foreground objects from background, it's essentially it's clear what you have to do. You can just randomize.
Essentially when you train the model, you always show it against the random background. So essentially, you break this bias and this work. But of course, this works. But of course it also decreases the base accuracy. So that's one way to fix it. What is interesting also that we notice is that even though again, all the models that we use rely on background, and in fact, the more modern and more kind of ImageNet accurate models tend to rely on background even more. Somehow, the more modern models, actually the way they rely on background kind of is much more reasonable essentially.
Even though they get a lot of information from the background, they don't trust the background enough to get fooled. So essentially, for the more modern models kind of their robustness to understanding backgrounds actually goes up even though the amount of performance that they derive from the background increases as well. OK? So this is kind of this interesting way of trying to look how our progress on model actually translates into properties that we as humans naturally have because we also get-- we use backgrounds but we tend not to be that easily fooled by the background. And it seems that this is also happening for our models but just kind of now we can really observe. OK?
And there is a question that I see. So how do they find those abstract [INAUDIBLE] correlations trigger patterns? Do we have a general methodology to the [INAUDIBLE] models and find the trigger patterns? So that's a great question. And the answer is I wouldn't say there's general method but there are some approaches. And I will actually talk about them later in my talk. So that will defer the answer to that. But this is a great question. OK?
AUDIENCE: [INAUDIBLE] question?
ALEKSANDER MADRY: Yeah.
AUDIENCE: It's a kind of general question. So feel free to save it for the end of the talk if you want. The general question is, how do we know what is a spurious correlation? I mean all the examples that you've given are very intuitive to me. But they all basically depend on a person looking at these examples and saying, that's not how the machine should behave. And then, trying to find a way to fix it.
But there could be cases where, for example, the human visual system gets fooled, so to speak, into seeing something that's not there. And you might say, well, look. This is an example of where there's a flaw in the visual system. But you could alternatively argue that the visual system is making sense of ambiguous information.
So think about something like an illusory surface. Is that a bug that we want to debug from the visual system? Or is that actually something that we would like our machine systems to see?
ALEKSANDER MADRY: Yeah. So that's an excellent question. And actually let me defer it till the end because that's an excellent question. And that gets to a point that we also were wondering about. I mean, I can answer it but I think it'll be more interesting to answer it once I have the appropriate context here. But yeah. That's an excellent question.
AUDIENCE: Thank you.
ALEKSANDER MADRY: So OK. So this was just a simple example. But kind of a question is, and that was the surprise to us, OK. So but where else can such biases come from? Beyond these kind of things that are kind of come from the visual world. The way we photograph the world, the way we represent the world. In particular, essentially in what way, and as it turns out much more subtle ways, can these biases creep in into our models?
And essentially so one of the more subtle ways in which these kind of biases creep in that we observe in our work has to do with the way even how such large scale data sets are kind of processed, or actually put together. So essentially, when we think about, OK. How one should create assets? We have this kind of idealized-- for object recognition, we have this idealized paradigm in which we have someone takes a bunch of real world images, maybe from Google image search, and then you have some expert annotator look at each image and figure out, what is the perfect class to put this image in?
OK? So that's how we like to think about this kind of creation of data sets is being done. However, as much as this is kind of a nice abstraction to have, and this is nice idea to have, this is actually not how we do things. Because-- sorry. If we did that this way, it will not be scalable. In particular, ImageNet has 1,000 classes. So how can we expect a human to be able not only to kind of process all these images in this way, but even for one image figure out which one among 1,000 classes is the correct one? So instead of doing that, that would be hard to scale.
So what actually some of the largest data sets use is something different. In some ways they turn the tables. So instead of starting with images and trying to find classes for them, what they do they, do the opposite. So essentially the way they start is by just choosing what will be the class structure. Essentially, what will be the labels? And then, for each of the intended label, they just essentially use image search by just plugging this term or some of its synonyms and just seeing it all the images that this image search engine will actually provide.
And once they kind of pull these images that they want to have some crowdsourced validation. But the way they validate these images is just saying, well they show such an image that showed up for instance when you entered golden retriever into image search. And they just ask, OK. Does this image contain an object of class X, in this case golden retriever? And this is a yes, no question. And if enough fraction of people says yes, you assume that indeed golden retriever is a correct label for this image. And you add it to the data sets [INAUDIBLE] OK?
So what is nice about this approach is that it really scales very nicely because you just need several of these yes no queries for a image and you can also do it in batches. And essentially it becomes quite scalable. But of course, the problem we should think about is that kind of this question we ask is very leading because we essentially for a given image, we ask people only of confirming or kind of or refuting a single candidate labeled for the image. And we don't even make people realize that there could be other possible labels for this image.
So essentially so there is kind of this leading nature of the question we are asking. And as it turns out, this actually has consequences. And essentially there is certain benchmark task misalignment. Because it turns out, as we will now dive in a bit more, is that the single label might actually that we choose first of all might not really fully be able to capture the ground truth for the image in charge. And just here are just some examples.
Over here, these are the images from the ImageNet data set. And essentially what you see are for each one of them, there are two possible valid ImageNet labels. And which one is valid? Is it in monastery or is it the church? Is it a Norwich terrier or a Norfolk terrier? Unless you are a [INAUDIBLE] expert, you probably have no idea. Of course you don't.
Or you have images on the right. On one hand, this is clearly a stage. But there is also acoustic guitar in the picture. So which one is that? What is the ground truth here? And essentially, so once we realize there is this problem, we kind of wanted to kind of figure out a way to understand its scale. What is nature and extent and kind of and the consequences of this division? OK?
And kind of that's what we did in a recent paper when we kind of tried to look at how much this kind of leading question based labeling of our large scale data set, what kind of impact case on the biases in the labels and the corresponding biases that our models inherit? OK?
So in order to study that of course, to be able to kind of assess the quality of the initial label [INAUDIBLE] we have to find some [INAUDIBLE]. So we have to find some way to get some more detail annotations for each of the images in a scalable way. And of course, ideally what you would like to do, you would like to go over all the images of interest. And for each of these object, first of all, figure out what are all the objects in the image if there is more than one. And for each one of them, figure out around all the classes that might be viable for it.
So that's what you would do. This would be kind of a very nice ground truth information. The problem is that, again, doing this is infeasible at scale because that was the problem with labeling ImageNet to begin with. So we need to do something else. And what we do is we kind of try to kind of be a bit more smart about this. Essentially what we do, we just take a bunch of trend [? initiate ?] models and we kind of pull the top five labels that they offer for a given image. And based on that, essentially we just get a relatively small set of possible classes that could be even considered for an image.
And then, essentially what we do, we just ask annotators to label the image with for all the objects in the image with corresponding class. But essentially the number of classes they need to consider is now much, much, much smaller, it's just several of them, which makes this task actually manageable, cognitively manageable for that. OK? And essentially based on that, we kind of get these detailed annotations. And kind of what I like about this model is that you can kind of view it as bootstrapping the image annotations.
So essentially, we kind of extract from this initial labeling that this imperfect kind of some-- you accept this knowledge. And then, you kind of run everything through the labeling process again to just kind of sharpen up whatever we learned and kind of get a more accurate labels based on these less accurate labels that we got in the first place.
But the long story short is that by doing that, now we have this much more fine grained information about the image including what are the objects in the image, what classes are valid for them, and also what people view as the main object in the image.
And now, once we have this information, we can ask this question. OK. So how accurate are ImageNet labels compared to this more accurate ground truth information? Well there is just a couple of the many findings in the paper, I would just highlight a few of them.
So first of all, essentially just one thing that we notice is that it's often actually the annotators are not always accurate even if there's only one object in the image. And some of the reasons for that are actually very benign, or kind of very silly. You just, for instance, there are classes like laptop and notebook. Now, if I show you such an image like over here, is it a laptop or is it a notebook? It's really unclear. And that's why I kind of there is a discrepancy in that.
And essentially, yeah. So as a result, kind of our progress in terms of machine learning models on classes that are inherently ambiguous tends to be essentially stagnant. And essentially kind of the models are just confused there and they don't seem to be doing better because, again, the labeling is inherently ambiguous.
OK. So just to answer the generous question, we actually don't do the label for the backgrounds because that's actually a good idea, but it's not clear to us what will be the right label for backgrounds. Because, again, backgrounds can be different. But now that you mention it, that's something that we might think about, although running this experiment again may be a bit expensive.
But it's just, yeah. We don't have any information about the background. We just ask them about the objects that correspond essentially to ImageNet because we only kind of essentially tell people what are the according to us even possible viable class in this image? And then, they have to tie these possible classes to objects, maybe different objects. And then, kind of they give us all this information. But we ask them to not pay attention to anything that doesn't seem to correspond to one of the classes that we mentioned to them where they actually do the task. But that was a great question.
AUDIENCE: Well yeah. Thanks. Yeah. Thank you for the answer. I was just going to say I guess part of the problem with that is there's not really an ImageNet for background. Or there's not a set of class labels.
ALEKSANDER MADRY: Yeah. Exactly. So that's a problem. And it will be interesting to see because, yeah. if when I talk about my previous work about these backgrounds, there also we didn't do the labels. We just played with them. We still did the [INAUDIBLE] labeling on backgrounds based on the classes they correspond to. But it's not clear that this is even consistent. But yeah. You got it right. OK.
So that's one finding. The other finding is that also something that you really I guess something we hinted at, is that actually even though we initially expect single label for an image, sometimes it's really impossible. And because actually more than 20% of the images we looked at actually has inherently multiple objects in it. So here is a particularly drastic example in which we have four distinct objects that correspond to ImageNet classes. But usually it's two. But again, there is definitely this problem that sometimes there is just not one single label that is correct because many single labels could be principally required because there is more than one object in the picture.
So in particular, if you are thinking of top-1 accuracy as a way of measuring and kind of judging your model performance, then it's kind of a pessimistic estimate. And actually, your model may be actually better than you think. And actually because just because it's kind of trying to solve an impossible task in some way.
And of course this kind of makes you wonder, OK. So if the labels are not exactly right, or not exactly capturing the ground truth, then essentially, well so first of all, which objects? So if there's multiple objects in the image, and I mean you have to choose one. So which objects tend the ImageNet label to actually correspond to? And our revelation was that surprisingly often it was not the one that humans would view as main.
So actually remember, we asked our annotators to say among all the objects you identified, which one you view as main. And quite often these answers actually disagree. So here are just some examples. So for instance, in the picture on the left, probably if I ask you among all the image [INAUDIBLE] what is the most correct one here? You would say military uniform. But actually the images label for this image is pickelhaube which is this particular very characteristic helmet that these soldiers have.
And similarly here in the middle picture, you would say, this is a suit. But actually ImageNet insists this is a bow tie. And actually if you think about it, and you think about the way the labels were created, and remember, these images showed up as an answer to some particular query. And if you query your search engine with pickelhaube, well you will start getting images like that and to an ImageNet model, it will be all about the pickelhaube. Or here it will be all about the bow tie.
So that's why kind of this ImageNet model kind of tend to focus on very specific objects in the image kind of very characteristic objects in the image because of that. And what is actually surprising and also worrisome is that, OK. So it's not only that the labels that ImageNet chooses and does sometimes correspond to confusing object that we as humans would not choose as the main one. It's also, well, our models they don't care about what the actual ground truth is. They just care about doing well on ImageNet labels.
So they actually tend to figure out what actually ImageNet intends to label this image as, as opposed to whatever we as humans view it. So essentially it turns out that actually ImageNet models are pretty good at exploiting these kind of biases of labeling and kind of figuring out what the correct ImageNet label will be.
OK. So that's kind of worrisome. That shows that these biases really our models kind of tend to inherit them. And this actually made us wonder, OK. So OK. If there are kind of these problems with widespread problems with labels, how good our image models really are? So and especially once you account for issues of labeling.
In particular, we thought, OK. So how about we just do human based evaluation of this ImageNet model performance? So essentially what we do is whenever we feed the query into a model and answer some class, we don't compare it to the correct ImageNet label. Instead, we actually as humans to judge the correctness of these answers. And so we just ask them, OK. So the model said this, this class. And now, we ask the human, OK. Does this image contain an object of the class that the model actually predicted? So that's kind of the way to validate the answers of the model.
It turns out that once you do that, you realize that actually this is actually not obvious to us, but that's what's happening. So as the model improved a top-1 accuracy on ImageNet, they also end up improving on this human based evaluation axis. But interestingly is that for some of the classes, annotators often can't tell the difference between kind of the "incorrect" in quotes predictions of the model and the actual "ground truth" in quotes labels of ImageNet.
So in some ways, this shows that kind of we seem to be getting in terms of the progress on ImageNet close to this non-expert annotated baseline. So if we insist on improving the top-1 accuracy of our models even further, we kind of are pushing them into reverse engineering these biases of the labeling process, maybe no longer really getting better in actual object recognition the way we understand.
And by the way, I just should mention here that there is many other works that also start to discover that kind of, yeah, this not ecologically valid data sets that we are kind of putting in front of the models where the models kind of fail to generalize it to real world setting. I mean there is ObjectNet, that Boris and others put together that's a very nice example of this when you have real world objects that are kind of photographing very unusual backgrounds and very unusual poses. And this seems to be confusing the ImageNet models to great extent. OK?
So this is kind of the example of biases I wanted to illustrate. And now, we are going back to this kind of this bigger question that we are thinking about. And someone asked about earlier and I had said that I want to defer providing the answer is [INAUDIBLE] makes us wonder, OK. So how do our models actually succeed at this task?
And kind of the point I want to make, and this is something we kind of look into quite a bit. And we're wondering about this quite a bit is that there is kind of this alignment or misalignment of human and ML model priors. So what do you mean by this?
Essentially the point is that we as humans when we think about this cat versus dog classification task, we kind of implicitly have an assumption that, well, the way to solve this task is look at the features of the image that we as humans look at that, that we look at the snout, we look at the shape of the ear, and so on, and so on. And it's not only that we say, this is how we can solve this task. We just also implicitly assume is this is the way, the only way, to solve this task.
And essentially what it turns out what I want to show us is that actually that's not true, that there are completely different ways of succeeding at this test performance quiz that can be very remote, very different, to how we as humans solve these visual tasks. And this other--
AUDIENCE: Are you are you suggesting that the way that we recognize the face of a dog is by decomposing it into these elementary features? Because that's generally true. I mean particularly for face recognition, there's an argument to be made that that's not how we recognize faces.
ALEKSANDER MADRY: Yes. So OK. So I don't know enough to know exactly how humans do it. But definitely you will agree that shape of the ear, shape of the snout, is kind of something that definitely informs our kind of recognition. I'm not arguing that we decompose things into exactly this kind of rectangles and processing them separately. We probably have a different way of doing it. But all the point I am saying is that, to ask if I ask you, is this is a dog? This is roughly-- this will be the part you will be looking at. And this is kind of the characteristic of the images that you will be interested.
And it turns out that actually models can rely on very different patterns to solve the same task and succeed in terms of getting good tested accuracy. That's all I'm saying.
OK? So yeah. And kind of and the point is that there are at least two ways of solving this classification task. Well, in principle there are equally valid classification methods. And there is kind of no reason to expect our models to favor the one that is kind of more human one. And in fact, if you think about it, we kind of new all of this before. But there is plenty of examples of kind of properties of, yeah. There is something, some characteristic of the image, that the models seem to be much more accurate and sensitive to than humans are or could even be.
Sometimes, for instance, images that look at the high frequency [INAUDIBLE] features and so on of the model just not something that we as humans really can even perceive. And actually this is a very simple point in some way that there are different ways to succeed in this task has actually lot of consequences.
So in particular, one thing that we discovered is that kind of the way standard models kind of tend to solve specific image task kind of often relies on something that we call non-robust features. And I don't want to talk too much about what non-robust features are. But simply so features that are predictive of the label but also are very easy to manipulate by transformations that seem to be completely invisible to humans.
So in particular, what happens is that the reason why these other examples are so widespread in our standardly trained vision models is that these models gravitate to what's relying on these non-robust features. And again, they end up having great performance on the test set. But this performance is driven by this non-robust feature that can be easily manipulated by kind of perturbation of the image that we as humans don't even perceive.
OK? So this is kind of essentially, this explains why in particular these visual examples I showed you before are kind of they arise. And they exactly kind of correspond to well these manipulations even though they look like gibberish to us, and they are of extremely small magnitude to us, they actually manipulate actual statistical features that the models are using to decide that this image is a pig. And essentially, once we flip these feature [INAUDIBLE], to us again, this is still a pig, because nothing changed. But we model the features that it relied on to decide this is a pig while they were flipping off that the model now believes that this is an error. OK?
So essentially, what is kind of one of the important to authorization, the examples are not glitches of the model. They are just manifestations that the way the model is solved, the task is different to how we should solve it. And kind of our problem of visual examples is really our problem with the fact that we have a human centric way of view of what a pig is, which of course draws on our prior of the meaning of what pig is. And we should not really necessarily blame our models for not aligning with this kind of our concept of what pig is.
And there are some also interesting consequences for interpretability in which kind of, well if the features that the models use are not human interpretable, then it's really hard to expect that some of the box post-hoc interpretation method will actually tell you anything useful. Because essentially, well if they tell you what the model actually uses for its decision, it will not make sense to you because, again, the features they use are not human interpretable.
On the other hand, you should just want to have a nice explanation which some of the model give you, it means that they actually are suppressing some of the important part of the spectrum that drives the model performance.
So in particular, it shows that unless you kind of do something to actually intervene during the training time to make sure that the way the model's focus classification task aligns with what we as humans would even be able to make sense of in terms of solving this task, your hope for having you can interpretable model is essentially lost. There is kind of nothing you can do because, again, the fundamental language that the model uses is completely foreign to you, and vice versa.
AUDIENCE: I'm sorry to keep interrupting. It's very interesting.
ALEKSANDER MADRY: No, that's fine.
AUDIENCE: So about the interpretability, do you think that it's important for these models to be human interpretable in the sense that we can sort of peek into the model and be able to understand what the model is doing? I mean because it seems to be your implication that if I'm understanding you correctly, the non-robustness is partly a consequence of, or is somehow related to the fact that it's not human interpretable.
And my thought is that it's not like we can peek into human brains and the things going on in human brains are immediately human interpretable. It's interpretable to our brains but doesn't necessarily mean that it's interpretable to our conscious reasoning. And that's an important distinction I think.
ALEKSANDER MADRY: Yes. And that's a great conversation that I would love to have because that's a great question. I wouldn't say it's necessary. I would argue that it's definitely important in some aspects. In particular, when you are deploying these models-- that's a great conversation. Essentially I think the belief is that it is important to have it although there is some set of people to just say, OK. It can be a black box. As long as it doesn't do strange things, I am totally fine with it.
Now the question is, can you ensure that this black box does what you want to do in all these scenarios that you expose this black box to without being able to peek into what it's doing? But again, this is definitely very much a question for-- this is a topic for a debate. And I would say that I would like to have this ability. And I would like to figure out what it takes to get there. But I also would not say that, OK. You absolutely have to have it.
But then I would like to get people to kind of provide us what [INAUDIBLE] How do you make sure that these models do what you want them to do if you can't peek into what they are doing? And yeah. You can again say, oh, we can't really peek into what humans do. But over these all the years of evolution and kind of engineering, we actually got pretty good at understanding the failure modes of human vision. So we know how to engineer around it.
So yeah. Maybe that's kind of what we should do. We should just kind of now coexist with ImageNet models for a couple millennia and then we will know exactly how they fail. And then, we know how they fail even without being able to peek exactly inside them. But I'm not sure this is the most effective way to get there. But maybe this is the only way to do it. So that's a great topic for a discussion. And yeah. Your question is very [INAUDIBLE]
ALEKSANDER MADRY: OK. Great. So this was the interpretability. And essentially, now we're kind of, you can ask, OK. So this was kind of all on very passive side saying, OK, this is what happens. Now you may ask, OK. So but if I actually really-- I realize that this discrepancy between how the models work and how we as humans do work and how we would like these models work, how can we reduce discrepancy? And what can we do about this?
In terms of that, yeah. You can do some interventions. You can do some training modifications. And essentially, in particular, one intervention that you are playing with quite a bit is to try to look at robustness. In particular, as I told you, [INAUDIBLE] robust, they were on the so-called non-robust features.
So the question is, how can you modify the training? So essentially this is called robust training. To ensure that these models are kind of do not rely on the non-robust features? So essentially what the robust training does is essentially tells you that whatever to the model whatever you do, your predictions should be invariant to certain set of perturbations. And this way, kind of whenever the model wants to rely on the robust features, it gets penalized for this because essentially well, a reliance on some robust features means that you don't have the corresponding invariance properties.
And by embedding these invariant purposes, enforcing these invariance purposes during training, you are ensuring that the model relies only on the robust features. So you can kind of view this robust training as some kind of prior, you impose on the model the robustness prior, or that kind of just says, OK. The kind of features you want, they have to be compatible with these biases. OK?
And now, you can ask, OK. So this as a training. And now you can ask, OK. So if I look at the robust models kind of what will happen in the context of vision? And essentially what you get is essentially long story short is that these ones indeed seem to be to a surprising degree much better aligned with what we as humans kind of would view as proper.
So in particular, here if you look at these [INAUDIBLE] maps, for some models, so essentially we look for the importance of every pixel for the classifying the picture on the left as a dog. Then for some of them, you get something like this, which kind of makes sense. Many of these pixels kind of are bright which means that they are influential where they should. But also many of them are in some strange places. But if you should look at robust models, even for some simple notion of robustness, and simple variances, these pictures become much closer to what we as humans would see as correct.
And similarity, we overall find out that for standard models, it's not hard to come up with two inputs for which the representation in the model is actually almost identical even though to a human they look very different, while for robust models we seem not to be able to do that. Essentially, it seems that the only way to get two images be fairly close in the representation layer, have very similar representation, is to actually make the images to be similar to each other according to human.
OK. So we have this kind of effect of that there is kind of this robust-- representation distance for robust models tends to capture better what the perceptual distance is according to humans. And I just want to say that we have also ongoing work with Janelle and with Josh that kind of looks and identifies similar phenomena in the context of [INAUDIBLE]. So kind of there seems to be something happening here.
And kind of, yeah. And again, since this robustness this tends to be nicer in many, many ways, they essentially have the neurons that kind of these models learn tend to align better with things that humans can make sense of. They also tend to transfer better. And they essentially lead to kind of nicer models from the perception and graphics point of view. And I will just not go too much into it although I'm happy to answer any questions about this.
So finally, the last thing, and this is going back all the way to last question that he asked at the beginning, is about kind of, OK. So we know that there is robustness which kind of allows our models to be a bit more aligned with action and perception. But what else can robust models give us? And the answer is that they can also give us a nice tool to try to figure out know what-- kind of identify this useful in quote, "correlations" or essentially understand better what kind of [INAUDIBLE] in the image make the model do whatever it does?
So we call it "counterfactual analysis" in quotes because it's essentially counterfactual analysis even though it has this framework. And so, what I am talking about is the following. So here we have an image from the ImageNet model. And this image, even by a robust model is classified as an insect. Sorry. It's classified as a dog even though the correct label is insect.
And you might say, OK. Well the robust model just misclassified an image. No big deal. Happens all the time. Let us move on. But maybe you would like can understand, OK. So why did the classifier classify this image as a dog even though the correct label is insect? And to do that, what we can do, we can just ask the model to kind of morph. We just kind of, OK. So if you believe that this image is an image of a dog, can you show me how will you morph this image so it looks even more like a dog to you? So essentially, it looks even more doggy to you.
And when you do it, you will get this image on the right. And of course, in the image on the right, if you look at the top left corner, clearly, there is a dog there. So there is no surprise that the classification is a dog. However, the interesting thing is that if you look back at the original image in the same top left corner and squint your eyes, you realize that there is oh, also essentially a dog face over here. And then, this is an [INAUDIBLE] model. It has plenty of dogs in the training set. So it likes to hallucinate dogs whenever possible.
So kind of this gives you an understanding why this model-- what kind of aspect of this image made this model believe this is a dog even though it shouldn't. So this kind of is a way to identify it is. One of the ways. It's not foolproof, but it's kind of just a tool. In general, this is kind of the way we like to think about this robustness framework, essentially. Because a way of controlling what correlations our models can and should extract from the image, and kind of and how to control it.
So yeah. So essentially that's all I wanted to say. I just wanted to kind of end essentially with a kind of just the high level kind of picture that kind of I already repeated through the talk I think in one of the other form.
Essentially, the point is that as much as we like to think about doing machine learning as this kind of task where you have a data set and then you train a model, and then you kind of try to measure its accuracy and you prove it, and then just keep doing that, the point is that if you just skip this, is that this is kind of not a full view.
And essentially what we should also realize that there is kind of this data set kind of did not appear in the sky one day. It's not given to us from the God. It's just something we created to try to make it be a proxy for a real world task we actually want to solve. Unless every proxy, this benchmark will be imperfect.
And you really have to worry about, OK. So is this benchmark I am trying to kind of solve, is this still capturing the task I really intend my model to eventually solve? Or is it now just became a game for the same [INAUDIBLE]. So essentially this benchmark real world task misalignment becomes a very important kind of question that as our models get better, we should be confronting more, and more, and more.
OK. So I am definitely run out of time. So I will stop here. And I'm happy to stay a little bit and answer any questions. But yeah. Thank you.