Adversarial examples for humans
January 4, 2021
December 12, 2020
Gamaleldin Elsayed, Google Brain
All Captioned Videos SVRHM Workshop 2020
PRESENTER 1: Our next speaker is Gamaleldin Elsayed. He's a researcher at Google Brain. Finished his PhD at Columbia University in theoretical neuroscience, I believe. Originally from Egypt. And I know previously I introduced a math olympiad, that was Melanie. Gamal is actually an Olympic athlete in fencing representing Egypt. So yeah, and also a researcher. So an interesting combination.
So Gamal, please, feel free to take it away. The stage is yours. He'll be speaking about adversarial examples for humans.
GAMALELDIN ELSAYED: Thank you very much for the introduction and for having me. It's great to be here in this great workshop. Hi, everyone. I'm Gamal.
So today I want to talk about some of the experiments I've been doing at Google in the past two years trying to study some differences between computer vision models and our visual system. So one of the things that I've been interested in is studying the failure mode of computer vision models, and then compare those to humans. And as many of you may know, of the most intriguing failure modes of computer vision is adversarial examples, which have been brought up by many people in the past six years or so.
As many of you may know, adversarial examples are those inputs that are designed by an adversary to make a machine learning models make own predictions. And in particular, in computer vision, people have shown that you can add very small perturbations to images. And this perturbation could, for example, make a computer network just make any prediction you want.
So this is one of the most famous panda images, I guess, in the field. Where in this work people added just a very small perturbation that is less than 1% of the intensity level of the pixels. And then the network here is switching prediction from a panda to a gibbon. So maybe switching prediction from panda to gibbon, is not a big issue.
But for example, in this work if you switch prediction from a stop sign to a kite that can be a very big problem. So that's why these examples pose safety and security concerns, and that's why it's really important to study their properties.
So I guess when adversarial examples first brought up it has been always so that these are unique failure modes to these computer vision models. And one thing I've been interested in is studying actually if humans can be susceptible to the same mistakes that these models are making. So generally these adversarial examples are generated by the optimization process that requires access to model parameters and architecture.
And it may seem initially not possible to transfer these to humans, as we don't have access to the human brain. However, there has been some signs or clues that this transfer is actually possible from different works in adversarial examples in the field. And the first clue here is these black box adversarial attacks.
So generally you can transfer or create invariance to models. And then attack a model that you have access to by using [INAUDIBLE] techniques. So for example, if you include different models with different architectures, trained with different data or trained with different loss function, you're more likely to attack a model that you have access to.
So the second clue here is these adversarial examples were made invariant to transformation. So those more strong adversarial examples that appeared in the physical world sometimes show feature that are relevant to our visual system. And here, I have two examples of this.
So the first example here is this adversarial poster. And the goal here was to design a poster that can be printed in the physical world and then can be viewed by a model. And then no matter what viewing condition or angle, this model is going to predict a computer label rather than a cat label here. And if you look closely here, you can see that by making this invariance to spatial transformations you can see that this poster is creating or generated some books like feature which may resemble a desktop computer here.
So even more interesting is this work, which is called adversarial patch. So in this work the goal was to design a patch that you can just throw in front of any computer vision model, and then the model would just ignore any object it's seeing. And predicting anything as being a toaster. So any model is just going to be absolutely confident seeing a toaster no matter what object it's seeing here.
And if you look closely to this patch you can see that this patch is actually having toaster features. For example, it has these two openings here. So it's having these toaster features by doing this optimization on these models.
So again, our hypothesis here is that these strong adversarial examples what are generated and transfer across these computer vision models, they target features relevant to our visual system, and thus they can transfer to us humans. And the aim here is to identify what context can we detect the effects of these adversarial examples in humans. And again, it's across two projects or two experiments that we have done in the past two years to try to quantify and investigate this hypothesis.
So the first work here is a paper that appeared two years ago in NeurIPS. So this paper is with Shreya Shankar, Brian Cheung, Nicolas Papernot, Alex Kurakin, and Ian Goodfellow, and Jascha Sohl-Dickstein. And the main goal of this paper is to actually try to address the mismatch between computer visual models and our visual system. And then try to ask this question of whether this example transfers to human in this [INAUDIBLE] setting where we address the mismatch between the two systems.
So the systems technology here is basically to start first by addressing the difference in the initial early visual processing. Then address the architecture difference between the two systems. Then generate [INAUDIBLE] examples using these black box methods. And then make humans evaluate the accuracy of images and see if humans would make the same mistakes that these models are making.
So let me comment here about this mismatch between computer vision models and our visual system. So there are two major differences here. And the first one is this early visual processing. So usually in computer vision models you would have a computing network, for example. It would take a grid of pixels and then layer by layer it's going to transform this to create high level representations of this input image. Whereas, our brain doesn't work in the same way.
So we have our eyes, which have a fovea. So it doesn't have resolution across all the inputs seen, but it has high resolution in the center and then the resolution drops out. So in order to account for that difference we incorporated a layer early for these models that's actually doing this spatial based blurring.
So the second major difference is feedback, as many have talked about today. So feedback it's a very important aspect of our visual system. And most of the computer vision models we have are these feed forward architectures. So in order to address the difference we use ideas from psychophysics to try to limit the presentation time and using backward masking to try to limit the feedback [INAUDIBLE] brain. And then make our visual system functions more like a feed forward system.
So apart from these two differences we just use standard ways of generating adversarial examples that people use in the field, which I'm going to mention in a second.
All right, so this process starts first with a dataset that have rich classes, for example, ImageNet. And then what we do first is we track a group these 1,000 classes to classes that people are familiar with. For example, we create three image groups here. For example, the pets group is including all images of cats and dogs. Hazard group includes images of spiders and snakes. Vegetables group images of broccoli and cabbages.
And then what we do is do this ensembling technique to try to create perturbation that pull all models together. And the way to do this is simply create or compute the probability distribution, or probability of predicting a target class. Say for example, you start with a cat image, you compute the probability of a dog image, and then you just use this iterative optimization process that take gradient steps with respect to input. Such that the whole ensemble of models here would increase its probability or prediction to a dog class.
So it's a very simple optimization process, your gradient based optimization-- yeah, nothing fancy. So one comment on this also is generally you have a constraint on the scale of these perturbations so that the final image is very similar to the clean image.
So we took this image and then we ran this to alternative forced choice experiment. So in this experiment we had 38 subjects. They sit in front of a screen. And then for each experiment it's basically choosing between two classes for the different image groups, as I mentioned. So if subjects are doing the pets group they're going to just choose between cats and dogs.
So that structure starts as follows. So it starts with a fixation period. And then we briefly present images for 60 to 70 milliseconds, and then we follow the presentation here by high contrast mask. And again, the idea of this brief presentation and the masking is to try to reduce the feedback processing of our visual system to these images. And then we record the chosen class of the image and peak reaction time.
So the conditions that we show in these experiments are as follows. So we start first by showing clean images, just to get the baseline of people accuracy in this task. And then the second condition is these adversarial images, which is basically say take a cat image and they perturb it to be a dog. And then we had one control here, which we call flip control. And the goal of this control is to control for the scale of the perturbations.
So just to see if any degradation performance is basically due to this added noise, or there's more structure to this. So basically simply, just take the difference between adversarial and clean images, and then you take the perturbation, flip it. And then put it back in the image.
So the effect of this on the models is that it's actually deactivated the adversarial effect. So for example, if you show this image to a computer vision model, it's going to predict this cat as a dog. But then simply by just flipping the perturbations, that model is going to go back to predict this image as being cat again.
And then the final condition here is this false images. So this is basically an image from a third class. For example, if we're doing the cat versus dog classification experiment this could be a car image, for example. And then some of the time we perturb this to be a dog, and some of the time perturb to be a cat. And then we ask whether humans would make similar predictions what we have here.
So let's first look at the results for this false condition. So as I just mentioned here subject cannot choose the true class of the image. They just can't choose cats and dogs. And then you show this car to them. So if the adversarial perturbation has no effect on humans you would expect to get one of two patterns. One is just people would just report randomly cats and dogs. And then if this is the case, the probability of the target class would be at 0.5. So this is our chance level.
And a second scenario is people like cats, they just going to press cats all the time. Or like dogs, they going to press dogs all the time. But then because we perturb these images with equal probability to cat and dog also this would average to 0.5. And what we see here is across all the age groups that we have run in these experiments, is that there is a significant shift in human perception towards the target class that we specify with our attack.
So what this means is that it's more likely that people are going to report this image as a dog if we add the perturbation that is specific to the dog class, and vise versa if we add the perturbation that is specific to the cat class. So even more interestingly, if we can divide this metric just based on the human's reaction time, and this that, where on the left here you get the fast reaction time and here's the long responses. We can see that this also depends on the reaction time. So humans are more biased toward the third class and they report this more quickly.
So now let's go back to the other conditions where subject here can truly choose the original class of the image. So here people can choose the true class. And if we look at the results here comparing adversarial images with respect to the clean images, you can see that there is a reduced accuracy of human report of the true class. But that is not very interesting because this could be just because these images are more noisy.
What's more interesting here is to compare the flipped condition here was adversarial examples. And we can see also there is reduced accuracy for the adversarial compared to the flipped control. Which shows that this example actually, in this time limited presentation, in this 60 millisecond sitting they transfer to humans.
So there was some limitation of this study. And one of them is in this experiment, because we have this very fast presentation, we needed to increase the perturbation to make them very obvious to humans. So perturbation had very large magnitude compared to those that fool computer vision models. So here we use 32 out of 256, and then infinity norm. Whereas models can be fooled usually with perturbation that's even less than eight out of 256.
So the second limitation here is that we could only demonstrate this effect only in this time limited presentation. So this was a very key to the success of this transfer to humans. And let me show you here an example. So this is an image of a spider that we perturbed to be a snake. And then in this time limited presentation sitting about 70% of the people thought this image was a snake. But now if you just take a look at this image no one is going to testify that's a snake. Everyone is going to say this is a spider.
So in this next study I'm going to discuss a new experiment that we conducted to try to see if we can change the task, and see whether we can measure the influence of small adversarial perturbation in this time on unlimited setting. So this is a new work that is led by Vijay Veerabadran, which is an excellent student that is interning with us this year. And is joint work with John Shlens, Mike Moser, Jascha, and myself.
OK, so the main idea here is in this original classification task it was really hard to measure the effects of small perturbation in this time unlimited setting. So in here we're trying to investigate whether we can change the task so we can detect these subtle effects on the adversarial perturbations. And the [INAUDIBLE] is quite similar. So we start first also by matching the digital visual processing by adding this fovea layer.
And then what we do differently is we design a separate division now with different skills. So now we can have very small perturbations and very large perturbations that we measure human perception across these different perturbation skills. And the key difference here is the task. So in here we didn't use classification task. Instead we use a comparison task. So instead of just showing one image we show two images, and then ask people to compare across these two images, which I'm going to discuss in a second
So this is the new task. So basically here the task have a specific class. So in this case, it's with respect to the cat class. And then what we do is we show two images side by side to humans. And these two images are perturbed in different ways. And then we ask humans whether which image is more cat-like.
And then what we need or what we want to see from this is if these perturbations are changing model confidence in different ways, the world, model is going to be more confident in one class or own image as being more cat-like than the other. And then we can measure the alignment of human comparison task to these models. So this is an Mturk study where we recorded 100 subjects, and then we just recorded the images they choose right versus left here.
So the different perturbations, the things that we used here is shown in this slide. So first perturbation condition is plus cat, which is basically pick an image and then increasing the model confidence in the cat class. And minus cat is the opposite. It's just doing this-- again, iterative optimization-- to reduce the model of confidence to the cat class.
And then we have the plus dog perturbation. So this is a perturbation that increase model confidence in a different class than the experiment class. And then what we do is we pair these conditions. So again we had the false images, which is not cats and dogs.
And then what we do is we pair plus dog and plus cat. So if you show plus dog and plus cat to the model, the model is more confident in cat class in the plus cat more than plus dog. So the model would choose plus cat. And then what we measure is whether humans would actually make the same choice.
So the other condition here is plus cat and flip plus, which is similar to the previous experiment. So here we use also the flip control to try to deactivate the adversarial effects on models. And then for example here, the model would predict with high confidence plus cat get more than flip plus. And then we see if humans are actually going to make the same choice here.
We also had conditions where images starts from the true class, which is the cat class here. And then what we did is we generated perturbation that would decrease the model confidence in cats, so this is minus cat perturbation. And then we have also the flip control. And then we paired these two conditions and then showed them side by side to humans, and asked them which one is more cat-like. And in this case, the model would predict flip minus to be more cat-like than minus cat. Because this is decreasing the model confidence in cat class.
We can see also we ride the scale of these perturbations. So here you have very small subtle differences or subtle perturbations. And here you have a large perturbation, similar to what I have in the previous experiment.
So now let's look at the first conditions here, which is plus cat, plus dog. So in this condition, as I mentioned, the model would predict plus cat to be more cat-like than plus dog. And then if we look at human alignment with the model across the different perturbation scales, we can see a small evidence that there is some alignment here. But it's not very reliable.
And that's presumably could be because there is some shared features between the cat and dog class. So both are animate classes. So people could confuse more features of cats and dogs.
So what's more interesting is this plus cat and flip plus conditions. And again, the flip is this control that would just mismatch the perturbation to the image, and would deactivate the adversarial effect on models. So the model here will predicts plus cat to be a more cat-like than the flip plus.
And what we see here is a very significant effect across all epsilon values. So even we could detect this effect at perturbation scale of 2 out of 256. So this is very small perturbation. It's less than 1% of the range of the [INAUDIBLE] levels here. So it's very small. And yet, we can see that humans significantly are choosing plus cat to be more cat-like than the flip plus control. So this shows that in this comparison task, we can uncover the influence of examples in humans, which is similar to the models.
So another thing here we can do is to look at the true class conditions. So again, this is the condition we start from the cat image. And then we ask people which image is more cat-like, and then we reduce the confidence for those in the cat class by adding this negative cat perturbations. And we have the flip control as well. And then what we can see here is this effect is also consistent in this condition. So we can also measure a significant bias in human perception in this condition as well.
So just to summarize, so in these two experiments what we have been trying to validate is this hypothesis that the adversarial example that strongly transfer between computer vision model, they also transfer to humans. And what we showed with these experiments is that these adversarial perturbations division that caused this significant change in model prediction, they can also influence human perception in both these brief presentation of a single image, or in this extended viewing conditions where we compare experiments rather than classifying images.
So I think these experiments show that the decision boundary of our visual system is very surprisingly similar to those of ensemble of convolutional neural networks. However, one thing that they haven't mentioned is that the effect here is much smaller than the effect on models. So we can detect significant effects, but these effects are very small. Whereas models would be completely confused by these perturbations.
So this shows that there is still a lot of work to be done here to make these models more robust and similar to our visual system. So that's all I have. Yeah, thank you very much.