Compositional Generative Networks & Adversarial Examiners: Beyond the Limitations of Current AI
Date Posted:
May 5, 2021
Date Recorded:
May 4, 2021
CBMM Speaker(s):
Alan L. Yuille All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Current AI visual algorithms are very limited compared to the robustness and flexibility of the human visual system. These limitations, however, are often obscured by the standard performance measures (SPMs) used to evaluate vision algorithms which favor data-driven methods. SPMs, however, are problematic due to the combinatorial complexity of natural images and lead to unrealistic expectations about the effectiveness of current algorithms. We argue that tougher performance measures, such as out-of-distribution testing and adversarial examiners, are required to realistically evaluate vision algorithms and hence to encourage AI vision systems which can achieve human level performance. We illustrate this by studying object classification where the algorithms are trained on standard datasets which have limited occlusion but are tested on datasets where the objects are severally occluded (out-of-distribution testing) and/or where adversarial patches are placed in the images (adversarial examiners). We show that standard Deep Nets perform badly under these types of tests but Generative Compositional Nets, which perform approximate analysis by synthesis, are much more robust.
TOMASO POGGIO: I'm Tomaso Poggio introducing the speaker for today's Center for Brains, Minds & Machines weekly seminar. And I'm very happy to introduce Alan Yuille. I met Alan the first time when he came as a postdoc at the Artificial Intelligence Lab back in '82, I think. This was almost 40 years ago. And he was coming as a theoretical physicist in quantum gravity having worked with Steve Hawkins in Cambridge. And he was one of, I would say, the first physicists moving to computer vision and machine learning and AI-- as I said, 40 years ago.
I'm proud that I'm the co-author of him of a few papers on the math of some aspects of computer vision, scale space, in fact. He was also one of the very early members of the CBMM, attending the summer school because its mission, his mission, his objectives are very much the same as the Center of Brains, Minds & Machines. And that is the science and engineering of intelligence with science being the primary motivation. Alan is presently the Bloomberg Distinguished Professor of Cognitive and Computer Science at Johns Hopkins. Before getting there, he was a professor at Harvard, a researcher at the Smith-Kettlewell Eye Research Institute, and a full professor of statistics and computer science at UCLA.
It's good you're back on the East Coast, even if we can see you only virtually. Well, the floor is yours, Alan. Great to see you.
ALAN YUILLE: OK, well thank you very much, Tommy, for that very kind introduction. Yes it doesn't [INAUDIBLE] 40 years ago, but when I started getting interested in AI and not physics, I was extremely lucky to end up, perhaps by chance, at MIT where I had people that I could work with, the real experts like Tommy and get guided into the field really, really, really efficiently. And it's a change from physics to AI. I'm really very happy I did. It's one of the best decisions in my life was to make that change and to move to MIT.
I'm also delighted to be talking to the Center for Brains, Minds & Machines because the philosophy behind it is exactly what really motivates me. And I have the center as one of the things I admire. And I keep trying to set up something like this at Hopkins, and admittedly on a somewhat smaller scale.
Anyway, so what I'll talk about, the title, Compositional Generative Networks & Adversarial Examiners. And the real goal was to go beyond the limitations of current AI. So the talk is a little unusual because I'm trying to cover a lot of topics. And I'm really trying to make four main points here. So technical details. I'll sketch the methods, but [INAUDIBLE] details, that might allow a longer discussion.
So high-level discussions on the limitations of current AI vision. So despite the recent successes, current AI systems have many limitations. And there was need for much more fundamental research to overcome them. And not everything can be solved by using more data. And I've been giving this type of talk to places, often the computer vision, or the machine learning community like Oxford, you know, last month. So I'm saying things that probably the computer vision people need to hear, but I think people at CBMM would probably take it for granted.
So I can shock computer vision researchers by telling them that AI systems are brittle, special purpose, consume large amounts of energy, require lots of annotation, et cetera, and lack the robustness, flexibility, and enormous versatility of human intelligence. And so that led to questions about how can we develop vision algorithms that can overcome the limitations? And secondly, how can we test the performance?
So testing the performance seems a really trivial question. I never imagined when I started doing AI that I would even imagine writing a paper about how to test algorithms because I wouldn't have thought it was a really interesting question. But I think it really matters. I think it's almost crucial to develop better-- having better ways of evaluating algorithms. It's crucial to getting algorithms that have better performance because at least in my area, getting papers accepted requires producing tables showing that your algorithm is better than everybody else and the whole range of different tests. And yet, I think a lot of those tests are not very good and not necessarily fully meaningful.
So the methods are evaluated by what I'll call a standard performance measures. And I'll get into that on the next slide. But it has given rise to a certain type of conservatism, which can make it hard to publish novel work and sometimes even novel data sets. Quite different from the situation 40 years ago. If we had the reviewers then that we have now, computer vision would never have got started, I think.
So what I mean by standard performance measures-- well, they require training and testing data set algorithms, evaluating them by average case performance on finite-sized balanced annotated data sets-- BAD for short. Now there are very good reasons for doing this. There's very good theoretical understanding, pack theory, BC theory, and machine learning. And Tommy, who's worked with [INAUDIBLE], has contributed to that. So it makes a lot of sense. And it's been a very effective strategy, which has dominated vision and machine learning for over 20 years and has taken us up to the good level that we are now.
But it also has drawbacks because I think it strongly favors data-driven regression-type methods. And I'm using the word regression to mean very general regression in the statistic sense, discrete continuous variables, you know, it doesn't matter. It also advises researchers to work on problems which annotated data sets exist. And it can lead to the tyranny of data sets which favor established, well-engineered algorithms. So that's the state.
And there's certain problems with SPMs that people have known ever since data sets were created. One is data sets are biased, important events may be rare, results on one data set may not transfer to different domains. 20 years ago, we did work on edge detection. There were two different data sets. You knew if you trained your edge detector on [INAUDIBLE], it didn't work on South Florida and vise versa. And also, instead of average case performance, you might care about worst case performance.
Now, those are all things that, at least in computer vision, most people would generally agree on, at least the thoughtful people. But I would argue is that there's actually more underlying problem with SPMs, which is basically that the set of images is not just infinite, but the set of visual scenes is also combinatorially complex. So it's hard and arguably impossible for a finite-sized data set to be representative of the real world, even for static images. OK.
I mean, you can be representative of part of it, but if you're going into a company complex situation, the size has to be very big-- has to be huge-- so big that it's really becomes almost impossible. So how can you test vision algorithms if you want to go beyond SPMs? So there are alternatives. These alternatives exist. People are doing them.
One of them is out-of-distribution testing. You train on one data set and you evaluate on data which has different statistical properties. And I will turn to that and part of the technical work. Another is domain transfer. You train on one data set, you perform the domain transfer to data from the second data set. For example, we actually did that 20 years ago before people actually were working on the main transfer, I realized. And I will return to that later.
Reduced training-- training using small amounts of data testing with a much larger data set. Also, evaluating on multiple tasks but only training on one. So you train to detect an object using boundary a annotations, but you also find the boundary, find the parts, and things like that. So we would argue that all of these things are good but may not go far enough.
And so the more extreme version, which I'm not sure I persuaded anyone in computer vision to do, but at least I'm starting out to see if I can find a few people, is that you should start testing algorithms in a somewhat different way by adversarial examiners which probe for their weaknesses. So the sort of motto for this is sort of, let your worst enemy test your algorithm. An adversarial examiner selects to test images sequentially at one time where the selection can be based on the algorithm's response to the previous image. So you don't just sort have a finite fixed data set of images that you're testing on. You allow questions to be asked, changes to be made to the images, which can then enable you to, in principle, explore [INAUDIBLE] number of images. There are two examples we had. These are not huge.
One is a work by Michelle Shue where you identify model weaknesses with adversarial examiners. And another one, I'll say a little bit more about it, is called Patch Attacks where you can modify images in a way that's perceptible to humans, but which really hurts standard deep networks. And so, a provocative way to think about it is that we know in machine learning the idea is that you test with a random set of samples. You know, this is basically the main idea. On the other hand, if you think of high technologies like computer code or airplanes, you don't really do that. You identify the weak points of the algorithms.
And this would also be similar to how a professor would test students. You don't have students by asking them a random selected set of questions. You test them by asking a series of questions where each question depends on their answer to a previous question. And if you're a student, that's presumably how you test professors as well. So it's a question of thinking of whether, now that we've got to a stage of having good algorithms, we should perhaps test them in a rather tougher type of manner, particularly if we want to have systems at work in the real world really reliably.
So here's a way of introducing this sort of work from the point of view of computer graphics. You can, of course, in principle, use computer graphics to systematically explore algorithm performance as you vary all the generative factors. So it's very trivial, for example, to construct a sofa, view it from different angles, and then test an algorithm that's 100% successful at finding sofas on ImageNet. And finding that if you show the sofas from certain viewpoints, the performance of the ImageNet trained system algorithm falls almost to zero. Because, hey, it's never really seen that viewpoint, probably. Or you can modify the color and the deep network's never seen that color.
You can go into Unreal Stereo. So stereo algorithms have been around for ages. I think Ma Poggio in 1976 is one of the earliest. And a lot of the algorithms are still based on that principle. But if you know stereo, you know that there are situations which would cause stereo to fail-- you know, specularities, texturelessness, transparencies, disparity jumps. It's very hard to evaluate that on benchmark data sets because these are really painful to annotate.
But if you use computer graphics, you can generate these types of situations. You can modify specularity textureless, and you can systematically study how algorithms perform as you vary these types of factors. However, this approach runs into problems because the number of variables for generating images can become huge.
So in these two examples I showed, you to do exhaustive search and that worked over the two examples because the number of variables was small. You just have to vary from viewpoint, you vary three variables for color, two variables, et cetera. Very easy to do, quite practical. Differentiation, local differencing, well, you can do that. We did it in the paper on adversarial attacks beyond the image space. But differentiation is only a rather short type of search. It doesn't really push you far enough. It only allows you to explore in the local neighborhood.
So then go back to think, what could your worst enemy do if they wanted to break your algorithm? It seems that the next step were to say, you should train a search policy on the data which allows you to modify-- pick your next image-- based on the response of the algorithm to the previous image. How will you do this? It can be done. In this case, we did reinforcement learning. You could do it with any other method. But essentially, the key concept is learn a search policy and then apply that search policy when you're testing the algorithm.
So over here on the right at the bottom, blue is what happens. You know, you're plotting the performance on the test data over the number of iterations, the number of examples that you show to it. And blue is what happens if you just pick random examples. In this case, random examples from the viewpoint. And basically, you keep on testing and you think your algorithm is doing pretty well, because it survived all the random tests.
However, if you have this adversarial examiner that is trained by search policy, it's initially sort of also flailing around a bit like the random method. It's sort of changing viewpoint, not making much difference. But at a certain point, it sort of locks on to the difficulty of classifying some of these objects. And the difficulty is that for certain viewpoints, these objects look fairly similar. So after a certain state, it suddenly realizes it and converges to viewpoints which performance is really bad. OK.
So this is one example. Start out. Try to get a policy that allows you to systematically manipulate the image in response to the performance of the algorithm. Here was another one that I'll bring up, and we'll come back to this later, is how about doing this on real images? There are, of course, a whole lot of standard attacks people have done for the last six years or so where you have imperceptible changes to images, et cetera, using differentiation.
Now this is not what we're doing. These changes are going to be perceptible. But they're not going to affect human interpretation of the images at all in the slightest. So again, you have in a search attack policy which is trained by reinforcement learning agent. It works by selecting texture dictionaries. These dictionaries have been obtained by using features extracted from a surrogate deep network, not the one you're going to attack. And then you sort of generate images that have these texture features.
So these are not objects. They look like a sort of Picasso painting or Bracht painting from the early 1900s. They do look actually quite artistic in a way. But they are attempting to capture some sort of textureless property of the object. OK.
So then agent can select one of these patches or a couple of them, put them into the image, move them around. It can see how well the deep network responds to them. And based on that, it can select a different type of patch and/or move the patch somewhere. So the system allows the image-- it doesn't have a fixed data set of images to test on. It has a fixed basic set and the whole series of combinatorial number of operations you could do onto that set of images which enable it to search over a far bigger input space.
And this attack, this type of process works very well. It has an attack success rate. And this is a targeted attack. So the success occurs if it takes a t-shirt or a sweatshirt and turns it into a pretzel, because the target is a pretzel. If it turns it into a fish, that would not be a success. So it's a targeted attack. It's black vault knows nothing about the interior properties of the deep network. So the success rate is over 95% success.
And if you analyze the attacks, it still generally supports the idea that the deep nets don't have a very detailed knowledge of the structure of the object. But they really work by recognizing large numbers of textures or appearance templates which they have sort of memorized in some form. And so, one aspect here was that it's not just that the deep network thinks the thing is a sweatshirt-- the sweatshirt is a pretzel-- but it thinks there's sort of a little thing here, which is a very abstract sort of patch thing. It looks more like a pretzel than a real pretzel.
OK, so if I give these to-- a response I get to giving this sort of talk is-- our adversarial examiner's too tough. Perhaps the types of images they generate rarely or never can in practice. Is this a bit like German over-engineering where you go away and work on something for 20 years and then somehow you find it's no good because somebody's done the same thing far quicker.
I see there's a question or two up here. So OK, can we build a system that's not fooled by any of the adversarial examiners? What can we certify about it? At the moment, the adversarial examiners are limited. So you can say they're robust to that type of examiner. But I think if you can make the set of adversarial examiners rich enough, then you can really be sure that the method is going to work. As I say, this is your worst enemy. These adversarial examiners were not created by my worst enemies. They were created by my graduate students.
But I think, the point was trying to make it harder and harder for you. As I say, second point from [INAUDIBLE], yeah. You want to find specific points of failure and keep hammering at it which at the same time, requires understanding the algorithms better, I would say. I mean, my general belief is that if you can make-- I think in the long-term anything that you don't understand, you can break by this type of approach. If you understand it, you can start defending against it or fixing it so that you know that you can get it to succeed, I think.
So I think you're right. The two examples I've had are rather the first ones we came up with. But you'd like to develop it, yeah, precisely as you say, to encourage the adversary to explore as broadly as you can. If you had a computer graphics data set that was big enough, representative enough, and all those factors, you'd like to explore them. But you'd have to sort of pin down which-- learn a strategy or pin down which part of the strategy are really going to hurt the algorithm.
OK, so in terms of [INAUDIBLE] engineering, I think I would like to say that if a human can do it, then the computer vision algorithm can do it also, even if it's never going to show up in any of your data sets or any of your real-world conditions. If it's there, if it fools the algorithm and it's not going to fool a human, then you should be worried. Because I think also by having these types of attacks and defending against them, you're going to be moving towards algorithms that you can really, really approach for that.
Any research on human performance in response to adversarial testing? Not that I know of these methods. Because basically, these methods are too easy. Humans would not have any difficulty with these patch attacks at all. I mean, they would say they get the cart-- you know, they would know what the object was. And then they'd say, OK, there's the object. And there's a little patch there and maybe it looks a bit like a painting of a fish or something. But I don't-- it's something I've thought about, but I would have to come up with cases where the attacks would be really hard to-- where the humans wouldn't get 100% correct. Yeah.
Other issues, right-- and adversarial examiners which used computer graphics-- are they problematic? Because inserting images are not perfectly realistic. I mean, my answer to that is, computer graphics images are increasingly realistic. But also from human perception, humans, we can switch from real images to computer graphics images easily. We have no difficulty doing that. So I think computer vision algorithms, if we want them to be capable of behaving rather like human vision, they ought to be able to do that too. So the ability to do domain transform is something that we should require of computer vision algorithms inspired by how humans can do it.
Another question I get, more from the computer vision people was, I don't care about malicious attacks so why should I care about these? And I say my main motivation isn't to defend against malicious attacks, but really to study the weak points of algorithms.
OK, so you thought the patch example-- parakeet with the scoreboard being the background? Yeah, I think-- I mean, that's a good point. I think though, in the setup here, the understanding is-- at least for the deep network-- is that this it's just an object and the whole image or most of the image should be the object. I mean, you could certainly reformulate it so that the deep network thought that was an object in the background. It would have the task to detect the object and find the background. In which case, yeah, your point would be relevant.
Now I'm interested to say, you think it's a parakeet but still, yeah. At least, you wouldn't think it's a real parakeet, you'd probably think, I would guess, that it's sort of a symbolic artistic parakeet. OK. So moving on a little from this is, now it's partly building up to the compositional channel to networks, or what are called, perhaps, approximate analysis by synthesis. And here I give the game away because I have a long-term belief that the vision ought to be solved by forms of analysis by synthesis. But then, I have to publish my papers in computer vision conferences where everything is evaluated by standard performance measures on EAD data sets. So how can I manage to do that?
OK, so the challenge is, well, the types of models we're talking about-- first, we have to get them so that they work as well for on the deep networks on those types of standard tests. I have a hope of getting them published at all. Then we have to show that they have advantages under other situations. So that's what we are doing. The other situations out of distribution for us was having occlusion. This happens very frequently in images.
You will have data sets where the training doesn't have any occlusion or has very little occlusion. But you test it on images where there are large amounts of occlusion and see how well performance goes. That is an example of out-of-distribution testing. The second one is the adversarial examiners-- the types of patch attacks that I brought up earlier. Those are tough types of attacks. They worked really successfully against standard deep nets. How well are they going to work against these types of compositional generative networks that I'll describe?
So why generative? Well, generative knowledge of the process, objects come from different views, can be occluded, they can be foreground, that can be background. There are many reasons for these things, I think. And some of the CBM literature would have argued with this, ideas of physics, and so on. And let me just make a few background comments on analysis by synthesis. I think people are probably familiar with it in this audience. Computer vision people are usually only to a limited extent.
So it's to formulate a vision in terms of inverse inference, which you could consider now inverse computer graphics though at the time the ideas of analysis by synthesis started, a computer graphics was such a primitive state that nobody-- it was not something that you'd even think of in this context. So what does it require?
It requires generative models that can render the complexity of the real world. Computer graphics is something that can do that these days. GANs, perhaps, increasingly capable of it. So it requires that sort of process of generating the real stimuli. And secondly, algorithms which can invert the generative process.
Now the first part is hard. And so for 30, 40 years, analysis by synthesis being really difficult. It's now becoming a bit practical because the first part has been partially solved. But to invert the generative process is really difficult. There's a huge search problem there. So to detect a car, for example, you don't only have to search over the position and size of the car, but also the make of the car because that's going to affect the appearance, the 3D orientation, the color, the texture, whether the car is clean or dirty, the weather conditions, and a whole bunch of other factors that you can think about. So it's a very, very hard search problem to do.
Now, the CGMs are a type of approximate analysis by synthesis. The generative models are defined on deep network convolutional feature vectors, not on the image intensities. So the advantage of this is, the feature vectors-- because they're defined on convolutional feature vectors, they don't care about details in the image which are not really needed for parse-like object recognition. So to detect a car, standard analysis by synthesis would say, you have to take my car or Tommy's car or professor Biden's car from particular angles. Whereas really, you only want to detect the car. You don't care about those details really.
So if you use deep network features, that gives you a course description of the car and the features are invariant to details that don't matter. That also means that fairly simple generative models are sufficient. You don't need to learn very complex probability distributions unless you want to apply this to fine detailed tasks where you might need more properties of the image intensities. So simple generative models are sufficient to do it at this level, which means that learning them is possible, and inference is also fairly straightforward using algorithms which are pretty similar to deep networks. So I see there's a question.
Approximate generative models, can it be formalized by logical plausibility? Well, I should appeal to theoretic neuroscientist experts, which I'm not really one. I have friends who are theoretical computer scientists. And I think that's a challenge.
I mean, the original idea of analysis by synthesis that David Mumford conjectured extremely boldly in 1991 was that you had feed forward and feedback connections. And the feedback connections could generate the image, which you can generate your guess of the image which you would compare to the real image. And then after that, you would look at the differences and you would use feedback loops for it. So there's certainly a big picture idea there that you could do.
I think formalizing it here, I'm certainly very interested in the possibilities of this. And I'd love to follow it up. Because if you think of doing it as approximate, it sort of throws a different light on what David Mumford other people are wanting to do. You would start off by using the top levels to get to sort of the core idea structure, and then the details would go down in the bottom. So I'd love to follow up on that.
But all I've got at the moment is sort of hand-waving arguments and intuitions. But to formalize it more, yeah, certainly something I'd love to do. So now, to go on to the fact, here is the motivation. And now, go onto the occluders. So I'm telling the story in a sort of slightly non-historical way, because for some time, I've been thinking that occluders are ways to really test current algorithms and challenge them.
So the claim here is that deep networks don't generalize well to occluders. And here, you train them with not a clue to data. And here, you see how that performance goes as the occlusion area decreases and the performance goes down to 63% compared to 99.1% if there's no occlusion. So I admit this is slightly unfair because 10 years ago, we'd have killed to find an algorithm that could get anywhere like 63% classification performance if there was 70% occlusion.
But now, we've been spoiled by deep networks. So you know now we're not satisfied with this anymore. Oh, I see your comment, Tommy, thanks. Yeah, let's definitely follow up on that. And is this number good? Well, we did some experiments on this in CogSci 2019. And the argument is humans basically do almost perfectly under these sorts of situations, I think. Oh was there a question here?
Sorry. Can I confer on utilizing points in the speech language delay? I think I'd rather delay that till later because I think they could apply to that, but I'm not an expert in speech or language, which means, I have opinions and you shouldn't necessarily believe them. But we can get back to that.
So what if we train with augmented data? So this is a standard deep network idea. OK, let's just throw more data at it. Let's throw more occlusions at it. Throw it in here. So you put in five times as much data which has occlusions in it and it does improve in certain performance measures, going from 63% to 80%. OK, that's quite a big jump. That's a gain.
And one Chinese student-- when I gave part of this talk-- he thought this was the take-home message. But my take-home message was not really this. It does improve. You've got more examples. But maybe again, like [INAUDIBLE] aspects of the deep networks, you are sort of maybe memorizing certain occlusion patterns that are similar to the ones that you're going to see. So anyway, this would be the standard deep network-- more data is going to fix the problem.
So now, go to what we did. So as I said, we take the deep networks part, the convolutional layers, and we eliminate the classification head. So we keep the features. And sometimes when I'm in a joking mood I say, I love the features of deep networks. I really love the features of deep networks, but I don't like the classifiers. And that's putting it a bit extreme, but the features are really good, but why not replace the classifier by a generative model? So at the top of the deep network, you've got these bunch of really nice features, [INAUDIBLE] feature vectors, fp, and you can use that to train a generative model. And these features, as I've said, are invariant to details about the image that you don't actually care about at the moment. Well you don't care about it for the task we're doing.
So you are placed ahead by generative model-- and I've really got to get a better picture for this. But the idea is that you have an object at the top right here, which would be a car, or a vehicle, or anything. And I should say right now that our work was originally done on vehicle classes and then generalized up to 100 objects. And it's best working mainly on the vehicles for reasons I'll discuss.
So you have an object. Then you can generate different viewpoint. You can have class mixtures which correspond to different viewpoints. So this is like classic model that you recognize the object by having Viewpoint 1, Viewpoint 2, Viewpoint 3, Viewpoint 4. We're not telling the object model of that because we're not giving that extra information. But what it does-- and it's going to learn this automatically-- it's going to learn of each object there are several possible structure patterns of image features that are going to happen.
So if you look at a car from the front, you're going to see cars on the left, cars on the right, and a window in the center. And if you look at it from the side, you're going to see a different type of pattern. So you could have a whole set of 2D models of objects there.
OK, so you have those. And then they generate features. And in between the deep network features of these vMF kernels, which are sort of a bit like parts in quotations, because to say they're parts is a little bit too strong. They look like them. And you can quantify them to some extent, but they're not perfect. But at least, you've got a generative model, you select an object, you generate a class mixture, you generate a structured pattern. OK.
And there's a bit of math here. I'll I just put it up to say that the math isn't very complicated to discuss this in detail with class going over the paper. And there's not really much time for that in this type of talk. But what you're essentially doing is that it's replacing the top, fully connected layers with a set of operations which are fairly similar mathematically but have very different interpretations.
Is a generative model just augmenting the convolutional features? I wouldn't call it augmenting them. It's sort of generating them, which has some big advantages. If we're using a pre-trained CNN for features, aren't you worried this will ignore the relevant parts of the image the network is trained to ignore? We could be. Except, that hasn't seemed to affect it at the moment. In fact, the features we're using are all trained features. We don't train the low-level features. We take them from something off the shelf.
I have wondered about, OK, maybe you replace them by the sort of unsupervised convolutional features, which is a fairly hot topic at this time. And you know, hey, that actually might work better. I just haven't persuaded the students to try that out yet. But we are relying on the deep networks to pick features that do the right thing. And the right thing here is classification. For some of the tasks we're doing with this as well, different features might well be better. But OK.
So you place it by the generative model, and then you learn the parameters by a combination of differentiating and clustering. So when you're training a standard deep network, you have loss function. And you just do differentiation to train the weights. I mean, well, not just differentiation. Because of course, you know it's actually technically pretty difficult. But anyway, you do that conceptually.
But here, we're also doing forms of clustering. We are essentially learning a dictionary of feature vectors that happen, which are the vMF kernels, which color-code roughly to parts. And we're learning these class mixtures. So when we're doing the learning, there's differentiation for estimating parameters here. But there's also clustering to estimate these things and those ones. I mean, that happens in some types of deep network architectures as well that I'm familiar with, but it's not really standard.
A question about the take-home thing. Well, we haven't got to the take-home point yet, but I'll get there in a slide. So you've got the generative model. And so it's got more internal structures than the deep network. It can represent spatial patterns. It's got some interoperability about the class mixtures in the past. But certainly, if you run this algorithm, it's not going to be very robust to occluders.
Let me just say the explainability very quickly. So the vMF kernels resemble part detectors. These are images which would activate a particular kernel, you know, engine of airplanes, side of airplanes, et cetera, seats of bicycles. Now of course, someone who's been in computer vision as long as I do, you know that you can make things like this that look wonderful and are cherry-picked. These are not really cherry-picked.
And we can quantify these because we did some annotations. So these vMF kernels do sort of correspond roughly to parts, but not as precise to parts as I'd like. Similarly, for the class mixtures, you can interpret them to some extent. And so for bicycles-- this is one class mixture-- bicycle from the side. Another one is tandem bicycle, things like this. Another one is a bicycle from these other views, et cetera.
So there's some interpretability there, which is quite nice. But getting back to that point, OK, so now what do we do? Now, to make it robust, we introduce occlusion as an outlier process. So this is a pretty standard idea in basic probability theory or statistics in Bayesian models. You have a generative model of the data you could do inference on it. But then you say, hey, wait a moment. Maybe the data was contaminated by something else.
A long time ago, in the '90s, I co-authored a paper on robust PCA, where you assumed that the data was generated by a Gaussian which, if it had been purely a Gaussian would give you PCA. But then you said, OK, and there's a possibility some of it could have been generated by uniform distribution, which gave you robustness. So the same idea is put in here. And it's a very natural thing to do for a generative model.
So we say that the image could be generated by one of the objects, one of the viewpoints, et cetera. But there's a probability that the data could have been generated by an outlier process. An outlier process means that the feature vectors can come from what we call these occluder kernels. And those are basically anything in the data set. This induces anything, any sort of background thing, anything that's not particularly the object.
So this can be done with an outlier model, variable z, which is just another variable to put in here that would have to be estimated to decide if you think the data has been generated by one of your object models from a viewpoint, or whether it's been generated by an occluder. How do the vMF kernels run parts of the object? Is it supervised?
No. It's not supervised. It's just done by clustering. It relates to work we did several years ago called visual concepts, which we clustered and found it a bit hard to get accepted in computer vision algorithms. But it's when you're training the network as well as using differentiation to learn parameters, you are clustering to get a sort of a dictionary of feature vectors, which are the vMF kernels, as well as clustering to get the viewpoints.
I should say, part of that's nontrivial. The clustering happens with a mixture of spectral clustering and some other stuff as well. But basically, yeah, we're not using any supervision for this.
OK, question from [INAUDIBLE]. The different viewpoints are learned from the different configurations of the same object. That's correct. Should the model additionally receive an input or learn where it is in the 3D space?
Yeah, that is sort of slash future/ongoing work, because certainly with humans, you'd certainly expect to recognize objects by looking at it from several views. And here, we're just taking a few static images. So with the CNNs, I would say that this is sort of a work in progress. We've got two DCNNs here, which have some nice properties, but there were certainly limitations of them that I want to improve on. It's the basic concepts, I think, which are perhaps good.
What sort of weighting or confidence does occlusion element get? Well, that is something that is-- that is something that you specify. It's not something you learn. Because think about it. What you're trying to do here is, you're doing out-of-distribution learning. You know, you get a test out of distribution. So you can't actually learn how much occlusion is going to be because if you did, you would really have knowledge of that. So it's like a type of tolerance you get for the amount of contamination in the data, where the contamination is the occluder.
If you're familiar with Alexandra Madre's formulation of min/max formulation for standard non-perturbative attacks, it's a little bit like that. You want to minimize the energy, but you maximize over some transformation on the data. And how big that maximization is is the amount of robustness that you're putting into the model. And that's a factor that you have to specify. And so roughly here, that's what you were doing here with the occlusion, et cetera. I mean, it's a starting point. You could assess more types of things there.
What types of classes is the class mixture modeling? It's roughly viewpoint changes. It isn't always that, as I said. It's not trained. It's not supervised. It looks like that, but it's not perfect. You know, you'll get an airplane going one direction, you'll get an airplane pointing the other direction, they would be put in the same cluster. That's the limitation of the model at the moment.
As I say, I think the classes that we've got my clustering, based on the class mixtures and for the parts, they are interpretable up to a point, and more and interpretable than anything else we could check. But they're not as good as what a human would get, which is good. So I don't have to retire just yet. OK.
Occlusion as an outlier process-- and then once you do that, then you can see that the models are going to become robust to occluders, but can also localize them. We did it originally on vehicles only. And then we extended it later on.
Here is sort of examples on videos. So here, the occluder, this sort of z variable which decides how much the occluder is, there's sort of a probability for that. And so for this image, it says, hey, we think this area here of the bar is an occluder. The head is an occluder. The tree is an occluder, et cetera. So this is because it's trying to predict the spatial pattern. And then it finds that some piece of pattern don't look right, but they could be explained by the occluder process. So that helps. That's a good thing.
So you can perform, as well, the tables I left out-- partly because, I must say, I get so frustrated that we have to have tables in all our papers. And when students send me a paper to review before they send it in, I must admit, I glance at the tables, but I'm far more interested in the ideas. And I'll take it for granted that they're hard-working students so they're table was actual good.
But basically, what you're doing with this is you're getting performance that's about as good as the deep network if there's no occlusion. But as the amount of occlusion goes up, the performance is pretty good. And you're going up to about 90%, even under 70% occlusion. So in comparison with what you got earlier, you're pretty robust.
The results are best on the vehicles because the assumption here is that four viewpoints, four classes, are enough to capture the viewpoints. And for objects like vehicles, that may well be enough, roughly. But as you start going to more complex objects, four mixture patterns are not really enough to capture it.
So you still do reasonably well. And so if you scale up to 100 object categories that we had to do for the journal paper, you're still doing better than the deep net with the occlusion. But the performance is decreasing, compared to how well we do with the vehicles. It's because we're trying to model the spatial patterns. And as I say, the four viewpoints sort of starts breaking down.
OK. So now, this is the out-of-distribution stuff. And this is now a paper that was protected by CPPR. Which was now we said, OK, let's take the Patch Attacks. And if you remember, the Patch Attacks were almost 100% successful on deep networks. And so here is an example of PTT16. Patch Attack is almost 100% correct for targeted attack.
Here's an example of the car with a few patches on it. Its false RS is a later version. It's different. It's got some clever things that were not in Patch Attacks. It sort of followed up on it. It's got one patch. It also attacks it.
But the [INAUDIBLE], without doing any training at all, are an order of magnitude more robust. The reason for that is, essentially, first, they know the structure of the objects. They've learned it roughly. And also, they know that there couldn't be outlier things which get rid of the patches. So it's fairly straightforward that it should detect it, be robust to it. But it's nice to see it is so robust at the moment.
So the adversarial examiners are really effective against the deep networks, less effective against the Patch Attacks. But now, I need somebody who really, really doesn't like me to find attacks that might exploit some of the weaknesses of the CTNs, or persuade some of my students. So they're motivated to take limitations as we develop them, while the very nice generative properties of this, which allow you to put in occluder processes, actually will allow you to do several different tasks with the same representation which I haven't got into time to do.
You can find boundaries to some extent as well, and to find parts, and to be robust for adversarial examiners without having any extra training. And as I say, it could be nice sort of ways of tying this into neuroscience.
But now-- and this is going to be a little bit quick-- I want to throw out one more piece, which was training to parse animals with weak priors, and I'd say, with no annotated data. So I know, and I hear it, lots of CBMM. You know, humans learn vision in ways that are very different from machine learning system. And so here, an infant could learn about a toy donkey. And here is an infant demonstrating it by seeing, touching it, and tasting it, which means, it could sort of use the 3D model of it. Or even, perhaps, a precocious child could read about it in the [INAUDIBLE].
So an infant could explore geometric configurations, it can take identify the key point where it bends, and so on. And sort of could, wild dimensional object which he could render as a surrogate for visual development. So I'd love to hear anyone who buys that as a possible relationship to cognitive science development. The best I've had so far was Brandon Laite, one of Josh's students, who seemed to think it was roughly consistent.
But OK, so now, going from that-- while conjecture is, what can you do? Can you take a computer graphics model of a horse or tiger, annotate its key points? Or would you only annotate once, generate a large set of simulated images where the key points are known because you've annotated the model with a diversity of viewpoint, posed, lighting, texture appearances, and background? Can you do that? Is that in some form a surrogate for human development? And if so, how well does it work?
Train a model for detecting key points on these simulated images. If you do this, your first finding is you can get detection of key points on the synthetic images and they don't work terribly well in the real images. But they do well enough so that when you use them to start a self-supervised learning, they now perform pretty well.
So here is animal parsing-- horses labeled on the horse model. You render the horse with a whole lot of views, et cetera. And then you test it. You test it on a data set where there was some annotations. The performance of that is like 80%. You start using these models. You can go up to about 60% by doing a number of sort of clever tricks without using any ground truth real data at all. And you can move it up to about 70%.
So you can actually detect key points quite well here without any real supervision at all, just using a synthetic model and doing some domain transfer. And this has other nice properties. I'm realizing I'm going somewhat over time, and I'd rather rush this. And the reason for that is I see nobody's asked any questions about this part of the work.
So I can stop for a minute and just see if people have any questions on this thing? Or whether-- since I've talked for about an hour-- whether people are saturated with listening to me and I should move towards wrapping it up and throwing things open for any other questions?
How could you generate examples in the brain without a graphics engine? Well, I appeal to Josh Tenenbaum, who gives talks about the physics simulator in your brain. So I would refer to JB Tenenbaum private communication. I really don't know is the honest answer. I mean, we know we can imagine objects in 3D in the brain, we can rotate them. Can we develop simulation? I'm really not sure.
This is rather like a 1980s idea of how humans dream-- if you remember the theories about that. Boltzmann machines-- you train algorithms without any input. And then you'd train them, generate them without any input while you are asleep and train on that. So I don't know. I mean, this is wildly speculative.
And I should say, if it's interesting enough, it's worth thinking about. If not, I don't know. But I mean we can dream. We can shut our eyes, we can imagine things. Newer mechanisms for that, I don't know. But potentially, yeah.
So it sounds like, Tommy, you think that's a bit too-- yeah. Well, I'd say it's a conjecture, not an assumption, I would guess. And maybe a conjecture too far. I mean, for publishing this in a computer vision thing, I think, in fact, we said nothing about cognitive science at all. We just said, look, it's nice, we can use computer graphics to generate things. And then we can train using that. So that's good.
But since we've got cognitive science and neuroscience experts here, I thought I'd like to throw it out and see whether the support is possible. Because right, it's a big thing. But we can imagine things. We can do that type of stuff somehow.
So, anyway. What's the question about, synthetic in the horse case? Is it meant to guide the learning process, augment the network without any supervised data?
For me, the exciting thing was that you could really get good results on the real data without any ground truth annotations on the real data. The best performance, though, is if you combine it with the ground truth on the real data. And then you move things up from slightly below 80% up to 85%.
So, I mean, I love the idea of not having to train with any-- I would love the idea of trying to train with limited supervision and even almost no supervision. [INAUDIBLE], you know, and then test infinite supervision I think. Might be hard to go beyond single objects to scenes of objects when rendering a data set for pre-training. That might be harder.
Yeah, we've just done it with objects. There's recent paper by Sanja Fidler, I think it's in ICLR, that was using these types of interpretable GANs in the rendering engine that sort of seemed to relate a little bit to this. I mean, very different, but I'm not fully enough a GANs experts at the moment to know how far you could go with these sorts of models or how much you could do the rendering. Yeah, I'm not sure about it.
These are, I think, starting with images, starting with objects-- that's the place to start. Possibly it would end here because you can't extend it. But starting is a good, seems perhaps a good place to start. All right. So moving back, right.
So this was a question, actually. If you have the real data alone, about 79%. If you used a synthetic plus label data, you moved up to 82.43. Not a huge amount. But if you had the synthetic data by itself with this extra stuff and self-supervised training, you move up to 70.77.
You also found you could, in principle, scale this up to more categories. We didn't at the time, partially because actually to evaluate this-- and this gets back to some of my complaints about computer vision data sets-- we had to have some data set where there was ground truth for certain of the key points. And we could find that for horses and tigers, but we couldn't find it for sheep. We couldn't find it for mammoths, or things like that. So we didn't do it on it. But I thought it might work out anyway. I think, in principle, it should apply.
We then thought, OK, since the network's been trained so it can start moving from the synthetic stuff to the real, then it could do forms of domain generalizations. So if you run it on these types of things here, it will work reasonably on these types of things as well without having to do anything much extra. So I think, on this project, the idea was that if you had some synthetic data and were able to do some form of domain transfer to real, as we're showing here, then that's got quite a lot of advantages from a computer vision perspective.
The crazy wild conjecture, which I've only-- I haven't actually dared to write down, and certainly not in the paper-- would be whether humans can actually do these things or stimulate these things in their head well enough to train on them. But yeah, that's beyond it.
So conclusion is, going back to the beginning, AI vision has very big limitations compared to the human visual system. And a problem there is, I think one reason for that is, I think we're suffering a bit from the tyranny of data sets and the types of standard performance measures that we use. Because although they're really good-- I mean, that's taken us to the level and we shouldn't neglect them-- it really favors certain types of approaches that have the ability to learn the patterns in the images and the data. And the deep net seemed to be wonderful at doing that.
But perhaps we've failed to understand, to get deeper structure of the data, which is necessary to do out-of-distribution tasks, or to transfer to other domains, or to deal with the adversarial examiners these tougher tests. And so, I worry. And this happens to a lot of my graduate students and a lot of the computer vision people which just published papers that make bigger and bigger improvements on standard performance measures, but are not necessarily giving good insight into improvements. And I remember when deep networks came along.
The first few years, you made an improvement on a deep network, or applied to a new problem, your performance increase went up by 10 AP, or some huge jump. And now, you've got far more people with far more powerful computers strike publishing papers where performance has gone up 1 AP, or something like that. And I feel it's perhaps not really a good use of resources. And some of those findings, even on a big data set like ImageNet may not transfer over to other situations.
So I think we really need to have, in a computer vision perspective, a machine learning of far tougher evaluation algorithms because also, I think that will show up-- that will test the current deep networks better. I mean, maybe they'll survive them. I'm guessing that probably they won't and will require modifications. But if we do that, then we'll start developing algorithms that really can work towards the level of human performance.
And so, for me, the compositional generative networks were the type of thing that we're doing. And we're performing them better on these tougher challenges at the moment. But as I say, they are limitations to that I'm only too well aware of that we'd like to improve. The only annotate once again is slightly orthogonal to this. But it relates to the same point. You'd like to be able to have algorithms that could transfer from one domain to the other that rely upon less training, et cetera, and so on.
So a question about-- yeah, annotating once. Yes, you annotate a single instance of an object class. So you go to the computer graphics model. You go into the lender code. And so once you've done that, you've annotated on the 3D model there. Then you can generate it under all sorts of different conditions and get a very large amount of data that you can use.
It is tough. I mean, I think what we try doing-- the natural idea is you try and make the images realistic. You try and generate the images to make them more and more realistic, and then you use those. And that didn't work.
What we found was actually what seemed better was, you had a whole lot of variability in lighting and texture patterns and so on, and you used that. Why? We're not really sure. But the intuition is that really that if you did that, you focused the algorithm on sort of properties of the object like the edges, the boundaries. Because the boundaries of the objects are really-- of a computer graphics object-- they're fairly similar to the boundaries of real objects. The texture stuff, the appearance may be different.
So trying to make the generative models too realistic may actually be hurting you, or certainly hurt us when we're doing that project. It relates a little of the approximate analysis by synthesis idea in the Comp Gen X. In a way, you don't want to have too much of the details of the images because they're difficult to do. They may be irrelevant. They may distract you, at least to do the types of tasks that computer vision are doing at the moment. They're fairly coarse-level tasks. Later on, when we want to do fine-level tasks, then maybe we need to go into those. But we may not need them yet.
Should we crowdsource the problem of detecting failures of existing algorithms, let people upload images where systems fail, crash logs of airplanes? I mean you can certainly do that. And it would work, probably. But you know, it's a brute force thing. And even it scales through the number of images and number of annotators. You're not really going to get to a combinatorial number that way, if you see what I mean.
You know, you've got a billion people. You've got a billion people, you show them a million images that they can look at. That's a big number, but it's still not a combinatorially big number, necessarily, I think.
But of course, I'm pushing a particular viewpoint here, which may not be a-- and I may be pushing it too extremely at the moment. Anyway, so this is on the occlusion. I think, the human perspective, the human challenging the algorithms that humans can do, really tougher performance tests, working out systematically what humans are capable of compared to what machine vision systems can do, challenging the machine learning systems to do them, modifying them in ways that I've suggested here would be ways to improve the AI side of vision, and probably also, the cognitive side as well.
So that's, I guess, at the moment. I've gone over time. There's a whole bunch of references here. Gosh, and I should have put photographs of people. And I should have acknowledged grants, et cetera. But fortunately, all those are in the papers, et cetera.
So maybe I should just stop here because I've gone way over time, I think-- stop here and see if there are any more questions that people have at this stage.
PRESENTER: Very good, yes. Let's see whether there are more questions. But there were quite a lot during your talk. It was great. And I hope we can start some collaboration from you. It will be good, even better.
ALAN YUILLE: Yeah, I would certainly love to do that. Because I think there's some very, hopefully, some interesting things here. And there's a lot of-- yeah, anyway, absolutely. I should say that.
That's good. Well, thanks for listening. I mean, I put a lot of material here, as you can see by the references. But I think the big picture and thing here was more than the sum of the individual components. But I think the questions were really interesting and simulated. And yeah, it will be definitely wonderful to carry on to see if joint projects can come out from this.
PRESENTER: That's great. Thanks a lot, Alan. It was very good to have you. And, yeah, I hope everybody enjoyed it as much as I did, which is a lot. So let's do it again at some point.
ALAN YUILLE: Yeah, absolutely.
PRESENTER: And hopefully, in person.
ALAN YUILLE: That's fair. OK, Tommy, well, great to see you again. And yeah.
PRESENTER: Bye bye, and tune in next week for this seminar. Bye, Alan.