ALAN YUILLE: This is a Friday afternoon and I'm talking to an interdisciplinary audience. I'll have managed to restrain myself from having any equations whatsoever in my talk. Now as I was a post-doc here, I keep wondering what my younger self would say if I saw a talk like this.
I'd think, OK, a bunch of words and pictures. The person obviously hasn't got anything worthwhile to say. But I don't know. I older now and a bit more tolerant, at least with myself. So parsing object and scenes in two- and three-- dimensions-- myself at UCLA, and of course, I'm one of the far-flung branches of the Center for Brain, Minds, and Machines.
How does this work relates to this? Well, what I'm going to do, what I'm going to talk about is I'm going to give an overview of three particular types of work with relate closely to CBMM.
They are published in-- they're going to be published in CVPR, Computer Vision and Pattern Recognition Conference in a month or so. They're available on my website and they're in the process of becoming CBMM meetings.
So where does it relate to CBMM? It perhaps relates to the thrust 5, the issue about having models that come out [? and ?] challenge questions, lectures, perhaps particularly the idea of visual Turing test, the idea of having computer vision systems that would work at the level of human performance and could be motivated by findings of studies of neurons and studies of psychophysical tests.
And as I'm going through the talk and the different parts, I'm going to make a sort of reaching out to particular people in the center that I've talked to where I think this work is going to connect to them. So three particular studies and I'll discuss those.
So organizations of the talk-- and so this is the three parts. First, an issue about objects and parsing, which perhaps relates to what Josh was saying earlier, compositional models, which I'm very keen on, but sort of in a more practical way than some of the things I've talked about the past.
Moving on to the need to do things in three dimensions-- an example where you can find the 3D structure of a person from a single image. And then finally, psychophysics in the wild for rapid detection of objects in complex scenes relating to visual attention.
And I have to say, it was so interesting being here today that I should have spent a few minutes editing my talk, particularly section 3, which was written by very enthusiastic student. And I was trying to prune out some of the more controversial things that he had perhaps put in his talk there. And I hope I've got most of the controversial things out, but we'll see.
It was a student of Christoph Cox. So if you know Christoph, you may understand why the student had that title behavior. OK so a general claim I would make though here is that perhaps the key problem of vision, and arguably of intelligence, as a whole is the issue of complexity.
If the world was simple, if images were simple, and so on, then computer vision would have solved it 50 years ago as the community was still struggling to do with it. So there's different types of complexity, however. One is the complexity of images themselves. And another one is complexity of visual tasks.
And I'd argue that the computer vision community is doing a fairly good job of moving to complexity of images, but perhaps it's rather more neglectful of the complexity of tasks. Anyway, the fundamental problem of vision would be how can humans or artificial systems deal with this incredible complexity.
How could we-- we shut our eyes, open your eyes in 150 milliseconds, which is about the amount of time it takes for your eye to blink. You can recognize practically everything that's going on in the scene.
So you could have high level models of all the possible objects around. You could do it by inverse graphics, which I generally believe is probably maybe arguably the best way to do it. But how on Earth do you do it so fast? How is that possible?
My belief is that that seems to relate to the idea of compositionality, of doing things by finding parts and gluing them together and stitching them together. And that's what my previous talk at CBMM was about. Although actually I realize the talk was here, but was I think a week before CBMM was finally opened. But it was the hierarchical workshop that Tommy and others organized here.
So I think fundamental problems of complexity, compositional type of models are ways of dealing with it. But let me move onto issues of being more precise about that, and how to study complexity in these cases.
So visual tasks-- if you compare humans to machines, humans can do so many different visual task at the same time. Whereas computers, in computer vision, a lot of them are sort of geared on more precise things. So finding a cat in a box is a computer vision task that many people have spent an incredible amount of time on.
But I would like not just to find cats in boxes, though I spent my time doing that. I would like to find the cats, I'd like to find the ears. I'd like to get the 3D structure of the cat.
I'd like to be able to say what the cat is doing, how it's interacting with other cats, and so on. So the amount of visual tasks is important. Data sets is really important also. Because in the sense, going back to my issue about the fundamental problem of vision being complexity, it's the complexity partially of the whole visual scene, the set of images out there.
And I've begun saying that I think when people started doing computer vision in the '70s or '80s, they were practically crazy to do it. Because at the time, it was completely impractical to deal with all possible images that could occur.
You couldn't even store many images on your computer, let alone process them in any real time. You could only work on the very, very small, tiny subset of the set of images that existed. And so any progress you could make there was inherently limited.
So now, it is possible to get very large data sets to make them sort of representative, in the sense representative of typical scenes you're going to see out there. So that if you get performance on one of these data sets, you can be reasonably sure that they would transfer to some other image around there.
And at the third part of the talk, I'll have comments about the design of data sets in that case, and issues where certain findings might seem to be wrong because they've been found on various [INAUDIBLE] data sets, and then not really going to-- they're unlikely to extend to anything more complex.
Another issue here at the center was the idea of Turing tests. A way of studying this is to compare the performance of the models to those of humans and to the [? neural ?] and behavioral correlates and to develop computational models of the human abilities.
And so this is one of the parts of the center. And it's certainly interesting to think of computer vision in that perspective and see how, to be frank, how badly many computer vision models fail when you compare them to the ability-- to what humans can do.
So here, just to say that computer vision is doing well in terms of complexity of data sets-- so here is figure of a number of images ten years ago. The PASCAL data set was considered very enormous, impractical, now up to [? image net. ?] We're moving up into that direction.
So in the sense of going in that direction, we can claim that we're exploring the complexity, and we're get enough images. However, the axis here is only showing classes by which means the number of types of different objects.
So here in PASCAL, you'd be looking at trying to distinguish between 20 different types of objects. There are far more of those in the world. Image that will go up to file larger classes. But still, the axis here is only about the number of them, rather than the-- and only about the task of doing this detection of saying whether there is a cat in a box or not, rather than dealing with this issue of what the cat might be doing.
So only a small number of visual tasks, cat in the box. But if you look here, and this is not from my back garden. I think it's from Dan Kersten's back garden in Minnesota.
Humans can extract really rich representation from images. If you look at this picture, you can think Van has an example-- you can read out whole sentences of what people can get out just from that simple image. So there is an incredible ability to generate things far more than just answering these simple categorical questions.
So just to say before I go into more details of what I'm doing, and this is something I believe is very important-- is what we've done is supplement existing data sets with additional labels. So I'm saying we don't necessarily need to make new data set because we have plenty of images out there.
But we would like to put additional labels so that people can do far greater variety of challenges and tasks on the data sets we have already. So there are three data sets labeling, which I'll show in a minute, which again to release at CVPR.
But these allow one to study far more complex visual tasks-- well, not far more-- but complex visual task than the ones that people are doing in computer vision generally. It also enables what I think some people starting to cal psychophysics in the wild, which means studying the behavior of what humans can do it real images in real world conditions.
Because from the computer vision perspective, you think, OK, there's a lot known about the human visual system, and indeed there is. But there's surprisingly little known about the performance of the human visual system in complex real world scenes. And there's increasing interest in doing that and papers published in the computer vision community of it.
This is neuroscience in the wild. That's just to put in the plug for two of my collaborators-- Tai Sing Lee, who does [INAUDIBLE] electrode recording at CMU, and some of my work is related to interaction with him, and Dan Kersten doing FMRI studies.
So here would be data set 1. And so I'm putting it up here partly to advertise it, but again, for people in the center that maybe find this of interest and may sort of spark ideas. So here is the PASCAL image as 20,000 of them labelled typically by [? forming ?] boxes about the objects for the challenges.
But what we've done here somewhat largely in [? Korea ?] was that you can now put in more detailed labels. So here is the cat. You've labeled the head, you've labeled the ears.
You labeled the torso, and so on, the sheep, and the dogs, and the parrots. And we've done that for I think almost all of the 20 objects that were present in that data set. We didn't do it for ships, because they weren't incredibly complex. But we did it for everything else.
So part and subparts-- so this means you could start testing algorithms which could be evaluated by whether they can find the part of the objects, and so on, and find the silhouettes and not merely put a box around them.
And so I'll discuss that a little in the first example of the work. The second one is image label. So what you do here is you take the image and you break it down. And depending on the granularity, you break it out into 35 different classes.
So every pixel here would be labeled as sky, grass, ground, building, tree, or something of that form. Another set, another level of granularity was to label 60. And the biggest one we had was to labeling 600. So every image pixel would be assigned to one of these labels. Every pixel would be assigned to one of these labels.
And again, you could do the task of not just detecting the object, but detecting the background, and so on. Data set 3 is sort of a subset of this. And I'll discuss it a bit later on in the third part of the talk. OK
So an issue I wanted to bring up here, and it's come up in talks with most people, is also what humans can do. And so you can be subjects in this experiment. And this experiment is not going to give you a full image, because we know humans are really good at that.
But suppose I give you this thing here. What is it? Anyone got any idea?
AUDIENCE: [INAUDIBLE] bicycle or motorcycle?
ALAN YUILLE: Interesting, but no. The answer here is-- the suggestion here is the once you've got the whole context of the entire image, you know, everyone almost certainly knows what it is. I was told Josh talks about a [? posterior ?] of images being strongly peaked. And I would say that's true of an image, but it's certainly not true for some piece of it.
What we've done here is we've taken a super pixel algorithm, which is a type of an algorithm you have in computer vision that can find local areas which are roughly similar in appearance. You do that for the whole image. You break it into about 100 different pieces. And this is one of the pieces.
So people-- let me see. Where can I move this on? So now this has stopped working. Why has that happened? Well I can move back here. OK, so I guess that gives you a longer time to look at it. But I don't really think it's going to help.
Oh, this is not working either. What can I do here? That's is working. At least my mouse works. So here's a little bit of a clue. Now we're allowed to say it's an airplane, car, boat, sign, building.
ALAN YUILLE: Car. OK, as now--
ALAN YUILLE: It is a boat. But part of the point of this is, of course, it's really impossible. It could be that, or it could be sky. There are certain things it couldn't be. It could be part of the grass. But locally, there's an incredible amount of ambiguity.
And one of the things we did, this is [? Rosdah Mothaghi ?] who was the student, now post-doc at Stanford who did this work, is that you could use people on Mechanical Turk to do tests like that to see how many of those things they could get right.
And humans would get roughly 60% of those correct or so, in general. Some of them, because of the nature of super pixels, give it away and make it a lot easier. But still, human performance of that level of 60%-- if you take the types of algorithms we conclude with computer vision and training, and machine learning you are getting about 50%. Which isn't great, but it's actually pretty good, considering the difficulties of it. So it's behind the human performance, but it's only a little bit behind. But still, I think here there's a windfall of more experiments.
And I believe that actually a lot of the visual psychophysicists are unfortunately on their way to the SS today. Not in the audience, but anyone who knows a psychophysicist-- I want to put out the idea that these types of tests, these types of studies are using these real world data would be very useful for computer vision. And I think would perhaps give lot of insight into human abilities. And data sites that this could do it. So I think this is screwed up because of that particular slide. So maybe now I can move over here.
OK. So this here is work about detecting animals and animal parts. The lead author is Xianjie Chen, who is up here. And so the idea was to exploit our first data set by trying to detect and parse animals in real world conditions. By real world conditions, I mean by taking images from the Pasquale data set, because those are known to be hard and difficult.
And there are various challenges, which means that the objects are deformed, they can be occluded. In some cases the head is visible, in other cases it isn't invisible. Sometimes the torso is visible, sometimes it isn't.
Also the head could be from the front on, or it could be from the side, or it could be from the back. And the torso could be from one angle another, or another angle or so on. So if we work it out those for the formable object like a cat or bird, or something, there's a whole lot of possible you know, appearances you can get for these, depending on the view point and the pose of the object.
So big deformations. Sometimes it's this case, it's hard to detect. Well, actually it's not really shown here, but there are cases where for example, you can't really detect the whole object because it's not there.
But you could find the head. The head of the object. The head of that particular animal. So some parts of it could be detected. Occlusion, detecting occluded body parts is difficult, or almost impossible if the object is low resolution, it's possible to detect the object, but it's not really possible to detect the part of the heads.
So the strategy was to detect what you could. Detect the parts, detect the objects, using a model that could automatically switch all on or switch off certain parts. And also a model that could allow the parts to have several different sort of types of appearance.
So each part would be modeled by Deformable Parts Model, which I guess some people in the audience will know what that is. But that would be until deep networks-- this would probably be the method of choice for detecting objects. OK. So it would be a method that if you trained it on the sort of cat in the box problem in Pasquale, it would perform about the best. OK.
And so you can model the holistic object by Deformable Parts Model, which is what people have been done, but then you could also try and model the parts of it by Deformable Part Model as well. Now also, an object or a part is going to look very different from different viewpoints. So if a single object or a single part or for a head-- you don't have a single head model. You have head model for the front, a head model from the side, a head model from some other position. And so here we call it a type.
So each part would have a deformable model. Each part would have several deformable models for different types with different viewpoints. And moreover, one model, one part may be visible or invisible. So if you count it up, there is a very large number of possible configurations.
Parts could be on or off. There's a combinatorial number of possibilities. And if you do this with a traditional graphics model like say a pictorial structure model other people would use in vision, you'd have to have a pictorial structure model for about 100 different pictorial structure models to deal with all these different possibilities.
However if you do this in a compositional model way, you sort of build it out these parts. You exploit the fact that there could be several different types of heads, and several different types of torsos. But the head model would be sort of in the sense in invariant to the torso. There's a bit of relationship between them, but essentially you could reuse the head model from the front, and for different configurations of the torso and the hand.
If I move here I can move my torso around, I can move my hands around, but my head is in the same position, you need the same model for selecting the head. So this is an illustration of the compositional idea that you construct a large number of possible object configurations, which you need in this case from the small number of parts and part types. OK.
So then from an input image, the algorithm has to search quickly over all these possible configurations, which parts are present, which types are present, which positions are present, and it can do this pretty efficiently. And so here is a bit of a figure for this to illustrate it here is the horse. This is sort of like a holistic model of the horse. Where there are more standards. So of cat in the box type model.
Here is a model of the torso, here is a model of a certain leg configuration, here is a model of the head. So a graphical model would represent the entire holistic horse, together with the connection in this case to the head, and to the torso, and to the legs. But for particular [INAUDIBLE], the algorithm would have to be smart enough to do that it can find all of these. Or maybe it could only find the legs and not find the head and so on. And it needs to be able to switch between them. OK.
So here, just a few examples of the possible configurations that you could have of different parts being present, different types of the parts being present. And also different spatial relations between them.
So spacial relations are fairly easy. In fact, I see the first parsing of these went back to 1973, which is completely prehistoric, by computer vision standards. Where you would have a head here, and a part here. There would be a position for the head, there would be a position for the part. But he developed a position between them.
And he put a distribution. We love distribution on the relative positions. But this is really-- I say this is almost pre-history by now. And the issue here is far more, that this is one type of head.
You need to allow other types of heads and other types of torsos, other types of things. OK. So need to check how I'm doing for time because we started a little late. And this is Friday, so I don't want to keep you all around too long.
AUDIENCE: Got about 25, 27 minutes.
ALAN YUILLE: 27 minutes to go? OK, so I'll skip through a thing. The bottom line of this is it works reasonably well, and here are examples of it finding the cat box and then the cat head, and so on, et cetera. And it works.
Let me skip through these. Positive results, here are tables and so on. So basically things, by the standards of computer vision, this works fairly well. Well enough to get accepted. Well enough to improve the state of the art on certain tasks.
And it does show the compositional strategy works fairly well. Exploits reusable parts et cetera. The limitations though is really-- if you think of it by the standards of what a human can do, it's not doing that well.
A human on this task, I'd say a human ought to get 95% correct. Why 95%? Because in this test, some of the object are really very, very small. I've looked at those objects. By constructing an image, we take out the background and you just show the object itself.
I've run simple tests on it and you can't do it. And so it's hardly fair to get the object to it. Or to get the model to do it. But basically the standards are down, and what is the problem of this? And whereas if people want to go further, if I want to solve a chewing test for detecting parts.
And so the bottleneck really I think is the Deformable Parts Model, which had the head of the two being very successful at detecting objects. Were using it to detect parts, which is easier, more plausible, but for those, it's still not working well enough. It's not modeling the shape, the appearance of the models as well as it ought to in order to get high level performance.
So definitely trying to improve it. The certain relationships here to work going on in the center, I think may not have been clear so far, but the idea of the compositional models putting together to form bigger structures, it relates closely to the ventral stream hierarchical models, atomic Poggio and [INAUDIBLE] what they're doing. This issue about what humans can or cannot do from those images relate to-- well, work that Ken Nakayama has done. I was having tea with him yesterday and so he has weighed on work under what conditions can humans find faces and images.
And it would be great if he would extend it to these types of more complicated scenes. Social interactions, I think. If you want to work out the complex interactions, then I believe that [INAUDIBLE] they're talking about, it seems to me that you'd at least have to find the parts of the objects that you're looking at. And to be able to parse them and describe them in some ways. Same descriptions, graphical models that relates to lots of the things that Josh's interested too, and the complexity issues related to the work Les Valiant.
One slide here, I'm just putting up for [INAUDIBLE] one in case he watches this because he talked to me about doing this on cars. We also did this on cars, which is actually a bit easier, because cars don't have nearly as much variability as humans do. And so instead again of putting a box around the car, you would want to take the car, and you'd want to break it out into the window region, the body region, the license plate region, the lights, et cetera. And because the car is, I say, more or less rigid, you can get away with instead of having all these parts and put them together, you can have a model of car from this viewpoint, this one, that one, and so on. And only a limited number of those, and still do [? infants ?] fairly well. And get fairly good quality results on it.
OK. So that was detecting objects in 2-D, and certain limitations with the models of the parts and so on. And I guess, increasingly at least in computer vision terms, and I have no idea necessarily about psychophysics, is it seems that a lot of the models ought to be more 3-D so that when you recognize an object, you're taking into account the fact not just its 2-D strategy, but the 3-D properties of it. I mean, if you want to recognize a chair, you could try and have a model of the chair from more different viewpoints. But the chair is a rigid object. Why not just have a simple 3-D model of the chair, and rotate it round?
And it seems a far more natural way to solve it from a computer visions perspective. Maybe not for a human, but that's an open question. So in this case, we didn't do it for chairs. We thought, OK, it would be far more interesting to try and do it for humans.
So building on some prior work that's been done in the computer vision community, this work by Chunyu Wang was starting from an input image, trying to find the joint positions using some existing algorithms automatically. And from that, then trying to estimate the three dimensional pose of the person from a single image. Finding the pose would have prior finding the configurations of the person. It also requires what in computer vision, you're calling the camera parameters, which is the orientation of the person in front of the camera. So if I keep my same pose here, I sort of move myself around here, so I'm essentially changing your camera, but I'm keeping the same pose. OK.
So you can also actually increase the performance of detection by imposing the 3-D model. So you start out with the possible positions of joints. You use this process to estimate the 3-D structure. The 3-D structure gave you a strong prior model. You can project that down on the image and you can correct some errors in where you think you've estimated the points down here. OK.
OK, so representing the object propositions of joints. You construct a prior model of 3-D geometry in the next slide and then the joint positions are estimated by standard algorithms, Zhu and Ramanan as well. And so they do quite well on certain data sets, but there's certainly an amount of ambiguity and noise involved. So learning a 3-D prior for humans-- So this is using a data set from CMU who did a group of people there, following on certain of their work. And making, I think, a little bit of improvements.
If you've got 3-D data, you've got the 3-D positions of poses and essentially you're using two types of constraints. One is this sort of anthropomorphic constraint, which is claimed that the limb ratios are roughly constant. So the ratio between my upper arm and my lower arm, my upper arm may be longer than that of somebody in the audience, but the ratios are supposed to be roughly the same. And suddenly the ratios between my arms and the limbs are going to be similar.
So CJ Taylor, I think, was the first person to envision and to exploit that. That however, is not enough information to give you the 3-D model, so you need extra things there. What we were doing was a method of statistical learning using sparsity. You represent 3-D poses by linear combinations of basis functions.
Now that sort of, at first sight, if you think about it, that shouldn't make much sense, because the sets of poses is not with a linear space, which you are doing here. But if you [INAUDIBLE] paucity into it, you are effectively putting something-- allowing yourself only to use part of this linear space. And so you're ruling out another of configurations which are just--
ALAN YUILLE: And then you learn the basis from training data. There's actually some quite interesting mathematics we're doing now about the details of that, but I'm restricting myself not to use mathematics in this talk, so I won't go into that.
Nevertheless, the idea is that you can, from those data sets, you can get models of poses which seem recently good in the sense that you get the 3D data, and you try to fit them to the models, and you test them, you get a pretty good fit. Of course, that's an easier problem than starting out with a 2D image and then trying to use this as a prior to interpret it.
Then you are faced with an inference problem that, even if you have the model, and even if you've estimated the joint positions in the image accurately, or even if you have presented quite a number of them, you're still faced with the problem of estimating the pose and estimating camera parameters. And so there this conceptually straightforward. You've got the image. You need to search over 3D poses and camera viewpoints, so it's conceptually straightforward, technically challenging. There's formulations and optimization problem when you use various algorithms to try to solve it, which they seem to solve it reasonably well.
[INAUDIBLE] so there's some [? quality ?] results. You can compare the performance to other methods, but I think perhaps from this audience, I guess, it's big enough to show that are those tables [INAUDIBLE] any detail, and the answer is it's working fairly well. What are the limitations? The results are fairly good from the ground truth. Can we do better? And I would say better [? than ?] the prior that we have for the 3D thing, so it's reasonable. It's not the best. It still needs to have more work on there.
Estimating the viewpoint-- there's sort or an ambiguity between the viewpoint and the pose, and I'm suspecting if we could get the estimation of the viewpoint correct first, we'd get a better result from that. But nevertheless, it's working fairly nicely, and CBM relationships-- how well can humans perform those tasks? Well, I asked Ken [INAUDIBLE]. Ken isn't a great, but Ken is saying with the evidence, that humans can get good 3D. That estimation doesn't seem to be totally strong, but I don't know if anyone has tested it under these types of conditions, and I would think that it perhaps shouldn't be too difficult.
Social interactions-- you could use these models to analyze people interacting with each other if you had language descriptions of them. And even, again, making connections to Tommy Poggio's work, the work [INAUDIBLE] study, where he's worked on recognizing objects and things under certain types of invariances and certain types of group transformations, and you could argue that the 3D configurations of a human pose could essentially be described by groups of transformations. But we haven't done that directly.
So a little bit extra in here. I talked about doing this for estimating 3D single-view from a single object-- for a human object. There's also stuff about trying to do this for scenes, particularly scenes which have a so-called Manhattan-type structure, where there is was a ground plane or walls and so on, that if you analyze the geometry, you find there's quite a lot of information to get 3D stuff out from that.
I think we coined the phrase 'Manhattan World' a long time ago, and it's a bit interesting watching the our citation rate, because we published it. It was quite nice. People had a few citations, and then the citation [? ran ?] down. And then suddenly, lots and lots of people seemed to be developing this type of thing and seeing it. So, in many cases, you have this ground planes. If you find the ground plane, if you find plane of surfaces and so on, then, as you can see, you'll get fairly good reconstruction for-- I don't show fairly good, but reasonable reconstructions out, and again, from a single view, without needing to do any more on it.
So about 20 minutes-- or, no, less, 15 minutes. So here, moving on from work that's been done on computer [INAUDIBLE], I'll now move into this issue about salient object [INAUDIBLE]. And so this is several author. Johnny Hou was, I think, the driving force, at least the part I was involved with, though he's not technically the first author.
And so here there's going to be work on so psychophysics, or psychophysics in the wild with a certain amount of computer vision. And so part of this is sort of outside my are, but I'll try and give you the basic summary of it. So, of course, visual attention has been fairly important, very much studied. Which parts of images do you look at? This can now be driven top-down from high level models, or it could be driven by sort of low level saliency type things.
And saliency, [? Laura ?] [INAUDIBLE] and Christoff Koch were sort of very much involved in developing the idea of saliency, and [? Jao Li ?] is a graduate student of Christoff Koch, who I inherited when Christoff moved up to Seattle. So here a lot of the bottom-up stuff in attention was driven by this idea of saliency. Now with computer vision you don't really necessarily care about saliency. What you care about, or, at least, what we care about is [? finding ?] objects. So saliency itself was just sort of finding places attached to the particular properties, but instead you'd like to find the possible foreground regions, which could contain objects.
So here, I think, this is [? Jao Li's ?] thing, with the two types of saliency. There was sort of fixation prediction, which follows from [INAUDIBLE] work. So which parts of the image do your eyes go to? And then, more recently, there was a salient object segmentation. So if you have a data set, can you find the object or something within the background? So this is a major problem, a major thing for computer vision.
You can have methods like DPM models for objects, or you can have the latest versions of deep networks, which also are very effective, but you have to run them somewhere in the image. If you try and run a deep network everywhere in the image, you're just moving it around at different scales, you'll take a long time, even if you've got GPUs. What you want to have first, and what people do generally, is that you try and find places in the image where there's a reasonable chance that the object is present, and then you run your deep network or your deformable part model on those places.
So bridging the two worlds-- so here I'm hoping I [INAUDIBLE] is to say that an issue was that a lot of the models on the work on salient object detection, at least, the ones done in psychological studies, were done on a particular data set known as FT, and it was a data set that by compared to things like PASCAL was a fairly easy data set. Many of the images would typically consists of an object and a sort of fairly clear background. So to parody it, you'd have myself here and the background like this. In which case, it's OK, but it's sort of not really representative of the complexity of things.
So although [? Jao Li's ?] idea was since we had the PASCAL data set, and we labeled it, he labeled it a little bit more. And here is a PASCAL image, and then you could label around the objects, as we were doing, enhance that. And then you could try and do tests for how humans would do salient object detection on this data set. So here there are typically many objects in the image, and the background is cluttered and is confusing, and it's sort of not trivial to know where the object is.
So this is a bit I'm less clear about. So you've set up these images. You'd run an eye tracker. You'd get people to look around here, given them a task. You didn't want to make it too precise a task, but you essentially say, well, we are interested in finding objects and you see how the eyes moved around, et cetera, like that.
So you could do the psychophysics to test how humans do it, and then you could produce a theory about how it could be done. Let me move back from that [INAUDIBLE]-- a theory for how it could be done by combining ideas and computer vision with the idea of saliency. So in computer vision, there were a set of algorithms, but the one I mentioned is called CPMC.
And so what it will do is if you take the image and you do and edge detection, the CPMC algorithm will group the edges and will come up with about 200 or 300 possible foreground/background segmentation. Most of the foreground/background segmentations are run, foreground/background segmentation's really difficult if you do it on a PASCAL image or anything like that. But CPMC will generate a lot of candidates, and some of the candidates are surprisingly good.
So the idea of [? Jao Li ?] was to produce, say, a theory of visual attention, or at least, visual attention for finding salient objects, which combines running a CPMC or an equivalent method to find possible foreground/background regions and then combining it with the classic salient data fixation dues. In his case, his particular saliency detection algorithms, he'd already go. So you're combining the basic idea of saliency with this idea of trying to find these foreground/background regions.
So here is a question. So I think I was saying I think the FT data set was a little bit limited compared to PASCAL because of the simplicity of the images-- a simple foreground image and then things in the background. And so here was a chart that [? Jao Li ?] made where this is a performance of particular sets of algorithms up here on FT data set. And if you run them with PASCAL, this is that performance level. So they're dropping, I think, by 35% in terms of performance from going from this type of set to this one.
So if this is the sort of thing that, say, computer visions, wants to grow accustomed to, they could have certain types of data sets. You train on them. You work on them. And then you find that you've either over-trained or over learned on them, and you need to move onto something else. So the Cal Tech data set-- Cal Tech [? went on one ?] data set-- for a long time was something people used. And then there were various concerns, and within the computer vision community, they all moved onto the PASCAL, because people were finding it was too easy to get things done on Cal Tech, and PASCAL was enough of a challenge.
And so similarly here, you could say that the FT data set was a good one to test when you're starting out for these things. But if you move on to the PASCAL label thing, you find that the performance is harder. This is arguably more realistic, and methods that have worked well in this data set, do not necessarily transfer to those ones,
So here, I think, was so the some of the issues. I don't know if these are really representative of the FT images, but you can see the backgrounds are not too hard. And over here, these are some of them. This one's not so hard. This one I'm not even sure you're finding, but it's complex. And here you supposed to find the bike, and it's obviously got a whole lot of stuff about it.
So it's hard for doing saliency. Some of these things are not really salient enough, at least not for the previous models. But they would at least be found by this other type of method of combining the bottom-up segments with those things.
So here was an analysis here, and I am just saying this is in tons of data sets, and whether they're representative or not, there seemed to be need for the study of data sets to see whether there were certain biases associated with them that could be exploited by algorithms and that would mean the algorithms were not sufficient for other things. So here are a few studies he put up there-- certain artifacts that could come out. And so issues of data set design biases, that you're making a data set, you could [INAUDIBLE]. If you make the data set of images--
If I want a particular task-- people will ask me, can computer vision do something or other x. And I say, I can design a data set for which you can do x really well. Now that's obviously cheating. What one wants to have is data sets or even images that are picked in general and then you ought to label then for particular tasks without deciding on the task, and then deciding on the label. If you decide what task you want to evaluate first, and then you do the labeling, you make it far too easy, I think, arguably.
So in this case here, the PASCAL [INAUDIBLE] is always free from bias. Efffectively, I put the almost in myself. But that's because the adaptation and the task was really independent of how the images were collected. They were collected for a different purpose altogether.
So then you can test this. You can use this. You can find nice suggestions for where the object could be, for where the squirrel could be. There should be a picture here where you get the salient boundary from the CPMC. Then you get the saliency coming out here, and then get this is as a very plausible region to find the object. So here is the CPMC that I've been mentioning, and you get improvements from that.
So I've been saying that feature list here, and should really wrap up on this. And so you take images here of the cat. Who could get certain segments out from the CPMC thing. And then you pick the ones which have the biggest saliency. Well, this is a ridiculous image. I should ask him to give me a different one, because there's only one cat in here. There's only small object here, but this one's a more serious one, where you have the train, and you can get it out like that. So then you get good [? results ?] from PASCAL, and other good scores, which I'm not anyone is going to take in at this time of the afternoon on a Friday.
So lots of the observations from this, the data sets-- one needs to have more complex data sets. PASCAL is good now. Maybe in a few years, there will be something far better, and we'll saying that PASCAL was too limited for that. But there seems to be an increasing need to get more and more complex data sets, which are more and more representative of the world and of the tasks that machines should be doing and what humans can do effortlessly. Then, I think, the saliency detection, this combination of the CPMC thing with the saliency thing is actually doing quite well. Apart from the evidence here, I know that people do that type method on ImageNet, which is enormously big, and you'll still finding that the CPMC segments are fairly good, fairly close around the objects.
So I'm not really try to push CPMC necessarily. I'm trying to push that class of algorithms by saying that though it's very hard to do foreground background itself, it may be realistic to think that you can produce a large number of foreground background candidates and still expect that one of them is going to be the correct one.
A final thing here, which again, I put it in here. It's not actually on saliency, but it's again, pushing of the real world psychophysics. And so I should say with that particular study that Charlie did, in my mind, I count that as real world psychophysics. You're trying to test what human abilities will do on fairly complex images because that gives you goals for how the computer vision algorithms would perform.
Here, there's a question of how well can humans classify real textures, in this case, animals. Here are some examples of textures from cows, horses, sheep, dogs, cats, et cetera. You show this to humans. You don't give them much time to do it. You take out the color because that gives a bit too much clue, and you find that in short view times, humans are not incredibly good at doing it.
You get these types of errors. You're right on the cat 50% of the time. This is a five way choice, so 50% of the time is not that bad, but still, it's not good. And in this case, if you do the machine learning, it can be the vision thing, training on it you're doing quite a lot better. That may reflect the fact that this is still a limited data set and the computer vision is maybe picking up on parts of the data which a human would not, but at least it shows things are roughly in the right ballpark.
So I think we're wrapping up here. So conclusions. What I talked about was three projects, the object, the object parts, the idea that one needed to get computer vision algorithms that could do these more complex, more realistic things. In my view, they need to be done by doing it in compositions, finding elementary pieces that you could reuse and so on because otherwise, I can't see how you could do it fast enough for large amounts of images. And our first attempt worked reasonably well. I think we were limited by the particular models we had of the parts.
Parsing humans in 3D. That I'm personally happy I could do that, and I feel that if you can parse humans, you can parse other deformable objects, and parsing rigid objects is a lot easier, as shown in a recently solved problem. And perhaps with these algorithms and more complex ones, you could probably got a lot of information out from a single static image.
Third one being visual attention. That was driven by the visual psychophysics, what can humans do. If one point of a bottom up attentional system is to do foreground background, then this type of method using these CPMC segments to bind with the standard saliency cubes will be a good way forward. And maybe it's successful enough on that to be worth being tested on a far bigger and tougher data set that could be combined.
So this is a series of those tips of work. So the general principles that we're trying to do is we want to develop computational models with human abilities, and as someone who publishes mainly in computer vision, I think trying to make things work as well as humans is a really good goal to have, and to have data sets and challenges which cause that is one way of driving the theory forward, and that includes the ability to deal with these complex visual tasks and complex data, and not restricting yourself to simple problems.
Then, I guess for myself, as a mainly computer vision person, the strange thing is I find out, OK, how much is really known about what humans can do in real scenes? And to be controversial and to wake people up, I'd still say I probably know more about that from being human and having eyes than I do by studying some of the literature, and that's because I think the literature is psychology and neuroscience, for very good reasons, has been done using fairly simple control stimuli which you can understand, but that's maybe not really representative of how human vision performs in really complex environments, complex scenes.
So that's why I think there's need for psychophysics in the wild, what humans can perceive from real images or particularly from parts of real images. From full real images, we perceive almost everything. It's very clear. But for parts of them, like that example of Bitter the Boat, we don't. So there's a real question about when does a part of an image get big enough so that humans really perceive it well. I think that should be a fundamental psychological problem to address.
I haven't said much about neural correlates in this talk. I am interested in those. I am doing work with Tai Sing Lee at CMU, and I guess my main finding is that neurons in the real world don't seem to behave like neurons in textbooks. But also, with that, and with also the [? FMRI ?] studies, we're trying to put real data in, showing real stimuli, and predict what they do. The findings are interesting, complex, and not fully understood at the moment.
And then finally, if people want to see the details or if anyone is like me as a young man and wants to see equations and lots of equations, you can go to these papers. You can download them from the web. You can pick them up, and then they will soon be available as CBMM memos. I'm certainly always happy to talk about them and talk about the math with people who want to get into the details. Otherwise, thank you for your attention. It's a Friday.
PRESENTER: So we have time for questions. I was told that you have to use the microphone for questions. Any questions?
AUDIENCE: I have a few questions. Why are you so interested in what people can perceive from little fragments of images, exactly? I mean, I do agree there's lots of interesting questions about what people can perceive from natural images, and we've talked about some of those in the context of CBMM. Both Nancy and I have been interested in just, what can somebody see in a glance at how much time image? There was some nice work by [? Fei Fei Li ?] and Pietro Perona on this. But I mean, particularly because what you can report consciously is not necessarily giving much insight into what the earlier stages of processing along the way are extracting out of that.
ALAN YUILLE: So those studies of [? Fei Fei ?] are interesting studies, but the thing is, how much do you remember or how much are you consciously aware of what you see in an image if you showed it faster? But I think I'm interested more because I'm thinking of, how would you process the image and interpret it? So if you start out with the image, the early parts of the visual system or the early parts of a computer vision system are just looking at local information. As you go up the hierarchy, you're getting access to more information and you're gaining in more context.
Let me back off that to say the best way of doing computer vision, ignoring humans, would be to have all these models of objects in your head and to map them onto the image, which I know people are doing that, but the thing is, how can we do that fast enough? I think that's unclear.
My suggestion answer to that was you do it in this compositional strategy. You do it by sharing. You start processing out the image. Locally, you have lots of little hypotheses, vertical ambiguity ones because you just simply don't know what's going on there. It's too ambiguous.
Once you get up to a certain stage, things become somehow big enough that the hypotheses become unambiguous, and then from that, you can go down and resolve anything below that. So my interest is partly, how high up do you have to go before things become unambiguous? So I think that would really give a good guideline for how--
AUDIENCE: But just to follow up on that, if that was your question-- I'm with Josh-- it seems like maybe a strategy you would use is you know these strategies where you take an image and then you reveal only little parts of the image and ask subjects do a task? It seems like if you had subjects with the Gaussian reveal so that there's little bits that are less blurred of the image. If you had them saying, is there a boat or not, you could very quickly figure out which bits of the image did they need to see in order to tell if there's a boat, and that feels way more naturalistic than giving them all of the pixel information for one piece of it and none of the pixel information for any other by a definition of pieces that's similar pixels. That seems like a worst case scenario for vision and for figuring out what the pieces are that people need to use.
ALAN YUILLE: But let me see. I guess the issue is what pieces could you give them but what pieces could your algorithm give them first? So we're starting off with super pixels because we know that there are algorithms that will give those super pixels, which we can see are sub-parts of objects. So we can see how to do the processing up to that stage. It's just going beyond that that's harder, I think.
It's sort of like edge detection. Suppose I have certain types of edge detection. I can detect edges. Then the question would be, how do I group them to make something bigger? We tried doing some experiments in the summer which you just put a spotlight on an image and you try and allow it to grow bigger and bigger until people click a button and say they can identify it.
So we did some studies like that and we get completely confusing results on it because really, it depends very much where the center of the button is put. If you put it over my eye or an eye of a human, the thing expands out only by a small amount and you'll know that it's a face or something. If you put it on my shirt, it can expand out a very large area before you know what it is. If you use a super pixel, though, the idea was that the super pixel would allow it to automatically grow with an area which you don't know what it is but you know it's all the same thing, and then you can try and identify it.
AUDIENCE: So is the idea that you're doing simpler-- I mean, you only showed us that one super pixel and then the whole image, but are you trying to grow it out? Is that the idea?
ALAN YUILLE: No. In this case--
AUDIENCE: Because that sounds interesting.
ALAN YUILLE: That's interesting. It could be done. In that study, what was done was just based on the super pixel alone. That's sort of a start.
AUDIENCE: I think certainly, both Rebecca and I probably agree that some kind of experiment like that would be interesting, but it's an interesting question to say, what's the right one?
ALAN YUILLE: [INAUDIBLE] super pixel [INAUDIBLE].
AUDIENCE: Yeah. A good way to relate the representations in the intermediate stages that you can compute with what's going on inside the brain [INAUDIBLE]. Anyone else want the mic?
AUDIENCE: I wanted to come back to a question about objects and object parts. If I understand correctly, you're defining object parts-- I mean, humans define those parts and then you go on. Is that the way to go, or can you think about unsupervised ways of defining parts? I'm always puzzled about what's an object, what's a part, and is that an important distinction?
ALAN YUILLE: We've done work in the past where we have done unsupervised algorithms to try and find parts. I'd say they are partly successful, which means they're successful on some data sets but they're not successful on PASCAL at the moment, for example. Here, we were using the knowledge of what the parts would be.
But there, again, we're doing this on static images, and if you are considering this in a more normal situation, and you see someone like myself moving, you know that there's a part here and here it changes because you can see that actually happening. So the decomposition of the parts becomes really quite direct when you've got a sequence of images, I think, whereas if you just try and learn them unsupervised just from a series of 2D projections, you haven't got that degree of knowledge, I think. But we've got papers on unsupervised learning and I'd love to learn all the things unsupervised. I just feel that it's maybe impossible, or certainly, I'm not clever enough.
AUDIENCE: So one more question here. Again, I'm happy to turn the mic over to anyone who wants it. Can you say maybe just a couple of sentences about the neuroscience results you referred to at the end, like ways in which the responses of neurons seem different on natural images than in the lab?
ALAN YUILLE: Well, that was a fairly low level study. I guess there's an abstract being at least sent in. I think it's more at the level of V1, V2, possibly even V4 cells, depending on what they decide to put in. But the cells are classified descending on certain types. They certainly felt mapped by certain types of patterns. So then you could try to put in natural images. Then you've got too many natural image patterns for the cell to look at. You can't test the response to all of them.
So we used work that was actually partly in the talk I gave yesterday at CSAIL about looking at certain types of dictionaries and patches that you could have that could represent image properties locally. And so one study of that would be all [INAUDIBLE] patches could be represented by a dictionary of 128 mini epitomes and so on. And so then, you could take those types of stimuli and so those are represented, and then you could show that to the monkey.
Then you could try and compare the response of that to edge stimuli which would characterize the classic receptive field of it, and you see the responses of the neurons to the edges were not necessarily-- for some neurons, the response to the oriented edges of things was correlated to the response to these intensity patterns. In other cases, it seemed it wasn't. The neuron was responding to several different types of intensity patterns which seemed to have little correspondence to the particular edges. But it's sort of very low level and not at this stage. But I'd be happy to talk to a neuroscientist about those findings.
PRESENTER: Any other questions? I think there's some kind of reception of food and maybe drink outside. So thanks, Alan, very much, and see you all outside.