Tutorial: Computer Vision (48:17)
Date Posted:
August 12, 2018
Date Recorded:
August 12, 2018
CBMM Speaker(s):
Andrei Barbu All Captioned Videos Brains, Minds and Machines Summer Course 2018
Description:
Andrei Barbu, MIT
Overview of some challenges still facing the design of general computer vision systems, traditional image processing operations and applications of machine learning to visual processing.
Download the tutorial slides (PDF)
ANDREI BARBU: So in particular, for the computer vision tutorial, I can't tell you about every technique that people have ever applied to every vision problem in the history of computer vision, because that's 50 years long, and that's going to take many hours. And I can't even tell you about every technique that people use these days because that also takes too long. So instead what we're going to focus on is I want to give you an idea of the kinds of problems people consider in computer vision, the kinds of successes that they have, and what we mean when we say something computer vision works or it doesn't work, because it's not always intuitive.
So for example, when we talk about an object detection, we don't talk about human object detection. We talk about a particular slice of this problem that we consider in computer vision. I'll also show you something connected to the deep learning tutorial from earlier today about the development of deep learning and computer vision and how this is a pretty logical sequence. And you'll see how even the modern deep learning techniques are very much based on what people used to do 10, 20 years ago.
But to get started, the quintessential computer vision task is object detection. And human object detection is quite rich. If you look at a scene like the one in front of you, you can probably name thousands upon thousands of different objects and their parts, properties of those parts. But what we mean in computer vision when we talk about object detection is essentially put a bounding box or, at most, try to segment out part of an object and label it with a single label.
Occasionally, people talk about fine-grained object recognition, where they will label, say, 10 different parts of an object. So you might label the wings of an airplane or the cockpit of an airplane. But you can name thousands upon thousands of parts of an object. So this is a significantly impoverished task. This is an early object detector, this is essentially what the results look like.
We also measure object detection performance very differently than you might in, say, psychology or in human vision. We have people draw bounding boxes, and we compare the bounding boxes from humans to the bounding boxes from machines. What's very fun about this bounding box task is there was a paper a few years ago that compared how good are humans at recognizing that the machine bounding boxes are bad?
So if I give you two bounding boxes, you could imagine that a human might put a bounding box here, and a machine might shift the bounding box. It turns out that humans disagree with each other on this bounding box labeling task very significantly. And they can't even tell the difference between state-of-the-art computer vision systems and human performance on some of these tasks just because the labeling of the bounding box itself is so incredibly ambiguous.
If I'm a rectangular object, bounding boxes are OK. But if I stretch out my arms, there's a disagreement about whether you should put the bounding box on my torso, or you should include my arms, which mostly includes background inside the bounding box. So the whole point of this discussion is very often, when you see computer vision people talk about a task, you should really dig in to figure out exactly what task they're talking about. It may not be exactly the one that you had in mind.
And even more than this in computer vision, we eliminate problems that are too difficult. So if you look at standard data sets, they don't include objects like this. The shirt doesn't really have necessarily a defined shape. It certainly has a 3D shape when it's on you. But it's so incredibly deformable. It's hard to even think about what kind of computer vision model would be good at recognizing this sort of shape. So we eliminate these from our data sets.
That's a significant limitation. In particular if you think about evolution, you don't really have a lot of rigid rectangular shapes out in a rainforest somewhere. They look much more like this. And we just have no idea how to handle this sort of thing. It's hard to express how far away we are from being able to recognize, label, understand parts of shapes like this. We have no idea how to represent them mathematically and in principle, even, never mind actually be able to do it.
One of my favorite example objects and one of my favorite objects actually in real life is a few years ago I bought a stone tool, so an ax that was used about 250,000 years ago. It turns out that there are so many of these in digs everywhere that in order to raise money, digs that are looking for human fossils will just sell them off. They're sort of like ancient Roman or ancient Greek coins. There are millions and millions of them.
So you can buy one. It's, like, $200. It's not a big deal. But what's amazing about this tool when you look at it-- I didn't bring it because I had too many things for the summer school. But when you look at it, it sort of fits perfectly inside your palm. You can totally see how you might strike something with it, how you might cut something with it. You can see that certain bulges in certain areas are good for some things and others are bad for others.
But if you take state-of-the-art computer vision, the most that you could do with this rock is made you could segment it. Maybe you could put a bounding box on it. But being able to detect something like the sharp edge, there's no sharp edge detector. Being able to detect the part of it that is most well suited for you to put your hand around, there's no a grabbable detector in computer vision. These are all concepts that are too advanced, too complicated, and we just don't consider.
And actually, if you put this into a CaptionBot-- I have not tried CaptionBot from this year. But this is CaptionBot from 2017. It's a hand holding half-an-eaten apple. Things like stones, again don't have well-defined object shapes. They don't have well-defined boundaries. So you often get absurd answers like this in computer vision.
Now, sometimes computer vision works decently well. And that's when we control the domain very, very well. If you look at autonomous cars, this is a much, much more constrained domain than what you see when it comes to detecting something like a stone.
You know that the roads are mostly going to be gray. You know that these lines come in only a few different shapes and patterns. Cars are thankfully not deformable unless you're in the middle of an accident, which I don't recommend.
They're mostly going to do some pretty rational things. You can even have 3D models for most cars that are on the road today. So the reason why computer vision works when it comes to autonomous cars, or at least works to some extent, is because essentially the environment is very, very well engineered so that if all you do is you put bounding boxes around cars, pedestrians, cyclists, that sort of thing, you can do a pretty good job of it.
You also get to have essentially unlimited data. Whereas, if you're looking at a stone tool that you've maybe never seen before, you have to handle that for maybe a single example. Of course, computer vision also fails, and it fails in pretty spectacular ways that human vision doesn't fail. And this is something that, as a community, computer vision is not particularly good at looking at and understanding.
Human vision fails in only a few cases. One case is if you're visual acuity is too low, you can't recognize objects because the world is too fuzzy. So you wear glasses. If it's too dark outside, occasionally you might misinterpret some object that's far away. You might think that a plastic bag is a cat or something. And then you change your mind as you get closer. And if I give you too little time to view something, you may misinterpret it.
But aside from that, human vision is pretty accurate. It's what we call veridical. You're seeing what's actually there. And you also don't have this experience where you're 100% confident in what you're seeing, and then you touch it, and it turns out not to be the object that you expected. You can almost count the number of times that this has happened in your life.
This is not true of computer vision. Computer vision tends to be massively overconfident in whatever answers it's producing. And sometimes this has pretty disastrous effects. So this is an example from the Tesla crash from two years ago.
This is a Tesla that was driving down the road. There was a truck that was making a turn. I think it was actually illegal turn on. I don't think they were breaking the law. The truck happened to have been painted white. The Tesla-- and I'll spare you the pictures. But you can find them online if you want to see something like that.
The Tesla was in autonomous driving mode. It's debatable about whether the person was supposed to be holding the steering wheel at the time or not. That's a whole thing where Tesla thinks that what the person was doing was wrong, and other people maintain that were doing something perfectly fine.
But any case, the Tesla was driving on the road. The car was white. It was a very bright sunny day. The color of the sky was almost exactly the same as the van, or the truck. And the car just plowed into it. And the person unfortunately died.
What's interesting is we have this sort of way, as humans, to try to anthropomorphize and try to absolve machines of their failures. So for example, the regulatory body in the US that's responsible for this basically assigned equal blame to the auto pilot and the human, as if the machine is some sort of human or biological system that deserves to be absolved for its errors.
Something else that's very different for machines and humans and computer vision is that humans tend to make random errors. You read an X-ray scan, a human might miss a cancer with x-percent likelihood. But on the other hand, what a machine will do as it might, on average, do very well, but systematically miss certain kinds of cancers because it hasn't seen it or because it hasn't learned about them. Humans tend not to make these kinds of very systematic errors. Again, we're just not particularly good at recognizing this in computer vision.
I don't want to be all negative about vision. Computer vision and human vision is an extremely, extremely difficult task. If you look at an image like this, if you had a robot nanny, you might want that robot nanny to be able to tell you that your child's attempting to do this. I should also tell you that this is Photoshop. No children were harmed in the making of this image. Someone online did it.
But if you look at this image, you might say, well, this child is in danger. This child might be falling. Maybe they're hanging on to this thing. There's a whole number of sentences that you could write about it.
And unfortunately, if you put this into state-of-the-art captioning systems, they say things like, I can't really describe the picture. But I see indoor, table, and room. Yeah, OK, it's indoor. There's a table in a room, but you've kind of missed everything.
That's something else that computer vision isn't particularly good at and we really don't understand about human vision at all, which is what is our model for saliency? We don't know how the task that we want to perform right now affects the kind of features that we pay attention to. If you look at existing saliency models, they're very much to do with the visual salience, these kinds of pop-out effects for objects, rather than having to do with a task-based saliency that humans have.
You can go through a number of problems that humans have to overcome, animals have to overcome in vision and look at the sort of solutions that computer vision has had for them. One big problem with vision is illumination. And this is very, very easy to overlook if you don't have experience with vision.
Illumination really changes what we see out in the real world. This is a very famous illusion. This is an illusion that your laptop manufacturer makes, it uses in order to convince you that buttons are popping out. It's just the direction of the light changes whether you think this is a dimple or whether you think it's a cavity.
Even more than that, this is another famous one by Adelson. These two squares have precisely the same color but. You don't perceive them that way at all. And what's amazing about illusions in vision unlike illusions in some other domains is it doesn't matter how much I try to show you evidence that what you're seeing is wrong, these processes are so far removed from your conscious control that you're going to perceive what your visual system wants you to perceive.
Now, it's really hard to try to reproduce some of these things in machines because on the one hand, we don't want to put in to our algorithm something like make the same mistake humans make. We would like these kinds of illusions to come out of some higher principles about vision. It's not at all clear from the set of illusions that people have tried on both humans and animals what the underlying principles that drive them is.
Color also causes a lot of interesting effects. And it's far more difficult to recognize the color of something than you might assume. Some Canadians-- I tried to include a Canadian flag in every talk. So this is it. And if you try to just say, tell me the red pixels in this image, if you take this, you put it into GIMP, and you just make a mask at some red level, this is about as close as I could get to trying to segment out the reddish pixels.
It turns out that the actual RGB values here and the RGB values here are not that different. Sometimes they're even completely identical. What your brain's doing is it knows about transparency. It knows about the 3D properties of these objects. It knows about what it expects to see in a sunset like this. And it's reinterpreting these colors according to that.
Now, there are data sets in computer vision that are all about trying to label images and regions of images with categorical colors and trying to match human perception. This is a really hard task. This may actually be AI complete, or computer vision complete.
Attention-- we mentioned attention briefly earlier. But scene recognition is particularly difficult. These days there are data sets where people take images of scenes with Kinect and try to reconstruct their 3D shape. Object categorization is the labeling task. Being able to understand the shape and the structure of objects is very difficult.
This has an impact on neuroscience as well, and on our understanding of human and animal cognition. If we don't know the mathematical structure that can represent 3D shapes and is good enough for us to do object detection, it's really hard to search for it in the brain. And it's really hard to try to test that cognitively, and we really don't. So there are people that spend their time doing research on different kinds of manifolds, different kinds of parameterized shapes that could potentially fit the 3D objects.
Motion is also surprisingly difficult. If you just want to do something like take a video and figure out where motion is in this video and how are the objects moving, turns out that this may also be complete in and of itself. There is a task in computer vision called optical flow, where the idea is I take a video, and all I want to do is label densely every pixel in this video with where it's moving from frame to frame.
This is an active area of research. These days, people apply deep networks and recur networks to videos, trying to extract these flow fields, and is extremely difficult. There are these beautiful visual illusions. We don't have any idea about why people see them. But I think you can see that these things are sort of moving if you stare at them.
It's also quite interesting that animals have a lot of the visual illusions that humans do. So for example, this cat does actually see these things moving. It's really difficult to test animals with these sort of things because they just don't seem to pay attention to them after a few seconds. But you see he's interested for a while.
So there is something deeper about this. Our visual systems diverged many millions of years ago tens of millions of years ago. But they still maintain this pretty fundamental property. There are many more videos of cats seeing visual illusions online. If you want to Google for that later, I encourage you.
Recognizing actions is something that people used to once upon a time, not really do too much. Today, action recognition is a pretty big and important task in computer vision, in part, because it's so important to applications of computer vision. If you could recognize actions, you could start tracking people in cities. You could start putting these things in factories, see what people are doing.
This is also sort of, in a sense, a recap of what kind of tasks computer vision has been considering over the decades. We really started by looking at some of the simpler tasks and then going towards more and more and more complex tasks. But even if you pick something like action recognition, action data sets today don't involve things like social understanding. And when computer vision people talk about action recognition, we mean action recognition in a three-second video clip rather than in a five-minute movie. Those are totally, totally different beasts.
Memory is very important for vision. You can remember the scenes that you've seen before. People seem extremely, extremely good at being able to remember the precise image that they've seen before and be able to remember thousands and thousands of such images. So I can flash lots of images in front of you, and you won't be able to have random access to those images. You won't be able to describe to me the 50th image that I showed you. But if I show it to you, you can recognize it really quickly.
The connection between vision and memory is pretty complicated, social interactions, of course. Sometimes vision bleeds into language. I'm sure you've all seen strip tests before. These words are very difficult to read.
There are other more interesting connections between vision and language that we understand these days. So for example, there are people that study the development of language words. If you're from Japan, 200 years ago, your ancestors didn't really distinguish blue and green very much. And they do today, and Japanese language changed, in part, because of contact with the West, with Europe, with North America.
In general, there seems to be a development that's pretty stereotyped of color words and language. People draw color word families. No one understands why this is or if this has anything particularly interesting to say about the human visual system. It also seems as if the colors that you can distinguish depend on how many color words you have, even perceptually.
So people have tested, say, men and women. Women tend to use many more color words than men do. I don't want to stereotype. But it does seem to statistically be true. And it seems like if you learn more color words, you get better at it. Men that are designers and have a much larger color vocabulary are able to distinguish many more colors than men that are not.
I also want to walk you through a little bit of but about the techniques that people have had in computer vision. So in particular, one thing that we're going to do is start with a very trivial visual tasks that is not even computer vision, that signal processing that you might have seen would not have been inspiring or interesting in the 1930s, and build our way up slowly from that task to a modern object detector. . And I'll show you the pretty straightforward path that's been taken. Now, it's straightforward in retrospect. It was not straightforward in the minds of the people that were doing it at the time.
So some of the simplest things that you can do to an image that are not at all surprising to anyone that does signal processing is you can blur or you can sharpen an image. We'll come back to some of these. It turns out that if you start off with an image blur, you can express this blur as a convolution.
I don't know. Did Yelling talk about convolutions a fair amount? Is everyone comfortable with them? Should we talk about them more? So it turns out you can express blurs as convolutions. You can build up your way from a very simple blur kernel into something that detects edges, into something that uses edges in order to detect objects and then into a fully fledged object detector.
To do this, we have to talk a little bit about baby machine learning. And you may have seen some of these ideas, and you certainly see them more in [INAUDIBLE]. But I just want to make sure we're all on the same page.
So there's what's called supervised classification. I give you a whole bunch of points, and I want to put line between these points. This is a discriminator. There's unsupervised classification. If I just give you the positions of the points, but I don't give you they're labeled, naturally you would still probably draw a line here if I told you that there are two classes.
Generally, most of computer vision is supervised. There are very few unsupervised methods that work. And very easy methods, called linear classifiers-- we'll have Leslie Valiant come talk tomorrow, who was involved in this kind of research. Linear classifier, all it does is it tries to choose a straight line, either in two dimensions or in n dimensions, that separates your points.
There are lots of straight lines that you could pick. It just turns out that one nice straight line is a straight line that has what are called support vectors, that are vectors that try to maximize the distance between this class of points and this class of points. That's where the terminology of SVMs comes from. This is no different than neural networks. A single layer of a neural network is going to do this kind of linear classification.
I won't go into SVMs. You will probably hear more about them in the machine learning day, two days from now. Something else that you've heard about in the optimization tutorial is gradient descent. You start off in a bad part of the search space, and you get better. Essentially what you're doing is you're taking these hyperplanes, and you're moving them around. And you want the hyperplane that satisfies your cost function better.
Sometimes a linear function isn't good enough. And back before the deep learning days, the way that people thought about this is they took a support vector machine, and they had what's called a kernel. What you do is you take your data, and you re-project it into a different space, where you undo some of the nonlinearities so that a single linear classifier can distinguish your points.
So you might have, say, a radial kernel that does some kinds of nonlinearities. Back before the machine learning days, these kernels tended not to be learned. People chose the kernels that they thought were most important for their task, and they used them in order to undo these nonlinearities and fit an SVM.
This is basically all that we need to know. So you've seen convolutions. I won't go over them again. There's a kernel. There's an input. There's an output.
The simplest possible kernel that you can make for a convolution, which was well known even 100 years ago, is a Gaussian kernel. All you do is you take your 2D Gaussian, the peak, the sides. You can change the width of your Gaussian.
The one caveat is you have to make sure that it's normalized so that the total sum of all the values in your kernel is 1. This is also important for deep learning. The reason for that is you don't want the average intensity of your image to change when you apply this kernel.
This is what it looks like before you normalize it. If you just take this Gaussian kernel and you run it over your image, you end up blurring the image. This is pretty intuitive.
What you want to do is you want to take most of the values of the pixels that are close to the current pixel, and pixels that are further away influence that current pixel less and less and less. This Is exactly how blur is implemented in Photoshop. It's no more and no less than this.
Now, of course, if we can have some kernels that look like a ghosts, we can start to make our own kernels by hand. And this is something that people did in the '70s, the '60s. This is a particularly useful pair of kernels called a Sobel kernel, or Laplacian.
What it does is it tries to do a kind of derivative. You pay a penalty for high intensity here. And you get a reward for high intensity here. This kernel is maximized when there's a difference of intensities between these two.
So what this is essentially looking for is vertical bars and horizontal bars. So when you convolve an image with these two kernels, you get out to two images. You get out an image that tells you about vertical energy and horizontal energy. You can treat these as vectors.
You take their magnitudes. And what you get out is an edge detector. This is a Sobel edge detector. If you've ever heard of a Canny edge detector, it's essentially this, with a little bit of path tracing added in. This is an edge detector, and this is computer vision from 30, 40 years ago.
Now, you could very quickly turn this into an object detector if you felt like it. If I wanted to detect this part, what I would do is I would crop out this part of the edge map. Then you'll give me an image. And I run that image. I get out a new edge map.
I take the two. I compute the dot product between them. And that tells me how compatible these two edge maps are. That already gives me a little bit of leeway, right? These edges aren't 1-pixel thick. I could even blur them a little bit if I wanted to, to let the part align if the image was taken from a slightly different viewpoint or the part is slightly differently sized. It lets me detect the part, even though the color here might be different, even though the illumination might be slightly different.
Well, the next thing that you could try is you could say, well, how about if my object is, say, differently scaled, or how about if the ratios of my objects are slightly different? The edge maps won't match particularly well. So what people said is, why don't we take these object detectors and rather than trying to match an edge perfectly, try to aggregate some basic statistics about the edges at every location? This is what's called a histogram of gradients detector.
So what you do is you take exactly the edge map that you had here. And then you take a region in the edge map. And rather than storing the edge itself, you store the histogram of the directions of those vectors.
So here you can see this is sort of the back of the tire. And here, most of the gradient energy was in this direction. And here, most of the gradient energy was in this direction, which sort of corresponds to the edges that you see. This gives you a lot more slack in trying to match objects.
So you can imagine that if I take this bicycle, and you have a bicycle that's 5% bigger or something, it's still mostly going to match. If you have a bicycle that's rotated out of plane a little bit, again it's still going to match. This is a standard object detector from about the year 2000 or so.
In, say, 2005, 2010, the best object detectors were these deformable part detectors, which are very much like the previous ones. But again, you want to add a little bit more slack to them. So what people did is they said, why don't we run an edge detector at multiple scales so you have a very coarse scale, and you have some fine-grained object detectors.
This just changes the scale at which you run your edge map at, so the scale at which you do the convolution. And then why don't we also let the parts move around? Because when humans swing their arms, the position of their arms obviously changes. The edges won't line up particularly well.
So they said, why don't we take this edge map, split it up into, say, six pieces, try to automatically infer the positions of these pieces. And it turns out there are some nice dynamic programming, greedy algorithms for trying to take an edge map like this and a big data set and figuring out how should you split the edge map up into n pieces so that it best accounts for your data. And also, we're going to have both the piece itself that tries to match the image and the deformation mask that lets the piece sort of move around a little bit.
So you can see we're slowly getting towards something that looks more recognizable as a modern object detector. And actually, there are papers that try to show that essentially this method of having an aggregate at the level of edges, building it up into a coarse detector, then a more fine-grained detector, where these things can move around, is essentially a kind of deep network. You can express this in the language of neural networks.
If you look at a modern object detector, it looks very, very much like this. You have an input image. You have a bank of filters that you apply over that image. That bank of filters is essentially behaving exactly like this. This is a bank of filters. And you can actually see the filters here.
The filters that are learned during deep learning can be much more complicated, and you might not be able to visualize them as easily. This is the same deal. You do some sort of max pooling, which is exactly what was done here. What you did is you asked, what's the best position for this particular part for this image? And that's the value that we're going to propagate later.
So you do max pooling, except that before we did two rounds of this. We aligned the parts in the coarse filter and then the parts in the fine-grained filter. Now we're just going to do six of these.
In the olden days, what you would do is you would take this, and you might try to put an Gaussian on top of the output of this, or you might just try to look at the raw scores themselves. If you look at a more modern object detector, you have not just a single linear layer. You have multiple linear layers, will let you learn much, much more complicated functions of the inputs.
But essentially, it's sort of the same thing, just to a more refined degree. There are also a lot more parameters than there used to be. This thing has millions of parameters. This thing has a few thousand parameters. You need significantly more data.
And you can look at the history of computer vision in the past 10 years or so. And you'll see that performance went up significantly with these models and then had another big jump once these models turned into deep networks. But hopefully this gives you an idea about where do modern object detectors come from and why do people structure them in a particular way.
If you look at an even more modern object detector, it doesn't look very, very different from this picture at all. One of the big innovations that people have made since then are what are called residual layers, where what you say is, you take the activity in some earlier layer, or even some part of the original image, and you propagate that forward. So what you're learning is you don't have to learn the output of that layer. You have to learn something about the difference in the output of that layer.
Turns out this is much easier to learn. It also has to do with gradient descent. I don't know if Yelling talked about this. But a big problem with gradient descent-- well, not with gradient descent, but with how we compute gradients, how we propagate errors-- has to do with the fact that the further away you get from the input, the harder it is to assign blame to one particular filter or one particular parameter in your filters.
And so the further away you get from here, you don't know whether what I should do is I should re-tune something in this filter or something in this filter. And so what residual layers do is they let you propagate gradients more efficiently and assign errors more efficiently, or assign blame more efficiently. This is what lets people turn these networks that are, by moderate standards, relatively shallow-- they have 8 or 10 layers-- into networks that have 100 or 1,000 or 2,000 layers these days. Aside from that, they're not so different.
So at the end of the day, you take an image as input. You try to propose some regions. Sometimes object detectors have separate regions proposal networks. Sometimes object detectors will just take an image, slice it up into pieces, and try to predict objects in every piece. You feed them through your network. You have some sort of linear classifier.
I won't go over this. Last year, this tutorial paradoxically came before the deep learning tutorial. But I assume that you know what max pooling is. Did Yelling talk about dropout at all? No. OK.
So I'll mention something about dropout because it's very important for computer vision. One thing that you could do is you could just train the whole network, as is. But it turns out that this doesn't work particularly well.
One reason why it doesn't work is you're training a huge, huge number of parameters every time. And that's always worse. But the other reason why it doesn't work is if you look at what people do in order to get really high accuracy at any AI or computer vision benchmark, invariably what they do is they train 10 different detectors. They vote between them, and they use some sort of boosting method or voting method in order to take the mean response of those detectors. And that always beats any individual detector.
And so what people do instead is they try to think of these networks as multiple networks layered on top of each other. This is one interpretation for what dropout is. So every time that you're training what you say is, I'm only going to treat some subset of the network, say, 50% or 90% or 30% of the parameters of this network, as actually being part of it. I'm only going to propagate information through the subnetwork. And I'm only going to tune this set of parameters for this particular example.
And every time that I tune a network, what I'm going to do it I'm going to pick a different subset of this network. One thing that this does is this gives you a little bit of redundancy in the network. If one particular unit in the network happens to make an error or got the wrong kind of input or was mistrained, well, you've trained all the other units to try to work without that one, potentially. The other thing that it does is it kind of tries to train what's called an ensemble.
You could think of this as I train lots and lots of networks that happen to share some parameters between them. And this is sort of the final network trying to vote between all of them. There are some other interpretations for dropout. But this is the most intuitive one, by far. Essentially, everyone in computer vision does dropout, and it improves results significantly.
Some of the biggest computer vision benchmarks include things like ImageNet. That's still the most important one out there. And you can see that that error drops significantly. So in about the 2011s, 2012s-- this is about that the deformable part detector. It had on the order about 20% error or so. And then in 2010, with the advent of the detector that we just saw, we got on the order of 10% error. You can kind of see all the different methods from not having parts, to having parts, to having deep learning in one easy picture.
The other thing to know is that generally when people talk about accuracy over these data sets, they don't quite mean the same thing that you might think intuitively. Very often, they give machines a lot of slack on them. So they'll say a machine only has to predict in its top five labels one label that was assigned by a human. There are various reasons for this, but it does tend to inflate numbers significantly.
And there are lots of different tasks. So localization asks you to actually pick the-- classification asks you to label the whole image. Single object localization, you can assume a prior that you only have to detect the single object for this image. In object detection, you have to put n bounding boxes, and you don't know what n is over a whole image.
Let me show you some-- I can show you a little bit about object detection in action. This is a state-of-the-art object detector. This is called Yolov3. You can kind of see the sort of errors that it makes and how well it works.
It certainly detects most cars. You can see it gets people. Occasionally, it merges people. Occasionally, it thinks that's a dog. You can see it does a passable job. Part of what it does is it's fairly unstable. So like the suitcase was detected sometimes. Sometimes it was detected as a car. Sometimes it was not.
The other thing that it does is it's very sensitive to the background. That's why sometimes this suitcase is detected as a car or a truck. It's not because it looks like a car or a truck. But the prior on suitcases being in the middle of the road is very small. And the prior on trucks being in the middle of the road is very, very high.
You can see the backpack sort of comes in and out of existence sometimes. It's also a little bit unclear what objects should be detected at any one point. So you may have seen that the ski poles were not detected. That-- I don't know why this person's staring at us so intently.
[LAUGHTER]
Yeah, so sports ball-- yeah, poor guy. Yeah?
AUDIENCE: So [INAUDIBLE] these demonstration videos, why don't any of them seem to employ common filters or anything like this, which would prevent these things from appearing in [INAUDIBLE]?
ANDREI BARBU: To some extent they would. So I'll repeat the question so everybody hears it. What you could do is you can employ an object tracker of some sort, or you can employ an LSDM that tries to smooth out these detections. You can employ a common filter, et cetera.
The reasons why people don't do that is because object tracking is just a different community, and that's by design. You and I could detect these objects in any one frame of these videos. So if you did that, it would sort of deceive you into thinking that this was far, far more accurate than it really was.
But you could. So in practice, people do that when they run these object detectors. But they really don't fix every problem in the universe. Like, when the whole thing is detected, some things are sometimes detected as a motorcycle. OK, this is violent. I should have watched this to the end before I showed it to you.
And Sorry. Did you have a question as well? Oh, OK. I thought there was someone else. There are lots of ways to fool these computer vision systems. I'm sure you've heard of adversarial examples. These are things where people will take a computer vision system, take an image, and try to find some other image that will get misclassified.
Or they'll take an image of a car and make it be misclassified as a panda by changing it so subtly that we can't tell the difference. This is my favorite example of this. What these folks did is they printed glasses that makes them be recognized as some sort of star. I don't know who that person is, though. And they work very, very reliably. You can print out particular 3D glasses to get recognized as someone else.
Something else that people have found about these kinds of adversarial examples is at the beginning, there was a lot of criticism of them that they needed access to the network that they were operating against. And if you think about it in the real world, a chameleon that's trying to run away from me doesn't have access to my visual system in order to adversarially adapt itself and see what I'm thinking about it. And it certainly can't take gradients through my brain.
But what people have found over time is that there are sort of universal adversarial examples. You can take one network, train an adversarial example for it, take a totally different network, train on a different data set with completely different architecture. And if you do it the right way, it'll still work with very, very high accuracy.
So there are lots of ways to try to defeat computer vision. There are many more sub areas of computer vision than I talked about. For example, there are folks that deal with image formation. This is more interesting topic today than it used to be because these days people try to use rendering engines to render images. And then they render them automatically.
So you might use something like a game engine, produce a 3D image that looks relatively realistic, like what you might see from a game, and then ask, what were the original textures of the objects that went into this image? What were their positions? Turns out that this works surprisingly well, at least on 3D rendered scenes. On real scenes, that work hasn't quite progressed to a point where we can use it.
Image processing is still a huge deal. The amount of image processing that goes into getting your camera on your cell phone to take a reasonable image is quite amazing. A few years ago, there was a series of talks at Google where some folks on their camera team tried to explain, in great detail, all the processing steps that go from the CCD in your camera, and all the computer vision that's in it, all the way to taking image. And it's about something like a 60-to-70-lecture series that's, like, 200 hours long, if you ever want to listen to it. It's fascinating. But it's a lot of weekend watching.
It's surprising how much you have to do to get even remotely possible image. These days, Google, Apple and other folks, Samsung, actually tried to do some deep learning to make selfies better, to be able to segment out people better, to add in some sort of fake depth of field effect from their images. I think this is something that's going to grow in the next few years.
Feature extraction is another sort of really big topic in computer vision. It's less important now than it used to be. It used to be that computer vision would work under the paradigm of some people would find good feature extractors. Other people would try to use those feature extractors in their tasks.
These days, we try to co-learn the features with the actual task. But people do look at the features that are learned by these networks and try to see whether they're interpretable by humans or whether they're useful for tasks they were never designed for.
Segmentation is, of course, an important task, both now and in the past. People do pixel-wise segmentation. It's called semantic segmentation. So I'll give you an image from a satellite or an image from a car. I want you to label every single pixel in this image with whether it comes from a road, whether it's grass, whether it's something else.
Object segmentation is another popular topic, where you want to be able to put not just a bounding box around me, but you want to be able to put a type polygon around me, for example. Structure from motion is something that you don't hear about too much today in computer vision. But it's still very interesting and very important, particularly now that we have drones.
Probably the most famous paper in all of the structure for motion literature and the multi-view geometry literature is reconstructing Rome in a day, where they took millions and millions of images of Rome, and they reconstructed a 3D model of the city. These days people have drones flying around, and they try to reconstruct what's going on.
There's even an IKEA app that I believe tries to use structure from motion to reconstruct some objects and try to place IKEA couches inside your house. I know there was a startup working on it. I don't know if they are still alive. Just motion perception is still really important these days. If you go to CDPR, you will see many papers on motion perception.
Stitching together images, that's less so than it used to be. But computational photography is still very important. People, these days, try to do things like improve the results from 3D engines using computer vision. There's a very big market in taking, say, Hollywood movies and modifying them and trying to automate some of the very painstaking editing that humans do by hand.
So in any one scene in a movie, there may be wires hanging from the ceiling. People are on some blue background. You may have to edit someone's facial expressions slightly, that sort of thing. This stuff tends to be done by hand these days. And if you ever go online and you try to watch a video of people doing this, it's really, really laborious. And there are people, particularly Disney has a pretty good computer vision group, that work on trying to automate some of this stuff.
Stereo's still important. Even though stereo seems as if it's this totally-- it should be trivial thing. You have two cameras. You can look at the disparity between the position of two pixels. And OK, fine, that gives me the depth. The mathematics is completely obvious.
It turns out that stereo doesn't work particularly well with the basic math. Being able to find the correspondences between two images is very difficult. And even exploiting those correspondences is difficult.
What's interesting about stereo is even though it seems like it's hard, it's quite interesting that it basically doesn't matter for humans. If you had one eye, I could never tell that you had one eye from your behavior. There are people to drive with one eye. It doesn't really matter.
It also is pretty neat that until the advent of VR, we didn't know how many people out in the population had two eyes but couldn't see stereo. There's this process that happens to you as a child called stereo fusion, where you see enough examples, binocular images. And eventually you start to perceive depth.
But in some decent fraction of the population, a few percent of population never had stereo fusion. And a lot of those people never knew they didn't have stereo fusion. So there are quite amazing videos of people online trying out VR headsets. And because the disparity in VR headsets is much, much larger than disparity you see in the real world, because they really sort of want to hit you home with those 3D effects, they experience 3D for the first time.
In recognition, these days there's a massive sort of explosion of the kinds of things that people can recognize. What's not on this list-- and I should say this list is from good computer vision book from 2010 by Slazinski-- is everything to do with language, everything to do with robotics. These were not computer vision topics that people had made too much progress on 10 years ago. And today they're much, much more active, much more interesting.
But it's still worth looking through this book, if only because it's sort of the latest and greatest that you can find as a survey of the problems in computer vision. It's not the techniques anymore. But if you think about the kinds of vision problems that we talked about, none of them really addressed this issue of creating a 3D construction for the scene so that you can tell that this baby may not have the best handhold. And if they fall, you can tell where they fall.
There's a big gap between these kinds of problems that we consider totally natural that even animals can solve-- like a dog could probably tell that this person is in danger-- and the computer vision that we do today. And the path between the two isn't clear at all. So when people tell you that AI will be human level in 10 years, maybe you shouldn't believe them very much. And with that, I'm happy to take your questions.
[APPLAUSE]