VIsipedia: Combining data, machines and experts to distill knowledge
Date Posted:
August 15, 2019
Date Recorded:
August 15, 2019
CBMM Speaker(s):
Pietro Perona All Captioned Videos Brains, Minds and Machines Summer Course 2019
Description:
Pietro Perona, Caltech + AWS
GABRIEL: OK, let's get started. It's a great honor to introduce Professor Pietro Perona, who is going to give our special [INAUDIBLE] talk today. [INAUDIBLE] a lot of biographical details. He's famous with a Wikipedia page and whatnot. You can find all the biographical the details. I was particularly privileged to have Pietro teach me my very first steps in computer vision when I was at Caltech. He's not only a great mentor to his group at Caltech, but also a great teacher.
In EE and CS, he's made many seminal contributions to computer vision. It's a very fun lab. It's always fun to go and visit and see all the great things that he's doing. Another particularly interesting thing that he's been doing now is [INAUDIBLE]. We live in very exciting times where there are lots of crosstalks between the industry and academia. And then you have amazingly talented people like Pietro who have both halves and who can sort of traverse both worlds. And many people want to talk to him about that, as well.
We have several talks, both today and tomorrow from people in different aspects of industry as well, which, I think, will be very [INAUDIBLE], as well. So before we start, I want to remind you that we have a reception after the talk. And the schedule says that that's in the MBL club? Does anyone know where that is?
I think that's in Water Street. That's Water Street, right? Yeah, so I think you exit here. You go to the left. And basically, as soon as you turn left again, it's sort of there on Water Street. So maybe the ones that are really sure can help me make sure that Pietro gets there after the talk, as well. All right? So again, without further ado, then, it's a great pleasure [INAUDIBLE].
PIETRO PERONA: Thank you. Thank you.
[APPLAUSE]
OK, thank you, Gabriel. Thank you for the intro. Can people hear me in the back? OK, excellent. And, yeah, this thing of the reception is a big mystery because I received the most diverse instructions on where it might be. And so I think that the club has run out of money and they're trying to distribute everybody around town so that nobody finds the reception. Everybody can claim that it was somewhere else.
OK, so thank you for inviting me. It's lovely company. It's the first time I am in Woods Hole. And I've heard from my Caltech colleagues how wonderful it is. And now I can check in person.
OK, so since this is a diverse audience and people have different backgrounds, please, interrupt me whenever I say something that is not completely clear. I'm sure that, you know, that you have lots of talents. It doesn't need to be that you know everything about what I'm talking about, so stop me.
And just as an introduction, my field is vision. So I'm interested both in computer vision and in biological vision. And so what is vision? And so if you look up what the definition of vision in David Marr's book, he has a beautiful sentence. He says, "It's to know what is where by looking." OK, a very simple sentence.
But the sentence contains two key words, the "what" and the "where," which are foundational for vision. So the what is visual recognition. So you say, well, you know, what do you see? And you can see mountains, a cow, a man, et cetera. And the "where" is the geometry, which is another thing that you can understand using vision.
And so you can see that there is a receding plain, that there are some mountains that have a certain shape, that the rocks are very vertical and instead the grass here is fairly horizontal. You can tell the shape of the cow, even if you don't know that it's a cow and all of that. So in vision, you have geometry and you have recognition.
And at the time of David Marr, that was a great definition of vision. Throughout my career, I've seen vision people expand the realm of what they consider to be vision. And so you could say here, well, so I see a cow, I see a boy and a man. And now I can even try to guess what's the state of mind of the cow.
You know, what does the cow want to do? The cow wants to investigate a bit the hand of a boy. You could even say that she's hesitant. And the boy wants to be safe, but he's also curious.
And so you can infer goals. You can infer plans. You can say a lot of things from just one picture. And you can get into what Josh this morning was very admirably laying out in front of us.
So Josh and I talk quite a bit because vision borders onto cognition, onto intelligence, and vise versa. You know, Josh views it as a useful input for what he's interested in. And so there is a point of contact. Another point of contact is with this morning talk on attention. And so again, you heard about how the visual system is allocating CPU cycles or, you know, circuits to different locations that they measure depending on where it feels that the information is.
And that's also very interesting from the computational point of view. Because CPU cycles are expensive and so people think about how do you allocate them in the right way. And you already start seeing papers in deep networks that are talking about that, how do we switch off pieces of our network so that we only compute where the computation is needed and not anywhere else, right? And so you see all of that.
So this is vision as we understand it today. And I should add one more thing, you know. Here we see a static image. But, of course, when we see and when a machine sees it's within a dynamical world most of the times. So you are driving, you're moving, you're interacting, and things keep moving around. And so the question is, how does it work?
And this morning in Bob Desimone's talk, you saw how the visual system of the monkey was dynamical. And it was computing things very quickly all the time. Depending on where the monkey was looking, things were changing. And the models that they use to understand what their system was doing, instead, were input/output models in which you have to decide when you put in your input to read out an output, right? So there is no notion of dynamics yet so much, although it's starting to be there in deep networks.
But this is all irrelevant for today. So today I want to focus on something simple, which is objects. But I would focus on the knowledge part, you know, how does a machine figure out what it needs to do? And so here is motivating the talk, an email that my father-in-law sent me almost exactly 10 years ago. And he was walking in Caumsett, which is on Long Island and he saw some mushroom.
And he knows that they like picking mushrooms on the Alps, so he wanted to know, can I eat this one? Is it edible or not? And when he saw it, I thought, I think I should tell you right away not. And Boris also says not, right? But I blush to say I couldn't tell the species of this mushroom. I didn't know what it was. So it didn't look right, but I couldn't tell the species, which is already enough, I think, when you pick mushrooms to say don't eat it.
And so the question was, how do I find a species? And of course, you want to-- So what do you do? You look it up on Wikipedia, of course? Where else would you look it up? The trouble is you don't know what to type in Wikipedia to find out the species of this mushroom, right? Now, it turns out that, as for almost everything else, including myself, as Gabriel said, there is a page in Wikipedia.
And there is one here, it's Amanita pantherina. And it tells you that you shouldn't eat it. It's poisonous. But how do you build a bridge between the picture and information? And so that was the question that really made me wonder, that maybe we had not developed yet a good system.
So, you know, it felt like you could read it in the newspapers, the web. You can find all the information on the web. And here is a flaming example where I cannot. I have a mushroom in front of me and I don't know how to look up the information. The information is there, but I cannot access it. OK.
So this happens to us every day. And somehow we become numb to the fact we are so ignorant. And so I could ask, you know, what's the species of this bird? Some of you may know, some may not know. You know, what is written here, which script is it, when was it written, what does it say? Can anyone read what is in there in the room? Not easy.
You know, is this mole something I should go and have my dermatologist take a look at, maybe take out, is it dangerous? Am I in danger? You know, who is this guy? You know, what's the species of this tree. So there are so many things. We are surrounded by things we don't know.
And so we can read the world well enough that we can navigate and operate. But as soon as I ask you these questions, I realize how ignorant we all are about things. So we have a certain superficial level of knowledge. But it's not. And so wouldn't it be nice if a machine like our personal smart device were able to assist us, in some ways, to build a bridge between images and information. So this is one question I may ask.
And so the idea is that you have Wikipedia. And wouldn't it be nice to build a bridge which we call Visipedia. And this is a name we came up with Serge Belongie who has been my partner in crime on this project. And he's a professor at Cornell Tech. OK, so how do we do it? What's the way to realize this?
And if there is anybody working in computer vision, they say, you know, what are you talking about? We are just doing it. You have deep networks. You train deep networks to recognize things. And that's about all that you have to say. You know, they work and all of that. And so the question is, is it enough or not?
And so in my talk, what I will want to do is think of deep networks and other classification algorithms as a substrate, as a commodity. Somehow, it's now built. Google teaches us how to use Tensor Flow. And many architectures have proven to be very good. And even if we don't understand much, so there are even kids from high school who train deep networks to do XYZ and they don't understand even the gradient.
So let's keep it as a commodity and let's assume that that community is doing a good job in keeping it up. And what else do we need? And so I want to explore some of these challenges, some of them fictitious and some of them quite real and unsolved. And so I want to take you through the challenges of building this Visipedia.
And so we want to go through fine grained categorization, how many categories do we need, and other issues that we will discover along the way. So first one is fine grained classification. So when deep networks were built or were-- Sorry, I should say that different. When we realized deep networks were for real, namely they could deal with real-world images, when was that?
It was the end of 2012 when Geoff Hinton trained a deep network, basically, the net on Fei-Fei Li's data set ImageNet. And so ImageNet is collected by searching the web for words that are associated to visual concepts. And those visual concepts are diverse. So you have dalmatian dog. You have cherries. You have container ship. You have mite, tick, and so on.
So if you take that point of view, then what you have, you know, what training on ImageNet would give you a network that can classify this as a bird. But that we knew, right? It's completely useless to me. I don't care. Maybe Google likes it because it can advertise stuff based on broad categories. But I knew that it was a bird. I would not ask my smart device to help me.
So what I really want to know is which bird it is. And so maybe I want to know that it's a sparrow. Well, it turns out that if you ask a birder, they will say, OK, yeah, it's a sparrow. But, you know, which sparrow is it? And so it turns out it's a Chipping Sparrow, OK? So you can go to amazing levels of fine grained categorization.
And those of you who are in neuroscience, you know that anatomy spend a lot of time debating the fine points of classification of neurons. So this is what science is all about. You go to a very fine grain. And why are you interested in the phenotype? Why are you not interested in how what things look like? Well, because form and function are highly related.
And so "forms" is also a give-away for hidden properties. And we can generalize properties once we recognize a category. And some categorization is very useful. And in many tasks, for example, for a neuroscientist, knowing if you're looking at a spindle cell or a [INAUDIBLE] cell, it's like day and night. You really want to know what is it at the fine grained level. And so you have to go to this level.
So the question is how many categories do we have to deal with? And it turns out that just for the sparrows there are lots of categories. These are maybe 30 different types of sparrows. And you see how similar they are one to the other. And so you say, well, it will never learn how to classify sparrow.
So this is an example, maybe the most horrible example I found in the birding community is two types of chickadee. And it looks like the difference is a slight whitening of the wing here, which is less here. And people kill themselves to know whether they've seen one or the other because the border is a little bit south of here. And there are places where the two populations mix, but people really want to know which one do they see and so on. OK.
So you have fine-grained categorization. So deep networks have been good for discriminating dogs from automobiles from cell phones. But will they be able to deal with those sparrows that we saw and all the bird species and all the automobile kinds and so on? So it's one question. Now, the second one is the number of categories.
And so here we have the birds. There are about 10,000 species of birds in the world. And so there are lots of birds. But birds is just a small thing compared to everything. And so the estimate is that there are maybe five to 10 million animals, and then there are plants and fungi and so on. So there are lots of different species.
And then you have lots of man-made objects. You can think of, again, all the covers of all the books ever written, all the carburetors of all the motorcycles ever built, and so on, right, all the levers of all the airplanes ever built. And so the mind boggles. And it may be another 10 million objects. And then you have, you know, famous mountains.
We saw some before in my picture. I can't recognize those mountains. And so there must be people who recognize them their mountains. And there must be, again, hundreds of thousands of recognizable mountains. Turns out celebrities, you think celebrity, by definition, very few, there must be a few thousand.
But actually, I learned at Amazon that there are half a million celebrities in the world. Now, how could you be a celebrity if you're one of half a million? I don't know. But that's the-- And then, you know, how many Chinese characters are there over the ages and so on. And so there are lots of things, like how many dishes in every cuisine.
So I think we should gear up for tens of millions of things. If we want to be serious, that's what our deep network should do. And so is a deep network going to be able to learn to recognize hundreds of millions of categories. OK, third challenge, this is more interesting and fewer people have thought about it, is the fact that the world is a long tail distribution.
And so I spent quite a bit of time in my last 20 years collecting well-annotated data sets for computer vision. And so we started off with Caltech 101, 101 categories, which then gave birth to ImageNet and then we had COCO and so on. And in each case, we were fastidious about selecting categories for which we could collect enough images to train a system. And so you know that if you're training anything to recognize categories of objects, you want to have a few hundred images.
And so 1,000 images per category for 1,000 categories, that's what ImageNet is all about, right? And COCO, so we said we want more images. And so we restricted the number of categories. And so we have 10,000 images for each one of 100 categories. And at some point, I realized, OK, I'm doing the biggest disservice for the community that I could possibly do.
So I'm killing myself to collect enough images for this category. And I'm just making the whole community believe that the world is a uniform distribution. And so you will always have enough images to train. And so you should develop algorithms that are very hungry for data and do well when they have lots of data. Well, if you look at the world, really, it's not true.
So here are birds. And so here is about a million birds. Now, we have about 5 to 10 million. And this, I must say, they're collected by eBird, which is an organization within the lab of Ornithology at Cornell University who have been my collaborators on this. And so here with a million, you see that in log-log scale, you have one of those Zipf distributions.
So there are two slopes. And we can talk about why there are two slopes. It doesn't matter now. But the point is, if you want to have about 1,000 training examples per bird, then you can only train 216 species. If you're less hungry, and we will see later that you should be hungry, and you accept to train on 100, you can train 1,000 categories. It sounds like a lot.
But you know, we have 10,000 to train on. And so we are training only on 1/10th of the categories if we accept a very sub-optimal number of training examples. Now, it's not that these images down here don't exist in our collection because we've been lazy or we've made a mistake. No, these are rare birds that are very rarely seen by people.
And yet, just the fact that they're rare makes them so much more valuable and they're exactly the ones for which we would like to train our algorithms to recognize them, right? So the world is a long tail distribution. Some phenomena we'll see very often and some phenomena we see never. And so if you talk to a doctor, a doctor in ophthalmology, any speciality you want, then you say, well, how many diseases of the retina do you see?
And the doctor will tell you, well, I know there are three or four I see all the time. But if you look at the book, there are 600. And we are always trying to figure out when we see that rare thing that we've never seen before, are we going to be able to detect it or not? And so this long tail distribution problem is one fundamental challenge of systems that we want to build.
If you want the system to live in the world and be relevant to human life, it has to deal with long tails. There is nothing you can do about it, OK? So that's a big, big challenge. So if you're curious about these birds, the most photographed bird is the bald eagle. And this bird here is a pigeon. And this is a prairie lark, just for you to know what the birds are.
Now, it turns out that humans, if you ask a birder how many photos of a new species do you need to see to be able to recognize it afterwards? They say, well, you know, give me three, give me five, give me 10, they give you a number like that. And so my estimate at the moment is that there is maybe order of two or three gap between the data efficiency of humans with respect to the data efficiency of machines. OK, so you should keep this in mind.
Now, this is confusing because machines are so good. And so in my mind, 2017 is the year where machines became better than humans at expert-level pattern recognition or categorization. So there was a paper by us on birds where clearly our birding app was much better than the human. There was a paper from Stanford and Google on skin blemish classification. And they also showed that their machine was better at telling malignant versus benign better than the dermatologists.
And there was one more which I'm now blanking out on. But also there was a paper on human face identification that showed that the machine-- it came out in PNAS last year-- the machines were better than trained human experts. They were incredibly better than untrained humans at discriminating two faces, same or different, and a little bit better than then sort of FBI-grade examiners who do that as a job.
So 2017 is when machines became better at visual categorization, and yet better in the sense that if you have infinite data, then it can beat a human, but a human will be able to beat them if you start throttling the data, OK? And so we should remember this as a fundamental limit of what we are doing now. There is something that we don't understand that we are not doing right.
And this is the same plot for the species of trees in Los Angeles. Same thing, it goes like as if a curve. OK, so I proposed the challenge. And let's see with a single plot, I can answer all three questions. So forget about the dotted line. These are for SVMs, it doesn't matter.
Look at the solid lines. And the solid lines are the performances of three deep networks in classifying birds. And here you go. So on the vertical axis, you have the log error, log base 10. So down here 1% percent, 10%, 100% error in classification. So since you have lots of categories, if you go at random, you're basically at 100% error.
Now, on the x-axis, you have the number of images that were used for training. And what you see here is that both good news and bad news, so follow me along this line. You see that if you have 10,000 training examples for these birds, then I can get down to 3% error rate. And the photographs are very, very challenging.
So you shouldn't think of, you know, pot shots, beautiful pictures taken with it. So they're difficult pictures. So that's where it gets better than a human. However, as I decrease the number of training examples, I lose about a factor of three or two and a half per decade. So each time I cut the number of training examples by 10, I multiply my error rate by a factor of three, OK?
And so that's really bad news, right? So a human might start off at 0% with zero examples, but it will go down very quickly. And then here they might saturate a little bit earlier than the network. And so this is the challenge that we are facing.
OK, so it was good news, bad news is the slope, and now another piece of good news, the solid curves were obtained by repeating the same experiment and each time increasing the number of categories that we were trying to classify with the same architecture which I forget now what it was. It was one of the popular ones. And the blue curve is 10 classes, the red-orange is 100, and the green is 10,000. And what you see is that the decrement in performance is almost negligible. So that's a piece of good news.
And so there is no theory behind this. It's purely an empirical fact that if we go from 10 to 100 to 1,000, we don't see a deterioration of performance. So it looks like the networks can handle a lot. Now, we're still far from a million or 10 million that I was talking about before, but as you will see later, there will be some examples where we are about at 50,000. And still, that doesn't seem to be a problem, OK?
So this is a manifestation of the same phenomenon in a more recent experiment. This is from my student Sara Beery. And Sara is looking at wild cams, which are cameras that are put out in the wild to monitor white light. And so there are some in the Serengeti, some in the southwest of the US. There are lots of them.
And so these are data from the southwest of the US. And so what you see here is that for different species of animals-- And here, "species" is a bit--- Well, there is cars and there is bird [INAUDIBLE] and other species. But this is a classification that is interesting to the US Forest Service that monitors the animals. And what you see is that you have the same phenomenon, as you increase the number of training examples-- this is per species-- the error rate goes down and the slope is more or less what we were talking about before.
So again, you have a strong effect on the number of training examples. It doesn't look like it's saturating. So if you had a million training examples, it would do even better. But we don't do well down here, right?
And so we know that humans-- oops, so I thought I had an animation-- humans are down here. They get down to say 5% error, something like that, but very quickly. You don't need to see many examples. And so a question is why is this? What is it that we are doing wrong?
And so the answer is something like this. As humans, we learn how to classify objects the hard way at the beginning. So if you have had children, you see that from age 2 to 5, they become really, really interested in recognizing things. And they point to things and they keep doing that. And they go through these phases.
At the beginning, they overgeneralize. So the first four-legged animal that you name for them like cat and whatever, then a cow is a cat, a dog is a cat, everything. So they have these broad categories. And then as time goes by, they become very fine grained classification machines.
And so they can distinguish different-- you know, my children taught me the difference between different buses in Pasadena. They are almost identical. There are tiny details. And to them, they were absolutely fundamental. And so you see that the brain is doing that.
And what the brain, I think, is doing is it's learning from the first few categories that you see, the ones where you have lots of examples. Like, in this case, it's opossums and skunks and so on. You learn some rules and then you become much more efficient for the new category. So there is some transfer learning that goes on.
And clearly, no matter how many papers you see on transfer learning, clearly our deep networks are not there yet. That just doesn't work the way it should be working. And so this is an example of pictures you get for chickadees. And so for chickadees, you get pictures of chickadees in all possible orientations.
And this is why the more pictures you get, the better a system becomes. It's because it never gets chickadee, it just gets chickadee upside down, chickadee with a wing like this, chickadee with a wing like that, and looking up, looking down. And so it has to learn these patterns the hard way. And it's not able to generalize.
This is the sad, sad truth. And so if you now have eagles, well, you have to learn eagles from scratch again. It's very little that it learns, while for us, even if no eagle will hang this way upside down, if we saw an eagle hanging upside down on a tree, we would go and, oh, what does that eagle hanging upside down, right? We have never seen it before, nobody has ever seen it. But, you know, the first time we see it, we can recognize the pattern. So that's clearly a weakness.
And I think that unsupervised learning and self-supervised learning will play a role. And so this is a picture I took from Rob Fergus' thesis. And here he was training in an unsupervised way models that had parts. We were calling them constellation models.
And so here a model is learning by looking at lots of motorcycles. It's learning irregularities like that there are wheels and the front wheel is different from the back wheel and so on. And this is learned in an unsupervised way. And I think that we will have to-- So it's my guess that in the next five years we will go back to looking at these kind of models that know more about structure and can learn in an unsupervised way what is consistent across many examples.
And so I'm looking forward to thinking about those questions. OK, so to summarize the first few points, so fine grained categorization so far has been a piece of cake. It doesn't seem to be wanting our deep networks too much. It seems to work well for face identification in order that you can sort pictures on your iPhone and all of your relatives or your friends that recognizes who they are. So fine grained categorization works well for birds, for flowers, for lizards, and so on.
The number of categories, again, we have not yet hit any significant ceiling. And there is no proof that we will not at some point. But the moment it looks OK. And instead the long tail distribution is a big difficulty. So the categories for which we have few training examples of not being learned well. We don't know how to handle that.
And we don't know how to take advantage of related things we know or that the network knows and promote smarter learning for the new things. So this is truly an opaque question. And so if you're looking for PhD topics, certainly this would be an interesting question. If you're working on human psychophysics, I don't know the literature very well, but that could be a very interesting thing to look at with the computation, in mind with the thought of, OK, I want to learn something that later can help me design better networks, I think it's a good problem.
OK, so the next challenge is experts. And so here is the question. If you talk to a computer vision person or a machine learning person, their idea of what machine learning is all about is selecting some algorithm, and then some good Samaritan will come with a data set and they will just pour the data set on top of the algorithm. And then we're going to learn, and then we're done.
And the data set has been annotated well. There is some sort of an oracle who, out of good will, has annotated the data for us. And we don't have to bother ourselves. Now, it turns out that if you go into science, and many of you are, you realize that that's not true. Like, there are huge disagreements in science about how to classify neurons, how to classify mouse behaviors. There may be disagreements on how to classify horses and other things.
And in the medical community, you talk to dermatologists, you talk to pathologists and they also disagree a lot. And they disagree both explicitly, in the sense that you hear them talking about morphology all the time, and implicitly in the sense that if you ask specialists or experts to label a data set, they will never agree. So, you know, in some fields, 70% agreement is common. And they're not even aware of it. So it's a big surprise.
And they say oh, yeah, well I agree but you know we really agree and this is this and this is that and so on. So they confabulate some answer of why. So the question is the following. We cannot trust the fact that we as intelligent agents building intelligent machines will be able to go and locate an appropriate expert to train our model to then deliver it to a final user. We have to build machines that are able to locate the information, wherever it's available, and discover it from people, from experts.
And so the machine has to be social. And you cannot know in advance once you build your model. You can build a model that learns, but you cannot know in advance what kind of information will a model be faced with in the real world. And so the model, in many cases, has to keep learning. And it has to keep facing things that they did wrong and learning from humans and all of that.
So in some sense, you could think of the machine as now the intelligent entity and humans as the resource. And the machine has to do a good job in learning, but it can ask question, intelligent question. So the machine is like a grad student and it has to figure out which authorities in the field to trust, which papers to read, and so on. It has to figure it out by itself.
And so that's the question, you know, how do we do it? And experts put out information, they write book. So this is from the forest service in Louisiana. And it's to help people identify the ivory-billed woodpecker, which is a species of birds that people believe was extinct in 1937 or something like that. And yet, there are sightings. And it will be very exciting to find it.
But the suspicion is that there are many species that might look like an ivory-billed woodpecker. And so this is what the forestry service puts out to help you, OK? So some expert has some knowledge and they are trying to predigest it for you to be able to input it into your brain and learn it. And this is clearly not the way the machine has to be trained, although it may be one of the ways.
And so we faced this problem first when we decided that we wanted to work on birds and with the birding community. And so to try things out, we started downloading pictures of birds from Google, from the web using Google as a search motor. And you have surprise. So when we took a list of species of birds from Wikipedia. I forget what we did, the typical computer science way of proceeding.
And one species was the Indigo Bunting. And soon we realized that the Indigo Bunting collection was polluted by another species, the Blue Grosbeak, which is easy to confuse with the Indigo Bunting. But it's not closely related. So there are these two species Indigo Bunting and Blue Grosbeak.
People upload them and a number of the labels are wrong. And yet, our job is precisely to build a machine that will help people distinguish carefully between these species. And so unless we have a good training set, what do we do? OK? So at the time, Amazon Mechanical Turk had become available.
And at Amazon, I keep telling them that they revolutionized computer science and machine learning and all of that. Because without Amazon Mechanical Turk, we would not have been able to collect ImageNet and we wouldn't have what we have today. And to them, it was always this little thingy that they were trying to decide whether to kill or not because it was not making much money, it was not growing bigger, and so on. And so they were not sure. And now they realized that, oh, we did a good thing. And just three days ago, the person who invented it passed away. And so I saw that sad piece of news.
OK. So we thought we would ask human annotators, and so these are our experts. And why do we call them experts? They're not. But we train them. And so we show them a number of training images and on how to discriminate between an Indigo Bunting and a Blue Grosbeak. And then they annotate the images, OK?
Now, it turns out we also, in order to be on the safe side and know what was going on, we also asked a true ornithologist to sort these two species from each other to have a ground truth. And so this is what you find if you have about 100 people label this collection of Indigo Buntings that contains a lot of non-Indigo Buntings. And so each dot is a person, is an annotator. And for each one of these annotators, we could measure the hit rate or the rate of correct detection-- how often did they say, "Yes, it's an Indigo Bunting," when it was.
And the rate of rejection is on the x-axis. And this is how many time will they say no when they have to say no. OK, so you see that they're all over the place, all over the place. So initially, we thought, OK, we ask five people, we go by majority voting, that's going to be good enough.
And as soon as we saw this distribution, we thought, OK, this is a very bad idea. We cannot go with majority voting because these people are behaving very differently. You cannot just average them out. And so you see that there are some people who do very well up here. They have about 10% error. But there are some people who are completely at random, around 50%. And some people are anti--
[LAUGHTER]
--anti-correlated. OK? And so we gave names to them. And so these are the competent ones. They know what they're doing. And then here. We have people who are just going at random, the bots. Here, we have optimists. They always say yes. Yes, yes, yes. Yes, yes, yes. Well, you know-- And here we have pessimists who always say no. Now, here we have--
[LAUGHTER]
So I told my students, "I think that these are MIT students who are trying to confuse your experiment." And that put lots of energy and juices into them. OK, so it turns out that if you have an adversary, so who are these people, in truth? These are people who got instructions wrong. And so instead of clicking on the Buntings, they are clicking on the non-Buntings. And they are anti-correlated.
So if you found people who are down here, that would be perfect. You just flip one bit, and you're fine, right? They give you a lot of info. So the ones who don't give you any info are these ones on the diagonal. So once you see this, then you start thinking, you know, what goes on in the minds of these people.
And so that's when we started thinking, OK, maybe just doing a quick majority voting on experts is not a good idea. We have to think more carefully about what goes on in the head of a person who is annotating data. And so we come to the psychology part of the talk. And so we came up with the following model. And let's see if you like it.
So here it is. So you start off with-- By the way, this is a plate notation for a Bayesian network. Are people familiar with this? So I'll take you through it, and it will make sense. So here you have a variable Zi, it's a binary variable. And it says whether in image I there is an indigo bunting or not, or it's a Blue Grosbeak. OK? So that's a binary variable. And this is the one that we would like to know about but we have no direct access to.
OK, now, this variable influences the image in the sense that you can think of the image as, think of a generative model for image making. And so somebody, God, or whatever is putting either an Indigo Bunting or a Blue Grosbeak. They are deciding on the spot which one to put in the image. And then you have lots of what you can call nuisance variables which determine all the possible aspects of the image which are of no interest to you in this very moment.
And so, you know, which one is a viewpoint? Which one was exact specimen? What was it? You know, which camera did you use, which pose, what was the lighting, and so on. All of those things are important for computer vision, but we are not interested in them. They are just a nuisance, OK?
Now, this binary value want to know and all of these high dimensional circumstances are determining the pixels of your image. So here you go from a bit to a megabyte. The image is a megabyte. And what you want to know is a bit. So you went that way. Now, what happens? Now, the person is going to look at it.
And so here is our model. And this is a crucial aspect of the model. Think of the best possible birder that you could imagine. What is the birder going to do? Well, first of all, the bird is here, they are going to reorient themselves. They're going to look at the bird.
And their brain, you know, whether it's explicit in their mind or whether it's implicit in what they do, but it's going to extract some characteristics of the bird, say, the color of the plumage, the length of the beak, the length of the legs, and all sorts of other characteristics. So they're extracting some measurements that are relevant for classifying the species. And so we call that Xi. It's a vector. Think of it as a 10-dimensional vector, a 15-dimensional vector, whatever you want. We'll see it later.
And so the visual system of the birder is extracting some features that allow them to classify which bird is there in image I. Now, a real birder is not the ideal birder. It's just some person we hired or somebody who goes out to have a good time. And so we think of the real birder as a corrupt version of the ideal one.
So if the ideal one was giving you a vector Xi, which is only a function of the image, because that ideal birder did the best possible job. There is nothing specific to what the birder did, it's just the best possible job. So it only depends on the image Xi. The real birder has some noise, which we model with a sigma. It's a Gaussian.
And so you have a real measurement in the head of birder J, of expert J, relative to image I. OK, so that's what a real person has. Now, based on that, the person will have to classify the bird. So let's pretend for a moment that x and y are one dimensional. And so we can think of it this way.
So you have conditioned the fact that you have an Indigo Bunting. The conditional of the Xi's is bigger than zero. And this would be a good the Indigo Bunting, so it gets a strong signal towards Indigo Bunting-ness. And here you have a very ambiguous bird that nobody knows which one it is. So it gets a weak signal here.
And then here we have like a canvas bag that has an Indigo Bunting print on it. It's certainly not there. It has a very low signal. And so these different people, we can think of them as having different amounts of noise in their head. And so the competent one has very little noise.
And given these measurements, they will give you a Y that is very close. And here is your incompetent bloke who is instead full of noise in their head. And whatever you show them, it will have noise. So once you have that, then you can classify.
And so here is how it happens. So here is your variable Zi. So Zi influences this Xi. So you have a different Gaussian, in our case, that describes-- So you have two, one for the Blue Grosbeak and one for the other bird. And here you have the measurements in the head of the annotator, of the expert. Here, you have the noise.
And now in order to come up with a label, which is a yes or a no, which is what you can have access to, the annotator Mr. J who watches image I has two more parameters, a classifier surface and a bias. So the bias is the bias between optimist and pessimist, how likely are you to say yes or no. And the W is a classifying surface in the space of x's or y's where you can classify, OK? Is this clear?
OK. So now I've built this giant castle of cards in front of you. Is it useful or not? And how do you use it? And so the idea is you collect a big number of labels from experts. And the experts are hundreds of people whom you hire to label hundreds of birds. And so each bird has been labeled by five, seven, whatever annotators. Initial annotator has seen 30, 40, 50 birds.
And now what you can do is you have measurements. You would like to have this variable. And you have lots of other latent variables here. And so what you do is you use a maximum likelihood process, say expectation maximization, to estimate all of these different variables. And you say, well, you know, if these people were always in agreement, then they must know something because they couldn't have communicated.
So they must have good parameters, low noise. They must have similar classifier surfaces and so on. And if people disagree a lot, then it must have either very different classifier surfaces or lots of noise in the head. And so you work out what is the maximum likelihood solution for this problem. And that one yields also Z. OK?
So you have two people. One is just doing majority voting, and the other person is optimizing this big model. First of all, the first person spent three minutes while eating breakfast writing their code, the other person spent a month. So already the person with this model is losing. So the model had better give you something, right, apart from writing a paper at NIPS.
OK, so here was an experiment in which we just do that, model inference. And what we saw was that, indeed, we were having some [INAUDIBLE]. So majority voting, so this is the error rate as a function of how many people have seen each picture. And you see that majority voting tends to saturate. And you end up with about 25% error rate. But if you estimate the parameters correctly, then your error rate goes down much more. So it looks like it is winning something.
And I should emphasize, this is worth it only when the task is very difficult and when the type of performance of different annotators is extremely diverse. Now, if everybody was behaving the same, then you could do majority voting. You wouldn't lose anything. Or if anyone is a great expert, like you give people images and say, is there a human face here, yes or no? You just ask one person or you do just to smoke out the people who are dozing off, but that's it.
But when you have a difficult problem, it's really useful to model what's going on and to understand what goes on in the head of the people. And so these are a sanity check experiments we ran. So here we had an ambiguous stimulus. It was a bubble with an ellipse. And we were asking people, is the ellipse more vertical or more horizontal?
Now, here, it's 45 degrees, it's right in the middle. But the stimuli were coming out. Some of them oriented this way, this way, this way, this way. And so people had to judge. And of course, there was an area of uncertainty in the middle.
Now, of course, so why did we do this? Well, because here, by definition, we know the ground truth because we've created it. And so you see that here, for example, this is the estimate of an Xi. And so the algorithm estimates Xi equals to minus 1 for ellipses that are already entered at minus 20 or more, plus 1 for plus 20 or more, and in between-- in between. So the algorithm is able to tell which ones are difficult tasks, maybe tasks where you have to ask more experts to know, and which ones are easy tasks where you have big separation between the two categories.
Now, an interesting thing is that the X's saturate here. Because there is nobody making mistakes any longer. And so minus 40, minus 45, minus 60, it's all the same for the algorithm. Because the answer is always the orientation is horizontal, right?
And these are estimates of annotator bias. So these are the pessimists and these are the optimists. And again, here we have a ground truth and here we have the estimate of the algorithm. And this is the competence of the annotators. And again, we know how much noise they have in their head.
This was more interesting. And so this was for testing what happens if annotators are using different strategies. And so we force them to do it. And so here you have these little gnomes or drawings or whatever. And in our minds, there were some that we're short and green and some that were-- or maybe short and yellow and tall and green with some distribution. And then we put down splotches and other things to distract people.
And so we had two sets of annotators. The ones that we were telling select the green little thingies. And the second set of annotators selected tall ones. Now, the two things were correlated, but they didn't know it. And so the ones were paying attention to length, and the ones to color.
And so then we mixed everything together and we asked the algorithm to estimate the distribution. And so the algorithm came up. So we asked it to create two latent variables to describe. Because we knew that there was color and height, but we didn't know if the annotators would discover it.
And what you see here is that this variable is correlated yellow to green, and this variable to short and tall. And so although the algorithm only sees binary labels from annotators who are doing a task that is different, it's able to put everything together and say, oh, there are two dimensions that seem to be important. And people are using these two dimensions in different ways. And also we're able to estimate what were the classifiers in the heads of each annotator.
And so you see that there is a group of annotators that is classifying short versus tall and a group of annotators here that has classifiers that go left to right. And this was yellow and this was green, right, or vise versa. So we are able to discover that there are two schools of thought and how to think about the classification task. And there are different behavior or different experts.
OK, let me go forward. OK, this is maybe the most interesting of these three experiments. So here we have a bunch of water birds. And so you see that we have ducks, geese, grebes-- which are similar to ducks, but they're not ducks-- and then pictures with non-birds. And we have two species of ducks. And so we asked the annotators to click on the ducks.
And so this is what we got. So again, there are two dimensions. And what you see is that the algorithm places each bird in a different location on this plane. And it's grouping the non-bird images here. It's grouping the geese images here, the grebe images here, and it's mixing together the duck images up here. Do you see that?
So again, purely based on yes or no answers, the algorithm is able to discover that there are many categories. There are these two relevant dimensions. Well, now, we don't know what they mean in this case. We don't know what the dimensions refer to in the image, but the algorithm knows. And it's able to put out these images and separate them out.
Now, you could ask, how can it know that a picture is a goose and one is a duck and it puts them in different spots if the only thing that the annotators are saying is duck or non-duck? You know, how can you separate the grebes and the non-bird images and so on? And it turns out that the answer is simple. The annotators were very diverse in how they interpreted a task, and therefore they gave a signal that was different for different birds and between different annotators.
And so you see that there are basically three groups of annotators. So there is a group here that is very spread out that separates out bird from non-bird. So they understood duck as any bird. And so that's what they did. There is a group here that separate the geese from the grebes and the ducks, you see them here, right? And another group here and here that is able to separate the ducks from the grebes and the geese.
So again, here, there are three sets of experts that have really different ideas of what reality is and they behave differently. But the algorithm is able to give you a signal that allows it out here. We didn't, then, cluster them in any way. So in some sense, it's never explicit in the algorithm.
But you see that there is signal that the algorithm can use to discover that people behave differently. There are different political parties, if you will, and people are different. And this is purely based on yes or no answer. So you can learn a lot about what goes on in the heads of people.
OK, now, more recently, so what I showed you is purely asking people questions, and they answer. Now since we have computer vision, we can combine computer vision algorithms with human annotators. And so what we can do is build an inferno loop in which you have your data set, and then a computer vision algorithm is being trained, initially doesn't do well at all. But you send out the pictures to experts or to Amazon Mechanical Turkers, you get human annotations that then help you train the algorithm.
Now, you can do active learning. The algorithm will start figuring out which ones are the difficult cases which need to be sent to people, and then which ones are the easy cases that a computer vision algorithm can take care of and so on. And so you have this loop that slowly learns and becomes more and more efficient. And on top of that, you can combine the computer vision algorithm with the annotators to achieve more signal with fewer people involved.
And we have lots of experiments where we see that we can indeed estimate well, you know, on some data sets where we have ground truth, the skills of annotators and this and that. And it's quite remarkable. So using these mechanisms, we have been able to build two apps. And one is called iNaturalist and one is called Merlin Bird ID.
So iNaturalist is in collaboration with the California Academy of Science. And it's supposed to help you recognize plants and animals of any species. And Merlin Bird ID is in collaboration with the Cornell Lab of Ornithology. And it helps you recognize birds, any bird, but only birds.
So here we have about, now, thanks to these collaborations, about 5 million training examples, all labeled. And here we have about 10 million. For the birds, I've told you it's 10,000 species. And right now, we are working on 3,000. That's what is in the app.
And it's growing every year. We're adding new regions in the world. And for iNaturalist, we have about 50,000 species that the algorithm can recognize now, and the goal is 500,000, at some point. So let me show you how. By the way, you can download them onto your phone. You can use them. They're very easy to use and so on. And so they have different characteristics. I want to take you through some.
So this is a Merlin Bird ID. You set it this way. You see a bird, you can zoom it in so that you can select as if there are multiple birds, you can select one. And then you can say, OK, identify it. And it's all in your phone. It doesn't go through the Wi-Fi because the birders are out in nature. So it was very important to condense it down onto a phone.
And I should tell, if the Google people are here, it was useful-- Useful? It was vital to have Tensor Flow because Grant Van Horne who was the student who did this was able through Tensor Flow to just experiment. And once things were running, it would just move it on to the phones. And it was done, it was very good for him. So this is an infomercial for Google. Thank you, guys, for doing that.
OK, and so the system gives you the top choice, and then a few more choices underneath. And you just look at them and decide which one you like and you identify the bird. And it works really well in real life. So this is Debra Morning, a friend to whom I sent a couple of years ago, as you can see, the app.
And she sent me back this picture, saying, it's amazing, it identified the bird. And I said, wow, you know, she must be smoking something because I don't see any birds here. But then I blew it up and I saw that indeed there was a bird here. And this is an oriole, a Hooded Oriole. And she told me that she verified it. You know, she had binoculars.
She was convinced that the system had done a good job and so on. And so I thought, oh my god, OK, so we've done something good. And indeed, I now use it a lot. And it does work on these crummy images. And I don't know why. It shouldn't work, but it does. And people are very happy. So you see lots of good comments on the website.
OK, the second collaboration is iNaturalist. This is much more Napoleonic. And it says, well, we want to recognize all sorts of plants and animals that people will see around the world. And here, it's geared not towards a specific category of people who are intensely interested in something, like birders, this is for anyone of us who goes out on a hike and you get curious about the flower.
And so this is a map of-- It's a heat map of where these observations are made. And so this is a large community. I think we have numbers, 100,000 people who are-- OK, here you get it. So these are number of observations per month. And it's going up exponentially. And so I think we have reached about half a million observations per month.
And this is as a function of the number of species. You see, again, a long tail distribution. And we are about at, I think 20,000 to 50,000 where we have enough image. Right now, I don't remember exactly what it is.
OK, and so this is me walking on the beach. I took a bunch of photographs. And this I identified. I'm using the app. So I take this picture. And it tells me it's giant kelp. I take this picture, it's a tiny animal on the beach. You see it from the dimension of the grains of sand. It tells me it's a Pacific sand crab.
And it keeps track of where you photograph the thing, so not very good for privacy if you want to be secretive about where you go. It's not good. But you can obfuscate. There is a feature that you can you can blur out where you were.
Anyway, so it's a social network where people participate. And you post a picture, you upload it, everybody sees what you've seen, and they can comment on your species. So here is a fly that was on my plate one day I was having lunch.
And so I wanted to know what fly was it. And so here is my house in Altadena. But these are all locations where other people observe the same species. Oh, no, sorry these are all observations around me on that day, I think. OK, and so here is what happens.
So here is a photo. And I had no idea what this was. And I used the app to identify it as a banded sphinx moth. And then within a few days, a bunch of people came online and changed the identification to white-lined sphinx moth. Now, to me, it's not a big deal, but to them it was a big deal.
And they say that there are three people all stacked against me. And they have some reputation. I'll talk about that in a moment. And so eventually, the algorithm switches from my identification to white-lined sphinx moth because there are enough people who have agreed.
And so here you have, you know, the dream we were talking about before. The machine is sitting there and it's talking to lots of people. And it's live. And it's learning by doing, by interacting with the cloud.
And so what's happening here is each person who interacts with iNaturalist-- and so it could be you in an hour from now-- has an identity. And the system is keeping track of all the observations you've made. And slowly, it estimates the likelihood that you will make correct species determinations from your pattern of activity.
And so how does it know the truth? Well, because whenever there are enough people who are converging, then it gives a final go to that species. And then from that it works backwards to figure out who is typically right, who is typically wrong, in the way that we saw earlier with the Bayesian model that I described. And so here there are more parameters because the system is trying to understand. You may be a great butterfly person, but you may not know anything about frogs, right? Maybe you hate frogs because they eat butterflies. And so you might [INAUDIBLE].
So for each different genus, there are different estimates. And you keep going up and down. And so the system is live. And so it keeps updating the likelihood with which an image has been classified. Is it, you know, is it more likely to be a white-lined sphinx moth or not? The likelihood that a given person is good at moths versus good at frogs-- and all of this goes on at the same time.
And every couple of months or so, Scott Loarie who is up at the California Academy presses a big green button that we gave him, figuratively speaking, and retrains the models with all the new observations. So the machine keeps training itself out of interacting with a bunch of people who may be experts, maybe not. And so this is exactly the setup that we're talking about, OK, in a little capsule. But it's not so little after all because it's aiming for half a million examples.
And so this is a little second infomercial. So earlier I did one for Google, now for Amazon. So out of our experience, last year I became chief scientist of a project in industry, a very new experience for me. And we built a system for annotating data sets conveniently. And it's called ground truth, Amazon SageMaker Ground Truth.
So it went live in November last year. And if you will, it's a layer built on top of Amazon Mechanical Turk. So with Amazon Mechanical Turk, you had to build your GUI. You have to build all the software that was interpreting what the annotators were giving you back to extract meaning out of that. Here, you have pre-built GUIs that you can quickly adapt to have whatever information you want from the annotators.
You can choose quickly how many annotators you want, and then the system will do consolidation of all these different labels that may be in disagreement. And it's also doing active learning. So if you have 100,000 images to annotate, it will start training a computer vision system behind the scenes, making it better and better. And so you can save a lot of time because if your task is easy, all the semi-duplicate images and so on are going to be digested by the automated system. They're not going to be annotated. OK?
So it's easy enough that even a professor like me can use it. I don't need to-- So this is a very empowering thing. Because I don't have to go to my students, please, please, would you annotate this data? And they say, oh, well, if I feel like it. And then six months later, something happens. Now, it's all immediate and all wonderful. OK, that was the infomercial. And so this is part of the interface. OK.
Now, we come to the last part of my talk. OK, they told me that it was OK if I went a little bit over time. And I could even go to two hours, but I should probably try to stay to one hour and a half. Because people are thinking of beer, at some point. So we are seven or eight minutes from end, again.
OK, I want to talk about open problems now, things that truly have are not there yet. So another challenge for this machine is discovery. And so, you know, what determines the current AI revolution, as it's sometimes called, you know. And so we have to remember how it happened. And so neuroscience inspired deep networks, which were developed in the '80s, you think Fukushima, Yann Lecun.
And basically, we had deep networks at the end of the '80s. And they were reading zip codes and they were working well. We simply had not realized that they could do more than just reading zip codes that were sitting there. And then in the meanwhile, Moore's law was chugging along. It was giving us better and better computers. At some point, we got GPUs, that were fundamental with which we wouldn't have been able to train these deep networks.
And the last piece of the puzzle that came in was these large, annotated data sets like ImageNet that finally allowed us to really get going. And so those were made possible by big search engines, by Amazon Mechanical Turk and all of that. So we have this three-legged stool. And each leg had to be there for things to ignite.
Now, my claim is, well, we call it the AI revolution. But the "I" part of AI is not there yet. And so I want to convince you of that. But I don't want to dump on the work that is being done. The work that's being done is marvelous. I want to say the road ahead is pretty long. And it's exciting. And it's good because you're young and you will be traveling on this road.
OK, so here is how we think about the levels of intelligence a machine. So the first one is being able to memorize patterns and recall them. And somehow we have solved this, computer science has solved this with relational databases and lots of machinery that we know about. And so that's very easy. You can search, you can organize, and so on.
Now, the next thing is going outside of patterns that we have seen exactly before when we can generalize. And so you could think of training a deep network to recognize dogs, well, you'd show it a bunch of pictures of dogs. When we test it, we know that we have to take it out of sample, and the network will work. And so here, there is some generalization. And I will show you that there are limits to this. I think we're halfway through with this part.
The next one is going beyond pure phenomenology and pure correlation. And it is inferring something goes on in the world behind the scenes and that explains what we're observing. And so coming up with models and explanations. And this is what Josh was talking about this morning, right?
And my claim is that down here, it's still terra incognita. And so it's great that people have started just to think about it. Because it's truly under-explored. And we need to make progress there. So let me take you through these steps to convince you.
First of all, so here is-- OK, clarify is an image analysis engine done by Matt Zeigler who was a student of Ralph Fergus who was my student. So it's like a grandson of my lab. And it's doing well, I'm told. And you can input images. And it gives you tags that are associated with these images. And so here you see that it did a good job with the cows, right?
So when we see things like this, we feel good about our field. It's doing a lot. It's so good at discovering meaning from pictures and understanding what is there. But since I know exactly how it works, I can break it. And so here is an example of breaking it.
So here it's having a lot of trouble. Only in one case, it said cow. And it was not even the first thing it said. Do people understand what did they do to break it? OK, so Boris, what did they do to-- Well, you know because you are working on this. OK, somebody else, scream it. Do you want to? Yeah.
AUDIENCE: [INAUDIBLE]
PIETRO PERONA: Right, right. So I figure most of the images that whoever trained that system must have gotten were by scraping the web. And so you find cows on grass and so on, and if you find cows-- So initially, I thought cows in my living room, I couldn't find any picture. Cows on a boat, I couldn't find any picture. But they found cows on beaches, OK.
Let's go with this. And indeed it does a terrible job, right? And so here we see that generalization is truly abysmal. And this is basically the root cause of the long tail problem that the network is not able to generalize, you know, from eagles to chickadees. It's not able to build these bridges of knowledge.
And so you might say, well, maybe you didn't-- if you put the bounding box around the cow and so on, and it does better. But it's not quite solving the problem. So there is this big issue about generalization. So what we're learning is, basically, deep networks are pattern matching machines. And they're not truly extrapolating or coming up with higher level abstractions that allow generalization in humans.
And so that's definitely something we don't have. And so, you know, even the second level of the scale of intelligence I showed you is a little bit in doubt at this moment. And you can play around with this, you can see that it's true, also, for ducks. If you have a duck indoor, then the network is not able to deal with it well. OK, so this is number one.
Number two, let's talk about thinking about cause and effect, instead of just correlations. And so here, just to make a point, I have this cartoon. So I ask Gabriel, what's the problem with this man? And Gabriel will tell me, well, he's sick. Clearly, he's sick.
And so I asked Gabriel, well, you know, why do you think that he's sick? He says, look, you know, he's in a bed. There are doctors around him. Clearly, he's not doing well. And I say, OK, so you say that he's sick because there are doctors and he's in bed. And he says, yeah, that's why I say that he's sick.
OK, I say, well, if that's a problem, let's get the doctors out of the door, tell him to stand up, and we feel fine, right, if the cause of sickness is the doctors. And you say, no, no, it doesn't work that way. You got it wrong. But basically, this is the level of understanding of our machines, that the machine is perfectly able to make a prediction because predictions only require correlations. You don't need to know which way the causal arrow goes. These two things of being in bed with doctors around and being sick go together.
And so if the machine sees the doctors and the bed, they can say the person is sick, as it was saying, that there was a cow there. Or if it knows that he's sick and it says, well, there may be doctors and maybe the person in bed. The two things go together. But the arrow of causation goes only one way.
We know, by experience, that it's the sickness that is causing the doctors and the doctors causing the sickness. Although, OK, this has been disputed, too, and so we should be careful about it. OK, so now, why is that important? And so I'll hold that thought for a moment.
And I want to go just to make sure that we understand this issue of causation. And so if you only think about conditional probabilities, you can think about being in bed and correlating with fevers. So we can talk about the probability of y equals 1, of fever equals 1, condition on being in bed. But truly you want to discover that the fever causes the person to be in bed.
But if you reason about conditional probability, you're perfectly-- And so correlations, you can write it either way. You can write what's the probability of being in bed condition on having a fever or vise versa. There is no reason to go one way or another when you have conditional probabilities. There is nothing.
And so I'm using now an arrow that is a little bit special to describe causation instead of dependency. And we know from the work of Judea Pearl how to think about this. And so he came up with this "do" operator. And so the idea in probability is that you say that two things are orthogonal or independent. If P of Y given X is not equal to P of-- Sorry, they are dependent on that. But the causation requires this new operator.
If the probability of Y when I modify X. Everything else being equal is not equal to the probability of Y, then X is the cause of Y. OK, so that's the definition that Judea Pearl says. Peal says, you know, pour epoxy on everything else the world, apart from the variable X. Now, start manipulating X and see if Y changes. If that causes Y to change, then you can say that there is a causal link between X and Y.
And that's the reason why we build labs. That's precisely why we build labs. It's to be able to isolate the preparation and being able to pour epoxy on everything else and only manipulate one thing at a time. And we saw that in Bob Desimone's talk this morning how a scientist proceeds in trying to establish causal relations. And in neuroscience, you know, it's very difficult. So you could say, you know, we have this possible graph of dependencies.
You know, he has a fever so he's in bed. But maybe he's in bed because he worked on a night shift or maybe the night shift because of the fever. Who knows which way it goes? Or maybe it was the flu. How do we know?
And so the idea is to blot all of those possible variables to explore whether a fever causes him to be in bed or vise versa. And now what you can do is you can manipulate the fever. So you can give him an antibiotic you can eliminate the fever and see if he gets up and he will or you can get him out of bed you know get up and then you can measure the fever five minutes later know was which way did it work. And so it gets up, the fever stays there.
So the bed is not the cause of the fever. But if you give him antipyretics, he will get out of bed. So the fever is the cause of him being in bed, right? So that's how you do it. So why do we care so much about this causal reasoning in machines? Well, it's very simple.
If you're just targeting advertising based on images, you just need correlations, right? So if you see that picture, you start advertising Advil to the person or maybe a better pill or whatever you want. It doesn't matter. But as we build machines that enter the world and do things in the world, think of autonomous driving.
Think of lab machines in scientific labs. Think of robots that have to clean the kitchen. The machine had better know about intervention. So if you want to build a robot that will make people feel better, the robot had better know if to feel better, you try to decrease their fever or you get them out of bed, right? So the same as, you know, as a car drives around, you know, what's the correlation between a pedestrian crossing and something else happening?
It has to know what causes what. And so we have to be able to build machines that explore position if we want to get to true AI. And so that's a very big problem. So I want to show you. It's late, so I don't take you through all the slides. But basically, the idea is that if you only observe nature, well, nature has all sorts of tangled variables.
There are lots of correlations. And so a learning machine is not able to understand. And so we have to intervene in some way. And the machine has to be able to control the variable X to be able to disentangle its different components and know what causes what. And so this is a very simple idea that we pursue.
This is my student Krzysztof Chalupka. The idea is you train, say, a classifier to learn to classify [INAUDIBLE] digits. And as you know, with a three-layer network, you can classify [INAUDIBLE] digits down to half a percent error rate in 20 minutes. It's just amazing nowadays what you can do. Now, does the network know what the digits are?
And you say, yes, I mean, you just told me that if you train it, it works very well. And so I'm asking you, well, does the network-- So what's the definition of a digit like a 4 or whatever? Well, it's that scribble on a piece of paper that will make a person say 4. There is no other definition you could give. And so is the network able to produce a scribble that a human will say four?
Well, the network could copy it from the training set. OK, so if somebody came and-- And so here is the idea. Sorry, so it turns out that if you start like turning on or off any little pixel in the image. You go out of the manifold from which you learned. And the network is at chance. And so the idea is this one. You first train the network. And now, since you have access to the internal parameters, what you can do, you play a game of creating adversarial examples.
And so you take fours and, modifying them minimally, you can, by taking derivatives of the output with respect to the input, you can create images that look like fours completely, but the network will classify as sevens. You can do that. You know it, right? so there was a paper by Rob Fergus and so on. And the idea is that then, now, you can start planning experiments. And you can have humans label these images that the network believes are fours into sevens, and the network will start learning.
And so you can have an automaton that does experiments on the world and probes the world until it figures out what's going on. And so this is our embodiment. So here is a network. It is learning what a 0 and a 1 and so on is. And we had Mechanical Turk workers type in the zip codes. And we allow them to type in a question mark when the pattern was not clear enough.
And so this is what the network learns. So this is a network that has been trained in a correlative way on [INAUDIBLE] digits. And so what we tell the network is you've got to learn how to modify minimally the nine in order to create a zero, and the eight in order to create a one, and so on. And initially, the network modifies. And it's comical because this network that is better than a human classifying [INAUDIBLE] digits.
Well, it says, well, by turning on this little pixel here, it has become a 0. Now, it's a 0. Because, indeed, the classifiers inside the network are shaped in that way, it has never explored these semi-noisy patterns which, to a human, are absolutely obvious. And so it's having an easy time modifying minimally the 9. It thinks it's a 0.
But then it sends it to humans. And humans say, no, no, your experiment failed. That doesn't work this way. And so it's like a scientist trying to figure out, OK, what does it want? What is it? And so as you go through a number of iterations, it becomes closer and closer to the 0, and here to a 1, and so on. And so the network learns how to modify, minimally, a digit to make it into another digit in a way that a human will agree, OK?
So this is, again, think of deep networks in 1988 when Yann LeCun was classifying endless digits. And so this is the study of causality. And so maybe 20 years from now, we'll understand better causality. This is just an initial step. OK, so here I want to recall, so clearly databases work. I showed you the generalization is not very good yet, although, you know, a lot works. And certainly, the AI is not there for understanding mechanisms and intervention.
OK. So just the last two slides. So this is the typical paradigm we encounter in machine learning and computer vision. So we have data. We ask an Oracle to label the data. And this is us. We are choosing the best possible algorithm. And we think we are smart. We implement it in tensor flow. And we turn the crank, we produce a black box. And now, there is a user who can use this black box.
And so this is our current paradigm. And my claim in this talk is that it's not a productive way of proceeding. Or it is a productive way, but you should be aware there is a much bigger, badder problem. And you have data, you have experts who may disagree. You may have different opinions.
The experts don't do all the job. They produce some training examples for the Mechanical Turkers who come here. Members of the academic society have different ideas on what to do here. It doesn't matter. And they produce black boxes. And now the users, these could be scientists in the lab who classify neurons stopped using it.
And they realize that they got all their classification wrong. Because what they thought were very neat classes, in fact, they're smudging into each other. They have to revise their idea. And so it goes back here. So the experts now have changed their mind. They change their ideas. The machine is learning. The machine can ask questions and so on.
And so it's a big dynamical system where we have people, machines, data, information flowing. And we want to understand how this works and how to make this system come alive and be productive and distill knowledge for us. And so that's the idea.
So these are my collaborators. Serge Belongie is at Cornell Tech. And many of the cool experiments are done by Grant van Horn and by Peter Welinder here. Although there are other contributions, these are a number of papers on the subject. OK, thank you?
[APPLAUSE]