A Conversation with Dr. Boris Katz
Date Posted:
July 15, 2019
Date Recorded:
July 10, 2019
CBMM Speaker(s):
Boris Katz All Captioned Videos CBMM Summer Lecture Series
Description:
Boris Katz of MIT discusses how he got into his field of research, and shows some projects that he has and is working on, with visiting summer students.
BORIS KATZ: How I started. That was many years ago, and it was a different place. I came from the Soviet Union here-- oh god, it was in 1978.
But before that, I was always interested in language, and I would quibble with computers. I really didn't like to write code, because to me, it sounded very unintuitive. And so many years ago-- we're really talking about 50 years ago, I started thinking about giving machines the ability to communicate with us in a better way than writing this code, which in my time was octal which you can imagine how hard it is.
Most of us may not even know what it is. In order to say "plus," you would say something like, in octal code 0 1 2, or something. And then there would be the first operon, second operon, and a third-- it was quite hard to do.
And so one of the first projects that I did was to teach a machine to understand one of those algebra word problems, like fifth or sixth grade. Car goes from city A to city B, a train goes from city to city, they need someone. And so the machine was supposed to read this text as text, understand what the issue was, and then solved the problem. And I was very happy when my successor was able to parse and solve a whole textbook of such problems in algebra. So then I got very excited. And I'm also interested in poetry, and I decided that it's time to teach machines to produce something exciting for people.
[LAUGHTER]
I taught it to write poetry. It was in Russian, unfortunately for you, because you cannot really see it and read it. But to me, it was a truly remarkable experience. And it was one printer in the whole lab. And the way the system worked, you would come in there and give a bunch of cards to an operator with your code.
The card came from a different office. You would write the code on a piece of paper. Then an operator in one room would put the stacks into punch cards. And then I would bring the punch cards, and then bring it to another, who gave to the operator.
And then [INAUDIBLE] come on a Wednesday, and we will run a program and show you the results. So that was the debugging cycle. And of course, nothing worked, and I got back a huge output. And I had to figure out what's going on. And then the issue was that there was one little hole, extra hole, or missing hole in the octal code, which I need to take some glue and--
[LAUGHTER]
Yeah, so you don't know how lucky you are that you don't need to deal with these issues. But that also, I guess, made me feel-- it kept me on my toes, because I really need to write code that didn't break. And so eventually, I debugged the program. And the first time I saw this wonderful poetry generated at, like, 3:00 AM in some random machine room. It was a truly remarkable experience.
Well, anyway, so eventually, I decided to come to this country. And I was lucky. The first office I walked in was at CSAIL here.
And I got a job, and I'm there since. So Cambridge is the only city in the USA I've lived in, and MIT is the only place I worked in. So I'm lucky that I found them.
And I started working on also-- continued working language. Some of the things we did over the years [INAUDIBLE] question and answer [INAUDIBLE] equals START, which was by far the first question answering system on the web. Some of the ideas from START were eventually used and implemented by IBM people. I worked with them quite closely to build the Watson system-- the one in Jeopardy.
A little bit later on, we had a project with Nokia, And we built a START system not quite on the phone, but working over internet. And we built the question answering system that you could ask the phone certain questions, and ask it to perform certain procedures on the phone. And that was before smartphones existed, of course.
And it was a vanilla Nokia phone. It was a total magic that it worked.
We went to Helsinki several times. I talked to senior vice presidents. And I said, look, it's very interesting, and you may have an edge over the competition.
[SIDE CONVERSATION]
BORIS KATZ: Yeah, I actually wonder what I could show you. Let me see if I can do that. So it looked to me that it was an important system. And I tried to convince them to put it in on the phone. But they just wanted to make sure that the system actually sits on the phone.
Unfortunately, the word "cloud" did not exist. So I had a hard time explaining to them that everything does not need to be on the same gadget, and we could go through the internet and get answers this way. But I was never able to convince them. So we--
[LAUGHS]
Let me actually show you the system. All right, so again it's hard to imagine the world without Siri and without smartphones. But imagine the phone that does nothing but making phone calls. So there's no audio, except for some music. But you need to read [INAUDIBLE].
So these are some of my students, who act as if they need to go on a trip. And that is-- yeah, it tells you about START [INAUDIBLE], which is how we called the system. So this is he would type the question.
[INAUDIBLE] later [INAUDIBLE] we connected it to speech. And he wasn't sure he [INAUDIBLE] to take the call, so he asked about the weather again. Now [INAUDIBLE] every phone has weather. But again, it's a different world. That was in 2006.
AUDIENCE: [INAUDIBLE]
BORIS KATZ: It was our system, we just made a film about describing our system, yeah. Yeah, commercial's not the right word, but something like that. So you recognize the train station here in Cambridge.
And he's totally lost. I think he asks, "Where am I." And this is pretty much [INAUDIBLE] first GPS came out.
So we connected the phone with GPS. It showed him the lab. Now he knows what he's doing.
Well, this is a different-- you may recognize that [INAUDIBLE] beautiful lawn. But now, they've built these buildings here. So it dates a little bit the film. Again, it was 2005, 2006 [INAUDIBLE].
OK, [INAUDIBLE] I think he's trying to reach his mother to tell her something, but she she's not answering. He's getting desperate. So we're able to figure out actions on somebody else's phone. So he texts [INAUDIBLE] remind her mother to take her medicine at 3:00 PM. And we'll get back to that story later.
Well, this is the [INAUDIBLE] Center. He likes the [INAUDIBLE] architecture, I guess. And he wants to take a picture.
Take a high resolution picture using flash in 10 seconds. It's remarkable that even now, almost 20 years later, they cannot do these things. It's totally, totally ridiculous. All right.
So he wants to talk to those guys again. These are my students. Oh, he's a little bit bored, I guess. He wants to play music or videos [INAUDIBLE] he doesn't know how.
So he asks how do I use the radio on my phone. And the machine gives him some directions about how to do it. Again, today also you cannot ask your phone how to do things, unfortunately. Certainly not in English. Now he knows.
[UPBEAT MUSIC]
Certainly, very good actors, actually. All right, now his mother, who is a research scientist in my group. And so he got now the warning, the alert, getting inserted on her phone.
And they were scheduled for future, so it happened at 2:00 PM or whatever he said, or 3:00 PM. Take medicine at 3:00 PM. And so she remembers.
[LAUGHS]
(SINGING) [INAUDIBLE] take you away--
[INAUDIBLE] his friend. And this is one last story. He is going back. He gets into a car.
[UPBEAT JAZZ-STYLE MUSIC]
Again, Imagine the world without Google Maps, without anything. And so we were able to [INAUDIBLE]. There was something called MapQuest. So we connected together several things.
So the question was, how do I get from here to [INAUDIBLE] house. So you need to understand what here is in your GPS. [INAUDIBLE] house, he gets through contact uniform. And then you send this information through MapQuest that gives you directions.
Well, anyway, so this is the story. So we built a system. Yeah, this is just-- right. So we built a system and gave them demos and all that. And then they kept asking me about putting START on the phone, finding a chip have a compiler for LISP on the Nokia phone.
And then, like, several months later, there was this news that there is a new phone that Apple came up with-- iPhone. So very exciting, the stuff much more than that Nokia phone. So I come back to this senior vice president.
I said to him, look, there's this new baby on the block. They're doing stuff, and eventually, they will catch up with us. And you need to really start doing stuff [INAUDIBLE].
He says, do you know how many phones Apple sold? I said, well, I read in The Times that last month, they sold about-- they were quite expensive, though they sold about 1,000 or several thousand units or [INAUDIBLE]. But he looks at me and starts laughing hysterically. He said, we, Nokia, ship one million phones every day. Why do we care about Apple?
[LAUGHTER]
So it's famous last words.
[LAUGHTER]
So what happened was I guess we stop worrying about them. We published a paper describing the system with all the examples. That was in 2007, I believe, in October or September in California.
And then somebody was sitting in the room. And two months later, a little company in the interior got started in California. And a couple of years later, they sold their stuff to Apple. And the rest is history.
So again, I'm not [INAUDIBLE] tell the story to show off [INAUDIBLE]. But of you will some of you will become professors, so you will be entrepreneurial, maybe or not. Or some of you will go to industry. But it's important for you to know that the company that seems big today may not be big tomorrow, like Nokia was almost sold on a yard sale to Microsoft several years ago.
And their mistake certainly contributed to that. And you need to find people to work for, or maybe start companies with people who have this vision that the world is changing out all the time. And you need to be on your toes. So that's really the only message there.
Well, anyway, back to what I have been doing. So we did all these things. There was some-- we were the first who came up with the idea that, to answer questions on the web, it's not enough to look at the text of all available text for you. You really need to do some pre-analysis of the text, and create some structured information. It will make it much easier for you to answer these questions.
And one [INAUDIBLE] in fact, the one who played the visitor, after he graduated, he went to Google. And you may know that Google has these 20% days, where one day of the week, at least at the time, you're allowed to work on anything you want, assuming you get the go-ahead from your boss. And then eventually, you served with the company. And if they like you, they pick it up. And so he worked with us on this idea.
We had the system OmniBase, which could be viewed as a pre-digested web put into a database. And he built something much bigger on a Google scale, which at the time they called Google Squared, and then renamed it to Google-- something, [INAUDIBLE] or something. And about 8, 10 years ago, they totally switched the search based on that. So now, if you have some questions. It used to be that they would give you just hits. Now if they have the answers in a structured form, they will immediately on top give you a very precise answer, which again is due to some of the stuff that we did.
All right, so that was language. And then a few of us, about eight or so years ago, started thinking about the future of AI. And [INAUDIBLE] this proposal to the National Science Foundation, which we called Brains, Minds, and Machines, where we told them that we don't believe that the problem of intelligence could ever be solved by one discipline, with computer scientists or cognitive scientists, neuroscientists. And we really need to work together to do that.
We also said that we don't believe that we need to look at one modality, whether it's language or vision or robotics. That again, we should go across modalities. In fact, I believe that it was big mistake made by AI many years ago that people totally separated.
Language people have no idea what vision people do. Vision people have some of the application [INAUDIBLE]. But they don't understand what they do. And it's a mistake, not only because people can learn from each other. But it's a mistake because, in fact, they made the problem harder.
Just think about a baby in the crib. You have to yet see parents who put an encyclopedia or a dictionary into baby's crib and say go learn. What happens is the baby observes the world with its vision, touches the world with its hands-- tactile information, and hears first parents, and then friends, talk about things, and puts all this quite often redundant information together.
And this is how learning happens. So the learning of recognizing objects and recognizing language and recognizing what you-- it happens at the same time. And this is easier because you have redundant information that confirms what you're learning.
But currently, what most language people do, they train and train and train on language data until they build a decent parser. So what vision people do, they train and train and train on billions of images until they get decent object [INAUDIBLE] visual performance and so forth. And I don't believe that they will succeed dramatically well if they continue doing that. We need to put things together.
Well, anyway, so going from language [INAUDIBLE] we were lucky. [INAUDIBLE] got accepted. We got a sizable amount of support from the government. And that allowed me and other people in this building and across the street to hire totally awesome people in different fields.
And so now, is my group is interested not only in language and question answering, but in vision and robotics, and especially in putting things together, and how children learn language and how children understand what other people think, and so forth. So these are the types of projects we're working on. So I could very quickly-- well, I can answer any of your questions if I can, of course, any time. I'll give you a second.
But then, I have a few slides about some of the things we did very recently. It's not really a technical talk, but it has to do with the current state of mission/vision. And I could show you the slides. But go ahead with your question.
AUDIENCE: More about your proposal [INAUDIBLE] language people and [INAUDIBLE] people. What does that practically mean? Like, you can't do visuals that way [INAUDIBLE] At an experimental or scientific level, what does it mean to combine both disciplines that have advanced both ways of doing the research? How do you operate one with the other.
BORIS KATZ: Well, you co-train. And in fact, we have-- yeah, yeah. So this is one idea. I'm not saying that we have all the ideas. I'm sharing that we have to try very hard to get-- yes, right. And we do have-- in fact, I can show you a couple of recent papers where we try to do exactly that.
AUDIENCE: Co-train.
BORIS KATZ: So anybody interested in machine vision here? OK. All right, so let me see if I can find those slides. How many of you know what ImageNet is? OK, most people.
All right, so I think if you ask most people, both in this building and across the street, and I think SQ as well, I think we all believe that much progress has been made in AI in the last, say, eight or so years, and especially the mission and vision. Indeed, if you read papers, or especially media, you will see that there are these claims that machine performance of ImageNet, for those couple of you who did not raise their hands, it's a data set of images. So this one had like 20 million of those, which most object recognition systems train on and test on.
So on the basis of the performance, this claim could be 97%. And in an earlier paper of a couple of years ago, claimed that human performance on object recognition on ImageNet is [INAUDIBLE] 94%. So what does it mean? Does it mean that you're wasting your time here, we should all go home, the problem is solved?
Maybe, but if you ask people-- I don't know if you have these experiences who actually use these detectors, who allegedly have 97% performance, you will know that they never work out of the box. And even if you train them, it's totally not even close to that number. And in addition, even that number 94% is also a little bit silly, because if 6% of the time you would not recognize the objects on the street, I'm a little bit worried about your safety [INAUDIBLE]. So we need to explain this disconnect. And maybe before we do that, we need to understand that these numbers have to do with object recognition [INAUDIBLE] what is it that needs to happen to recognize a visual scene like this.
So of course, we need to recognize objects and categorize them-- understand what class they belong to. But in addition, we need to understand spatial and temporal relationships between objects. We need to be able to recognize events, not only objects, what activity is happening there. We quite often need to predict what happens a second after that frame in the video or that image, or five seconds after that, or explain what happened before the event that you're observing.
I see mostly upper torsos of all of you here. But I pretty much can guess that there are some legs over there as well, even though I don't see it. So you should be able to hallucinate, fill out the gaps of what you don't see.
And of course, humans are totally wonderful of all of these tasks. But if you look at machines, it's true there have been some good progress made in object recognition and object categorization. But all these other human abilities that are required to understanding machines are not there yet-- not even close in our machines.
So the question is why is this the case. And some people say, oh, well, visual understanding requires knowledge about the world, which is a claim which we will agree with. And our human visual system is tuned, and some even say biased, to process structures which are typically found in the world, and to recognize objects and events that make sense to us as humans.
To maybe explain a little bit more about what is meant by that, I'll also you a quick video, which got blurred on purpose. Yeah, I don't know if it's a bit too bright here. Do you see what's going on here?
AUDIENCE: [INAUDIBLE]
BORIS KATZ: So-- say [INAUDIBLE] again.
AUDIENCE: I think it's [INAUDIBLE].
BORIS KATZ: You see it? Can somebody tell me what they saw?
AUDIENCE: [INAUDIBLE] like [INAUDIBLE] picked up a phone.
BORIS KATZ: Yeah.
AUDIENCE: It's [INAUDIBLE] that's just an observation. And then he put a [INAUDIBLE]. Moves a mouse, [INAUDIBLE].
BORIS KATZ: Well, which is perfect. And why is it that you were able to do it? Because you only saw just [INAUDIBLE] a couple of pixels on the screen.
But you know that you pretty much understood, oh, this seems to be an office situation. What happens in an office is that something looks like a computer, something looks like a human. So you hallucinated the rest, and you were able to give me a perfect answer.
But of course, it's also true that if some [INAUDIBLE] confronted with the fact that you were wrong, or shown something which is different than usual, you will not collapse and say I have no idea what's going on, God help me. But you will actually recover, dismiss your biases. I call them biases because you're sort of biased [INAUDIBLE]. And you will be able to recognize this unusual situation. So let me show you the actual video, which is on unblurred.
[LAUGHTER]
So there are at least, what, or five, six here, which are totally crazy. First of all, the phone was a shoe. The headphones were some Coke cans. The mouse was the stapler.
There was also-- I don't know if you see it, a trash can instead of an old-fashioned desktop, and a toaster for what looks like an old-fashioned disk drive or something? So you were able to quickly dismiss your biases, and humans are very flexible.
But the question is, can the machines do that? And before we answer this question, we need to understand what the data says. You all know the model [INAUDIBLE] machine learning stuff, they require data sets and all that. So is it possible that our data sets are biased?
And if you ask people how these data sets are collected-- for example, ImageNet they will tell you, oh, well they collect it at random from various photo sharing sites. And because it's at random, that means that they're are not biased. They're random. Well, let's look a little bit more carefully how [INAUDIBLE].
So it's true that they're collected at random. But what is Flickr? I don't know if any of you-- I don't use it. But I assume some of you upload stuff on Flickr.
You don't upload random pictures of random [INAUDIBLE]. I call this the best photos effect. So you select the ones that you like.
Usually, those pictures have predictable backgrounds. They are correlated with objects very nicely. If it's a somebody on the beach and the back of the sea or sand a plane is in the sky, or it was on a runway, frying pan [INAUDIBLE] in the kitchen and so forth. And so they're all more or less [INAUDIBLE] because you want them to be nice, these photos. They appear in preferential positions.
And there is also almost no occlusion, because if you don't see your objects, why would you even upload it and share it with the world? And even the angle from which you take it is not random. You want to make sure it is [INAUDIBLE].
So as a result, our data is actually biased. And there was this famous adage in AI that models are as good as your data sets. So what happens, current networks, the winner of all these competitions in ImageNet are extremely good at exploiting these biases in the data sets. And as a result, they over perform on ImageNet, on the data set that they're trained on, and they very badly underperform in the real world.
And so the problem that we want to-- so I asked the question before, can our machine overcome biases? So we first need to see whether we can de-bias the data collection process of our data sets. And we believe we have the answer-- that the answer is yes.
We actually spent several years working on it. We just sent a paper to [INAUDIBLE] describing it. So it's a platform which we called ObjectNet. You asked a question?
AUDIENCE: Yeah, I have a question about the term bias. I mean, in a certain way, [INAUDIBLE] argument that, OK, but my experiences are biased as well? Because I only see objects from a certain angle, from my perspective [INAUDIBLE] most of the time, I see it during the day. I don't see images from the bottom of the ocean. [INAUDIBLE]
BORIS KATZ: You are absolutely right. But this is exactly why I showed you that video. That when you saw that the unblurred video, you didn't say, I've have never seen a man putting Coke cans on his head.
Therefore, I have no idea what I see. You immediately saw what you saw. You immediately saw the people [INAUDIBLE]. Chances are they have never seen somebody taking a shoe and putting it like-- but did you have a problem recognizing it?
Well, but our machines do. And this is exactly why [INAUDIBLE]. It's true that most of the time-- 99%, whatever it is percent, of time, you look at objects, they're in this perfect position. But do you see what this is? You know it's a bottle. I don't want to--
[LAUGHTER]
But it will be no problem. And If you want to have robots in our house helping us, and seeing objects that they need to see to help us, we cannot tell them that everything is in the right position. Because if they don't see-- if all of a sudden, we do this, or we put the food all over the place, the robot will get [INAUDIBLE] stuck. Our networks need to be able to generalize to different positions. And currently, they don't.
And now [INAUDIBLE] some members explaining what I mean. But that is a good question. Thank you. All right, so this is-- yeah.
AUDIENCE: On that note, since you said something about assisting [INAUDIBLE], and then be able to [INAUDIBLE], is that the big part of the impact that you hope to have?
BORIS KATZ: Yeah, our goal is not to say that whenever they say 97%, it's in fact a much lower number. [INAUDIBLE] right now, fewer and fewer people work on object recognition. Because if it's 97%, why would you waste your PhD making it 97 and 1/2%
But we want to tell the world that the problem is still there with us. And of course, the main goal is, after that, to create better detectors, to understand how machine vision works. But I also firmly believe that we need to do it at the same time with understanding how human vision works and doing things together.
And hopefully, there will be this wonderful virtual slope between understanding humans and machines to build better gadgets invention. So that's the goal, yes. It was supposed to be the end of the last sentence so my talk, but--
[LAUGHTER]
All right, so this is what we do. We don't believe in random, because the web is not random. Flickr is not random. And so we have to do it the hard way. We have to actually create every image from scratch.
Thank god there is this Amazon Mechanical Turk. I assume you guys know what it is. And so just a [INAUDIBLE] of money. And so we give them tasks.
So I have a researcher in my group who one day went to his home and made a list of all the objects he has in his house. We threw out some of them for reasons I will explain later, which like with this food, [INAUDIBLE] put upside down. And so we ended up with 324 different types of objects, and put on the list. And we asked many thousands of Turks to take randomly one of those 300 objects, to put them randomly.
[INAUDIBLE] the coin toss happens on our side. We tell them take a bottle, bring it to the kitchen, rotate it at so many degrees. And I'll explain how this is done, because we cannot give them numbers. And put the camera from [INAUDIBLE] or something.
So we took some additional steps to make sure that there's as little bias as possible in the collection process. So in particular, they are not [INAUDIBLE] in any instances of any object classes. We didn't want to bias them this particular bottle, so that they will all go out to buy a particular bottle. We're just using this to describe objects.
And as I said, right now, it's English. But it could be any language potentially. It's we don't do it in more than three, four words. Usually, it's one, actually.
And they're guided. In their work also, they download an app on their cell phone. And that app has allowed them to use some animation to be [INAUDIBLE] to tell them what exactly they need to image and how. And you'll see in the next video a little rectangle prism that they need to align with the physical object before taking the picture.
So here are the instructions. So on the right is what he sees on the phone. He is told which objects to take, where to move it.
And here is this prism that he aligns with the object. It needs to be exactly inside the box. The prism, in this case, knows where the apt and front is.
In other cases, [INAUDIBLE] other objects and different things. And so now he's told to rotate it, and again align it with the prism. And take a picture and then press a button and upload it.
So this is what the Turks do. And this is a somewhat more schematic description of what they do. So it says here collect national data controlling for biases.
So to control the object class, which again is one of those 324 or whatever it is classes, occlusion has a [INAUDIBLE] on top of it. We don't do a close [INAUDIBLE] yet. This is a work in progress.
And then object orientation, we have about 50 angles, which again, using that prism, we don't tell them move the object 48 degrees. We move the prism 48 degrees so that he needs to align it with the box. And then camera orientation and background, and whatever rooms we tell them to bring it to.
And so all this random sampling, [INAUDIBLE] gets generated as a task to the Turk. He takes an image, or she takes an image. And then other Turkers make sure that that was the right label for that image, and that it has the right background and so forth. We have a very sophisticated verification process. And eventually, we throw something that we don't like after we get the data set.
So here is what we have so far. And we are still collecting as we speak. As I said, we have 324 object classes.
What will be important for the next couple of slides that we made sure that many of those classes overlap with ImageNet, because we want to make sure we can compare how systems perform. So it turned out to be 116 classes. So we collected about 70,000 images, and about 4,000 people were involved, sadly which resulted in the waste of a lot of money.
After validation only 40,000 were retained. And here are some of the reasons of why we threw out so many. Now we know better and we will be much more careful about that. But it was a learning process.
So here are some of the examples. So we removed 43% of the data. About 10% were just silly, because people had said-- we will ask them to take a picture off of a phone or of a bottle.
And they will put it on top of the desk with a check. And so we could see the check numbers. So we had to-- it was just foolish. We had to throw out this information.
Another 10% had to do-- in some countries, in this case, I believe it was India, some entrepreneurial people realized the Turks could make them some money. And some of these people don't even have computers or cell phones to do any work. So they hired a bunch of people, placed them all in one office, and told them to bid on jobs from Amazon Turk.
As a result, it totally defeated our purpose. We wanted to have as much variety in the background and in objects. And all of a sudden, we started seeing that same object in the same background in the same office. So now, we wrote some code that we will, unfortunately, not allow these people to participate, because we don't want those same images from them.
And 23%, which is a large number, that was the hardest for us to deal with later in the future. So they didn't have an orange or a banana, or something. And so they will go to Google, search for that object, and then take a picture of a picture and send it to us. And even, of course, the instructions told them not to do it.
So in any case, we had to throw quite a few of those out. And if we find a cute way to automatically recognize that it's a picture of a picture, we will be able to do it in a much more efficient way. Well, anyway, this is just a test set. Yeah, go ahead.
AUDIENCE: Did a lot of the costs come from maintenance of the Amazon Turk you were saying?
BORIS KATZ: Of the cost? Well, the cost, each time they take a picture, we need to pay them. And we want to pay them decently so that per hour, they don't make less than, like, $10, $15. And so this is the cost, yes.
And what we also believe is important, so that people don't overtrain like they do currently with ImageNet, we spend a lot of effort to make everything automatic. So at this point, all we need is a check. And then we can press the button, and another 50,000 images will be collected. And this way, people will not overtrain on those same objects, even in our data set.
If you're interested, here are some of the objects. This is about half of objects in our data set. And some of them are in italics, and those are intersecting with ImageNet.
AUDIENCE: I have a question. So if I make the argument that the power to generalize-- like generalization in general occurs from having a large volume of previous experience-- so the more previous experience I have, the better I can generalize. The more I have seen objects in different contexts, the better is my ability to recognize that object in a normal context, and so I put it in a frame. Now, 40,000 sounds like really a lot. But if I really compare that to the human experience, I would say that maybe three days of our lives--
BORIS KATZ: You missed something that I said a minute ago. I said this is just a test that we will not allow anyone to train on this collection. Either they want to train on 20 million images of ImageNet or 200 million images or a billion images, this is their problem. This is a new experience that I want those networks to experience and to be able to generalize to, rather than train.
I absolutely agree with you. Although to be fair, a baby does not see 1 billion images. And so there is something there that allows us to very quickly generalize.
AUDIENCE: [INAUDIBLE]. Have there been studies that have tried to find a relationship between number of images and certainly optic recognitions? So have there been any studies [INAUDIBLE] done?
BORIS KATZ: For machines, yes. But for humans, these studies make no sense, because-- look, you take a four-year-old. And I have grandchildren now. So you have a birthday, you buy them a little toy tractor. They've never seen a tractor.
You tell them, well, this is a tractor. So they play [INAUDIBLE]. Then the spring comes or the summer comes, you take them to a farm. There is a huge thing. Different color, different parts, totally different.
And he says "tractor." One example, one-shot learning. So we need to figure out how this is done. This is what needs to happen. Yeah.
AUDIENCE: I'm really interested in crowdsourcing approaches to research. And I'm sure you've heard of Sebastian Seung and Eyewire.
BORIS KATZ: Yeah.
AUDIENCE: And I don't think that required any monetary incentive. So what sorts of differences do you think there were between yours and something like that? His is just available online, I think. Like, why couldn't you put the app in the Play Store or the Apple Store, and just have people use it there?
BORIS KATZ: [INAUDIBLE] there? What, the images? I don't quite understand what you're saying?
AUDIENCE: [INAUDIBLE] crowdsourcing approaches to research.
BORIS KATZ: Yes.
AUDIENCE: So I'm wondering what differences there were between something like Eyewire from Seung and yours that require monetary incentives, or what sorts of differences there were? Or how you came to the conclusion that you should provide monetary incentive?
BORIS KATZ: Well, first of all, I think it's nice people make effort. I don't think we need to worry about [INAUDIBLE].
We're a rich country. We have funds, and I see nothing wrong in helping people with their livelihoods. [INAUDIBLE] totally [INAUDIBLE] economic issue.
But I also-- yeah, possibly we could tell people, look, we have a school project to. Help us solve the vision problem, or possibly we will be able to get it for free. But I don't-- there are some research issues.
You could try to have a bunch of people solve a math problem. But it's not that interesting to take a shoe and put it on top of a table or a desk. So I think people made an effort. I don't see anything wrong in paying them.
Well, anyway, so here are some images from ImageNet. So you see, in this case, these are chairs. Well, you see what I mean by stereotypical positions and stereotypical backgrounds.
And here are those same class chairs in our data set in ObjectNet. Well, you see how they differ by rotation, by background, and by viewpoint. And this is what we are striving for, to get as much variety so that our future systems can learn to generalize better.
Some other examples with this prism. This is a chair on its side. And again, none of you have any difficulty recognizing that. Some bottle on a chair. Here's a transparent bottle, which again is also very important for us to be able to recognize.
A knife on a sink. Again, it's a different background. But we all see that knife very nicely. A shoe on somebody's kitchen counter or something. It's also upside-down, but we all see that [INAUDIBLE] none of the networks is able to do that.
All right, so if [INAUDIBLE] show you one slide, that would be the slide I'll show you. So here is [INAUDIBLE] three parts. So this is the top one. Top one means that the network is allowed only one guess. There is also a top five, where a good answer is supposed to be found when you guessed-- you have five guesses, and one of them matches the target.
But I don't want to confuse you and show you [INAUDIBLE] curves. This is the green. Everybody sees that?
So we took five winning those year's object detectors, starting from Alex Lab in 2012, VGG [INAUDIBLE] in '14, ResNet-150 in 2015, DANSNet in '16, and NASNet in '17 And you see how the performance improved from about 50% to about 85% or so percent on the top one. And this is trained [INAUDIBLE] classes of ImageNet.
The next curve is to do the same thing, except to look at the performance of the intersecting classes. The incidence of those the images are still from ImageNet, but only for 116 classes that intersect with our data. And you see the performance goes down by about 10%, mostly because some of the images in ImageNet were very easy for ImageNet. And since we only looked at things that happened in our houses, they didn't have cats and dogs. And so this now goes from 40% in 2012 to-- what is it, about 70% in 2017.
So the next I will show you-- and again, the blue line is the performance on 116 object classes taken from ImageNet. So the next curve will be performance of those same systems on those same 116 classes, where instances come from our data set from ObjectNet So what is your guess?
Where would that next curve be? It will go down. What's the guess? How much?
AUDIENCE: 20% [INAUDIBLE]?
BORIS KATZ: How much?
AUDIENCE: 20%?
BORIS KATZ: So this is pretty dramatic. So these wonderful-- you were still in maybe high school when Alex Net came about. There was this incredible publicity. So they do 5% on our data set. It's totally crazy.
And this is what the community needs to see and people need to understand. As good as these networks are, whether they're deep or shallow, or whatever the current objective is, they are not good enough to solve the problem. So that is the message.
If you want to see some examples of what the networks did, I could show you that going back for chairs. So here is what ResNet-152 said about this image. Here is a top five.
So the first choice, there's a vacuum cleaner. Then it says it's a cello, then a banister, a shopping basket, and a microphone. And after looking at some of those, I actually became a bit more optimistic about the state of the world.
There are some reasons for all five of those, if you think about it. It's just that systems are stupid. And just looking at this little curve, they say, oh, it's a cello or something.
So if you guys could figure out a way to combine the smartness [INAUDIBLE] generalize it with, [INAUDIBLE] look locally [INAUDIBLE] a cello or a banister, or because of the angle, say that it's a vacuum cleaner or a shopping basket, because this is how you [INAUDIBLE]. It's actually remarkable. But of course, none of you will ever say that it's a cello. And then some other similar examples.
This is for chairs. And for other objects, we told them to put an iron, I guess, in the bathroom, and somebody [INAUDIBLE] put it on the sink. And of course, the best networks in the world say it's a soap dispenser because it's in the sink. It's recognized the background and it was good enough for it.
So again, you're the future. Please figure out a way to build better systems. And another cute thing. You know wash [INAUDIBLE], iPad for some reason.
And I need to finish up here. So this is just to show you-- it's not readable, but I will blow it up in a minute. This shows you how NASNet performs differently-- almost 90% on certain object classes and 0% on others.
And if you blow it up you will see which ones those are. So this is the top end and this is bottom end. I looked at it a little bit. I don't see any regularities there, except for somewhat more semantic objects are easier, because if you rotate them, they don't change.
So here's some of them on top. So that's the only thing that I noticed. But otherwise, it looks somewhat random.
Well, anyway, so let's get back to our adage. So current approaches are truly awesome at understanding correlations and [INAUDIBLE]. And we do want them, but we don't want to only have that. And ObjectNet is a new test set that controls for all these things-- for rotation, for viewpoints, and backgrounds, which we hope will spur the field into thinking harder about how to build better gadgets.
Of course, you want to do many more things. We want to understand exactly when and how they fail, rather than, say, number 45%. The main thing is we don't want to have this crazy disconnect between 97% somewhere, and when you actually go in the field, you get 40%. We want to make sure that performance of the data set is predictive of performance on the real world, and we know what it is.
If you talked about robots, autonomous cars, if you [INAUDIBLE] see something upside down on the road, you don't want it to think something crazy. So it's important. We want to add a closure where we have some interesting ideas of how to do it.
Going to humans, we want to use short presentation time experiments to characterize when and how humans fail, because humans are not perfect on that either, and which properties of objects of the same effect accuracy. And of course, this is responding to your question. We want to use all this understanding to develop better detectors and [INAUDIBLE] do something interesting. So this is what I have to say. If you have any more questions, go ahead.
AUDIENCE: Yes.
BORIS KATZ: Yes.
AUDIENCE: First of all, thank you so much for coming. I know the presentation isn't over, but you have already taught me so much, and I'm sure--
BORIS KATZ: Oh, thank you.
AUDIENCE: --appreciates your knowledge and wisdom. So I'm curious in discussing what [INAUDIBLE]. This gentleman had mentioned earlier about also what you had talked about-- how many young children perform what seems like one-shot learning. So they see a toy truck. And then a few weeks later, they see a real truck, and they will recognize it.
Something I've been learning a bit about recently is wake/sleep algorithms. And I guess the idea is people, during the day, you experience things, and then you go to sleep. And something is happening in your brain while you're sleeping. And we think sleep is very important for learning.
And so just a crazy idea-- I'm not sure if it's crazy. But just what I was thinking is, what if when we sleep, if we experience this toy truck, what if when we sleep, we start creating our own variations of the truck? So maybe we imagine a truck with a different properties or [INAUDIBLE] larger.
BORIS KATZ: Yeah, that is quite possible. But in fact, it may not even happen in sleep. In fact, it's a bit unfair to say that this is how humans learn, because I said it's a four-year-old. The four-year-old had four years worth of something.
As I said, I have grandchildren. So I'm observing them. You give them a new gadget, and they keep looking, turning it, picking up, throwing it, picking up, trying to-- and so maybe what they're doing is they actually try to understand.
In fact, [INAUDIBLE] had a couple of papers with his group on the subject. That they understand invariances to rotation, and other types of invariances, by just playing with this object constantly. So what I said was a one-shot learning came, in fact, on top of four years of understanding, that if you do this with the same object-- if you do this in the same object, and that will make the problem much harder.
So yeah, in a sense, they do have either through just playing for many hours or various hallucinations, whether awake or in sleep, it's quite possible [INAUDIBLE]. But I don't know [INAUDIBLE]. Yeah, go ahead.
AUDIENCE: I wonder what the implementation routine might be for something like this. Because I guess if I wanted to try to have a machine recognize something, I try to reduce something like a laptop or a plate to few wires. And then every time the machine would see the object, it would rotate and flip and do all sorts of things with the wires, until it matched up to whatever it was seeing. So I'm wondering what you'd implement at, like, 30,000 feet, since I'm not too-- I don't know too much about machine vision myself.
BORIS KATZ: Yeah. Well, again, as you may have noticed, [INAUDIBLE] vision is new in my world. Right now, we looked at a bunch of existing networks. We have some ideas on improving algorithms.
But what I didn't talk much about-- I just mentioned it in the bullet here, that we, in fact, do plan to look closely in not only how humans do that. But we plan to work with animals, which is much easier to study. If you study monkeys, then you can actually wire them and see what happens. And then you could, again, hopefully have this virtuous loop between understanding a little bit about [INAUDIBLE] of various processes in the brain, and go back to the networks and use this information about humans or animals to improve the networks. So this is what I think the future should be, or at least could be.
AUDIENCE: And then for any human studies, and any machine studies or animal studies, can we use, like, a minimum object-- like, the minimal representation that it could have, up to the most complicated version of it.
BORIS KATZ: Yeah, well, we haven't made this decision. With humans, we can do anything we want. But we cannot get into the brain, unfortunately.
Although even that is not quite true. [INAUDIBLE] if I have a minute to describe that. We have access to mostly teenagers at one of the hospitals in Boston who have incurable cases of epilepsy that are not held by chemicals, but by typical drugs. So they require surgery.
And what happens in this case, they are brought into the hospital. They cut their skull, and they insert electrodes to the areas that the surgeon suspects what the surgeon needs to happen. And then they observed them for several days, trying to pinpoint exactly what the future surgery needs to happen.
And while they are wired, we do-- well, we ask permission of their parents. And they're bored, anyway if they're sitting there for four days doing nothing. So we show them movies. And we are planning to show them also things that may be less exciting to them to show some minimal-- show some images or some language strings, and we have them hear some language so that we could try to see what happens in the brain when they see things, when they hear things, and when they see a conversation.
And that will, hopefully, again help us to get understanding. But the images that you're talking about, whether they are minimal or get more complicated, we are certainly open to that. We could try to see what works, and see if what is more productive and what helps. Yeah.
AUDIENCE: I have a question in regards to how a computer would learn-- so the actual process of acquiring that object recognition. So when a human learns a language, for example, there constantly associating an object with the name. So that is an association process.
However, as they're learning the language, they learn the certain rules, and begin to say, all right, this is "sit." The present tense for the action to sit is to sit. However, as they learn on, they associate the past tense what's "sat." However, there's like a time of overgeneralization. So for example, like to "hit," they would do "hat" instead of the correct version.
BORIS KATZ: Or hitted it or something, most likely. Most likely "hitted," because they associate.
AUDIENCE: My question to you is, Is there a way to implement the forgetting or to-- since humans have the ability to ungeneralize something, to make it more specific, to have that fluctuation of specific [INAUDIBLE]? Thank you. That [INAUDIBLE].
[INTERPOSING VOICES]
AUDIENCE: Do you think that is possible to implement in computers, the fluctuation between these two.
BORIS KATZ: Yeah, I think it is. But we have to start with actually being able to generalize, which is a really hard problem. Overgeneralization is, too. Kids show that-- it's known. There are many, many papers showing it in language understanding.
I'm not sure whether that is happening in object recognition. But for language, of course, you said that they learn objects. I actually think things happen at the same time.
I think the grammar and the lexicon, plus of course object recognition, are all happening at the same time. Because very rarely, a mother or father will say, it's a spoon, or just say "spoon." A child constantly, from a very early age, hears full English sentences. Depending on the family, parents more or less dig down to the baby and talk down to the baby a little bit. But when they talk to each other, they talk normal language.
So this is a constant input that the baby hears. And therefore, that needs to happen at the same time. Because if you never hear "spoon," if you never hear "table," how would-- you usually say, put the spoon on the table. And the child needs to understand it's a sentence, it's a structure, and eventually tenses, and eventually object participant constituents of that structure.
And that is a total miracle. And we pretty much have no idea how that works. As I said, there is some work more recently, and we also had some papers on the subject. But this is still hypotheses, and they don't even come close to how amazingly quickly children acquire language. So again, a lot for you to do. Yeah.
AUDIENCE: So I know you were saying, like, our networks are only as good as our data sets. So do you think the way to work on this problem is just on the data set side? Like if ObjectNet was 20 million, and you trained the difference that exists now, do you think it would have higher performance, or are the networks not capable of recognizing--
BORIS KATZ: Yeah, I don't think the-- it repeats your question. I don't think the issue is the numbers. I really think we need new ideas. There's always this long curve that goes down of unusual things. You will never be able to have as many objects to predict every possible angle that they could have this bottle on, or every possible background that you have never encountered before.
As you saw in these examples, you have absolutely no trouble seeing things you have never seen before. And now machines don't have this ability. I don't think it's just training. We really need to find clever ideas of generalization or eventual forgetting or specific-- now I have trouble with this word myself. But OK, go ahead.
AUDIENCE: Yeah I have another question that is related to Alana's. So I have not a computer science background, and it's hard for me to understand how the computer approaches the task. So, I mean, if a child has not seen anything before, like the way I would approach it subjectively, I would say, OK, it's a vehicle.
It has four wheels. It's big, so it's most likely used in agriculture. That's how I would guess about an object that I have never seen before. So instead of schema learning or associative learning, trying to rely on previous experiences. How is the algorithm of the computer even try to guess?
BORIS KATZ: Well, what happened was, 20, 30, 40 years ago, people would come up with of these semantic ways to figure out how [INAUDIBLE] regression, for example, could work. But more recently, there was this amazing success in machine learning techniques, that it was very hard to argue with success. And all they need is a bunch of instances-- this is a spoon. You have a million spoons and different backgrounds, and then a million cats and a million dogs and a million tables.
And then this is the training set, after which the system, as you saw, [INAUDIBLE] cello, something was able to very quickly what they thought [INAUDIBLE] generalization to figure out, oh, I saw someone [INAUDIBLE] before. And it looks like what I saw before, and this is a cello. There was no what we think is semantic understanding.
We don't know what humans do. We have some kind of networks in our brain as well. And some people might say there is no difference, except all these networks that currently winners of the competition model machine, they all are feet forward, which means that they learn from low-level stuff [INAUDIBLE] high-level stuff, and never come back. Whereas actually, it is known that in our brain, there are many more wires that go back, which [INAUDIBLE] used in current architectures.
But we don't know how that works. So the modern technologies are based on numbers, on a number of theoretic examples. And they're never given even parts mostly these days. It just works. Especially if you hear the number 97, they say forget it.
And the same is happening with language. People are trying to do passing, this or that. Now everything is done in real numbers, and it is more robust than previous approaches. But it reached a plateau, and people now want to come back and say, well, let's have some [INAUDIBLE] hybrid system, some rule-based systems, and some statistical-based system. And hopefully, the hybrid works better.
AUDIENCE: So when you say that when you approach it in a different way, so that [INAUDIBLE] problems are more generalizable, do you refer to changing the way that neural networks implement it, or do you think that maybe there needs to be an entirely different kind of model [INAUDIBLE]?
BORIS KATZ: The answer is yes.
[LAUGHTER]
My guess is all of the above. It's hard to predict what we don't know. But to me, it solves-- the current technology is just not powerful enough to capture all the variety of [INAUDIBLE] especially to deal with [INAUDIBLE] to even come close to what humans could do.
And therefore, some new ideas are needed here. I would say it is likely that some of the old ideas will stay. But I think we would need to augment them with many more new ideas, which you haven't discovered yet.
AUDIENCE: Hi. Has this ever been attempted prior with a different approach, or has the standard attempt always been educating through images?
BORIS KATZ: What-- [INAUDIBLE]
AUDIENCE: Has ever been a different approach--
BORIS KATZ: Approach to--
AUDIENCE: --achieve your end goal, what you're doing now?
BORIS KATZ: To vision?
AUDIENCE: Yes, for vision.
BORIS KATZ: Yeah. As I said, there were people who were saying the approach [INAUDIBLE] this is the type of object. The humans have a head and two hands--
AUDIENCE: [INAUDIBLE] only from the beginning associated with images? They never tried a different approach?
BORIS KATZ: Well, they were associated with images. But they were trying to be a little bit more semantic and rule-based. But if you want to recognize images, what do you mean by a different approach we you have to--
AUDIENCE: [INAUDIBLE] just out of curiosity.
BORIS KATZ: Yeah.
AUDIENCE: So going back to the graph that you had. So one of the major trends, if you look at an image that's progression, is the one of the biggest trends is that you have a higher parameter count in terms of connections between old layers? And so for instance, ResNet-152 is 60.2 million parameters?
BORIS KATZ: Right.
AUDIENCE: So parameters to everyone else is basically the value that you multiply inputs from one layer to go to the next layer. So there's a general trend, like with neural nets being universal function approximators, as you increase parameter count, presumably they can do better on any given task. From what I recall, based on the graph, there is an overall 45% decrease in accuracy. But as you increase the number of parameters, they were still able to become more accurate.
BORIS KATZ: Well, yeah, it's actually--
[INTERPOSING VOICES]
BORIS KATZ: Where is the slide? It is interesting, because it could have been much worse.
AUDIENCE: Oh, right. Yeah.
BORIS KATZ: It could have been, in fact-- where is the next one? Right, it could have been that this curve is flat.
AUDIENCE: Right, and so [INAUDIBLE]
BORIS KATZ: And in fact, I thought that it will be. But that means that the progress is actually being made, which is very, I think, optimistic for the whole field. But we are just nowhere where we want to be. But it looks like this idea [INAUDIBLE], whether they help.
Of course, we want it to work, as we have worked very hard to de-bias our data set. But it will never be totally de-biased. So a pessimistic view of this says, yeah, but your data set is still de-biased, and therefore these machines just became better.
AUDIENCE: Well, so coming from a-- so the labs that I work in back at UTF. I work with Dr. Mark Shaw. I don't know if you know him.
Basically, it's all applicable-- supervised learning, semi-supervised revised, all computer vision dealing with object recognition. And so the general trend is just like, if we make the networks bigger, they do better. That's the general correlation with the field.
And so one of the other things as well with a lot of machine vision is this idea of transfer learning, where let's take something that's trained on ImageNet and apply it to some new data set, because it has a bunch of really good feature detectors already built in because of the variety of ImageNet. So then would not necessarily be like-- I guess what I'm struggling to wrangle, at least internally, is that it is possible that what we're seeing with ImageNet is that it's not actually producing good feature detectors, which I think is like what your argument gets towards. That the feature detectors, so the kernels within the networks, they're not able to operate at a level that allows them to generalize well. Is that correct?
BORIS KATZ: Well, let's back off a little bit. First of all, as much as it's fun to build gadgets, at least many of us are in fact interested in how things actually happen here.
AUDIENCE: Right. Of course.
BORIS KATZ: So second, I believe all these high numbers-- whether they have to do 97% on object recognition or 90% on parsing or part of [INAUDIBLE], I think they all had to do with the fact that we humans are very predictable. So here's a silly comparison. Like, with my children, I text a lot.
And so at this point, I almost don't type. It pretty much knows what I am saying. So you can have an article in The New York Times saying that T9, whatever the thing is called, that predicts your next word is a system that overperforms humans.
It knows what I'm about to say, which is totally stupid, of course. But people are very predictable. Therefore, these very clever networks are capable of [INAUDIBLE].
They just count words, numbers, symbols, whatever it is. And they know what will happen next. But it has nothing to do with intelligence.
AUDIENCE: Right, yeah. Of course.
BORIS KATZ: So the performance itself I don't think tells anything about how things are actually happening, or teach us anything. Well, it's nice to be number one on something and write the next paper. I do think that at least some of us should be interested in how is it done.
AUDIENCE: It was a fundamental disagreement I had with my PI at UTO, was the fact that that's what I cared about. He cared about performance.
BORIS KATZ: Right.
AUDIENCE: But the point that I'm getting is I guess, in part, should we be trying to give them such, in a way, domain-specific tasks, where it's like discern the objects in an image, as opposed to something that's more complex, which is potentially something like scene labeling, where it has to learn to actually do the classification and then rank--
BORIS KATZ: Oh, absolutely. I [INAUDIBLE] certainly believe that. In fact, some of the work in our group is to come up with these tasks that people haven't considered before, which exercises the networks and the systems. Yeah, absolutely. I think-- yeah, OK. Chris tells me that I should stop.
[APPLAUSE]