Using language to understand vision and vision to understand language (56:27)
July 14, 2016
July 13, 2016
All Captioned Videos CBMM Summer Lecture Series
Andrei Barbu, Research Scientist at MIT, discusses using language to understand vision and vision to understand language. He shows how the simple ability to compare an English sentence and a video clip can form the basis for many tasks such as recognition, image and video retrieval, generation of video captions, question answering, language disambiguation, language learning, paraphrasing, translation between languages, and planning.
ANDREI BARBU: Yeah. So what I'll tell you about today has to do with how your cognition sort of helps your perception, and how your perception helps your cognition.
So at some point a long time ago, we kind of decided to chop up intelligence into lots of different subfields. Like there are people that work on natural language processing, there are people that work on robotics, there are people that work on haptics, there are people that work on vision, there are people that work on audition, there are people like Julian that work on sort of cognition but sort of totally disembodied and in different scenarios sort of totally symbolically.
We can keep going. Like everybody kind of has their own take on it. There are like the neuroscientists that want to record from the brain, and they hope they're going to be able to figure out everything that's going on from that.
And instead, I want to sort of take a very different approach to this problem, which is, let's kind of look at what happens when the problems that you're trying to solve don't fit into any of these categories and they force you to look outside of any one of these fields.
And everything that I'll be talking about has to do with, if you take one step outside and you look at how cognition combines with perception, it turns out that a lot of the distinctions between these fields become arbitrary. And it turns out that vision actually helps you solve natural language processing problems, and natural language processing problems help you solve vision, and robot manipulation helps you understand vision and physics, et cetera. So you'll get a better sense of that as we go.
But why do we care? Let's say that you only care about one field. Let's say for a second that you're like a computer vision person and you only care like-- or a neuroscience of vision person and you only care about one thing, which is, how well does my computer vision system work? Or how good is my model of the vision-- of the whole vision pipeline inside someone's brain of how well people detect objects?
So let's say that you take a small chunk of an image and I ask you what's here. Can anyone try to guess what they see? So like it doesn't look like anything, and I've seen the original image and the patch still doesn't look like anything to me. But when you see the whole image, it like-- it makes sense, right I mean, it's a hammer-- yeah, like, it's not quite a curtain, but like you get to see the hammer, you get to see the head of the hammer.
And what's interesting about this, it's like I didn't give you additional pixels on the hammer. You see the whole hammer here. What happened is you got the whole context of the scene and that lets you figure out what's going on.
So that at least says that if you're a computer vision person or you're like you're a neuroscience of vision person, restricting yourself to trying to classify or decode small patches of images and trying to understand what's there isn't going to be successful. You're trying to solve a problem that sort of superhuman.
OK. So at least we know that in order to do something as simple as object recognition, you have to have context. Well, the same thing happens in videos. If you look at videos-- I chose this one off of YouTube because the hammer isn't actually visible in like half the frames or even more. Like it looks like this in almost all the frames. But basically you don't see this. When you saw the video, the hammer just was there. Your brain was just integrating out the information.
So the idea is, the vast majority of the universe in computer vision, at least, and in sort of neuroscience of vision thinks about sort of a pipeline where I have some percept, I figure out that there's a hammer, and the knowledge about the hammer helps me figure out if someone is hammering and what's going on in the scene.
And in reality, I have a lot of top-down connections. If you talk to neuroscientists, they're going to tell you that actually in the visual cortex, there are a lot more feedbacks than feed-forwards which have no idea what they're doing. But clearly they're important.
But you think that this only happens for a super fancy things like object detection, it happens even if you're trying to figure out the shading of something. So if you look at these two squares-- I don't know how many people have seen this example before. OK, some people. But if you look at these two squares, they look very different. This one looks light and this one looks dark. But in reality, the RGB values are absolutely identical.
Well, all right. That was a little weird. But yeah. The RGB values are exactly the same. What's happening is your brain's paying attention to shading, it's paying attention the fact that this is a shadow, and it figures out even though these RGB values are the same, this thing is in shadow so it has to be light. And the other thing is in light, and so it has to be dark.
We can keep going. Some Canadian-- and I always include a Canadian flag in my talk, so here it is. If I take this and ask any kid, trace out to the parts of this flag that are red, they're going to do a good job. They're going to put some bounding box around here or some information here and some information here, and now I know it's red.
But if you show it to any machine and you ask it to find a good threshold for classifying what's red in this image, you get something like this. Because it turns out, the RGB values over here are actually extremely red. The RGB values over here are more red than these values are. But you never notice. This looks red and this doesn't. And that's because you're integrating together a lot of information.
So this kind of integration thing isn't just for object detection or for action recognition. If you want to understand shadows, you have to do it. If you understand something as basic as what colors is this, you have to do it. So there's a lot of top-down information in vision.
This is an example from Torralba that sort of many people use. Doesn't work too well on this projector, but the idea is if you look at this, it kind of looks like a house, and this maybe looks like a car, and this maybe looks like a person. I don't know, does that actually work? Yeah, OK.
Well, it turns out that this image patch is exactly the same as this image patch, it's just rotated. And you can come up with a lot of examples with blurry images where people make inferences like this.
So if top-down information helps you actually see, let's see how seeing actually helps you understand something about sort of higher-level cognition. Let's walk through a few things that people do when they're looking at the world.
Well, if I show you an image or a video, you can generate a description for me. I can ask you to check if a description is correct. Like is somebody on the left actually picking up a yellow bag in this video? I can ask you to answer a question about the video, right? So you see this image and I ask you, what color bag did the person on the left pick up? And you can tell me the bag was yellow.
You can also disambiguate, right? We can have conversations about the world, and if I say something like, Danny looked at Andrei picking up a yellow bag, you can look at this and you can figure out, oh, this sentence means Danny had the bag, it doesn't mean that I have the bag, even though either interpretation would be OK.
You can learn language, right? Like kids at some point see loads and loads and loads of examples like this. They also get to interact with the world, of course, and eventually they can talk to you. Well, when they can talk to you, they can do a lot of other tricks. Tricks that feel like they're a little bit more linguistic and less sort of vision-heavy.
For example, you can resolve coreferences. If I say something like, Danny picked up the bag in the chair and it's yellow, well, you can look at the video and you can look at the sentence and figure out that it refers to the bag, it doesn't refer to the chair. And we can keep going.
Except that this trick, for example, of resolving coreferences, you can do whether you have a video or whether I tell you a description of a scene. So like when you're reading a book and you see something like "it" and you're trying to resolve that coreference, you can kind of follow the story and figure out what does "it" refer to, what does "he" refer to in the sentence.
So you can either do it with the video or you can do it with text. But if there are problems that you can do with both language and perception, and at the same time you can do them with language and text, what else can you do if you have language and text? You can say something like, a tall man in a hallway lifts up a chair with a bag on it. Oh, there's typo there. Anyways.
Danny picked up the bag and the chair. So you can figure out if these two sentences mean the same thing or not. It's called paraphrasing. You do that all the time. Like if I asked you a question in one way versus another way, you can figure out that I'm really asking for the same thing.
You can also do common sense reasoning. You can say something like, Danny picked up the bag and the chair, and then you can ask me, who's holding the bag, who was holding the bag at the beginning of the sentence, et cetera. This is a kind of basic thing that you do anytime you read a book. You can also translate between languages. If I give you a sentence in Romanian or in some other language, you can translate it to English for me.
Now what's interesting about these problems is, if you can use a piece of knowledge for one of them, you can use it for all of them. If I can recognize a bag in a video, because I understand enough about that, well I know what a bag is, so I can translate to some other language, or I know what a bag is and I can understand if you ask me questions about bags.
And even more than that, once you learn what a bag is, I don't have to retrain you in order to do question answering about bags. No child ever asked to be told, this is what a bag is and this is how you answer questions about things.
Somehow there's some common higher-level cognition that ties all of these problems together. And so there's a continuum. There are some problems that feel as if they're very vision-heavy problems, like video captioning, and there are some problems that feel like they're very NLP-heavy like translation, and there's kind of a bunch of problems that sort of sit in between. And what I'll show you is that in reality, all of these are exactly the same.
And even more than that, there's another problem that you can do, something like planning. This is a problem that, again, you can do both on the language side or on the visual side. So you can say something like, there was a bag on the ground. Danny is now holding the bag in the room next door.
And so you can make a little plan in English. Like if you read this little story and you tell a four-year-old what happened, they can say something like, Danny walked to the bag, picked it up, and carried it out.
You can also do this from video. If I show you a video like this and I say, well, this is what I saw yesterday, and now Danny is holding the bag in the room next door, you can come up with a plan of how did you go from here to here. That's what every kid does when reading picture books.
So even in something like planning, which is sort of this very classical AI kind of problem that people normally don't think there's any perception or even any language in, there are these two different aspects.
OK. So what we're going to do it for a good chunk of the talk is basically go through these different problems, all the way from recognition down to planning. And I'll show you how if you solve one of them, you can solve all of them.
What I want to show you is that if we solve this top problem of recognition-- and what I mean by recognition is I want to assign a score to a pair of a sentence in a video. So like how well does this description of this video-- how well does this sentence describe this video? We can actually do all of these problems.
So some of them seem more obvious than others. Like retrieval, all I'm going to do is I'm going to take the recognition system and apply it to a really big corpus of videos, and I'm going to take the top 10 ranked things, and that was video search. But other things aren't so obvious, like how do you acquire language or how do you common sense reasoning or how do you planning if you have this? So that's where we're going.
So we're just going to spend a few minutes talking about the model. I'm not going to go into any details, I just sort of want to give you the general feel for it.
So let's say that you see a video like this. Someone's coming into the frame, they're riding a skateboard. If I want to understand if the sentence, the person rode the skateboard leftward is true of this video, you have to do a few basic things. I'm not saying that you have to have sort of individual models to do this. I'm not saying that you have to do them in a particular sequence. I'm not saying that you have to find particular brain regions for them-- just logically you have to have this information available.
So you've got to figure out that something rode the skateboard, so you have to find the person somewhere. Otherwise I could've said the banana rode the skateboard. You have to find out that there's a skateboard there, you have to sort of see the motions of these objects over time, and you have to figure out, are they actually sort of moving together?
And in the end, you have to look at the relations between all of them to understand if the person's falling off the skateboard, if they're riding the skateboard, or they're just hovering above the skateboard while they're both going in the same direction.
So basically what we're going to do is we're going to take these different parts and we're going to create a model that sort of solves all of these problems jointly. And what I mean by that is, the model recognizes the event while it tracks the objects and while it detects them. And you'll see why it works that way in a moment.
So the basic idea is that you look at a sentence, you can run a parser over the sentence, and it can tell you that inside the sentence there are basically three participants that are kind of playing a role in what's being described here. There is a person and there are two horses. So we know we've got to find three things in this video, and we make three trackers.
For those of you that are more on the machine learning side, I'll just say one sentence which everyone else can ignore. Basically the way that things work is that each of these are hidden Markov models, and each of these are hidden Markov models, and there's a big factorial HMM that observes the video. But that's it for machine learning.
But the idea is, if each of these things can search over a video and try to lock on to an object, then each of these things can inform it in order to tell it what it should lock onto. So every word is basically a constraint, and each of these things is a little algorithm that searches over every frame.
And then we can build up the constraints according to the sentence. So we can basically say that whatever this thing locks onto the video, it had better be tall and it better be a person. And whatever this thing is, it had better be moving quickly and it better be a horse. And by the way, the relation between these things should be that this thing is riding this thing.
Here we're assuming that there's a parser, and the parser essentially tells you the associations between them. What a parser will tell you is that basically there's kind of a main verb ride, and this is a noun phrase, that this is a verb phrase, that what's being ridden is quickly, that the person is tall, that the person is connected to ride, et cetera. So you get a--
AUDIENCE: --outside this and then--
ANDREI BARBU: That's assumed to exist. So you know basically how the words are supposed to be connected, and you know how many participants are in the action, and then you just connect together the participants with the words. Essentially you take your sentence and use a semantic parser, you turn it into a set of constraints. In our case, doing first-order logic, but that's not important.
Basically it says that I'd better find some x that's a person, some y that's an animal, some table, one should be picking up the other, and one of them should be on the table. And you just know the relations you're looking for, and then you build up a model. But we'll skip over the details of the model.
What's important is that this thing has parameters for every word. So inside every word, you have like a little recognizer for it that doesn't know anything about all of the other words. So for example, to recognizer for approach, it only knows one thing, which is, it had better see an object that's far away from something, then it sees it get closer, and eventually it should see it near the object. The recognizer for pickup is looking for one object be near another, and they get in contact and one object is lifting the other, et cetera.
The recognizer for a person is really simple. It doesn't have any state, it just says, I had better see a feature for that object that says person. The recognizer for read is similar.
Each of these things is like a little model that has different states, and it's trying to observe these trackers, and each of these trackers basically is trying to find some high-ranking object in the video. The trick, why this works is we don't run it in just one direction or the other, we solve everything jointly. So the knowledge that you're looking for something, picking something up actually helps you find that object in the first place. But I won't go into details about how that works, it's sort of not too important.
So let's see how we can use this. If we have some model that lets us figure out if a sentence is true of a video-- and to be honest with you, what I showed you was my particular sort of pet model. It doesn't matter what it is. If I just have a black box that lets me figure out, is the sentence true of this video? I can do a bunch of tricks with it.
I can, for example, do a retrieval. So what we did is we took 10 Hollywood movies, nominally all Westerns, although people that have actually seen these movies told me that that's not exactly true. But what we were looking for are movies where people are riding horses or interacting with horses in some way. The idea being that people on horses are relatively large in the field of view and sort of standard object detectors will do a reasonable job.
What we did is we produced a bunch of sentences on a little template, and then we built a little search engine. The idea is that you can type a query in and you can get hits for people riding horses or for people riding horses quickly or slowly. But we can find hits for sentences in videos.
You can also play the same kind of game with recognition, which is kind of the opposite of retrieval. In retrieval, what I do is I give you a really big corpus of videos and I give you one sentence and I want to find out which of those videos are good hits. In generation, basically what I give you is one video, and I give you a really big corpus of sentences-- like every sentence that I can make in English, and I want to find out which sentences are reasonable descriptions for this video.
To give you a feel for this, even if you have a very sort of simple grammar of English, like I have a sentence, I have a noun phrase and a verb phrase in that sentence, I have some nouns, I have some prepositions, et cetera, even something as sort of trivial as this already produces hundreds of millions of sentences. Right now we have on the order of about 100 different words, so it's much much, much more than that.
But the idea is, if you can score a video on a sentence, you can score a video and part of a sentence. So I can figure out is there a person in this video, not just is a person picking up an object. So I can start with an empty sentence, and I can check every one-word phrase against that video. And I can figure out that actually, the most likely thing to have happened in this video is that someone was carrying something.
And then I can ask, well, how can I expand this one-word phrase out into the best two-word phrase that I can find? Now I can figure out, OK, the person was carrying something. And then I can keep going. I can keep trying to add words to this and trying to find which words are the best words in order to complete the sentence, and eventually get to a whole sentence that describes the video.
There's a little bit more to this algorithm, and it turns out that as long as you make one sort of simplifying assumption, that there's no negation or anything that kind of looks like negation in anything that you produce. You can actually find the optimal sentence that describes a video, at least as far as the model is concerned.
To give you an example of this, if you see a video like this, you can produce a sentence like, the person to the right of the bin picked up the backpack. We have hundreds and hundreds and thousands of these.
You can see videos and some sentences that are generally at the bottom, like person slowly arrive from the right, person slowly went leftward, person arrived from the right, person moved the skateboard leftward. It doesn't know the word ride-- or this particular version of the system didn't know the word ride, so it's says moved. Like the person had the bicycle because they were in possession of it, person opened the bag and touched the bag, et cetera. And we did this for many thousands of videos.
The person carried something. Leftward and upward, which is mildly correct. The person had the bag. Some person quickly chased some other person rightward, which they actually do. And this person that's about to have a really bad day. There we go. Yeah. So this is a kind of sentence that you can produce.
So that's question answering. You want to be specific, you can do generation, and basically you seed your generation system with a template for the answer.
Something else that we've been doing, which is a little bit more on the language side. You're going to see that like-- we started at the top with basically a computer vision problem, like how do you recognize a sentence in a video, and we're kind of slowly moving our way down to NLP problems.
And there's this problem of disambiguation. Let's say I give you a sentence-- like if you've taken an NLP class, this is a very sort of textbook example. It's in every-- every first lecture of every NLP 101 course in the universe talks about it-- I saw the man with the telescope. You can either interpret this sentence as the person had a telescope and was spying on me, or he was looking on me while I had the telescope.
Now there are a kind of different inferences that you can make. In English, people prefer one interpretation over the other most of the time, but when you see a video or when you're in a particular scenario, you immediately switch to the right interpretation. So we wanted to see is, can we get the visual language system to represent these sort of fine-grained distinctions in meaning? And can we get it to actually pick out the right interpretation for these different videos?
So what we did is we collected sort of a large corpus of videos, and then what we said is, well, if we have this sentence and we have these two interpretations, we just have a classification task-- which of these two interpretations applies to this video? I won't sort of bore you with the thousands of examples that we have, but it turns out that this happens in many different cases.
It happens in sentences like, Andrei approached the person holding a green chair. It happens if you have conjunctions like, Danny and Andrei picked up the yellow bag and chair. It's not clear whether I'm doing all the work and Danny's providing moral support, or whether we're both actually picking something up. Like you've all had friends that have helped you move and like they were useless.
You have distinctions that aren't actually like the surface representation of the sentence, but there's something deeper about what's going on, like the logical form-- someone put down the bags. Again, it's unclear whether there's one person or many people.
You have like anaphora-- so basically you have like this cool reference of it that's unclear-- it's like the first example that I showed you at the beginning of the talk. There are many sort of such examples. And we filmed a few thousand videos, and we showed that the vision side can actually do a good job of this.
But to give you an idea of why, it's because even though in English you have a sentence that has two different meanings, in reality, what's different is the constraints on the objects that you want to see in the real world are different for the different meanings. So if I have something like Danny and Andrei moved the chair, if this implies that we moved the same chair, then basically I'm looking for one chair in the world, and that chair had better be constrained by the fact that Danny moved it and I moved it.
And if I have the interpretation that we moved two different chairs-- so we each moved the chair at some point, then I have to look for two chairs. They had better be different from each other, and I'm moving one and Danny's moving the other. So the vision system-- the vision language system has a sort of finer-grained interpretation than just the NLP side of things.
So there's a question of how do you train the parameters of these models? Well, if I show-- it makes sense that if I have a model that gives me a likelihood of a sentence in a video, if I show it enough sentences in videos, I'll be able to eventually learn its parameters.
It turns out that there's a little bit more of a complication here. You either want to learn the parameters of the different words or you want to learn what you do when I give you a sentence-- like how do you break it up into components, how are the components connected together to each other-- basically how do you learn to parse them?
And sort of for the [? comp-sci ?] folks, this is something that, again, goes back 300 years. You can probably even find Aristotle talking about it. Like, what's the right sequence that you learn things in? Is it that kids first learn what apples are and what pickup means? And they sort of understand subject-object verb type of phrases where like I pick up the apple, where there's not too much structure, and then slowly they learn more and more words that guides their ability to understand complicated sentences?
Or did they first learn something about syntax that's sort of very generic, and then they use that knowledge to say, oh, you said this complicated sentence. I only understand half the words in it, but I can use the fact that I know the structure of the sentence to understand the other half of the words?
There's a huge disagreement even today about which of the two people actually do. Whether they do sort of the lexicon first and then the syntax later, or they do syntax first and lexicon later. Chomsky and folks in [? C-Cell ?] that are sort of close to him are very strong proponents of the idea that you learn syntax first because you have a large amount of stuff built into your brain. There are many people elsewhere that try to show that in reality, objects are much more important.
I actually think that you do them both simultaneously, so it's good to piss off everybody. And there's a particular reason why, and that's because if you have a little bit of knowledge about one and a little bit of knowledge about the other, you can actually do a better job on both. So it's to your advantage to actually try to learn both at the same time.
And there's a bunch of data from child language acquisition corpora like CHILDES that shows that it doesn't fit into either one of the two categories. And definitely neither of those two universes has an account of how this actually works.
So what we've been working on is like an account of how can you plausibly learn the syntax and the lexicon in a world where you actually get to observe what's going on? So right now we've been working on learning them separately. So when we learned the meanings of the words, we assumed that we have the parser. And when we're learning the parser, we assume we have the meanings of the words. And later on, hopefully sometime next year, we're going to try to learn both jointly.
But I'll just give you a taste of how learning the lexicon works. Let's say that you see a whole bunch of videos like this. People are performing different actions, and a parent at some point says, the person picked up the chair or the person picked up the backpack or the chair approached the backpack.
Now what doesn't happen is you don't live in the regime where you get bounding boxes. Like no parent-- when you have a child, no one comes to you with a bounding box and says, put this around the teddy bear and tell them the name of the teddy bear and they're going to get it. In reality, you figure it out just by seeing the sentences in the videos.
And the sentences that people say you don't have to be true of everything in the video, just have to say something that's sort of relevant. You don't have to describe like a five-page description of a scene for a child to understand what's going on.
The reason the intuition for why kids can do this is because there's kind of what's called a constraint satisfaction problem. You know some relationships between these videos. You know that whatever the word person means, it had better be true of these two videos, and I don't know if it's true of this one. Whatever pickup means, then it better be true of these two videos, and I don't know if it's true of this one. Whatever approach means, have to have a high likelihood here, I'm not sure if it has to have a high likelihood here.
So you sort of get a graph of constraints, and you can figure out the meanings of these words from pairs of videos and sentences as long as you understand the structure of this graph. And it's very easy to build. All you have to find out is, did the word backpack occur in a sentence over here and the sentence over here?
For those that are sort of more into machine learning, this is exactly the way that Baum-Welch works, it's precisely the same algorithm. You alternate between deciding on an interpretation of the video and maximizing the likelihood of your sentences. Does that sort of makes sense to everyone? Yeah? Sure. Yeah, please feel free to interrupt me if you have any questions.
So let's switch to like, how do you learn the parser? This is something that we're doing now. Everything that I've told you before we've actually done. Let's say that you have a sentence and you have a parser. And the parser gives you two different interpretations-- this is exactly that disambiguation test that we were talking about before. And I have a video.
And what we did in the disambiguation task is we picked an interpretation, right? Danny approached the chair with the bag, we chose what this means. Does Danny have the bag or does the chair have the bag?
Well now if you're a young kid and you're not sure about the interpretation, because you haven't really trained your English parser too well, you don't get two interpretations, you get a lot of interpretations. Now when you look at the video, maybe you can pick the one interpretation that's correct, but you can kind of decide that some of these are more likely and some of them are less likely.
This is like what happens in any machine learning problem. I mean, we were calling this maybe something like distance supervision. I'm not training the system by giving it a sentence and the correct tree, what I'm doing is I'm training distance by giving a sentence and the video, and the visual language system looks at both and decides which of these things is more likely, and this becomes a reinforcement signal for training the parameters.
So I find out that I really think that Danny approached the chair with a bag means that the bag was approaching Danny while carrying the chair, and I look at the video and I see that didn't happen, and so I adjust my expectations for what the correct parse of the sentence should be.
To my knowledge, this is like the first account that's sort of cognitively plausible-- how you learn a parser if what you have are videos paired with sentences.
So let's move on. Like, we started with vision, and now learning parsers is sort of a firmly NLP task, but what if we get into more sort of NLP territory? Let's talk about something like paraphrasing, where there's no video, there's no perception at all.
If I give you these two sentences, the dark-haired man is picking up an object from the floor, and the guy in the plaid shirt is lifting a yellow chair, you can figure out that they basically sort of mean the same thing. Like I could show you a video where both of these were true, and they were both true of the same event and the same objects.
If I give you two different sentences, like the man with a chair walks away from someone, and the man walks away from someone with the chair, they're not necessarily as similar as the first two, right? So one case, the man's holding the chair, and the other case, the other person's holding the chair.
So the bottom two sentences are quite different. There's no like simple video that I can show you where they're both true of the same thing. And the top two sentences are very similar, even though the top two sentences have very different words in them, and the bottom two sentences have basically the same words and basically the same structure.
And that's the problem with paraphrasing. There's a decent community of people that work on natural language processing on paraphrasing. And the way that they do it is basically they go to Mechanical Turk, and they have people look at one sentence, and they have them write a paraphrase of that sentence. And you get hundreds of thousands of these. Then you train a big deep network that takes as input the two sentences, converts them to two vectors, and computes some function of those two vectors to tell you how similar are these two.
But this can't possibly be how humans work. Like, no adult ever sat you down when you were five years old and explained what a paraphrase was and why you should understand that two sentences are similar to each other. So somehow kids managed to do it with zero training data, and that's what we're going to do.
So let's pretend for a second you want to figure out the relationship between these two sentences. Well, if you have a visual language system, what you can do is you can go to YouTube and you can pick out videos for the first sentence. So if I say, the man with the banana was on top of the airplane, you can go to YouTube and you can give me the top 100 videos for the man with the banana was on top of the airplane.
Now this doesn't work if you use YouTube's search, because YouTube searches the text that's associated with the video, not what's going on there. To give you an example of this, if you type in approach in YouTube, you get airplanes approaching things. This always changes, but it's always airplane and approaching women.
If you type in leave, you get music videos or like it's time to leave some relationship. Like any one word verb that you type in, it's always related to someone's relationships or an airplane.
But if you have the visual language system, what you can do is you can take each of these, you can take every video, and you can ask, is this a good description of the video? And you can rank them. OK. Now what you have is you have a corpus of videos and you have a number associated with them, which is a likelihood. This first sentence and the paraphrased pair is very true of one video and very false of the other one.
But I can take my other sentence and I can do the same thing with that corpus of videos, and I can figure out, what's the correlation between these two? When one is likely, is the other one always likely?
So this works, but it only works when your sentences are actually available on YouTube. Because the sentence, the man with the banana was on top of the airplane, there's almost certainly no YouTube video for the sentence. And the longer the sentence gets, sort of the probability you'll find a YouTube video for it decays exponentially.
So what we do instead is, we kind of take advantage of the fact that you have some imagination. So you don't need YouTube when you're thinking about these sentences. I can ask you, close your eyes and imagine a video where I'm picking up a cell phone. And you can do that just fine. I can ask you, get a piece of paper and draw me a stick figure that walks up to a cell phone, picks it up.
Well, it turns out that we can do the same kind of thing with the model. I won't go into details, but the idea behind it is that if you have the visual language model, it basically learns a joint distribution between the vision side and the language side so that it can draw samples from either.
So if I give it a sentence like, the person picked up a cell phone, I can draw samples from the vision side and imagine videos where people were picking up cell phones. In our case we don't imagine videos at the level of actual pixels, but a little bit more at the level of stick figures. So we'll get like shape of the person's body and we'll get a bounding box for the cell phone, and you'll see the cell phone move up while the person is in contact with it.
And you get that for free. You don't actually have to retrain anything. So just like kids understand paraphrases without training, as long as you have imagination, you can also understand paraphrases with a visual language system. And what's cool about this is, this is a problem that has no perception, but it turns out that perception helps you solve it.
If we believe in this-- well, it turns out that we can solve a lot of other problems, but what I think one short sort of digression, this is the thing that I was adding at the beginning when I was listening to your questions.
You might ask yourself, OK, well there's a lot of sort of language, as in English in this talk. But other animals, they make it just fine in the real world, right? They can understand commands, they can sort of communicate with each other, they can understand that I'm picking something up. Why is it that animals can do these things if I'm saying that language is required to do them?
So what I mean by language here isn't language in the sense of English, it's more in the sense of, you have some structure of the world inside your head-- you try to impose some structure on what you see. And I'm using sort of language as a proxy for that.
So I don't know the sort of the basic building blocks that you're using inside your head to represent the world and to make inferences about it. All I can observe is how you talk about the world. So I'm going to take what you talk about the world as being the proxy for the sort of mentalese that's going on inside your head.
But that being said, a lot of animals actually understand quite a lot about language. So I'm going to show you a video of a dog named Riley-- there's a paper published recently about him. Riley knows-- maybe before I show you the video, I'll tell you a bit more about the story.
So Riley knows the names of a thousand different objects-- of plushie toys. And Riley is very good at getting those objects back. Like, you can see pictures of like a big amount of stuffed toys, and he'll just like jump into it and you tell him, give me the lobster, and he comes out with the lobster. And he'll like bring you the toy that he wants to play with, it's very cute.
And what they were trying to do with this experiment is to say, well, can dogs make additional inferences about language? What they did is first they had Riley go pick up a toy, giving him the name of the toy. And you can see, he's really good at it. He goes right back, he gets the lobster, everyone's happy.
Now Riley's being told, go pick up-- I think the doctor or something like that, a toy that he's never seen before with the name that he's never seen before. He knows the names of all of the toys behind there aside from this one. He's never seen the toy and he's never seen the name.
And so you can see, he's a little confused. He's being asked to do something that he's never done before. He kind of knows where he's supposed to do it and he kind of knows what the action's supposed to be, but he's not sure. Maybe the person's crazy, like they're asking me to do something insane.
And you'll see that he's going to go out and he's going to make some eye contact to confirm, like, this isn't working out, what are we doing here, all right? OK, fine, fine. Like the person is serious about this thing, how can I make this person happy?
Then there's sort of a moment of tension. And he comes back, and he comes back with the right toy, and he actually learned the name of the toy from this example. And now if you put the toy down, he will bring you that toy when you give him that name. And he's super happy about it.
So even though animals don't have language, there's evidence that they have compositional models of the world inside their head. There's evidence that they understand physics, they understand intention. They can do the same kinds of tricks that we do quite effectively, it's just they can't communicate about them as well.
Moving on to this sort of totally different NLP task. So these are sort of things that we've done. This is something that I'm doing now and hopefully will publish in the fall with paraphrasing, but translation is something we're going to start in a few months. The idea is you can use Google translate to translate one sentence from one language to another. So you can take like a sentence like, Sam was happy, and you can translate it into Russian or Romanian.
And the way this works is basically there are these big parallel corpora. One of the original, one's from Canada, because every parliamentary discussion has to be translated both into English and French regardless of which language the person's spoken, and they're both sort of equally useful. The same kind of thing happens in the European Union-- all of the languages have sort of equal legal stature, so you have to translate the sentences from the different documents very, very carefully.
And so you can use these well-aligned corpora to figure out that this word maps this word, this phrase maps to this phrase between these different languages.
If you think about it, this is not how a human learns language. If I sit you down and I give you a book this thick of this is the German-- this is a long German sentence, this is a long English sentence, and I give you a million examples of those, you will not learn German.
Like, the way people learn German is, you go to a class and have a conversation and you ask them questions and you see some examples, and over time you learn it. So it's very different from this kind of learning.
But there's another reason why this is sort of not a particularly good account of translation, aside from the fact that it's not possible for humans to do it. It's because it makes a lot of very subtle mistakes. So in Romanian and Russian and French and many other languages-- even in Spanish, you have to specify the gender of the objects that are involved in the sentence.
So like Sam was happy could mean Sam was male or Sam was female. I don't know-- in English it's ambiguous. But in Romanian or in Russian or in French, I have to say which of the two it was.
Even more than that, in something like Romanian, if I want to avoid specifying this, I basically have to imply very strongly that Sam is human. This is not satisfying because Sam could have been the name of a dog or something like that. And so you have this kind of impedance mismatch between languages where one language will specify something that's maybe useless to another language, but maybe under-specifies a particular scenario.
Now what ends up happening will translate is, you get some arbitrary output. Usually the output that you get is consistent, so it will sort of consistently specify that one is male and one actor is female. But it sometimes gives you an alternative, but what's unsatisfying about it is you don't know why the alternative came about because the system doesn't understand enough to ask you like, this is a gender difference.
And not only that, is if I change whether this word is male or female, it doesn't change whether this word is male or female. So it doesn't understand the concept, it just sort of saw some statistical regularity.
Now this one is kind of easy to pick up, but there are many differences that are much, much tougher. So for example, in Thai, you specify your siblings by their age, not by their gender. The main thing is it's not brother or sister, it's older sibling, younger sibling. In English we really, really love time. Like we talk about tenses all the time. In Chinese, you don't do that-- tenses aren't so important.
In this language that I've learned never to try to pronounce during a talk, you use absolute directions by default, not relative directions. So these people grew up somewhere in Central America, there was always a big landmark that they all basically agreed on, and their direction said to be relative to the landmark. And when they come to a city, they basically agree on a landmark as a community.
This doesn't work. If I say something is to the north, I kind of have to figure out what's the landmark that's right for this person and how do express it?
Lots of languages don't distinguish blue and green. So for example, Japanese didn't do this 200 years ago, 300 years ago. They only did it when they started to interact with America and the West. Some languages don't specify colors the way that we do. So like in Swahili, there's no like particular word for red. They play the English game-- the orange game everywhere.
So like orange have sort of two meanings. There's the object and there's the color. In Swahili, I would just say the color of my pants rather than gray-- there's no particular color word.
Even worse, before we get there, even worse, in many languages, you have different boundaries for your colors. So like you've maybe had this discussion with someone from another country where like you'll say something is teal and they'll say it's blue and someone else will say it's green.
And in part, that's because different languages carve up the spectrum differently, and this can be quite important if we're having conversation about some car that hit someone and you want to communicate what language the car was-- I mean what color the car was.
Even more than that, in something like Turkish and like many Bantu languages, you have something that's called an evidential. So you have to say why something is true and how you know that it's true. So like in Turkish, you have to say if you heard something or whether you saw something. And there are many more complicated [? financial ?] systems. And if you think about politicians, it'd be wonderful if we had this, but unfortunately English doesn't have it.
But if you imagine like how you translate from English to Turkish and how you figure out that in English, the person had seen this event, not heard about this event, there's no way if all you're seeing is millions of English sentences. You have to know more about what's going on, and you have to be able to ask a question, like, this English sentence is just under specified-- I don't have a good way of saying this in Turkish.
So here's another way, here's what we're going to be doing in a few months. The idea is first you train a French vision system, and then you train an English vision system. They're not trained on the same sentences and they're not trained on the same videos. Just completely separate. One can take French sentences and some videos and figure out [? if ?] the sentence [? for ?] [? a ?] [? video, ?] one could do the same trick with English sentences.
If I have a system that's trained to recognize French sentences, a system that's trained to recognize English sentences, I can think of the French one and I can play the same imagination game that we were playing for paraphrasing. I can imagine a thousand videos where the sentence is true and where the sentence is false.
And then I can take those videos, and I can use the English system and I can figure out what's the best description of these videos that matches the distribution of true and false values for this French sentence. So I want the description that's false when the French sentence was false, and true where the French sentence was true. So that's one way to translate.
And what's cool about this is, you don't need any training data. Once you have the French model and once you have the English model, they sort of connect together very nicely.
There's some good intuition that this is essentially all humans have to do it. Like, if you'll learn what an English word means in one context and you learn the same French word in a different context, I don't also have to tell you that this word is actually the same thing. Like, no one has to reinforce the fact that apple really is this English word and this French word. If you've interacted with an apple and someone spoke French and interacted with an apple and someone spoke English, you figure out the correspondence between the two.
Well I won't talk about common sense reasoning, but you can imagine-- I'll just say like two sentences. You can imagine that rather than asking you a question about a video, I can ask you a question about a hypothetical scenario. Like I'm holding the cell phone and I let go of it, what's going to happen?
One way to answer that question, rather than filming it and breaking my cell phone, is to imagine what that looks like. So you sample videos from your language model, and then you answer the question about those videos. That's something we're going to be doing soon.
But I'll just tell you briefly about planning. This is what Candace works on. The idea is-- let's say that you have a video, and like I have some frames for my video. And now I have a sentence, Danny carried the backpack to the chair. Well, if I don't show you the frames of the video, this is kind of like a plan of what happened if you go from here to here.
If I show you the frames of the video but I don't show you the sentence, this is like the execution, it's like the motion plan that a robot or a human would take in order to go from the first to the last frame. Then what I can do is if I show you neither of these two-- so you don't get to see the sentence and you don't get to see the video, planning is basically figuring out what's the right sentence to describe this video that I've never seen before?
This is not how people normally think about planning. If you know anything about planning, the idea is that you have a planning domain-- so you have something like programming language or programming language-type thing that lets a robot perform a sequence of actions, and have some model of the world that the sequence of actions can affect, then I basically searched through the space of programs to figure out what's the sequence of actions to get me to the right goal with high probability.
The idea here is, I have the ability to recognize sentences, and I have the ability to imagine videos, all I have to do is I have to imagine videos that are conditioned on the start and the end of the video having particular properties. Like looking a particular way or satisfying some story. Like I could give you this little story here and a little story here, and I can tell you there's some video that connects them, I just need you to find it. So that's what we're working on now.
But of course, like-- someone mentioned social aspects, for example, or planning or understanding the intents of other agents. This is stuff the model doesn't do at all. Basically all it knows about the world is that objects follow a particular sentence that I said. It doesn't know that there's physics, for example.
This is unsatisfying for a bunch of reasons, but one of them is, if I show a video like this where Jonathan, for the people that are in Josh's lab who know him, is picking up his pet rat or if he's picking up his pet monkey, they'll still look very different from each other. In one case, the object moved up, and in the other case, it's kind of like a bowling ball. Like when I pick up a bowling ball, I don't lift it into the air, I sort of drag it with me because I'm lazy.
Well, any kid would see both those actions and say the object was being picked up, but they look very, very different. It turns out what they have in common is that the physics behind the scenes is the same. In both cases the object is first supported by the table, and then it becomes supported by me.
And what we're doing with Jonathan and with Josh is changing the system to try to infer support relations. And I'll show you a more extreme example of this. Yeah, this is a cat trying to pick up its kitten, and it looks very, very, very different from anything that you saw before, but any child would see the first two videos and say that this is pickup-- no one ever gets confused.
What we're doing now is we're handling those first two videos, and those actually work, and what we're going to be doing in the fall is trying to generalize from one sports domain to another. Because you can figure out that this person's kicking and this person's kicking and this person's kicking even though they look radically different. So the goal is to train on something like football and test on soccer and on cycle ball, which sort of no existing system can even hope to do.
But before I want to wrap up, I kind of want to-- I don't want to leave you thinking that like a lot of these problems are solved. In reality, there's a lot that's missing here. Not only in terms of like formulating many other sort of interesting cognitive tasks in this framework, and not only in terms of how do you learn to perform a problem in the first place-- like how do you learn to answer questions, but there are some even more fundamental issues.
Like how do you generate coherent stories? Because everything that I talked about with generation is like one sentence. But if you want to generate a coherent story, I don't want you to say random one-sentence things about a video. I would really like them-- like, yeah. You've all spoken to people that when they tell you a story, they just like jump around randomly. Candace does that. And we tease her mercilessly in the lab for it.
So you can also try to do these kinds of things in 3D. A lot of what I talked about is 2D. Particularly if you want to infer physical properties of objects, having 3D is really important.
Some contact relationships are important-- oh, this made a reappearance. It's also important if you want to recognize that this is hold, right? That the helicopter is clearly holding the thing, but you may never have been trained to see this-- no one said the word hold, but you still got it.
Segmentation is important. I have to understand fine-grained definitions of people's hands and things like that to understand whether they're really touching. If you see a picture of me holding my hand like this, it's very likely I'm picking up the cell phone. If you see me holding my hand like this, I'm not picking it up.
Object detection is really hard. There are many corner cases. Clear that this is the one real dog in this example. There's no computer vision system that can do this.
There's theory of mind. It's really important to know what the other agents are doing. That's in part why we're doing planning, because we want to build planners into the other agents that are out in the world.
Kids can watch videos like this, and they can understand what's going on. You can already figure out what's going to happen to our poor friend over here. You have a model of physics, but that model of physics lets you understand things like some object [AUDIO OUT] for some actors and closed for others.
You might think, OK, this is kind of an extreme example, but there's a reason why kids can figure this out. It's because I can have a cage that's open for a mouse and closed for an elephant. So even though there's this extreme cartoon, in reality, there's a real [AUDIO OUT] object that has essentially these properties.
And then it turns out that all of these map to sort of the vast majority of English verbs. Things like absolving someone of something or admiring someone, all of these require these different features like planning, et cetera. So there's sort of a very long road ahead. That's before you get to things like metaphoric extension, like the fact that I can have a hard day, or the fact that the stock market can go up, or the fact that your mood can go up or down-- hopefully it hasn't gone down too far during this talk.
I'll give you an example of something that we did with robots. This is sort of where we started [AUDIO OUT] robot that can understand the description of a house. So I tell it, I want you to build a house that has two walls and a window. What's the robot's doing now is we build a house for it, and it's trying to understand what's there.
It looks at the house and it gets some things right or wrong these annotations that I made after the fact. So it looked at this and it got a few things wrong. It just couldn't see those things. And now it thinks to itself, how good can my model of the world possibly be? It basically uses a ray tracer to understand what could be visible or invisible.
It automatically figures out that there are some things that I could have gotten wrong. It goes, it picks another view to gather data from, it plays the same game from that view, then it realizes that neither of these two views was complete, so it goes and it integrates the information from both. That integration isn't going to be perfect either. So in a moment we're going to give it a sentence that it's going to use in order to figure out what's here.
So the intuition behind this is, anytime you look at a house, at least half that house is invisible to you, right? It's behind something. There are various ways that I can tell you this. I can tell you what's behind the house by describing it to you, or you can figure out-- you can just think to yourself, this thing is invisible, I can't possibly see what's behind this wall, I have to walk over there.
Now there are some times when you do this and some times when [AUDIO OUT] there are times when you don't have to go out into world to gather information because it's clear there has to be something there. Like, you can't see what's underneath my laptop, but there clearly has to be a table here because if there was a hole, it would be falling to the ground. And so it uses some knowledge about physics to understand when it has to gather data when it doesn't.
[AUDIO OUT] manipulate this structure, we can disassemble it. There are examples where we do this with sort of multiple robots cooperating, but I won't go into all of those.
I just want to sort of thank our many collaborators, including people that are here, like Boris, Shimon, and Josh, and Jonathan who is in here, and Candace collaborated on this stuff.
But yeah, I think sort of the road ahead is quite rosy, and in particular, I'm very excited by the fact that even though these are sort of very different problems in different domains, like AI and NLP and vision, if you look at it from sort of this thousand-point view, they all look very, very similar to each other.
In reality, they all look as if they're only a few lines of code different from each other. And maybe, maybe what you actually have inside your brain is you have something that's very general purpose, and what you do is you learn like a two-line program that lets you do planning with it.
And then when you figure out that someone's doing this new task that I don't really understand, when you see someone [AUDIO OUT] past that you've never sort of recognized before, you think to yourself, what two-line program is this person running inside their head? Oh, maybe they're running this simple two-line program that's answering the question.
So maybe in reality you don't even have to hard-code these tasks in the future. Hopefully we'll get there over the next few years.
Associated Research Thrust: