Andrei Barbu: From Language to Vision and Back Again
Date Posted:
June 9, 2014
Date Recorded:
June 9, 2014
CBMM Speaker(s):
Andrei Barbu All Captioned Videos Brains, Minds and Machines Summer Course 2014
Description:
Topics: Importance of bridging low-level perception with high-level cognition; model system for a limited domain that can (1) recognize how well a sentence describes a video, (2) retrieve sample videos for which a sentence is true, (3) generate language descriptions and answer questions about videos, (4) acquire language concepts, (5) use video to resolve language ambiguity, (6) translate between languages, and (7) guide planning; determining whether a sentence describes a video involves recognizing participants, movement, directions, relationships; overview of system that starts with many unreliable detections, uses HMMs to track coherently moving objects and recognize words from tracks, and gets information about participants and relations from a dependency parser (e.g. START) to encode sentence structure; similar approach is used to generate sentences and answer questions about videos (combining trackers and words); examples involving simple objects and agents performing actions such as approach, pick up, put down; translation between languages via imagination of videos depicting sentences
ANDREI BARBU: Hello, everyone. I'm Andrei. So let's start with a little off-the-cuff psychophysics. This is a more extreme version of what Shimon talked about. Who can tell me what this object is?
AUDIENCE: A cat.
AUDIENCE: A program.
ANDREI BARBU: Patrick says it's a cat. Any other guesses?
AUDIENCE: [INAUDIBLE].
ANDREI BARBU: So Shimon showed you examples where, if you've got parts of objects, you couldn't recognize them. This is an example where you actually have the entire object. And you can't recognize it until you see the content. So I didn't give you any more pixels on the object. I didn't remove the blur.
Actually, the resolution of the image doesn't matter. I'll send you a high-resolution copy if you want. The reason why you can recognize this as a hammer is because you see someone hammering with it. Their pose tells you what the object is. This is very different from what people do in computer vision.
Now, your brain is so incredibly good at this that it does it all the time. The hammer isn't actually visible here about 50% of the time. But you never notice a change in density or a ghost hammer appearing and disappearing.
So let's look at what's going on. Normally, we run a hammer detector. We use the knowledge of the location of hammers. And we detect hammers. That's not what's happening. Your brain is taking knowledge about hammering and changing its mind about what it's seeing in the image. It's seeing there's actually a hammer here. Now, why would you care about this problem?
Well, you perform paths like this, paths that involve high-level knowledge-- sometimes encoded in language and perception-- every time you ask someone for a cup. They have to realize what you're talking about, which cup you're describing. They may see multiple cups. They have to figure out how to ask you back, hey, which cup do you want? Or if you ask them, which shirt do you like more? They have to tell you politely why did hate one of them, or maybe both.
We also don't understand how these feedbacks work with the visual system. And this is really important. Anatomists tell us that there's this huge amount of feedback. Part of the reason why it's hard to understand it is because it's so difficult, even for computer-vision models to take advantage of these feedback. Most computer-vision models are feed-forward.
We also don't really understand linguistic representations in the brain. This is in part because we don't have the theory that connects language to perception. And the only thing we can do is sort of affect your perception and record your behavior. And ultimately, if you want the model of language acquisition, you have to have some connection to perception. Because that's, after all, what's happening there.
So what we're going to do is we're going to build models for some of these problems in some limited domain. And the basic widget that we're going to use is your ability to recognize if a sentence is true of a video. You're going to build a function that returns a number between 0 and 1. You can interpret it as a probability if you want. It's a generative model that just tells you, this sentence describes this video well or poorly.
Now, with this one ability, this one function, we can actually encode a wide range of problems. You don't have to solve many tasks with different representation. So for example, you can encode your retrievals very easily. All you do is you look at every video, if I give you a sentence. And you find the videos that the sentences are true of.
Now, you might say, why would we care? After all, you could just go to YouTube. And you can type in, say, show me videos of people picking things up. And you'll get hits. Well, you can do that. I did that without safe search on, because you get much better results.
It turns out, almost every hit is about picking women up in different scenarios. So you might say, OK, I just chose "pick up," and that's not a good hit. So we can pick "approach." And one of them is about how pilots approach things. And the rest are about how to approach women. It seems YouTube is about capturing approaching women.
With this one ability to just tell if a sentence is true of a video, I'll show you how you can also generate sentences that describe videos. You can also answer questions about them. You can try to acquire language. And I'll show you a short description of a model for that.
You can resolve ambiguities. So if I give you a sentence that has multiple potential meanings along with the video, you can determine which meaning I'm referring to. You can translate between languages. Now, translation sounds really weird. Because it's a talk that takes language's input and produces language's output. It doesn't seem to be a whole lot of video and perception in between.
But I'll show you how knowledge about how your ability to ground out your linguistic representations actually allows you to translate between languages. Ultimately, you can encode problems with planning, et cetera, in this framework.
So for the next little while, what we're going to do is we're going to build the supporting function from a few simple widgets, part by part. Let's see what humans do. If I show you this video and I give you a sentence like, the person rode the skateboard leftward, what do you actually have to do-- at least in principle-- in order to tell me, this sentence is true, or this sentence is false?
Well, you at least have to detect the object somehow. I'm not saying that humans do it in this order. You have to tell me, there's a person there. And there's a skateboard there. You have to track them over time to determine their relationship. And then, you look at the relationships over time and say, yeah, OK. This person was riding a skateboard.
Well, I should also say that this is very different from the standard approaches in computer-vision. What computer-vision normally does is it extracts a whole bunch of features. So Shimon talked about edge detectors and running them as single images.
People run corner detectors and large facts of images. And they extract features like this. These are STIP features. And they, basically, run some sort of classifier, like the linear SVM. And they recognize events. This is very different from what we did before.
Before we recognized the event part by part. You looked for people. We looked for different relationships. And then, we had a holistic understanding of a whole sentence and a scene. Well, that system that I described starts as objects, goes to tracks. We take those tracks and we recognize events. And ultimately, we recognize sentences.
What we're going to do is we're going to build a feed-forward system. We're going to choose our representations carefully so that the feedback become very easy. And the feedback become meaningful. They're not hacked. We're actually going to have one simple function that is easily interpretable, that we can actually globally optimize very efficient.
We start with an object detector. It doesn't really matter what you use, as long as it gives you bounding boxes or something like them along with some scores. Object detectors also have lots of false positives. People, apparently, are two parallel vertical lines. Lots of false negatives-- apparently dogs look like knees.
In order to get past this, what we're going to do is we're going to take our object detector and we're going to tell it to generate way more detections than you want. We're just going to take the threshold and lower it. Usually, we get hundreds of detections, thousands. We can even run this with millions of detections per frame. I'm just showing you a few here.
Red ones are for people. The blue one was for motorcycle. You'll see we get detections on the background. We get some true positive detections as well.
Now, what does it mean to track an object? It means you choose one detection in every frame of the video. And you say, this is the actual path of the object. In order to do this, we're going to take every detection in each frame of the video. And we're just going to arranged them in columns.
Every node in this column-- we're going to provide it with a score. This is just the confidence from the object detector that something's actually there. And every link is going to tell us, did the motion between frames move in the right direction? So
If I think an object is here at this frame and here a little bit later in the video, did the optical flow in the video really move in this direction? Or do I think the motion is actually moving here? Objects should move in the direction of the general flow.
Now, it turns out that you can find a path through this lattice, very efficiently, using dynamic programming. And what that does is it maximizes the linear combination of how confident am I in my detector? And how confident am I in my motion estimates and with the path of the object?
So with this, you can build a simple tracker. We're tracking one person in red, another person in blue. And we track the motorcycle in green. You see we can track them, even though they're pretty small in the field of view, as they get far. Eventually, we lose them. Because the object detector doesn't work very well.
This person also has motion blur. Their deforming. We can even separate the two people out. We only have a single person detector. We don't have a sitting-down and a standing-up detector.
Now, what we have are tracks for each of the objects in the frame. Given the track, in every frame you can extract out some features. And then, we're going to have a model that observes these features. In every frame from the bounding box, you can extract the position, the velocity, the acceleration of the object, its aspect ratio, whatever else you wanted to view on the one frame. It actually doesn't matter.
Then, if you want to recognize events that involve more than one participant-- which is a lot of what we talk about every day-- you can choose one participant, say the agent; choose the other participants, say the instruments; generate a feature vector for each of them independently; encapsulate them together; and then, add in some additional features that tell you about their relationship. So the distance between these two participants, their orientation, the derivatives of those properties, whatever else you want to compute on these two tracks.
Now, we have tracks. And from those tracks, we have a timed series of feature vectors. We're going to observe these feature vectors using a hidden Markov model. Every word is going to have an associated HMM that just tells us, is this track moving in accordance with our understanding of this work?
Hidden Markov models just have state transition matrices. If you're familiar with Finite State Machines, they're just like an FSM. But you can have probabilities associated to transitions and to what they observe. It's a very mechanistic model. Just, for every feature in your feature vector, you have to choose which state you're in. And then, you check, do my observations match what I actually saw in the video? --the features that I extracted out?
Of course, we want multiple words. And all we're going to do is we're going to build a different HMM for every word in our lexicon. If you want to recognize what happened-- so say you want to figure out, what did the person do? Pick up? Or approach? Fly? --et cetera. You just try every single word.
Now, what does it mean to actually recognize something with a hidden Markov model or a finite state machine? It's extremely similar to how the tracker worked. And you would take every node, every state of the model at every frame. You arrange it in the columns of a lattice.
Every node in this lattice tells you how confident you are that you're actually observing what the state wants to see in that frame. And every link tells you the probability of jumping between different states of the model. And then, at the end, what you do is you maximize the linear combination of these two quantities.
How confident am I in my observations? And how confident am I in the dynamics of the model, that they're being satisfied? And you can do this again with dynamic programming.
So here's an example of this in action. This is just a feed-forward system. It just tries every word. It tries to match it to the video. And it generates sentences independently of each other. You see you can track the objects. The person carried something. The person went away.
It doesn't deal with objects entering or leaving the field-of-view. So the tracks will remain. It's a person, chase some other person. We can also handle camera motion with this. So the person slowly chased the car rightward. And in case you're wondering where these humorous videos came from, they're by the Department of Defense. And they were meant to help us catch terrorists.
[LAUGHTER]
You think I'm joking. But what we did is we saw this [INAUDIBLE]. We didn't really see sentences. But you can see how, if I detect multiple words, I can build some sentences out of them. You saw we had objects. We have tracks. Then, we get events. And then, we get sentences. And now, all we have to do is figure out how to break each of these barriers and add in feedback.
In that sense, what we're doing sort of promotes a little bit of a higher level. We have a lattice for our tracker. And we have a lattice for our word. We perform a math estimate so we find the best path in one lattice. And then, we use it in order to find the best path in the other lattice.
Essentially, what we have is a factorial hidden Markov model. And we're going to perform inference on the whole model at once by taking the cross-product between these two lattices.
In other words, rather than first finding the best track and, then, using that knowledge in order to find the best state for our word-model, we're just going to move this maximization out, which is usually the opposite of what you do. Usually, you move maximizations in. But it's OK. Because we still preserve the same property of the problem that makes it easy for us to run this dynamic programming out.
So what this lattice looks like is, in every node, we have to select the state for our model and the detection that we want to track. So let's see this in action. I'm going to show you the same video twice. The best detection, by far, according to state-of-the-art person detectors is that window.
On this side, I told the model, look for something approaching this person. You can see that just the knowledge that a ball will approach someone-- I didn't draw a bounding box around this person or anything like that-- was enough to constrain the model so that it knows, this is the right person. And it can ignore the window, which actually has a much better score.
So your knowledge about the motion of objects-- the kinds of things Josh talks about, like Spelke objects-- is enough to get you to figure out that moving objects have some coherency, but isn't enough in order to allow you to detect more complicated things like "approach" and to deal with objects that never move.
AUDIENCE: All right, I didn't understand. Did it detect it correctly before the ball actually appears?
ANDREI BARBU: Yeah.
AUDIENCE: [INAUDIBLE].
ANDREI BARBU: Right. So the property of this algorithm is that it maintains its uncertainty throughout. In essence, the way that it works is there's an exponential number paths through these lattices.
But the dynamic programming part says, it's OK. I don't need to maintain an exponential number. I can maintain a linear number of paths. But I won't lose the optimality. So what it does is it can use its knowledge from the end of the video in order to correct what it mistakenly thought happened at the beginning of the video.
AUDIENCE: It uses the whole information?
ANDREI BARBU: Yeah.
AUDIENCE: OK.
ANDREI BARBU: This is the whole information. But it still has the property that you can ask it at any point in time, what's your highest confidence detection at this point? So you can run it online. Just, over time, it might change its mind about what happened to the path. Yeah?
AUDIENCE: In this example, is the system assuming there's going to be one thing that we're tracking?
ANDREI BARBU: It's looking for one thing. What it's looking for is a ball approaching a person. It's not assuming that there aren't other people and other balls. It actually won't even care if they are, as long as they're not approaching each other.
AUDIENCE: And what is the left one tracking?
ANDREI BARBU: The left one is just a regular tracker. It doesn't have any knowledge about events. It is just looking for high-scoring boxes that move coherently.
AUDIENCE: It's looking for movement.
ANDREI BARBU: Right. So there's no motion here at all. That's the whole point. That sometimes objects don't move. And this incoherence doesn't help you fix your mistaken belief about the tracker.
AUDIENCE: How does tracking work if your eyes are tracking something? So, like, it's stable and the background's moving?
ANDREI BARBU: Well, what happens-- at least in this simplistic model-- is it computes an average optical flow vector for the entire frame and, then, subtracts it from everything. So it's assuming that the entire camera is moving.
AUDIENCE: For example, the person on the motorcycle-- is it actually easier to track the motorcycle and person separately? Or why don't you track them as a combined unit?
ANDREI BARBU: It's much better if you track them together. But the only way to track them together is to have some knowledge about the event or to hard-code in the fact that a person riding a motorcycle looks like a particular [INAUDIBLE]. So that's exactly what this would do. I can even give you examples of it doing that.
All right, so what we did is we sort of broke down this barrier between the tracker and the event recognizer. And it was in a very natural way. We didn't force any additional feedbacks into the system. Things just composed together nicely. Now, all we have to do is build up the meanings of sentences out of these parts.
What we did in the previous example was we had one track-- that was the ball-- one track for the ball and one HMM for the word "approach." And we also had another track for the person. What we did is we took a cross-product of a ball-tracker, a person-tracker, and an approach-tracker.
You can just add more lattices to this cross-product. It doesn't change the structure. It just makes the problem a little bit harder. So now what we're doing is we're simultaneously trying to look for multiple objects that are connected to each other by multiple high-level concepts.
Let's see how we can actually build these cross-products. Because a sentence isn't just a bag of words. It actually has some internal structure. Some words refer to some participants. Participants play different roles. So you can use something like Boris' start system, in order to get a dependency parse.
What it does is it looks at each of the words in the sentence. And it you can tell you that there are actually two horses that participate in the event described by the sentence and one person. So from the dependency parse, we can determine we need three tracker lattices. And then, we just word lattices for each of the other words. In our case, we ignore determinants.
Now, if you have a verb like "ride," the dependency parse tells you, "ride" is connected to the agent and to the patient. And it even tells you the direction of the connection. It tells you that the agent is riding the patient. So the person is riding the horse. This is important. If we're really encoding the meanings of sentences, we really don't want to find horses riding people.
Although, I don't have good examples of this. So it calls to tell us that "call" is modifying "person." "Quickly" is modifying the horse. Although-- as linguists in the room will note-- that's definitely not true. It will tell us that the horse is moving leftward and the horse is moving away from the other horse. And it knows which horse is moving away from each other.
[WHISTLE]
Now, if I really want to show you that the system actually understands sentences and understands their structure and is correctly encoding them, one way is to devise a path that something is maximally hard. We're going to have sentences that only differ in one word or one part of speech. Then, I'm going to give it the same video with two events that happen simultaneously and have it to choose one event or the other event, based on the sentence.
So this is the same video shown twice. In the left side, I told it, look for a person picking up an object. And on the right side-- or your left side-- I told it, look for the person putting down an object. And even before I played this video, you can see that it correctly chooses the person. OK, now he [? stands up. ?] Then put down the bin.
We can also play this game with many other parts of speech. All of this stuff is not event specific, even though we started by thinking about event recognition. So we can say, the backpack approached the bin versus the chair approached the bin. We can say, the red object approached the chair versus the blue object.
And we can play this game with other things, like prepositions. So the person to the left of something versus to the right of something. And here, we refer to an object that actually then plays a role in this event. So we can refer to static objects, dynamic objects.
So we've sort of seen how we can build the scoring function-- at least for this domain-- how we can have a score for how well a sentence describes a video. Let's see how we can build up all of these other tasks out of, say, retrieval.
In retrieval, I can give you a corpus of video. In this case, I'm going to give you 10 Hollywood movies. They're all nominally Westerns. And before someone complains, this is what IMBD considers to be Westerns. I'm going to shove all the blame onto them.
What I did-- this is about 40,000 frames of video, once it's [INAUDIBLE]. And I processed two different sentences from the following template. I can make this a bit smaller. So these are sentences like, the horse led the person quickly rightward. Or the horse rode the person. Although, nodes are positive. The person approached the horse quickly from the left, et cetera.
We have three verbs, two nouns-- person and horse. And the reason why we chose people and horses is because they tend to be large in the field-of-view. And even though they're deformable, they're not all that deformable, given that they're usually imaged from the same angle.
So now, we can run a query like, the person rode the horse. And you can see that we get lots of true positives. What we did is we ran each of these 200 queries. And then, we had people annotate to the top 10 results and tell us how well the system performed.
Now, we can also add more interesting sentences like, the person rode the horse quickly. You can see we get very different kinds of hits. We could say, the person rode the horse slowly. Now, bear in mind that the system has never seen any of these sentences as part of its training data. It's only seen the meanings of individual words.
Now, we could do something like, the horse approached the person. See how we get lots of hits of horses approaching people. And we can do, the person approached the horse. And if you want, this is online. You can play with it. I put the link in the slide.
All right. Now, retrieval seems pretty straightforward. You have found a number of videos. And all you're doing is you're just checking to see how well each sentence matches to each video. Generation looks a bit more complicated. All of a sudden, I have a grammar that can generate a large number of sentences. And somehow, I have to find a good sentence to describe my video.
This grammar looks like a toy grammar. And well, it is by linguists standards. It only has four nouns, a few verbs, a few adverbs, some prepositions, and a few adjectives. But even this toy grammar generates about 150 billion sentences. So searching for the best sentence is a pretty hard task.
We're going to take advantage of a property of this entire system in order to very efficiently find the best sentence the system can produce, in order to describe a video. Now, you can think of every hidden Markov model that you add on for each word as an extra constraint on the tracker. Right?
The trackers can't possibly track better objects than they do without any constraints. And so, the more words that you add to a sentence, the more constraints you have and the lower the score-- so the worst to match.
So what you can do is you can start off with no sentence. And you can try every one-word sentence or every one-word sentence fragment. You can find the best one, then, expand those outward and build your way up to a sentence. In essence, this is like a lazy star search through the space of all possible sentences, as defined by a grammar.
Also note that our grammar doesn't have negation. This is really important. And then, we can generate something like, the person carried the bucket. Here's an example. We have hundreds of these if you want to see more. And from this, you can generate, the person to the right of the bin picks up the backpack.
We can also answer questions with this. So we have the video. We can ask a question, like, what approached the bin? And it'll produce the answer, the chair. Now, it's not hard to see how it might produce this answer. You just generate the sentence. But sometimes it gets more complicated.
If you want to ask, who approached the bin? --if a person came up you and said, the person-- that was their only answer-- you'd be pretty pissed off. You would think that person was being rude, right? There are multiple people in that video. And you are asking for information about, specifically, who actually performed the action. So it's insufficient for the system to produce this sentence, even though it's actually the best-scoring sentence on this video.
What we're going to do instead is we're going to feed the system with a template. We're going to tell it, I don't want you to generate just any part of speech. I want you to generate a noun-phrase. And I want you to generate this noun-phrase in the context of the fact that it has to approach a bin.
So it's going to run the generation system biased by the fact that it knows-- from you-- that there's definitely someone approaching the bin. This allows it to have much higher performance. It imports the knowledge from the context of the conversation-- at least this tiny, one-sentence conversation.
Now, what it does is it says, well, this part that I'm trying to generate is nonspecific. It tries it against the whole video and sees how well it matches other parts of the video. So the fact that the person is true of many different regions of the video means this is a bad path.
And then, it keeps searching for a noun-phrase that's informative-- so it's true of one specific region-- while at the same time having high score, so it's actually true. And there's a trade-off between these two. And ultimately, produces a sentence like, the person carrying a chair approached the bin.
And we can produce either of these-- so the person to the left of the bin or the person carrying a chair-- as an answer. It actually tends to prefer the person to the left of the bin, depending on how you tweak its parameters. Mostly because chair's vectors don't work very well. So they tend to have much lower score.
So we've seen how we can-- just with one widget-- perform retrieval, generation-- which involves this physical search stuff. Mm-hmm?
AUDIENCE: Sorry. About the last example, so it sort of asks what approached the bin. I would think the chair would be, probably, the correct answer.
ANDREI BARBU: Yeah.
AUDIENCE: But I can see how the person in the red shirt or the person on the left would also be correct. Can you--
ANDREI BARBU: It's true. That's not handled. It's whatever constraint you want to place on it ahead of time. There's no long-term reasoning about precisely what you mean. You can also perform language acquisition with this. Basically, this is just a big factorial hidden Markov model. In the same way that you can perform EM on a hidden Markov models within its parameters, you can tune the parameters of this model.
I can give you a whole bunch of sentences, a whole bunch of videos paired with those sentences. And I just ask you, find me the best parameters of the words that explain my videos. I won't go into details, for time reasons. But I'm happy to talk to people afterwards.
You can also resolve ambiguities. Now, I should say that things at the top are things that we have models for and we've done. And as we go down this list, things become less well-known and less implemented. So we have implementations for everything up to here, at least in the domains that we consider. And these ones are problems that we're working on now.
So if I give you a sentence that's ambiguous-- maybe it has two meanings, like, Andrei picked up the backpack and then picked up the chair. It's not exactly clear if I picked them up both simultaneously; if I picked up the backpack first or the chair first or the other way around. I can give you a video with that sentence. And then, given the video, you can determine which of these different interpretations or parses is correct.
It actually turns out that only after the fact did we realize that there's actually a deep connection between acquiring language and resolving ambiguity. It turns out that if you want to learn how to parse sentences, you can use this knowledge about resolving ambiguities to improve your parser. Right?
In essence, learning a parser just means I start with a parser that gives me a very wide and bad distribution over the set of possible meanings for my sentence. And all I have to do is look at my videos to figure out which of these meanings are more likely, given the video, and then use that to reinforce the parser. That's one of the things that we're working on now.
Now, let's get to translation. Well, you might ask yourself, what's the point? I can go to Google Translate. I can type in something. And I'll get a nice translation underlying it. And there was that very nice demo by Microsoft recently where they did real-time translation over Skype.
Well, let's look at how well these systems work. If I give you a really simple sentence, Sam was happy. I actually type this into Google Translate. What Google does-- and what all these statistical systems do is they get parallel corpora, things like the Hansard Corpus from Canada, if you're trying to translate from English to French, or lots of UN corpora, things like that.
And this is what you get out, in English and in Russian. And is there any Russian speakers here aside from Boris and Evgeny? If they can tell me what's different about these two sentences? Like, what additional information is encoded? Maybe there are no other Russian speakers here. OK, fair enough.
So it turns out that these sentences specify that Sam is male. And I don't know how Russian is, but at least in Romanian, I would have to go through a huge amount of contortion in order to not specify that Sam is a male in this sentence. Even more than that, while I went through those contortions, I would have to specify that Sam is a person. Because Sam may well be your dog.
Now, if you want to specify they're a female, you can do that. But there's no good way of saying something in between the two. But these systems don't have the ability to tell me, hey look, there's something under-determined about your sentence. Now, lest you think this is just a gender issue, in Thai, for example, you specify your siblings not by their sex but by their age.
In English, you specify relative time almost every time you describe an event. So if I tell you, I having dinner with Patrick, you know I'm having it now or in the future. But it's pretty clear that I did not have it in the past. In Chinese, you can say a perfectly reasonable sentence that doesn't tell you about the temporal relationship between that event and the present. Imagine how you would say that in English. It would be very awkward.
An aboriginal language in Australia-- whose name I will not attempt to pronounce-- uses only absolute direction. And it's not an exception. Many languages do this. So even when you're referring to points on your own body, you have to refer to north, south, east, and west. Now, it would be very difficult to use something like the statistical translation approach in order to translate between English and this language.
Many languages also don't distinguish blue and green. This is a really perceptual issue. It's not just some categorical information that's missing. And even more than that, many languages' color words don't map very well to other languages. So for example, if you look at Japanese from 200 years ago, the words that they use for blue and green today had completely different meaning back then. They've just repurposed them, it's presumed, to talk to us.
Now, you can imagine a very different translation mechanism that just uses your ability to determine if a sentence is true of a video. So I give you a sentence. Now, what I described before-- even though I didn't put it in these terms, just to make things understandable to everyone-- is the graphical model that's generate.
So what I can do is, given a sentence, I can actually sample many videos from it, many tracks of objects that move in time. I have a corpus of videos. Now, all I have to do is run generation. I run generation, and I find the best sentence that, on average, describes all of these videos in another language. So just by having the ability to recognize a sentence paired with a video in two languages, you can translate between them.
You can also perform planning paths. So what you can do is you could take a video or a sentence that describes the initial state of the world, a video and a sentence that describe the end-state of the world, and you can imagine videos and sentences that connect the two. And it's very much like the translation path.
So rather than telling you in more detail about planning, I'm going to tell you about something that I think is somewhat more interesting, which is why all of this doesn't work-- or the limitations. First of all, everything isn't 2D. That's a massive problem.
If you're trying to find events in the wild, people are not very cooperative. They don't always moving in the camera plane. They don't do everything in plane. So they often pick things up in front of me, from your view-point. All the objects and every representation here are bounding boxes.
We're working on integrating with a person detector and doing inferences, like, the fact that I had to touch this with my hand to constrain the location of my limbs when I pick it up. But we're not quite there yet.
We also want to generate coherent stories. And I'm sure Patrick will tell you much more about the importance of stories. But there's a small problem with the system. It's exponential in the number of words in your sentence, the run-time. It's a minor issue if you're just talking to people sentence-by-sentence. Because we're not David Foster Wallace or Prousts all the time.
But it's a real problem if you're trying to recognize an entire paragraph, and every time I add an additional word, you have this exponential blow-up. So we've been working on a model that eliminates this blow-up. It is actually more neurally plausible. But for interest of time, I'm not going to go into it now. It also has no notion of forces or contact relationships. It just knows object proximity.
There's no social reasoning. It doesn't do physics. It doesn't do theory of mind. It doesn't infer anybody's goals. It's architecture is, maybe, nearly possible. In the sense that all that it does is it does doc products, addition, and math. And it has a very structured control-flow.
And if you asked me, actually, two days ago, whether this model made any useful predictions or if it's actually useful for looking at neural data, I would definitely have said no. But after talking to [INAUDIBLE], it turns out that you can use this model for some really interesting stuff on some of this primary data. So we'll see how that works out.
All of this means that you can't handle the vast majority of English verbs. You can handle verbs and events that involve the gross-motion of objects. And all of this is before you get to things like metaphoric extension, which we use all the time in speech. But what I think we have is we have the beginning of a connection between language and vision, where we're trying to ground out the meanings of words in what we can actually see.
And the other nice part about this model is all we have to do is address this one problem at the top and many other paths fell into place. This was a big deal yesterday during the metrics meeting where everybody was saying that it's going to take us forever to tackle any problems in AI if all we do is address one small task after another without sharing representations between them.
Now, in the last few moments, I also want to tell you about one other neat thing. So for this language acquisition task, it turns out that it's actually much easier to acquire the meanings of words if you're performing a harder task. So if I give you a video and I give you a word, say a verb, that verb can refer to many different things in the video. But if I start giving you long, complicated sentences, so richer and richer input, this problem becomes easier to solve. This is a case where having a more complicated representation with more constraints, actually leads to much faster and reliable inference, which is relatively unusual.
AUDIENCE: [INAUDIBLE]. Actually, I guess we're [INAUDIBLE] for time.
ANDREI BARBU: Sure. All right, any questions?
[APPLAUSE]
Sure.
AUDIENCE: It seems like if you have better recognition, you'll get better at all of these [? actions ?] like that. Is that true? And do you think that, then, really focusing on recognition--
ANDREI BARBU: Right. But recognition buys you better performance. It doesn't tackle any of these issues. So it's both that the low-level perception is unreliable and the fact that we don't have good representation-inference algorithms for the vast majority of what people actually do.
AUDIENCE: Well, but if you had recognition that worked out of the camera plane, for example, or something like that, right?
ANDREI BARBU: Well, that's true. But you still can't do things like--
AUDIENCE: A lot of these seem like recognition issues. Especially, the first one.
ANDREI BARBU: OK. This one definitely is a recognition issue. If I had a better person recognizer-- but I think the deeper questions are, like, social reasoning, or theory of mind, or inferring goals. Because if you looked at many of these verbs, like, how would you figure out that someone's cheating? Or that two people are having a race? It's not an issue of just recognizing their motions. You have to do a lot more on top of it. Mm-hmm?
AUDIENCE: Can you say something about making neural predictions?
ANDREI BARBU: OK, so [INAUDIBLE] idea was, because this is a generative model at the level of words, and it's just a way of creating a factorial HMM that represents the structure of a sentence, you could imagine that, if you hypothesize a certain brain region is responsible for encoding some linguistic meaning, you would take single-cell recordings from that brain-region, then, train a model like this and see if it fits well-- take an instruction, structured knowledge.
Because there's already some evidence that with the hippocampus, with hidden Markov models, you can fit the sum-data really well. Then you could imagine co-training on both the vision and the neural data and seeing an improved performance. Mm-hmm?
AUDIENCE: Is there a way to have a fluency about the grammar?
ANDREI BARBU: Yeah.
AUDIENCE: That's the other part, acquiring language?
ANDREI BARBU: Yeah. So our idea is to use something like CCGs. This is Mark Steedman's work. It's been continued and expanded on by Luke Zettlemoyer, a former PhD student who's a professor at Washington now. The basic idea behind CCGs is learning complicated rules is really hard.
So what they did is they have a way of parsing that fixes the rule-set. But all you learn are the types of the tokens. And then, you just have a distribution over token-types. And it's just a matter of learning distributions. That's not built-in right now. But that's the idea. Yeah?
AUDIENCE: So how well does the model do to detect something if it's deformed, say, in the video two times?
ANDREI BARBU: If it's deformed?
AUDIENCE: Yeah, for example, one example that we deal with-- although we do the experiments with rats-- is that we track their motion on matrices. And by the angle and occlusions with wires and stuff like that, sometimes you lose sight, partially, of the angle. So I'm thinking that maybe this is a way in which integration [INAUDIBLE]. And I just wondered how well would the tracking be.
ANDREI BARBU: As long as you have some weak signal that tells you the rat is at roughly this location, it's OK to have many other false positives. So it can deal with a large number of false positives as long as it doesn't have false negatives.
But yeah, it's good at integrating information over time. And it could probably also integrate information about your space, if you add a maze or something like that. But there are a whole bunch of other particle-filtering based trackers that tend to work better if you have a maze and a lot of constraints.
Associated Research Thrust: