Towards Machines that Perceive and Communicate
June 30, 2017
May 26, 2017
Kevin Murphy (Google Research)
All Captioned Videos Brains, Minds and Machines Seminar Series
Abstract: In this talk, Kevin Murphy summarizes some recent work in his group related to visual scene understanding and "grounded" language understanding. In particular, he discussed the following topics:
Our DeepLab system for semantic segmentation (PAMI'17,
Our object detection system, that won first place in the COCO'16 competition (CVPR'17, https://arxiv.org/abs/1611.10012).
Our instance segmentation system, that won second place in the COCO'16 competition (unpublished).
Our person detection/ pose estimation system, that won second place in the COCO'16 competition (CVPR'17, https://arxiv.org/abs/1701.01779).
Our work on visually grounded referring expressions (CVPR'16, https://arxiv.org/abs/1511.02283).
Our work on discriminative image captioning (CVPR'17, https://arxiv.org/abs/1701.02870).
Our work on optimizing semantic metrics for image captioning using RL (submitted to ICCV'17, https://arxiv.org/abs/1612.00370).
Our work on generative models of visual imagination (submitted to NIPS'17).
Kevin explained how each of these pieces can be combined to develop systems that can better understand images and words.
Bio: Kevin Murphy is a research scientist at Google in Mountain View, California, where he works on AI, machine learning, computer vision, and natural language understanding. Before joining Google in 2011, he was an associate professor (with tenure) of computer science and statistics at the University of British Columbia in Vancouver, Canada. Before starting at UBC in 2004, he was a postdoc at MIT. Kevin got his BA from U. Cambridge, his MEng from U. Pennsylvania, and his PhD from UC Berkeley. He has published over 80 papers in refereed conferences and journals, as well as an 1100-page textbook called "Machine Learning: a Probabilistic Perspective" (MIT Press, 2012), which was awarded the 2013 DeGroot Prize for best book in the field of Statistical Science. Kevin is also the (co) Editor-in-Chief of JMLR (the Journal of Machine Learning Research).
HOST: Thanks, everyone, for coming. We are very honored to have Kevin Murphy here as our speaker.
I've known Kevin for a very long time. I was just reflecting back on when I met you. Do you remember? So I was in grad school, and Kevin was doing a summer internship, I think, at the DEC lab, or something. Anyone heard of that company, DEC? They made a search engine called AltaVista, briefly survived until they were overtaken by another company that you might have heard of.
But Kevin was doing a summer internship then. And I remember going to dinner with you at Daddy-O's, and sitting out in the summer. And I-- you know, instantly-- I mean, though I was in brain and cognitive science, Kevin was in computer science and AI, I felt a connection.
And again, I think you'll see this is computer science work. This is-- he's going to talk about machine learning, computer vision, and related things in natural language. But Kevin is a fitting person to have here for the Center for Brains, Minds, and Machines because he's always been interested, I think, in getting computers to do the things that humans do and is one of the most interesting and best people at doing that.
He has many positions he's occupied since that summer. He was-- he finished his PhD at Berkeley. He was on the faculty at UBC for a while, where he started working on a machine learning textbook that has become one of the-- I don't know if we want to call standard-- it's certainly one of the best books in the field.
It received the DeGroot prize from statistical-- books in statistical science, which is very impressive. And I think it's one of the books that best-- I'd say it's the book that, to me, makes the most interesting connections between the whole broad swath of machine learning and a range of different ideas and statistics, Bayesian, otherwise, and so on.
He moved to Google a few years ago and has been leading the group there in the-- well, in exactly the area that you're going to hear more about. Oh, and I guess he recently became editor-in-chief, co-editor-in-chief of the Journal of Machine Learning Research. That's super impressive.
So, anyway-- so there's really nothing more that needs to be said. We're just extremely honored and pleased to have you here. So Kevin, take it away.
KEVIN MURPHY: Great. Thank you.
That was a very flattering introduction. I hope I can live up to it. I'm very excited to be here. It's always fun to come to MIT. There's so many interesting things going on.
HOST: Yeah, you were a postdoc at MIT.
KEVIN MURPHY: I was postdoc, yes. So I was a postdoc here, 2002 to 2004, with Bill Freeman and Leslie Kaelbling before I went to UBC. So yeah, I remember MIT fondly, and I always enjoy coming here.
So like Josh said, I-- so I've been at Google five years, actually. So about three years ago, I went to the head of research, who's a guy called John Giannandrea and I said, you know-- so this is how-- I'll tell you a little story. This is how Google Research is set up currently.
There are literally three towers. There's the NLP tower, which is sort of run by Fernando Pereira. There's the computer vision tower that's run by Jay Yagnik, that I'm in. And in the middle is the machine learning tower where Google Brain is, run by Jeff Dean.
And everyone talks to the machine learning tower. Everyone-- you know, deep learning is used for everything. But there wasn't so much crosstalk between the vision guys and the language guys. And it feels like these should be interacting.
I don't think we need language to build intelligent machines because animals are intelligent, and they do things. But if we want to build devices that interact with humans, then we do need language. But we don't want it to just be language about the web and abstract things.
We also want to talk about the physical environment and ask, what is he wearing, and why-- oh, if he put his coat on, is it raining outside. And those kinds of grounded things. So that's a space I'm interested in.
So I pitched this idea to Jay G. About three years ago. And I said, I think we should be doing something in this space. I think this is interesting, and I think it's going to be useful to Google.
And you know, back then, Google Home didn't exist. There were no devices on the market that needed it. But he believed me. So he said, OK, go ahead. Build a team, do something, or at least start to try to do something.
So I spent a long time trying to think of a name. And I settled on VALE. It's a pretty broad title for my team-- Vision, Action, Language, and Environment. And I've been super fortunate to grow the team to 10 people, with an 11th person joining us. So Carl Vondrick, who some of you may know, is starting next month to join us.
And we've been looking at, basically, mostly computer vision and so, specifically, object detection, person detection, and pose estimation, and what I call dense predictions, so like mapping one image to another image, like 2 and 1/2-D vision, depth estimation, semantic segmentation, colorization, surface normals, optical flow, a little bit, those kinds of things. And I'll briefly summarize some of our work in that space.
And then, I've always been interested in the language side. We haven't done a lot of work there, but I have a few little projects here and there that I've done mostly with interns. Jonathan [? Mahmud ?] was my intern. And I won't talk about his work today, but that was one example.
So that work is kind of fun and is a bit more cog sci science-y, so I'm going to talk about that, as well. And then, very recently, we've been starting to dabble in reinforcement learning. And this is sort of growing in popularity. I deliberately did not include RL in the first version of my book because, you know, even only five years ago, it was still just [INAUDIBLE] worlds and wasn't being used.
And there's been a lot of progress in that field, as you may know. So when-- I'm hoping-- I am working on v2 of the book. It's a work in progress. I don't know when I'll be finished. And there will certainly be some coverage of RL. And so you'll see RL creep into this talk a little bit.
So it's pretty broad. I have-- hang on. Let me just do the table of contents. I have a whole slew of things I'd like to cover. I'm planning to go over the perception side pretty quickly, maybe in like 15, 20 minutes-- maybe, let's say, to 5:30. And I-- you know, you can interrupt me if you want more detail and stuff.
But then I'm going to slow down a little bit on the second half, which is about sort of connecting the vision to the language side, rather than just the pure vision side, just partly because it's more recent. And I think it might be of more interest to BCS folks. Let me backtrack, though, and just show the slides I skipped.
So that's the team. I'll call out names of people as and when I get to specific projects. So-- but this was sort of-- this is the landscape we're in today, right? Why try to connect vision and language?
So I'm-- in this talk today, I'm mostly focusing on static images, single static images, which is-- the simplicity is not really the real world that we're in. But there are artifacts of this form. People take lots and lots of photographs. We'd like to interpret these photographs.
And the language comes in because we want to annotate them so we can retrieve them and describe them, maybe to visually impaired users. If we could generate image captions automatically, that would be pretty helpful to a lot of people. And so I'll talk about some methods for doing that.
Of course, you know, in the future, we'd like to look at video analysis, both in an offline setting-- you know, discover interesting facts about what happened in your surveillance camera or your biological vision system, watching mice in a vivarium or something. And then, even more interesting is an interactive setting where you've got streaming data. And this is obviously the closest to biological vision, which is situated in real time.
And so this is sort of the spectrum that I'm hoping that will move in my team. And obviously, other people are on different points in the spectrum. But today I'm going to focus on this left side, which is sort of the classical computer vision, where we've got web images, and we're going to try to squeeze juice out of them.
So I'm calling it deep understanding of single images. And here-- so person detection and object detection is an example. Captioning is another.
So deep understanding-- well, aren't we done? I mean, look at this figure. Everyone knows the image in that challenge, right?
So the error rates have been going down and down and down. And look, we're better than people. Oh my god, we can just quit and go home, right?
Well, no because image classification is not image understanding, right? I mean, obviously, everyone-- I assume most people in this room agree with that. But that statement is not the default assumption if you go to a computer vision conference.
Certainly, some-- many of my colleagues who I won't name think classification is-- that's it. If we can do better at classification, then we're making progress. Well, that's-- there are many, many other things we want to get out of images, right?
So we would like to do stuff and things, as Ted Adelson says, right? So stuff-- the dense things, the surfaces, the surface normals, the semantic segmentation into categories like grass and road and sky. And then the things that are countable, that may be individualized, that could, perhaps, move-- people and dogs and individual trees, if we're interested in that.
So you know, you could move from one category-- it could be instantiated into instances, if you care about counting trees or detecting the disease once. But if it's a group of trees, it becomes stuff. And then people, of course, is a special case because they're animate, and they're, obviously, very relevant for us and also for other devices that we build and artifacts that we build because we want to make stuff that's useful for people.
So we want to be able to detect people and imagery and estimate their pose and maybe use it to predict what they're going to do, understand what they're doing, and so on. So these are the kinds of image interpretation tasks that we'd like to be able to solve.
So I'm going to go over, very briefly, like I said, some of the techniques that we've applied to these problems. And the basic hammer that we're using is the hammer that everyone is using these days, which is neural networks, in particular, convolution neural networks, in particular deep convolution neural networks, which are just function approximators that map from, in our case, RGB images to some output that we'll specify.
And we are going to assume-- so everyone says, OK, why is the field making so much progress-- because of deep nets? Sure. Deep nets are very, very, very helpful, but we need a way to make them. So I think software plays a big role.
The fact that there are systems like TensorFlow that makes it easy to create these models and train them at scale is a game changer because these things are very data hungry and compute hungry. And then the other thing that people tend to not mention as much is labeled data. It's not just data, but it's labeled data, labeled by people.
This has driven the field forward. And it's a drug we're addicted to, to quote Jitendra Malik, which is clearly biologically implausible and is also unsustainable, from an engineering point of view. So I'll talk a little bit about how to try to reduce our-- wean our addiction to this.
But in the first half of the talk, we're just going to inject the morphine and enjoy the ride. So we're going to download the COCO data set, which is a public data set not created at Google. It's not particularly big. It's-- let me read the statistics here-- so 125k images.
But the key thing about COCO-- it stands for Common Objects in Context-- is that it's very densely labeled with lots of juicy kinds of annotations. So we have instance masks. We have bounding boxes. We have people key points. We have captions.
There's all kinds of layers of stuff that have been added on top to this. So it's a very useful data source. That's why-- making a lot of-- moving the needle in the field. So let's try to tackle some of these tasks.
So maybe starting at the bottom of the stack of these sort of-- it's not exactly pre-semantic because I have the word "semantic" in the title, but it's dense, right? So the input's an image, and the output's going to be an image-shaped thing, where, in this case, in the task of semantic segmentation, the goal is-- assign a pixel to every-- sorry, a label to every pixel.
So we have a finite number of categories, like 20 or 80, depending on the training set. And we have labeled pairs that-- this is somewhat expensive to acquire this data. But if you can get it-- and at this point, you know, there are several data sets of this form-- now, it's just a standard supervised learning problem.
And you know, the questions to be answered are really what's the sort of form of this network and then how do we train it efficiently. So the form-- you know, CNNs for classification problems have this sort of funnel shape. They start with a big image, and they squeeze it down, and they predict a small number of outputs-- 20 or 1,000 labels or something-- so they have this bottleneck shape.
But in these dense problems, the output's just as big as the input. So a very common architecture is this hourglass, where you shrink, and then you expand. So you are doing convolution initially, and then you're doing deconvolution to go back up.
So the problem with that is that you're losing a lot of information in this bottleneck in the middle. So you've thrown away a lot of the signals. So basically, there's a whole rash of papers-- I've listed three here, but there are many more-- where people try to recover that lost information.
In a supervised setting, it's not so bad because you are given a very high resolution input. So you can do these skip connections, basically, that will sort of copy some of the high resolution input from the beginning all the way to the ends. And then the network's adding the sort of semantic layer on top, which is typically lower resolution, anyway.
And then the network learns how to fuse those two. So I'm not going to go into these sort of low level architectural details of the networks because these things change rapidly. And it's not clear what the key principles are because there's still lack of consensus.
But this sort of high-level idea that you want to pass information at multiple scales-- I mean, that's been around for a while. And that's driving a lot of these things. So they're often called U networks just because people draw them like this. I'm going to call them conv-deconv networks.
But anyway, all of-- you can use one of these architectures that you like. They have access to the high resolution input. But nevertheless, the outputs that these models are predicting tend to be kind of blurry. So our contribution to this space was to say, well, you know, there are these methods that people did five years ago, all that time ago, called graphical models.
And they have some nice properties for modeling correlation between random variables that the neural network doesn't explicitly capture, right? The outputs are predicted independently per pixel, conditional on the hidden states of the neural network. But nevertheless, there's no explicit correlation.
Whereas, you can model what they explicitly will say are conditional random fields. So in particular, there's a paper-- let's see if I can get the reference-- from [INAUDIBLE] 2011, and they showed that you can actually capture long range correlations between pixels if you make your graph fully connected.
And you can still do efficient inference in such a model using mean field algorithms. But back in 2011, people were-- so basically, I'm not going to go into the details, but probably many of you know, with a CRF, you have to model the-- say what the correlation structure is, but also what the local evidence is. Like, locally, what you think this picture should be? What category?
So you know, back then, they were using random forests, and so on. It's sort of a low-hanging fruit. You just plug-in a neural network instead, and you'll get better results. So we take one of these neural nets that's trained end-to-end, and then we're going to feed that into the CRF, and that's going to clean up some of the high resolution edge information that was lost by the network.
So that's sort of the key idea. It's called DeepLab. So the two primary authors are Jay Chien, who's at Google in LA, and George Papandreou, who's also at Google in LA, who's on my team. Iasonas Kokkinos is a colleague of George's. He's a faculty-- I'm not sure where now. I think he's at maybe UCL-- and on Kevin-- sorry.
I'm reading my own name. Who's this guy, Kevin Murphy? He's just a free rider!
Alan and I were playing an advisory role on this one. So there's sort of two key things. One is the CRF component. I mentioned the other is this idea of a true convolution, which is essentially expanding the spatial support of your filters in an efficient way to capture a long-range correlation without blowing up the computational costs.
And this is an old idea from signal processing. I learned this from George, who has a signal processing background. Recently, Vlad [INAUDIBLE] rebranded it dilated convolution, and George got mad, saying, why invent a new name for something that already has a name?
This is French for "with holes" because you're putting holes in your filter, but you're not actually multiplying by 0. That's the key trick. It's very simple. Anyway, this is the part I'm more familiar with.
So in the CRF, if you just use a nearest neighbor grid structure, you don't really get any juice because the neural network, the conf net, is already capturing short-range local correlations by virtue of having correlation filters. So you need to use these models that can exploit long-range connections to get any win.
So this is kind of what it looks like. You've got this network predicting, per pixel, the probability of each of the categories, like the softmax heatmap. And it's somewhat blurry. And then you feed it into this network, and you get a nice, sharp output coming out.
And inference in this network is an iterative process. And I think I cut all these slides just to save time, but you can implement that iterative process as an R and N if you want. And it's actually implementing the mean field equations. And after a few iterations, it will converge, and you get nice wins.
So here's just some eye candy, input image predictions from the neural net. And the you stick it-- stick the CRF on top. And you can see it sharpens up the results. So you know, it's not qualitatively-- in some cases, it is flipping labels, so it can suppress some false positives.
But primarily, it's sharpening up the edges. And that gets you some gains in terms of the standard metrics. So roughly speaking, you're going from, in the pre-CNN era-- so the same CRF, but we say deep random forest, or something. It's like 50%. CNN's come along-- you get 10% gains just by using a neural net. Everyone's happy, and then you stick a CRF on top, and maybe you get another 5% or so.
And you know, the really nice thing about this is it's just a black box. And you can just throw different data sets at it, and it will learn different mappings. So this is-- here, the labels are different. So it's labeling parts of objects.
And now we can train it on urban data. This is the cityscapes data set from Daimler-Benz and a group in Germany. And you can see the application, the relevance to self-driving cars, which is not something I work on, but this is just the data set.
We tried it, and we get good results. Not state-of-the-art-- there are people who have beaten us-- but pretty good. And you can train it on-- to detect take parts of people. And so this is a pretty generically useful thing to have.
Now, one thing we've noticed-- as you all know, the trend in the field is-- make my networks deeper and deeper. And so you think you're doing well, and someone just adds another 10 layers, 100 layers, and they beat you. And you start crying like the girl in his photograph.
So we had to do something similar when Jay Chien wanted to make a journal version of our conference paper. So by then, the field wasn't staying still. So we had to use a better model underneath. We had been using VGG, which is a network from the Oxford group, Zisserman's group at Oxford, with 16 layers.
And then ResNet-101 came along, and it was from-- much, much better. So we just swapped out VGG and replaced it with ResNet. And then everything got better, but the relative gain from our CRF started to shrink.
So you know if you stick a CRF on top of VGG, you gain about 2 and 1/2% by this metric, intersection of a union. And if you stick a CRF on top of ResNet, you still get a gain but it's now 1.3%. And you can kind of see the trend here.
So we didn't even bother porting the CRF code because it's a little tricky to implement. So we just-- it's just not worth the engineering complexity. But the upside of having the simpler thing is you can use this model, not just with different data sets, but it's easy to modify.
I think I cut these slides.
But you can have the same thing predict not just semantic segmentation, but, say, depth per pixel and maybe surface normals. And then you can have one model predict all of these things at the same time. So we call it the master net.
And it's just different output heads for your CNN. And so you can-- it's pretty efficient and you can make it run on the phone. And it has lots of obvious applications, some of which I'm not allowed to talk about, so I decided to just cut that whole part of the talk. But that's just a little snippet of what we've been doing in this space that I call dense prediction.
OK, so let me move on. So let's move on from dense to sparse output. So where there's a small number of things you're trying to get, but you don't know how many. And you want to find out how many things are there and tell me some properties.
So we're going to represent things initially by boxes, bounding boxes. So this is called object detection. We want to find some categories you care about. We want to localize them and put a box around them. And Tommy Paggio is doing pioneering work on this [AUDIO OUT] with SVMs and sliding windows, and now it's [? convenance, ?] but it's not that different.
So there's tons of applications-- there are some eye candy from applications of other teams at Google. This is the-- I'm not sure where Tommy's sitting, but I mentioned Google. We were talking about the Google X robotics team. Oh, there you are.
So [AUDIO OUT] from Paul [INAUDIBLE] at Google X robotics. And so the algorithms I'm going to describe to you are actually used by all of these groups. So that's cool. It's very satisfying to see it being used.
So one-- basically, we got into this game, like, 18 months ago, when the whole company [AUDIO OUT] to TensorFlow. We thought, oh, it's time to not just re-implement old algorithms, but let's actually update the tool chain to use new algorithms.
So we looked at the literature, and said, OK, there's these ones that have been winning these competitions. They all have different acronyms. There's SSD and FRNC, blah, blah, blah, but they're all very similar. And it's like convergent evolution.
So basically, you have some convolutional block that's extracting features densely across the image. And now, essentially, it's just like a sliding window classifier, right? For every patch, you're predicting what the label of that patch is and what the coordinates of the box are. And of course, the patch could be background, in which case you don't predict the box, so you're only trying to predict its location if it's a non-background category.
And the difference from the early sliding window methods is, A, we're using neural nets instead of SVMs. But more interestingly, we don't have to cover space quite as densely because the network can be-- you can sort of classify each patch coarsely. And if you think it's a hit, then you can learn a regression offset which will fine tune the location.
So that will give you sort of sub-sliding window accuracy, which-- so that turns out to be quite efficient. So mixture model-- you're sort of tiling space with a finite set. And then you learn regression within that finite set.
So anyway, the SSD, what it does is it predicts the box location and the label in one shot, single shot. But then an alternative would be this method called faster R-CNN where they first predict the box location, and they don't know what it is. They just know it's something. It's like a generic box proposal.
And then they extract features from inside of that box. And then they try to figure out what it is. And the nice thing about this approach-- it's more accurate, but you can have any kind of output here. It could be predicting box coordinates or other signals. And I'll give some examples later.
So anyway-- and then there's RCN. It's another paper from Ross Girshick and colleagues, which is a variation where you compute your features. The second-- this output head is sharing features, so the final layer is very efficient. It's just sharing more features as an efficiency, speed up, primarily.
So anyway, Jonathan [AUDIO OUT], the tech lead on my team in charge of object detection, he and some colleagues devised this nice API that sort of captures all of these models and more. And you know, we implemented all of this in TensorFlow. And it enabled us to sort of try all these methods and see how they compare.
And they all have various knobs, which I'm not going to get into, that let you sort of trade off speed and accuracy as this is just-- this is a very quick model. It's using something called MobileNet, developed by some of our colleagues on the Mobile Vision team. I forget exactly the running speed, but you know, it's a few hundred milliseconds per frame-- very lightweight.
And then this is some heavy thing that is picking up on small [AUDIO OUT], like the kite that was missed here, and gets rid of some of the false positives. So it's clearly getting more people. And this is a kite, not a person. There's nobody-- it's not someone windsurfing.
So you are getting gains, but you are paying a price for that. So now we have this sort of [AUDIO OUT], this toolbox we're able to-- or this factory, really. We're able to sort of mint tools from the factory that span this spectrum.
So we have a CVPR paper this year where we exhaustively sort of spanned the space of these models and tried wiggling knobs to make this trade off curve between speed, which is on the horizontal axis, and accuracy, on the vertical axis. And we wanted to find, OK, who are on the-- what's on the frontier, right?
So we have these critical points, which are the models that strictly dominate the ones below it, at least empirically, on the data sets that we tried. And [AUDIO OUT] sort of say, OK, if you don't have a lot of compute, maybe this is the model you should use. If you don't care about speed, this is the one you should use. And if you're in-- this is sort of the sweet spot.
And so this is a pretty useful paper for people who work in this field. And then we said, OK, well we have [AUDIO OUT] different models, and we can easily ensemble them together and that gives you a nice big win. And using this model ensemble, we won the COCO detection challenge last year.
And we actually won it by a pretty healthy margin. So our final score, I guess, was 40-- whatever that is-- 41 or 42. I can't remember anymore. And [AUDIO OUT] a fairly large gap relative to the second best, which is the team from Microsoft, and then various other teams.
So we're pretty happy with that. So that was a very nice outcome-- a lot of work, of course. So that's great. But this is-- we're just getting started, right?
Bounding boxes [AUDIO OUT] a crude approximation to the shape of objects, so we'd like to actually get a more fine-grained outline. So I mentioned already semantic segmentation, where we-- where, if you say the categories are table and chair, that would group these chairs together. It doesn't know that they're individual chairs.
But if I want to count chairs, then I need to say not only that it's a chair, but it's chair one versus chair two versus chair three, right? So that's the difference.
So you can think of that as-- well, one way to tackle that, the way that we tackled it when Peng Wang was my intern last year-- we just say, well, we already have this juicy stack that predicts boxes. Let's just-- instead of just predicting corners of the box, let's predict [AUDIO OUT] segmentation mark inside of that box.
So we can reuse the segmentation machinery that I talked about earlier. But instead of applying it to the whole image, just apply it within the patch So it's just a pipeline approach, a two-stage pipeline. And this actually got second place in the COCO instant segmentation challenge last year.
So we're pretty happy with that-- publish it though, because it didn't win. And methodologically, it's not that novel. There's various other methods that are similar. But it works really well.
You get really nice, pretty pictures. It's like-- even in clouded cases, like these children are occluding each other, and it can segment them out. And in some cases, it finds objects that are really hard to see, even for people, sometimes. So it's great.
More recently, very recently, there's an alternative, slightly different approach from the team at Facebook AI research, and they call it Mask R-CNN. And basically, instead of first predicting a box and then predicting the mask inside of the box, they predict them in parallel. Other than that, it's the same.
We have this nice, sort of generic set of tools, so we were able to re-implement this in a couple of days, basically, because these are just, essentially, changing the wiring diagram of your network and letting it train. And you have to change the loss function, is the other big thing.
But Alireza Fathi, who's on my team, [AUDIO OUT] him. He coded it up. These are some preliminary results. It hasn't fully trained or anything. But you can already see, roughly, what it's doing. And you know, it's pretty cool.
So we get boxes, but we're going to get the masks within the boxes. And it works even in quite challenging cases, like when there's overlap, and so on. So this is already useful to product teams of various kinds.
Now, that's object. So we've talked about sort of-- stuff. And we've talked about things. And now, people are, in some senses, things. But they're, obviously, a special case.
So people detection-- we can do, literally, reusing our [AUDIO OUT] stack. We just change the data. But again, we don't want just boxes. We could get the mask of the person, and we do.
But we also want to get the articulation of the body. And so far, we're only doing it in 2D. There's various groups working on 3D pose estimation, which I think is certainly more useful. Chris Bregler joined Google recently [AUDIO OUT], and we collaborate with him.
And in fact, one of Chris's teammates worked with us on the pose estimation challenge in COCO. So COCO had three challenges last year. We entered all of them.
So we won the object detection. We were second in segmentation. And we were second in the key point one, although there was a bug in our code, and after the deadline, we fixed the bug. And then we [AUDIO OUT] number one, but it was too late, so the history books record us is being second. That's OK.
So the approach is, again-- it's actually pretty simple. There's a few twists that make it publishable enough at CVPR. And I'll tell you, roughly, what they are. But it's a two-stage pipeline.
We detect the person in a box. And now we're going to [AUDIO OUT] points inside of that box. And the key points are represented as a heat map. So it's a bit like a segmentation thing. But instead of simply predicting the location in a cross grid, we can learn offsets, just like we do with bounding boxes, like I mentioned earlier.
And so you're going to get a [AUDIO OUT] vector. And you can think of it as like a mixture of Gaussian's model. So you get a weighted combination, and you add this weighted vector field together, just like you would in Hough voting.
And you get a much more precise localization of the key points than you would with prior methods, which is why we're able to beat them. So we're currently number one on this leader board. Everyone sort of leapfrogs each other. So the last time I checked, we're number one.
And these are the numbers that George sent me. So last time he made this slide, we were number one. This is the average precision metric. And here we are-- 0.649.
We beat the Facebook paper that just came out. And they claim they were number one, so by transitivity, we're surely number one. And there's a group from CMU that won the competition. And so all of these methods are similar at a high level, but they differ in some of the details.
And these details seemed to matter for these problems. So we're pretty happy with that, and as you can imagine, there's a lot of applications for this stuff. And [AUDIO OUT] work in progress that George is pushing on is to say, well, especially in the case of people and kinematic chains, there's a lot of structure that we can exploit.
We know it ahead of time. Let's not completely throw the baby out of the bathwater. Maybe, let's revisit the CRFs and deformable parts models and try to leverage that in conjunction with these-- the juicy [AUDIO OUT] signals that we get with our conf nets.
And for simple problems, when it's isolated people, you just don't get any wins. But when there's a lot of occlusion and overlap, and perhaps in tracking scenarios, we expect there to be more significant gains. But that's work in progress, so stay tuned.
I-- pretty good, actually, timing-wise. So I aggressively cut a lot of slides because I didn't want to go deep dive on that. I'd rather go into more depth in the following sections.
But before-- so that's sort of interpreting the world through, say, a single image. We build up some state estimator, in a sense. We're going to do something with that, right? So we might want to describe it to a person, or the person might want to interrogate us and ask us questions, right?
So language is going to be a medium in both directions when we bring humans into the picture. So that's one, [AUDIO OUT] sort of the main motivation for studying it. The other is also from a machine-learning point of view and a research point of view.
Everything I've said up until now is supervised learning with CNNs. And so you know, the models, they differ in, maybe, the loss function or exactly what topology they use. And there's a [AUDIO OUT] in that space.
But at some level of abstraction, they're all quite similar. There's a lot of other models that are interesting and worth exploring that have more expressive power. So you know, RNNs, recursive networks, Turing complete is clearly more powerful than a stateless feedforward model, right?
So there are settings where we want to model on stateful computational processes. And then I'll talk a little bit about variational ordering code as the density models that are unsupervised. And so there are certainly, if I switch to my machine learning textbook on-- this is just a p of y, given x, x is the image, y is some kind of annotation-- body pose, label, box, whatever.
This is the same thing, except you've got lots of them, 1 to t, a variable number. That's the important thing, is that it's a variable. Otherwise it's just a fixed length vector, right? And then this is the same, except it's a joint model of x and y.
And that lets you do cool things because you can have missing data, or you just have images and no text or just text and no images. And you hope that the latent variables capture the correlation. And I'm going to talk about that at the end.
So there's many more models that we can explore and that we need to explore when we get into language modeling because the problem is more difficult. You need to take-- well, in the general case, sort of, it's AI complete. You need human intentionality and a very deep understanding of the world to do a good job.
But even to do an OK job, I think, you can't just brute force it by collecting label data. So we do not only need fancy models, but we need to move beyond just max likelihood training with input-output pairs and look at other ways of training these models with different object functions.
So I'll give an example where we use reinforcement learning to train modules to optimize a criterion, or reward function, if you like, that's better suited to the task that is not likelihood of the data. And then when we're doing density modeling or latent [AUDIO OUT] modeling, we're going to use variational base, which is, basically, converting Bayesian inference into an optimization problem because we like optimization. We have good software for it.
But we will get uncertainty and all of that good stuff coming out of it. So this is a richer playground. It's more fun, if you're a machine learning researcher, than just living in this top left corner where we were previously playing. So that's sort of a methodological point.
So I'll just have a quick break. So I'll first talk about mapping from images to tech. So let's go back to this example I started with. There's two obvious approaches, right?
We could take the image, and we could parse it into all of the pieces. And I already explained how we could do that. And then we could use those pieces and convert it into a sentence somehow, and you could imagine using template methods.
And in fact, people do use that approach, and it can work [AUDIO OUT]. And it could be template-based, or this could be a neural net that has access to these signals. Or it could just be an end-to-end thing, and that-- in some senses, that seems to work better.
But we will-- and I'll show some results which look really cool that take this end-to-end approach. But there are some caveats there, which we will visit later. So I'm, for the most part, actually going to follow the trend and do this direct end-to-end stuff and not really use the scene interpretation that we've built up.
We have those signals. We should use them in this part of the work. We haven't. It ought to help, but we haven't really tried that hard.
So almost surely everyone in the room is aware of the breakthrough that happened in 2015. Simultaneously, several groups at Berkeley and Toronto and Microsoft, and so on, all kind of stumbled [AUDIO OUT] the same idea of treating image captioning as a translation problem, where you're translating from images in one language to captions in another by-- you take your image, you pass it through the CNNs that I've been talking about to create a vector representation that squeezes some semantic juice out of it.
And then [AUDIO OUT] as conditioning for a recurrent neural network that's trained to generate sentences one word at a time. And the features derived from the image bias the word choices that you make. And the thing is-- trains on supervised image caption pairs to maximize conditional likelihood.
So all of these methods adopt that kind of framework. These-- this is Oriol Vinyals, et al-- he's a colleague of mine at DeepMind, Fang, et al, from Microsoft, where they did use some of this intermediate structure of object detection, and so on. And they get broadly similar results. So you know, several groups came across it, and they're all getting [AUDIO OUT] results like this.
You've probably seen this stuff, right? You give it an input like this. It was annotated by a human-- three different types of pizza on top of a stove. And the model says-- two pizzas sitting on top of a stove top oven. You know, is it two? Is it three?
It's not really clear. You could quibble over that. It's a little bit agrammatical-- or not agrammatical, but disfluent-- sitting on top of a stove top. But you know, if you're, maybe, not a native speaker, it's not bad. And you know, it's pretty, pretty amazing, especially compared to where we were before.
And I could give you more amazing results, but let's look at the not-quite-so-amazing results because it's more fun, more interesting, if we want [AUDIO OUT] improve things. So you know, you get-- this is a common failure mode. These models don't really count, and they often just guess two because two is the most frequent noun phrase--
--in the data set.
These models have millions of parameters. And any bias in their data, you-- they will find and exploit it ruthlessly. And then, [AUDIO OUT] there are some really embarrassing fails.
You know, the data set, the COCO data set doesn't have these kinds of images. And you know, we would like this stuff to actually work. I mean, we would like to be able to annotate all the world's images for visually impaired users.
But if it's going to do things like this-- and furthermore, if the model confidently believes this is the right answer-- we can't ship this. This would be really embarrassing. So this has not launched.
And the accessibility team-- when they saw these-- I mean, this stuff was so cool it made the New York Times. And you know, they said, hey, this is great. We want to-- we'll give you engineers. We can code it. We'll take your Python prototype and redo a few. It's not a problem.
But then they run it on their data, and it fails. And we say, OK, well, we haven't quite got our act together yet. So what are we going to do to improve? So there are some basic problems here.
Perhaps the most fundamental problem [AUDIO OUT] we got from Jitender Malik. But I think the fundamental problem is that it's really hard to evaluate whether the caption's any good. And so if we can't measure progress in a rigorous way, we can't make progress.
So we decided to take one angle on this, which is to look at a special case of captioning where there's really a task. And this task is called referring expressions. This is a standard setup. In computational linguistics, it's been around for a while.
And the idea is that you have two people. So we're bringing multiple agents into the picture because communication requires at least two agents, right? Otherwise there's not really any point to it. So we're going to have a speaker and a listener.
And the speaker sees an image and wants to convey some information to the listener. And what he wants to convey is some-- the location of some object of interest. So what he's going to do is-- in our setup, there's a speaker that's given an image, and we're told, please describe [AUDIO OUT], so this box, which is guaranteed to correspond to an object of interest coming from the same COCO data set that I mentioned.
And then the algorithm, the speaker algorithm, has to generate a sentence such that when that sentence is received by the listener, the listener can decode it and correctly infer which object was being referred to. And if they correctly decode the objects, then they get points based on understanding it correctly.
So you're rewarded both for speaking clearly and for comprehending correctly. And we're going to train these two agents cooperate-- it's really-- it's a cooperative game, so it's really just one meta system. [AUDIO OUT] game theory or anything like that.
And we're going to simplify the problem by giving the listener a finite set of things it has to choose from. But the nice thing here is it's very easy to measure performance because you can just say there's five boxes. Did you-- what fraction of the time did you get the correct box?
Or if it's a regression setting, you can measure [AUDIO OUT]. And it's a meaningful thing, right? You didn't quite get the location right, but you were close. So we can make progress, and you can potentially train on this.
AUDIENCE: [INAUDIBLE]. Training
KEVIN MURPHY: Yes. That's a very good question. So we actually have human label data. Do I have a slide on that? I'm not sure if I do.
We got people to create-- so we gain use supervised learning. So we had people annotate it. And we ground it. We're not actually having the agents create their own language.
There has been a lot of recent-- well, several [AUDIO OUT] papers where they use RL, where the agents just sort of stop babbling, and they create their own language. And if you seed it with human-- a human language, typically English, it might stick to using that, but it's not guaranteed to.
Here we're actually using supervised learning, not reinforcement learning, and it's only training on data [AUDIO OUT] created by people. And I'm going to come back to that. It's clearly a bottleneck. I'm trying to get away from the drug of supervised data.
But here, this is the first time we were working on this problem, so we wanted to start simple, or relatively simple. So I should mention this is-- the first author of this work was [INAUDIBLE], who was, at the time, a PhD student at UCLA [AUDIO OUT] and has since joined the Google Waymo, or Alphabet Waymo, whatever it is-- the self-driving car team. So this was published in CVPR last year.
So how are we going to tackle this? So the baseline approach would be-- you're given a region of the image, and now what you could imagine doing is just [AUDIO OUT] features from that region. And then just use the pipeline that we already have, which takes image features and generates a sentence using an RNN that's conditioned on the image features.
And then we can give it some context from the whole image, right? So that's the baseline model. It's a maximum likelihood model that's predicting a sequence of words, given the region features.
And then, what would the listener do? The listener is given a finite set of regions. It could just rank how likely each region is to match the sentence that was spoken and compute them, the most likely match. So there's a max likelihood classifier very similar to what people did in speech recognition five or 10 years ago, [AUDIO OUT] would be, say, an HMM per phoneme or something like that.
But there's obviously a problem with many aspects of-- well, in particular, the speaker. If you have a setup like this, and I say, OK, please describe this region of the image to this, to the listener, it might just say, the girl. There's no reason why it should say, the girl in pink because it-- we are giving it the whole context, but it doesn't really know that there's some ambiguity here, and so it should add extra redundancy to its description to make it unambiguous.
So what we can do is realize that the purpose is not just to describe this patch, but to [AUDIO OUT] convey information when there's some ambiguity so that the listener can decode correctly. So there's a nice game theoretic analysis of this that Percy Liang and colleagues came up with. I think there's too many symbols for me to decode here.
But the bottom line is that the speaker should take into account the belief state of the listener, essentially. And when they're creating a sequence of words, w, for a region, they should make sure that the likelihood of that description is higher for the true region than for any of the other regions because if that's true, then the max likelihood decoder is going to work, right?
You know that will be ranked correct [AUDIO OUT]. So if you can satisfy this criterion, then it will ensure that the listener will decode correctly, and you'll both be happy. So what you can do then, is then, instead of-- let's see if I have it.
So it's a very simple change, right? Instead of maximizing the likelihood of the words, given the regions, we can maximize the [AUDIO OUT] probability of the true region, given the words, right? And so in speech recognition, this is called MMI training, maximum mutual information. This is a blast from the past if there's any speech people in the room.
But it's just-- you compute the posterior, and then you're going to maximize that. So it's discriminative training because of this normalization constant because you'll take into account the relative likelihoods of each of the regions-- and making sure the true one is higher than the others. And we also tried a ranking loss, and it's more or less the same. I like this better because it's probabilistic, but they're very similar.
So we just changed the loss function. And pretty much everything else is the same. And lo and behold, it helps. So this is the data collected. I'll skip this.
So you just train that up, max likelihood training. And I'm skipping the architectural details that are in the paper. And let's-- just a little demo.
So here's an image. If we point to this guy on the left, and we say, please describe it, it will say, a man wearing a black jacket. And we point [AUDIO OUT] right, and it says, a woman in a black dress. It's pretty good.
Let's do one more-- red bus, double decker bus. It's pretty good. Doesn't always work-- let's look at-- this is an interesting failure case.
We point to this thing. And it says, a bus in the right. Well, it looks like a train, not a bus. Is it on the right? Or is on the left? That's the point of view of the speaker and listener.
It doesn't capture that kind of subtlety. It's not exactly grammatical-- "in the right." So you know, there are flaws. But you know, it's pretty good, actually. It's surprisingly good, given how simple it is.
And on the listener's side, we can [AUDIO OUT] it's fairly adaptive. So here's an image. Like I said, we're going to give it a candidate set of regions. And these actually come from the object detector that I mentioned earlier, but the class agnostic object detector, the region proposal network, also called multi box, that just says, her are five or 10 object-like things I found, [AUDIO OUT] about one of these.
And then, depending on what I say, I'm going to highlight one of those candidates. So if I say, a black carry-on suitcase with wheels, it will pick this one. If I say, a black suitcase, it picks the same thing, a red suitcase, it flips over here, a truck in the background, it picks this. So you know, it's responding.
So you know, this is pretty interesting. So we have a nice objective function to measure [AUDIO OUT]. And you know, we're capturing some aspects of communication, which is a multi-party setup. But we were relying on manually labeled data to specify discriminative descriptions for each of these regions, take into account what the confusing categories were.
And this isn't true-- isn't possible, in general, right? So imagine we change the scenario. So instead of having a single image and there's one region out of five that I'm trying to discriminate between, I have a set of images. And there's one member of the set that I'm trying to describe to you.
And I want to describe this set-- this instance [AUDIO OUT] all of the others. And this set could be arbitrary. It changes at runtime. It could be quite big.
Here, I'm going to focus on a single distractor image. So what I'd like to do is describe this image on the left such that you won't confuse it with the one on the right. So a default model, which is just like a max likelihood [AUDIO OUT] on caption data does, in fact, say it-- like, an airplane is flying in the sky, which is a reasonable description if your task is to describe this image.
But if your task is to distinguish it from other members of the set, where-- like, in this case, the set has two elements, then you're going to get confused. So what you should do is say, well, if my goal is to be discriminative or distinctive, then I should maybe generate this-- a large passenger jet flying through a blue sky.
So this is not a passenger jet. It's clear, if you hear that, you're referring to this one on the left. So this is very similar to the setup I had before, except the key difference is we're not going [AUDIO OUT] collect training data that is explicitly discriminative.
We're just going to reuse the caption data we already have, and we're going to change the way the model works. So we're going to dynamically derive discriminative functionality from a model that was trained, in sort of a generative way. It's a pretty simple idea.
So we're just going to, again-- same principle. We're going to modify the speaker to pay attention to the needs of the listener in a simple way. So the key idea is to take into account that the listener is going to be computing a likelihood ratio.
If there's only two choices, it's going to be doing this max likelihood decoding, like I mentioned. How likely is the sentence under this hypothesis versus this hypothesis? So when we're considering a sentence that we might generate, we want to make sure it's more likely under the true image, as opposed to the distracting image, right?
So we're just going to have this log likelihood ratio, and we're going to generate a sentence that maximizes that. But that could give rise to agrammatical sentences, so we're also going to have a language model term, which just says, generate me a sentence that's likely, but also one that is more likely under the correct image, as opposed to the distracting image.
So that's our objective function that we use at runtime to decode our sentences and [AUDIO OUT] rewrite this. Since it's just a sequential model, you can rewrite this as a sequence of conditional terms. And then you can use beam search, and then you just modify the beam search algorithm with a slightly different decoding function. And you can decode from this greedily, and it's very simple to implement.
So again, the first author of this work is Rama Vedantam, who was interning at Google last year. And you'll see his name pop up again in another part of the work. He's a student at Virginia Tech, about to move to Georgia Tech. And the co-author, [AUDIO OUT], many of you know him. Devi is Rama's adviser, and Gal Chechik is a colleague of mine at Google.
So let me just show you some results. So the generic model is the captioning model I mentioned earlier, which we used, I think, one, at the standard show-- I think it's show, attend, and tell. It's the CNN [AUDIO OUT] thing.
So if you give it this green image, it will say, a man and a woman playing a video game. But that's a bit ambiguous. This introspective speaker that does the likelihood ratio decoding says, a man is sitting on a couch with a remote, which is certainly a better fit to this image than to that one, right?
Let's do one more example. The generic model would say, a train traveling down tracks next to a forest. And we've chosen the test set so that it's ambiguous by matching-- they contain-- in this case, I think they either have the [AUDIO OUT] same captions, I think, either according to the model or according to the humans.
So these images were prepared to the same caption. I can't remember now if this is in the training data or if it's due to the algorithm. But in any case, on their own, these sort of map to the same point in language space, in some senses. But as a pair, they're not-- they need to be distinguished.
[AUDIO OUT] would say, a red train is on the tracks in the woods, which is clearly more-- a better fit for this left one than the right one. So it's doing pretty well, right? And it's very simple. It's quite happy-- it was very happy when these things work.
And then ultimately, we want to know-- does it work with humans in the loop? [AUDIO OUT] So we did an AMT study, and we looked at two settings where the-- let me see if I remember. The easy confusions are ones that are confusing images that are similar in some feature spaces, such as FC-7, some layer of the neural network.
And then the hard ones are not only similar in feature space, but they all [AUDIO OUT] have very similar captions, according to humans. And in any case, in both scenarios, our introspective method, IS, is significantly better than the baseline, which just did standard max likelihood decoding.
So this is cool. You know, I think we're making some progress. Both of these models were trained using maximum likelihood. And they were-- the decoding was using a different objective, but the training was still ML.
But that's a problem because-- let's see. What are the problems of maximum likelihood? Well, with these sequential models, we're decoding one word at a time.
So I'm predicting-- the train on the tracks. Let's see if I can pick a more interesting [AUDIO OUT]. It could be-- the train on the turnstile, right? So by the time I get to "on the," [AUDIO OUT] is probably the most likely word in my grammar model.
And if I accidentally made an error there and said something else, I'm not going to be able to recover from it because my language, prior, is so strong, if my prior predictions are different from that, I'm conditioning on things I haven't seen in the training set. So I deviate from what I [AUDIO OUT] on, and I start entering parts of models-- data space that the model hasn't been exposed to.
So this is called the exposure bias problem. Because you're-- at training time, you're always conditioned on the ground truth prefix. But at test time, you always conditioned on the predictive prefix.
And they might start to become arbitrarily different, and the models can perform poorly. So that's a well-known problem with max likelihood training in these sequential models. So what we can do is to replace maximum likelihood with some other objective that looks at, maybe, the overall fluency of a sentence or its-- how well it performs at some discrimination task.
And we can use reinforcement learning methods that are able, in principle, to optimize black box functions. So in particular, we can use the policy gradient algorithm or code reinforce to optimize anything we want, in principle. So the MIXER paper from some Facebook guys-- they published it last year, and they used this approach.
[AUDIO OUT] the BLEU score, which is a metric from the machine translation community. And they showed some wins over just max likelihood training. But there's a couple of problems with this.
So the biggest problem is the BLEU score is just not very well correlated with human judgment of caption quality. It's a very syntactic thing. It may be better for machine translation, but for image captioning-- if you look at the quality of algorithms compared to the quality of people, according to BLEU, it seems that the algorithms are better than people.
So that's clearly not true. So this is just clearly a bad metric. And this is the case for pretty much all of the automatic metrics. The other issue [AUDIO OUT] is a bit more detailed, but-- and I'm not going to go into that level of detail.
But their particular policy gradient algorithm is extremely high variance. And it's difficult to use. So we came up with a better policy gradient method. And with our better method, we were able to explore alternative objectives that are more correlated with human judgment.
So in particular, the approach that we took-- this is a bit of a technical detour for those who are familiar already with this. When you're estimating the cost to go, so your Q function, the MIXER approach, basically, averaged the reward across all the time steps and [INAUDIBLE] constant.
And we do something similar to what they do in AlphaGo. We do, like, Monte Carlo roll outs. So we have a partial sequence of actions or words said so far. Then we hallucinate possible endings of the sentence.
We feed these complete sentences to our black box evaluation function that I'll tell you about in a minute. We get a score, and then we average over those. And we use that average [AUDIO OUT] to tell us how well we're doing.
And these curves are showing the various metrics of BLEU and ROUGE and CIDEr and METEOR as a function of time. And we're the blue curve. And you can see we are much higher and we are faster. And the previous method, which is in green, isn't as good, and its [INAUDIBLE] was very hard to tune, very sensitive to hyper parameters.
So we found that this particular combination of policy, gradient, and Monte Carlo roll outs was easier to work with. And then that lets us explore the space of reward functions to try to find something that matches human judgment better. So fortunately-- so this project got started-- I went to ECCV last year. I think it was in Amsterdam.
And there was this presentation called SPICE. This group from Australia, Mark Johnson's group, they came up with a metric that, for the first time, [AUDIO OUT] put humans at the top of the leader board, where they ought to be. So this plot is showing human judgment on the x-axis and automatic score on the y-axis.
And these dots correspond to, I think, judgments according to different systems. And this is judgment-- this is human judgment [AUDIO OUT]. These methods-- so this is the BLEU score I mentioned here. The algorithms are all in blue, and they're scoring higher than the red human.
And then the CIDEr metric-- the best algorithm is, apparently, better than the best human. And then the METEOR metric-- similarly. So they came up with the SPICE metric, and finally, for the first time, [AUDIO OUT] humans are not only higher ranked than the algorithms, but there's a better correlation.
So the way it works is that they parse the caption into-- they actually have multiple captions, and they parse them all. And they build a scene graph, and they extract the semantic content of the sentence. And then they measure how well that match-- for the ground truth.
And then a generated caption is [AUDIO OUT] similarly, and they match in graph space, rather than in grammar space. So they're really saying, is the semantic essence of the sentence similar, as opposed to is just the sequence of tokens that a-- are generated similar. So this feels like the right thing to do, right?
So an obvious thing is-- so I-- you know, [AUDIO OUT] is awesome. We have this hammer. We can optimize anything. They have a way of measuring performance that seems to be correlated with what we want to do. Let's just optimize the crap out of the SPICE score.
But it turns out, if you do that-- I don't think I have any examples-- because it's only looking at the semantic structure and not the syntax, [AUDIO OUT] you will, of course, do well by this metric. But your sentences aren't very grammatical. So we did a simple thing.
We simply mixed the SPICE metric that captures semantics with the CIDEr metric, which captures syntax. And we call it the combination SPIDER. And then we optimize that.
And so if you optimize that using [AUDIO OUT] policy gradient method, and then you show those captions to humans, then humans like us much more than they like other methods, which is good. So we're generating stuff that people are happier with. You can play a different game if you want.
We can take the metrics that are used in the COCO competition. So it's a combination of BLEU, CIDEr, METEOR, and [AUDIO OUT]. And we can optimize that. And then you can-- we used, like, a really dumb model from two years ago, the old show-- like, it didn't even have attention-- really simple, you know, VGG-- simple baseline model.
But you optimize it by this. And we were number one for about a week on the COCO leader board. [AUDIO OUT] we were number one the time we submitted, so we could brag about it. But then we got beaten by someone else.
But our point was like-- we were using a very simple model. We were just optimizing for the right thing. But if you show those captions, the ones that win the competition to people, they don't like them.
They don't like them as much as if you show them the captions that are generated by optimizing this more human metric. So we're pretty happy with this. And some of my colleagues are planning to try to see if we can get this working on real data.
But I'll come back to that. But let me just show you a couple of examples. So this is the fire-- there seem to be a lot of fire trucks in COCO. So these are the five [AUDIO OUT] humans. You can read some of them.
This is just the default baseline max likelihood training-- a red and white bus is driving down the street. Well, it's not really a bus. The previous max method, MIXER-- a yellow bus driving down a city street. And then, this is us at the very bottom-- a red fire truck is on a city street. Seems better, right?
Let's just do one more. This is our method. The baseline-- a woman walking down a street while holding an umbrella. OK, that's ridiculous. Our method-- same model, different loss function-- a group of people walking down a street with a traffic light. It's pretty good, right?
So we're quite happy with this [AUDIO OUT]. Since then, various other teams have come up with even better ways of regularizing or stabilizing the training process, which I won't get into.
So let's-- I'll do five more minutes, and I'll wrap up. This is the most recent piece of work. We just submitted it to [AUDIO OUT]. We'll put it up on archive shortly.
But let me just motivate it. You know, it sounds like everything's great. And we've made progress. We made human raters happier than they were.
So my colleague on the accessibility team says, OK, you ready now? Can we launch? No, we still can't launch because-- look at these errors. It's just ridiculous.
[AUDIO OUT] from our system. This is from a couple of years ago. But we still do make embarrassing errors. So a cat is sitting on a toilet in a bathroom, a woman laying on a bed with a laptop. I like this one-- a man in a suit and tie is holding a cell phone, a man is riding a skateboard on a ramp.
I mean, this is ridiculous. [AUDIO OUT] so we've made progress. This-- we're optimizing a metric that's closer to human judgment. But it still feels like we're sort of skimming on the surface.
We're picking up on correlation, and we're more correlated with humans. But are we really understanding what's going on? I don't [AUDIO OUT]. So we wanted to step back a bit and do some science here and say, OK, guys, we're not going be able to deliver on your launch deadline of six months.
We really need to step back and try to get a bit more, a richer understanding of what's going on. So look at some of the core problems in language understanding and [AUDIO OUT]. And there are lots of them, right?
So one of them is just the variability with which people describe the worlds or the-- which appearance of objects. We have a good handle on that. These neural network models are very good at learning to be invariant to lighting and color and occlusion, to a certain degree.
But these are sort of local, statistical noise, as it [AUDIO OUT]. There's other kinds of more radical variation in the world, like new combinations of things that I've never seen before, structurally novel combinations, compositionality. You know, this is fundamental, especially to language, right?
The world is fundamentally combinatorial, and so you're never going to cover it with training data. So we need to [AUDIO OUT], just grab the bull by the horns and just address this as a first class citizen, which means we have to get away from just random train test splits where the test set is pretty similar to the training set with just slightly different colors on your pixels.
So we're going to do compositional splits, where we guarantee that the thing [AUDIO OUT] on you've never seen at training time. And they're going to be structurally novel test sets. And then we want to deal with abstraction, which is related to compositionality.
So if we have a lot of signals that vary, what do they have in common? What is the essence of the concept that you're trying to convey? And what are the things that are just random and incidental?
And sure, you might pick up on that correlation, but it's not really the core. And there are, of course, lots of other problems, but these are the issues that we want to tackle. So we started to think about language and vision sort of more deeply, I guess.
So one key thing is [AUDIO OUT] when you describe a compositionally novel sentence to someone, they may never have heard of it. It might not even exist in the world, but they can still usually understand it, right? So if I say "purple hippo," it evokes some representation in your head.
We don't really know-- this is the thought bubble-- we don't know what that is. [AUDIO OUT] some distributed representation in neural net. Maybe it's-- who knows, right? I don't want to commit to what that is. We want to be agnostic to what that is.
But then if I probe you, and I say, well, do you understand what I'm talking about, they'll say, sure. OK, well, prove it to me like you would if you were examining your students. So a reasonable thing to do in this scenario is to ask the student to draw or to sketch, any way, not photo realistic, but, like, OK, show me what you think I'm talking about.
And they might generate a diversity of samples that sort of capture the essence of this description or this concept. And now you could say, well, I don't want just any purple [AUDIO OUT], I want to say purple hippo with wings. So I should be able to be more specific as I add more constraints to the problem.
And now, presumably, the thought bubble, the distribution of possible worlds has shrunk because I've added constraints, and therefore, the set of samples I generate should be less diverse. They should be consistent with what I say, [AUDIO OUT] but they should spend the space and not fill in details arbitrarily when I-- for things I didn't specify, right?
So just some-- we're going to build up to a model. We're going to call these text descriptions, y, the internal representation z, and the generated images will be x. And of course, we could do the reverse.
We could have a set of images. And we could say, OK, please describe that. And so this is concept learning. This is very much inspired by Josh's thesis, right?
So if I give you these images, you would-- like, the least common ancestor, in some sense, is purple hippo. These are both purple hippos, but there's a more parsimonious explanation, which is a tighter [AUDIO OUT] sphere around-- which captures the data, but only the data. And that would be purple hippo with wings, right?
So we would like to capture that kind of phenomena, as well. We actually haven't worked on this particular problem yet. We believe the model I'm about to show you can solve it. We just haven't had time to try.
But [AUDIO OUT] the model is based on variational autoencoders, which some of you may be familiar with. So this is just the latest variable model. You have some latent variable z. We're going to assume it has a Gaussian distribution because it makes everything simple, but it doesn't have to be.
And then we have these two modalities, right? So we have images off on one [AUDIO OUT]. And we have text off on the other. And we're going to generate everything, so it's not discriminative anymore.
And that means we can train partially supervised. We can have images on their own, or labels on their own, or both. And we do want some paired data so we can learn this correspondence, but we don't necessarily require a lot of it.
So [AUDIO OUT] of this work is, again, Rama Vedantam, who did the discriminative captioning work. He liked it so much at Google, he came back again. Last time, he was on another team. This time he interned with me.
And then my colleague, Ian, works a lot on this, and Jonathan Wang, who I mentioned earlier in the object detection project. So this will be coming up on archive [AUDIO OUT] in the next couple of days, actually.
So with VAE-- so these joint models have been around for a long time, right? There's nothing new here. The sort of breakthrough a few years ago is to try to make inference more efficient.
So what you can do is, you can train a network to approximate the inference process. So what we're going to do is we're going have three inference networks. So we'll have one inference network that infers the posterior over the latents, given pairs of data.
And that's, maybe, what we have at training time. But at test time, I might only hear a sentence. And I want to imagine the meaning of that sentence, so I'm going infer z given y. So I need a network that only works in text modality [AUDIO OUT].
But I might want to do the reverse. I might want to have an image and embed it into my concept space so I can describe it. So I'll need an image-specific inference network, as well. So we're going to have three networks that capture these different types of data, and we need to jointly train them.
And there's been several papers on, like, multimodal VAEs. And they all do slightly different ways of training these. And I'm not going to go into the relationship of-- between our work and theirs. It's in the paper.
But what we do is we use just neural nets to parametrize these networks in the usual way. And then there's a couple of novelties [AUDIO OUT] have this slightly different objective function that-- we call it the triple elbow because there's three elbows.
I'm not going to get into this, but those who know the elbow, this is the usual elbow and joint data. We have an elbow just on x's and just on y's. And this gives us a way to train these three networks simultaneously.
And then we wanted to test this. So we just threw SGD at it. And we [AUDIO OUT] probe how well it's doing in this sort of controlled setting to see if we could-- if we're tackling these basic issues.
So the first thing we did was to take MNIST, as everyone does who works on VAEs, and we replaced the class labels with an abstraction of the class label. So we just gave it two bits, either the parity-- it's either odd or even-- or the size. Is it a big number, bigger than 5? Or is at less than 5?
So we're not doing natural language. We're just doing attributes. And in this case, they're two binary attributes. It's very, very simple.
So you can fit them all to this. So you've got a bunch of images, and you've got these little bit vectors. If you fit an [AUDIO OUT] model, which only has images, and you fit it in 2D, and you look at the latent space that it induces, there are these four categories, right?
There's small and even, and small and odd, large and even, and large and odd. It doesn't devote any space, in latent space, for this large even category, the red guys [AUDIO OUT]. Even though there are digits in that group, there's no reason why it should allocate mental space to it because it doesn't know these labels.
And it can recreate the likelihood of the pixels fine without it. And you can monkey with the loss function. The beta VAE paper is from DeepMinds. Basically, they change the weighting term on the KL.
And there is InfoGAN paper from OpenAI, where they, again, sort of-- it's a GAN, not a VAE. But it's pretty similar. They sort of change the weight terms, but they're still just dealing with images. And it's nice to try to squeeze as much juice as you can from images alone.
But there are going to be some high-level [AUDIO OUT] where you're going to need some linguistic or some kind of structural side information to tell the system what it is you care about. So if you have the joint model, and you fit it naively, it starts to get-- do better. It has to generate the labels as well as the image, so it's going to devote some capacity of its model to that task.
But the [AUDIO OUT] bits and the pixels, then, in the label, so it emphasizes the pixels more. You can just weight the labels more highly. And if you scale them appropriately, you get nice decomposition of your latent space, and you could, obviously, do well at classification if you wanted to.
So that's cool. But more interesting is that [AUDIO OUT] do posterior inference. So like I said, our latent space is Gaussians, so our inference network Q is going to predict the parameters of a Gaussian. And so if I twiddle the bits, I'm, like, giving it sentences, and it will map to the appropriate part in latent space.
But what we want to be able to do is just to describe the world at different levels [AUDIO OUT], right? So let's see what the next figure is. So I should be able to specify the concept "all even numbers."
And I don't care if they're small or big. I want all the evens. Or I want all the smalls. Or I just want all the numbers. And then I should be able to generate samples for my model that are consistent with what I did say and [AUDIO OUT] entropic over what I didn't say.
So how are we going to do that? Well, we need an inference network that can handle missing data. So what we decided to do is to use a product of experts on the assumption that these attributes are roughly orthogonal, at least in this setting. So we can-- each expert is its own Gaussian distribution that's mapping that particular attribute to latent space.
And we're going to combine them multiplicatively so that they-- when they agree, they're going to carve out a part of space. So the individual experts, like this tall one on the right, is capturing the concept of small. And this tall one on the left is capturing the concept of big.
And this is the even expert, and that's the odd expert. And then, if you want to capture small and even, then the two experts fire together. And they-- these Gaussian bubbles intersect.
And the nice, in general, products of experts are intractable, but in Gaussian land, everything's analytic. And it's straightforward to compute. So you can fit this.
One thing to notice is that these individual experts are a bit weird. They have [AUDIO OUT].
What's the time? I'm really running over. Yeah, it was a bit too much detail.
But they have these wide tails because they're normally always present. In Geoff Hinton's original work all the experts were firing. In our work, we have a variable number of experts firing, depending on what you observe.
So we can just have what we call the universal expert, which is the prior. And that regularizes it because it's always being multiplied. And then you get this beautiful sort of Olympic-rings-type structure where you've got this broad concept.
This is the prior. You've got these specific things capturing aspects of the problem-- parity or magnitude. And then we can make compositions of these individual components just by combining these [AUDIO OUT] experts dynamically at runtime without having to specify it.
So then we want to evaluate these things. We don't want to be-- it's nice to look at these pictures, but we want to measure, objectively, how well the system's doing. So we've proposed three criteria for evaluating any module. It doesn't have to be our model.
So we can't look inside your head, but we can ask you to generate images. And we're going [AUDIO OUT] those images. So you generate a set, s. And then we say, OK, are the images you generated correct?
So that-- what this says, simply, is we're going to apply classifiers to your generated images. And we're going to see if the predicted labels from our classes match the things that I told you to generate. So if I say purple hippos, they better [AUDIO OUT] be hippos.
But I didn't say if they're flying or if they're all eating grass. I don't care about that. So you have to match on the bits that you require, and you don't care about the rest.
We also want coverage. So we want to be-- the things that I didn't specify, I want you to give me a variety. I want some flying hippos. I want some flip-- hippos with wings. I want some [AUDIO OUT] water, some in the field.
So we want to measure-- this is-- that's not-- parse the syntax. But that's the idea. So we want diversity, so we're going to cover the extent of the concept, and not just give me a single example. And then we want to handle compositionality, which we do, simply by partitioning the data so that we get structurally novel [AUDIO OUT].
So then to test these-- we had a slightly harder task that we called MNIST with attributes, where we're sliding the digits around, and we describe them-- you know, this is the class label. But we can say, this is a small digit. It is upright? Is it in the top right or a 4 in the bottom left, and it's big?
So now we can fit a model on it. And we can say, OK, please generate me some-- a 4 which is big, upright, and bottom, left. And there is some samples.
And these are some rival methods that we do better than. This is variational canonical correlation analysis, and it's blurrier than us. And the classifier doesn't like it. It gets the bits wrong [AUDIO OUT] red. This is the joint multimodal VAE, which is similar to us, but a slightly different objective.
So our samples are correct, more correct. They're sharper. This is a bit more interesting. We can give partially specified queries.
So I can say, just generate me something on the bottom left. [AUDIO OUT] what it is. So sometimes it's 0. Sometimes it's 3's or 9's or 6's, right? Or I might say, I want it to be 3 and big, but I don't care where, and it slides it around.
But if I clamp all the bits, then it's more specific. And what's going on under the hood is that these Gaussians I initially brought in are shrinking as we condition onto more bits. And that's inducing this narrower distribution.
It's very similar to Josh's thesis, where he had distributions over a hypothesis space, which is either, like, the number line or a tree. And in our case, it's a latent space, which is nice because we can fit any kind of data to this latent space. This is dynamically changing as we condition on more or less data.
And then, we can do it [AUDIO OUT] split. So we can give it a query. It's never seen anything that's zeroes, bigs, uprights, and top rights, and it does the right thing. It's seen zeroes on their own and bigs on their own, but it's not seen this combination. And we do better than others.
And we can quantify all that. And you know, we beat the other methods, too, especially by VCCA. There's a healthy margin. The other method we beat, [AUDIO OUT] it's a smaller gap.
OK. So that's very recent work. I'm pretty excited about it. There's clearly a long way to go between, like, playing with endless digits and the kind of real data that I was talking about earlier. And we need to bridge that gap.
So maybe I'll just mention that future work is, basically, to try to bridge that gap. And furthermore, we want to move away from just single images and look at the active scenario where we have streaming video, and we're interacting with people, ideally in real time, and that raises a whole host of issues that I didn't talk about, which is lots of juicy future work.
OK, thank you.