Computational Models of Cognition: Part 2 (58:30)
August 16, 2018
August 16, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Josh Tenenbaum, MIT
In this second lecture, Josh Tenenbaum first elaborates on an intuitive physics engine that uses a probabilistic framework combined with inverse graphics, to capture aspects of human understanding of the physical behavior of objects. This physics engine embodies a generative model of the physical world that incorporates uncertainty in a way that enables approximate, intuitive inference and predictions of object behavior. This framework is then used to construct an intuitive psychology engine that enables inferences about the beliefs and desires of other agents, their goals and actions, and social interactions between other agents.
JOSH TENENBAUM: Welcome back for part 2. I'll probably go for about an hour and a half here. So we may go a little into lunch. But we'll leave time for things. Because I want to make sure to do at least a little bit of learning.
But most of what I'm going to do here in part 2 is talk about how we build these models using the tools of probabilistic programs and some game engine simulators. And we'll spend a lot of the time on intuitive physics for a number of reasons. It's exciting to us, we're doing a lot of work on it. We've made a lot of progress.
It's also the place, I think, where brains, minds, and machines best meet up when it comes to the common sense core. I will talk some about intuitive psychology. But at this point, there's interesting possibilities for using neural networks, both real and studying biological neural networks that underlie intuitive physics. And we will come to some of that in intuitive psychology.
Again, you'll hear really exciting work on that from Rebecca Saxe and Winrich Freiwald tomorrow. But it's just harder representational basically. So we've made less progress on the computational cognitive neuroscience there, or any kind of plausible neural networks.
Intuitive physics. This will also be an opportunity to show you some of the experimental paradigms that we use. Now I talked in the first part mostly about kids. While we do experiments on kids-- I'll show you a couple-- most of the work we do in my lab is with adults.
But we study the things that young kids can do but we study them in adults because we can get a lot more data, a lot more quantitative data lot easier, a lot faster by studying adults. And we think we're studying basically the same thing, although there are important differences. And we talk about that.
So the kind of experiments we might do with adults here are judgments, for example, in these blocks world scene. So blocks world is a classic setting for studying intelligence, including intuitive physics. And we might show you these stacks of Jenga-like blocks.
These are all simulated images. But you can see what we're doing, especially if you've played the game Jenga, but you don't have to have. And the question might be, well, how likely is any one of these towers to fall over under the influence of gravity?
So I built this stack. How stable does it look? And probably most of you would agree that the ones in the upper left here look pretty stable and the ones in the lower right look pretty unstable, more likely to fall. So that's the kind of experiment we could do.
Now if we want to build a model of how you do that, the mental model that we think you have is something like this picture here. And it's about interpreting images at a certain point in time and being able to predict what's going to come next. What you might see in the future, images at a later point in time. But also really about the underlying world state.
So by the image, I mean the pixels. And by the world state, I mean something like a description of the three dimensional objects, physical objects, so their shape and their physical properties. Whatever is basically the state of the game physics engine needed to reason about these things.
We're going to model those arrows, which are the causal arrows, using game graphics engines and game physics engines. And I'm not going to tell you the details of how those work. But the key thing is, that in some form they capture the actual causal processes of how graphics or optics works and how physics works. Like how light bounces off surfaces of objects and forms images, or how things like Newtonian mechanics or fluid mechanics-- but they capture them in only approximate ways that hack various things. For the most part, they respect physics.
But the goal, when you're designing game engines, is just for it to look good and be fun, and look good on short timescales but be really efficient because it has to support an interactive loop. So it's not like Pixar's graphics engines, which are designed to render beautiful, often photorealistic images. Which use things like ray tracing, where you basically almost model photons scattering off objects and many into reflections.
As a side note, I guess NVIDIA, one of the leading game graphics companies, just introduced a new chip, a new GPU chip which does real time ray tracing. So increasingly, the game industry is moving in this direction. Because wouldn't it be awesome if you can have an interactive Pixar movie with that good graphics. And you will have that before too long. The graphics engine in your head may not be that good, although, it's pretty good. I don't know what your dreams are like, but the graphics in my dreams can be pretty good.
Similarly, game physics in many ways incorporates insights of Newtonian mechanics. But there's a lot hacks, too. And as a result, you don't really have conservation of energy often. But you have various kinds of forces and key properties, like mass and friction and elasticity, the things that you need to model what we did with the tape there.
So let's talk about how we do probabilistic inference in this way, so vision from this point of view. And again, I'm going to talk a fair amount here about vision. But it's the kind of intuitive physics, common sense approach to vision. It's more of a top-down in several different senses, as Jeremy and I were talking about, view of vision that I think really does start to meet up in a pretty interesting way with some of the more bottom-up approaches to vision that you've seen from Jim DiCarlo or Gabriel, or Tommy, for instance. Especially because of the community we have here in the summer school, I'll talk about vision somewhat.
But from the probabilistic program common sense view, vision is inverse graphics. It means taking the output of one of these approximate rendering engines, viewing it from a point of view of a probabilistic model, like there's a conditional probability of image condition on scene. And I want to make a guess at the likely scene. Just like you did in those Bayesian tug of war examples yesterday in the tutorial. Or if you're watching this on the video, on the probmods.org web book.
So here what that means is I'm going to see an image of a block tower like that. And I want the output of my vision system to be something like this, which is a configuration of 3D objects which would render approximately into that image with maybe some noise model, like pixel noise or some little noise model. Now you might say, well, what sense is that 3D? It just looks like a black and white version of the one on the bottom.
But that's just a projection of a 3D representation inside the computer. And a way to see this, it's also a sample, it's a posterior sample. So it's a sample of a high posterior probability seen conditioned on the image, conditioned on some observation noise model.
Now this is one sample, here's another sample. Here's another sample. Maybe I have four samples. So if I play those fast, you see what looks like the blocks moving around a little bit in three dimensions. Those are three different samples of 3D positions. Those are not samples of images, they're samples of block positions in 3D but then rendered into images and coated in black and white.
But the idea is that vision gives you maybe one of those guesses. And you see there only a little bit different and they all represent a plausible distribution of uncertainty that you would get if you looked at that image for a relatively brief period of time. I should say by the way, that the work I'm talking about in general on the intuitive physics stuff here, just for citations, is work that was done by Peter Battaglia and Jess Hamrick when they were in our group a few years ago and published initially in PNAS in 2013, and some later work that Hamrick, et al. Did in Cognition.
It builds on inverse graphics work that I'm going to be talking about by Vikash Mansinghka, Tajus Kulkarni, and then recently developed by people like Ilker Yildirim and Jiajun Wu in our group. But this particular image here is actually from the Battaglia, et al. PNAS paper, where they built a very, very simple version of this system that would look at one of these blocks world scenes and make a guess like that at the underlying scene.
Here's another example. This is a very unstable tower. And here's a guess at the 3D position. Here's a few other guesses. And you could see as they move around, again, those are all, yeah, plausible 3D positions that could underlie this very unstable scene.
Now how do we do this? How do we construct one of these posterior samples? Well, here's another use of top-down and bottom-up, or maybe it's the same use. But this a system, the one I just showed you, that's a very top-down approach. And it's a very slow approach.
Biologically, it's very implausible that it works this way. Although it's possible you could build machines this way. And there's some things that I think are biologically right about it potentially.
But what I mean by top-down is that it's using the most basic methods that you saw also in the probabilistic programming tutorial. It's using a Markov chain Monte Carlo. You can think of that as a high hypothesize and test idea but with a biased local random walk in the space of hypotheses, where you make up some guess.
Initially it's probably very wrong, doesn't look much like the image at all. And then you change it, you tweak it in some ways. You make a local proposal, maybe you move this block here or there or you change the shape a little bit, or something like that. And then you see if that's better. You're more likely to accept the proposal. Otherwise, you are more likely to reject it.
And the key is that you make a bunch of these proposals and you're more likely to accept the ones that are better than worse. And if you're willing to wait long enough, after hopefully not too much time, the Markov chain of those proposals will converge to a sample from the posterior. And the first inference algorithms that probabilistic programming languages, like WebPPL, were built with do that. And there's reasons they were built that way initially because it's universal.
So you can define it in a Turing universal sense, if you guys know what that means. But basically, you can define probabilistic programming languages so that any computable probabilistic model-- that means any probabilistic model whose conditional probabilities are computable numbers-- can be written in these languages.
For any computable conditional query-- that means for any conditional probability you'd like to know the answer to or you'd like a sample from-- these MCMC or top-down or local hypotheses and test algorithms can answer that. And it's quite remarkable that you have that. Like running in your web browser when you do that tutorial, you have a universal probabilistic inference engine that can answer any computable probabilistic question. That's pretty amazing.
It's just for most of them, you're going to have to wait very long time. And that's not very satisfying, certainly not for the brain. So right now, if you look at the modern toolkit of probabilistic programming-- and I'm using a phrase of Vikash Mansinghka's here-- inference programs or inference programming, there's a range of different, you might call them, inference algorithms. But they are also probabilistic programs that run on top of basically a mental simulation program, like a probabilistic physics engine.
But these probabilistic programs are designed to compute the inferences. And some are very slow but general-purpose, like the ones I've just talked about and used here. Others are much faster, although more special purpose. And this is where neural networks have come in.
So I'll show you some of that a little bit later, ways that we can use neural networks to basically learn very fast approximate inference in probabilistic programs. But there's no free lunch in that while they are very fast, they are very specific to a particular program or program structure that is used to train them. But that might be a really good way to think about how the probabilistic program inferences work in vision. Because the visual system, compared to other aspects of cognition, has a relatively fixed architecture, at least over our own lifetime, let's just say.
So the basic kinds of inferences that you do in inverse graphics, a lot of them are things that you have a lot of experience with and a lot of opportunity to practice even in your sleep, as we sometimes say. And what you learn is reusable. Whereas in other kinds of more advanced cognition, maybe a little bit less so But the difference between the fast and slow approaches here is illustrated in this figure, where on the x-axis I have time, on the y-axis I have likelihood or log likelihood. Actually it's log likelihood on a log scale.
But what I'm showing is, basically higher is better. It means an in-scene interpretation that's a better fit to the image. And I'm showing how that evolves over time, how your scene interpretation evolves.
For the baseline, the red one, that's one of these top-down Markov chain Monte Carlo type approaches. And you can see it gets better but slowly. And the blue one is a thing that uses a neural net to accelerate the inference in ways I'll show you later on. But it's much, much faster. It basically gets as good as the MCMC thing almost immediately, but only for a specific class of problems that it's had experience with.
We'll come back to the fast methods before. But right now, I'm focusing more on the knowledge representation and how to use these structured causal models, let's say, these probabilistic graphics and physics programs to do scene understanding. And I'm not going to worry about whether it's fast or slow. But I want to show you the generality of this.
So here I'm showing you a few things from the Kulkarni, et al. CVPR 2015 paper. This introduced a probabilistic programming language for scene understanding called picture, which Mansinghka and colleagues are redoing with some of these modern tools using neural networks. So stay tuned for that.
Kulkarni also, he's now at DeepMind. And with some teammates there, they've also been doing a bunch of interesting related kinds of neural net enabled inverse graphics. You might check out their spiral paper, for example, from the most recent ICML, another similar idea building on this.
But for example, things we could do even just a few years ago, was take the same basic framework and see blocks or see faces. So for example, we could take a generative graphics model for faces that has a probabilistic causal model in which you generate the 3D shape of a face, the texture. You choose a lighting direction, a camera angle. And that renders an image like this. And now we can look at a face image and run that backwards.
So for example, here's just a movie. Sorry, the graphics got mangled a bit. But on the left, you see an observed face. And on the right, you'll see one of these slowly evolving MCMC chains of the interpretation.
So on the right it starts out not looking very much like the face. But it makes local adjustments, until fairly quickly it converges on something that looks just like the face. But again, the key isn't that it looks just like it, but that it's a real 3D model. I've just projected it into an image. But because it's a 3D model, I can now imagine what this face would look like under different conditions.
So here's other examples, where I'm taking a face and then I'm imagining what it would look like under different viewpoints or lighting conditions, which I can do because I've inverted the graphics engine. So this is a way to richly perceive faces beyond just classifying them, but to actually see their real underlying structure.
What you might need for say, really understanding somebody's emotions, or their health, or in mate choice, or other kinds of things. A lot of social interactions are going to require a rich 3D perception of the details of the shape and texture of a face of somebody that you've never seen before. And we think these kinds of things are necessary or at least very valuable for doing that.
The same framework can be applied to perceiving bodies. The power of having a probabilistic program and language is, I don't have to separately engineer a system for your body perception. I just give it a model of bodies. So in this case, a graphics model of bodies.
And then I can observe a body, in this case, here's Usain Bolt. And now you're going to see the same top-down inference chain, where I'm making local adjustments to a 3D model body model until it looks like the image. You'll notice that it doesn't look like the image at the pixel level here because we don't require that the model match the pixel level. In this case, it can match in an intermediate representation level, like an edge map.
Or it could be like [? convent ?] features, if you like. And that's useful because that way we don't have to capture all the details of skin color for the whole body and all the clothing that can vary. We just have to capture the shape.
This was used to build a system that at the time was actually a state of the art full 3D body perception system. We could use it for other kinds of generic objects, like bottles and vases, and these kind of cylindrically symmetric, but interestingly shaped objects. Again, it works really well, though slow.
I'll come back to in a little bit how you could use neural networks to accelerate all of these things. The key is that the target of perception is this rich 3D percept, which then becomes the input to the physics engine.
So if we now want to take these tools and use them to do intuitive physics, we're going to take our say 3D sample. Now I have it as a wireframe so that it actually looks a little more 3D, I guess, not an image. But that's the state of one of these game physics engines.
If I want to know what's going to happen, will the blocks fall over, I can just run the physics forward. Going forward with one of these simulators is much easier than going backwards. When I say, OK, well, we have a promising program. It goes one direction. I want to now condition on the output and make it an inference to the input, that's hard. But going forward, that's always easy. And for that, I don't need any neural networks. I just need my simulator.
So I just take my game engine and run physics forward. So it just simulates the effects of gravity, friction, collisions. And what I see if I run it on that guess of the world stage for that image is, well, the blocks mostly fall over. So this represents one posterior predictive sample of my probabilistic physics engine, which would say for that scene, I think that the blocks are mostly going to fall over. So it's very unstable.
With one sample you can make a guess. You can't be confident, or you don't your confidence unless you can take a few samples. And in the work we did here, we asked people to make a 1 to 7 judgment, for example. Seven means I'm sure it's going to fall over. One means I'm sure it's going to all be stable. And intermediate numbers are intermediate things.
And we estimated that people took around three to five samples. But the point is that, in our models and in people, we think you're taking one or a few samples like this. Here's another sample, where now I'm propagating again the state of the objects through a few steps of the physics engine.
And what you can see is that different things happen in these two samples. This is different. The outcome at the end is different here and here at the fine grain detail level. But for the purposes of intuitive physics, for most plans we want to make, it's basically the same. Namely, most or all of the blocks fell over. So that's going to be something I would want to do something about, unless my goal is to have them fall.
To turn this into an actual quantitative experiment and model, what we do is we have a bunch of people make a 1 to 7 judgment. But they could also make binary judgments and you could average it. It comes out the same.
And then what we're plotting on the y-axis here are the average of a group of subjects, where each dot is for one of these tower stimuli. The crossbars show error bars for both the people vertically. And the model is the horizontal prediction. Where the model, again, took a few predictions, like the one I showed you, and looks at the percentage of blocks falling over.
And what you can see is that there's a pretty good correlation. I picked out three specific stimuli. But overall, across 60 different stimuli, it captures most of the variance in the data. And this is one of the first good experiments we did.
Actually, it was one of the first things experiments we did. But it came out very nicely. So it encouraged us to do a lot of others. Most of them look much like this. So I'll show you other a few other examples.
But I also want to contrast it with this plot over here, which is the same data from humans on the y-axis, but a different version of the model. You could call it an ablation, or something. Although, in some ways, strictly it's better. It's the same physics engine but without the probability or the uncertainty.
So this is a system which perfectly solves the vision problem. It perfectly knows where all the blocks are, no uncertainty there. And no uncertainty in the physics either. Our model has a little bit of uncertainty in which forces are operative, exactly. Maybe the table could be bumped, or there could be a wind, or it could be various ways of inserting force and certainly. It's hard to distinguish the different ones here.
But here we have a model which has no uncertainty. It solves the vision and the physics problem perfectly. And so in some sense, it's better. It's fully accurate. But it does not fit human data nearly as well.
The correlation with people's judgments is more around 0.6 rather than around 0.9. And you can see the big scatter. In particular, what you can see is, there's a bunch of points in the upper left here, which are things that people think are unstable-- high means likely to fall-- but the model thinks are either perfectly stable or mostly stable.
The red dot, the red stimulus is a classic one. It's this one here. This is one of the ones I showed you before when I was showing you the inverse graphics. And if you look at this, I'm sure most of you would agree, it looks like it should be falling over. It looks unstable under gravity.
It actually, in the ground truth physics, is stable. That's because the blocks are just perfectly arranged that it won't fall over. But we don't see it that way. It's very hard to see that.
And the probabilistic model captures that. The ground truth model doesn't. This is the worst mistake the ground truth model makes but it makes others. And the reason is, or what the model suggests is that this is an interesting probabilistic physics illusion.
At least our hypothesis here is that the way we see the world has this uncertainty intrinsic in it, especially because our perceptual system isn't perfect in solving the 3D problem. There's going to be some uncertainty. Then we need to propagate that uncertainty through our intuitive physics.
Or even if we only propagate one sample, one posterior predictive sample typically is going to be unstable. Because if you just were to change the block position just slightly, or if you got it just slightly wrong, it actually would be unstable. And almost all of them are unstable, which means probabilistically, we should be confident that it's likely to fall over.
But this is just one of many illusions that we're familiar with in vision. In some sense, your visual system does the wrong thing. But it's actually the right thing, if you think about a rational probabilistic inference.
Now we could contrast this model. I'll show you a few other things we can do with this model. But I want to first of all contrast it with a different kind of approach, a more pure pattern recognition approach. Basically, end to end deep convolutional neural network, which you can certainly build and get it to solve this kind of problem.
And it represents a reasonable alternative hypothesis, which some of you might have, I don't know. Some of you might find this model compelling, the one I presented. Others might say, well, do you really have to have a mental model of blocks and physics and all this stuff? We've had a lot of experience in our life with objects falling over.
We've played Jenga. We've play with blocks. Maybe we really can take a more pattern recognition approach.
And you could instantiate that idea by taking a deep network, which is fairly similar to the ones that have been used to do object recognition or detection or ventral stream modeling. But now train it end to end to map from images of stacks of blocks to several different outputs or losses. One is, will it fall over or not? Another might be a pixel mask prediction thing. Turns out that helps a little bit to predict where the pixel flow is going to go.
Importantly, this is a system which is not built with any notion of objects. It's just going from pixels to either binary classification, yes, no, fall, or other pixel mask things. It was built by a very good team of deep learning researchers at Facebook AI. And it works very well.
But this is a great example of how a deep learning system can be trained to solve a particular version of a problem but not necessarily to capture the underlying structure needed to generalize. So you train the system on 200,000 images of cubes. In particular, it works especially well if the cubes are all different colors because then it gets nice pixel flow information.
So you can train it on cubes, like 2, 3, and 4 cubes and it can generalize in some ways. It can generalize to five cubes. Most impressive generalizations it makes are to actually real world images of cubes, of the same cubes. I'll show you that actually in a bit. But it doesn't generalize too much other stuff beyond that in terms of differently shaped objects, different numbers of objects, or different kinds of judgments besides the judgment of will those blocks fall over?
Whereas, you can generalize in all those ways without any special new training. So the system that we built can solve the problem I just showed you. But it can also solve all these other problems, like which way will the blocks fall, or how far will they fall?
Or it can be reprogrammed in a sense because it has real representations of physical objects and their properties via language. When I talked about how natural language grounds out in these core physics representations, I can tell you, for example, that as you see here in these scenes, there's two different color materials. And suppose I tell you that the gray stuff is 10 times heavier than the green stuff. Well, that should change your judgments.
So if you look at that pair of blocks there, or this pair of towers, they have the same geometry but they're colored differently. And people judge that they're going to fall in very different directions. And so does the model.
Or I can show you a scene like this, where the blocks seem like they should be falling over but they're not. And then I can say, well, do you think that maybe the red stuff or the yellow stuff is a lot heavier than the other material? So you can make an inverse inference about the density, if you like, or the mass, relative mass of the different kinds of objects. People can do that too. And our model can do that
I'll show you one other judgment, which is in some sense, a kind of weird one. It makes for a fun interactive demo. We'll do that in a second.
But I also want to emphasize, the point here, which is really trying to develop the point I'm trying to show on this slide, is that if you like, for common sense human intelligence, there isn't a task. A basic thing about machine learning and the way machine learning, not just deep learning, but the way machine learning approaches have been deployed in AI-- and this is really what I mean by basically working on AI problems as if they're pattern recognition problems-- is you define a task. And then you build you build a data set for the task. And then you build a system to solve the task to find the patterns in the data, which solve the task.
But a really basic difference between that and intelligence in the mind-- and it's not just humans, but especially in humans-- is that you build models that are not just designed or not just built or trained for a task. You can reprogram them on the fly to solve tasks that you've never done before or maybe never even thought of. So consider this task. Here we have a table with red and yellow blocks. And imagine that the table is bumped hard enough to knock some of the blocks onto the floor. So do you think it's more likely to be red or yellow blocks? You guys tell me, red or yellow?
JOSH TENENBAUM: Red, OK. This is a question which you can't just answer based on previous experience of stacking blocks. Now you could say, well, I have some other previous experience. Like I see there's a lot more red blocks so it's probably just that. But let's try a few others. So how about here, red or yellow?
JOSH TENENBAUM: How about here?
JOSH TENENBAUM: Here.
JOSH TENENBAUM: Yeah, OK, good. We could go on all day but we'll stop there. Maybe you get the point. I could generate an infinite number of these stimuli. And you can see here, there's fairly broad agreement. But some of these are harder. There's more uncertainty or you're slower for some of those.
And again, for those of you who've done behavioral work, I don't know if you've talked about this in terms of psychophysics tutorials, but for many, many other kinds of especially lower level psychophysics tasks that have been studied for more than a century, there are basic speed, accuracy trade-offs. Some things are harder, some things are easier. And when something is harder, there's tends to be more disagreement in the judgments different people make. And they tend to be a little bit slower. And you just saw that right here in your own responses.
And that's totally consistent with a model that says you do something like this. You run a simulation of the underlying physics that the problem asks you to do. And maybe you run one or a couple. And if it's very clear, then you respond. Otherwise you might run a few more simulations.
The simulations in this case might look like this. So here's a simulation of one of these scenes with a small bump. And here's a simulation with a big bump. And again, the problem didn't specify big bump or small bump. You might try one or two. Or you might just try one and just see what happens.
But it doesn't really matter which of those simulations you ran because the outcome at the intuitive level is the same. And also notice, you don't have to run the simulation all the way to the end. You can just run either of these simulations for a few time steps and you already know what's going to happen.
It also doesn't have to be very accurate. I'm showing you a fairly fine grained accurate simulation. But you don't have to run a fully accurate simulation.
It can have a lot of noise or low precision and it'll still come out about the same. Noise and low precision will bias you towards seeing things that are more unstable. But that as I already showed you, and we have a lot of other evidence for this, human perception is biased in that way.
Here what I'm showing you is the analogous plot of data and model for this task. And what you can see is the fits are a little bit worse, but basically the same as I showed you for the previous experiment. But in the previous one, that was a question of, will these blocks fall over, how likely they fall over? You've probably been making judgments very much like that since you were a kid. This is a judgment that you didn't make until just this morning, unless you've seen this talk before.
So the point is that you can do both and the model basically fits about as well. Because basically, what we think the model is capturing is of an engine in your head, which can do all these things and so many others. There's no learning. I'll talk increasingly about learning as we go on for the next few minutes.
Part of the reason I'm not talking very much about it in this lecture is that it's hard and we're still working on it. Which is, the problem of how do you learn a physics engine? But learning in or with a physics engine is a very straightforward thing to do. And it's something I think people do a lot, both kids and adults.
Whenever you learn a new physics game, there's often some interestingly different physics, some weird force, some weird tweaks, something happens and you have to learn it. And you can learn it very quickly. You might not learn the right thing to do to solve a new level of a physics game. But you can look at some weird, interesting physics and in just a few seconds figure out what's going on.
Think about the first time you saw a touch screen glass display, like an iPhone or an iPad. And you see, oh, I move my finger across it and, wow, it lights up or it moves, or something. Or like the first time you saw an air hockey table like that kid.
Or here what I'm showing you-- this is from some work that Tomer Ullman and colleagues have done-- where I'm showing you overhead views of some weird air hockey tables, where there's four pucks bouncing around. And each of them is a different little world. They all are Newtonian in the sense that there's something like F equals ma.
But what you might notice is that there are different forces in play. If you look at some of them, you might see what looked like forces of attraction or repulsion. Do you see that? Or sometimes if you look closely, you might see that some of the patches, different patches of ground have more friction than others.
So what Tomer and colleagues did is they showed people brief clips, brief movies in a number of these different worlds, just 5 seconds of a movie. They were allowed to view it a couple of times. But they were asked a number of different questions, like what forces were operative? What attracts what? What might repel?
What are the relative masses of the different colors of objects? What are the relative friction or roughness of the different surface patches? And people can judge all of these things. They're far from perfect and the models are also far from perfectly capturing people.
This is a harder task, especially when you only have 5 seconds and you have to make a bunch of different physical judgments. But people can do this. And it's interesting that they can do it at all. They can get they can get somewhat better if you allow them to actively intervene and move things around and do little experiments, as Neil Bramley and colleagues have shown.
And then we can model this with a hierarchical Bayesian inference. And I'm just sketching this here. Again, the tool is in prom mods. You'll hear more about hierarchical Bayesian models a little bit later today. But you'll hear more from Sam Gershman and others next week.
But again, this idea, it's been a very powerful way for cognitive scientists and AI researchers to think about learning in these more structured models. It is to say, well, just as I can do perception as a Bayesian inference, I have a hypothesis base of scenes and a prior. I can push back to multiple levels of inference to capture more abstract, longer time scale kinds of inferences, which are like learning.
So I can say for example, in this case, I can give my system what we're calling here a meta theory. I can basically give it effectively Newtonian mechanics. But I can then say, well, that really generates a space of more specific theories. It puts a prior on space with more specific theories that correspond to different settings of what forces are operative or what the different parameters for mass or friction are.
So the top level is given. The middle level is completely unknown. And then the bottom level is observed. Because then once I've fixed all the details of the forces and the physics and some initial condition, then it just runs. And I get to observe the output of that middle level. And I can make a Bayesian inference to the parameters at that middle level. And that's how this model works. And it somewhat captures what people do.
I think there's a lot more work that has to be done to make this a better model of what people do. But it shows at least in principle, how from a hierarchical probabilistic programmed point of view, you can capture this very rapid learning about a range of different ways the physical world could work. This isn't how I think babies build their intuitive physics. But it's a kind of learning you can do to tune and adapt your intuitive physics to all sorts of ways the world presents itself to you.
Of course, we talked a while about intuitive psychology. I'll spend a little bit less time talking about this. But hopefully, the ideas now should be intuitive and also the kind of models and experiments.
Ultimately, we'd like to understand what's going on inside the kid's head in the Warneken and Tomasello experiment. Or what goes on inside the 13-month-old's head who sees the chasing and fleeing. I'll show you some things stepping towards that.
The basic idea here, though, is to try to formalize the intuition we all talked about of efficient action planning. At the heart of how humans understand action-- definitely at the heart of how humans produce action in the context that we're talking about here, which is grounded in a local spatial temporal environment. So actions which are like reaching for things, or moving for things, or moving around and then reaching for things.
I'm not talking about actions like planning your financial future, where people are probably far from efficient. Or actions like planning your PhD or your research, where again, we are often far from efficient. But we still want to be efficient.
But in these kinds of situations, humans and animals are very efficient. And we seem to be built from very early ages to understand people in that way by doing it what we call inverse planning. Assuming that the actions we see, the actions we see here are the output of a planning process, which is pretty close to efficient, maybe with some uncertainty. And takes as input beliefs and desires.
So beliefs are a model of the world, especially the physical structure of the world. Desires are the agent's goals. And then you have a planning algorithm, which produces actions to achieve the desires efficiently given the beliefs.
Now we as observers, see the output of the planning algorithm have to work backwards to it. Just like when we're doing vision as inverse graphics, we see the output of graphics and have to work backwards. But the same tools apply here. We can apply them in a range of settings. So we started doing this line of work more than 10 years ago with Chris Baker. And Rebecca Saxe has been heavily involved in a lot of this work too and inspired a lot of it.
But Chris Baker started just doing basic goal inference. So you're just seeing an agent move around. And there's no there's no worry about beliefs. You just assume you and the agent have perfect understanding of the environment.
But the question is just, as you see an agent moving around with constraints, different constraints, walls, and holes, and different possible goals, what do you think the agent wants? And how do your inferences about the agent's goals or desires change over time as a function of how the agent moves and turns and might surprise you?
And what we found-- I'm just showing you a few of these scenarios. These are just little snapshots of an agent moves along this path. And we asked people at different points in time, which of the three objects, A, B, or C is the agent's goal?
And you can see all sorts of interesting dynamics, where sometimes it's very clear from the beginning. Sometimes you can't tell. And then at some point it really resolves into these two. Sometimes here you think it was A until when it's up until 0.10, you think the agent's headed for A. But then they take a step in this direction. I was, oh, I guess it's B.
There all sorts of double or triple switches in these domains. I'm just showing you four of dozens of such situations. And you can see in these cases, there's people on the model. And they're very, very similar. They're almost impossible to tell the difference.
Or here I'm showing you a scatterplot of all the judgments from this experiment overall. And what's interesting is that the correlations here are even higher than in the intuitive physics. They're very high but they're even higher.
Cognitive scientists will tell you reasons for this. It might be that people are actually, on long time scales, much more predictable than physical objects, at least when they're moving around simple spatial environments. And people intuitively understand that predictability.
The basic structure, again, is really driven by what we've called the naive utility calculus, which is the structure of costs and rewards that the planning algorithm has to be sensitive to. The models are actually pretty simple but they have to make assumptions about the agent's reward structure. But they're very, I think, natural assumptions.
So what we assume here is that agents get a big reward, a positive utility for getting to a goal. And they have a small cost for each action step, each step. And then they make plans which try to maximize their long-run expected utility, reward minus cost. And that's a way to get just efficient planning. But it allows you to generalize to all sorts of settings that aren't just shortest paths. They can include false beliefs because of the notion that it's expected utility not true utility.
But in this case, it's just basically shortest paths. But there's some uncertainty because the agent could change its mind. So the agent might not know exactly what it wants or where the objects are, a little bit. Or you just might not know. Or it just might change its mind for some reason. And that gives you some extra structure here.
It gets a little bit more interesting when we actually allow for more interesting kinds of uncertainty. So here I'll go into a little bit more detail about the nature of this experiment. This is later work that Chris Baker did together with Julian Jara-Ettinger, and Rebecca, and me. And it's what we call the food truck paradigm. And I think it illustrates in a simpler form some of the same things you saw in the red and blue ball chasing each other.
So here we have a little snapshot of a university campus. Our campus at MIT is somewhat like this. Yours might be wherever you are. There are buildings and there are lunch spots. The most popular lunch spots, at least for grad students, are food trucks. So they're trucks with various kinds of different food.
In this part of campus, there's three trucks that tend to come to campus. There's a Korean truck, or K, there's a Lebanese truck, L, and there's a Mexican truck, M. On this particular campus part, though, there's only two parking spots, shown in yellow. And on different days, different trucks get there different times and park in different spots.
So on this particular day, the Korean truck parked down in the lower left. And the Lebanese truck parked in the upper right, or the northeast, I guess. Think of it as an overhead view. There's a big building in the middle. And let's say that our friendly student comes out of their office here.
This is part of the experiment, and people know this. They can see what's in their line of sight. So in this case, the student can see that there's the Korean truck there but they don't know what's on the other side of the building, although they know where the parking spots are.
So what do they do? Well, let's see. So they go towards but past the Korean truck. They go to the other side, where now they can see the Lebanese truck. But they turn around then and go back to the Korean truck. So the question is, what's their favorite truck?
JOSH TENENBAUM: Mexican, yeah. Isn't that interesting? What's their second favorite truck?
JOSH TENENBAUM: Some of you might have been tempted to say Korean. You had to remember that Mexican was an option. But if you did, it's interesting that you know that. So what's going on here?
Again, simple kinds of action understanding or goal inference can be done in a purely perceptual way. If there's a cup of coffee here, or when I went to reach for the tape, when I first started to go around here-- remember the tape was there-- you didn't know what I was doing. But at some point, you saw me reach for it's. It's like, OK, I guess I want the tape.
But here, the thing that you think they want is just not present in the scene. It's only present in your representation of the agent's representation at the scene. If you know that they know that the Mexican truck could have been there, then that's really the only way to make sense of this as an efficient action.
Or in other words, you see their path. It's obviously not an efficient path to the Korean truck because there's a much more efficient path. But it's an efficient path subject to some false and true beliefs, a mixed set of beliefs, which you just also see together with the desires.
So in our experiment, we asked people both to rate the degree of preference for each of the trucks. But we also asked them to rate degree of belief for the agent's priors before they started moving, what did they think was on the other side of the building that they couldn't see?
So in this case, people say that the favorite truck is Mexican and that it's most likely that on the other side of the truck, the Mexican truck is there, as opposed to the other possible worlds. They know it can't be Korean because that's here. But it could have been the Lebanese truck or it could just be nothing. On different days, those are all options, those all actually happen.
But importantly, it's a joint inference that they like Mexican and that they thought it was likely to be there because otherwise, it just wouldn't have made sense for them to do what they did. If they liked Mexican but they knew it wasn't there or they didn't think it was there, then they would just go to Korean, if that's their second choice. So people and the model basically make those inferences. The model captures that pattern of inference of joint belief desire, as well as many others.
So again in this experiment-- this was the Nature Human Behavior paper that just came out last year-- there's dozens of these scenarios. And the model pretty much captures all of them. Not completely, but really captures most of them and across different conditions.
Again, it's very high correlations for both beliefs and desires. And we contrast it with alternative models that ablate one part or the other. So again, I think this is really starting to show that we can reverse engineer these aspects of not just common sense physics but also common sense psychology as it sits on top of physics.
Another thing we've been scaling up to is-- the way I thought of this, though, is more as scaling from grid worlds, little dot worlds to the kind of things that are more like what we do in our natural world or what AI robotics would want. So actually taking video of somebody reaching, a real person reaching for an object in three dimensional space. In this case, there is a 4 by 4 array of objects. And I'll show you this movie in slow motion actually. See when you think you know which of those objects on the table she's reaching for. I don't know. Raise your hand when you think you know which one she's reaching for.
OK. Most of the hands are up by now. Some of you raised your hands early. By around here, this point, this is the timing thing, most of you were raising your hands. This black dashed line is the correct answer.
This is not data from an experiment. This is actually a model prediction which we're currently engaged in testing. This is work that Tao Gao and others are doing. It started with Chris and Yibiao, but Tau is mostly driving this here.
The way the model works, the way we make these predictions is we're using what's called the MuJoCo physics engine. This is a state of the art AI tool for planning whole body or humanoid robot motion. It is again, the idea-- the robotics people really want to do this-- is to plan efficient actions, like efficient reaches or efficient whole body motions. Because this is really a whole body motion, it's not just moving your arm.
And MuJoCo and other tools allow you to do this. They allow you to compute the most efficient trajectory. That's the forward or causal model. That's the model conditioned on desires, make an efficient action sequence.
Then we want to invert that. So we want to work backwards and say, what's the most likely goal input to that physically efficient motion planner? And that's what we're computing here. It's a graded computation. And that's our Bayesian posterior probability of the input given the observed outputs, or the goal given the observed action sequence.
And we are optimistic that this will at least, to some extent, capture people's abilities to do this Bayesian inverse planning, even with a full body motion in a complex seat. You can extend this to things which are even less like reaching. But MuJoCo, or a physics engine can handle any of these things.
A scene like this, where you get a sense of what she's reaching for just by how she stands up and starts to move or hear, before she's starting to move her arms at all. Part of it of course, is driven by her eye gaze. But even if you couldn't see her head-- you'd have to do that, block her head-- just the torso and the arms and legs is enough to narrow it down to at least a couple of blocks. And to make a pretty reasonable guess that then when you see how she actually reaches is likely either close or actually right. You might have been surprised at which hand she used but OK.
Or take this scene here. This is Tao. Which object is he reaching for? Oops, oops, sorry. Well, you can tell even before he actually touches it, maybe around now.
But notice he's doing something weird. What's weird in that scene? He's reaching over something. What's he reaching over?
AUDIENCE: Oh, a string.
JOSH TENENBAUM: It's either a string or it's a plane of glass, actually. Again, think about this from a computer vision standpoint or a biological vision standpoint. He's reaching over something which is practically invisible. But yet you can see it because of how he reaches over it.
You see it mostly in the way he moves around it, his efficient actions subject to a constraint that he knows is there. So somehow we have to be able to see that to make sense of it. And I think you're only going to be able to do that by using this kind of model, as opposed to a more just kind of bottom-up image-driven pattern recognition system.
We can extend these models to what we call sometimes in AI, a multiagent setting, or more of a social setting. Where now there's two agents acting and we want to interpret both of them, like the chasing and fleeing, or the Warneken and Tomasello scene. So what makes this scene here look like one person has a goal and another person is trying to help them achieve the goal? Or what makes this scene here look not like helping but like the opposite?
So the way we model that, we can model that too by saying that we have these models, these planning models, these utility theoretic planning models but they now have these recursively defined utility functions. So we say helping is something like an agent is helping another one if the first agent acts as if it appears that it's expected utility is a positive function of its expectation about the other agent's expected utility.
That sounds complicated but it's not that complicated. In many cultures, we have names for this, usually in moral principles, like the golden rule. Do unto others as you would have them do unto you. Or do unto others as you think they would want to be done unto.
Basic principles like that are at the heart of a lot of common sense morality. And in work-- not work that we've done, although, we've collaborated on some studies-- but work especially done by Kiley Hamlin, who's another great infant researcher, and work she started doing in her PhD with Karen Wynn and Paul Bloom at Yale and has long done in her own lab at UBC, British Columbia. They've shown that even very young infants seem to be able to understand helping and hindering type behavior like this.
And we've built models, partly together with Kiley and Tomer Ullman, about how that works. Even shown some evidence, which I don't have time to show you here, about how even 10-month-old infants, not the very youngest, but 10-month-old infants seem to apply at least a primitive proto qualitative version of this recursive utility calculus in terms of how they understand when an agent is helping another one. And we show that by having actions which on their own are not uniquely identifiable as either helping or hindering, but depending on the context that the infant has previously seen, the very same action would be either helpful or not helpful based on what they know another agent knows about another agent's goal.
And if that sounds like a lot for a 10-month-old infant, it is a lot. Not all the infants pass that task. But the majority of even 10-month-old infants do.
I'll just show you two examples of other infant studies because, again, mostly I focus on adults. But you can actually test these models quantitatively with young kids. So with 12-month-old infants-- in collaboration with Luca Bonatti and Erno Teglas, this was work that Teglas did as part of his PhD in Bonatti's lab-- we tested these probabilistic intuitive physics models. And I won't go through the details of how we did it. Really, they did it, Teglas, and Bonatti, and their colleagues did all the hard work. Vul and I contributed a little bit through some models.
Here are the physics involved. Objects moving in this semi-random but physical way inside a lottery machine or gumball machine. And then after a certain period of time of occlusion, one of the four objects appeared outside through a hole. And they varied which object was it, the rare one or the common one. You can see the rare one is the one type, the three type.
How long the period of occlusion was, zero, one, or two seconds. And where were the objects right before occlusion. Was it near or far? And as a function of those several variables, the rational probabilistic prediction-- any physicist would make the same prediction. It's almost like baby [? stat ?] [? mac ?] but not even very statty.
But basically, you can probably see how these would give rise to graded expectation of how likely it would be. And you could set up a surprise here. The classic infant measure is a violation of expectation looking time measure. You show infants something that's very surprising and they look longer, just like you might.
Here a surprise would be, for example, if you set up something where there's three yellow objects right near the door of the thing and one blue object far away. And you have the occlusion. You have the occlusion and then right away an object appears and it's blue. Because it was like, how did it get there, it was over here?
Another surprising thing might be if they move around for a while. The blue one, even if the blue one was close, if the blue one is rare, you'd expect after a while they've randomized. So you should see more of the common one, the more frequent one.
It turns out, infants actually make all those probabilistic expectations. And we were able to quantitatively model this. So we showed that the looking time was roughly proportional to just the inverse of the probability under this probabilistic physics model.
Well, here's some recent work that was done by Shari Liu, who is in Liz Spelke's lab. And she's a member of CBMM as a PhD student. And she was a student in the summer school last year.
And this was work she did together with Tomer Ullman. If you really want to appreciate this, I better turn up the sound and start it over. But here you're seeing infants applying this naive utility calculus cost reward stuff to goal inference using principles of efficiency and indeed efficient force-based planning.
So what you see here is, the infant sees this. This is what the infant sees. They see one agent faced with a costly action to get to a goal. And he declines. Now he's faced with the same costly action to a different goal or other agent and he accepts, in a sense. He takes that costly action.
And what we show, what Shari and Tomer showed, is that the infants think the agent assigns higher value to, or prefers the one that it took the more costly actions for. In this stimulus, it's trivial because it went for one and it didn't go for the other. But the experiment is much more carefully controlled. The red agent always goes the same number of times to both objects. It's just that they're faced with costlier actions for one than for the other.
So they basically always take less costly actions and decline more costly ones. But they're graded in terms of what's faced. So the walls can vary in height. And they see what they see choices of different wall heights.
They also vary other physical parameters. Sometimes the costly action is sliding up a steep ramp versus a shallow ramp. Other times, it's jumping over a gap that's narrow or broad. And it doesn't matter which of those it is. But in each case, the underlying commonality is that there's a more physically costly action, in the sense of work done, like literally integral of force exerted over a path. That's a metric that picks out all these cases.
And if you say, well, the more work you have to do, the costlier the action. And you assume that they're making a utility efficient choice. Then there better be a higher reward if you're taking costlier actions. And really quite strikingly, even 10-month-old infants make those inferences in a graded way. This is the work of Kiley Hamlin's, which I mentioned, but I'll skip.
So that's most of what I wanted to talk about, about computational models and psychophysics of common sense. The cognitive science part properly mostly wraps up there.
Associated Research Thrust: