Embodied Cognition
Date Posted:
September 23, 2024
Date Recorded:
August 13, 2024
Speaker(s):
Vikash Mansinghka, MIT
All Captioned Videos Brains, Minds and Machines Summer Course 2024
VIKASH MANSINGHKA: I'm just going to start with some intro slides from Leslie, just to set the big picture on embodied cognition. So many important constraints arise from trying to build agents that have a physical body. And I think across several of the lectures today and tomorrow and then also the guest speaker this afternoon you're going to hear about design choices that are really motivated by all of these-- the fact that the agent-- it has a whole system, perception, memory, decision making, control, to adjust, the actions that are taken in response to decisions. The agent's in closed loop feedback with their environment. And they don't want to hurt themselves or other agents in the environment.
And actually, all these constraints really take on a fundamentally different character in the embodied setting. Even though you might hear about closed loop feedback in agents in the digital world, the cost of failure is a lot lower. The real time constraints are often nonexistent. And so hopefully, you'll get a feeling over the course of these lectures for how that changes the design of AI systems and also how that's informing attempts to reverse engineer natural intelligence.
So the schedule for the next two days-- today, we're going to have a lecture from me on navigation and mapping or building up to navigation and mapping and an afternoon lab that will let you get hands on with the probabilistic programming platform for doing exercises and navigation and mapping. You're going to hear from Leslie on planning for robot manipulation, and then a wonderful guest lecture by Stefanie Tellex from Brown.
And then tomorrow, Josh is going to talk about-- Josh Tenenbaum is going to talk about cognitive models of physical reasoning. And then I'm going to present a lecture presenting a theory for how the probabilistic programs you heard about today could be implemented by the brain using circuits of biological neurons with realistic spiking rates and connections to empirical data from the Allen Brain Atlas. And then in the afternoon, there'll be a lab on integrating learning with planning for robots and also another lecture on essentially the same topic.
Any high-level questions about the schedule before we dive in? OK. And maybe just one thing to help me tune some of the presentation. So could you raise your hand if you identify as a roboticist? Zero. OK. A couple. A couple.
[LAUGHTER]
OK, awesome. And what about as a neuroscientist? OK, cool. And any cognitive scientists here? OK, a couple. OK.
[LAUGHTER]
OK. Very good. I think we'll have something for all of you. So this morning, my goal is to cover three topics at a high level of detail-- or sorry, at a low-ish level of detail, so high conceptual level, that set you up for the labs in the afternoon. So first one-- I'm going to try to motivate a building material that we're going to be using in the labs that's maybe quite different from either neural nets that you're used to training or other kinds of software that you're used to writing.
And then we're going to show how to use that material to infer 3D scenes and build up to the basics of embodied navigation and mapping. And then I'll give a preview of the lab in the afternoon, where you'll get to do some of that hands on, with very different compute graphs than the ones that you see in transformers.
So just zooming out, we're all, no doubt, familiar with the very large investment in training predictive models end to end using deep learning platforms like PyTorch, TensorFlow, and JAX. Very interesting and powerful approach, but has some interesting costs, like it requires internet-scale data, serious investments in simulation infrastructure, and also often sticky product loops with humans to provide the labels for human feedback.
Now, certainly, it has produced incredibly impressive advances. But I think it's also very useful to look very carefully and seriously at the places where there are gaps. So how many of you have seen this video before? OK, maybe a couple.
So what you're seeing is a perception system undergoing an interesting sequence of states in response to a horse-drawn buggy that's right in front of it. I think it is-- it's funny, heh, but I think it's also striking given the stakes of the deployment. So if a human or another animal were experiencing those kinds of fluctuating percepts, we might say they were delirious. So how can we build perception systems which don't exhibit these failure modes or understand the remarkable capability of biological organisms to perceive their environment and build maps without these kinds of failures, at least under normal circumstances?
In fact, I think the struggles in autonomous driving at kind of a large scale really point to some fundamental gaps in the theory that we're going to get even animal-level embodied cognition using end-to-end training of machine learning models.
So in the middle here, I'm showing a curve from a post mortem by Starsky robotics, which was one of a reasonable handful of autonomous trucking companies. They were describing how capability will increase as you get more data, but it'll asymptote in a very unpredictable place. And so the bet that the company is testing is whether it's going to asymptote at a high enough level for any autonomy, L1 autonomy, L2 autonomy. So--
AUDIENCE: Can you remind us the difference?
VIKASH MANSINGHKA: Yeah-- well, I guess, maybe somebody else should check me if I have this right, but I think L5 is, like, full self-- is what-- the term "L5" should be used instead of, quote, "full self driving," although that's been on the market for a long time. There's an interesting website that shows the 10 years of promises Tesla's been making about that. So there are various levels, like driver assist, like you're changing lanes and something will give you a warning if you might collide, or sort of increasing all the way up to full autonomy.
Now, in the wrap up for Argo AI, there was an acknowledgment that although the performance of Argo's driving systems was improving as more data was provided, the question was just how fast. So that's another lens on this S curve.
So I'm not trying to argue that the basic feedback loops in this industry picture don't exist, but the question is, what will the cost be in terms of how many orders of magnitude?
And all of that is in, really, I think, striking contrast to the speed and robustness with which children and even nonhuman primates, as it turns out, develop rapid, robust, generalizable world models that are sufficiently full featured to allow them to see the world and drive.
So there's actually an interesting genre of YouTube videos that has young kids, and actually, in some cases, also orangutans driving golf carts. And although it's a reasonable to question the judgment calls made by all of the primates involved in recording these videos, I think none of us are surprised that they can see the world well enough to drive.
And the failures in autonomous driving technology are really-- fundamentally, they've been failures of perception, not failures of executive control on top of perception. Not to say that there aren't also interesting challenges in scaling planning, but I think the gap with the capabilities of natural intelligence are much more fundamental.
So what we're going to be studying today and what you'll be playing with in the lab are some ingredients for an approach that has a long history in robotics and computer vision, but that requires very different platform technology and enabling computer science than deep learning for trying to build perception systems that can learn with much more like the data efficiency of humans and other animals.
And the central structure of these systems is to have AI systems that construct an internal world model that has multiple levels of representation, so modular, geometric models of the world and its surfaces. On top of that, abstract states that can be used for planning in the world, and then even, ultimately, uncertainty over the rules in the game by which the states of the world evolve. So very different representations in the latent world model than the things that emerge from training large transformers or other, more modular systems built out of neural networks. Yeah, question?
AUDIENCE: It was for the previous slide.
VIKASH MANSINGHKA: Yep.
AUDIENCE: And I guess the root of my question is kind of implied in the phrasing of "embodied cognition," but is this really-- are they-- I don't. Are they really developing a robust, generalizable world model? Because the machine is built for them, too.
VIKASH MANSINGHKA: Great. So maybe just to play back the question-- it's quite interesting-- which is, we've-- and to enlarge it, we humans have really rearranged the world-- [LAUGHS] to make sense to us, heh. So how much of that is responsible for, let's say, kids' abilities to drive? Is that a fair statement of the question? OK. So it's interesting--
AUDIENCE: We've already kind of closed the loop in our environment.
VIKASH MANSINGHKA: Yeah. So definitely very interesting question. I think some phenomena which I think might suggest the capacities enabling this are much more general. So one of them, of course, is that actually, nonhuman primates can also drive golf carts. And it turns out one of my favorites is a city bus on the streets of Bombay. Yeah. It's a-- [LAUGHS] but actually, that's not-- maybe not too surprising. I mean, animals of many sizes and forms can see the world in three dimensions very robustly. They have to, to survive.
I mean, in fact, that capacity is even much more evolutionarily ancient, like larval zebrafish, a common model organism in high throughput experiments in neuroscience, which my lab has done some work with-- they have to hunt paramecia in three dimensions. And it turns out there's very strong behavioral evidence that they have a 3D map that they're sensitive to in choosing how they hunt their prey.
And if we reflect on why that might be, it's kind of an evolutionary necessity for organisms to have a model of the world that's sufficiently in register with the world that they can engage in the competitive process of survival. If organisms don't have robust-enough 3D models in the world, they'll starve, or they won't survive long enough to reproduce.
So I think it's a very interesting question exactly what are the world models of animals across the whole spectrum of apparent intelligence. But the idea that they have robust world models and that they're qualitatively much more stable and coherent than the ones produced by our best autonomous driving systems-- I think that there's a lot of very good convergent evidence across the neurosciences and behavioral studies of animals that really support that claim.
It's a great question, though, to-- and also, you can also think about how easily we can learn to play games in virtual worlds and study also what happens when we take animals, especially rodent models and mouse models, and study their behavior in virtual environments to get a sense of how robust our world modeling capability is.
OK. So if we want to do this, that is, we want to build AI systems that have modular, hierarchical world models inside them, it's maybe natural to think that we're going to need to move beyond the deep learning toolkit because, for example, we might, at minimum, like the representation of the scene, like the lecture hall that I'm looking at, to have objects and the objects to have surfaces and to be able to, in my imagination, mentally edit the objects and the layout modularly, as opposed to having to, let's say, tune the weights of some very large, entangled neural representation.
So to help enable that, we and others have been developing a building material called probabilistic programs, which are a computational formalism that extends classic symbolic programming and differentiable computation, as in TensorFlow, PyTorch, and JAX, to include and to support all the interactions with probability theory and symbolic metaprogramming that are needed to automate the math for a much broader class of AI architectures.
So in the lab this afternoon, you'll get to play with some of our probabilistic programming languages in a navigation and mapping setting. But for those of you who may be new to the field, over the last several of years, probabilistic programming has started to outperform machine learning and deep learning systems on benchmark problems from a range of industry application areas, like inferring the 3D structure of 3D scenes, which we'll hear a lot more about today; deduplicating linking and cleaning databases with millions of records; and forecasting time series; and other multivariate data streams.
So really, today, I'm going to focus on the embodied applications, but like with neural networks, as they started to emerge, the applications that they've been put to span the spectrum of places where computation needs to make contact with data in the external world. And we've been partnered with Google for the last couple of years to help develop our probabilistic programming platforms and apply them in 3D scene perception and other areas.
OK. So by the end of the lab this afternoon, my hope is that you'll have some basic understanding of how systems like this can be built. So what you're seeing here is the output of a probabilistic program that's building a map of one of the floors in the computer science and AI lab at MIT. And it's just-- it's doing that by estimating its current position and then jointly estimating where the walls are and correcting for the various sources of error that come from the uncertainty and the robot's own odometry readings and in sensor noise.
But in fact, my hope is that you'll have an idea of how we can build systems that are one or two steps maybe more cognitively plausible than classic SLAM in robotics. So here's an example capability which can emerge from probabilistic programs that are doing hierarchical Bayesian reasoning about where they are and what their environment is.
So here on the left, I'm showing a simulation of the external world in which this little black circle and line indicates the position and orientation, i.e. the pose, of the robot. And in the middle, I'm showing where we told the robot it is. So we're torturing this poor virtual creature and putting it in a different room than it actually is.
Now, can you imagine what will happen as it starts to navigate? Maybe you can imagine what should happen.
AUDIENCE: The user?
VIKASH MANSINGHKA: Great. OK. So, right. So certainly, if we woke up in a room and we thought we were in a different room than we woke up in and we walked out in the hallway and looked around, and we were, like, hey, this doesn't fit. And yet interestingly enough, it might not be surprising to know that deep learning systems for localization can struggle to do this.
It's in a way-- and it's not the same as the floor you are seeing in perception from the Tesla, but it's kind of similar. It's like, when the model's assumptions are out of register with the world, how do we make AI systems which don't just kind of blindly project from their training data, but instead go, OK, well, maybe something's changed that I can't fully explain and do some kind of online learning to correct it?
So now I'm just showing what happens with our probabilistic program when the robot leaves the room. So as you said, it detects that something's gone wrong and fixes it. But it's worth just reflecting a little bit on what went on. So-- yeah?
AUDIENCE: Did it just fix itself by putting itself in this position where it thought it was?
VIKASH MANSINGHKA: Well, so there's a series of steps that goes through. So first-- so if you look in the right panel, what you're seeing are the aspects of its world model which have to do with how much it can trust the data from its senses. So the y-axis is the probability that any given sensor reading is an outlier, and the x-axis is showing the expected noise around inliers. So if all the probability mass is in the top-right corner of that heat map, that's like saying, I cannot believe my eyes.
And what happens once the robot leaves the room is that you can see the distribution on how much it can trust its sensors starts to concentrate on not very much. And after that, proceeds for a while. Then the robot's world model updates.
There's actually a sequence of reasoning steps that happen where first it goes, OK, things don't quite-- things seem like they match. Then things seem like maybe they're not matching. Then maybe I can't trust my senses. Then maybe, actually, well, that's kind of a strange explanation of the world that I suddenly became delirious. [LAUGHS] Maybe there's a simpler explanation.
So it's cognition at those multiple levels of representation that enables robust navigation in a changing world or in even simple cases where one's sensors might malfunction. And in fact, there's a very long history of really beautiful experiments showing how animals and other organisms can adapt to lenses over their eyes or adversarial changes to their environment and update their maps and even their models of their own sensors to keep them in register.
So I think it's just really important, especially in this era where it can seem like the path to success is training larger and larger models offline from very comprehensive simulations, just to recognize that rapid, robust adaptation to a fluctuating environment and to a fluctuating self model are possible, and both for natural organisms but also, hopefully, as you'll see over the course of the day, for engineered systems.
And just to check, how many people here have heard of probabilistic programs before? OK, maybe a couple of you? Interesting. Maybe the language is, like, Stan? Raise your hand if you've seen Stan or Pyro? OK, a couple of you. So just to note for those people-- actually, of those people, how many of you heard that inference is slow? Yeah, OK. At least one person's-- OK.
So an interesting difference between the probabilistic programs you may have encountered and the ones you'll see here are we're going to be doing real-time adaptive inference using methods that are much faster than the MCMC algorithms that you may have learned about. And we're going to be encoding very broad states of ignorance, not narrow priors, so that we can make systems that can deal with the open world.
And although I won't be talking about this today, it's pretty central for the field that we can learn the source code for probabilistic programs from data as opposed to only writing them by hand. So useful to keep that in the back of your mind as you're wondering how this might relate either to scaling route for AI or reverse engineering route for natural intelligence.
OK. So with that context, let's dive into some examples. So we're going to spend the bulk of the morning looking at inferring 3D scenes from RGBD video. And so I'm going to start with the big picture, and then we're going to go through a detailed teaching example. And that will set you up for understanding what's going on in the lab this afternoon.
OK. So the starting point we're going to take for 3D scene perception is something we sometimes talk about as just ontological realism. So what do I mean? This is in contrast to predictive modeling. So in predictive modeling, we say, well, there's lots of data, that we're going to have a model which learns or is optimized to try to predict that data. And who knows what it's going to represent inside of itself? Just whatever's helpful for predicting. That's one interesting starting point.
Another one, though, is to say that there's a world. And that world has stuff in it. [LAUGHS] And that stuff is sometimes lit. And then it's the interaction of the light with the stuff that forms the images. And so our job is to work out from the images what's the stuff, and how is it lit. It's maybe a little radical.
So there are-- of course, if you've heard of Gaussian splatting? How many of you heard of Gaussian-- some of you have, right? So there's been some progress recently in using differentiable computation to try to make simple versions of this picture practical.
But I think the most important idea to take away about 3D perception and then also navigation and mapping is that I think it's just starting to become practical now to directly implement this whole picture. And actually, probabilistic inference plays a central role. So let's look at why.
So what do you see on the left?
AUDIENCE: Giraffe.
VIKASH MANSINGHKA: Giraffe, right? Are you sure?
AUDIENCE: No.
VIKASH MANSINGHKA: OK.
[LAUGHTER]
All right.
[LAUGHTER]
So-- second. How about here? How many of you have taken long drives at night? OK. How many cars are on the road?
AUDIENCE: Lots.
VIKASH MANSINGHKA: So yeah, OK. Right. [LAUGHS] And how many of you play contact sports? Yeah, OK. A few of you. So in all these settings, we confront the challenges of uncertainty in state estimation in embodied cognition.
So the compute hardware fits in a pretty limited form, so we can't bring all the world's data to bear. And even if we could, most of it would be irrelevant because the world's changed. Our focus of attention is limited. All these things, limited compute, memory, ultimately leading to limited attention, means we're going to have some uncertainty about the state of the world. Oh, and the fact that the world is changing. All of those things means we're going to have uncertainty about the state of the world.
Then in settings like night driving, we can see that our sensors are quite limited, but we don't need a veridical, detailed model to avoid the cars. So maybe it's OK for us to be doing uncertainty aware perception.
And the maybe-giraffe case I think helps to show just how far this kind of thing scales, even to very low-level percepts about space and surfaces and motion. Our mind is filling in, sort of connecting the dots, to construct coherent percepts.
So again, it can be tempting to say, well, we'll just train a lot of data or optimize to fit one best model. But I think that the requirements for robust survivability of natural intelligence or robust performance for artificial intelligence really point at the centrality of probabilistic inference for these problems.
OK. Now, let's look at some examples of probabilistic inference in 3D scene perception. So here's a system called 3DP3 that we built. It takes in as input an RGB and depth image pair-- so I'm showing that on the left-- and it produces 3D percepts like the one I'm showing on the right, which are comprised of a 3D scene graph that has objects with crude voxelized models of their shape. It has contact relationships between the objects, and it has a reconstruction of a cleaned-up version of the depth channel, which is what I'm showing on the bottom, which is what you get when you do a depth rendering of the scene.
So this formulation of single-frame 3D scene perception is to go from the data on the left to the symbolic models on the right using probability to account for the gap between the models on the right and the data on the left. Like, we're not explaining every weird artifact in that kind of messy, blurry depth image. And if you actually watch depth video, you'll see even from a modern smartphone that has a depth sensor, it fluctuates really quite strikingly as you pan around a room or a tabletop scene. So there's a lot of need. It's kind of more like the giraffe or the nighttime driving scenario than you might intuitively think, or at least hope.
OK. Yeah?
AUDIENCE: How do you perform the transformation between this one and this one?
VIKASH MANSINGHKA: Yeah, I'll be--
AUDIENCE: Is it classic?
VIKASH MANSINGHKA: Sorry, what?
AUDIENCE: Is it classic algorithm, where it's not learned?
VIKASH MANSINGHKA: Neither. We're going to do probabilistic inference. And so I'm going to build up to how we do that using a resample move sequential Monte Carlo algorithm over the course of this lecture and the labs.
So maybe I would say that some of the representational techniques have certainly precedent in probabilistic robotics and early areas of computer vision. Some maybe really don't. And then the inference algorithms are pretty different from both learned neural nets and classical geometric methods in robotics.
And an interesting feature of the approach is that the object models in 3DP3 are actually learned from data and, really, just a handful of images. So here, I'm just showing some images of a cup. And on the right, I'm showing the object model that 3DP3 learns from those images where you can see that it knows that it does not know what's inside the cup because it hasn't seen inside of it.
So let's get a sense of what some of the results are, and then I'll show a few more types of results from this type of system before we dive into how to build it. So here, I'm just showing a four input images from the YCB-Video database of tabletop scenes.
In the middle, I'm showing the percepts on those images that are produced by DenseFusion, which is one strong, popular deep learning system for doing RGB and depth object perception. I'll show results from another one later in this lecture. DenseFusion runs about 4 FPS, which is faster than the version of 3DP3 when we wrote this paper, although 3DP3 is still just at the edge of real time. You'll see a real-time 3D scene perception system with probabilistic programming later today.
But I think an interesting thing is that when DenseFusion makes errors, sometimes the errors can seem a little incoherent. They're like maybe the video of the Tesla at the beginning. We can immediately tell at a glance that the output is not consistent with the data somehow. And 3DP3 is actually able to detect and correct those kinds of errors.
And in fact, that holds up in quantitative evaluation across the YCB-Video database, where when we publish 3DP3, it's actually state of the art accuracy for a 6 degree of freedom pose estimation and substantially more robust as well than deep learning baselines. That means lower probability of large errors that would lead to catastrophic failure of the system.
Now, what's going on inside 3DP3? Well, if you remember that picture of a multiscale world model I showed earlier-- so here, I'm just showing the top-level gen code for 3DP3, which you can think of as kind of a stochastic simulator that induces a distribution over worlds that have objects with shapes that are assembled into a scene graph which is parameterized by the relative poses of those objects that can then be rendered from different viewpoints to generate clean depth images and then with noise added to model data from a depth sensor.
So you can think of this generative function-- that's a key concept in probabilistic programming-- as a piece of code that makes stochastic choices for every latent variable that might need to be inferred later. And it induces a kind of probabilistic loss over a very broad, open-ended distribution of possible worlds.
So it's, in some ways, a very strong constraint relative to what a transformer being applied to a tokenization of video has. Like, it's saying, there are shapes. Or, sorry, there's space. The space can have stuff in it. The stuff has shapes. What we're seeing is a rendering of that stuff.
But on the other hand, it's also quite open ended. It's not saying what the shapes are. It's not saying what the stuff is. And it's not saying that any part of its model has to be veridical or exactly reconstruct the data. That's important to keep in mind for people who may be wondering how the bitter lesson bears on all of this, which I'll return to tomorrow. Yes?
AUDIENCE: Are we still using a broad prior on this?
VIKASH MANSINGHKA: Are we using a broad prior, you said? Yeah, that's right. So there's essentially a kind of-- well, an approximation to a true Bayesian ignorance prior overseen graphs that tries to spread the mass as broadly as possible over the space of logically possible seen graphs in this thing's DSL.
So in fact, actually, yes, it's a key point that at every level of the model, there are very broad priors. Like, object shapes are just matter in a Voxel representation. The scene graphs aren't the likely ones for indoor scenes. They're just possible scene graphs with stuff that might be floating or might be stacked. The images are not highly realistic renderings of a RealSense depth camera's model applied to the scenes. They're just the visible surfaces in depth with a very generic noise model added.
And the noise could be very high, in which case the depth image is basically random, or very small, in which case the sensor data would have to match the depth rendering fairly closely. So it's actually a hierarchical ignorance prior at at least three levels of representation. It's a really important point. Yeah?
AUDIENCE: Does it assume the continuity of objects?
VIKASH MANSINGHKA: Good question. No, actually, it doesn't. But that gets to right at the heart of the current research. So it will be interesting to try to make systems that understand that when stuff moves, probably should be physically connected, but even that, only probably, because, of course, the world is full of forces that appear to act as at a distance, sometimes because we just can't see the mechanical connections, at other times because they travel through space.
So this question of, How would you build a hierarchical ignorance prior over a continuously deformable approximation to physics that can scale to real world and video game physics? connects a little bit with what you'll hear about from Josh tomorrow. But there's many questions there that are very current active topics of research. Yeah?
AUDIENCE: Would it be really sensitive to a quite nonsensical scene?
VIKASH MANSINGHKA: Great question. Can you give me an example of what you're thinking of?
AUDIENCE: I can't give an example that would occur in real life, I guess, but if for some-- would it be-- I guess, like, how much does it weight every experience? If it weighs it equally, then can it be quite biased to expecting the world to have nonrealistic physics for some reason? Like, if it starts seeing things floating upside down or something?
VIKASH MANSINGHKA: Good. So I'll just say one thing about this now, but we'll return to it some today and some tomorrow. So this example is actually pretty interesting. So you may notice this clamp in pink here. We see it as resting on the coffee can, but in the scene graph, it's actually not connected to the table. It's as if it's floating in space. So this is actually really important.
Of course, this thing's a little scene graph language is restrictive. It says, stuff is either floating in space or in flush contact with other stuff. But the clamp isn't in flush contact. If you were to draw a bounding box around the clamp, it's not face-to-face contact with the can or the table, sort of resting at an angle.
So one thing you might be concerned about is, well, if there was a machine learning system that was trained on scenes that were only in flush contact, then you showed it real images that weren't, and maybe it would like project its biases-- but there's no training in this. Instead, there's an inference and an ignorance prior.
So actually what happens is the thing says, well, I can't explain the pose of the clamp in terms of flush contact. I know there's a clamp there. I can't explain how it got there, but I can see the clamp. OK? That's because there's a probabilistic model that can posit explanations of varying depths for different parts of the data.
And there's no training. There's no past data to bias it. There is a world model, which is too restricted to capture what we would say the truth is. But it has a pretty useful approximation that's a shallow approximation to the truth, which is, I don't know why the clamp is there, but it's there, floating in space.
So the path to robustness for this type of system-- this is part of the answer to the bitter lesson-- is designing models which are robust at every level of representation, even when their causal explanations are incomplete or wrong. So it has the capacity to at least absorb the data for any scene with objects in strange poses, floating in space, from video doing nonphysical things. It just can't explain it very well. Say, I don't know how the world got here, heh. It's like the SLAM example I showed earlier where the robot leaves the room. It's like, well, I can't trust my sensors.
OK. So let's look at a few more examples before we start diving into the code in a little more detail. So here, I'm showing an example from a few years ago which we'll actually look at the code for just so you get a sense of some of the physics of this stuff. So there's a real-time depth input stream on the left. And I'm just showing a very stripped-down world model in the middle, which is four triangles, two for the ceiling, two for the floor, and a camera that's pointed somewhere.
So for people who do behavioral experiments with rodent models, this is kind of like-- you might think of head direction and orientation, tasks like that. This is a starting point for building a model of that.
And on the right, you're seeing the model's inferences about which of the pieces of data it knows it can explain-- those are the ones in blue-- and which are the ones that it knows it can't explain with its model. Those are the ones in yellow.
So when the chair rolls in, it's not saying there is no chair. It's just saying, I know I can't explain those pixels. Here's the model of the world, but it doesn't fit with that part of the data.
Now, of course, like, a better model would then acquire the object once it doesn't fit. And in fact, we've been building systems using GenJAX, which is our version of Gen that's hosted on JAX, developed in collaboration with Google, that can do that. So here you're seeing video on the left and a world model that's growing in real time in the middle, where as objects come in that aren't in the model, the system infers that it can't explain them, and then it can acquire those objects and then keep them in register with the physical objects as they're moved. So that's what you're seeing in the middle. An object's put in. It's acquired and then tracked as the physical object is moving.
And the reason probablistic inference is necessary is because, actually, the depth channel is really messy. I'm not showing that here, but to get something that visibly rerenders to match the world actually requires a lot of probabilistic reweighting of the depth information across this 50,000-- roughly 50,000 triangle model. And inference here actually runs in real time on one L4 GPU, which is a fairly small one. Yeah?
AUDIENCE: When you say real-time, you mean, like, 40 or 30 FPS?
VIKASH MANSINGHKA: This one is, like, maybe 10 FPS. But it's-- an L4 is a pretty small GPU. Yeah. So I mean, we're in the early days of performance engineering this stuff. Yeah, yeah. Yeah, yeah, yeah, of course. Yeah?
AUDIENCE: And so the bit prior to the application, does that mean--
VIKASH MANSINGHKA: Yeah, the world model does, exactly. That's right. So we're adding new objects online.
AUDIENCE: And we use the posterior from that.
VIKASH MANSINGHKA: That's exactly right. And what you're going to see in the lab exercises and a little bit more in this lecture is how you can write real-time inference programs that are doing sequential Bayesian reasoning exactly like that, where the posterior for the next frame-- so the posterior from the current moment becomes the prior for the next moment. And the world model can be grown online.
And here's an example that just gives you a sense of the robustness of pose estimation in these systems. So on the left, I'm showing a bowl. And on the bottom, I'm showing the posterior on the orientation of that bowl. So the bowl's rotationally symmetric. This could be in any orientation, whereas in the middle image, we can see a mug highlighted. And since its handle is visible, we know exactly what orientation the mug is, whereas on the right, we can see a mug whose handle is self occluded, and the system knows that the mug's pose could be in anything along this arc of poses.
And in a sense, intuitively, I hope you can see that this shouldn't be mysterious. I mean, we're used to this kind of robustness from human perception with just a little thought-- a little sight and a little thought. But we're not used to it from AI systems. And personally, I think that's fundamentally being driven by the gap between machine learning approximations, predictive models, versus inference in robust world models.
So let's do one more comparison to deep learning. So how many of you have encountered FoundationPose? A few of you have, maybe some of you. So it's a CBPR paper from this year, but it's a system from NVIDIA that's an attempt to build a foundation model for 3D object perception.
And it uses an interesting kind of LM-augmented synthetic data generation pipeline where natural language descriptions of scenes are turned into 3D models that are rendered. And then there's a video transformer that's trained to try to work backwards 600-odd hours on a fairly beefy GPU.
So here's what happens when you actually run it on the YCB-Video database. So here, we're showing on bottom left the pose, the poses that come out from FoundationPose. There's often a pretty good intersection between the output percept and the real object, but it's not getting it right.
And what you're seeing-- actually, I think in the interest of time, I'll just move on. Now, if it's a foundation model, one might like it to transfer fairly broadly. So what happens if you try to run it outdoors to track a car?
AUDIENCE: What was the training? Like, what--
VIKASH MANSINGHKA: It's large synthetic database with lots of data augmentation.
AUDIENCE: Was it synthetic or--
VIKASH MANSINGHKA: FoundationPose was entirely synthetic. But just pointing out-- so initially, it actually fails to lock on at the beginning. But if you carefully hack its initialization, you can get it to track the car for a while, but then it loses track when the car turns and gets-- especially once it gets occluded by the bicycle.
But actually, the same inverse graphics system that we built can simultaneously track two objects from this outdoor video and stay tracking them post occlusion. And there was no training. There were no outdoor videos fed in. We're just doing online inference with a hierarchical Bayesian likelihood using the object models that I showed earlier using that kind of very simple voxelized object representation.
And I'm highlighting that because I think it's really important, both from an AI standpoint and a neuroscience standpoint, to start understanding how to break down this question of robustness and systematic generalization that's behind the marketing arc around foundation models. I think it would be wonderful if we had foundation models for building downstream AI systems that work robustly enough to serve as foundations.
I think in some domains, machine learning systems are doing at least a very useful job, but in embodied cognition, I'm not so convinced. And I think it's really interesting to try to build alternatives that are robust by design. And we're starting to see some evidence that they generalize systematically across broad ranges of environments and can stay functioning even in challenging conditions, like high degrees of occlusion.
OK. So let's look at the code and start to get a sense of how to build systems like this. So the indoor scene camera tracking system I showed earlier that fits a four-triangle world model-- actually, most of the code is on these two slides. There are a couple library routines that have order 20 or 30 lines of code that I'm not showing, but there's basically two ingredients.
One of them is a generative program that you can think of as a generator for virtual worlds, and the data that would be likely given those worlds. So that's what I'm showing on the left. And then on the right, there's an inference program which explores the space of possible executions of that generative program to try to find the probable world's given data doing that sequential Bayesian reasoning thing that we were talking about earlier.
And I'll point out. You can use Gen to automate the math for a broad class of classical optimization algorithms and geometric probabilistic robotics and probabilistic deep learning algorithms, including Monte Carlo gradient training for transformers.
So the computer science layer underneath is really automating a broader class of interactions between calculus, probability, and symbolic structure than TensorFlow or PyTorch or JAX. That's not what I'm focusing on in this lecture, although if you read papers on Gen, you can see what happens when you apply it to do Monte Carlo gradient estimation for training neural nets. Also an interesting topic, although I think maybe less germane to the workshop.
So let's look at this generative program in a little more detail. So how does it work? Well, what it's going to do is it's going to put a floor somewhere. It's going to choose a camera's height somewhere. And you can see the-- I'm showing a visual representation of the world, and I'm also showing the data structure that's called a trace, which is a record of the values, the stochastic choices with their names and values that were made during the execution of the program.
Then we're going to point the camera somewhere, so choose its pitch and its roll. Then we're going to do some geometry to convert the pitch and roll into a rotation matrix for the camera and then feed that, along with the randomly chosen camera location, into a renderer to figure out what's the depth we would expect to see with a camera at that orientation. So that's what I'm showing in the bottom left here. And then we're going to add some noise to actually get a noisy observation.
Now, I'll just point out later, designing this so-called noise model or sensor model is not easy. And one of the main places we've made progress, I would say, in the last maybe 10 years in this type of architecture is getting much better ones that work across a broader range of sensors and better capture an ignorance prior that doesn't have to model in detail the exact causes of every gap between a model in the world, but can roughly approximate a really broad range of, if you like, deltas between a hypothesized latent scene and the real data that you're observing.
So I'm not going to spend too much time on that today, but just point at that as a technical area. And it's quite analogous to work in graphics engines, by the way. So early video game graphics was much less realistic than graphics today. A lot of what they were doing was codesigning the math and the implementations of the low-level approximations to optics that would make stuff fast and robust and look good.
And that's kind of the analogous move here. We're trying to design the low-level approximations to optics that are sufficiently robust that we can do Bayesian inverse graphics and still have posterior percepts that are robust and fast and look good. So that's kind of the nature of the intellectual game. It's not getting a lot of training data or training larger neural networks or whatever. It's designing those approximations.
Yeah? There's a question?
AUDIENCE: So if there is no learning happening, is there a set of objects fixed?
VIKASH MANSINGHKA: No. Well, so in this example, we're just doing pose estimation. But in 3DP3 earlier, the objects, as I showed, are learned. They're just learned online. So there's a broad ignorance prior. You might imagine a line here which says, and generate a list of random of objects. And what's an object? Some collection of voxels within some bounding box. Or maybe you'd say, well, there's a table of objects, and there could be some new object I've never seen before. And any time I see an object, it might be from the list I've seen, or it might be some new object.
AUDIENCE: And in the previous code example, the it function had input as the input the number of objects?
VIKASH MANSINGHKA: To generate in the scene, but typically, that'll be random.
AUDIENCE: So every single frame, so to speak, you're re-initializing it?
VIKASH MANSINGHKA: No. So in this, we're going to show what happens when you take one model and apply it across a range of frames persisting the state, but-- so like-- yeah, if we wanted to do online learning like in this example, then what you might want is something where the objects, once they're acquired, are remembered and can recur. OK, that's-- but that's one level higher than just learning the objects.
So one thing is just, OK, there's some unknown number of objects, but each time I-- there's no classification, right? Like, for example, this system does not know that the two ramen containers are instances of the same type. It's just saying, well, there's these two surfaces that have some texture. But you can have another layer of representation which uses something called a Dirichlet process, essentially, to generate a random textured mesh from a distribution that sometimes generates a new one and sometimes reuses an old one. And that's a way to start building in the ability to do online classification.
So you're just asking about learning. Maybe the point I'm just going to keep hammering home is I think we're trying to get to systems that do learning, that actually learn online, not that optimize the parameters of neural nets offline to approximate what biological systems do when they're learning.
AUDIENCE: Just wondering what are the inputs to the-- so the input is the image, and that--
VIKASH MANSINGHKA: Yeah, that's right. That's right. Not a prebuilt object library.
AUDIENCE: I thought you just showed us, like, the normal initialization of--
VIKASH MANSINGHKA: Yeah.
AUDIENCE: That's it. Nothing happens?
VIKASH MANSINGHKA: Sorry, you mean in this example, right?
AUDIENCE: Yeah, yeah.
VIKASH MANSINGHKA: Yeah, yeah, here, it's-- here, we're just showing initialization. Exactly. Yeah, it's just you have to build up to the online learning and--
AUDIENCE: You can see it here.
VIKASH MANSINGHKA: Yeah. OK. So cool. So that's the generative program. What about inference? So here's one way you could do inference, just like a slow MCMC algorithm. I'm just going to show it first because it's kind of intuitive. And then we're going to do something that runs in real time. So we're going to have some code that grabs a frame from a depth camera.
Then we're going to ask Gen to make a trace of the generative models, so that randomly execute the generative model, but force the image that came in to match the image that we-- the constraints which are constructed from the frame that we grabbed from the depth camera. So this is where we're telling Gen that the random variable or latent variable called observation is going to have its value set to the frame that we loaded from the depth camera.
Now, initially-- yeah?
AUDIENCE: The observation-- it just seems big?
VIKASH MANSINGHKA: No, the observation is something from a camera in the world. So in fact-- so this-- when we say get_frame(depth_camera), this is-- yeah. Good call. This should be labeled frame, not observation here. Thanks for catching that. We're actually working on-- we have a prototype Colab plugin that automates some of this stuff, but yeah. So this should be called frame.
And then here, we're showing that there's a randomly sampled initial trace that looks like the trace you saw before. But we're using it to try to explain this data, which doesn't really fit, which is fine. Random initialization.
So now we're going to do some inference updates. So, I don't know, 10,000-- or 1,000 times. Let's try to tweak the pitch roll in z. And so this top move runs what's called a Metropolis-Hastings update on the trace, reproposing the pitch, the roll, and the height of the camera from their prior under the generative model. And it automates the math and the acceptance ratio for that.
And then these three lines of code below do random walk updates with varying drift on pitch, roll, and z, individually. The intuition I want to give you is much like with TensorFlow or JAX or PyTorch, you don't have to do the math for gradients. It's kind of derivative. In Gen, you don't have to do math for Radon-Nikodym derivatives, which turn out to be the measured theoretic derivative that is the thing that all those p/q ratios in variational and Monte-Carlo inference are calculating. It's actually just a kind of derivative. And in fact, it's a stochastic-- Gen automates the math for stochastic approximations to those.
Gen also automates the math for automated differentiation of expected values, which is what you need in RL. So this is just an application of that to Monte Carlo inference.
And so what you get when you run this inference program is something that starts with a random scene and then tunes it until the floor kind of matches the data. So that's what you're seeing here. But now we haven't yet done the sequential inference part, so let's take a look at that.
Well, actually, first, let's add the ceiling. There's no ceiling in this model. So we can do that just by adding a little bit to the model. Of course, later, you could be imagining, well, how do we learn graphics models from images? We don't want to have to have variables labeled ceiling. We want some sort of much more generic, maybe triangle-based scene graph representation. I've shown you a couple of those elements already.
But once we have that, we better also add in some inference logic to make sure we do inference over the ceiling that we're inferring. Maybe, hopefully, it's not too hard to see how that could be automated. It's a pretty templated type of code. Now, once we have that, we get something that fits both the floor and the ceiling.
But now, if we want to do online inference, one thing we could just do is change for loop to a while loop and update the frame from the depth camera inside the loop. And in fact-- and now when we do that from frame to frame, we have to tell Gen, oh, by the way, take the world model and replace the observation with the new frame that you've loaded.
And also, if that changes the score, essentially, the joint probability of that whole execution, which is one of the things Gen is tracking and has to update that efficiently-- but then you can get something that you can pan around a room. And it'll track the state online.
And the very simple one-trace posterior approximation from frame i is effectively the prior for frame i plus 1. And what you're seeing is code for a very simple single particle resample move SMC inference algorithm, for people who know what that is.
Now, what you'll see in the labs, which I'll get into a little bit more later, is the compositional building blocks for a much, much broader space of inference methods that include higher level cognitive control. So I'll get to that in just a bit. But I first wanted to start with something that's fairly simple, which is just doing a little bit of inference in between each frame.
Now, an interesting research topic, which I think of as right at the intersection of robotics and neuroscience, is this model doesn't know how long it took in real time to grab a frame or to do inference. And I think that's actually one of the biggest flaws with the model. That is, we should have a motion prior that's in physical units and a controller that's trying to decide, should I bother updating my model with new data? Do I kind of basically know where everything is already? How much should I compute?
You think about that football scene earlier. Probably, it matters how relevant-- there's a very interesting kind of decision theoretic logic to where the system should compute and how much uncertainty it should try to reduce. So that's an example of a research direction, which I think is quite feasible now to do with these sorts of building blocks, is make state estimation models that are embedded in continuous time with decision theoretic controllers that know that and exploit it. And we have early prototypes of that, actually, in Gen as well. Yeah?
AUDIENCE: Does anybody ever look at-- or is anybody looking at whether or not the degree to which the mistakes that are made here are similar to the mistake-- the perceptual mistakes--
VIKASH MANSINGHKA: Yeah, really good question. We have a new project at the Quest for Intelligence which is starting to explore that much more systematically. There is some work on inverse graphics models that Josh and Ilker Yildirim at Yale and others have done trying to just use the inverse graphics framework broadly to understand the hollow face illusion. Anyway, but we can talk about that offline.
So there's some-- but there's actually a really-- I mean, if you're familiar with-- I guess it's like Nakayama and Shimojo motion illusions, there's a whole bunch that we're very interested in modeling using probabilistic inference in world models like this and then early stages of a larger effort to build and test models against human perception more systematically. So happy to talk about that more. And some of the co-conspirators are in the back of the lecture hall right now, that you can talk to you a little bit later, if you want to, on the-- more on the BCS side. Yeah?
AUDIENCE: In generative problems we get, we never speak about the source of light, right? so I was wondering in some real observations when there are some different sorts of lights-- and it could be very difficult to understand the shadows. I mean, so how can we do that?
VIKASH MANSINGHKA: Yeah, that's a great question. There are two grad students in my lab who have been debating whether-- how far down that road we need to go to really get more full-featured perception system working. And I think it's a real debate.
Conceptually, I hope you can see how it's not that hard in principle. We could add extra random choices for an ambient light source, for point light sources. We could also attach light sources to each triangle. Those light sources could have either-- you can imagine something that's more realistic, where there's actual ray tracing out from the light sources.
You can even have light transport, or you can imagine cheap approximations, like some things are bright and things kind of near them tend to get blurred to colors that are kind of closer to the bright thing. All those different types of modeling are possible. And so we're just starting to try to navigate that design space now.
OK. So I'm just going to wrap up in the next five minutes. So we've already touched on this point, but I just want to hammer it home, which is, like-- so instead of-- we're not trying to encode very restrictive symbolic knowledge. Instead, we're saying, if we want to avoid-- or, well, another way to put it would be a different take on the so-called bitter lesson. I think if you encode knowledge that's false, then you should expect that your system will perform worse.
But if the knowledge is true, then maybe not, OK? OK, you say, well, OK, all knowledge is false at some level of resolution. But I think the level at which Tesla's perception systems are failing is not the level where our common sense theory of ontological realism of the world is wrong. It's not, oh, quantum physics or something like that. That's not why. It would be great if we could just exploit the consequences of classical mechanics and optics in perception, OK?
So then how do we encode classical mechanics in optics without a bunch of false stuff on top of it? That's where I think broad ignorance priors and probabilistic inference comes in. So how do we encode systems that know that mostly, they don't know what's going on? But they do know something, that there is a world, and it's got stuff in it. But of course, that world might have objects teleporting or doing all kinds of weird things. And they can't just not see that when it's happening. And that's what these hierarchical Bayesian ignorance priors are really doing.
OK. So let me just spend the last minute or two just sketching what you're going to see in the lab. So we're going to take a problem that's motivated by drone navigation mapping, which is, I think, one of the areas where robotics-- the physical platforms are mature enough that there's the ability to do very interesting experimentation on embodied cognition, and not the only setting, but it's, in my mind, a very appealing one.
And we're going to work through Gen programs that estimate where the robot is that work both when the robot's motion model is very reliable and when the motion model is not so reliable and the robot maybe is buffeted around by a lot of wind.
And what we're really going to build up to is an inference process that takes the building blocks you saw, but adds a controller on top of it which is continually looking at the weights, an online estimate of how well the model fits the data, and using some feedback-- some-- putting in a kind of feedback to figure out, should it think more or just go with what it's inferred?
And this is important because if you think about the compute graphs that arise when you train large neural nets to do this type of problem, they're not adaptive. And there's this metaphor of attention, but the actual process running on the GPU has this-- has to pay this kind of worst case, all-to-all connectivity cost that's just baked into the transformer's design.
But our own inference processes are much more flexibly adaptive to where our models might or might not fit the world and where that matters. And so that's the type of inference program that you're going to be building up to in the lab. OK. Thank you very much.
[APPLAUSE]