Representations vs Algorithms: Symbols and Geometry in Robotics
Date Posted:
November 4, 2020
Date Recorded:
November 3, 2020
CBMM Speaker(s):
Nicholas Roy All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
In the last few years, the ability for robots to understand and operate in the world around them has advanced considerably. Examples include the growing number of self-driving car systems, the considerable work in robot mapping, and the growing interest in home and service robots. However, one limitation is that robots most often reason and plan using very geometric models of the world, such as point features, dense occupancy grids and action cost maps. To be able to plan and reason over long length and timescales, as well as planning more complex missions, robots need to be able to reason about abstract concepts such as landmarks, segmented objects and tasks (among other representations). I will talk about recent work in joint reasoning about semantic representations and physical representations and what these joint representations mean for planning and decision making.
TOMASO POGGIO: So I'm very happy to introduce Nick Roy, who is a new member of CBMM. And Nick is also a professor in the Department of Aeronautic Astronautics, and is a member of CSAIL, and a director of the Bridge part of the Quest For Intelligence.
So Nick is one of the main representatives of robotics research at MIT. He has done great and amazing things. And he will describe some of them to us today. And at CBMM, we of course have the goal to have research in engineering, like in robotics, interact and be inspired, and inspire work in neuroscience and cognitive science. So I'm looking forward to this talk, and to a lot of inspiration. Nick.
NICHOLAS ROY: Thank you very much, Tommy. It's a pleasure both to be part of CBMM, and also to be giving a talk today. And as Tommy said, I'm going to talk about a few of the different things, and I'm also going to try and relate them to questions that I don't know all the answers to, that perhaps you all can help me figure out the answers to in terms of how national intelligence exhibits some pretty phenomenal capabilities that seem to be important to robots.
And Chris mentioned the questions. I'm happy to take questions during the talk. I do have the Q&A window open. But Chris, please also if you see me steamrolling through a question, please feel free to interrupt me so I can take it.
So it's a super exciting time to be working with robots and autonomous vehicles, et cetera. When I first started at MIT in 2003, robots simply were not something that you saw as part of your everyday life. But of course, now we have self-driving cars that are present in Boston, certainly all over the place in the Bay Area, Singapore, and other places.
On the right, you see an autonomous drone delivery happening. This is part of a project that I led at Google a few years ago. And so we're not just seeing robots being tested and evaluated in the real world. We're actually seeing them become part of our everyday life. We're taking the services that they promise to offer, starting to almost take them for granted, which is super exciting for me, as somebody who wants to develop these systems and extend what they can do, and understand what we need to develop further in terms of the technology to give us even better vehicles.
To understand what we need to do next, it's useful to ask, how did we get here? So what's enabled current autonomy capabilities? So a few things. One is small computers. When I was a graduate student many years ago, the total amount of computation on the surface of the Earth was less than what's in the human brain. And of course, thanks to Moore's law, we've surpassed many, many orders of magnitude relative to the human brain, in terms of what exists on the surface of the Earth.
But also, we can get them in really small form factors, which is great for me, as somebody developing small drones, for instance. We can put tremendous amounts of computation inside these vehicles. Another is just small scale electronics of various kinds. Certainly the cell phone industry has given robots and drones great enabling technologies.
Another one is just a lot of the infrastructure we put up in the last 25 years or so. So this is a GPS satellite. So GPS gives our systems the ability to know where they are. It gives them great-- or the telecommunications also gives us a great ability to communicate between the vehicles. So these are all hugely enabling.
But I think most people in the field would agree that these have not been the biggest deals. The thing that's really given us the most progress has been the development of highly accurate lightweight sensors. One of the reasons why we can recognize self-driving cars as they drive around is almost all of them have some kind of spinning LIDAR on top of that basically gives the vehicle the ability to see where obstacles are, recognize other cars driving around, and those LIDARs also work far better than GPS does for these vehicles to know where they are.
So the laser range finder, in particular, many people have argued was the single biggest breakthrough for autonomous, certainly ground robots. On the right is another laser range finder. This is made by [INAUDIBLE] automation. They thought they were making a factory safety LIDAR, but they were actually making a LIDAR for flying vehicles, because this weighs only 160 grams. It's about yea big, if you can see my screen.
And it lets us do things like this. So this is a video from a few years ago from my research group. We put a LIDAR on this aircraft and had it fly around in the Stata Parking Garage. And in some sense, this is just kind of a stupid robot trick. We showed what could be done with autonomous flight.
And the reason I like this video is, besides the fact it's kind of fun to look at, and was fun that we could do it. It also is not a flight that a human pilot could have flown. This is not to say that human pilots can't do this. Human pilots are extremely good. But the interesting thing is there's no place for a human pilot to stand so that they can actually see the vehicle all the way through the flight. So if you want to keep the vehicle safe, it puts a lot of pressure on the onboard sensing in order for the vehicle to know where it is and make appropriate control decisions.
Now, LIDARs have been great for self-driving cars, but they're not especially biologically motivated. There are animals that have echolocation. Bats, dolphins, et cetera. Projected light tends to not be something that biology has done a lot of. The same students who put this vehicle into the air went off to found a startup called Skydio, which some of you may have heard of, and full disclosure, I'm an advisor to the company.
So they realize the limitations of LIDAR, and have really focused on computer vision for enabling the same kind of flights. So the intended application is, among other things, a civilian infrastructure inspection. So this is the Skydio vehicle inspecting a bridge. I forget where the bridge is exactly. I think it might be Minneapolis.
But this is an extremely difficult operation for human pilot, again, to carry out, because they don't have situational awareness where they're normally standing. So this is entirely GPS [INAUDIBLE]. It's building a model of the bridge structure. And it has some understanding of the kinds of operations that the human pilot wants in order to actually fly around.
It can fly through really complicated and tight spaces, including this very narrow structure. And the way that it works is it uses the navigation cameras, six of them, and I'll show a better picture in a second, to basically build a 3D map of the environment and navigate through that. And here's another video of the Skydio vehicle flying around at relatively high speed, building this three dimensional map. The voxels are are colored by distance, and it does this all in real time entirely onboard the vehicle.
So this is all a preamble by way of saying that computer vision, so LIDAR and sensing, really got autonomous vehicles up and running. First in indoor robots and so-called service robots, and then on the ground, self-driving car robots, and now computer vision really seems to be enabling aerial operation. So sensing, hugely important.
But if computer vision seems to be crushing it right now, how come we don't actually have more ubiquitous autonomy? Why don't we have robots in our houses and in our workplaces when we're in the workplaces, and why haven't the self-driving car companies really delivered on the promise? And like Waymo went public in 2012, if I have that day correctly. So it's been eight years. And they've driven millions and millions of miles. But they only operate in Arizona and the Bay Area. So why don't we have ubiquitous autonomy?
And so I think the rest of us talk, I'm going to try and give a few reasons why we don't have ubiquitous autonomy, and try and talk about some of the work that I and others are doing to try and advance the state of the art. But here we go.
So I made the point that the Skydio 2 vehicle is doing all these things. But it's using six navigation cameras. And as impressive as this vehicle is, it can understand, and it can process at its perception of the scene far less than what you or I could have with exactly one eye.
You and I can cover one eye and basically operate just fine in the environment. We might have a little bit of difficulty catching a baseball that has traveled more than 10 or 50 meters. But other than that, we can do most things just fine. But the Skydio vehicle really has a tremendous pressure on its ability to have much, much better imaging than what you or I have.
And so that suggests that there's something about how we're using the camera imagery that probably doesn't match what biology does, and that's possibly where some of the source of difficulty lies. Another issue is that if you compare the amount of computation onboard the vehicle, it's Nvidia Tegra X2, and it's the 15 watt computer. And it's maxed out.
If you talk to the Skydio folks, they will tell you there's not much room for any additional computation beyond what they're doing. And if you compare that to the adult brain-- and sidebar, it's interesting that certainly the internet and the popular literature has very little consensus on how much power the adult brain actually consumes.
I thought it was about 20 watts. I thought I learned that from Jim DiCarlo. But if you go and ask even the literature, you get wildly different answers. I think a doubling of fossil power is wildly different. It's interesting, there was also Jeff Bezos been on the record multiple times saying the human brain consumes 60 watts, which I'm pretty sure is not right.
So it is interesting to me. I couldn't find a definitive source that unequivocally defined how much power the brain consumes. Probably people's brains vary might be sort of the issue. But if somebody has a reference for this, I would be grateful.
But the point is is that the Nvidia computer is about as powerful as the human brain, but is producing far less capability out of that power. And so the question is why is it so badly misusing its computation? What's going on?
So it's worthwhile looking at the structure of an autonomous system, a control structure. And some of you may have heard me talk about this in a program review before. But there's parts of these vehicles that work really well, and parts that don't. And the parts that really work-- so if we look at the control structure, you'll notice that pretty much every single deployed and operational vehicle has a low level control loop that looks something like this.
You have sensor data that comes in. It goes into some kind of probabilistic estimator, like a common filter or something, that it says where the vehicle is in space, and also it might say what's around it. So you might have just a position estimator. You might have a mapping solution.
That position estimate gets fed to a motion planner that generates a reference trajectory that gets fed to a controller, that then generates action commands to the motors. And it runs at a fast duty cycle about 1,000 Hertz usually. Now this is fine, but it isn't really good for anything except getting the vehicle from point A to point B.
So if you want to do a more complicated mission, more complicated task of some kind, then there's almost always a thing sitting on top of this control architecture that's very symbolic, which is a very ill-defined term. It's very discrete. And it extracts some notion of the state estimate, in a symbolic representation to discrete representation of state estimates such as I'm not at a xy position in the world, but like I'm on a road, a distinct road. Or if I'm an indoor robot, I'm in this room or that room.
And then some symbolic planner operates in order to figure out what the high level sort of sub goal should be given to the motion planner, like drive to the end of the road, or exit the room, or something like that. And then that gets converted into a little motion plan, and so on. And this is reasoning, or many more things typically than the lower level. But the good news, it can run at a slower duty cycle.
And what you see in almost every operational autonomous vehicle of one kind or another is that this lower level thing works really well. Very, very unusual for failure in the system to result, from a failure in state estimator failure in motion planner. Almost always, the results of the vehicle being in some kind of operational condition that wasn't expected or some kind of mechanical or electrical failure.
What does fail is this. Relationship between the low level control of the robot and the higher level reasoning of what the robot seems to be doing almost always ends up with the robot getting stuck or not knowing what to do, or doing the wrong thing. So this is where the failures in autonomy come from. This is why we don't have these highly capable robots around us right now.
So why is it-- what is it about this thing that breaks? It's not really this upper level high level reasoning piece that breaks. It's the relationship between the two. What do I mean by that? I'm going to-- notice I have these little arrows that represent information or controlled flow. And these little arrows typically represent information of controlled flow conditioned on some kind of model.
So I need to figure out how my sensor behaves in order to make a state estimate. I have to figure out how my [INAUDIBLE] behaves in order to give reasonable control signals. Those estimates of how the sensor or the [INAUDIBLE] behaves, those models, everywhere that I can learn that model from data, I'm going to color it in green.
We have techniques like state estimation. Or maybe you have reinforcement learning that's basically describing how to pick good symbolic actions based on some kind of symbolic state. And this-- notice that every arrow down here is colored in green. I can learn basically everything I need to know about the low level operation of my robot from data.
But I can't color every arrow in green. These two arrows here are in red. And the reason they're in red is because I can't learn them from data. And inevitably, what happens is that I get some engineer or some graduate student to write down a set of symbolic states, some rules for how to extract them from the state estimate, some rules for what my symbolic actions are, like my behaviors, for instance, give them to the motion planner.
And they're writing it down in Python or C++ or something. And inevitably, the engineer or the grad student forgets some edge cases or fails to account for some combination of conditions, or somehow doesn't capture the richness of the problem. The robot breaks. People get a bit frustrated. They come along, they extend the finite state machine with extra states for extra transitions, and they keep-- operation resumes, and essentially we keep going until the next failure happens.
And essentially, you're doing like AI or machine learning by a grad student. And that's not as scalable. Just to make the super concrete-- this is not an artificial example --imagine that you're a delivery drone company, like I was running into delivery drone project at Google. And you might have a very simple sequence for delivering a package. Takeoff, fly to destination address, enter or hover, [INAUDIBLE], et cetera.
Very simple. You're up, and you're a part of this airspace where you're not going to encounter other aircraft. There's no contingencies you have to worry about. Let's take one of these symbolic sequences. Flight to destination address.
You're a large internet company that happens to be developing this drone. So you actually happen to-- you know your drone doesn't know about addresses. It knows about GPS. But you've also got a large street view unit that's collecting street addresses and mapping the GPS coordinates. So why don't you ask your street view unit, what is the mapping between addresses we're doing delivery and GPS locations for delivery?
So this is an address in Palo Alto. You look up the GPS location. And it's there. This is a great statement of where that house is for that address. It is a singularly poor location for where to deliver packages. Your vehicle has to understand that that's not-- this is a bad thing to do.
OK, well maybe you're actually smarter than that, and you also happen to have a self-driving car unit, and so you ask the self-driving car unit, where would you put the car? Well, that's here on the street. Another really bad place to deliver packages. Maybe you've got somebody who's going like and figuring out where the nearest sidewalk access is.
That's here. It's under a tree. You can't deliver that either. You might actually want to deliver in the back yard, except maybe there's kids playing there. So I lied a little bit when I said there were no contingencies. Pretty much every step of operating an autonomous vehicle of some kind requires understanding constantly what's happening in the environment around you.
The models that we tend to operate with our autonomous vehicles right now are so abstract that they really don't represent the real world. And so my claim is that one of the things that's really holding back true autonomy is that it requires true understanding of the environment.
And what do I mean by understanding of the environment? Another thing that I've talked about in a program review before that some of you may have seen me mention is that this is really a question of how do we represent the environment. If you went back in time many years ago to a robotics conference in say like the late '70s or early '80s, well, first of all, you couldn't. There were no robotics conferences in the '70s or '80s. All of robotics is inside AI.
And AI was all about logic. This is Renee Descartes, the father of logic. And people were spending a huge amount of time writing down facts about how the world worked in order to get to where they thought-- that was the thing that they thought was important for AI.
And it turned out that logic was not a great way to represent the world, because you had trouble with reconciling inconsistent pieces of sensory information. If my claim is that the sensor was the thing that really enabled robotics, then logic was poorly suited and has been poorly suited for handling the errors, the inconsistencies, and the partial observability that comes out of real sensors.
And robotics eventually moved to probability theory this is the Reverend Thomas Bayes. And probability was not well thought of. But it was a roboticist, a guy by the name of Peter Cheeseman, who wrote In Defense of Probability, that really argued that if you wanted to deal with the real world, then you needed to be able to-- you needed probabilistic models that allowed us to represent the real world.
And so robotics actually, these kinds of models were all about, or what robotics was all about for many years. And you never really never see models like this in robotics conferences anymore. You see probability distributions over and over and over again, because they are how you deal with reconciling inconsistent pieces of the real world.
So that was a representational shift. So the question then is how do-- what are the right representations? How do we actually get to identifying the right representations for operation in the real world? So let's turn back to computer vision again. Many of you, I'm sure, have read the Forsyth and Ponce book and, David Forsyth in '96 observed that the world consisted of things and stuff.
So I made the claim a few minutes ago that computer vision seems to be crushing it. Computer vision is-- I don't want to take anything away from computer vision. They've done a remarkable job at giving us tools that work really well in a lot of ways. But I would say that right now, computer vision is the best at object recognition.
We could argue about whether [INAUDIBLE]. But it is pretty good at object recognition. You can go to vision.google.com, and give it an image, and it will label 5,000 different categories of things with very, very high precision, and many of you may have even better performance on a lot of these things.
And I'm going to observe that certainly knowing where cars and other distinct objects are in the world are, but objects aren't necessarily the most useful thing for an embodied system. The things that might actually be more useful are the stuff in the world, stuff that aren't distinct locations of objects in the world, but spatially extended, you know, trees that deform and move around, that really exist over not just a point in space but an extensive space.
Even better than trees is extract the roads. So one of the things that I'd like to observe is that as we think about what are the representations that we need for true robots and embodied intelligence, actually incorporating a model of the stuff in the world is a crucial representational ability for acting in the world.
But again, simply extracting the-- so we could, you know, semantic segmentation has done remarkably well in the last few years. You can actually get semantic segmentation to do up to 60 classes using less than 20% of 1,080 at 15 Hertz. Can't quite get that on to a drone. But you can get about 5 Hertz with fewer classes on a drone, which is pretty good.
And so we actually did this, is that we put semantic segmentation on our drone. And the first thing, we're using an RGB-D camera. So an Intel RealSense. And so the first thing to notice is that if we just rely on the ranging ability of the depth camera, we can't see very far. We only see 10 meters maximum range.
If we start to extract semantic segmentation from the RGB camera image, we can do a lot better in terms of understanding the scene around us. And we're getting great sort of dense fill in of the environment around us. We can identify the roads, which for a drone flying in unknown environments are a great signal of what are good trajectories in the environment.
The other thing is that by having partial depth, we can actually recover the depth of those semantic segmentation pieces, and actually build a three dimensional model of the environment. But reasoning about that three dimensional model of the environment is complicated from a planning perspective. So we also sparsify it, as you see here.
So what we do is we take the road segments and we do a graph for traction. And then we can actually use that to fly through the environment at relatively high speeds. Here the vehicle's going 9 meters per second as it flies through, this is Medfield State Hospital out in Medfield. It's no longer used as a hospital, so it's a great place for us to go and fly.
And doing relatively low frame rate semantic segmentation of the scene, and then retracting to a graph network for the vehicle fly around, all of a sudden we can get very, very good motion. And we're using much less computational power than the Skydio vehicle.
We obviously have much less detailed representation of the environment, and I would argue that this is not the right representation if we're doing the kind of very careful flight in and amongst the steel bars of a bridge that the Skydio vehicle is doing. But for fast operation in outdoor environments, the hybrid representations may be extremely useful. But what we really want to do is actually understand everything that's around the vehicle.
So if true autonomy requires understanding everything that's around the vehicle, then I would love to understand how does the brain know what everything is around it at such low computational power. I don't have an answer, but within CBMM, I'd love to get a better answer to that.
Now, David Forsyth said that there were things and stuff. So we've talked about stuff. What about things? Things are a bit problematic for autonomous vehicles right now. So again, object recognition, object detection is a great technology. We have it on our vehicles. But it's not perfect.
It's very good for recognizing the existence of an object on a flat 2D image. But our robots and our vehicles exist in three dimensions. They need to not only that there is a car roughly in this part of the image, but we need we need to know where it is, how far it is away, and how big it is for the purposes of safe operation around the vehicle.
But that's hard to do. Our object recognition is going to get it wrong a lot of the time. It's probably done a fine job with this car. It's got the bounding box right around this car, and a couple more images that probably give us the ability to trilaterate on the position and, no, sorry, triangulate on the position and trilaterate on the size and distance.
Probably going to get this window wrong. This pillar here is occluding the the window. And it's definitely going to get this window wrong here, because the window's run off the edge of the frame. So that's problematic. We often represent objects as point estimates. We approximate the bounding boxes as a noisy point estimate. And so we sort of take multiple measurements as sort of the centroid of the bounding box that corresponds to the centroid of the object. And if we have occlusions, we're going to get it wrong.
So we need our perception system to be reasoning about not just spatially extended stuff in the world, but we also need our perception system to be reasoning about spatially extended objects in the world. So we to represent objects as three volumes.
Now, we could represent them as discretized, voxelized objects, but there's a bunch of reasons why that's computationally not very efficient. And we know that we don't reason about the world in terms, at the very low level, sort of voxelized representations of the world. We think about objects as a single, a lot of the time bounding solid.
And so an interesting bounding solid might be a ellipsoid. And so if we reason about sort of the nature of the edges of the bounding box with respect to the bounding ellipsoid, we might do a much better job of fusing the bounding boxes. So why ellipsoids? They're low dimensional. They allow a smooth close form update as we get extra measurements.
Turns out there's two 500-page books written simply about understanding how ellipsoids behave under different measurements, et cetera. Really, quadrics are really quite remarkable objects, but highly useful for reasoning about low dimensional representations of space.
There is one problem, is that we do have a problem of baseline. So if we restart-- commit to, instead of reasoning about, say, points of centers of objects from bounding boxes, and we start reasoning about ellipsoids instead, then we need to make sure we have enough measurements to make sure we have a well posed problem. If we don't have enough measurements to have a well posed problem, then the following thing happens.
Full disclosure. This is a simulation. But we're getting it working on the [INAUDIBLE] vehicle soon. We're taking bounding boxes of these objects. And we are doing data fusion for multiple views in order to extract ellipsoids that we then render here as bounding cuboids. And you can see the light of the bounding cuboids actually don't fit the ground truth, which is the dark gray cuboids there, very well.
That oftentimes the ill-posedness of the solution results in poor estimates. And poor estimates of the volume of free space around you is bad news for a vehicle, because that's how you end up hitting things. Because you really want to know where-- we really want to know where the objects are around you.
So what you do? Typically, what you do is you become super conservative, and you've put very large bounding volumes around your estimates of these things. So you know that's something roughly there. But you don't exactly know where it is. You know that there's-- maybe there's a covariance attached. And so you put a bounding cuboid around the covariance, and you just give up on the fact that you don't have very good estimates, and so you can't perhaps operate in certain regimes.
But that seems sad, that you committed to this representation of space, but then you said, but I'm really bad at getting that representation of space. What can we do about it? Maybe the thing to do is to actually recognize that you may have different ways of representing space.
So maybe the first time you see an object, you're going to get relatively ill-posed views of that object. And the best you can do is represent this centroid of an object with some estimate of the overall location. But then maybe as you get more views, you get a much better estimate, and I'm going to go back there.
If you watch the estimate of the potted plant, it actually transitions from a centroid to actually a volume around that potted plant. As we got more views, we're able to change the representation of space based on how much we know about the objects. And so we can actually have a hierarchy of different representations.
We may have a very abstract representation initially, which is just we know some thing's there. There's a bounding patch, but we can't even tell where it is in space. And as we get more measurements, we get a point mass representation. So we refine the representation.
And then perhaps we actually get two spheres, ellipsoids, or bounding cuboids as we showed here, as we get more and more measurements. And so this is something that, this kind of hierarchical abstraction over representations is something that robotics just hasn't had for most of its existence. And I'm going to claim it's crucial for really understanding the scene around you.
If I take the same problem that I showed on the previous slide, and have the vehicle actually move between different representations-- I'm going to pause here. What you see is the vehicle got some initial estimates of the centroids of these objects, but did not know how big they were, or exactly where they were.
So it actually put very large bounding spheres around these objects, for the purposes of safe conservative flight. Whereas here, this vehicle that's committed to the one representation of ellipsoids, which again, we're drawing here as bounding cuboids. It's got it wrong. And it's lucky that it didn't try and fly too close this object, because it would have been wrong.
Over here, it knows it's wrong, and so it's not going to try and fly too close to those objects. If I let the videos play a little longer, you see that we're populating the scene in a very conservative manner. And then as the vehicle turns around-- oh, it actually already promoted that object there to [INAUDIBLE] that it knows and understands well. And you can see that the estimate corresponds to ground truth very, very well.
And so the point here is to say that two things. One is that for perception of the environment for a real robot, we need a much better understanding of the environment than simply dense voxelized representations. We need to know what things are, and we need to know where they are in space.
And this requires this reasoning about spatially extended things, like where the road network is, and it requires us to reason about objects at multiple levels of representation for the purposes of putting them in the representation as accurately as possible and planning safe trajectories around the environment.
So this is something that I would love to understand, is how does the brain reason at different levels of representation in order to actually capture everything that's around it? When you don't have a lot of information, you have to be super conservative and say that all the stuff in the general area is going to be obstacle, and I'm not going to go near it.
And then at the same time, if you aren't-- at the same time, as you get more information, you want to refine that more and more. You need these hierarchical representations that you can move up and down on.
So I said that computer vision seems to be crushing it right now. Maybe for the purposes of image understanding. But maybe not yet for the purposes of full vision on a fully autonomous vehicle. But I'm talking mostly about perception. What about planning? And OK. What is-- let's try and think about what it means to have the same questions about planning.
Typically for planning, we understand everything about what's in the model the planner's using. But do we have the same issues of like hierarchical representations and what's in the model? So let's choose a really, really simple planning problem. I'm going to call this the ring world.
Imagine that we have two robots that are constrained to move on a ring and can't move through each other. So I have an O robot and an R robot. And for those of you who know about configuration space, I've drawn the configuration space on the right here. This is basically-- so theta R is the angle that the R robot can move around the ring. And theta O is the angle that the O robot can move around the ring.
And so any point in the space here corresponds to configuration to robots. And the gray bar right here is where they must be in intersection, and that's disallowed. So you can't have the robots in intersection. So you can't be in this sort of gray stripe down the middle.
So suppose I want to put the robots into a particular desired configuration. So I want to have them switch places, for instance. So this is a super easy motion planning problem. There's two solutions. One is that either the O robot moves this way, and the R robot moves in the same direction, swoops all around.
And that corresponds to a path in configuration space that looks roughly like this. You know, the theta R robot takes the long way around, and the O robot takes the short way around. And you can give this to any motion planning problem. They might find the other solution, which is for both forms to go the other way. The O robot would go up and then wrap around, and the R robot would move a little bit. But either solution is fine.
And you can give this to like a random sampling rapid exploring random tree planner, which you see on the left here. [INAUDIBLE] and Amelia [INAUDIBLE] found an optimal motion planning algorithm called the [INAUDIBLE] some years ago. That's shown here, and it finds the path there. That's great. Simple planning problem. Unrelated to anything I've been talking about before.
Now, let's make the problem just a tiny little bit harder. Let's imagine that the R robot is a true robot, and it can move around. But the O robot is actually an object. And it can't move by itself. It can when pushed, when the robots-- when the R robot is in contact and pushes it. But otherwise, it can't.
So now, this is the [INAUDIBLE] motion plan here. You see the R robot only moving. Theta O is not changing. R robot moves around to come in contact with the O robot. Then it pushes the O robot. Notice they both move down the plane of this sort of blocked part of the configuration space, and then moves off again. That's the answer.
This problem is almost completely unsolvable as a standalone motion planning problem. And that's super weird. Why would that simple change make it almost impossible for most motion planning algorithms to solve? You basically have to break it into two problems.
One is where you figure out how to put the robot in contact with the object. You figure out how to move the two of them together. And then you figure out how to put the robot into the final configuration. You break it into three separate motion planning problems. And so this is basically the field of tasking and motion planning.
And the reason why this is complicated is that there are three analytic dynamical systems. There are three places where the dynamics of the system are smooth. But they're smooth in different ways. So you can have a patch where the robot is not in contact. It's free to move around. Oh, it's the middle one here, excuse me. It's free to move around in this orange space. And it can only move on horizontal stripes, because it can't move the object while it's not in contact.
And then there's two other modes of operation. One where the robot is on one side of the object and pushing it, and they're both free to move in concert up and down this line. And the one is where the robot and object are in contact on the other side of the object, and they're free to move up and down this line.
And what's tricky for motion planning algorithms is finding the transitions between these different modes. That notice that we only have transitions between these modes when the robot actually comes in contact, and that state of being in contact is a subset of measure 0, if you're, say, sampling random states. And so finding a subset of measure 0 is essentially-- it's going to be vanishingly improbable. And that's what makes this problem so hard.
So only some of the [INAUDIBLE] in each mode. Sampling from the orbits and intersections between orbits, that's what you need in order to find solutions. And so by actually modeling the dynamic discontinuities, this is now trivially solvable as a stand alone motion planning problem.
So I made the case that for perception, we needed representational ability to represent spatially extended parts of the environment around us. And we needed to represent spatially extended objects around us. Now I'm making the case, very briefly, that to represent planning problems, we need to represent these very rare, these very sparse intersections between dynamic modes of the things in our environment.
So we have two representational challenges. We have to represent everything around us. We have to be able to represent spatial extent. We have to have a hierarchy. And we have to find these very rare places where the dynamic modes of our interaction with environment change. And that's really, really hard to do. At least roboticists haven't figured it out.
This is another example here of a robot planning a task of actually moving these objects from one location to another. This is another task and motion planning problem. We're not doing anything particularly fancy here. But what happens when the robot is presented with an extra tray?
All of a sudden the planning problem changes dramatically in terms of the fact that it can actually do the planning-- solve the problem a lot faster by actually using the objects in order to move the, or use the tray in order to move the objects a lot faster.
So think about what has to happen here is a robot has to reason about the tray as a spatial extended object capable of supporting multiple objects. It has to reason about the fact that there's discontinuities in the dynamics between the objects not in grasp of the robot manipulator, objects in grasp of the robot manipulator, objects in contact with the tray.
We figured out some of how to do this efficiently, in terms of actually combining logical representations with the geometry and the low level continuous dynamics of the scene. But there's a lot more interesting research to be done here.
So a question then I ask is how does the brain-- maybe [INAUDIBLE] use here, but maybe use is the wrong word, or not sufficient. Maybe how does the brain find and use the very sparse representations that capture the changes in the dynamic modes as we interact with our environment?
There seems to be something special about discontinuities in the world for the purposes of planning. Let me give you another example. So imagine that you're a robot faced with trying to get-- you're standing in an unknown environment. You can see what you see right here.
And you're told to go to a goal that's like 100 meters away in the direction of the green dashed arrow. And one thing you could do is you could build-- you've got a fancy laser. You got the Skydio [INAUDIBLE] system, or maybe you have semantic segmentation that's extracting the corridor in the hallway, et cetera, and building you a map.
But at the end of the day, you've really only got two choices, two distinct choices. You can go down the hallway, or you can go down the-- or you enter into the classroom. And thinking about the environment as anything other than those two choices is computationally really demanding.
So sparsity and discontinuity seem to be really important for driving down the computational cost of a lot of our planning algorithms. And oh, by the way, knowing the fact that the goal is 100 meters away is super useful for choosing between these two actions. You know, if it's 100 meters away, it's unlikely to be inside this classroom, because classrooms are not 100 meters across.
It's much more likely that you're going to go down this corridor, and there's probably another quarter at the end that you should maybe turn left at in order to go there. So we can use a lot of prior information to reason about these two distinct choices.
If you really were reasoning at the level of like optimized trajectories through this environment, very hard to use that information in order to make decisions. Now, if I tell you the goal is 5 meters away, then you can still draw the same conclusions. It's pretty likely the goal is inside that classroom and not inside the corridor. Going down the corridor doesn't make any sense at all.
So how do you actually get at these distinct choices that you might reason about? So there was some work about 15 years ago by a guy called [INAUDIBLE] at UIUC, where he built a thing called gap tree navigation. He demonstrated that if you had a robot that had a perfect sensor for sensing the discontinuities and range around the robot, you could actually build a navigation strategy that had some nice properties in terms of completeness and optimality.
Not a very practical thing. But the key idea of actually sensing the range discontinuities around the robot actually turns out to be really, really useful for building representations that allow us to plan efficient trajectories through the environment. So this is a simulation. But we actually trained a gap sensor from this data and put it on a real robot. And this is our little RC car driving around the lobby.
I think this is the [INAUDIBLE] school, E-50 or E-50, I forget. And what you're seeing right below it, is it's actually building a map by basically putting walls in between the gaps that it sees in the range from the camera. It's doing a little bit more inference than that, because also reasoning about the whether the vertices that constitute the gaps are convex or concave, and whether that constitutes a wall or not.
But we actually do a pretty good job of building a representation of the environment. It's very, very efficient, very, very compact. And it's built entirely on this notion of discontinuities, not in the dynamics of the environment, but discontinuities in the geometric properties of the environment.
And you can connect that with some of the ideas that I mentioned when we were looking at the classroom or the corridor, is you can actually make reasonable guesses about whether to follow a particular branch in the range discontinuity. So this is an optimistic plan. It is building a detailed map here, and it's basically trying to get to some goal down here.
And then if we actually train a system to classify gaps as to whether or not they're likely to make progress towards the goal or not, we can put that on a robot, and you see here that it does a much better job of actually getting to the goal without being distracted by parts of the environment that are unlikely to correspond to things that the robot would use.
So I'm going to say that true autonomy will require true understanding of the environment. The brain needs to know everything around it. It needs to reason different levels of representation, and it needs to use very sparse representations, especially for decision making.
Now, a lot of the systems that contain a learning component are increasingly important for complex autonomy. And this is a problem for robotics, because learning models need a lot of data from all operating conditions. Lighting and weather changes, they're common in the real world, and data is very, very expensive for robots to gather.
I'm going to skip a few slides in the interest of time, and just assert that robots right now are as data hungry as the rest of our learning algorithms. And the brain can learn from much [INAUDIBLE] data that represent the real world very efficiently. I am by no means the first person ask how the brain does this, and would absolutely love to have an answer to this, because my group right now is struggling with data collection for robots.
The bit I skipped is some work in [INAUDIBLE], but I'm not going to talk about it right now. The other thing that really matters for robotics is, of course, safety. So we've seen adversarial images. We've even seen adversarial objects. We've seen a real failure in the Tesla autopilot, where the system was presented with a white tractor trailer against the [INAUDIBLE] sky that it had never seen before. Did not recognize that as a tractor trailer. Brake was not applied and somebody died.
And one of the realities of learning in the real world, that I am sure many of you very deeply appreciate, as I do too, is that the real world is not IID. And so if you assume that the data is distributed like this, but it's actually distributed like this, a lot of our current learning models have a really hard time with this, and people like [INAUDIBLE] and others are doing great work in and really trying to understand how we might trust our representations now, we might trust our learned models.
One thing that we have done is we've taking advantage of autoencoders to at least try and do anomaly detection. The idea here is that if, as you're training your gap detector or object recognition system, whatever, you also learn to reconstruct the input through something like the information bottleneck.
Then we can actually recognize-- if the reconstruction is reasonably close to the input image, then you feel like you might have seen this before, and you can trust your system. But if you haven't seen it-- if the reconstruction is really terrible because this is a novel image that you've never seen before, then perhaps you shouldn't trust your learned system, and you need to back off to something much more conservative.
I think this is just one step along the road, and other people have ideas like this before. Robotics doesn't really do a lot of this, and I'm not quite sure why. I think the answer might be because it's really hard to know what actually makes an image novel. This is from some work my student Charlie Richard did a few years ago. And this is the basement of Stata. And he was actually driving a robot around and trying to have it predict trajectories using the autoencoder to determine when you couldn't trust the input image, or see the classification input image.
And somebody stepped out of a open doorway here. And the system actually characterize this as a novel image and didn't trust our prediction on it. But is this a novel image? It got most of the scene correct. This is a person. Not knowing what to do when new people show up seems like an important thing for safe operation in populated environments.
But at the same time, the system was going to make an accurate prediction. So there's a complicated question to be asked and answered about what truly makes an image novel, and when do we truly trust our systems. So a question that I would love to know the answer to is how do we represent everything around us? How do we do so in a hierarchical way? How does the brain use very sparse representations? How does it actually infer them from rich representations? How do we learn from sparse corpora, and how do we trust our own perception and our learned models?
And I really think that we do need new mathematical theories of representation. Theories of representation are very much a hot topic right now. I don't want to pretend that I just discovered this. There's an entire conference on learned representations. Well, what I mean by representation I think is something very different. And I'm trying to articulate requirements on our representations that are imposed by robots and operating in the real world that give us more autonomy, that give us more complex operation, that give us better safety.
And yes, I need more data for my systems in order to train up the models that we have. So all of this, of course, is me just speaking for the work of tremendously great students. And also, I want to thank students, [INAUDIBLE] for the footage from Skydio, and I've been very fortunate work with some great sponsors.
Associated Research Module: