Reintegrating AI: Skills, Symbols, and the Sensorimotor Dilemma
Date Posted:
October 25, 2022
Date Recorded:
October 18, 2022
Speaker(s):
George Konidaris, Brown University
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: AI is, at once, an immensely successful field---generating remarkable ongoing innovation that powers whole industries---and a complete failure. Despite more than 50 years of study, the field has never settled on a widely accepted, or even well-formulated, definition of its primary scientific goal: designing a general intelligence. Instead it consists of siloed subfields studying isolated aspects of intelligence, each of which is important but none of which can reasonably claim to address the problem as a whole. But intelligence is not a collection of loosely related capabilities; AI is not about learning or planning, reasoning or vision, grasping or language---it is about all of these capabilities, and how they work together to generate complex behavior. We cannot hope to make progress towards answering the overarching scientific question without a sincere and sustained effort to reintegrate the field.
My talk will describe the current working hypothesis of the Brown Integrative, General-Purpose AI (bigAI) group, which takes the form of a decision-theoretic model that could plausibly generate the full range of intelligent behavior. Our approach is explicitly structuralist: we aim to understand how to structure intelligent agent by reintegrating, rather than discarding, existing subfields into a intellectually coherent single model. The model follows from the claim that general intelligence can only coherently be ascribed to a robot, not a computer, and that the resulting interaction with the world can be well-modeled as a decision process. Such a robot faces a sensorimotor dilemma: it must necessarily operate in a very rich sensorimotor space---one sufficient to support all the tasks it must solve, but that is therefore vastly overpowered for any single one. A core (but heretofore largely neglected) requirement for general intelligence is therefore the ability to autonomously formulate streamlined, task-specific representations, of the kind that single-task agents are typically assumed to be given. Our model also cleanly incorporates existing techniques developed in robotics, viewing them as innate knowledge about the structure of the world and the robot, and modeling them as the first few layers of a hierarchy of decision processes. Finally, our model suggests that language should ground to decision process formalisms, rather than abstract knowledge bases, text, or video, because they are the model that best characterizes the principal task facing both humans and robots.
Speaker Bio: George Konidaris is an Associate Professor of Computer Science and director of the Intelligent Robot Lab at Brown, which forms part of bigAI (Brown Integrative, General AI). He is also the Chief Roboticist of Realtime Robotics, a startup based on his research on robot motion planning. Konidaris focuses on understanding how to design agents that learn abstraction hierarchies that enable fast, goal-oriented planning. He develops and applies techniques from machine learning, reinforcement learning, optimal control and planning to construct well-grounded hierarchies that result in fast planning for common cases, and are robust to uncertainty at every level of control.
LESLIE KAELBLING: I'm very happy to introduce George Konidaris. So I know George because he was our post-doc here for three years. So and while he was here, Tomas and I learned an enormous amount from him about planning and scheduling. And George just barely made it here from the red line, which exhibits his expertise in planning and scheduling, no.
[LAUGHTER]
But he's had a really interesting trajectory and an important trajectory. So he did his undergraduate studies in [INAUDIBLE] in South Africa and then went to Edinburgh and then to Amherst and then here. And he's been really instrumental in a lot of work on actually bringing AI to Africa and helping with people there, helping students, and supervising a bunch of people at the University and so on. So that's a great thing that he's done.
And he's also done an enormous amount of really great work on, I think, bringing kind of an integrated view of AI, which used to be we, lots of people thought about AI, but it's kind of dissipated a little bit. And people are looking narrowly. And George really has got the whole picture in his head in a way that I think almost nobody else does. And so he will tell us about his whole picture.
[APPLAUSE]
GEORGE KONIDARIS: Thank you, Leslie. That's incredibly kind. We'll see if the inside of my head frightens any of you or weeds you out. I'll do my best to make it reasonable. Yeah, sorry, I was late. It's a persistent cognitive failure of humans. They always chronically underestimate how long things take. And I'm an AI researcher, so definitionally, I really chronically underestimate how long things take because otherwise I'd be in another field. So but I made it here, MIT 5 to the rescue.
So as Leslie said, I'm an AI researcher. And I would like to just briefly clarify that that means that I'm specifically not building useful things. There are lots of people who build useful things with AI technologies. They design products that we use every day. They can help you auto tag your friends on Facebook and all that kind of stuff. I'm not one of those people. We have useful conversations, but we're not in the same field, fundamentally. So I'm interested in understanding how we design an intelligence.
So as all these things do, it goes back to Turing. So this is Alan Turing. In 1950, he wrote, really, the first AI paper on Computing Machinery and intelligence. And then in 1956, at Dartmouth, there was this conference where the phrase artificial intelligence was coined. And this was really the beginning of the field, at least in North America. Everything descends from there.
And the underlying hypothesis there was that largely speaking, functionally speaking, everything is a bit of a cartoon here, but functionally speaking, the brain is a computer, which is to say that you can apply computational thinking to the brain and also that it might be reasonable to build a computer with an algorithm in it that might replicate some functionality of the brain.
OK so it's been 70 years since that happened. And we've made lots of progress. There's lots of exciting cool things happening. On the other hand, when you go up to an AI researcher today, which I do frequently, and say, hey, so what is an AI going to look like? Not a useful thing but what's an actual general AI going to look like? How do you know when you're done? What's the actual goal of your field?
They look a little embarrassed, and they shrug, and they go, oh, I don't know. And then they go on with their day. And that should make you feel mortified. It makes me feel mortified. It's an epic failure of thinking about what we should be doing with our lives like, what is this field for.
And it's totally fine. I have nothing against building useful stuff. I'm all in favor of useful stuff, but we have a scientific question that we're trying to answer, which is how you design an intelligence. And when I'm trying to do a piece of work, when I'm stuck writing a theorem, or when I'm trying to write a program, and I find myself blocked like we have been for 70 years. I find myself unable to make progress.
One of the strategies that I often take is to just add another assumption. It's to just add one extra kind of piece of intuition or one other piece of structure, and then usually everything kind of drops out. Usually you're stuck, means you haven't added enough stuff to the problem.
So part of what I'm going to propose today is that we should add a thing to the problem. We shouldn't just say that the brain is a computer. We should also say that humans are robots. That in nature, brains are controlling bodies. OK and that the fundamental function of the brain is to tell a body what to do. It's inputs are the sensors of the body, and it's outputs are the actuators of the body. And that's what brains are for. And so therefore, whenever we think about artificial and a general purpose AI, we should be thinking about a robot.
So what I'm going to do today is I'm going to attempt to briefly persuade you of that just for five minutes because that could go on forever. And then I'm going to show you what I think is a kind of fully-- fully is the wrong word, barely sketched out--
[LAUGHTER]
--hypotheses about what it might mean to build a generally intelligent robot. How much leverage this extra assumption gives you. And I'm going to try and resolve the void that's at the center of our field, which is how do we put all these things back together again.
So a lot of people here are cognitive scientists, and that's wonderful. I love cognitive scientists, some of my best friends. And they might say to me, well, OK, sure, every natural agent out there in the world happens to have a body. Does that mean we need to be thinking about robots, and the answer is, totally.
Because they're there are no disembodied intelligences. That thing doesn't exist. And so when you're trying to take insights from a natural intelligence and project it onto a computer and say we should build an AI that does planning because humans do planning, and it should do computer vision because humans do computer vision, and it should do scheduling because humans have to catch the train those are all things that robots do. There's no disembodied agents in nature, nada, zilch, none, not a sausage, doesn't exist.
It's not even clear that it's coherent, and there's no reason at all to believe that the architecture of such a thing would look anything like the architecture of what you see in nature, which is an architecture built to control a robot. So in my opinion, if you're doing cognitive science, you need to take seriously the fact that the first few levels of control of the thing that you're thinking about is a robot.
And people get a bit literal on this with me. They say, oh, OK, fine, you need to avoid obstacles and recognize things in the world, but we get above that and after that, we're not an embodied intelligence anymore. Everything is in embodied intelligence. Every input to you, to your brain, and every output goes through your body. You're 100% embodied. Every last thing dreaming, writing a sonnet, playing music all of that's embodied, every last thing.
So when you say embodied intelligence and then there's the other type of intelligence, there's no other type in nature, none. And if you're an AI person, you might say, OK, fine, but I want to engineer useful things. And so why should I care about whether or not those useful things are a robot or not?
And here I appeal to your sense of discomfort with what the field is. So AI is this collection of fields where we do planning because humans seem to do planning, so that sounds cool. And there's a subfield that does planning. And then we do learning because humans do learning, and that sounds cool, and there's a subfield that does learning. And then we do natural language because humans do natural language, and that sounds cool, and there's a subfield that does that.
But if you took those subfields, if you took Russell and Norvig, which is the standard and very good AI textbook in the field, and you reshuffle the chapters at random, you would lose nothing because there's a void at the center of that thing. There's no notion of how these things get put back together. And you're not going to get that unless you root it all in a very powerful sensorimotor interface with the world. All you need to be a robot is powerful sensors, powerful actuators, and being stuck in a loop with the world.
And so if we want to be able to think about how we put these things together again, it helps to have this clarifying hypothesis. And I think this is what we've always meant by an agent, actually. I think a robot is just a fully developed agent. Everything else is a cartoon. People in AI say, oh, robotics is an application of AI. Actually, no, robotics is AI, everything else is an application of AI.
OK, so what I'm going to talk about today, and I'm just going to take that for granted and I'd be happy to argue with you at length afterwards about the details of that hypothesis. It's not new to me, obviously, it's been around for a long time. I was indoctrinated in the '90s by Rodney Brooks, the late '90s, and I was very young. So just for age calibration purposes there.
What I'm going to do is I'm going to try and present what we think the natural consequences of this are in terms of how we model a generally intelligent agent. And we're going to start out with a formal model because we have to agree with what the basic process is. And then we're going to think about how that single agent can-- one single agent can solve many tasks. That's our definition of general intelligence is one agent able to solve many tasks reasonably well.
And then we'll talk about whether that gives us leverage into asking and answering what should be innate versus learned. And then I'll just touch briefly on language. So the last two will be kind of brief. Most of the meat is in the second one because that's where my technical work lives. So let's talk about this model.
So in my lab we like coming up with cool names for things. And so we call the kind of basic model that you get when you switch on a robot, and you ask it what's happening? Is you get a process that's running with the world, and we call that the ego process. It's the robot's innate interaction with the world.
What's happening in that thing is, you're getting some observations from your sensors. And they're rich and high dimensional because you're a robot, and of course, they are. And we'll get into why that has to be true in a minute. And then you get to take actions in the world. And you're locked in this interaction loop. Your actions change the world, or changes in some irretrievable way, and then you get some sensor impact back.
And this just is what happens when you switch on a robot. If you take seriously that the computational analog of a body is a robot, this is what you get. This is kind of no argument there. You get sensors. You have actuators. And when you actuate, the world changes. Otherwise, you wouldn't bother.
And what we're going to say, is we're going to say this ego process running. And then we're going to think about a general purpose robot, so we're going to give it tasks. One task at a time. And we're going to model the task using something called a reward function. And that's quite a specific thing, but actually, we're going to model it very generally.
But the idea is that there's some distribution of tasks which you may ask the robot. And you're going to pull something from that distribution, and you're going to hand it to the robot. And then that actually forms a formal model called a decision process.
Some of you may have missed the word Markov in there because it's not Markov. It's just a decision process where you interact with the world. There's an observation space. There's an action space. There's a reward function that specifies your task. There's the transition. That's how the world works in the background, and you don't have that, and it's complicated anyway.
And then there's this gamma term, which for now you should just think of a modeling term. It's like the probability that you'll be-- it's 1 minus the probability that you'll be interrupted and be asked to do some other task at every time step. And just as a modeling choice, we're going to choose that as a geometric. So every time step, there's some probability that someone will say, hey, robot, stop doing that thing. Come do something else. That's how that makes sense.
And what we like to do, is we'd like to generate an action from some policy. That policy is task specific, and it depends on your whole history. And the reward is task specific, and it depends on your whole history. And this gets you out of some of the awkwardness in reinforcement learning where you think about sums of reward because you can imagine just getting the reward at the end of the history, and that can express rewards of a whole trajectories, or you can imagine that it's cut up into things that you sum, but this is kind of sufficiently general to do most of the things we want to do.
And then what you want to do is you want to find a policy that maximizes discounted sum of rewards. And so this looks like reinforcement learning or reinforcement style thing except, of course, it's partially observable. It's severely partial observable because the world is severely partially observable. I don't know whether it's raining or not right now but that my future dinner plans are going to depend on that, and I just don't get to see. And so we call this a decision process. And that's the fundamental model. That's the Ur model. Everything else has to flow from there. That is a description of what happens to the robot in the world.
Now, let's talk about why it's the case that it's hard for such an agent to learn to solve any individual task. If you take such an agent, and you take such a formalism, you see that the robot's interacting with the world not a task but the world. Because its receiving the world in its full complexity through its sensors.
And so you could imagine just taking something like an RL style agent with memory and just sending it at this and saying, hey, just go maximize your policy. And that would work if you had infinite data. But it's not likely to work for most real robots, because we don't have infinite data, and they live in the real world, and time is a real thing and samples are a real thing, so we can't do things quickly.
Now when you aren't thinking of a general agent, when you're thinking of a narrow agent you get away from this problem. If you're building an agent that plays go, you get to give that agent an input space that's tailored to go. You get to say it's a 19 by 19 array. And your actions are legal moves in go.
And when you are building an agent that does radiology, you get to give that agent-- you get to say your input's an image and your output's like I've detected disease or not. And when you are doing Jeopardy, even natural language tasks, you get to say my inputs a sentence, and my output is a question.
And if you're building one agent that's totally fine. But if you're building a single agent that needs to do all of these things, then its sensorimotor complexity needs to be big enough to contain any of them. It needs to be able to contain the union of all of them. And so we call this the sensorimotor dilemma. The more general an agent is, the less well suited its sensors and actuators are to any individual task. That's the core dilemma in general intelligence.
If you want to build something that just plays chess, easy. You just give it an input. And that's the chess board. And you give an output. Those are actions, and you're done. It's not that hard. But if you want to build an agent that plays chess and also juggles and writes a sonnet and is able to do radiology and play go, then you have to drastically expand its sensorimotor space to cover the union of all of these things. And suddenly, it's not suitable for learning any of them.
Because the key question in building an AGI, I think, is how you overcome the sensorimotor dilemma. So if you want to learn to play chess, you could imagine giving a chessboard to an agent. But a robot doesn't get that for free, instead a robot gets something like that. And it has to take this rich visual input. And from that rich visual input, it has to try and learn to play chess.
Now, if we're given this, we can just drop in a completely generic alpha beta pruning algorithm and get to average club chess player level. 1,600 Elo is average club chess player level. We have an algorithm from the '70s that does that. So once we have the right representation, we're satisficing immediately, but instead, we get something like that. And we have to find a projection from that thing to the original task that reflects the complexity of the task, not the robot.
Robot's got to be complex otherwise it can't be general purpose. Task has to be simple otherwise you can't solve it. These are all the same position, by the way. So that compression into a state space has to be able to take all those three images and compress them into the same board.
By the way, this explains why chess is harder for people than computers because if you're building a narrow agent, you get this for free, of course that's easy. But if you're building a general purpose agent, you have to go through vision to chess. That's why chess is harder than vision for people, but easier than vision for narrow AI programs.
OK, so if you're building a general purpose AI, you have a robot interacting with the world. If you're building a narrow AI that solves one problem, you just have a computer interacting with that problem, and it's much easier, but it's also cheating. So whenever you read an AI paper and it starts with, we're solving a sequential decision making problem. We assume a state space that looks like this and an action space that looks like this, cheating. They've gotten rid of 80% of the problem just by framing it,
Just framing the problem is most of the problem actually. And if you're a general purpose agent, you have to do the framing yourself. You don't get it given to you, that's uncool, OK. So how might we do that? How might we build a robot that can do the framing itself on its own autonomously? And this is actually a well-formed question once we have this fundamental model.
We can say that we have a decision process. And this is like the reward function added to the ego process in which decision making is crazy. You can't do it, yeah, that's nuts. Playing chess from pixels in real images, no way that's going to happen. But what you'd like to get is you'd like to get a decision process that's maximally compact that expresses the complexity of chess not the robot. Doesn't have any robot parts in it. Only has chess parts in it.
And it turns out that you can do that. It's kind of well formed. What you need to do is you need to learn a perceptual abstraction. That's a mapping from your perceptual history to an abstract percept. And you have to learn an action abstraction. That's like a motor skill that you can run in the world. And if you put these two things together, the rest of them just follow. You can pull the rest of them automatically.
As long as these two things match, as long as they mesh so that they form a coherent decision process, all you need to do to frame the problem is learn an appropriate perceptual abstraction and an appropriate action abstraction. And then once you've got that, you can mine the structure in these things. So this is my favorite trick I like to pull.
So imagine you've got this super-complicated ego process, which is a decision process, which is formally like you can't solve. The complexity of that is nuts. But if you start with a decision process, and then you've built an abstract decision process, and you eyeball its properties.
You see that if it's observable, and it's sequential, then what you have is an MDP. And you can swap in MDP-solving algorithms for that. And so you can solve a subclass of decision processes. It's much easier to solve.
Similarly, if you take an MDP, and it's deterministic, known, and discrete what you have is search. So you can drop in a star, and you can solve that more effectively. And similarly, if it's known and discrete and also happens to be factored, you can do classical planning, which is better than search. And if it happens to have adversarial semantics, you can do adversarial search in it. So you just drop in your alpha beta pruning.
You hypothesize that there's an agent out there. That's where the adversarial dynamics come from. You decide to connect some of your pixels to that agent because there's not other agents in your pixels when you're a robot. You switch the robot on, and it doesn't come with human detectors. You have to build that, and you have to choose to model the world that way.
Similarly, you can take a decision process that's observable and not sequential, like it ends after one time step, and if it's got discrete actions, you have what's known as a bandit. Bandits are easier. And if you have demonstration data, you have what's known as classification, supervised learning with discrete actions. And if the actions are not supervised but continuous, and you have demonstrations, then you have regression.
So you can get to all the other classical AI paradigms by special casing a decision process. That's what I mean when I say it's a Ur process. You have a root process running on your robot. It's a decision process. Then you frame the task that you're facing in the most compact way. And then your special case that task. And you say, hey, what general purpose algorithm can I swap in because this happens to be a classification task not a sequential decision-making task.
So what we imagine is a robot, and then we hand it a reward function and it learns the abstract representation that's appropriate for chess. And then it plays chess in that case. Or if you tell it to navigate through a city, it learns the appropriate representation for that. And it happens to be a graph, and you do search.
Or if you want it to play soccer, it learns the appropriate abstract reputation for this, which is continuous state in action. And so it has to do policy search. It has to lock in just the right algorithm, but they're all special cases of the decision process.
So how are we going to do this? I've shown you formally what we need to do. How are we going to learn these abstract representations? And that turns out to be complicated as you can imagine. My lab has spent a lot of time on how you learn abstract actions. And I'm not going to talk about that today. What I'm going to talk about is how you learn abstract symbolic representations, or abstract perceptual abstractions once you have the action abstractions.
So you need to be able to account for grounding, like what do those perceptual abstractions mean in the world. When are they discrete? When are they continuous? When are they deterministic? When are they stochastic? And you need to be able to phrase them in such a way that you can support learning them.
So our approach is what we call constructivist because we like names like that. And it says, formalize the question that you're asking this abstract representation. Write it down in math and then construct the representation so that by construction you can answer that question.
So we started out with the simplest planning question. We said, OK, let's say you start with a start state, so we did this in the MDP framework for now. So we're just saying everything is Markov, but we're going to get to the more general decision process framework. And you've got a sequence of motor skills that you want to run in a row. And you want to know, can I execute it with property one or not?
So what is the basic math of that process? It turns out that you can answer that question always using some sets. One set and one set operator. So you start it with a precondition. That's the set of states from which you can execute that motor skill. And then you need an image operator, which is after you've executed the motor skill, what's the set of states you could land up in. And if you know these two things, you can compute whether you can chain these things by saying, I'm starting in some state z, start set zed, not zee, zed.
And then you take an option and you can execute that motor skill if the z lies inside the precondition. And if it does, you're going to land up inside the image, and then you ask can I execute the second one? And you say, yes, if and only if that image lies inside the precondition. And if so, you can land up in its image.
And you can write a proof like this for any sequence of actions. And the only thing that appears in it is the image, the precondition, and the start state. So if you can write those sets down, then you have a necessary and sufficient vocabulary of sets for being able to reason about whether you can plan.
And what we're going to do is we're going to create abstract states that are names for these sets. And then we're going to tell you that you can execute a motor skill from those states if those sets lie inside a precondition. And then we're going to build a graph-style representation compute it that way that you can prove it sound and complete under some conditions. And we can throw the set thing away. And we can just plan with abstract representation.
Some conditions under which this thing is a graph. There's this idea of a sub-core skill which drives you to a set of states. It doesn't really matter where you start. You're going to end up in some set of states. If it drives all the state variables to that set of states, then you can construct a graph. The only things you need to check is whether or not that set of states is inside the precondition of some other state. If so, you put an edge between the two nodes. Each node corresponds to a skill, and you can just do search in that graph. So in that case, it's discrete. And the resulting problem class is search.
The more interesting case is when it's factored. So what that means is that some state variables are left alone. Other state variables are changed. The set that they'll end up in doesn't depend on the set you start in. That's called a factored sub goal skill. Like if you're a robot and you're picking up this object. You change the position of the object, the position of the robot, but like not whether the door that is leading here is open or closed. If that's the case, there's a bit of math, and you land up in strips, classical planning. That's what you get.
You can upgrade this to ask a probabilistic question. What's the probability with which I can execute a sequence of motor skills? All you have to do is take the pre-condition set and turn it into a probabilistic classifier. What's the probability that I can execute a skill from a particular point?
And then you can take this image, which is a set. And turn it into a distribution over states. What's the density of states with which I landed in afterwards? And if you could do that, then you can compute the probability with which you can execute a straight-line plan. And also probability classifiers and density estimators are well-known machine learning tool boxy things.
So we can just pull them out of our toolbox. Literally, I pulled them out of scikit-learn. And you can apply them to this problem. You go around in the world. You execute your skills. You get data of the form, I was in a state. I executed my skill. I landed up in another state. I got some reward. And in every state, can I run the skill or not. This is training data for a probabilistic classifier for your precondition. This is training data for your density estimator.
And you just kind of cut things up so that that abstract factored sub-goal holds. And if you can do that, you can't always do that, but if you can do that, you get a factored representation. So let me show you what that looks like.
So this is an abstract. This was Leslie and Tomas. Sorry it took so long to come out. It was like five years working on this paper. OK, so this is a room with a cupboard and a cooler. There's the cupboard. There's the cooler. And then there's the switch. There's the switch over there. And I just handwrote the motor skills using motion planning and a bunch of other stuff.
And the robot can navigate between the cooler and the cupboard. And it can open the switch. Turn the switch on or off when it's standing in front of the cupboard, and it can open and close the cupboard. But when the cupboard's open, it blocks access to the switch, it can't reach it. And it can open this cooler, but it can only open the cooler in the cupboard if it's got both hands available to it.
And then there's an object inside these. You can see it, the robot, holding it there that it can pick up if it can reach. When it's holding it, it holds it up here. It's a big green object. Now you could imagine writing down an MDP that's got like some tiny number of states that very compactly describes this MDP, but that's cheating.
So what we want to start with, is we want to start with the motor skills, and we want to start with pixel inputs. And we want to build that MDP. OK, so not quite pixel inputs. What the inputs to the state space are pose in a map a voxelization of the entire room. I've just cut out the cooler. And the raw joint positions of the robot.
And if you give those motor skills to the robot, and you ask it to execute. You have to execute about 160 skills. Takes a couple of hours there after turning the switch. Oh, I forgot to say, when the switch is on, there's a light on inside the cupboard. And it's so bright, you can't see inside the cupboard. I had to find a very bright light. It whites out the robot sensors.
There you see it picked up that object, putting it inside the cupboard, closing the cupboard, closing the cooler. And so if you get this-- there it can't pick up the object because it's too bright when the lights on. So if you run this about 160 skill acquisitions, a couple of hours worth of data, and remember, there's no background knowledge built into this robot. It's straight from voxels.
We can recover some symbolic representations. They're inarguably symbolic because they are a text file. So it's in PDDL. This is a factored MDP. This says, if you want to navigate to the cooler, the precondition is that the state is drawn from symbol one. And if you go into symbol one and you look, it's a distribution of the robot's pose on the map.
And if you draw samples from that distribution, you see that the robot is standing up here looking that way. That's where the cupboard is. And then afterwards, symbol one will no longer be true. Symbol zero will be true. That's also a distribution of the pose in the map. It's standing over here facing the cooler. And that will take about 37 seconds.
Now, what's happened here is the robot has discovered the distinction between at the collar and at the cupboard on its own autonomously just from data. And it's provably a sound and complete representation for planning with those motor skills. So the big assumption is the motor skills are present.
Here's a more interesting one. If you want to open the cupboard, you have to be symbol one, so you have to be standing in front the cupboard. Symbol three is a distribution of the robot's joints. And you'll see that their hands are down here so not holding anything.
And then symbol four is a distribution of the voxels. And I just cut out the cupboard. The rest of it doesn't change much. This is what the voxels of the cupboard look like. You see that the cupboard is closed. Switch could be on or off. Lots of other stuff could be happening, but the cupboard is closed. Afterwards, the cupboard will be open. Possibly with an object in it, possibly not. Possibly with the switch on, possibly not, not specified by this distribution.
OK, So. The robot has learned a grounded symbolic representation autonomously just given the motor skills. The rest of the stuff has happened. And if you give this to the robot, and you ask it to plan to move the object from the cooler to the cupboard, here's what it does.
First, it opens the cooler and it doesn't pick up the object because the cupboard is closed. And you can't open the cupboard if you're carrying something in your hand. So it moves to the cupboard, but then it doesn't open the cupboard because it has to turn the switch off first. So it turns the switch off. And it wouldn't be able to reach the switch if the cupboard was open. Then it opens the cupboard. Then it goes back and picks up the object.
Now, that is what classical planning classically does. That's not surprising once you have that representation that you can do that. It did it in 4 milliseconds actually. Because the inputs just a text file, very fast. What's interesting, is that we got to that maximally compact representation by applying this theory of how we build a compatible perceptual abstraction given the action abstractions. Also, I asked her to clean up afterwards. So it's going to close the cupboard and the cooler and all that stuff automatically.
The delay here, by the way, is for the vision. What's happening is it's merging the point clouds. I wrote this code, and I'm bad at vision. So it's slow. It's about 45 seconds. Had I had a grad student at the time who wanted to play with GPUs, that would have been instant, but the actual planning time is not what's causing that delay. Then it goes back and it closes the cupboard. And we've applied this framework in 12 papers at this point. Six of them involve real robots. So it's now kind of reasonably well developed.
So that's my argument. That's how you get over the sensorimotor dilemma. You learn a collection of action abstractions. You learn a collection of perceptual abstractions that match those action abstractions that support planning with them. And then once you're in a representation that expresses the complexity of your task, not of your robot, then you can do problem solving.
OK, that's what I'm going to say about one agent many tasks for now. Now, I'd like to talk about innate versus learned. So far, if I'm being grandiose about what the previous work does, is it unifies the rest of AI with hierarchical RL. It says that the base thing is a decision process. You learn state and action abstractions over that decision process. And then you index into the structure of the resulting thing, and you call out to various classical AI paradigms like search or classical planning or policy search.
But we haven't talked about robots yet. And of course, all that is very abstract, but anyone who's actually switched on a robot and done anything with it knows that you can't actually actuate your sensors and actuators directly like that in real life. You will immediately crash your expensive robot. You might damage a grad student or some other horrible thing like that.
So let's think about how we might unify robotics with the rest of this stuff. And my argument is that we need to think about what should be built in to a robot. So here's my hypothesis. If we start with the ego process, and let's say we want to play chess. And we want to build a set of abstractions that describe chess compactly. We're actually going to do that hierarchically. We're not going to go straight there. We're going to build some layers.
So let's say we learn those layers from scratch. And then if we wanted to navigate in a map, we would have to learn another set of layers if we learned everything from scratch. And if we wanted to play soccer, we would have to learn another set of layers if we learned everything from scratch.
But you might find that these first couple of layers, which are, by the way, exponentially harder to learn then the later ones, which are nice and compact. These ones are high dimensional and continuous, recur over and over again. You are learning the same set of abstractions repeatedly if you learned your abstractions from scratch.
And so you can imagine putting a line here. And you could say stuff below this, I'm almost always going to have to learn anyway. It reflects structure in the world or the robot, not this individual task. And by the world, I mean either the world, like the physics of the world, like the world is 3D and there's time. And you don't want to smack into things, but I also mean the distribution over tasks. That's a property of the world. You're going to be asked to do some things, but not other things.
So if we think about that structure, we would find that these things get learned over and over again. And they kind of belong to the world, the robot, and the distribution of the tasks. And it makes sense to build those in. That is what should be innate.
So we imagine blocking this into something we call the sensorimotor substrate. It's the collection of things on top of which you learn a problem-specific abstraction. Below here has no problem specifics, kind of world and robot specific and problem distribution. Above that is like, you want to play chess, that's never happened in your evolutionary history. So we're going to bootstrap off some layer in the sensorimotor substrate, and we're going to build an abstraction on top of it.
So what do we want to include in these abstractions? What goes in there? Everything that exploits structure in the world, not a specific task, more specifically, structure and any element of the ego process, observation space, and the action space. Structure in the reward function distribution.
If you're a robot that's always asked to do the same thing, like exactly the same thing, you should just build it in. There's no reason to learn. That distribution however is broad and fat, then you are going to have to learn. And these things are always going to be coupled. It's always going to be coupled state or perception and action abstractions. That's the form of this knowledge. You always have to couple an input abstraction and an output abstraction.
And so now, if we look at the robotics literature. And we go through it because the robotics literature I love robotics. I have a lot of fun in it. Sometimes, robots drive me crazy, but what you find, again, is this collection of methods that someone thinks will be useful in a robot, in many cases, have, been useful in a robot, but it's kind of disorganized. There's a bunch of things you could study. Each one of which is probably useful in a robot someday.
But if you go through it, you will find that most of these processes like SLAM and motion planning and object recognition and grasping can be thought of as action or perception abstractions. So like here's SLAM, for example. And I just got this off YouTube, but I think it's Creative Commons, so that's probably fine.
Like this is the thing you can't think about doing anything on a mobile robot without SLAM. This is that you are driving through the world, and this is your sensor input space, which is massive and very partially observable. And you're building a map of the environment.
So you're going from that sensor input, which is very high dimensional and can only see in front of you, to your pose and a map, xy and theta, which happens to be Markov. And so what you've done, is you've learned an action abstraction that takes that decision process and makes it a Markov decision process. And if while localizing, you also happen to build a map, you've built a model of that Markov decision process. And you can do model-based planning. You can do path planning in that map.
That's the appropriate matching action abstraction that matches SLAM, which is a perceptual abstraction that exploits the fact that the world is spatial and the sensors work in a particular way, to build a spatial abstraction that gets rid of the partial observability due to limited sensor range. That's its role. And you really want to be solving Markov things if you can get away with it because partially observable things are bad.
Here's another one. This is robot motion planning, which Tomas invented, more or less. So here you've got robots moving through the world. And you just can't treat this as a generic action space. There's all these obstacles. These are 900-kilogram robots. If you start to smack into things, everything goes bad. Even if you were as a human smacked into things all the time, everything would go bad.
So what does motion planning do? It exploits the fact that the world is geometric in 3D, and it builds an action abstraction. I want to put my hand over here. Can you compute a policy for me to get there to there? Yes you can by exploiting the fact that the world is 3D, and you know your own kinematics. That's an action abstraction.
Once you can operate in this space, just where your business-end of your robot is, you'd only think about where the rest of your body is and decision making is much simpler. So what I would like to do is build a sensorimotor substrate that consists of techniques from robotics, those are abstractions over observation histories, and take other techniques from robotics that are abstractions of actions, and then there are some techniques like IK or Inverse Kinematics and forward kinematics. Those are processes that support those things.
All right, so my bold claim of the day is that everything in robotics can be viewed as one of these two things or a supporting thing of one of these two things. And what we need to do, is we need to be able to find the matching pairs of observation and action abstractions that put us in a much simpler decision process that exploits the structure of the world to be able to do fast learning and planning.
Once you've done SLAM, you don't need to worry about the fact that you can't see past your nose. You're in a map. And if you want to do navigation, it's path planning and a map. And that's not hard.
And so we've kind of built up like a beginning scratch estimate of what this would look like. We haven't done any of it yet. But you could imagine saying, OK, let's go into the robot literature and say, roboticists are wise. They know what robots need. They've built a bunch of technologies. And let's go pick up those technologies and try and organize them.
And what we found when we think about that is that they've broken the world up into navigation and manipulation, roughly speaking. And then for each of those, you could imagine building starting with control, starting with like Boston Dynamics running around in the world. That's the first level. And then on top of that, you build a spatial layer, which handles SLAM and path planning of the navigation side and motion planning and spatial perception at the action side.
And then maybe you jump up to an object level, where you start to think about the fact that mostly, you're interested in manipulating objects. And therefore, when we want to go navigate, we want to go navigate to a place where we can interact with an object or we can observe it on the navigation side. So these two techniques actually aren't submitting maps. And locally observable MDPs are coming out of my lab in the next year. Max Mullin and Eric Rosen are the two students working on them. They're very friendly. If you're interested, just ping them.
And then eventually, you probably we think after the object level you get to object generalization. And I'd like to build like a single robot stack, not like a robot stack per problem, not like let's think about what this robot has to do today, let's build a single robot stack the way that you have a single stack. And be able to say, OK, when I want to learn a problem-specific representation, I can reach into somewhere in that stack, pull out the appropriate level of abstraction, and build on top of it.
And that's how we think about unifying robotics, which you have to study if you think about a generally intelligent agent because a generally intelligent agents are robots. And the structure reflected in robotics has to be, in some form, around for people.
OK, last bit, I got like three or four minutes till we get to questions. I try to leave a lot of time for questions because I've said a lot of cranky stuff. So I want to talk a bit about language. Language about well, I know the least. I have the fortune of collaborating with Stephanie [INAUDIBLE] and Ellie Pavlik who are amazing language researchers who think about grounding.
And so you should attribute all the naivete that I'm about to exhibit to them failing to see me this morning when I was on my way here. So it's not my fault. It's not their fault. It's my fault. And the thing about language in this context, is that it's special because we don't get to design it. It is the case that there is an existing natural language out there. And if we want to build agents that use that natural language, we don't get to choose how they came to that natural language or what its form is. We just have to use a thing that exists out there in the world.
It's not like the rest of the stack where we get to just say, hey, I want to use my computer vision algorithm however I like because I'm designing this agent, and therefore, I feel godlike. And I can decide what it does. Instead, language is just language, and we've got to use what we found.
But under this view that humans are robots not computers, you have to say that language is a protocol invented by two robots to talk to each other, not for anything else, not for tweeting, not for writing books, not for anything like that. It's for communicating between two embodied agents.
And therefore, when you think about when you are grounding language, think about what language actually grounds to is it has to be a decision process because that's the Ur formalism. Shouldn't be thinking about grounding language into images or videos. Should we think about grounding languages into decision processes.
OK, so when I say, hey, robot, please pick up a stapler, and it's in this rich sensory mode of interaction with the world, what you imagine happening is that there's this structure. There is the bottom-level rich sensory mode interaction with the world. Hard to describe because everything's continuous. What's not so great for that?
And then on top of it is the sensorimotor substrate which will have things like objects in it, got to. And then on top of that, you will have a whole bunch of learned task-specific abstraction hierarchies. Some of them about chess. Some of them about soccer. Some of them are about navigating through Cambridge.
And you're going to have to take that language and ground it into those hierarchies. And it's going to inform the decision processes that are in those hierarchies. So the generic object that it grounds to is a decision process in this formalism.
So I think my bold, wildly unqualified to say, statement of the day, is that language should be grounded to decision process formulations. And then if you actually look at human language, so reinforcement learning people, of which I am one, don't like structure in their decision processes. They like unstructured Markov decision processes. They feel like structure in the decision processes is cheating. And that's because they haven't done robotics. So they don't know that it might be cheating, but it's totally-- you just have to do it.
But what I find really striking is if you look at the structure of language, if you just look at parts of speech, and I only speak English, so I'm sorry I'm being naturalistic and only looking at one language. I can swear in a bunch of other languages, but English is the only one I was made to study grammar in that I remember.
If you imagine that language is grounding to a decision process, the fact that humans have verbs, which are discrete names for actions, mean you have to have action abstractions. You have to have discrete action abstractions in order to name them with discrete names.
And similarly that we have nouns and common nouns means you have to have objects. You can't be in a Markov decision process or a decision process, you have to be in an object-oriented Markov decision process. Otherwise, you can't have nouns. And you can't have common nouns. It has to be a typed object-oriented decision process.
And the fact that you have adjectives, mean those objects have attributes. And the fact that you have adverbs, mean that those motor skills are parameterized by real value parameters because if you say throw the ball higher, you have to be able to adjust your motor skill to throw the ball higher.
So to RL people that are thinking about grounding language into RL, you have to say the language suggests that the model that you want to ground into is very structured. And another thing is, there's no reason to have declarative statements about world state unless you're in a partially-observable Markov decision process. I can't say the stapler's in your office if you're in an MDP because you can just observe that at every time state. This doesn't make sense if humans are in an MDP.
So we wrote just a workshop paper to just-- it was just kind of an amusing thing. My student Rafael Rodriguez Sanchez and Roma Patel, who was a PhD student at Brown, just said, OK, let's take all the different types of structured decision processes. And let's say, OK, what parts of speech make sense in each of them? And you can see that it's like really clear that MDPs, most of the language doesn't make sense in MDPs. There's just no reason to have it there if what you're grounding to is an MDP.
But if you're decentralized, that means there's other agents who don't see everything that you see. Object oriented, that means there's objects in the world. Partially observable, that means you can't observe everything MDP, then a lot of these things make sense, but those structures have to be there.
The other route that we've been-- and I should say that the symbol thing kind of makes sense as a grounding target as well. So this is a paper by [INAUDIBLE] Gopalan, who's now an assistant professor at Arizona State, but was Stephanie [INAUDIBLE] PhD student. This is a case where we took robot demonstration trajectories, segmented them into sub-skills, learned the symbolic representation suitable for planning with them, and then took a natural-language description of that trajectory and mapped it, like just did machine translation from the natural-language description to those symbols.
And then we were able to set the robot a goal and have it describe the sub-goal and then have it plan to the sub-goal because it had the underlying substrate for planning. So these things can work together a little bit too.
The last thing that I'm going to say that we're working on is, we've been thinking about, well, there are these really, really successful approaches to doing machine translation now with big language models. So what we've been thinking about is, just for MDPs, just for the simplest case, is there a language that can express anything you might wish to say about MDPs, a formal domain-specific language.
Can you write down advice in that formal domain-specific language that is by construction grounded in an MDP, and then can you do machine translation to that language? So my student Ben Spiegel, who's in the audience, and also Rafael Rodrigo Sanchez have been working on this. This is called RLANG. It's a language for giving advice to reinforcement learning agents. And very soon, you'll be able to give that advice in a natural language and it will be translated into an RLANG program that's just a machine-translation problem.
Now an MDP is very limited. It's not very powerful, but it is a very natural way to be able to give advice to reinforcement learning agents. And we think it's some evidence that there are reasonable ways to try and get from language to these structured decision processes.
OK, so just to close up. What I've talked about today, bigAI model. We talked about the fundamental model, the Ur model. That's what you get when you switch on a robot. And you're all robots, not computers. And that's a decision process. And I think that that's the maximally general thing you could be asked to solve.
And then we talked about how you would have a robot solving an extraordinarily powerful, like wildly overkill, decision process. And for every task that maybe we may wish to solve. And our kind of answer to that, how you resolve the sensorimotor dilemma is that you learn a task-specific representation that captures the complexity of the task, not the robot.
You should get the same description of chess whether you use a six or seven degree of freedom arm, whether you have an HD camera or an SD camera. You have to bridge that gap between the robot's complexity and the complexity of the task. And if you can do that successfully, then we have a natural route to being able to pull in all the formulations from the rest of AI and special case demand man when the decision process is the right structure.
Now, we talked about how that gives you a lever for thinking about what should be built-in and what should be learned. And our argument is that what should be built-in is the set of abstractions on top of the ego process. That's paired action and perception abstractions on top of which you can rapidly learn a task-specific abstraction. And those are things that reflect the structure of the world, or the reward distribution, or the robot, but not any specific task. And then we've talked about how language, and we can think about grounding language into sequential decision-making processes.
If there's a single thing you take away from this talk, it's this picture. It's this picture of a robot going around being given tasks with a really rich sensorimotor space, then constructing abstract representations of those tasks, and then exploiting the structure that it can see in those decision processes because it built them. And then swapping in the right algorithm of the right complexity. OK, that's it for me. Questions, thank you.
[APPLAUSE]