Towards Complex Language in Partially Observed Environments
August 20, 2020
August 19, 2020
Stefanie Tellex, Brown University
All Captioned Videos Brains, Minds and Machines Summer Course 2020
BORIS KATZ: It is my great pleasure to introduce on this next speaker, Stefanie Tellex. Stefanie is a Professor of Computer Science at Brown University. She is working on human-robot collaboration, and I guess more specifically on grounded language understanding. Stefanie is also a former student and a friend, and please welcome Stefanie Tellex through the Brain, Minds, and Machines summer course. Go ahead, Stefanie.
STEFANIE TELLEX: Awesome. Thanks for having me, Boris. I got started in language and language understanding working as a master's student in Boris' group on factoid question answering, and did a lot of really fun work with computational linguistics, and--
BORIS KATZ: If I remember correctly, you even started as an undergrad, [INAUDIBLE].
STEFANIE TELLEX: Oh, yeah, it was in undergrad, and then I was in MEng. Yep. Yeah, in undergrad-- long time ago. And then I went on to do a PhD at the Media Lab and a postdoc at CSAIL, and now I am a Professor at Brown University. So this talk is called "Towards Complex Language in Partially Observed Environments."
And what I'm really interested in, as a researcher, is trying to make autonomous agents-- but really robots-- understand language in the same way that people do, and be able to do all the things that a person can do with language. And since a lot of what a person does with language is refers to things in the real world and the environment and asks other people to do things and asks questions and asks for information that involves taking actions in the environment, that very quickly took me towards robotics, because a robot has a sensor that lets it observe stuff in the environment. It has actuators to take action in environments. And I got really interested in how we can have language interface with these systems.
So I think it's a super exciting time in robotics. When I was a postdoc at CSAIL, I worked on this project to create an autonomous forklift. So this is a shot of the forklift moving a pallet around, and there's no driver. You can see the LiDAR sensors and stuff around that lets it see what's going on in the environment.
And since that time, there's been tremendous progress in the field of self-driving cars and in autonomous vehicles, to the point where we now have unmanned vehicles carrying paying passengers by Waymo in Arizona. So we're seeing a tremendous amount of progress in that space. There's also been a lot of progress in drones and flying vehicles.
So this is a shot of the drone I use to teach a class at Brown called Introduction to Robotics. It costs about $225 in parts, depending on how you count and who you buy them from. And it is able to fly autonomously in indoor environments, and what's cool is all the autonomy is onboard the drone, on the Raspberry Pi implemented in the Python programming language-- which, for teaching, is really, really nice, because it means that we can give you the drone, and you can use whatever laptop base station you already have-- doesn't matter if it's Windows or Linux or Mac.
You don't have to install any software. You can just interface with it with a web browser in an SSH client. So this is a shot, in the pre-Covid days, of the students in my class flying their drones on the flight day when we all flew together in the tennis courts at Brown.
And sort of in parallel to this project, around the same time as we were doing this project, there's a company called Skydio. This was a spin-out of MIT that created an autonomous commercial drone with 13 cameras-- a stereo pair of cameras for each of the six sides of the cube, plus a high resolution color camera that can fly autonomously. It could see obstacles all around because of the cameras, the stereo pair on each side of the cube.
It can fly through highly cluttered environments with obstacles like trees and forests and stuff like that, and it can follow you as you're mountain biking down a mountain, or skiing down a mountain or something, and keep you in the shot with the high res camera, and avoid all the obstacles with all the other cameras-- completely autonomously with no human pilot-- which was a tremendous advance, but is now a consumer product. You can buy it. They're on version 2.
So we still haven't hit household robots, but I think that is in our future. And there's been a lot of progress in factory-manipulated robots-- so robots automating assembly lines in the factory. And as these robots become more powerful and more autonomous, there's a real question about how are we going to interact with them? How are they going to be in our lives, and how are we going to tell them what to do?
So the standard interface to all of the robots I've mentioned so far-- the industrial manipulators, the forklift, the autonomous Waymo vehicles, the Skydio, is a touch screen interface. So this is a shot of the forklift's interface. It was a tablet interface, and you circled the pallet that you wanted the robot to pick up, and you circled where you wanted it to go, and then the robot would go and do the task. If you get into a Waymo vehicle, there's a touch screen on the back of the seat, and that's where you can see what the robot's seeing, what it's about to do, and also control where it's going and what it's doing.
And that's a good interface for a lot of purposes, but in my group, as I already kind of said, we're really interested in language-based interfaces-- so where you say to the robot, in words, things like, "Put the metal crate on the truck," and the robot has to figure out the mapping between the words "metal crate" and some object that it's detecting with it's very high-dimensional sensor data, and then the words "truck," and same thing-- it's got to figure out where the truck is. And then it's got to send electrical signals to its actual actuators. And it might be sending those signals at 30 to 100 Hertz. They might be sending commands to control the actuators to do the thing that the person asked it to do.
And of course, to do this-- put the metal crate on the truck-- that might take five minutes of driving around and setting these commands and having the robot do it. So it's a challenging problem to map between these really low-dimensional words like "crate" and "truck" to these really high-dimensional, high frame rate sensor streams coming in and actions coming out-- going out to the actuators. It's also challenging because people don't like to stick to a fixed vocabulary and grammar.
So this is a video that demonstrates the data collection paradigm that we use throughout all my work. What we do is we show people on the internet an example of the robot doing an action. So here you can see the forklift lifting up this pallet, driving forward a little bit, and putting it on the trailer. This one happens to be in sim.
And then we say to people on Amazon Mechanical Turk-- we say, can you please generate a natural language expression that you would use to command a person to take this action? And then they say things like, put the tire pallet on the trailer, or they say, place the pallet of tires on the left side of the trailer, or they say, lift tire pallet; move to an occupied location on truck; lower tire pallet; reverse to starting location; lower forks; end. So here they made this programming language kind of thing.
So what we see when we do this is that there's a tremendous amount of diversity in the natural language expressions that people use. So when we do this task-- just for that short little video clip, we see a really wide variety of different commands. And we've collected 50 or 100 commands for this clip, and we don't see any repeated. They're all unique. And very few of them, or none of them are wrong. They're just capturing different things. They gave me different levels of specificity. They're using different words. They're giving more or less detail about what they want the robot to do.
So the challenge is we want to be robust to this sort of variation, and we want understand and do the right thing no matter what words they choose, and no matter what things they specify and what things they don't specify when giving a natural language command. And of course, we don't just want to put a metal crate on the truck. We also might want to give very fine-grained instructions. You might want to tell it move backwards 10 feet, or we might want to give a very abstract instruction, like move everything from receiving to storage, where the robot goes and operates for an hour or two hours on its own, following, now, one short natural language command.
And then we also want to be able to recover from failures. So if the person says, put the metal crate on the truck and there's two trucks in the warehouse, we want the robot to be able to say, which truck? And then the person might say, the one on the left, to figure it out.
So fundamentally, the challenge that my research group is confronting is that people want to talk to robots about everything they can see and everything they can do-- everything. So to do that, we need to make a model of everything that the robot can see and everything the robot can do. Which, my group-- we're trying to make this model, human-robot collaboration, by learning decision theoretic models for communication, action, and perception.
So communication because we really want to understand language. We want the robot and the human to be able to communicate. Action because we want to be able to communicate about everything the robot can do. And perception because we want to be able to communicate about everything the robot can see.
The learning is going to allow us to scale to very large problems by applying large data sets and modern machine learning techniques, neural networks, to learn. The decision theoretic bit is going to allow us to detect and recover from failure-- so not if things go wrong, but when things go wrong, by having a model of what we think is supposed to happen and where we think we are right now and how we can get to other places in the state space, that's going to allow us to detect and then recover from failures to accomplish the task. And then overall, we're going to be able to improve from experience, because we're able to learn, and then detect and recover from failure.
So the technical approach that we take in my lab is that of POMDPs-- Partially Observable Markov Decision Processes. So a POMDP is-- you can represent it as a graphical model where there's a state sequence, and in each state you go from S sub t minus 1, to S sub t, to S sub t plus 1, and the state at time t depends only on the t minus 1, not on any of the previous history. So that's the Markov bit.
At each state we get an observation. So the gray shaded bit means we observe the observation. So at each state we're going to get to read our sensors, and we're going to get to see some stuff from our sensors about the world. And then at each state, we can also pick an action.
And the action at the last time stop controls, or has an effect on where I am at the current time stop. So if I decided I was going to drive forward at the last time stop, then at the next time stop, that's going to mean I'm forward. My x-coordinate has increased, or however I'm representing that. So my action affects the next state.
And then last of all, I need a notion of what to do, what the goals are. So the most general way to do that is to have a reward function. So at each state, I get a reward-- just like a number. And the goal in solving one of these POMDPs is to find a policy, which is a mapping of states to actions, that maximizes my expected discounted reward.
So at each state, if I have a policy, it tells me what action I take. And under that policy, I will earn a certain amount of reward, and my goal is to find the policy that maximizes my reward. So POMDPs have a bad reputation, because they're really hard. But I don't think that should deter us, because robotics is also really hard.
So we shouldn't be scared because the model that we're using to model it is hard to solve, hard to find models-- hard to find policies given the model. And the other thing that is nice about POMDPs is it's the simplest model that I know of that captures everything that a robot is in the sense that it has observations, it has actions, it has a goal, it has states. You really can't leave out any of those things without breaking something about what a robot is-- sensors, actions, goals, and then those actions mean something about the world.
So the game that we play is given this very general framework, we try to think about random variables and conditional independence assumptions that enable efficient learning and inference in this framework. And I'm really happy to use deep learning to do the learning. This is not a-- sometimes people think graphical models are the thing, and deep learning is bad, and other people are like, deep learning is bad. We need to do graphical models. But there's really no conflict between the two. You can do deep learning in a model, but it's just another conditional estimator, and it just happens to be a really good one, and we're happy to use it.
So what my group is trying to do is take this very general POMDP framework and convert it to something I call the Human-Robot Collaborative POMDP. And what we're going to do is kind of make it more and more specific, and then that's going to enable computational benefits when we're interacting with people.
So the first thing we're going to do is factor the state space and separate out the physical state of the world and the human's mental state. And that means that the robot can then take actions that affect both the physical state of the world, but also, it can take communicative actions. It can say things that will affect the human's mental state.
Then once we reason about the human's mental state, we can change our reward to say, well, our reward is to do what the person wants. We might not know what they want, but we'll represent that as part of the human's mental state. Then we'll factor our observation model to be physical sensors, like our LiDAR sensors that tells us where the robot is, and then language and gesture that we're observing from the human as it tries to communicate with the robot, and that predict things about the human's mental state.
So then we can look at this problem of predicting the human's mental state after observing the language, and that is the problem of language understanding. We can think about the problem of trying to say things-- taking social actions by saying things-- and having those actions affect the human's mental state. So for example, if I ask a question, I can reason that that's going to change the human's mental state, that might cause them to answer that question in the next time step, which will then let me get the information I need to solve the problem. So that's the problem of language generation, or dialogue within the general case. So I call that communication for collaboration.
Then you can think about the problem of taking actions. And the challenge here-- it goes back to the example I gave before, where you want to understand really abstract things, like, clean up my house, but you also want to understand things like, get me a glass of water, and also really fine-grained things like, tilt the picture frame a little bit to the left. OK, there. Perfect. Where there's a really tight feedback loop between the language and the robot's actions.
So what we need in our action space is structures and representation, the compositional hierarchical structures, that allow the robot to reason at all of these levels and to go back and forth between these levels really easily. And in fact, the ability to do that is the power of a language-based interface. Unlike a touch screen or something else, you can really quickly, in language, specify really big abstract actions, but also go down and specify really fine-grained actions at the same time. So that's what I call action for collaboration.
And then finally, you have to think about the model for perceiving the environment-- how the robot's sensors predict the physical state of the world. And on the one hand, you might say, well, this is just the roboticists and the computer vision people have to solve this problem, and we'll just take it once they solve it and put it in our system and go. But it's actually really easy to come up with examples of language where you can see that the language has to be intimately tied with the perceptual system.
So one example I call the speck of dust problems. Robot, pick up the speck dust on the floor. And you want the robot to then go find the speck of dust and pick it up. And if I say those words, as soon as I say those words, the robot better be making a speck of dust detector and looking for the specks of dust and picking it up. But before I said it, the robot-- if it was looking for all the specks of dust and running the speck of dust detector and finding them all and putting it in its mental model, clearly that's wrong. Clearly it's going to computationally overload.
So there's clearly this very intimate connection between language and the perceptions. Another example is-- I have this pack of batteries here I'll use as a prop. Take the tape off the pack of batteries. So here's the tape, and I'm taking it off.
So now, a pack of batteries, we're now talking about this tape. You might say, start by peeling the corner of the piece of tape. So I'm using this corner. So this corner of the piece of tape wasn't an object before, but now I just used words to talk about it, and I'd better be able to reason about that corner and localize it and grasp it and do things with it. And so that shows how language and perception system need to be intimately tied together.
So that's what I think of as perception for collaboration. So my lab is doing work in all of these areas, and we're trying to think about the pieces to put all of these areas together to make a system that's able to collaborate with people, and to talk to people about what's going on. So I'll start out with one example of a project we did that ties each of these pieces together.
This was at [INAUDIBLE] a few years ago. So the idea is that the robot is fetching objects. So the person says, can I have that ball? And then the robot has a bunch of objects, and it has to figure out, from the person's language and gesture, what object to hand over.
So we do that by maintaining a probability distribution over the objects. At the beginning it's uniform, and then I can do a belief update based on a language model and a gesture model that, for this particular example, is very sure that it's ball number 2, and then you hand it over. But if I move farther away and I rearrange the objects so the balls are right next to each other, and the markers are right next to each other and I do the same kind of thing-- can I have the marker? And I point, and I do an update, what happens is the system isn't sure. It doesn't know whether you want marker 1 or marker 2.
So the question is, what should the robot do in this situation? What we want the robot to do is to ask a question. But we don't want it to always ask a question, because that would be annoying, and it would make things take longer. So what we did is created a POMDP framework-- we call it the FETCH-POMDP-- where we reasoned about the information from language and gesture, as well as question-asking behavior.
And using this framework, the robot can decide its belief state is unsure which marker to hand over, so it better take the time and ask a question. And it figures out which question to ask, as well, as part of its inference process. It says, this one? And then the person says, no, not that one. And after getting the answer, we can do another belief update and be confident that it's marker number 2, that should be handed over, and it goes and does it.
So I guess I cut the video, but anyway, there's a video. I don't know if I should-- I'm going to stop screen sharing and add the video, because I think you should see it.
PRESENTER: And we do have a few questions, whenever you'd like.
STEFANIE TELLEX: Great. Yeah, I was going to pause and ask questions too, so why don't we-- why don't you-- can you read off the question while I--
PRESENTER: Sure thing. First one is from Nate Manuel. The discussion of perception versus environment reminds me of intuitive expertise and the idea that human experts see fewer options when they initially see a problem. Do you think that is relevant?
STEFANIE TELLEX: Very much so, yeah. So there's a lot of concerns in POMDP land-- and in [INAUDIBLE] generally-- about finding the optimal solution and proving that you've found the optimal solution. But I think that is wrong-headed, and I think that the idea of pruning the state space and learning what's important and what's not important-- and if that sacrifices optimality, great. I love it, because I'm happy to sacrifice optimality in exchange for a smaller search space and finding a solution.
And of course, I think there's a lot of stuff you can prune that has no cost, that you don't sacrifice optimality. A couple of my students are working on a really cool paper that looks at pruning in this sort of more abstract space based on the start state and the goal, and what we show is that by doing this pruning step, you can dramatically outperform state-of-the-art planning algorithms on standard planning problems, because you can just throw away irrelevant stuff, and it goes way, way way, way faster. So I think that's really important.
- Can I have that ball?
STEFANIE TELLEX: There's the video.
- Final answer-- you wanted object 4.
STEFANIE TELLEX: So that's showing--
- Can I have that ball?
STEFANIE TELLEX: When you're close--
- Final answer-- you wanted object 4.
STEFANIE TELLEX: It's unambiguous, so it very quickly figures out which object you want and hands it over.
- Can I have the metal object over there?
STEFANIE TELLEX: And now you're far away.
- This one?
- No, the other spoon.
- Final answer-- you wanted object 6.
STEFANIE TELLEX: And that's where it's-- the same code, the same algorithm is deciding, in this case, it's confused and asking the question. OK, more questions from you.
PRESENTER: Great. The next one is from anonymous. How is the language understanding achieved?
STEFANIE TELLEX: Yeah, so in this particular paper that I'm talking about now, we are using a manually-defined keyword-based language model. So for each of the objects, we wrote down the words that people would use to refer to that object, and then based on that we do a language update. In work that I'll talk about later in this talk, we use deep learning. So we use seek to seek models very commonly, where it's sort of like a machine translation problem where we're trying to translate from English to "robot-ese--" a formal language that can be interpreted by the robot as a goal expression.
And in a paper we just published this summer at RSS, we're able to learn from pairs of English and trajectories. So instead of needing a formal expression, we can just look at a trajectory that the robot actually followed in the world, and we can infer the formal expression over many examples. I'm super excited about that work, because it doesn't require a lot of annotations of the formal language. Cool. So I got the questions pulled up now. So is this different than cooperative inverse reinforcement learning?
I know what inverse reinforcement learning is. I don't know what cooperative inverse reinforcement learning is. But inverse reinforcement learning I find very interesting, because it's specifically trying to learn the reward that the agent is supposed to follow. And one of the things we've angsted about in our group is, does language go to a reward expression, or does it go to an action?
So a lot of the early work in robot language understanding would map language to an action, or a sequence of actions. So for example, go forward. Pick up the cup. Turn left. Move forward. Put the cup down on the table, or something like that, where you just execute the sequence of actions. But actually, a reward, or a goal-based meaning, we think, is much better, because if I go, when I pick up the cup, and all I have is a sequence of actions and it fails, I don't have any recourse. I've got a-- I just go on to the next thing. But if I know that my goal is to get the cup on the table, then if things go off the rails anywhere along the way, I have the ability to replan and find a new way to achieve that goal.
So I think IRL is very interesting for that reason, because it's also thinking of things in a goal-based way. The bad thing about IRL is that a lot of the traditional IRL settings use a very low-level numerical representation for reward. And one of the directions we have been going is in goal-based reward with predicates-- so the idea that you have a reward function that's defined as a predicate on states.
And the predicate is true in a bunch of states, and false in a bunch of other states, and you're supposed to try to get there. So yeah, if somebody wants to put in the chat, or something, about cooperative IRL is-- I don't know if that's possible. [INAUDIBLE]. Yeah, I was trying to pull up the chat. Let's see if I [INAUDIBLE]. OK, so if somebody wants to put that in chat, I will try to answer what cooperative is.
How do you define reward function, quantify what the person wants? Is it an episodic task during train the model? If so, how do we define it at the end of the episode? Yeah, so this is another thing about reinforcement learning that I think a lot-- not all, by any stretch, but a lot of RL gets wrong. And what they do is they make what I like to call the Groundhog Day assumptions. So I don't know if you've ever seen this movie. It's like an American cult classic movie called Groundhog Day starring Bill Murray. If you haven't seen it, it's really funny, really good.
And what happens in the movie-- the premise of the movie is that he wakes up, and it's February 2, and he lives through the day of February 2, and then he goes to sleep, and he wakes up again, and it's February 2 again. And the same video is playing, and the same song on the radio is playing, and all of the people in the world around him are living the exact same day. Every day at 3:00 o'clock, a little kid's climbing a tree and falls out of the tree. It's the exact same thing. But he gets to change his actions. He gets to change the decisions he makes as he's living through this day over and over again. And this is the setting, this RL setting, where you get to start from the beginning and relive the same day, and the only thing you change is your actions-- until, eventually, he lives the perfect day and moves on.
So this is a setting that a lot of RL algorithms assume-- that you get to reset to the same initial state that's exactly the same, and that you get to do it over and over and over, until you've learned something about how the world works. But really, in real life, of course, every day's a new day. It's more like a transfer learning kind of scenario, where every day I wake up, and hopefully physics still works and the sun is rising and gravity is happening, but every day's a little different. I've learned stuff, and I've changed things, and the weather's different, and all kinds of things are different.
And I can still predict things. Like, I know the light switch worked yesterday. It's probably going to work today, but it might not. And I have to be able to take actions anyway. And the thing about language learning is that I don't want my robot to have to-- if I give it a natural language command, I don't want to have to have it go and try 100 episodes before it follows the command. I want it to follow the command correctly the first time.
So that is how it differs from traditional episodic reinforcement learning, where I'm trying to find the one policy that follows the reward. You can cast it in that framework by saying the reward makes the person happy, but what the person wants is very complicated. Or you can cast it as a transfer learning problem, but that's the sort of way I think of it.
Is the POMDP framework I propose a type of cooperative reinforcement? [INAUDIBLE] you guys are using these words, "cooperative." So I don't know what that is either. I could imagine what it is, but I'm guessing it's like two agents working together to learn? I think you could think of it that way.
So one of the questions that we've been considering is, what-- the human gets to take actions too. The human has a mental state. The human's mental state is changing. And in general, how do you want to model that? We think it's part of the state space, but you could imagine a recursive model where you put the human state-- state action space-- inside of the robot's transition function, and then try to predict what the robot is doing based on that.
There's a paper called I POMDPs-- I POMDPs-- that does this. They really write out the recursive math of that-- and it's horrible. It's really hairy math, because it's all recursive. And then there's the really beautiful work that Noah Goodman and-- oh, I forgot the other two peoples' names, but anyway, the probabilistic programming work that can really elegantly represent that type of recursion.
So I kind of think that's the right way to handle it, but so far-- but the other thing people do is just say, well, the human is very complicated-- even though we know they're complicated. And we give the other agent a simpler model-- a greeting model or something-- and use that in reasoning what the robot should do.
A general question-- is it possible to incorporate body language, gestures, or simply point to the object to improve user experience? Yes. And in fact, you saw that in this video. I'll show you-- I'll show it-- whoa. Slide's moving forward. I'll go back.
- Can I have the metal object over there?
STEFANIE TELLEX: So it's using here.
- This one?
- No, the other spoon.
- Final answer. You wanted object 6.
STEFANIE TELLEX: So here, it's using language and gesture to figure out what object the person wants. Gesture, in general, is really complicated and confusing. But what we did, is just said, well, pointing gesture is one really important kind of gesture. And we'll make a model that says the human is likely to point at the thing they want with some noise.
And we shoot a Gaussian arrow out from-- we tried, at the beginning, down the arm. We shot an arrow down the arm. But you can't see their finger with it with a gesture tracker, so we shot it down the arm into the world. And that was OK. But we realized, actually, the much better vector is from your eyes to your hand. That vector, and we shoot that out into the world. And that seems to be the one that people are using to decode these pointing gestures. That worked much better.
You have a confounding factor in the command. Yeah, So I see somebody else said it. It's an additional source of information. That's exactly what it is. It's an additional source of information. So in particular, this model is assuming that language and gesture are conditionally independent given the object that you want to refer to. So given that I know that I want to refer to this marker, I'm likely to point at the marker. And also, independent of that, I'm going to use the word marker. And then the model just does the Bayesian update from that.
Prosody, no. I want to, but no. I really, really, really want prosody to happen. I did try a couple times to get students to do it. We made incremental parser. It wasn't quite prosody. Because prosody, if you guys haven't heard of that yet, it's the intonation of words. For example, I can say something like, hungry? And in English, you kind of go, "eee" [RAISES VOICE PITCH] to indicate that you're asking a question.
Or I could say, hungry, in a commanding kind of way, like feed me, saying I am hungry, give me food, or something like that. So there's definitely information in the prosody. The closest I ever got was trying to do incremental parsing. So the idea was to try to parse and understand the language as we were getting it from the speech recognizer word by word.
Because what I think happens, is when a person is talking to another person, you're watching, and you're trying to tell them to pick up an object. You're like, pick up the mug. And then they look back and forth. And then you add, the one on the left, that one. You're adding these additional pieces of information.
And as you've seen, this body language. So what I wanted to do was make the robot-- when you said pick up the mug, I wanted it to parse. And then figure out that there's two mugs. And then I want it to do non-verbal, like looking back and forth at the mugs or the objects it thought you might want, hoping that that would elicit extra information if you needed it, without even having to ask a question.
But no, we haven't really done any of that for real. I think those would be great projects to do, although they're tricky, because you have to break the abstraction barrier of speech recognition. You have to dig down and get partial results from speech recognition. Is this wishful thinking, or have I achieved detailed actions like aligning the fixture preying on the man. It's wishful thinking.
We have done detail that-- I will talk about the-- we have done fine-grained stuff. We can do things like, go north, go south. And then in the same model, we can do, go to the red room. And they can do, move the block to the green room. So we can't do levels of abstraction. But we really haven't done that really fine-grained, aligning the picture frames, yet.
It is a confounding factor if you're trying to address verbals response and language in robotics. Yeah. I guess if you're trying to separate-- I mean, this particular factor, we were trying to incorporate language and gesture. It's really hard to stop people from using gesture. And actually, this particular paper was one of the first that used language and gesture together in one paper, in one system.
Most of the previous work in language and robotics had either only done language or only done gestures, or had not run a large [INAUDIBLE] study, where they actually tried it on people. Most of my work that I'll talk about next only has language. This particular model, we really simplified it so that we could bring in gesture. But when we look at more complicated language, we have our hands full with the language, so we ignore gesture.
OK. I think I answered that one. So I'll go back to my talk now. But please, feel free to ask more questions. OK. So we did this user study for this question asking thing. We had 60 people come in and we compared three different algorithms where it never asked a question, always asked a question with a fixed policy to always ask. And then we tried intelligently asking a question. And when we found that it works best to intelligently ask a question.
This isn't a surprising cognitive science result, but it's cool to show that this autonomous system is able to achieve, realize a benefit. We were 25% faster, and we were more accurate than the baselines. It turns out that we thought maybe always asking would be more accurate, because it would help to get more information. But it turns out that asking too many questions confuses people. They expect you to understand it if they're really close and they're pointing. They're like, why are you asking a question?
And it creates opportunities for mistakes. So if they don't answer yes or no, and we don't understand the answer, then that can confound the system and push it in the wrong direction. So that's why it turns out that always asking a question isn't great. And also, really interestingly, 38% our users in a post-user survey, thought this system could understand prepositional phrases, like "to the left of," which it couldn't. We just used keywords like bowl and stuff.
And we think that goes back-- to the person asking about gesture, we think that the gesture language integration with the system was so smooth and intuitive that many of our users, they just figured, oh, it understood me. And it was because, probably, the gesture was disambiguating. But it felt so natural that they thought it could understand spatial prepositions when it can't.
OK. So the next thing I'll talk about is trying to understand these more fine-grained commands. So we haven't done the put the picture frame, that level of feedback. But we have created frameworks that can do coarse-grain, more abstract things and more fine-grained things. And for a lot of this work, we moved to a symbolic domain, a domain where you are moving around in a grid world, and there's rooms with different colors, and there's objects, and you can push the objects around.
And the idea is, we'd like to understand, go down five, right five, one up, right four, down one, left one, and up three. And also, we'd like to understand, take the red chair to the blue room. We'd like to do both. We don't want to say one is better than the other. Sometimes you want to say abstract things, and sometimes we want to say fine-grained things.
If you only have low-level instructions, you can't operate in really big complicated environments. So this work was done, in part, for NASA. So we were thinking about the International Space Station. And if you could only do high level instructions, then you lose granularity. You can't say, move south, go forward a little bit and stop. So we were building on this on prior work, where it was predicting the reward function from the natural language command.
And in this paper, instead of using standard machine translation model, we switched to neural networks. And we predicted both the reward function as well as the level of abstraction in an action hierarchy. And we incorporated a hierarchical representation for actions. So we were predicting both the level of abstraction and the goal within that level of abstraction. And that sort of-- coming back to our POMDP, we can say our action representation, we switched to a hierarchical representation.
- Go to the red room.
STEFANIE TELLEX: Do you guys-- I guess I'll ask later.
- Goes south, south, south.
STEFANIE TELLEX: So here, he's giving a low-level command, go to the red room-- or high level command, go to the red room. But he can also say, go south, south, south. And the same system is able to interpret both of those instructions. And then we got interested in understanding what I called non mark off commands. So the idea was that, instead of just saying, go to the blue room, you want it to give constraints, like go through the yellow room to the blue room. So don't go through-- in this case, the borders indicate the colors.
So don't go through the green room. Can you see my mouse? Don't go through the green room. Go through the yellow room. So this is tricky if you want to use a goal representation. Because if your state is only where I am right now, you can't evaluate, once you're in the blue room, if you've gone through the yellow room to get there or through the green room to get there. You need your whole history to evaluate that.
But if you're saving the whole history, the search space, the planning space explodes, then it becomes really hard to solve. So we incorporated linear temporal logic to represent these kinds of non mark off goals. I'm skipping a little bit. We did a neural network model to learn between English and LTL expressions. And then what we did is, essentially, compiled the LTL expression into a specification MDP.
And then we combined that with the lower-level environment MDP state space. And what that does is, lets the planner save just enough history to evaluate the LTL expression and know more. So if you're supposed to go through the green room, it basically adds a bit to your state that says, have I ever been in the green room? Yes or no. And it updates it in the right way so that if I've been in the green room, it goes to true.
But doesn't save if I've been in the orange room. It doesn't save my whole past history of my positions. So it makes it just enough bigger that I can evaluate these expressions. And it works. You can sort of do these kinds of non mark off constraints. You can say, avoid the yellow room. You can say, go through the room. And it will find policies to do it.
I'll skip the video and I'll skip this. The last paper I want to talk about-- we have a bunch of language papers, so that's them, and you can see more on my website-- is the one I already alluded to that was published this summer. So there's a lot of existing language work in natural language and robotics where we have parallel data sets of human language and some abstract mean representation. And typically, these data sets are small in the hundreds or the thousands, because some human has to annotate the abstract meaning representation, whether that's LTL or something else. Some human has to go and annotate it.
In contrast, for existing machine translation work, the data sets are parallel data sets of human language pairs, like English and French from the Canadian Parliament corpuses. And they use very large data sets to the tune of millions. And they work much better than any of our robot models.
And one thing-- I also looked at sort of the history of this project called Geoquery, which is a project of answering questions about geography using the database. So the data set consists of questions like, what state sport are Texas? And then queries that-- in a formal language, it's like a Prolog-y kind of language-- that will query the database to give you the answer and then the answers.
And there was this progression of Geoquery papers. So when it was first introduced by Ray Mooney in '96, the training instances consist of sentences paired with the ground truth parse of this one. The actual parse tree had to be annotated. Then Luke Zettlemoyer and Mike Collins-- Luke's thesis was about using sentences paired with logical forms. So it was sentences paired with logical forms. And the advance here is that you didn't need the parse. You only needed sentences paired with logical forms and nothing else.
And then in 2009, Percy Liang did a paper where you don't even need logical forms. You just need the answer as text. You need questions, and you need answers. And they learned from doing the questions and the answers. So inspired by this progression, we sort of thought that robotics was kind of in the Luke-- our robotics work was in the Luke stage. We don't need parses, but we do need logical forms.
And we want to be in the Percy stage, where we only need questions and answers, which you can collect from regular people, and no logical forms, which you need trained annotators to provide. So what we did is collected a data set that consists of natural language expressions and trajectories only. So we would show people the correct trajectory and the incorrect trajectory. And we would say, give us a natural language command that would cause a vehicle-- in this case, we were driving around in urban environments-- that would cause a vehicle to do the correct trajectory and not do the incorrect trajectory.
And they would say things like, go along 3rd Street until the intersection with main street, and then walk until you reach the Charles river. And we had latent a logical form. But we never got to observe the logical form during training. And what this lets you do is learn from much larger data sets than we've ever been able to use before, because we only need the trajectories during training. We don't need the logical forms.
And it actually learns to predict the logical forms from only English and trajectories. So to me, this is the ultimate answer about should I predict goals or actions. Actions, the trajectories, what did I actually do, are easy to collect. So I should learn from them.
But what I should learn to do is predict the goals. I should predict the LTL. And that's what this paper was doing. And I want to make sure I have time for more questions, so I won't go into more detail. But I think it's a really exciting paper, because it kind of breaks open that Gordian knot and says, learn from actions, predict goals, and use large data sets, and be happy.
OK. And then we're also thinking about objects search, but I think I'll skip all that. But it's about trying to find objects based on natural language expressions. And I like it, because it learns to understand lies. If you lie to it and say it's in the kitchen, but it's not in the kitchen, it'll go in the kitchen and then not find it, and then it does smoothing. And then it's just like, oh, it's not in the kitchen. Then it goes somewhere else and looks everywhere else until it finds it.
So we kind of started with this really simple spare POMDP representation. And we gradually built it up. We've added-- I didn't really talk too much about objects, but we added objects. We added hierarchical options. We added communicative actions. And we have this really complicated world.
One of the best things about my lab is that we get to do-- I like to do really off the wall projects too, sometimes. So this is one of my favorite off the wall projects, was a student who had just written a language. And we should learn to write. So he was interested in the robot [INAUDIBLE] created a system that takes input, that is able to produce a policy for writing that text, as you see, in the bottom half. So it's "Hello" in a number of different languages.
And it's been trained only on Japanese Katakana annotations. So it's annotated with the actual trajectories of the Japanese characters at training time. But at test time, it only sees pixels. And it has to produce a trajectory to reproduce this drive.
OK. I'll stop there, and I'll take more questions. Let me pull up the questions. Can the robot have a working memory, i.e. encoding information into a short term storm, manipulate it, and reorder it, and selectively use that information to guide action and selectively forget information. I love that idea. Yeah. I think that for the perception stuff that we're thinking about, that's a lot of what we want to do is have some kind of a memory that pulls in, I'm looking for specks of dust now. So I should pull in my speck of dust detector, and then I should pull it back out.
And actually, one of the things I think hierarchical abstraction could do if it's done right, is give you the ability to have that. Because once you're down one level of abstraction, that has a certain set of stuff that's relevant and a certain set of other stuff that's irrelevant. And I'm hoping that that structure would let it decide what can get pulled into working memory and what not.
Wouldn't number of trajectories have a combinatorial blow up in the state space? Does it slow down learning? So in the work that we were doing with Roma about learning from trajectories only, we generated the trajectories, and we chose how many to show them. If you have more that you showed, it kind of makes the learning easier. Basically, what you're doing during the learning, is generating LTL expressions and then checking if they correctly accept and reject the trajectories of the training set.
So if there's more trajectories, then there's more supervision. There's more data that you can get about the LTL expressions. But probably, it's harder for a person to correctly find a sentence that does them all. So we, generally, did two or three in our data collection. And it seemed to work pretty well.
On being able to use bigger language-only data to boost language for robotics. Do you think you could lose the knowledge and massively pre transcendence encoders like Burt in some way. Yeah. Maybe. It's something we've [INAUDIBLE] write a paper at RSS this year about doing something like this. So anyway, he used data from larger data sets that were not grounded.
And the idea was that people say things like, I need to cut some onions. Get me a knife. And ideally, the robot has a knife detector and can go get the knife and give it to you. But for lots of-- in the general case, maybe it doesn't have a knife detector.
So the question was, could we learn to predict from the word cut, what object we wanted the robot-- the robot should hand over. So it had to-- sorry, my son's got to switch tasks. But I think he's doing it on his own. So I can keep going a little more. So it has to predict the-- given an image of a bunch of objects and a verb, like cut or pour or things like that, it has to predict which object you're talking about.
And we used a data set of images. And we collected, actually, information about the verbs that they were using. But the thing about the pretrained models is they're typically not connected to sensor information. And that's what we're trying to do. So it's not totally obvious how to use them, but I suspect that's more a failure of imagination on our part than the fact that-- We probably should use them, but I haven't figured out how. OK.
BORIS KATZ: I think we have time for that one last question if you'd like.
STEFANIE TELLEX: Sure. Can I make the robot selectively encode information into long term memory and carry out a command like, can you fetch me the book I bought yesterday? Yeah. I love that. So there's a cool paper by Rohan Paul and Nick Roy-- and I think Andre was on that. I think Andre was on that paper, too-- where they were doing that. And it wasn't as far back as yesterday, but it was stuff like, can you get me the mug that I put down.
And what was cool about it is it saved everything at the low-level. And then it would go back. And when it'd get the command, get the book I bought yesterday, it would go back and find what that was in its long term memory and then use that to figure out what book you're talking about. So that was a paper a couple of years ago. [INAUDIBLE] I believe? Rohan Paul was the author.
BORIS KATZ: Yeah. That was with our group. And it was [INAUDIBLE].
STEFANIE TELLEX: Yeah. What happens when there's a case of conflicting input to the robot, like asking it to pick up an object but point in the wrong direction? I know I'm not supposed to answer it, but I will. So then the robot would get confused. And that would be a common thing.
So the way that our pointing model worked, is the longer you were pointed at something, it would get more and more sure, because it would update at each frame rate. And then when you stopped pointing, it would slowly decay. The model would just be like-- at every time stamp, there's a chance you could switch objects you want, so it would slowly decay back to uniform.
I'll be right there, Jay. My son needs me so-- yeah, yeah, yeah. So it would slowly decay back to uniform. And depending on the details of when the language updates come in and when the pointing is and how wrong it is, it would either be more or less confused. And if it was more confused, it would decide to ask a question. OK. I'll stop there. Thank you so much for having me.