Towards complex language in partially observed environments         
      
            Date Posted: 
                  September 1, 2022       
            Date Recorded: 
                  August 21, 2022       
            Speaker(s): 
                  Stefanie Tellex      
All Captioned Videos Brains, Minds and Machines Summer Course 2022 
                  Description: 
                  
Stefanie Tellex, Brown University
Part of the Brains, Minds and Machines Summer Course 2022
                    
BORIS: Our speaker is Stefanie Tellex. I started working with Stefanie, well, a few years ago when she was an undergrad at MIT. She, Stefanie, got all her degrees from MIT, then she was a post-doc there as well. And then she became a professor at Brown University, where she runs a lab that deals with robots that collaborate with people. Stefanie, take it away.
STEFANIE TELLEX: Awesome, thank you so much, Boris. So, thanks very much for having me. It's always fun to present to this honor school. And I apologize for not being there in person. I was going to drive in, but then I slacked off at the last minute. So, please don't hesitate to stop if you have questions, or you want to get at anything in more detail.
So, as Boris said, I'm a professor at Brown University, and I run a lab called the Human to Robots Lab, and I think a lot about robots and human-robot collaboration. It's a super exciting time to be a roboticist. We're seeing advances in robots and robotics that are just mind-blowing.
And, at Brown, we just got these Spot robots from Boston Dynamics. So these are quadruped robots with a gripper. They can manipulate stuff. And, out of the box, they are just amazing. They can open doors. They can pick things up. They can dance. They can just do so much stuff.
And when you see them-- like, you knock them a little bit, and they recover, or they do this dance move, or they're jumping on the air, and rearing back on their hind legs, and stuff like that, it's just completely spectacular to see the level of autonomy and capability that these robots have. They have obstacle avoidance, built-in stereo cameras all around to see and avoid obstacles, just so much stuff. This is a video of one of my students, Ifra, dancing with the robot.
[VIDEO PLAYBACK]
[MUSIC - JEWEL, "YOU WERE MEANT FOR ME"]
JEWEL: (SINGING) I hear the clock. It's 6:00 AM.
[END PLAYBACK]
STEFANIE TELLEX: But, yeah, it's a robot dancing, and it's really amazing. And it was really cool about the interface for that is they donated to us, so you have to pay extra on the Spot to get it. Like, they built it, and they charge extra money for it after you buy the Spot.
And it's basically like a video editing interface, where you have these tracks, and then you pull down the different moves. And it's quite easy to use, so my nine-year-old niece came down and played with the robot, and she choreographed a dance to a song from that new Weezer album. And in about a 30 or 45 minutes, she had the robot dancing and stuff.
So these interfaces are quite powerful. This is a picture of the controller that comes with the robot. And they really did an amazing job with how you control the robot. So, there's this very sophisticated software that, I think, strikes a really nice balance between powerful--
OK, so this is the interface that comes with the robot. It's like a game controller kind of thing. And they did a really nice job of making it powerful in that you can do everything that the robot can do. You can open doors with it. You can pick things up. They've got a little choreographic interface on this. They also have a desktop version. You can teleop the robot around. So, it's really, I think, a model of its kind. It's one of the best robots we've-- it is the best robot we've ever had in our lab.
This is the desktop interface that comes, and you can see in the upper left a bunch of different actions it can do, like crawl and go to, and pace, and trot, and stuff, running man, all these different actions. And then, at the bottom, there's is this timeline with the audio, the song you're doing, and then all the different actions to make that video that you saw part of, anyway. So, it's really a cool interface.
So, what my lab is trying to do is we're saying that, look, there's these really cool, amazing, powerful interfaces for making robots do stuff, and we want them because they're good in certain use cases, in many different use cases. But we also really want to be able to talk to our robots. We want to say things in words of what they should be doing, what we want them to do. And then, based on the words that we say to them, we'd like them to say other words back to us, and we'd like them to do things in the environment.
So, in this example, one of the core problems we've been thinking about is object fetching, object delivery, so find the mugs in the library, the living room, or the kitchen. So, making the robot navigate through an environment that it doesn't know about in response to language, incorporating information from language, and integrating that information with its perceptual system in order to find objects.
And we've been particularly focused on object finding, object search, and object delivery because object search is fundamental to anything else you want to do on a robot. Like, if you're delivering something, you've got to find the object to pick it up to give it to the person. If you're opening something or actuating a door or doing something like that, well, you've got to find the doorknob first before you can grab it and open it. No matter what you're doing on the robot, you need to find objects as the first step. So, we thought that's the gateway to autonomous behavior.
And, fundamentally, what we're trying to do is solve the problem that people want to talk to robots about everything they see and everything they do. So we're trying to make models of everything the robot can see and everything the robot can do. So, what this talk is about is trying to do that by making models for communication, action, and perception to enable human-robot collaboration in partially observed environments, especially through language. And, in fact, I think language is really important to enable collaborate-- not just interaction between humans and robots, but collaboration between them.
So, the approach that we're taking is based around POMDPs, Partially Observable Markov Decision Processes. So POMDP models are a way of thinking about how a robot operates in an uncertain environment, where you have a state of the world that's unknown to you but evolving through time. And you know the dynamics. You know how it's going to evolve, or you know a distribution over how it's going to evolve.
And then, based on that state, you get observations that tell you something about the environment. And you get to see the observations. You get to-- so based on where you are in the world, you can expect certain observations from your camera and from your range sensor, for example. And then you get to take actions. So, you can decide to move forwards or backwards in the environment.
And, based on those actions, you expect the state to change, so the actions change. The current state and the action you took predicts the next state that you're going to go in. So POMDPs are notorious for being really hard to solve, but they're also the simplest model I'm aware of that captures everything that a robot is because it's capturing the actions to take in the world and the observations, unknown things in the environment.
So, what we started to do in my lab is take the strategy of trying to make everything be a POMDP. And we organized this around object search because, like I said, object search is this fundamental problem for making robots do stuff in the world. So, we started out by thinking about object search as a two-dimensional grid search Gridwell problem. So, you might say to the robot something like, find the mugs in the library, the living room, or the kitchen.
And we assume that the robot, grand SLAM, which is a pretty well understood problem-- Simultaneous Localization And Mapping, which is a pretty well understood problem in robotics. And it had made a map of the environment that was a two-dimensional grid map, the most common SLAM approach. And after having done that, it had its map, and it could localize itself in the map. And, again, robots are able to do that very well.
But the location of the mugs wasn't known in the environment, so we were going to give the robot a detector that, basically, if the mugs are nearby and in the field of view, like a camera, like the field of view of the camera, then it will know, but it's going to be a noisy detector because our detectors-- our real detectors, of course, aren't perfect.
And the robot's job is to take this detector and its model of the environment and give it a motion model-- it's a two-dimensional wheeled robot driving around in the world-- and reason about where to go and where to point its camera in order to find objects in the environment as quickly as possible.
And it's getting information from language about where those objects might be. So, in this case, it's being told that they're in the library, the living room, or the kitchen, but not in the other rooms. And, based on that, in the map, in blue, you can see it's believed distribution about where it thinks the mugs are, where the lighter colors are more likely, the darker colors are less likely. So, it's lighter colored in those three rooms based on the language update.
So then what the robot's going to do is drive around different locations in the environment and then look, run its detector to look, in different places. And then, in the living room, it's going to look in that direction and do this update based on the sensor model and say, oh, I don't see it. So, it's darker blue now to indicate that. And, of course, this is all with the right update, but, intuitively, it's darker blue, so I looked here. I don't see it.
So, it turns, and it looks north, and it gets detection. And now it thinks it's very likely that the mugs are there, so it finds the object and does an update to say, OK, I found it, and it's unlikely to be anywhere else. So, this is the intuitive behavior that you want the robot to do is systematically drive to the places where the object is likely to be and then look until it finds the object and stuff.
The challenge is that, if you represent the state space as a 2D grid, which is very natural-- a lot of our problems, representations do that. It's sort of the default representation that comes out of SLAM localization and mapping-- the state space of where the objects could be is very large. And naive algorithms for finding policies for finding the object don't work because it takes too long. So, if you could find the optimal policy, it would do what you're seeing, but because it's slow, it takes a long time to find the object, to run the inference to find the object.
So, what we did to address this problem is we built off of a standard POMDP algorithm called POMCP, Partially Observable Markov-- sorry, Partially Observable Monte-Carlo-- well, I forgot the P. Definitely Partially Observable Monte-Carlo. I thought it was tree search, but it's not a P.
Anyway, POMCP is a well-known sampling-based algorithm. And what we did is extended it to an object-based representation, object-oriented POMCP. And what we did is factor the way that the rollouts and the belief updates go so that it assumes that the world is structured according to objects and that objects are conditionally independent of each other, which enables inference to run much more quickly.
And we can therefore find policies very efficiently to get the sort of behavior that we want to have the robot do. So, we can find optimal policies faster. So, this is showing-- so, this one example. We were able to define this policy quickly with POMCP.
We also evaluated this in simulation but with a bunch of different world model-- different scenarios. So, we were increasing the number of objects and showing that it works better as you add more objects. The number of samples-- essentially, the number of particles and the different sorts of numbers.
So, in all of these graphs that you see the dark blue line is the known policy. So that's essentially where you're cheating by knowing where the object is-- you're searching for objects, but you're knowing where the object is. You just got to drive to the object-- and, essentially, lets you calibrate the reward you're bringing back because we're showing cumulative reward on all of the y-axes.
And then the red one, the red line, is the random policy. So random is showing what happens when you act randomly in the environment, so that's like a lower bound for the reward. And then the magenta line is POMCP. So, that's the previous state-of-the-art algorithm for solving large POMDPs.
And you can see that, in the graph on the left, as you add more objects, the state space gets larger, and POMCP does worse and worse and worse. As you add more and more simulations, as you have more particles, essentially, everything gets better, and POMCP gets better too but still much worse than all the lines in the middle, which are variations of OO-POMCP. So, the first takeaway here is that we're doing much, much better than the previous state-of-the-art algorithm with object-oriented POMCP.
The second thing, takeaway or, actually, my favorite part of this graph is the one on the-- of these three graphs, is the one on the right. So here, in the x-axis, we're varying the amount of information that we're giving the algorithm about where the objects are from language. This is a very simple language model, so we're just using keyword updates at the room level. So, we're assuming that we're getting a bunch of words that say what room the object is in or might be in, so like in the library or in the kitchen. And we're just matching those two words based on labeling the rooms.
So, in the misinformed case, that's where the person is lying to the robot. So we say, it's in the kitchen, but, actually, it's in the living room. In the no-information case, we say nothing. So, we didn't lie, but we also don't really help it. We don't give it any information with language.
In the ambiguous situation, we tell it correct information, but it's ambiguous. So we say, it's in the kitchen or the living room. I don't know which one. And it actually is in the kitchen, but we don't know whether it's in the kitchen or the living room, and went to search in both places. And then last, in the informed case, we say it's in the kitchen, and it actually is in the kitchen. So, on the x-axis, we're increasing the amount of information that we're giving the robot from language.
And you can see that the POMCP and the other algorithms are doing better as we give more information, random-- I'm actually not sure why random is going up. I think random is just random. It is a little weird that it's going up at all.
So everybody's going up as you give more information, but the OO-POMCP variations, like that teal line, the cyan line-- and then the other ones are just different variations on it-- are going out more to the point where, with the misinformed case, we still find the objects, just not very quickly because, first, we have to go through the wrong room and check it out and then decide it's not there.
And then do a smoothing update, essentially, to be like, OK, well, they must have lied to me because when we do the [INAUDIBLE] update, we just have a small smoothing term, and then it has to search all the other rooms, so it always-- And, again, this is not specific logic that we programmed in. This is just the POMDP doing its belief updates as a POMDP should, does, given the model of the world.
And with no information, it finds it more quickly, but not as fast as ambiguous and not as fast as informative. Informed, it's almost as fast as the known case, the fully observed case, so we're almost doing as well as the case where we have cheating levels of information about the object about the world.
So, again, this behavior is what you would expect a POMDP to do. So, what's cool is that we're able to elicit this level of performance and these differences in performance in a relatively small amount of inference time. We're able to solve this large problem much faster than the existing state of the art and see the POMDP doing these cool POMDP things. Oh, I made that graph big so I could talk about it and I could do it. This is the same graph on the right here.
We also made this work on a real robot. So, here, the robot is reasoning about the world as a 2D grid. And it's using AR tags as the detector and reasoning about where to point its camera in order to find objects in the environment-- decided to turn around and look the other way and then go back until it finds the object.
So, this is cool because the robot is able to navigate around an environment and find objects with relatively realistic assumptions for what the robot's capabilities are. But the big limitation in what you saw was that we were using AR tags or fiducial tags to represent objects. So, we weren't using a real object detector.
And, really, what a person would want to say is something like, find the green mug on the table. They're going to give you some noun phrase, and you somehow have to connect that noun phrase to objects in the environment. So the challenge is being able to recognize objects, no matter what we're-- essentially being able to create a detector based on the words that the person said to you.
And the thing is that people say lots and lots of different things. So, this is some examples from our data set-- the "left striped chair," or the "top-right clock," or the "side with onion," or the "vase in the middle," describing objects with spatial language that you'd like to find. So, we want to be robust to [? remember ?] all these different things people could say. And here's more examples-- "pizza," "bottom-right clock," "rightmost vase," "coke on the top-right side."
So, there's an existing approach from Trevor Darrell's group at Berkeley that basically is doing segmentation from natural language expressions, which can take in an RGB image and a natural language expression, like "a bunch of bananas" and then the image you can see here in the left, and then do magic deep-learning stuff in order to output a segmentation of that part of the image, the bunch of bananas in this case. So, this is sort of close to what we want because it's taking an arbitrary language and an image and giving you the location of where that object is in the image.
So what we did is started with this algorithm and their data set. We expanded to a larger data set that had come out in the intervening time called RefCOCO, which has lots of objects in indoor environments of bottles and sofas and chairs, and we trained them all. And then we applied it to a simulation environment, AI2-THOR, to make it easier to run experiments. So, we also did some posttraining on that.
But there's a significant difference between the segmentation problem and the problem that we really want to solve when we put this on a robot. And that's the problem that, in the training data set, when you're doing segmentation, you're getting an image. You get language. And the thing in the language is always in the image. So, here, the sink on the left-- there is a sink on the left and it'd come out in the segmented image.
But, actually, most of the images that are taken by the robot are not going to have the object that you're looking for when you're doing object search. In fact, if it did, you would have basically already found the object. So most of the images don't have a sink at all, even in a relatively small room, like this bathroom.
So the challenge is to be robust to this sort of thing and be able to correctly reject a lot of images that don't have the object you're looking for and correctly accept it when you do because, at that time, you almost found it. So, here's another example of the toaster, the same environment.
So, to address this, we train the model. In addition to training it on positive examples, we gave it a lot of negative examples, where the object wasn't in the image, to help the model learn to correctly reject images instead of hallucinating the object it's supposed to be looking for, whatever image it was in. I don't know why that was-- yeah, I should fix that.
And the second problem is that, when you're putting this into a POMDP, you need to have an observation model of what the robot sees. So what we did-- and the detector actually has an observation model that has a model of how much noise is in the detector that it can use to tell the POMDP essentially how good is the detector.
So, we defined an observation model based on outputs from the deep learning model that basically let it change how it's doing its belief update based on the uncertainty in the detector. And then we measured the task completion rate of different versions of this.
So, in our experiments, we use the AI2-THOR simulation environment of 15 different rooms, four different types of rooms, with 30 different target objects, indoor household objects, and a 2D model of the environment. So, we just assume-- we ignore occlusion and stuff. We just assume that, if you're pointing your sensor at a direction, you can see the objects in that direction.
And we're measuring the task completion rate. So, here, this graph is showing the task completion rate for different variations of our model. The one on the far right, the pink one, which we call the perfect sensor, is showing the performance of a noisy sensor that is being created from the ground-truth object location.
So, the model of that the POMDP has of the sensor perfectly matches the sensor because it's being created from ground truth. So, this is an upper bound on performance because it's showing what it would be if the POMDP's model actually perfectly matched what the sensor is doing.
And everything else, all the other bars that are together in different colors, are showing a real sensor that's taking as input an image of the environment, a generated image of the environment, and then running detection on that image, based on the language input, and then finding the object.
And what you can see is, on the far left, we're using-- sorry, ado-- my friend's puppy came to visit, so welcome Bermy to this talk. So, on the far left is we're using a fixed ground truth sensor model, where it's not changing based on the detection results.
And that's doing worse than the one on the far right, where we're using a dynamic sensor model that's changing based on our estimated values for the way the sensor works. And this is basically saying that the model is doing a really nice job of telling us how confident it is, and we're able to do better in the POMDP by looking for more directions, essentially, and actually finding the object when the model is telling us it's not confident.
What we're not doing as well as the perfect sensor model, and I think this is because the model that we're using is making an assumption that's common for range sensors, which is that, if I look multiple times from the same place, that the noise is going to get averaged out. But, for a visual sensor like this one, there is some noise from the camera, and that image noise is going to get averaged out.
But, if it's failing to detect because it's occluded or because the object goes into the data set or something like that, looking in the same location is not going to solve it. And so what we should do instead is make our sensor model more knowledgeable about the 3D structure of the object and the environment so that it can reason about when it needs to look from different angles, or even fall-- that this is something that's out of data set, so we need to fail and say we can't find the object or something else.
And we made this run on our Spot robot. So the input language is find "the green mug on the left," and it's able-- just using the map, Spot can automatically create a map of the environment, and it can basically just use the map to reason about where the object is and stuff. I don't have the video in here, but that's basically what's going on.
AUDIENCE: Stefanie--
STEFANIE TELLEX: [INAUDIBLE]
AUDIENCE: --I have one question.
STEFANIE TELLEX: Thank you. Yes, please.
AUDIENCE: Yeah, so yeah. So, in the previous part of the talk, you talked about how the algorithm can recover from a lie. So, this kind of uncertainty in the language. And now you are talking about robust object detection. So, how do you-- like, when the robot doesn't see something in the scene, how can you tell it is because of the object failure, or it is the language failure, or how do the algorithm, like in the POMDP, and the [INAUDIBLE].
STEFANIE TELLEX: Yeah, so it's basically based on the POMDP's model of the object detector. So, the object detector-- we're basically giving it a noise model of sigma, which is about its location, localization accuracy. So, if I see the object, what's the plus or minus x centimeters or x and y centimeters for-- if I say the object is at this x and y, how far away am I from that? That's a sigma, a variance.
But then more what you're talking about is the epsilon error rate, which we use to represent the true positive rate. We have another one for false positive rate, so how likely is the detector to lie to you and say, oh, it's here, when really it's not there, and it's not there when actually it is there.
So, given those parameters, the POMDP is going to just do what the POMDP is going to do. And, to the degree that this observation model is a correct model of the detector, what it will do, if you can solve it, is optimal. Does that make sense?
AUDIENCE: Yes.
STEFANIE TELLEX: Yes. But it's the degree that it-- but this graph here is saying, actually, it's not a correct model because this magenta line is with a fake perfect sensor, where it does match, and it's still not getting all the objects because there's noise in the perfect-- we shouldn't call it the perfect sensor. It's perfect in the sense that the noise model of the sensor perfectly matches the POMDP's noise model. We still don't find the average all the time because we cut it off after a certain amount of time and just say, fail.
But you find it less when the sensor model doesn't match because of what you're saying, basically. It's not looking in the right places, or it sees it-- like, probably what's happening is it gets to a place where the object is in the field of view, and it had a false negative. It should have been detected by the detector, but the detector failed to detect it.
And then it probably was like, OK, I'm going to look again at the same spot, and maybe it'll work that time, but, nope, it doesn't. And then it moves on, and that's it. It doesn't see the object. It misses it, never goes back there, right? And the POMDP won't have it go back.
And, if it thinks that the error rate of the detector is low, and it thinks it looked there and didn't see it, it's not going to go back until it's looked everywhere else, right? Because then the noise-- like, if you have a smoothing term, it's then going to basically look everywhere else, try it twice. And, hopefully, it'll get a different view, a viewing angle or something, and maybe see it.
AUDIENCE: I also have a question.
STEFANIE TELLEX: Yeah, please.
AUDIENCE: OK, I guess I have two questions, but one is like, what about if the visual scene is moving? Is the robot capable of dealing with movement in the visual scene, or it has to be completely static or [INAUDIBLE] for it to be able to navigate--
STEFANIE TELLEX: This detector is operating at the level of individual images from a camera that's running at 30 hertz or something. But it's only looking at one individual image. And you could definitely mess with the robot by moving things around, either-- if it's in the field of view, it's just going to do whatever the deep learning sensor does. It is not reasoning about occlusion at all in this model.
I will talk about another model that does do that. It does reason about occlusion. So that is, if you're standing in front of the green-- like, in this video-- there we go. In this video, if you're standing here, in between the robot and the green mug, with this model, it's going to completely break and think, oh, the green mug's not there because I see a person. It's not going to say, oh, there's something-- I've got to look behind the person.
And we'll talk about another model that can reason about occlusion. So, given a point cloud, it'll be like, well, I can't see stuff behind the person, so I've got to look behind that other stuff in order to fully search the environment. Definitely, an attacker could stealthily sneak around behind the robot and move the object where it just looked and make it really hard to find the object. So, we don't reason about that.
That said, it's very straightforward to modify a model like this one to take that into account by saying that-- by essentially adding a entropy component to your transition model, so basically saying that, whenever you state transition from one state to the next state, all of your belief-- if any object could, at any point, randomly move to-- teleport to a different place.
And what that looks like when you implement that assumption is that, if you just wait, the probabilities all go back to uniform, and then it'll do the right thing if it doesn't know anymore. It'll be like, well, I don't really know, so I'm going to go look again and update my probabilities again. Does that make sense?
AUDIENCE: Yeah, thank you.
STEFANIE TELLEX: Yep.
AUDIENCE: And I guess I had another question, but this is very general. I was just wondering-- I imagine this robot has different models running, so you probably have one for objective recognition, but you also have to have movement in the world. And I was wondering how do you integrate between these different functions that the robot has to accomplish.
STEFANIE TELLEX: Good question. So we think about robotics as having a sensorimotor substrate, which is the model of the robot and its sensors and actuators. And then everything else is happening on top of the sensorimotor substrate. And the sensorimotor substrate is fixed, and-- I mean, it's not fixed if you're actually building the robot because you could always staple another camera. We're putting GPS sensors on it and stuff.
But, if you fix the hardware of the robot, then the sensorimotor substrate is fixed. You have the sensors you have, and they're outputting what they're outputting, and you have the actuators you have, and the actions are running on top of them.
In practice, we're often working at a higher level of abstraction, where, with the case of the Spot, for example, it comes with an SDK, in which you could say, please stand at this x, y location in your map. And it's running localization on board. it's doing it. And it's got a planner that decides how to stand in a particular location-- x, y, theta, usually, so you can face in different directions.
And, in the case of the Spot, it's got-- and this is very common in many robots. You have a fancy, expensive camera, but you only have one, and then you have cheaper other cameras sprinkled around. So, it does have stereo pairs facing in all directions, but it has a high-res color camera in its snout and its kind of hand there. And so, we're basically asking Spot to point its camera in different places. And then, based on that-- and we define that as the action space of the POMDP.
OK, so this is talking a little bit how you can extend the model so that it's able to take as input arbitrary language and a model of environment, a map of the environment, the ability to localize within that map, and move within that map, and then find objects in the environment. So, that's pretty cool.
So then, as you guys kind of alluded, we got some questions about, well, what about moving things? And I think a related question is, what about occlusions and stuff? Really, the environment is 3D and not 2D. So we were really interested in moving this to a three-dimensional space. And this is challenging because if the two-dimensional space is already really hard to plan, and the three dimensional space is the cube of that, literally, so it's even harder to plan in because the space is much, much larger.
But 3D search is actually very relevant both for manipulator robots, like the Spot. The camera is in the arm, so it's very natural and useful to move the camera up and down, forwards and backwards, left and right, in order to make it move around. And, of course, if you're a drone, if you're a flying robot, you're also moving in 3D space. And lots of useful things are in 3D spaces, like bookshelves and cupboards that you have to open, and behind other things, and stuff like that.
So, to expand this to 3D, what we did is we made it reason about-- we basically built out the naive model of a POMDP with a camera and a frustum, and we showed that that was very slow and isn't possible to be solved quickly. And then we used a hierarchical representation for 3D space called octrees, which represent a 3D space as a hierarchy of volumes or subregions. And then we use that to structure the robot's belief update so that it can efficiently segment the regions into volumes. We have to tell it what volumes there are. It can decide to do that and then systematically search within the regions and do its brief updates.
And we made this run and its simulators. You can see shots from the simulator here. And then we also made this run on our mobile robot. And this is again with AR tag detectors, where it's getting sensor input and actuating its torso to move up and down with its handheld camera and moving around on a mobile base to find objects in an environment.
So, here, you can see the map of the environment, and then segmenting the map into regions, and then searching within those regions to find objects. Here, one's up high, and one's down low. One's in the middle, and it's able to find them all. And this was a Robocup Best Paper at IROS last year-- two years ago because time flies, as Boris and I [INAUDIBLE], 3D multi-object search.
Here's a video showing it going. And you can see the updates running on the map below. And what's cool is we're not telling it like where to go, where to point its camera. It's making its own decisions about that based on its model of the world.
So, we talked about finding objects based on hallucinating a detector for the object. There's also the problem of using pre-existing knowledge bases to find objects. So, one example I like to give is, find the big, blue bear. This is an object. It's no longer on Brown's campus, but it was, and it was a piece of artwork, where there's a lampshade, and a big, blue bear. And you can see from the scale, it's quite large.
OK, big, blue bear-- it's very distinctive whether there is or is not a big, blue bear on my screen, as well as whether there's one in your field of view. It's a great landmark because it's so distinctive. But, by definition, because it's so distinctive and interesting and unique, it's never in your training set, right?
So the challenge is that we'd like to be able to find objects like the big, blue bear without having had them appear in our training set before. I mean, it would be really cool if there's a way to do this with large language models and compositionality. I feel like I should acknowledge that's worse because, if you believe in compositionality, this is why, because you've never seen a big, blue bear before, but here it is. And, with just those words, having known those words, you should be able to make a detector for it. Maybe that would work.
But we decided to try to hack it a different way, which is to observe that landmarks, like the big, blue bear, are in existing databases about the environment. So, in particular, this is Providence, Rhode Island. This is where Brown is. And there's an OpenStreetMap database of all the objects in the environment. And, in that database, is the big, blue bear. So, you'd like to be able to use that information in order to find objects in the environment without having to have to train in the environment.
So, in our previous work doing this sort of thing, if you wanted to test in Providence, you had to train in Providence. So, you had to collect data from Providence and then train on those landmarks. And then, if you took it to Cambridge, you had to collect data from Cambridge and then train in Cambridge. And then get to Woods Hole, the same thing-- you had to collect data from Woods Hole, and that's a drag.
So, in our previous work, we were able to do this. We collected data set of natural language commands paired with LTL, Linear Temporal Logic expressions. So, this can be used for finding objects, but also, you can give more sophisticated commands, like go to the green room, or enter the blue room via the green room, or go to the blue room but avoid the green room.
And we can look at a data set that was a few hundred commands. It was the largest data set to our knowledge of language and LTL pairs. And we trained a sequence-to-sequence model in order to interpret commands. I think I'll skip the details of that model. But, in order to do this, we basically-- like, if we trained in this virtual environment with green rooms and red rooms, we could never run on OpenStreetMap. And, again, if you trained in Providence, you couldn't test it in Cambridge.
So what we did is took a model from the natural language community called CopyNet, and we said, look, these open-class noun phrases are something that a large language model can actually learn to extract. So if you say something like, go to the medicine store, you can extract it as a quoted string-- that's why it's called copying. It's copying out the stuff from the input-- and embed it inside your logical formal language expression, your linear temporal logic expression.
And then you can feed that to a special landmark resolution module that takes as input the noun phrase, the words "medicine store," for example, and then runs a query in OpenStreetMap to find the matches for that object. So, in our case, we just used cosine similarity between medicine store and the words that were associated with the landmark within OpenStreetMap.
And what this lets you do is take commands that have been trained in Providence and Cambridge, and then take a model that's been trained for data from Providence and from Cambridge, and bring it to a new place, like Woods Hole or Cape Cod or whatever you want to take it to, and give it OpenStreetMap data which already exists for these places, and, immediately, the robot can understand commands.
So, in our user evaluation, we had 14 participants give natural language commands to a drone in Tulsa, Oklahoma. There was no training data from this location. And almost 80% of the time, more than 75% of the time, the commands succeed. And we break it down. This is a speech-to-text-- like, it's going from speech all the way to behavior.
And we break it down to different failure modes. So, some of it was [INAUDIBLE]. And a lot of it was essentially semantic parsing incorrectly. Some of it was incorrectly resolving landmarks, and some of it was bugs in our planner. But, overall, we're doing quite well in understanding commands in a new environment that we haven't seen.
So, what we're trying to do now is look at the existing work in language and robotics, so RoboNLP. A lot of that work is using parallel data sets of human languages and some kind of abstract meaning representation. A lot of times, that's LTL, Linear Temporal Logic. Sometimes, it's made-up [? formal ?] languages that that graduate student made happen. Sometimes, it's based on the planning representation that the robot's using.
And what people do is create small data sets of hundreds or thousands of examples of English, or whatever human language, paired with expressions in that data set. And then you try to learn a model that translates between English and robotese, the robot language.
But if you compare this to existing machine translation work, the robotics people are using hundreds or, if you're lucky, thousands of examples. The machine translation people using parallel data sets of human language pairs are using millions of examples. And, not surprisingly, it works a lot better if you do that.
So, what we did in my lab is we looked at this, and we were inspired by the work in the Geoquery domain. So, in the Geoquery domain, the first time the data set was published in 1996, the training instances consisted of sentences paired with the parsed sentences. Somebody annotated the parses, the parse truth for those sentences.
Then, in 2005-- and, of course, I don't know if you remember because I think I was around when Luke was doing his PhD with a PhD student at MIT and wrote a really cool paper, and his thesis was about basically pairing sentences with logical forms, so you didn't have to annotate the parse tree. You only had to give it English sentences paired with the logical form. And it could learn to do it without the parse tree. That's why it was a big advance. So, the logical form was not annotated with a parse tree.
Then, in 2009, Percy Liang and Michael Jordan, Dan Klein-- there were a couple of other papers at the same time doing this-- said, actually, we don't need annotated logical forms. We don't need English paired with a logical formula lambda, calculus expression, or something, or LCL, whatever it was. In this case, it was lambda calculus.
He said, we don't need that. All we need is a signal of mapping questions directly to answers. And the logical form is going to be something internal to our system, and we're not going to use the logical forms at all at training time. And each of these approaches did better, and the last approach, Percy Liang's approach and others, solved Geoquery and said, look, Geoquery is done, and we're able to achieve human-level performance on it.
So, looking at this trajectory, we were inspired by this to try to do this in language and robotics, try to make that same jump from a Luke Zettlemoyer, kind of supervised approach of language paired with a formal expression to get rid of the informal expression. The thing that we had to do is decide, what are we going to train from? So, this is for following instructions, and we're pairing natural language with trajectories.
So, here, so natural language instruction is, "walk along third street until the intersection with main street, then walk until you reach the charles river." And then paired with that is an LTL logical form that represents the meaning of that command that sort of accept-- it correctly-- it's resecting and rejecting specific trajectories based on what the person told the robot to do.
But we don't see the trajectories at training time. We don't see the logical form at training time. We only see the trajectories. So, we see correct and incorrect examples of trajectories. And we see no logical forms. We only have a model of the formal language and the ability to evaluate expressions in that formal language against trajectories. But the data set doesn't have any logical functional [INAUDIBLE], any LTLs, expressions.
And we use LTL progression to kind of generate sets of expressions and then evaluate them as the trajectory progresses through the environment so that, during training, we can figure out what LTL expressions go with which-- essentially annotate the data set with LTL acceptance with LTLs, expressions.
We show the people on Amazon Mechanical Turk trajectories sampled from the k-th shortest path, and we ask them to give a natural language description that describes one trajectory while specifically excluding the other. This is much easier for our Turkers to do than annotating LTL. So, we can collect a lot of different trajectories per environment because we don't need any LTL at training time. This is the largest data set of language mapped trajectories that we're aware of, and there's no LTL required a training time. There's some red Xes to indicate that.
And we tested out a hundred unseen trajectories of natural language commands of varying lengths, and we compute the goal-state accuracy and edit distance, and it works. So, we show that we're able to correctly follow commands. We were evaluating both goal-state accuracy-- so this is how accurately do you end up at the-- are you where you're supposed to be at the end-- as well as edit distance, which is how close, with an edit distance metric, is the path that you took to the true ground-truth path that you're supposed to take?
And we do much better than baseline algorithms in real-world environments. We also tested this on the SAIL data set, which is Matt McMahon and Ray [INAUDIBLE] data set of indoor environments.