ThreeDWorld (TDW) - A multi-modal platform for interactive physical simulation
Date Posted:
August 25, 2021
Date Recorded:
August 16, 2021
Speaker(s):
Jeremy Schwartz, MIT, MIT-IBM Watson AI Lab
All Captioned Videos Brains, Minds and Machines Summer Course 2021
Description:
TDW website - http://www.threedworld.org/ - for papers, links and more resources.
JEREMY SCHWARTZ: I work in the brain and cognitive science department at MIT, where I'm the project lead for the development of ThreeDWorld, or TDW, as we call it for short. TDW is a multimodal platform for interactive physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments.
In this tutorial, I plan to do a deep dive into TDW and explain what it is and why we created it, cover its features and capabilities in detail, and discuss several examples of actual use cases. And of course, you can also get a high-level overview of TDW from our website, threedworld.org, where there are lots of visual examples. TDW is publicly available, and you could link to the TDW GitHub repository from there. We'll be taking a close look at the TDW repo a little later in the tutorial.
One thing I should point out before we begin. I'm not a scientist. My background is computer graphics software development and content creation. If you're interested in more detail about the science behind the various use case experiments I'll be talking about today, I recommend checking out the relevant papers, most of which are cited in our TDW paper on the website.
Let's get started. As we all know, machine perceptual systems typically require large amounts of label data. And that data can be laborious to come by, and can also be quite expensive. In addition, some quantities, such as the mass of an object or the material it's made of, can be difficult for human observers to label accurately.
Around four years ago, we started developing TDW as a way to address the situation. The idea was that, by generating scenes in a virtual world, we could have complete control over data generation with full access to all associated generative parameters. This would allow us to train machine perceptual systems as virtual agents inhabiting the world.
As I said, TDW is a multimodal platform for interactive physical simulation. It's built on the state of the art game development platform Unity. Unity is cross-platform, allowing us to run TDW on Windows, OSX, and Linux. It handles the image rendering, audio rendering, and physics for us. What you're seeing on the screen is some examples of TDW's simulation engine output, including its advanced physics. I'll be going into a lot more detail about all of the platform's capabilities during this tutorial.
In terms of agenda, there are four key aspects of TDW I intend to cover in detail. First, we'll look at three very different use cases that illustrate the generality and flexibility of TDW's design. Then we'll talk about the system architecture, as well as the API that supports the generality of that design. Next, I'll cover the ways in which we afford equal status to visual and auditory modalities and how that allows us to create synthetic imagery at near photoreal levels and generate sounds with a high level of acoustic fidelity.
Then I'll discuss TDW's advanced physics capabilities that allow rigid body objects, soft body objects, cloth, and fluids to interact. Coupled with that, we'll explore the multiple paradigms we use to interact with objects and generate physically realistic behavior. Direct- or object-to-object interactions are where users directly affect objects through API commands. Indirect or agent-to-object interactions utilize some form of embodied agent. TDW supports several types of agents, as we'll see. Users can even interact directly in VR, picking up virtual objects using virtual representations of their own hands.
A key goal in designing TDW was to create a very general and flexible platform capable of supporting a wide range of use cases. What this means in practice is that, compared to some other simulation platforms and frameworks, TDW does not impose any particular metaphor on the user in terms of the types of simulations it can generate. For example, some simulation frameworks only support interior floor plan environments-- in other words, rooms with furniture-- or specific paradigms like navigation.
TDW, of course, can do those as well. But we can also generate experimental stimuli of a more customer-specific nature. We can, for example, support use cases dealing with fine-grained image classification and object detection, physical prediction and inference, infant play behavior, and task and motion planning. To be fair, while many of the other simulation frameworks and platforms had not previously supported interaction with the environment, it appears that this situation is changing, and several are beginning to add these capabilities. Let's take a quick look at some examples of very different use cases.
Here, we see an example of generating data sets of synthetic images, usually used for training networks to generalize against real-world images, such as those from the ImageNet. Typically, these data sets are very large, on the order of 1.3 million images. These particular images are from a much smaller data set generated for the purpose of collecting neural responses to synthetically generated images from primates. The object labels used for this experiment were bear, elephant, car, face, dog, apple, chair, bird, zebra, and plane.
I won't go into the details right now about how TDW creates scenes of this type, since we haven't discussed any of that yet. Let's just say that 3D objects, for all exemplars of a given semantic category-- chair, in this case-- are loaded into a virtual scene. To increase variability, each image has randomized camera and positional parameters, and may have additional random parameters, such as the angle of the sun or the visual materials of the model. This randomness is constrained somewhat in order to guarantee that the object is always at least partially in the frame.
Here, we see an example of TDW being used for the training and evaluation of physically realistic forward prediction algorithms. As human beings, we learn at a very early age that the results of objects coming into contact with each other affects how we interact with them. For agents to learn this, they must understand how momentum and geometry affects collisions.
In this clip, randomly selected toys are created with random physics material values. The force of randomized magnitude is applied to one toy, which is aimed at another. This is one of over a dozen physics behavior scenarios in our physics benchmark data set. We'll see more of this data set later in the tutorial, when we discuss physics in TDW.
Another key use case area for TDW, as shown in this third example, is embodied AI, where embedded agents are trained to interact with the environment and potentially change scene state in some way. Here, you see an agent performing part of a TAMP, or task and motion planning task, involving the location and retrieval of target objects. We'll dig deeper into this area as well when we discuss physical interaction in TDW.
However, before diving into TDW's features in detail, I think it would be helpful to look at the platform's high-level systems architecture, as then, we can introduce some terms that you will hear throughout the rest of the tutorial. Also, while we're looking at the architecture, we can take a brief look at our API. I'll also explain how to access and set up TDW.
TDW simulation is composed of two main components. The build, what you see here in green, is a Unity executable of TDW's simulation engine. This can be either Linux, OSX, or Windows. The build is responsible for image rendering, audio synthesis, and all physical simulation. The controller, in purple, is a Python program which communicates with the build over TCP/IP and uses TDW's comprehensive command and control API. The controller sends commands to the build, which executes those commands.
The build can then return a wide range of data types to the controller, representing the state of the virtual world. Data types include image data, such as image renderers, segmentation ID images, semantic classification images, depth maps, and normal maps. It could also include collision data, including whether objects are discretely impacting, rolling, or scraping. Can also return spacial and transform data, such as position, orientation, object bounds, et cetera. Users, i.e. researchers, write controllers to suit the needs of their use case. Basic Python skills are really the only requirement for using TDW successfully.
In addition to the build and controller, the platform architecture includes two other key components. The first is an Amazon S3 repository where 3D object models, scene models, material files, and HDRI skyboxes are stored. I'll be explaining more about all of these in a moment. Object and environment models are downloaded at runtime into the build as asset bundles, which are compressed binary versions of the model data. Once downloaded, all model data is cached. What this means is that rebuilding a scene-- for example, when running successive trials-- is essentially instantaneous.
The second key component is a JSON records database, which is stored locally. This database contains all model and other metadata used by TDW. A set of librarians, which are basically Python wrapper classes, handle the querying of these metadata records at runtime by the controller.
Our API contains over 200 commands covering tasks like scene setup and manipulation, object loading and modification, camera and rendering controls, object interaction using physics, and agent navigation and control, again using physics. These are general purpose atomic commands. You can think of them, really, as LEGO-like building blocks for creating high-level behaviors.
And like many available simulation frameworks, TDW controllers can send multiple commands per time step, allowing for arbitrarily complex simulation behavior. The build can run standalone-- locally on a laptop, for example-- or on a remote server. It can also run within a Docker container. The TDW documentation is one of the platform's strongest features. Every command and every variable in the API is fully documented, complete with numerous example controller scripts. I think there's 45 at the present count.
In addition, we have numerous documents addressing specific topics, such as best practices for improving photorealism, how to handle observation data, how to set up scenes, how to do audio and video recording, and many more. Over the next few months, we will be expanding our documentation even further by adding a series of video tutorials that will further improve the onboarding process for new users, as well as help more advanced users tackle certain types of simulation situations.
Now that we have some understanding of the system architecture and TDW's API, let's look at a minimal example of a controller. We aren't going to go too deep into coding in this tutorial, but we must at least have a version of the classic "Hello, World!" example. This is our Python controller version of "Hello, World!" First, we instantiate a controller, and then send commands to the controller using the communicate command. Each command is a JSON object that has matching code in the build. The build will de-serialize each command and then execute the associated process.
In this case, the only command we're actually sending is the terminate command, which terminates the build. When a controller is run, it will first check if the correct version of the build is already on the computer, and if not, it will download it and save it locally. It will then launch the build. We can disable this automatic launch of the build if desired, however-- for example, when we want to run the build on a remote server. The controller will also check to see if your version of TDW is the most recent.
Let's look at a more complex example. We're going to look at the photoreal scene example controller, [? photoreal.py, ?] which is one of our actual example controllers, that creates an image similar to the one that you're seeing here. I'm just going to switch over to code window. Basically, here, we're doing some imports. We're importing the controller.
We're importing the module containing utilities, both functional utilities and wrappers. We're importing some code handle images. Basically, what we're doing is creating a subclass of the controller class called photoreal. And in our run method, we are creating an output directory for some images that we will create and capture in this script.
We then load the environment. It's called a streamed scene. I'll explain in a minute what the streamed scene is as we look at how TDW handles environments. In this case, it's a scene called archviz_house. We then add a set of four objects.
We're actually using a wrapper function here to simplify the loading of the object. We just reference it by name. We give it a position and rotation to set up the scene explicitly. We then organize a whole bunch of setup commands into a single list. If you remember, I mentioned that TDW can execute a list of commands in one single timestamp.
Basically, we're stringing together a whole set of commands that will be executed all at the same time, saving a bunch of sending commands back and forth. We're setting the screen size to 1920 by 1080, setting the highest window quality. We're then creating what we call an avatar. You can think of an avatar as an agent. The terms are almost interchangeable, though avatars tend to not have the ability to affect the environment, whereas an agent might. In this case, what we're adding is just a simple disembodied camera and setting its field of view.
We then set a bunch of post-processing parameters that set things like the focus distance for the depth of field, and exposure, ambient occlusion values for improving the shadowing in corners of the room. We set the shadow strength, and then we send all of those commands using the communicate command.
And then we teleport the avatar to the location we want it to. In other words, we teleport the camera to where we want it to, request that we send images, and point the camera at the location we want. And then we get the image data back and save that out to our directory. This is a simple but totally functional example of a controller that generates a photorealistic image. Let me come back over to our slides.
Now that we've seen a more complex example of a controller, let's take a quick look at how we access and set up TDW, including a quick visit to the repo. The basic requirements for running TDW are, you have to be on Windows, OSX, or Linux. You need to have Python 3.6 or greater installed. And ideally, you want a GPU on the system that you're running simulations on-- the faster the better. I mean, obviously, it is possible to run TDW on a laptop, for example, that doesn't have the GPU, but this is real-time 3D rendering, so your performance is going to be pretty hampered if you don't have a GPU.
There are some functionalities, such as audio/video recording, that has additional requirements. These are spelled out in the TDW documentation whenever there's other requirements besides what I've listed here. TDW setup is a simple pipe I install, and includes all required dependencies. To get the full benefit of TDW, such as access to the example controllers, users should download the repo or fork it. Let's take a quick look out there now. I'm just going to click through to the repo. OK. This is the TDW repo-- threedworldmit/tdw.
At the top level, we have all of our documentation in this very large readme. The first thing you hit is a Getting Started document. That's basically the place that any new user should head to when they first start working with TDW. This goes into how to set it up, how to install it, how to run controllers. It's really pretty mandatory for getting going with TDW. We have our command API here, the documentation for all of the commands in the system. As I mentioned, there's over 200 of them, and they're grouped into categories like add_object commands, add_material, add_model commands, et cetera.
For example, we click through to set screen size. Basically gives you the detailed documentation here with a couple of different syntaxes for calling the function and what the default values are. Then we have a whole slew of different documents about different topics in the documentation.
Here, we have how-to documents that talk about audio and video recording, avatars, benchmarking. We have example controllers. I think I mentioned we had 45 of them. Here's all of the different example controllers and what they do. There's documents about physics, documents about our releases, et cetera. If any of you intend to use TDW for the work that you're doing in the summer school, you've got a lot of documentation here that will help you get going.
I think, at this point, we should dive into the details of what's in the platform. Let's start by talking about TDW handles multiple modalities. Visually, we strive for the highest level of photorealism possible. We achieve this through the lighting and rendering approaches we use and the high-quality 3D environment and object models from our 3D model library. We use 100% real-time global illumination with no light map baking. Our lighting model utilizes a single light source, representing the sun, used for dynamic lighting.
This is the type of lighting that causes objects to cast shadows in a scene. In some interior scenes, though, there may be one or two additional point lights. But the primary light source is this single directional light representing the sun. Here, the complex shadows and pool of light on the floor in the scene you see on the screen come from this type of lighting. There's actually 3D tree models outside the building that are, in fact, casting these shadows through a transparent skylight in the roof of the building.
For general environment lighting, we use high dynamic range image, or HDRI, skyboxes. To understand what impact a skybox has, you can kind of think of it as a planetarium projection. HDRI images contain substantially more information than a standard digital image. They capture the lighting conditions at real locations for a given time of day. They're typically used in movies to integrate computer-generated imagery with live action photography. In this clip, TDW is automatically adjusting the elevation of the sun to match the time of day in the HDRI image. This affects the shadow length.
Also, the intensity of the sun is being adjusted to match the shadow strength in the image. The HDRI map is being rotated to simulate different viewing positions. The sun angle is correspondingly adjusted so the direction of the dynamic shadows continues to match the direction of the environment shadows from the HDRI map. As you can see, rotating this map around can give very different, and dramatically different changes in the level and quality of the lighting within the same scene. It's really quite an effective technique.
Most scenes in TDW start off with some type of environment. Our environment assets span both indoor and outdoor scenes, including several environments that are created from high-quality scanned photogrammetry assets. Many environments are designed for maximum variability, with large amounts of detail, both object and surface detail, so any arbitrary viewpoint within the scene will deliver a suitably complex and varied background.
The outdoor images on this slide contain assets such as rocks and pebbles, mossy boulders, areas of mud and grass, sections of cliff faces, and other real-world terrain elements that have been scanned from various locations around the world. For example, the lava beaches in Iceland that you see in the top right image. To create these scenes, large numbers of these assets are arranged in various configurations to create a complete landscape. When combined with a suitable HDRI map, the resulting environment can be quite convincingly realistic.
This type of scene is referred to as a stream scene. You remember, we loaded a stream scene in that photoreal example, since when loading a scene, we were essentially downloading an asset bundle that contains all of the scene data into the runtime simulation. Streamed scenes contain bounds data representing the region within which objects or agents can safely be spawned, as well as the sunlight and associated HDRI map for that scene. Now, users are free to change the HDRI maps in their controllers for a given scene using API commands, so you're not locked into using the HDRI map that is part of the streamed scene.
Environments then populated with objects from our library of high-quality 3D models. Most of our 3D model assets were originally created by high-end rendering applications and optimized for real-time 3D rendering. They use PBR, or physically based rendering materials, that respond to light in a physically correct manner. Many of the materials were originally scanned from real-world materials.
Models can be placed around and seen in various ways. They can be placed completely procedurally-- in other words, based on some algorithms such as stacking or random scattering within the bounds of a room. We see some examples of this here. Alternatively, object placement can be based on an explicitly scripted arrangement, for example, a dining table set for dinner or a scene such as our photorealism example on the top right, which we discussed earlier.
The 3D models themselves are highly optimized for research purposes, with custom physics and audio material behaviors designed that realistically and accurately simulate how those same objects physically interact in the real world. That includes the generation of physically correct impact sounds, as we will see in a moment.
Models are also normalized to real-world scale given a canonical orientation and semantically annotated with the appropriate word in that synset noun category-- for example, chair, coffee maker, toy, dog. For every model, metadata, such as object bounds and other useful information, is saved in the JSON record libraries I mentioned earlier. Let's now talk about the audio modality next. Before I forget, I'm going to just turn on the ability to--
AUDIENCE: OK. I just wanted to ask you if you have any examples of people training computer vision models on renderings of these objects in TDW and then seeing how well they might transfer to the real world, or how well it might work on real data sets like ImageNets, testing set, or validation set.
[INTERPOSING VOICES]
JEREMY SCHWARTZ: Well, actually, there's a full discussion of that in our paper that you can link to from the website. And the example with the chairs is actually exactly that kind of data set. In that particular case, those images were from one used with primates, but we have other data sets that have been used to do exactly what you're talking about, test transfer to ImageNet. And if you look in our paper from the website, you'll see some data and much more detailed discussion of how those experiments were conducted. But that's one of the prime raisons d'etre for TDW is to generate those kind of photoreal images for that type of transfer.
Let's talk about the audio modality. Audio modality is equally important in TDW, and the platform provides a high degree of acoustic rendering fidelity. For the sounds placed within interior environments, TDW uses a combination of Unity's built-in audio and resonance audio 3D spatialization to provide real-time audio propagation, high-quality simulated reverberation, and directional cues via head-related transfer functions.
Sounds are attenuated by distance, and can be occluded by objects or environment geometry. The reverberation model automatically varies with the geometry the space, the virtual materials applied to the walls, floor, and ceiling of that space, and the percentage of room volume occupied by solid objects, such as furniture.
However, it's TDW's advanced physics-based synthesis of impact sounds that's really the standout feature. TDW's high-impact Python library uses modal synthesis to generate plausible realistic impact sounds in real time based on the masses and materials of colliding objects, as well as parameters of the collision, such as object velocity and angles of impact returned by the build. [? Mode ?] properties are sampled from distributions conditioned on properties of the sounding object.
The mode distributions were measured from recordings of actual impacts, and use impulse responses captured from real-world objects of a given material, such as blocks of wood, cardboard boxes, metal bowls, et cetera. For those of you who don't know what I mean by an impulse response, you could think of it as a distinct audio signature of a given material. At the time the synthesis approach is being developed, human perceptual experiments were conducted.
In those experiments, listeners could not distinguish the synthetic impact sounds from real impact sounds, and could accurately judge physical properties from the synthetic audio. PyImpact currently supports 14 material types, including metal, glass, ceramic, soft and hard plastics, cardboard, stone, and others. Materials were sampled across several object size categories. We'll shortly be making some significant extensions to PyImpact that will add scraping and rolling sound synthesis models to the existing impact sound synthesis. Let's look at and listen to some examples.
[TAP]
[CLICK]
[THUMP]
[TAP]
[TAP]
[CLICKS]
[THUMP]
[THUMP]
In this clip, we had some examples from a data set used for object, mass, and material estimation, plus a Rube Goldberg machine type of setup we constructed to demonstrate both the impact sound synthesis and some more complex physical interactions in a photorealistic setting. That was the one where the monkey collides with the various objects. By the way, the full controller for the Rube Goldberg demo can be found in the use cases section of the TDW demo.
I should also point out, for the gentleman that just asked the question, another use case there is for image synthesis, basically, is a full controller that's been used to generate these type of 1.3 million image data sets for potential transfer to non-synthetic image data sets like ImageNet. So you could always refer to that if you're interested in seeing more detail about how those kind of data sets are created.
[TAP]
While we haven't discussed agents in TDW yet, since we're on the topic of multimodality and audio, I thought we should discuss this next use case here. We are actively developing a challenge that focuses on the multimodal aspects of TDW. In this challenge, and embodied agent is spawned inside a single room environment.
The agent hears an unknown object fall to the ground somewhere in the room. The agent must locate and retrieve this target object by using both visual and auditory modalities. Objects may be behind a sofa, on top of a cabinet, inside a containing object, or occluded by other objects, such that the agent may need to physically move them to reveal the target object.
What you'll see is some objects dropping, making sound, and then you'll see kind of a simulation of the agent hearing the sound and then going to explore, looking for the sound. Here's that object drop, the key dropping on the floor. Now, this is kind of a mock-up of what the challenge is meant to do, but it's still being generated from within TDW. But as I say, the challenge is still-- the models for the challenge are still being developed-- the AI models.
Let's talk about physical scene interaction. The first type of physical scene interaction we'll discuss is object-to-object. We've gone to great lengths to enable believable and realistic object interactions through accurate physics behavior. TDW includes two separate physics engines which serve different purposes.
Unity's basic physics engine, PhysX, handles rigid body physics, including the collisions between rigid bodies. For example, by applying a forward directional force to an object, it can be made to collide with other objects, as we see on the left. Or we can apply an upward force at a specific point-- for example, to tip a dining table and make objects on the table slide or roll off, as we see in the right side.
To achieve what we refer to as fast but accurate rigid body collisions between our library models, we use the V-HACD approximate convex decomposition algorithm to generate groups of convex hull mesh colliders. In this image, the convex hull colliders are shown in green. These highly form-fitting colliders are economically organized and provide an optimal balance between performance and accuracy. If we used a full mesh collider on our objects, performance would be severely impacted.
However, if we use simple but performant colliders, like box or sphere colliders, they would only roughly approximate the shape of our objects. Therefore, we wouldn't get the accurate and realistic physics behavior we require. To further refine object interaction behavior, users can modify mass, friction, and restitution, or bounciness, at runtime on a per-object basis.
The second physics engine used in TDW, NVIDIA Flex, uses a particle-based representation of the underlying model to manage collisions between different object types. On the left, we use the cloth simulation to drop a rubbery sheet, which collides with a rigid body fire hydrant object. On the right, a fridge model is dropped into a pool of water, causing significant displacement and splashing.
Dropping objects of different sizes, masses, and/or materials into fluids and observing the splash behavior can be useful in estimating these quantities. This type of unified object representation can help machine learning models use both the underlying physics and rendered images to learn the physical and visual representation of the world through interaction with objects in the world.
Differential forward predictors that mimic human-level intuitive physical understanding are considered important for enabling deep learning-based approaches to model-based planning and control applications. Creating end-to-end differentiable neural networks for intuitive physics prediction is thus an important area of research. However, the quality and scalability of learned physics predictors has been limited in part by the availability of effective training data. We saw this area as a compelling use case for TDW, highlighting its advanced physical simulation capabilities.
Last year, we developed a comprehensive benchmark for the training and evaluation of physically realistic forward prediction algorithms. This publicly available benchmark goes well beyond existing related benchmarks. It contains a varied collection of physical trajectories that make extensive use of TDW's deformable cloth and fluid capabilities, and provides scenarios with complex real-world object geometries and photorealistic textures. Here's a sampling of some of the different physics scenarios represented in this benchmark. I'll point out and comment on several of these.
I've got a little pointer here. The one I'm on right now, stability, most real-world tasks involves some understanding of objects' stability and balance. Unlike simulation frameworks, where object interactions have predetermined stable outcomes, using TDW, agents can learn to understand how geometry and mass distribution are affected by gravity. On the right here, we have object permanence. Object permanence is a core feature of human intuitive physics, and agents must learn that objects continue to exist when out of sight.
Down below is sliding versus rolling. Predicting the difference between an object rolling or sliding, which is an easy task for adult humans, requires a sophisticated mental model of physics. Agents must understand how object geometry affects motion, as well as understand some rudimentary aspects of friction. Simple collisions, which we see over here-- we've seen some examples of that already. Agents must understand how momentum in geometry affects collisions to know that what happens when objects come into contact affects how we interact with them.
The top left here, we have draping and folding. By modeling the way in which cloth and soft bodies behave differently than rigid bodies, TDW allows agents to learn that soft materials are manipulated into different forms depending on what they are in contact with. And then last but not least, submerging. Fluid behavior is different than solid object behavior, and interactions where fluid takes on the shape of a container and objects displace fluid are important for many real-world tasks.
Now, Stanford's NeuroAILab used TDW to train a learnable physics simulator to predict physics behavior using a subset of the physics data set scenarios I just described. Scenarios use various 3D shapes, such as ball, cone, cube, dumbbell, octahedron, and cloth rigid and soft materials to construct the following scenarios.
Lift, where objects are lifted and fall back on the ground. Slide, where objects are pushed horizontally on a surface under friction. Collide, where objects collide with each other. Stack, where objects are stably or unstably stacked on top of one another. And cloth, where cloth is either dropped on one object or placed underneath and lifted up.
In this video, we can see the results of the model predictions as compared to the ground truth simulations. As you can see, all of the projections look physically plausible without really any unnatural deformations. Just for your reference, you can access the TDW physics repos in its own separate repo, shown here, but it can be accessed from the top-level readme in the TDW repo by just going to this high-level API section and clicking through to physics.
Now, let's delve into what we mean by agent in TDW. TDW supports a range of agent types. At the most basic level, avatars or agents can be as simple as a disembodied camera capable of returning image data from the build to the--
AUDIENCE: Sorry, Jeremy, we have a question here.
JEREMY SCHWARTZ: Oh, OK. Sure.
AUDIENCE: Yeah. Just a quick question about which types of physics interactions are handled. Are there rigid body-soft body interactive physics handled? Per se, I drop a rigid body object onto a suspended cloth, will those two objects interact?
JEREMY SCHWARTZ: Did you-- sorry, that was just on the screen. I guess maybe you missed the video, or?
AUDIENCE: Yeah, I wasn't sure if that was, like-- for example, would the cloth repulse the rigid body object rather than-- for example, sometimes, those relationships are uni-directional. So the cloth could be removed from under the rigid body object, the cloth would move the rigid body object, but--
[INTERPOSING VOICES]
JEREMY SCHWARTZ: Yeah, no, there's a drape-- I can try to go back to it. Hang on a second. I mean, there's an example of dragging, which is basically an object sitting on a cloth, and then the cloth is dragged along the ground by an invisible force, and of course, the object falls over and is dragged along with it. That's sort of the inverse of the case where the cloth is falling onto a rigid body object and kind of conforms around the shape of the object. But yeah, all of these types interact with each other.
The rigid bodies are dropping into a container of fluid and displacing fluid based on the mass of the object and the shape of the object. Cloth-- those bouncy cloth balls are dropping onto solid objects and kind of wrapping around them. But if the situation was reversed and you had the soft kind of beanbag chair type of object sitting on the ground and you dropped a rigid body onto that, it would push its way into the soft surface of the object. So yeah, it's totally bidirectional in those interactions.
AUDIENCE: OK, thank you.
JEREMY SCHWARTZ: You bet. Let's see. Talking about, the most basic level avatars can be as simple as disembodied cameras capable of returning image data from the build to the controller. You can also have more than one camera in a scene providing combinations of first-person, third-person, and top-down views, for example.
We also have basic embodied agents whose avatars are geometric primitives, such as cubes, spheres, or capsules that can be moved around the environment by applying forces. These agents are actually very useful for algorithm prototyping, for example. But we will see an example of where they're used in a full simulation.
Complex robotics agents with advanced embodiments, such as articulated limbs, are capable of both mobility and sophisticated physical interactions with the environment and the objects within it. The second form of physical interaction We'll look at is agent-to-object, where this more advanced type of embodied agent physically interacts with the environment. In embodied AI research, it's especially important that embodied agents have physically mapped action spaces that allow them to interact with the environment, effectively changing both the object and scene state.
To that end, in TDW, we have Magnebot, which is a robotic agent with articulated arms that terminate in nine degree of freedom magnet-type end defectors. Magnebot is fully physics-driven. There's no animation involved at all. Directional movement and turning are achieved by controlling revolute joint drives. Arm articulation utilizes one-degree of freedom or three-degree of freedom joints in combination with an IK, or inverse kinematic system, to facilitate sophisticated reaching actions.
As you can see, Magnebot can also move its torso vertically along its central column, which is implemented as a prismatic joint, allowing it to reach objects at a considerable height above the ground. Agents like Magnebot can be equipped with cameras capable of generating RGB images, as well as various camera passes, such as depth, normals, object segmentation, semantic classification, and pixel flow. Besides the agent's egocentric view, additional cameras can be linked to the agent to provide third-person follow cameras, or even a static tracking camera view.
Let's revisit our API for a moment in the context of physical interaction. Where interaction is concerned, it helps to think about the API as being composed of three layers. The main API contains low-level commands that operate directly on the revolutive prismatic joints of robot agents such as Magnebot, or other robot models within TDW. When I talk about this TDW API, I'm talking about the 200-plus commands that we looked at in the main TDW repo, just to be clear. For example, robotics commands such set_revolute_target, will turn a revolute drive, such as the wheels on the Magnebot.
To facilitate Magnebot's mobility and scene interaction, an additional high-level API layer built on top of this lower level API combines low-level commands into actions, such as move_to location and turn_by an angle for mobility, and reach for a target position and grasp a target object with the arm articulations necessary to pick up and place objects. The specific project use cases, such as challenges, which may have requirements for specialized variations of commands in the Magnebot API, we will typically develop a third, ultra high-level API layer. We'll see an example of that in just a moment.
Here is an example of a Magnebot API-- a little bit of code in the Magnebot API, just so you can see the difference from the code example we looked at before. Basically, we're importing from the Magnebot module the Magnebot and arm classes. Create a Magnebot, initialize a scene, which basically spawns the Magnebot inside of an empty room.
There's other things that init_scene can do, but that's basically what it's doing here. We then turn by 120 degrees. We move by one unit. We then reach for a target with the left arm, and then reset the arm and end the simulation. On the right, you see what's being generated by that particular snippet of code.
Alongside the development of the Magnebot API, we created 36 pre-populated multi-room home environments designed for agent navigation and physical scene interaction. All objects in these environments are physics-enabled. While these environments are primarily intended to be used with the Magnebot as agent, they can certainly be used for other simulation purposes in TDW.
To simplify the use of these floorplan environments, TDW provides a dedicated controller, the floor plan controller, a child class of controller that creates one of these interior scenes and populates it with objects. Here are three example floor plans. As I mentioned, there's basically 36 of these that we've created so far.
An additional benefit of using these floor plan environments that each floor plan layout has an associated occupancy map that agents can use to facilitate navigation. The occupancy map shows open cells within the environment space that are not occupied by obstacles such as furniture or other props. In other words, if the agent were to stick to areas that are purple in the occupancy map shown here, it could move about this environment unimpeded. Now, the occupancy map is basically designed for some very basic navigation methods, such as a navmesh or A-Star type of controller.
It's possible that the particular application may want to do a much more elaborate type of navigation, in which case, the agent could use its vision system to render depth maps and then convert them to 3D point cloud data to represent the physical environment within the agent's field of view at any given point in time.
The Magnebot API is also in its own separate repo, shown here. The Magnebot is actually also just a straightforward PyPI install, so you can just pip3 install Magnebot. You can, again, link to the Magnebot repo from the TDW repo directly.
Let's talk about a detailed example of all of this. In conjunction with the MIT IBM Watson AI lab, we recently launched the TDW Transport Challenge, the visually guided task in most of the planning benchmark for physically realistic embodied AI.
In this challenge, our Magnebot is spawned randomly in a simulated physical home environment. The agent must collect a small set of objects scattered around the house and transport them to a specific location. For example, a typical challenge task might be, transport one toy, two bowls, and one jug to the bed. The agent has an interaction budget-- in other words, a fixed number of actions that it must stay within in order to successfully complete the challenge.
We also positioned containers around the house that can be used as tools to transport objects efficiently. On its own, the agent can carry at most two objects at a time. However, using a container, it can carry several objects at once. However, locating and retrieving a container uses up valuable interaction steps.
Therefore, the agent must plan the optimal path to transport the objects to the goal location and reason about whether to use containers or not. To summarize the challenge, to complete the task, an embodied agent must plan a sequence of actions to change the state of a large number of objects in the face of realistic physical constraints.
As I mentioned, this includes challenge includes an additional high-level action space built as an ultra high-level API layer on top of the Magnebot API. This API layer includes commands such as Put_In and Pour_Out. Here, you see Magnebot performing the Put_In action, slowed down a little so you can see the nuances of the arm articulations taking place. Combining several high-level commands into a single ultra high-level commands not only streamlined the controller code required to perform a challenge task, but also made for a better fit with the OpenAI Gym wrapper used by the challenge infrastructure.
Here, we see an example of one type of challenging situation the agent needed to deal with, and why a synergy between navigation and grasping is important for successfully performing a task such as retrieving a target object occluded by other objects, where grasping might fail, for example, if the agent's arm cannot reach an object. Here, you can see both the success and failure cases of that same type of action.
Of course, in some situations, the agent can become so stuck on an environment obstacle that it's unable to recover, as we see here. Basically, the agent is stuck kind of at that corner of the wall there, and just fails to be able to get out. Here's an excerpt from an actual challenge task, again slightly slowed down so we can better see the agent in action. It's about a minute long. You can see how the agent is using the container to collect several objects in succession before transporting them to the goal location. We'll just watch this for a minute.
You also notice how the agent's are quite capable of pushing other objects out of the way, because everything is physics-enabled in the environment. Clearly, not all of the objects on the floor are actually target objects, i.e. objects that have been assigned as targets in the task definition. The agent must use its vision system to determine which objects it needs to pick up and which ones to ignore. Having reached the goal location, the agent performs a Pour-Out action and terminates the task.
Note that the criteria for successful completion of the task do not require the agent to drop the objects onto the bed itself. The goal location is actually a small region in front of the bed defined by a radius from the centroid of the bed. Just showing you guys here a link to the transport challenge repo. Again, this is accessible from the main TDW under high-level APIs.
Let's look at some other types of embodied agent available in TDW. To further TDW's capabilities in the area of embodied AI, we've begun developing a new human-like agent, the humanoid agent designed for advanced human agent collaboration. This agent will be able to physically interact with the environment at a detailed level. What we see here, and what we currently call TDW humanoid in the current version of TDW, is an early prototype of what will eventually be the humanoid agent.
As we see here, the humanoid agent will utilize a range of photorealistic 3D model skins, allowing the agent to be male, female, or even a child. Of course, non-photoreal representations will also be supported, as you can see on the left side of the slide. This visual realism will be augmented with equally realistic body motion derived from motion capture data. Motion retargeting will allow the same set of motions to be applied equally to adult and child versions of the agent.
We also plan to include motion blending, the seamless transitions between body motions, coupled with the ability to procedurally modify the agent's motion at runtime-- for example, in reaction to scene events or changes in scene state. This will enable reactive animations as event feedback-- for example, the agent's head turning in the direction of a new point of interest. This runtime modification approach will allow us to modify pre-existing animations in response to changes in object state, for example, changing the agent's pick up animation dynamically to reflect a new or moving target location.
The agent will also have fully articulated hands capable of high-dexterity physical interaction. For example, this would enable performing fine motor control tasks, such as doing a jigsaw or similar puzzle. These hand actions could potentially transfer to a real-world robot hand, such as the Shadow Dexterous hand. The agent's structure will include a camera for egocentric views that will rotate along with the agent's head. As with other TDW agents, additional depth, ID, and semantic classification camera passes will also be available.
As a first step towards supporting sim-to-real transfer to real-world robots, TDW can now import standard URDF robot descriptive files. This allows users to import their own robot models and control them inside a TDW simulation. Some of the existing robot models in the TDW distribution include Sawyer, Fetch, Baxter, UR-5, and UR-10. In this example, the movement of the UR-5 robot arm is being controlled through a series of low-level API commands that drive the revolute joints of the arm. By using these low-level commands, users could potentially build high-level interaction behaviors like those provided by the Magnebot API.
Let's talk about the third interaction paradigm, which is human-to-object, meaning a human user interacting directly with the scene in virtual reality. We currently support the Oculus Rift headset with Oculus touch controllers, but we'll be working to support the Oculus Quest 2 with its built-in hand tracking as soon as the updated version is released at the end of this month. As you can see, this provides sophisticated object interaction and control using the user's own hands. When used with our library objects and their form-fitting colliders, this provides a lot of opportunity for complex behaviors and object interactions.
Let's take a quick look at a use case example that utilized TDW's VR capabilities. This experiment, performed by researchers at Harvard in the Stanford NeuroAILab, investigated the patterns of attention that human observers and intrinsically motivated neural network agents exhibit in an environment with multiple animate agents and static objects. Both the human in VR and the observer agent are placed in a room with multiple inanimate objects, together with several differentially controlled actor agents. These actor agents, the colored spheres that you see moving around, are controlled by either hard-coded or interactive policies implementing various behaviors.
Both humans and intrinsically motivated neural network agents had to discover which actor agents are interesting, and thus worth paying attention to, based on their animate behavior. Interestingly, the socially curious neural network agents produced an accurate attentional gaze pattern that's quite similar to that of human adults measured in the VR environment, arising from the agent's discovery of the inherent relative interestingness of animacy.
Well, thank you. That's basically the content of my tutorial. I've listed the relevant URLs for all of the repos again here, as well as my email if anyone wants to contact me with further questions or discuss what TDW can do, et cetera.