ThreeDWorld (TDW) Tutorial
Date Posted:
April 13, 2022
Date Recorded:
April 1, 2022
Speaker(s):
Jeremy Schwatrz, Seth Alter,
All Captioned Videos Computational Tutorials
Loading your interactive content...
Description:
In this tutorial, Jeremy Schwartz will walk us through the features and capabilities of ThreeDWorld , a high-fidelity, multi-modal platform for interactive physical simulation.
Next, Seth Alter will conduct a tutorial lab session.
The repository is available here (please note that it is not needed to run the code in advance — only requirement is to have Python 3.6 or higher installed).
FERNANDA: Hi, everyone. Thank you for coming to the competition of tutorials. For those of you that are in the building, I highly encourage you to come in person. We have pastries. And the lab will be very interactive. So we hope to see you here. Today, we have Jeremy Schwartz and Seth Alter presenting on 3D World. It's a very cool tool and he's going to tell you a lot about it. But it's widely used in a lot of the Cognitive Sciences labs here. And we're very excited to hear them out.
JEREMY SCHWARTZ: OK, well, Thanks Fernanda. Hello, everyone. As Fernando said, I'm Jeremy Schwartz and Seth Alter. So I'm the project lead for the development of 3D World, or TDW as we call it for short. And Seth is the lead developer and architect of the API for TDW. So like the slide says, TDW is a multimodal platform for interactive physical simulation. With TDW, users can simulate high fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments.
So TDW was released as a public platform in July of 2020 and is being actively used by many institutions and research labs both here in the US and abroad. TDW website is www.3dworld.org. And you can reach the TDW GitHub from there as well. TDW is funded in large part by IBM through the MIT IBM Watson AI Lab.
So in the first half of the tutorial, I'm going to give a general overview of TDW speeches and capabilities and drill down on how it's being used for embodied AI and physical scene interaction. I'll also share some work in progress on developing realistic humanoid agents suitable for experiments in human robot collaboration for example as well as some new particle based dynamics capabilities we just added to TDW
The second half of the tutorial will be more of a lab or workshop, where Seth will go into depth on several code examples. And you can try out for yourselves a physics-based controller we prepared for this tutorial. But first, I guess I'll give a little background on TDW and how we developed it, why we developed it. As you all know machine perceptual systems typically require large amounts of labeled data, data that's laborious to come by and can also be quite expensive.
In addition, some quantities, such as the mass of an object or the material it's made from could be difficult for human observers to label accurately. So around four years ago, we started developing TDW as a way to address the situation. The idea was that by generating scenes in a virtual world, we could have complete control over data generation and all the generative parameters associated with that. This would allow us to train machine perceptual systems as virtual agents inhabiting the world.
TDW is built on top of the state of the art game development platform. Unity. Unity is cross-platform, which allows us to run TDW on Windows, OS X, and Linux. It handles the image rendering, audio rendering, and physics for us. So in this overview, there are four key aspects of TDW I'd like to talk about. First is the generality and flexibility of TDW's design. Second is its systems architecture and the API that supports the generality of that design.
Thirdly is how we afford equal status to visual and auditory modalities. And fourth is TDW's advanced physics capabilities that allow rigid-body objects, soft-body objects, cloth and fluids to interact. Coupled with that, we'll explore the multiple paradigms we use to interact with objects and generate physically realistic behavior, whether that's direct or object-to-object interactions, where users directly affect objects through the API commands, or indirect, or agent-to-object interactions that utilize some form of embodied agents. TDW support several types of these agents as we'll see. And users can even interact directly in VR with objects in the scene.
So a key goal in designing TDW was to create a very general and flexible platform capable of supporting a wide variety of use cases. What this means in practice is that compared to some other simulation platforms and frameworks, TDW doesn't impose any particular metaphor on the user in terms of the types of simulations it can generate. For example, some frameworks only supports interior floor plan environments. In other words, rooms with furniture or specific paradigms like navigation. TDW can use those two, can do those two, rather, but we can also generate experimental stimuli of a more specific nature.
So we can for example, support use cases dealing with fine grained image classification and object detection, physical prediction and inference, input play behavior, and task and motion planning. You see some image examples from these kind of use cases here. So let's actually take a look at some examples of some very different kinds of use cases. So here, we see an example of generating data sets of synthetic images usually used for training networks to generalize against real world images such as those from ImageNet.
These images were generated for the purposes of collecting neural responses from primates when viewing synthetically generated images. Here are 3D objects for all exemplars of a given semantic category, chairs in this case, are loaded into a virtual seat. To increase variability, each image has randomized camera and positional parameters and may have additional random parameters such as the angle of the sun or the visual materials of the model. This randomness is constrained somewhat in order to guarantee that the object is always at least partially in the frame.
Here, we see an example of TDW being used for the training and evaluation of physically realistic forward prediction algorithms. So as human beings, we learned at a very early age that the results of objects coming into contact with each other affects how we interact with them. For agents to learn this, they must understand among other things, how momentum and geometry affect collisions. In this clip, randomly selected toy objects are created with random physics material values. The force of randomized magnitude is applied to one toy, which is then aimed at another toy.
And then the third use case area for TDW as shown in this third example is embodied AI wear embodied agents are trained to interact with the environment and potentially change scene state in some way. Here, you see an agent performing a part of a TAMP, or task and motion planning task involving the location and retrieval a target objects. We'll dig deeper into this area as well when we discuss physical interaction in TDW.
However, before diving into the features in detail, I think it'd be helpful to look at the platform's high level systems architecture and introduce some terms that you'll hear throughout this tutorial. And while we're looking at the architecture, we can also take a look at our API. So TDW simulation is composed of two main components, the build, which is what you see in blue on the left is a Unity executable of TDW simulation engine. And again, that can be either Linux, OS X, or Windows.
The build is responsible for image rendering, audio synthesis, and all physical simulation. The controller in green is a Python program which communicates with the build over TCP/IP and uses TDW's comprehensive command and control API. The controller sends commands to the build which executes those commands. The build can return a wide range of data types to the controller representing the state of the world.
Data types can include things like image data, such as image data, image renders, segmentation ID images, semantic classification images, depth maps and normal maps, collision data, including whether objects are discretely impacting, rolling, or scraping, and spatial and transform data such as position, orientation, bounds, et cetera. Of course, there are the many other kinds of data that we don't have time to necessarily describe right now. So users, i.e. researchers, write controllers to suit the needs of their use case. Basically, Python skills are really the only requirement for using TDW successfully.
So in addition to the building controller, the platform architecture includes two other key components. The first is an Amazon S3 repository where 3D object models, scene models, material files, and HDRI skyboxes are stored. I'll explain more about all of these in a minute. Object and environment models are downloaded at runtime into the build as asset bundles, which are compressed binary versions of the model data.
Once downloaded, all model data is cached which means-- this means that rebuilding a scene, for example, when running successive trials is essentially instantaneous. The second key component is a JSON records database, which is stored locally. The database contains all model and other metadata used by TDW. A set of librarians, which are basically Python wrapper classes handle the querying of these metadata records at runtime by the controller.
So the API contains over 200 commands covering tasks like scene setup and manipulation, object loading, and modification, camera and rendering controls, object interaction using physics, and agent navigation and control. And basically, these are general purpose atomic commands. So you can think of them as LEGO-like building blocks for creating higher level behaviors. Now, [? unlike ?] many available simulation frameworks, TDW controllers can send multiple commands per time step, allowing arbitrarily complex simulation behavior. The build can run stand alone locally on a laptop, for example or on a remote server. It can also run within a Docker container.
So the TDW documentation is one of the platform's strongest features. And we recently went through a whole revamp of the documentation to improve it and add more tutorials. Every command and every variable in the API is fully documented. And we have a large number of example controllers, more than 45 at this point, I think. There are also many documents that address specific topics, such as best practices for improving photorealism, how to handle observation data, et cetera.
OK, so let's start looking at some of the features. And we'll begin by talking about how TDW handles multiple modalities. So visually, we strive for the highest level of photorealism possible. We achieve this through the lighting and rendering approaches we use and the high quality 3D environment and object models from our library. We use 100% real-time global illumination with no light map making. Our lighting model uses a single light source representing the sun, which we use for dynamic lighting, which is the type of lighting that causes objects to cast shadows in a scene.
The general environment lighting, we utilize high dynamic range, or HDRI skyboxes. And if you don't know what the skybox is, you can think of a planetarium projection. It's basically the same kind of idea. HDRI images contain substantially more information than a standard digital image. They capture the lighting conditions at real locations for a given time of day. And they're typically used in movies to integrate computer generated imagery with live action photography.
So in this clip, TDW is automatically adjusting the elevation of the sun to match the time of day in the HDRI image. This affects the length of the shadows. Also the intensity of the sun is being adjusted to match the shadow strength that's in the image. The HDRI image is also being rotated to simulate different viewing positions. The sun angle correspondingly adjusts to the-- so the direction of the dynamic shadows continues to match the direction of the environment shadows in the HDRI map.
So most scenes in TDW start off with some type of environment. Our environment assets span both indoor and outdoor scenes and includes several environments created from high quality stand photogrammetry assets. Many environments are designed for maximum variability with large amounts of detail both object and surface detail. So any arbitrary viewpoint within the scene will deliver a suitably complex and varied background.
The outdoor images on this slide contain assets such as rocks and pebbles, mossy boulders, areas of mud and grass, sections of cliff faces, and other real world terrain elements scanned from various locations around the world. So one example obviously, is the lava beaches in Iceland that you can see in the top right image. But [COUGH] excuse me, environments are then populated with objects from our library of over 2,800 high quality 3D models that spans over 200 semantic categories.
Models are normalized to real-world scale given a canonical orientation and semantically annotated with the appropriate semantic categories, such as chair, coffee maker, toy, dog, et cetera. Bottles can be placed around the scene in various ways. They can be placed completely procedurally, i.e. based on some algorithms, such as stacking or random scattering or using more complex rules, for example, rules that define the layout of a kitchen environment as you can see in the top left image there.
Alternatively, object placement can be based on an explicitly scripted arrangement for example, the scene in the top right image, which you've seen earlier. That one was explicitly the object was explicitly positioned and scripted for aesthetic reasons.
So talking about audio, the audio modality is equally important to TDW. And the platform provides a high degree of acoustic rendering fidelity. The sounds placed within interior environments, TDW uses a combination of Unity's built-in audio and resonance audio 3D specialization to provide real-time audio propagation, high quality simulated reverberation, and directional cues via head related transfer functions. Sounds are attenuated by distance and can be occluded by objects or environment geometry.
Reverberation automatically varies with the geometry in the space, the virtual materials that are applied to walls, floor, and ceiling, and the percentage of room volume occupied by solid objects such as furniture. However, TDW's advanced physics-based synthesis of impact sounds is really the standout feature. TDW's PyImpact Python library, developed in collaboration with James Traer from the McDermott Lab, uses modal synthesis to generate plausible realistic impact sounds in real time based on the masses and materials of colliding objects as well as parameters of the collision such as object velocity and angles of impact returned by the build.
PyImpact currently supports 14 material types, including metal, glass, ceramics, soft and hard plastic, cardboard, stone, and others. So let's look at and listen to some examples. In this clip, we have some examples from a data set used for object mass and material estimation plus a Rube Goldberg machine style setup we constructed to demonstrate both the impact sound synthesis and complex physical interactions in a photorealistic setting.
Hopefully, everybody heard that OK. So recently, we again collaborated with the McDermott Lab this time with Vin Agarwal to add a synthesis model for scraping sounds to PyImpact impact. So unlike impact sound, scraping synthesis also uses the notion of material surface roughness and imperfections to synthesize plausible scraping sounds. The series of ultra high resolution scans of actual materials, such as soft and hardwoods, various plastics, ceramics, et cetera inform the synthesis model in addition to the impulse responses of the physical materials and the parameters of the collisions.
So we'll be following this up with a third synthesis model for rolling sounds. So I just have a very short example. This is fairly new, so we really haven't put together too many fancy demos of this. But I just have a very short demonstration of the scraping sounds. Hopefully, you can hear it OK.
[SCRAPING SOUND]
OK, all right, so let's talk about the various ways TDW handles physical scene interaction. And we'll start off with object-to-object interactions. So we've gone to great lengths to enable believable and realistic object interactions through accurate physics behavior. TDW actually includes two separate physics engines, which serve different purposes. You need these basic physics engine called PhysX handles rigid body physics, including the collisions between rigid bodies.
For example, by applying a forward directional force to an object can be made to collide with other objects as we see on the left. Or we can apply an upward force at a specific point, for example, to tip a dining table and make objects on the table slide on or off as we see on the right. To achieve what we refer to as fast but accurate rigid body collisions between our library models, we use the V-HACD approximate convex decomposition algorithm to generate groups of convex hull mesh colliders.
In this image, the convex hull collide is the shown in green on the objects. These highly form fitting colliders are economically organized and provide an optimal balance between performance and accuracy. To further refine object and direction behavior users can, of course, also modify mass, friction, and restitution, or bounciness at runtime on a per object basis.
So the second physics engine used in TDW, Nvidia Flex uses a particle-based representation of the underlying model to manage collisions between different object types. On the left, we use the cloth simulation to drop a rubbery sheet, which collides with a rigid body fire hydrant object. On the right, a fridge model is dropped into a pool of water causing significant displacement and splashing dropping objects at different, sizes, masses, and/or materials into fluids and observing the splash behavior can be useful in estimating these quantities, so forward prediction algorithms that mimic human level intuitive physical understanding but considered important for enabling deep learning approaches to model based planning and control applications.
However, the quality and scalability of learned physics predictors has been limited in part by the availability of effective training data. So we saw this as a compelling use case for TDW highlighting its advanced physical simulation capabilities. So in 2020, we developed a comprehensive benchmark for the training and evaluation of physically realistic forward prediction algorithms. So that it's really tiny to see how to turn on the slide, turn on the video, I mean.
There we go. It's publicly available benchmark, goes well beyond existing related benchmarks. It contains a varied collection of physical scene trajectories that make extensive use of TDW's deformable cloth and fluid capabilities and provides scenarios with complex real-world geometries and photorealistic textures. So here's a sample of some of the different physics scenarios represented in this benchmark. Set, including stability, object permanence, sliding versus rolling, simple collisions, draping and folding and fluid displacement.
Let's let you watch that for a second. And again, this is just a small sample of the variation of physics behaviors that are included in this data set, which is actually a separate repo. And you can download this and work with it once you have TDW. So Stanford's new AI lab use TDW to train a learnable physics simulator to predict physics behavior using a subset of the physics data set scenarios I just described.
Various 3D shapes, such as bowl, cone, cube, dumbbell, octahedran that utilized cloth, rigid, and soft materials we used to construct the following scenarios. Lift, where objects are lifted and fall back on the ground. Slide, where objects are pushed horizontally on a surface under friction. Collide, where objects are made to collide with each other. Stack, where objects are unstably and stacked on top of each other. And cloth, where cloth is either dropped on or one object or placed underneath and lift it up.
Here, we see the results of the model predictions as compared to the ground truth simulations. As you can see, old predictions look physically plausible without unnatural deformations. So we literally yesterday just introduced the first in a new series of particle based dynamic simulators that are derived from a suite of plug-ins called Obi physics. So this will ultimately be replacing our current flex-based fluid system.
Obi Fluid supports various shapes of fluids emitters. Obi Fluid supports various shapes of fluid emitter and fluid materials such as water, honey, oil, chocolate, et cetera as well as granular materials such as sand, rock, foam pellets, et cetera. Output data for particle trajectories and velocities is fully supported. So we'll be releasing a corresponding cloth simulation within a few weeks, followed by support for soft deformable objects.
So as you can see, you can really set up a variety of different kinds of materials not only in terms of their appearance, but also their behavior. You can control the viscosity of not just the fluid, but the collision behavior and surface friction and everything of the objects themselves. So yeah, here's an example of some of the granular material kind of looking like little foam pellets.
OK, so at this point, let's delve into what we mean by agents in TDW. So TDW supports a range of agent types. Agent avatars can be as simple as one or more disembodied cameras returning image data from the build to the controller.
Basic agents whose embodiments of cube, sphere, or capsule primitives can be moved around the environment by applying forces. These agents are often used for algorithm prototyping. The next level up are complex robotic agents with advanced embodiments such as articulated limbs that are capable of both mobility and sophisticated physical interaction with the environment and the objects within it.
And finally, we're actively working on near photo real humanoid agents driven by motion capture data that will move and behave in a realistic fashion. These agents can perform typical human actions such as vacuuming the floor, setting the table, or carrying a tray of food. These agents are ideal for use in sophisticated human robot collaboration experiments.
So let's take a closer look at this agent-to-object physical-scene interaction where these more advanced types of embodied agents physically interact with the environment. So embodied AI research-- it's especially important that embodied agents have physically mapped action spaces that allow them to interact with the environment, effectively changing both object and scene state. To that end in TDW, we have Magnebot, a robotic agent with articulated arms that terminate in nine degree of freedom magnet and defectors.
Magnebot is fully physics driven. There's no animation involved. Directional movement and turning are achieved by controlling revolute joint drives. Arm articulation utilizes one degree of freedom or three degrees of freedom joints in combination with an IK, or Inverse Kinematic system, to facilitate sophisticated reaching actions. As you can see Magnebot can also move its torso vertically along its central column implemented as a prismatic joint allowing it to reach objects at a considerable height above the ground.
Agents like Magnebot can be equipped with cameras capable of generating RGB images as well as various camera passes such as depth, normals, object segmentation, semantic classification, and pixel flow. Besides the agent's egocentric view, additional cameras can be linked to the agent to provide third person follow cameras or static camera tracking views.
So let's just quickly revisit our API for a minute in the context of physical interaction. Where interaction is concerned, it helps to think about the API as being composed of three layers. The main TDW API contains low level commands that operate directly on the revolute prismatic joints of a robot agent, such as Magnebot or other robot models within TDW as we'll see you in a minute. For example, robotics command such a set revolute target will turn a revolute drive such as the wheels on the Magnebot.
To facilitate Magnebot's mobility and scene interaction, an additional high-level API layer built on top of this lower-level API combines low level commands into actions, such as move-to location and turn-by angle for mobility and reach for target position and grasp target object to the arm articulations necessary to pick up and place objects. For a specific project use cases, such as challenges, which may have requirements for specialized variations of commands in the Magnebot API, we'll typically develop a third ultra high level API layer. Let's take a look at an example of that right now.
So in 2021, in conjunction with the MIT IBM Watson AI Lab, we launched the TDW transport challenge, a visually guided task and motion planning benchmark for physically realistic embodied AI. In this challenge, our Magnebot agent is spawned randomly in a simulated physical home environment. The agent must collect a small set of objects scattered around the house and transport them to a specific location.
For example, the typical challenge task might be transport one toy, two bowls, and one jug to the bed. The agent has an interaction budget, in other words, the fixed number of actions that it must stay within in order to successfully complete the challenge. We also position containers around the house that could be used as tools to transport objects efficiently. On its own, the agent can carry at most two objects at a time. However, using a container, it can carry several objects at once.
However, locating and retrieving a container uses up valuable interaction steps. Therefore, the agent must plan the optimal path to transport the objects to the goal location and reason about whether to use containers or not. So to summarize, to complete the task an embodied agent must plan a sequence of actions to change the state of a large number of objects in the face of realistic physical constraints.
For the challenge, we developed a high level action space for the agent that includes commands such as put in and pour out. Here, you see Magnebot performing the put in action slowed down so you can see the nuances of the arm articulations taking place. Here, we see an example of one type of challenging situation the agent needed to deal with and why a synergy between navigation and grasping is important to successfully performing a task such as retrieving a target object occluded by other objects where grasping might fail if the agent's arm cannot reach an object.
Of course, in some situations, the agent can become so stuck on an environment obstacle that it's unable to recover as we see here. So let's watch an excerpt from an actual challenge task again, slightly slowed down so that we can better see the agent in action so this is about a minute long. You can see how the agent is using the container to collect several objects in succession before transporting them to the goal location.
So as you'll see, though not all of the objects on the floor are actual target objects, i.e. objects that have been assigned to as targets in the task definition. So he'll skip over a couple of objects momentarily. The agent must use his vision system to determine which objects it needs to pick up and which ones it needs to ignore.
So having reached the goal location, the agent performs a pour out action and terminates the task. Note that the criteria for successful completion of the tasks do not require the agent to drop the objects onto the bed itself. The goal location is actually a small region in front of the bed defined by a radius from the centroid to the bed.
So let's take a look at some other types of embodied agent in TDW. So besides Magnebot, TDW also supports the import of standard URDF robot descriptive files. As part of our efforts to support research around human-robot collaboration, this capability allows users to import their own robot models and control them inside a TDW simulation. Some of the existing robot models in the TDW distribution include Sawyer, Fetch, Baxter UR-5, and UR-10.
So in this little example, the movement of the UR-5 robot arm is being controlled through a series of low level API commands that drive the revolute joints of the arm. By using these low level commands, users could potentially build higher level interaction behaviors. So far, we've looked at the robot side of human-robot collaboration. Let's take a look now at the human side. As I mentioned earlier, a big part of our current development with respect to embodied AI agents is the creation of a realistic humanoid agent.
So we are calling this agent Replicant. And it will be able to physically interact with the environment to the detailed level. The Replicant agent will utilize a range of photorealistic 3D model skins allowing agents to be male, female, or even a child. This visual realism will be augmented with equally realistic body motion derived from motion capture data.
We recently acquired a wireless motion capture suit that will allow us to build a custom library of human actions capable of driving the movements of this replicant agent. The agent will also have fully articulated hands capable of high dexterity physical interaction. For example, this will enable performing the motor control tasks such as doing a jigsaw or similar puzzle.
Much like a high-end video game, a runtime motion blending and transition handling system will allow us to seamlessly blend between different motion capture animations. However, in addition to motion blending, we plan to include the ability to procedurally modify the agent's motion at runtime. This will allow reactive animations as event feedback, for example, the agent's head turning in the direction of a heard sound or other new point of interest.
This runtime system will also allow us to modify pre-existing animations in response to changes in object state, for example, the agent's pick up animation will change dynamically to reflect a new or moving target location. So here's an example of some basic reaching and placing actions captured using the motion capture suit that we recently acquired, the motion data was only minimally processed.
So basically, this took place in an online demo of the suit that I was given. I basically just gave the guy wearing the suit a few directions about pretend you're picking objects off of it, well, not pretend, he actually did it but pick objects off of a shelf. Place them on a shelf. Do that kind of thing. So this is a fairly good example of what the motion looks like that we'll be able to have on our humanoid agent.
So here, we're testing a pure IK based reaching action in conjunction with a motion capture move. This is very preliminary. And there's no transition from walk to idle stance whenever the agent is equipped with colliders and can interact with seen objects such as the chair it collides with. In the reach sequence, the ball target object is being randomly positioned and its location left or right of the agent centroid determines which arm is used to reach with.
So we plan to improve on this by layering IK on top of motion capture and reach actions to improve the realism and fluidity of the resultant motion while allowing dynamic adjustments of the reach target position. So here, we're combining our reach action with systems for assigning avoidance points to objects and maintaining the stability of held objects such as containers.
So the basket in this image has four important affordances points to find. The system can either use an explicitly labeled affordance or else select the closest one to the agent's end effector depending on its orientation with respect to the basket. Affordance points locations are defined as JSON object metadata and can therefore be easily adjusted at any time. As the agent reaches for different target locations, the stability of the held container is being maintained automatically.
So over the next three months, we plan to make significant additions to our motion capture library capturing a range of everyday actions relevant to human-robot collaboration and other use cases. Additionally, we'll be working on a physics-based grasping mechanic that will enable us to grasp and carry arbitrary objects in a realistic and stable fashion. We also plan to implement the full motion blending and transition system that I mentioned earlier as a first step towards the runtime motion modification system this is planned for phase two of our Replicant development.
So the third and last interaction paradigm I want to talk about is human to object, meaning a human user interacting directly with the scene in virtual reality. So I'm going to play this video. OK, so we recently introduced a new grasping system that is physics-based. The hands have colliders on them. They're capable of grasping objects in relatively arbitrary positions depending on where you grasp the object.
And as you can see, they support the physics capabilities and the colliders of the objects. So you actually get a sense of even though you can't feel the weight of the object, the physics is pushing the plate down a little bit when I drop an object on top of it. So you do get a sense of that. You can also see that we can do things like put our finger through the loop of the cup. So there's quite a lot of detailed interaction possible. Couldn't make up my mind which way to go there.
OK, so in this second example, we're actually utilizing what we call composite objects, which are objects with affordances and moving parts. So for example, we can open the oven door here and close the oven door. There's a hinge joints on the oven doors that allow us to open and close them. I'm teleporting across the room to get to the microwave. Reason for that is the physical space I was into to move around is smaller than the virtual space. And so I took advantage of the teleporting mechanism that we have built into the VR interface.
OK, so just one more little example, so for some human-robot collaboration scenarios, the robot agent needs to learn about human activities by observing a human being performing those actions. Replicating the scenario in a virtual environment has many benefits, including scalability over thousands of iterations. Our IK driven VR humanoid agent, which is something that we're not quite released yet, but we're working on, essentially mirrors the actions being performed by the human user in VR providing a full body embodiment of a rope that a robot agent can observe performing actions in the same way it would a real human.
So on the right is what I'm seeing in the VR headset. And on the left is the third person camera view of the VR humanoid mirroring my movements. We can teleport around the virtual environment using the teleporting mechanism I showed earlier to interact with spaces larger than our physical VR space or the cable tethering would allow. So here, we see the VR humanoid interacting with an articulatable object to place a mug inside the microwave.
This is actually an earlier version of our articulation system. So the door doesn't quite open as smoothly in this video as it does in the more recent ones. And it still works pretty well.
OK, so yeah, that's basically the end of the kind of overview section. So I'm going to switch things over and hand it over to Seth now who's going to take us through the lab session and workshop.
SETH ALTER: So this is the workshop part of our demo. And this is the hands-on part. So who here has laptops? Did you all bring your laptop? Actually, two people who brought, three people people, OK, four people who brought laptops, great. So in order-- we're going to be using tdw. And-- went the wrong way. This assumes that you have Python already installed. You have Python installed? Great, OK, so in your terminal, you're going to do pip3 install tdw.
So let me know when it's done installing. You can raise your hand. I think there's not that many people, so we can just do it that way.
AUDIENCE: There was a bunch of people on Zoom too.
SETH ALTER: A bunch of people on Zoom, that's right. Well, for everyone seeing it, I'll just assume you installed it correctly. OK.
So the second step is to clone the repo. If you have git, you can do a "git clone" from that link. These are the scripts we're just using for this demo for this afternoon. If you don't have git, you can go to the website. It's the same thing without the git extensions, so https://github.com/alters-mit/tdw_bcs_demo.
And there will be a link there to download a ZIP file of the repo. But we're going to be working with this. And we're going to be editing some of the scripts inside it and then learning the scripts.
So the first script we're going to run is this is hello_world.py. So if you change directory to tdw_bcs_demo, and then just run python3 hello_world.py. What this is going to do is launch a Python script which launches like any other script. If you've downloaded it you've already got this. You need to follow the instructions at the bottom.
And so this is going to launch a Python script. But the other thing that's going to happen is that you'll see a window pop up, and that's the simulator application. So that window's going to pop up and disappear, and you'll see the words "hello world." And if all that happens, then everything's working just fine.
If you don't see that, or if there's an error, please let me know. There will be a step in between where it has to download the build application. That's part of the install process. And the idea is that if this works, with some exceptions that require specialized hardware, like a VR headset, you can run the majority of TDW if this works.
So the way this breaks down is that the part that says controller, that is doing most of the heavy lifting in TDW. We're creating a new controller. And that is the data object that's going to be communicating with that external simulator application. We call the simulator the build, typically, because we have to compile it, unlike a Python script.
And with the way we communicate is by calling this function that's called Communicate. That is the only means of data going in and out from a controller to a build. And what the Communicate call does, is it sends commands. It can send either one command or an arbitrarily long and complex list of commands.
Now command, within the context of the simulator, which you're not going to see is relatively complicated what that actually is. But on this side of things, it's just a dictionary like any other dictionary. And the way the program is going to differentiate between commands is that Dollar Sign Type key, that tells you what type of command it is.
So in this case, we have a terminate command. And we know it's a terminate command because of the dollar sign. It has no other parameters. If there were other parameters, they would be arranged like a dictionary-- like position, or ID, or whatever. And all this is going to do is tell the build application to quit. This can be really useful if you are running things on a server and you really need to make sure that you don't leave processes lingering that other people need to shut down for you.
So Jeremy showed you are a much nicer looking diagram. But this is the short version. The control is the part that you write. That's the Python script. And you're going to send commands via that communicate call over to the build, which is the window that popped up. The build on every Communicate stuff is going to spit out output data. And you can use that output data however you want. You can save it directly to disk if it's an image, or maybe you want to use the images to do some planning process.
An important thing to understand about commands in the context of the rest of this workshop is you're not going to see many of them-- or everything you are going to see uses them. There is no way to communicate between the build and the controller except with commands. It's exactly one thing. We're never passing any other kind of data structure. But we can send arbitrarily long lists of commands.
So as Jeremy mentioned, if you wanted to simulate a robot arm, we could send a command per joint, for 12 or 24-- however many commands on the same frame. And they would all execute at the same time. So we can have any number of complex behavior. And we can build complex behavior of these very, very atomic low-level commands.
In many cases, though, there's a lot of routine stuff that people do in TDW. And what this workshop is covering is largely the routine stuff. So to that end, we've wrapped a lot of commands in wrapper classes and wrapper functions. So we could call function, the function generates a command, then we send the command.
And that's just for ease of use. Things like adding a camera, and setting the position, and looking at a target are three separate commands. But since you almost always want to do those all at once, we wrap that up in one single call.
So you're going to see a lot of these wrapper functions for the rest of the tutorial. I'm going to see what we call add-ons. And add-ons will automatically inject commands into the controller per frame. Again, a useful example would be an image capture one, which automatically is requesting images and then saving them out every single frame. Just stuff that is a pain to set up yourself, so we've just done it.
Now I want you to go to the TDW BCS demo folder that you have, and open up the script tdw_room.py in a text editor. What this controller example is going to do is load a scene. And it's going to add a camera to the scene. And it's going to add an object to the scene. And it's going to so show you an image. It's going to capture an image and show it to you.
Initially, it's going to be a really bad, weird looking image. And your goal is that you're going to have to edit the file, the controller script, to get a good image. And by a good image we mean we want to be able to look at the object. We want to point the camera at the object and be able to actually see it.
So as you're working on it, I can show you how this works. This is the script. This is the entirety of our little demo script. As before, we're creating a controller object that's going to talk to the build. It's also going to launch the build automatically. That's how you saw the application window in hello_world.py.
We are also going to generate an object ID. The controller side of things is responsible for managing which objects are what. Objects are referenced by a unique ID. And then the controller keeps track of those IDs. So if you don't want to pick a unique number yourself, you can call "controller.get unique ID," which we do here.
So in this case, we are creating a camera. And it's called a third-person camera because this camera is not attached to an agent. It's just floating in space. It doesn't have mass. It doesn't have velocity or anything like that. It's just a third-person camera. And this is an example of an add-on. In this case, we're telling it to look at an object and move to a position.
And those, again, there's about five or six separate commands that are involved in the initialization process. But because it's such a common task, it's not something we expect users to have to do on their own. So we just set it up here. We have a wrapper function to add the scene. The scene name is TDW Room. And we have a wrapper function to add the object.
And so the reason the image looks bad initially in this example is because the camera, it starts at 0, 0, 0, which is the dead center of the floor inside the object. And so it's looking directly at a white wall at floor level. And it looks odd. We don't want that. We want to be able to see the object. So what you need to do is change the position parameter until you can see the object.
The hint I'll give you is that Y is up and down, X is lateral, and Z is forward and back. But you'll need to adjust X or Z and/or Z and definitely Y in order to get a good image. If you're getting a flashing screen, that's fine. That's because you have an Apple, and you have to download the scene.
These scenes don't exist by default in your application. They get downloaded and loaded into memory at runtime. And until the camera shows up, it's just going to look like matrix glitch-out zone. On Windows, it will look like a black screen, which is kind of more elegant, until the camera shows up.
The other thing I'll say is that if it helps with positioning the camera, all these coordinates are in meters. So raising the camera up to 1.7 meters is probably sufficient to be able to see this object, for example.
AUDIENCE: Is it like [INAUDIBLE]?
SETH ALTER: Yeah, it's-- the unity uses metric. So the entire background uses metric except when it uses scalars with completely undefined units.
AUDIENCE: Why is [INAUDIBLE] still up?
SETH ALTER: Were you able to see the object?
AUDIENCE: I can see the wall--
SETH ALTER: OK, great. Then you should adjust the camera and run it again.
AUDIENCE: [INAUDIBLE].
SETH ALTER: If the application window is still open, you should close it. 30 seconds is about typical for this auditorium, because I know the Wi-Fi. Just make sure you're all on the same Wi-Fi network. Maybe MIT Guest is slower than at MIT or vise versa.
The other thing that can always slow it down is the computer itself, because once it's downloaded you do have to add a lot of information to memory. But there's no way to differentiate between download speed versus load speed.
Yeah, the render time is quite fast. It's like-- as soon as you see that image, it's already done rendering. It should be a fraction of a second. It's just the load process can be slow. Then the next of the B script is going to load much faster. And some people can see the box.
AUDIENCE: Yes.
SETH ALTER: Give it one more minute, and then we will jump to the other one, which just loads less stuff into memory so it should move.
AUDIENCE: I see that.
SETH ALTER: Now everyone's seeing the box? OK, great. So this example is a lot closer to how TDW actually gets used. And it's very different in some critical ways. And I'll explain that at the end. It's more similar because we don't typically use TDW juxtaposition camera, right? We try to actually have motion and events within the simulator.
In this scenario, we're going to create a scene and we're going to add some objects to the scene. There's going to be a table and some objects on the table. And there's going to be a ball. And we're going to apply a force to the ball.
Now an important thing about the underlying physics engine, and TDW in general, is that we don't need an agent to generate force onto an object. We don't need anything to throw an object or chuck it or whatever. We don't need anything to kick it. We could just give it a force vector and it will go.
We can in fact, we can also throw it if we want but we're not doing that right now. So we're just going to apply a force out of nowhere to this ball. And we're going to try to use the ball to knock over objects on the table. Now, I've done the hard part for you, which is generating the scene and creating logic to figure out how we knock the objects over.
And we're not going to get into how that logic works because it involves too many steps for a first demo. What we are going to do is how to get the ball to hit the objects. So if you open up the script, use_the_force.py, you'll see that it's structured pretty differently than the first two demos. In the first two we just had a Python script that starts at the top and it goes down. And in this example, we're defining a subclass of controller.
So this is a subclass. I called it "use the force." And it's a subclass of controller. Now functionally, this can be identical to what we were doing initially, right? So in this very minimal example that launches a controller and quits, these two are exactly the same functionally. There's no functional difference between the two.
The main difference is as follows. This one is simple. And if I want to show people how TDW works, most people find it easier to read things like this, where it's just top to bottom. It's also good if you want to just try something, or test something, or make sure everything is working correctly, to just write a quick controller and run it. And if it's fine, move on.
This other one is more verbose. It's harder to read. You have to understand how classes work in Python to really get what you're looking at. The difference is that it's more robust. It encapsulates all the functionality you need. So for large projects, you should almost always be subclassing controller and creating this kind of class hierarchy.
So, yeah better code organization, and you can divide everything into functions. And so what we're doing now with this use of force thing, is that we're doing something closer to physics data-set generation, where in a normal case, we would be writing out output data and recording our results. And so to do that we want to divide our logic into a series of trials.
And the template that we typically use for a trial is that we set up the scene, we do something-- in this case, we throw a ball-- we have an end state-- we wait for everything to end-- and then we clean up the scene, and reset, and write out whatever the results are. And that's a pretty good way to do data-set generation.
So in this example I've given you, we have this function called "trial." That's not a part of controller. Controller doesn't need to do trials and doesn't have an understanding, innately, of what a trial is. So we're defining it here. And the trial function has all of these parameters. These are the parameters that we're going to be using to get everything set up before launching the ball.
And I tried to name them as intuitively as possible. Do you have a question?
AUDIENCE: There's a question of the Zoom.
SETH ALTER: OK.
AUDIENCE: The question is, what's the difference of "get at physics object" and "get at object?"
SETH ALTER: That's a great question. In some cases, we don't care about the physical nature of the object, such as those chairs floating around in the earlier example. "Get at object" just adds the object to the scene and assigns it some default physics values. They're usually wrong.
If we care about the physics state, "get at a physics object" will assign physics values in addition to everything else. And "get at object," returns a single command, and "get at a physics object" returns a list of commands.
In some cases, "get at a physics object" is going to pull from a database of physics values we've already defined. But given the size of our model library, we haven't defined all of them yet. So in some cases, it tries to derive reasonable values. So it's a more complicated process. It's not necessarily something you need to do every time.
So in this case, I think everything in this script is using "get at physics object," because this, is in fact, a physics scene. Whereas in the previous example, we had "get at object," because we were just moving a camera around, and we don't care about the mass of the iron box. We just need to be able to see it.
AUDIENCE: --have one more person saying that when I run the script, it says "type error-- get at physics object-- got an unexpected keyword argument-- scale mass."
SETH ALTER: Scale mass-- OK, tell them they need to upgrade TDW to the latest version. So that would be step 3, install hyphen uppercase U. I think we added scale mass sometime between when we sent out the announcement for this and today.
So the parameters that you're going to be adjusting are at the very bottom of the script. And you can see that we're calling this trial function. It's going to generate the scene. It's going to add the ball. It's going to apply the force, and then it's going to return a Boolean success. Success just means all of the objects that were on the table are no longer on the table.
I will tell you a couple of hints. The first overall hint is that these parameters won't work. You are definitely absolutely not going to get the ball to hit those objects with these parameters. Three key hints-- increase the force. It's not enough force. The ball is going to fall short.
My second hint is that the ball is in a very bad position. The ball is aiming at the center of the table, and it's coming in at a bad vector. It's going to miss those two objects even if it's got enough force. You kind of want to rotate it about 90 degrees to the side of the table. The third is that, don't change just these. The objective of this demo is not to play by the rules and adjust these until you get it right. The objective is to understand how TDW works.
So if you can find it, somewhere in the script is a setting that sets the size of the ball. And if you change it to be really big, you'll be able to knock everything over. But I won't tell you where that is. You might not be able to see the ball right away either, because the camera is not at a good spot.
AUDIENCE: Are you ready to move the tablet, or no?
SETH ALTER: Yep. That's one of the parameters, unless, again, unless the ball is really big and then you will see it. You will be able to see the ball with the default camera if the ball is moving quickly. But the camera is not in an ideal position.
If you can figure out how to remove the table, I will count that as knocking things off the table. What did you do?
AUDIENCE: It's the size of the ball.
SETH ALTER: Excellent.
AUDIENCE: Just demolished everything.
SETH ALTER: Yeah, just demolish everything. You could also get rid of the table. And then things would not be on the table and it will say "successful." What it's actually doing is checking to see which objects are on the floor at the end of the trial. So if you delete the table, it'll count that as success.
How's everyone else doing? Does anyone have any questions about what direction they should be going in with adjusting these?
AUDIENCE: And if [INAUDIBLE]
SETH ALTER: OK, great. Great. I'm glad everyone was taking the giant ball approach. This is definitely good. Yeah, so the positional and rotational coordinate systems are uniform in TDW. So the way it worked with the camera position is identical. So it's all in meters. Y is up. OK. Great, great, great, great. OK. We're going to move on then. That went faster than I thought.
So what you just did is, in some respects, very close to how an actual use case of TDW would work. And in some cases, it's not. In a critical way, what you did is much more similar to game development than a research experiment. And that is that you personally set all these values. And that is, in many cases, typically not what you're going to end up doing for a research project.
In a game, which is my background-- I used to be a game developer-- we want to set these values very precisely in order to have this exact kind of experience for users. So it's setting the speed and the size of certain objects in the scene.
That's not as important in a research project geared towards machine learning. In those cases, what you want to do is get the computer to set values for you. And in a data set generation controller, typically what we would do is run thousands of these trials, each with slightly different parameters, and getting slightly different results, and saving that out.
Which leads us to our fourth and final controller, which is use_the_brute_force.py. It's similar to "use the force," but we're going to just brute force our way to the solution. So here, you can see we have this incredibly ugly nested series of "for" loops. Those are the parameters. And we'll call on trial at the very bottom. We're going to try lots of combinations of parameters until we get a good one. And then we're going to save that out.
Is this an efficient way to run an experiment and to figure out the solution? No, it's definitely not. This is very slow. But can we make it more efficient? Yes, absolutely. There are much better ways to do this. The first thing we can do is that we can constrain the parameter range.
So if we know that some of these range-- like if the minimal force is too low, if we know that, we can increase that force to somewhere plausible. And so on. So we can constrain all these to make everything run faster.
The second thing that we can do is that we can just train a model to take increasingly better guesses until it's reliably throwing a ball at objects and knocking them over. One thing that we haven't covered in this workshop is output data. We haven't been using the data we've received per frame.
But we do get data per frame. We get image data. We get data of where all the objects are. And we can write that out to an arbitrary file. We can write it out to a ZIP file, or an HDF5 file, or a JSON, or whatever, and then load those back up and train a model on them.
For the those gifts that Jeremy had of objects stacked on top of each other and dragging cloth around and stuff, those all save out to a single file and that gets used as a benchmark for other projects.
So if you want to get started with TDW, all you got to do is pip3 install TDW. Then you have it on your computer. For rendering, if you want to do the photorealistic rendering side of things, you'll need a video card. You'll need a GPU. And that's the [? job, ?] just because it'll run slowly on your computer.
There are things your computer cannot do without a GPU. There are parts of the rendering process that it's going to skip if you can't do it. So it works better with the GPU. The faster the computer the better. If you want to do VR stuff you need VR, and so on.
And you can run this on a server. Many people run it on a remote Linux server. And so once you've got everything installed, we encourage you to read through our documentation. All of the documentation, there's two sides to it. There's a manual that covers every topic that I could think of. And then there's the API side, that just this software, what every single function and parameter means.
If you're curious about use cases, like use_the_force.py specifically-- and bear in mind that that's by no means the only use case of TDW. There's a lot of agent-based stuff that would be structured completely differently, and there's a lot of image data-side generation that would ignore physics entirely.
But if you want to do the sort of physics, trial-based data set generation, you should check out TDW Physics, which is a separate repro, which organizes everything into trial-based controllers. It uses a more complicated class hierarchy for the sake of having less bugs. So I think, overall, it's a little harder to read than what I've been showing you. But it is a working example of how to do this sort of thing.
And that's that. We have a lot of time, if anyone has some questions.
JEREMY SCHWARTZ: Yeah, I think it's important to understand that you guys can reach out to both myself, there's my email there, or--
SETH ALTER: Oh, yeah.
JEREMY SCHWARTZ: --alters@mit.edu as well, especially if you guys start working with it and have some questions, things like that. We have a TDW Slack that we add users to as they start using TDW. And that allows you to kind of DM us and, especially if you're joining a project that uses TDW, it typically would be on the Slack and communicate directly with us that way.
SETH ALTER: Yeah, so if we're going on Zoom if you couldn't hear that, you can email us. You can get us on Slack if you're MIT. And if you find a bug, which is not infrequent, you can post that on the repo and we'll be on it very quickly.
AUDIENCE: Two questions.
JEREMY SCHWARTZ: Two questions.
AUDIENCE: One is how can I join the Slack?
JEREMY SCHWARTZ: Basically, just all we need is your email address, basically.
SETH ALTER: Can they hear that?
AUDIENCE: Yeah, we'll post it the--
SETH ALTER: OK. If you email us, we will add you to the Slack, is the answer.
AUDIENCE: And the second question, is there a list of objects that the model librarian can draw from, listed online?
SETH ALTER: That is a good question. So the question is, is there a list of objects that we can draw from. And, yes, and there are several lists. But without getting into exactly what the difference is between those lists, we have two ways of looking at the objects.
First is that we have a JSON file of metadata. So we have not just the name of the object, but we have like its volume. We have the right size of the file, which can be used for generating very big scenes. We have its semantic category, with whether it's like a box, or a shelf, or a kitchen counter, or whatever. And that's all stored in a data structure that you can access as part of the TDW module.
And that is one way to go through it. The other thing is that, built on top of that is an application we make called the TDW Visualizer. And it will capture images of each model. And then you'll have a little application that you can see a picture of each model, its name, and some, but not all, of the metadata.
JEREMY SCHWARTZ: There's a thumbnail view, listed by categories. If you want to look at all of the objects of category chair, or table, or cats, or what have you, you can do it with a filter and see what we have and to see a rendered image of the [INAUDIBLE].
SETH ALTER: And like I said, some of the objects have default physics values, but not all of them because we have thousands of objects. We haven't gotten to all of them yet.
Control.addon. What does control.addon mean? So in this example, ThirdPersonCamera, which you can see in the middle of the slide ThirdPersonCamera is an add-on. And add-ons are not controllers. And they get tacked on to controller in the "sys controller.addons." Which is a list. And all the add-ons in there send commands per frame.
So in the case of this camera, it will send commands. At the first frame, it will create a camera and point it somewhere. And then it has an API for adjusting it. And it will. So if you say "camera.Rotate45Degrees," it will automatically, under the hood, generate those commands and send them on the next frame. So for a lot of use cases, it's a very general way to do basic tasks.
You can have any number of add-ons attached to a controller, including zero. And they all get executed in sequence. And the sequence ends up mattering. So if you have one that creates a scene and one that creates a camera, you have to make the scene first and then put the camera in, otherwise it won't work.
JEREMY SCHWARTZ: Basically it's a way of streamlining a lot of useful functionality of TDW that you'd have to write specific commands to do before. So what used to take 25 commands or 25 lines of code, is now collapsed down into one line of code because behind the scenes it's basically executing all of those commands on your behalf.
SETH ALTER: Exactly. The other thing, and it makes everything more portable. The one that's probably less abstract but very complicated computationally is the Magnebot, which is that agent that you saw picking up boxes and small objects and stuff. That's an add-on. And it's structured as an add-on because that way you can put it inside any other controller.
And all that add-ons do are, they're just generators of commands, of those dictionaries. There's increasingly complicated generators, such as Magnebot. But anything you could do with an add-on you could do without an add-on. It would just be more of a pain. PyImpact is also an add-on. Cool. Well, thank you.