PHASE: PHysically-grounded Abstract Social Events for Machine Social Perception
December 16, 2020
December 12, 2020
All Captioned Videos SVRHM Workshop 2020
The ability to perceive and reason about social interactions in the context of physical environments is core to human social intelligence and human-machine cooperation. However, no prior dataset or benchmark has systematically evaluated physically grounded perception of complex social interactions that go beyond short actions, such as high-fiving, or simple group activities, such as gathering. In this work, we create a dataset of physically-grounded abstract social events, PHASE, that resemble a wide range of real-life social interactions by including social concepts such as helping another agent. As a baseline model, we introduce a Bayesian inverse planning approach which outperforms SOTA feed-forward neural networks. We hope that PHASE can serve as a difficult new challenge for developing new models that can recognize complex social interactions.
PRESENTER: Our next speaker is Aviv Netanyahu from MIT. She is, I believe, a first-year graduate student, and that's been working with Andrei Barbu and advised by Josh Tenenbaum and Boris Katz. And this is joint work with Tianmin Shu. Before Aviv gives her presentation, I'd like to say that this paper actually had the highest score of all papers in the workshop-- nine, nine, nine, perfect nine. So please take it away, Aviv.
AVIV NETANYAHU: Thank you so much. I'm second-year PhD student at MIT. And yeah, I'll present our work on physically-grounded social interactions.
So imagine that you're helping your friend move. And let's say your friend has small boxes. So if you want to help them, you can just hold a box and move it. And let's say your friend has a big box, so helping him would mean both of you picking up this box and moving it. So the same social interaction, like helping, looks different under different physical constraints. So we're interested in learning models that can reason about social interactions in the context of these different environments.
We'll show our data set, PHASE, for physically-grounded social interactions. We will provide some baselines on this data. And we will suggest a model of our own for inferring interactions.
So, in this example, when you're trying to help your friend you're basically working together towards a goal. And you're both doing your part to avoid bad outcomes. In this environment, we have two agents and objects, like the box. We have different locations and obstacles, like walls. And the agent's goal is to put the box at one of the locations. And his friend wants to help him.
So they're both taking actions in the physical environment to avoid things like the box falling down. And the same way that we can perceive interactions in the real world, we can also do that in simulated environments. So this is the environment that PHASE is constructed in, where we have two agents-- these trapezoids objects, which are circled here locations, walls-- and you can also have a goal of putting an object somewhere or helping some other agent.
Each of these agents has a field of view, which is limited, so it's a partially observable setting. And they can take actions in the space. They can grab objects. They can exert force. And that's how they can achieve their goals in this environment. And also, the agents can have different strengths.
When we say each agent has one goal, either physical or social, and the agents know about each other's goal. So in this case, the red agent has a physical goal of putting an object on the landmark. Another goal an agent can have is approaching another agent. And the final physical goal it can have is reaching a landmark.
Social goals determine the relation between agents. So in this case, the green agent has a goal of putting the blue item on the yellow landmark. The red agent has the social goal of helping. And this forms a friendly relation between them.
In case each agent has a different physical goal that's unrelated to the other agent-- for example, here putting an object somewhere or getting to some landmark-- the relation is neutral. And the only interaction you'd see between them would be unintentional. In the last case, the green agent can have a goal of putting an object somewhere, and the red one could have a social goal of hindering. And this would be an adversarial setting. So we have three physical goals and two social goals-- helping and hindering.
We can input all these configurations-- the physical and the social, i.e. the agent locations and so on-- and the goals and relations into a simulator to generate these interactions. And this is basically our PHASE data set. So the physical configurations are the sizes, the strengths, room layout, initial states. And the abstract or social configurations are goals and relations. So we can input the observation that each agent has in the environment and all these configurations into a planner.
And in this case, imagine the red agent has a goal of putting the blue item somewhere. So it would first need to reach the blue item. And in this case, a possible plan might be turn right. So our planner outputs actions for agents. And then we execute them using the physics engine in our environment. So we basically simulate one step.
And after simulating that step, we update the agent's observations in the environment. And we can feed that back into the planner and repeat this process iteratively until we get a video. So eventually, what we get is videos of length 10 to 25 seconds, which is longer than current data sets on actions that usually have three-second videos. And our planner can be either a machine controlled planner or a human controlled planner.
In both cases, we see that we can perceive interactions and goals. In this case on the left, a human is controlling the environment-- or two humans, rather. And on the right, a machine is controlling the environment and generating the plan.
We can perceive in both cases that the agents are adversarial since they have a conflicting goal of putting the item in different locations. So since the interactions we've received are the same, we can be sure that our generated videos are meaningful, and now focus only on them.
Indeed, when we show these to human subjects and ask what type of interactions appear in the videos, we get a wide variety of answers. So this graph shows the percentage of videos that contain each of these interactions. And the nice thing is is that we don't have to encode all these interactions. So the source of richness just emerges from the combinatorial combinations of our environment configurations we call the physical and social configurations. And also by the fact that we're operating in a physical environment.
So an agent that has a social goal, its reward basically depends on the other agents' physical goal. That's how different behaviors can emerge. If, for instance, an agent wants to hinder the other agent that is trying to put an object somewhere, then the interaction might be perceived as stealing. And if the other agent has a goal of reaching somewhere, then the interaction might be perceived as attacking or blocking.
So we evaluate our data set on three tasks. The first is classifying the goals. We have social and physical goals. And they also include the different combinations between landmarks and objects. So we have working goals. In this example, the green agent has a goal of putting the blue item on the blue landmark. And the red agent has a social goal of helping.
The second task is classifying relationships. So in this example, they're friendly. And in general, relationships are friendly, adversarial, and neutral. Our third task is predicting trajectories. So we can also just have part of the video as input, and then try to predict the next 10 steps or 2.5 seconds.
We will show three models that can do this. Out of the models, two of them are state-of-the-art models for recognizing relations. And then we propose a model of our own. All models receive as input the observed trajectories and the output, either the goals and relations or future trajectories.
So models that output the relation between agents use either LSTM or graph neural networks. And both of these models are bottom-up approaches. We propose inference by SIMulation, planning, and local estimation, SIMPLE, which also has top-down components. And it's a Bayesian inference model.
So the input is the observed trajectories. And we start by outputting an initial proposal of the environment configuration of the goals, relations, and strengths. And then given that environment configuration, we can use that with the generative model that we have for generating the data, meaning the planner in the physics environment. And we input that into our generator to create a new video and get estimated trajectories.
Then we can compare the estimated trajectories to the observations we get as input and look at the areas where they differ the most. And those areas can help us improve our guess of the environment configuration. So we can input that back into our cue-based proposals and get another estimation for both relations and strengths. And we can complete this process iteratively until we get our final estimation of the environment considerations.
Finally for the results, on goal and relation classification, we have a baseline for human accuracy, which is near 100%, both on goals and relations. And the state-of-the-art methods for using LSTM and the graph networks achieve much lower accuracy than the human baseline. And SIMPLE, which is our method, achieves results that are closer to the human baseline.
For the second task of trajectory forecasting, we receive a percentage of the observed videos. And we output the next 10 steps. And we measure the error-- the average displacement error and the final displacement error. And again, LSTM and graph-based methods have higher error than SIMPLE, which is our method.
To conclude, PHASE, our data set, combines physical scene understanding and social scene understanding. And it's a first step for simulating interactions, but there is still a lot of work that can be done. So we don't even know exactly what the space of social interactions is. We don't know how rich our simulators need to be to capture them. And we're not sure how this can generalize to actual videos of the real world.
Other than the data, current methods are limited on our data. So the other question would be, how can we improve existing methods? Even SIMPLE, which is the method that we propose has its limitations. And for example, it's expensive computationally. So how can you speed it up?
And we can also ask how to generalize to new environments and so. So PHASE will be available soon on our website. So feel free to try out your favorite model. And we will be presenting in the poster session right after this, so it would be great to see you. And thank you to all the organizers.