The neural computations underlying real-world social interaction perception
Date Posted:
February 8, 2023
Date Recorded:
February 7, 2023
CBMM Speaker(s):
Leyla Isik All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Leyla Isik is the Clare Boothe Luce Assistant Professor in the Department of Cognitive Science at Johns Hopkins University. Her research aims to answer the question of how humans extract complex information using a combination of human neuroimaging, intracranial recordings, machine learning, and behavioral techniques. Before joining Johns Hopkins, Isik was a postdoctoral researcher at MIT and Harvard in the Center for Brains, Minds, and Machines working with Nancy Kanwisher and Gabriel Kreiman. Isik completed her PhD at MIT where she was advised by Tomaso Poggio.
Abstract: Humans perceive the world in rich social detail. We effortlessly recognize not only objects and people in our environment, but also social interactions between people. The ability to perceive and understand social interactions is critical for functioning in our social world. We recently identified a brain region that selectively represents others’ social interactions in the posterior superior temporal sulcus (pSTS) across two diverse sets of controlled, animated videos. However, it is unclear how social interactions are processed in the real world where they co-vary with many other sensory and social features. In the first part of my talk, I will discuss new work using naturalistic fMRI movie paradigms and novel machine learning analyses to understand how humans process social interactions in real-world settings. We find that social interactions guide behavioral judgements and are selectively processed in the pSTS, even after controlling for the effects of other perceptual and social information, including faces, voices, and theory of mind. In the second part of my talk, I will discuss the computational implications of social interaction selectivity and present a novel graph neural network model, SocialGNN, that instantiates these insights. SocialGNN reproduces human social interaction judgements in both controlled and natural videos using only visual information, but requires relational, graph structure and processing to do so. Together, this work suggests that social interaction recognition is a core human ability that relies on specialized, structured visual representations.
NANCY KANWISHER: Well, it is a great joy to welcome Leyla Isik back to MIT. Leyla got her PhD here working with Tommy Poggio and then stayed on for a postdoc working with me and Gabriel. Lucky us. And then she went and took a faculty position at Hopkins in 2019.
And Leyla combines three skills not usually found in the same person. She has major computational chops, as you will see. She can build and test all kinds of computational models. Second, she has a deep interest in cognition and not just basic aspects of perception but really high-level perceptual cognitive phenomena. And third, she knows how to get really good data and do damned good experiments. And that's quite a powerful combo.
So she has applied this combo to try to understand social perception, which is a really important problem. Social perception is not just a matter of, say, recognizing a face-- something any old CNN can do-- but the much harder problem of understanding social structures, interactions between people, how those people feel about each other, what they might be doing, what the nature of their interaction is, et cetera.
And so, in her postdoc work, she, along with Kami Koldewyn, found a part of the cortex that responds very selectively when people are looking at two people interacting with each other, as opposed to two people doing their own separate thing. And that's a pretty amazing, high-level, specific function, a pretty cool thing. And so she has been studying this region further and coming up with computational models of how we perceive social interactions.
And, as I mentioned, that's beyond your garden-variety CNN. And so Leyla's devising some very interesting structured graph models of social perception. And we'll hear about that today. Thank you, Leyla.
[APPLAUSE]
LEYLA ISIK: Thank you, Nancy, for the kind introduction, and thank you, everyone, for having me. It's been such an amazing treat to be back. So today I'm going to talk to you about the neural computations underlying human social interaction recognition.
All right, so many of you have probably seen this image. I think I stole the example from Gabriel, actually. But even if you haven't, you can probably quickly understand what's happening and why it's funny.
And in order to do that, you need to not only be able to recognize objects, people, scene information, but you need pretty rich knowledge of the physical world-- like the fact that when you step on a scale, it reads as heavier-- and I think, even more importantly, also the social world. So you need to know that this guy on the scale doesn't know that Obama's foot's on it, but everyone else does. And they're all laughing about it together.
And this motivates the main question we try to answer in our lab, which is, how do humans extract all of this rich social information from visual input? And this is a really hard problem. So even just this image alone had multiple entire think pieces written about it. So you can compare the rich descriptions that humans give to this image to what Google's state-of-the-art deep neural network would say about it.
So, pretty amazingly, compared to even just a few years ago, the Google system can accurately put a bounding box around almost every object in the image and accurately tell you what each one is. But, in contrast, there's been far less of an attempt to try to understand the social information in scenes. There have been some attempts to do this, but they've been very limited. And Google's attempt at this is to try to recognize faces and classify the emotional expression of those faces. So, again, does a pretty amazing job at recognizing every single face-- no small feat.
And now you can zoom in face by face and get its emotion prediction for each face. So let's start with this guy, which is face number 10. And it tries to make a prediction about possible emotional expressions-- joy, sorrow, anger, surprise-- but it rates everyone as equally very unlikely. We can also look at Obama's face and, again, every expression is equally very unlikely. And, in fact, the only confident classification the system makes is about this face back here, which is very likely blurred.
[LAUGHTER]
And I think this really highlights the huge gap between all the progress we've made in understanding visual object recognition and humans' rich visual-social abilities. So the approach we try to take in our lab is to apply the methods that have been successful in helping us understand object recognition to understand social vision. And so, in particular, we use a combination of high spatial neuroimaging and high temporal resolution imaging, behavior computational models, and the combination and comparison of all three types of data.
And so today what I'm going to talk to you about is the application, in particular, of fMRI-- a series of studies using fMRI data, and then tell you about a computational model they motivated that we then used to try to reproduce human behavior. And the focus of today's talk and much of our work is on understanding and recognizing other people's social interactions. And so, by this, I mean third-party social interaction. So how you recognize, for example, the interaction between Obama and his staffer, not your own interactions with somebody else.
This is a hugely important ability for humans. So after only a few months of age, infants can tell if one puppet-- if they're interacting and if one is helping the other one or hindering it. Primates can also make similar distinctions by observing other people's interactions. But, until recently, very little was known about the neural basis or neural computations, and so I'm going to tell you about these two things today.
And so starting with the neural basis, we knew from decades of social neuroscience research that there are many brain regions involved in various aspects of person perception, like recognizing faces, bodies, motions of individual people, and also theory of mind-- understanding the mental states of other people. And so, in a first series of studies with Nancy, Kami, and others, we wanted to ask if recognizing social interactions relied on one of these other known brain regions or something else.
And so the real-world interactions that I started with are imbued with all sorts of rich, contextual, visual knowledge. And so to start with, we wanted to strip that away to a social interaction's bare components. So we showed subjects these point-light videos that contain videos of two agents acting with some sort of social contingency. So they look something like this.
And we compared these interacting pairs of point lights with pairs of point-light figures engaged in two independent actions. And so in a first step from our experiment, we showed people in the scanner a bunch of videos like the one on the top and contrasted it to a bunch of videos like the ones on the bottom. And we asked, just as a first step in a standard group analysis, if there are any regions that respond reliably more to the interacting versus independent videos.
And this is what we find, a region that is pretty localized just to the right hemisphere, to the posterior STS shown here. I've circled it. And so this is the result. This is a group analysis looking at the results across all of our subjects. We can also localize this region in individual subjects. So here's one subject's brain.
And we identified this region that I'm showing in red that seems to respond more to the interaction than independent actions. And we asked, to what extent, if at all, does it overlap with the other known nearby social-processing and motion-processing regions in and around the STS for other people's mental states-- in particular, the rTPJ, faces in the pSTS, and motion? And as you can see in this subject, while these regions are all nearby, their peaks of activation are anatomically separate.
That's just one subject. Here are three more subjects. And so you can see that there's some intersubject variability. But, for the most part, these regions all fall in pretty stereotyped locations and are anatomically separate. In follow-up experiments that I won't get into, we also showed that they have very different functional profiles. So the social interaction region responds a lot to social interactions but not really to mental state inference, faces, motion, et cetera.
So in a second experiment, we wanted to see, one, does this-- responses in this region generalize to new stimuli? And, two, can they tell you something about the type of interaction that's going on? So much like that early Kiley Hamlin study I showed you in infants, we designed these stimuli where one agent was trying to achieve a goal, and the second agent either helped it or hindered it. Oh, oops. Sorry, now you have to watch these both at once. And we contrasted these with physical interactions that were just shapes moving around-- inanimate shapes like billiard balls.
And we found that not only did responses in this region generalize to the second set of stimuli-- so it responded significantly more to the social versus physical interaction-- but the pattern of activity in this region could also decode helping versus hindering interactions. So we found this region that is in the pSTS that seems to be selective for recognizing other people's social interactions, that distinguishes between helping and hindering interactions, and is functionally distinct for nearby regions for faces, animacy and motion, theory of mind, et cetera.
And so while it was pretty exciting that it generalized across these two very wildly different sets of simple stimuli, and while these two sets of stimuli look very different from each other, they're still both a far cry from the real world. So the real-world example that I started with is much more complex, rich, has a lot of contextual information, et cetera.
And, actually, in fact, the real world is even worse than this because often you're not viewing static images. You're looking at temporally extended events. And so here's a clip from the BBC television series Sherlock that we used in a second study that I'll talk about.
[VIDEO PLAYBACK]
- John! John Watson! Stamford. Mike Stamford. We were at Barts together.
- Yes, sorry. Yes, Mike, hello.
[END PLAYBACK]
LEYLA ISIK: All right, so there's a lot going on visually. There's speech and language. And it's also kind of ambiguous what's happening. And so we wanted to know to what extent our findings that we saw with these very clear-cut, controlled stimuli would extend to more real-world settings. And, in particular, when you're watching this, like I said, there's not only their social interaction, but you're also seeing faces and hearing voices, probably trying to understand something about their mental states, and also processing their language.
And so many people have rightly criticized that if we just study each of these cognitive functions in isolation, it's going to be hard to understand whether and how they each are processed in the real world. And, indeed, one prior study with movies suggested that social interactions are processed in theory-of-mind regions rather than the pSTS, suggesting that these two things may not be separable in more real-world settings.
However, no prior studies with movies have really tried to tease apart these different factors. So that's something we really wanted to get at in this work. So in this study led by my former postdoc, Haemy Lee Masson, we sought to ask just this. Can we find a similar type of selective response for social interactions during a full-length movie?
And so to do this, we used two publicly available movie data sets, so one collected by Janice Chen and colleagues, where 17 adults watched the first episode of the BBC show Sherlock, and a second collected by Aliko and colleagues, where subjects watched the romantic movie 500 Days of Summer. So these are two different data sets collected by different labs on different continents of different movie genres, and so we really tried to see if we could find phenomena that would generalize across these two very different sets.
And I think it's really important to do more content-based analyses on these movies. So to do this, we really densely labeled each of the movies with a combination of perceptual and also social-affective features. So we automatically extracted several visual features like the pixel value, motion energy. We labeled whether the scene was taking place indoor or outdoor, faces or written words on the screen, and then as sort of a high-level visual catchall, the output of the fifth convolutional layer of AlexNet deep neural network.
We also extracted some basic auditory features, the pitch and amplitude, and had annotators label whether there's background music playing. For social features, we had annotators label whether a social interaction was taking place on the screen, whether a character was speaking, whether a character was engaged in theory of mind-- and I can talk a little bit about how we define this later if people are interested-- and then affective features, the average valence and arousal of each scene.
And so just to give you an idea of what these look like in that scene you just saw, this would be labeled as a social interaction with characters speaking, a neutral scene. There would be the DNN features, the output of a face-detection algorithm, pitch, motion energy, et cetera. And so all of these things are really tightly correlated. So that's one thing I want to stress to you to keep in mind.
And so to understand each of these features' contributions to brain activity, we learned an encoding model that tried to link the activity in each voxel over the course of the movie to the feature representations over the course of the movie. And we just learned a linear mapping between these two using a type of regression called banded-ridge regression that helps deal with correlated feature spaces but also helps to account for the fact that some of our features, like the output of the deep neural network, are really high-dimensional, and others are unidimensional.
And so we can train the classifier on one subset of the movie or the encoding model on one subset of the movie and then make a prediction on how that movie data-- about what the voxel activity should be. And we're just using as our accuracy metric the correlation between the true and predicted voxel activity.
All right, so as a first pass, we just wanted to look at how well the model did over the course of the movie. And so subjects only saw the movie once. We have no measure of reliability within subject. But just to try and remove noisy voxels, we restricted our analysis to voxels that had significantly above chance correlation across subjects. So I'm graying out other voxels.
And what you see is that every-- what we found is that for every voxel within our ISC mass, our encoding model predicted significantly above chance the activity in that voxel. And activity peaked in the left, but it was quite strong at bilaterally along the STS.
All right, so now that we have this, we can try to understand how the different features are linked to voxel activity. And, like I mentioned, the features are all extremely correlated. So, for example, whether there's a social interaction and whether there's a character speaking have a R value of like 0.8 or above 0.8 or something like that. So it seemed kind of hopeless, but we thought we would try.
So the first thing we did was just try to see if we could even separate out the contribution of social and perceptual features because, for example, all sorts of things like the lighting on screen correlate with affective features, et cetera. And so to do this, we built three encoding models-- the full model that I told you about, a model based on just the perceptual features, and then a third model based on just the social features. And then we can calculate the unique variance explained in each voxel by the perceptual features by taking the variance explained in the full model and subtracting the social variance explained, for example-- and vice versa for unique social variance.
We can also use this approach to look at the unique variance explained by each single feature in our data set. So we're particularly interested in social interactions. So we can compare the full model to a model trained with every other feature except a social interaction. And that difference should give us the unique variance explained by social interaction while accounting for all of our other features.
So, as a first step and sort of sanity check, we ask, what is the unique variance explained by the perceptual features in the movie? And here are the results. And so you see significant activity in auditory cortex, which seems reassuring, and also a bunch of visual regions. So that's a nice sanity check, right?
Now, we can do the same thing for the social-affective features. And, to be honest, I wouldn't have been surprised if this was nothing because it's a Hollywood movie. It's not designed as a stimulus set. But, actually, what we see is quite sensible. So you see high prediction or high variance explained along the STS, in theory-of-mind regions, and in frontal action observation regions as well. And so even just this was pretty exciting, that we could separate the contributions of perceptual versus social features in the movie.
And so now we wanted to see if we could separate out the contribution from social interactions versus theory of mind because, like I mentioned, some prior work had suggested that in natural settings, those two things are co-processed. And so here's the unique variance explained by social interactions. Oh, I didn't mention this. I'm only showing you the results for one movie, but they're very similar in the second movie as well.
And so what you see is that the most robust activity occurs bilaterally in this case along the STS. You can contrast that with theory of mind. And here the activity is weaker, and I think there's a lot of reasons why that might be. But you still do see unique variance explained in TPJ and PFC, precuneus, and other regions that you might expect based on the theory-of-mind network. And, importantly, these two things, these two activations, are largely non-overlapping.
And so I think this sort of content-based movie analysis is really promising, and we're extending it to other applications as well. And so in the same Sherlock data, they had participants recall the movie. And we're using this approach to look at social interaction memories. We've also applied it to action recognition more broadly.
And I think content-based movie analyses open up the doors to many other difficult-to-scan populations. So using some awesome publicly available data from Hilary Richardson and Rebecca Saxe, we're starting to look at kids watching a movie and how they might code-- how social interaction representations develop over time. And, in an ongoing project, we're also showing the same movie to young adults with and without autism.
But to sum up the first part of the talk, what we found is that social interactions are selectively processed in the pSTS. And this seems to be true even in natural stimuli, where they co-occur with so many other things. And so in the second half of my talk, I want to discuss the computational implications of this work, or, in other words, why should we care about neural selectivity? What can neural selectivity tell us about neural computations?
One thing that we've seen now across many studies is that perceiving or recognizing social interactions is dissociable from theory of mind. So, certainly, those two things are tightly coupled. But when you watch an interaction, that doesn't seem to rely on the theory-of-mind network. And that adds to a bunch of other growing evidence that perhaps the computations being carried out by this region or being carried out when people recognize social interaction are visual in nature.
So, for example, there's been a lot of behavioral work showing that people process facing bodies, like this on the left, in a way that's subject to a lot of behavioral signatures of visual processing, like they have advantages for being found in a visual search task, they suffer inversion effects, et cetera. And that's not true when you have the bodies go back to back, and it's not true of other types of objects like chairs, for example.
But one puzzle has been that standard either static convolutional deep neural networks or recurrent neural networks that are largely based on visual input and carry out visual computations seem to do a pretty bad job of recognizing interactions in both images and videos. And that's been used as evidence that perhaps the processes used to help you recognize different types of social interactions are not visual. I mean, although they are from visual input, they require some sort of high-level reasoning or an explicit model of the social and physical world to understand what's going on.
And so these are often-- so just to make this concrete, going back to these stimuli that are often used by, for example, Kiley Hamlin, where you have one triangle-- when you watch this video, these theories are often-- and the best performing models that reproduce human judgments are these generative inverse planning models, which would suggest that, for example, they work-- I feel kind of silly explaining that some people in this room. But they would work by extracting the agents and then generating hypotheses about what those agents might be doing.
So, for example, you might hypothesize that the red circle is trying to go up the hill, and the yellow triangle's trying to help, or that the red circle is trying to go up the hill, and the yellow triangle's trying to hinder. And then based on these hypotheses, you could simulate possible trajectories. And, importantly, both the hypothesis generation and the trajectory simulation are built on explicit world knowledge about goals, the physics of the world, et cetera.
And then the model makes a selection by comparing all those possible trajectories in the hypothesis space to the observed trajectory and making that prediction that way. So, for example, help would be the closest match here. And this is very different than the way standard bottom-up neural networks work, where you take the agent information, you put them through some sort of end-to-end learning algorithm, and you make a prediction for helping. But, importantly, these models on the bottom tend to not work so well.
But there's a second insight that we got from the neuroimaging data that social interactions are also dissociable from other types of visual processing, like faces, motion, et cetera, which suggests that perhaps they rely on some sort of specialized representations that are specific to social interactions. And, in particular, one key property of interactions is that they're relational. You're always trying to recognize the interaction between two agents.
And so we wanted to ask if purely information and computations would in fact be sufficient for matching human judgments if you add in this relational information into your models in the form of an inductive bias. And this is work that we just published as a preprint led by Manasi Malik, a graduate student in my lab.
All right, so what do I mean by a relational inductive bias? We put it here into our model in graph structure. So, for example, going back to that red circle climbing the hill, and the yellow triangle helping it, in a standard neural network, you might take information about these two agents-- like their visual properties, their speed, motion, et cetera-- and feed them into your neural network to make a prediction so you could train your neural network on a bunch of these scenarios and then test it on a new scenario.
But, instead, what you could do is you could represent these two agents as two nodes in a graph. And then you could represent them not just in isolation but also input and learn representation about the edges between them. And then you can do that for a simple graph like this, or you might have a more complex graph that has multiple agents and objects.
And you put that whole graph structure and you learn representations on that graph form and then use that to make a prediction. And those are called graph neural networks. And we combined this graph neural network with a recurrent neural network that took video input at each time-step to make a prediction about social interactions. And we called it SocialGNN.
And so this was inspired by a whole host of other graph neural network research that's been coming out. And so it's certainly not the first graph neural network even to try to make social predictions. But what's novel about this is that most of these prior networks have tried to make time-step by time-step trajectory predictions. And this is really the first network to try to make predictions about social scenes based on temporally extended events.
And so to test the network, we use this great data set from the Tenenbaum Lab called PHASE that has a lot of these Heider and Simmel-style animations with two agents shown here in red and green and two objects shown in pink and blue and different relationships between them. So here's one example. And so the agents both have a physical-- each have a physical or social goal, and you can define the relationship between them.
And, as you can see, they're really visually interesting as well. So what does this look like? So, for each video, we can try to predict the interaction type. So did this look friendly, adversarial, or neutral to folks?
AUDIENCE: Adversarial. [INAUDIBLE].
LEYLA ISIK: Yeah, so this is an adversarial example. And the original data set had some human ratings on a subset of the videos, but we collected more on the entire data set. So, like in the original paper, we told subjects that there are two creatures and two objects. And we gave them some information about the physical world, and we just asked them to judge the relationship.
OK, and so then we can try to replicate those human ratings in our models. And so the first thing to do for our graph model is to extract the graph at each time-step. And so for each frame of the video, we have the node information for each agent and object. And so that's each agent's position, velocity, heading direction or angle, size, and whether it's an agent or object.
And then we also coded in this graph, where we add edges between any two entities that are touching, and they're undirected or bi-directed edges. So in this frame that I'm showing you, the red agent is touching the blue ball, which is also touching the green agent. So that input graph would look like this. And, importantly, while we used the annotations from the data set, all of this information can be extracted pretty easily visually. And we feed it frame by frame into our graph neural network that I'm going to tell you a little bit more about now.
All right, so this is the basic structure. So at each time-step, there's an input-- the input graph that I just showed you is put in, and there's some graph computation that I'll tell you about in a second. And then time-step by time-step, the computation that's happening is essentially like a pretty standard recurrent neural network, like an LSTM. And in the last stage, we make a prediction just based on a linear readout of the final time-step representation that predicts friendly, neutral, or adversarial.
And so this input graph is put into the model by, like I said, taking each of the node representations, which are coded here as V, and for every edge in the graph, pairing the node information for the two nodes on either side of an edge. And so that information is put into a linear layer, which learns an updated edge representation, is then zero-padded, and then put into the processing at each time-step.
And we compare this to two baseline models. The first is a pretty standard recurrent neural network that has the same LSTM backbone as the graph neural network and takes all the same features that went into the graph neural network but just concatenates them. There's no structure here based on the nodes and edges. And just like the graph neural network, it makes a prediction based on the output of a linear classifier at the final time-step.
And then we also compare this to the inverse planning model developed in the PHASE paper known as [? Simple. ?] And so the first thing we can-- what we did is we tried to predict the social interaction type, friendly versus neutral versus adversarial. So chance is 33% for these three models. And humans agree on these videos. They're pretty clear, like that one I showed you. But there is some ambiguity. So humans agree with each other about 80% of the time.
And so this is the performance of the inverse planning model. There are some caveats to note about this, but I'll show you an example in a second where that one does much better. But the inverse planning model does much better than the VisualRNN. But on this subset, SocialGNN is at the level of human agreement. And, again, this is a model that doesn't have any explicit representation about physics or agents' goals.
However-- so this is the caveat that the test set, the data set released also had this "challenging/challenge" generalization set where there were a hundred videos that also had novel scenes and action types. So all of the results I just showed you on the original set were trained and test on [INAUDIBLE] data, but there was quite a bit of visual similarity across them. So here we're training the model on one set and then testing it on very different visual and social scenes.
And so here the inverse planning model does quite a bit better. But SocialGNN still does significantly better than the matched visual model and almost at the level of human agreement. I think one thing that's interesting-- in some follow-up work we're looking at, the SocialGNN and inverse planning both explain unique variance in human behavior.
So it's not just that SocialGNN is like inverse planning but worse. There are several examples where SocialGNN agrees with humans, and inverse planning doesn't. And so we're starting to try to explore-- I think this is a nice way to operationalize two different hypotheses. And we're starting to explore why some videos might be correctly classified by one model and not the other.
A real strength of this model, though, is that it can be image-computable. So you can extend it to natural videos. So we use this gaze communication data set that, again, consisted of here between two and five agents interacting with each other and objects. And they're each labeled as having a different type of gaze communication event.
And so the nice thing about the GNN is that you can extract graphs even from these-- because it's somewhat abstracted away from the very low-level visual information, you can extract graphs even from this. So here, in this data set, we extracted graphs by putting bounding boxes around all the people and objects and putting that node information through a kind of standard deep neural network and using those as our node features. And we put coded edges in the graph based on the gaze direction of the people.
So here the gaze direction-- here the edges are directed. And so, in this frame, for example, the woman in pink is looking at the man in blue, who's looking at the object in his hand. But now we went from a video, again, to the same type of graph that we can feed into the exact same network architecture. And we can compare it to-- so for this case, there are several different types of gaze communication. But the first thing we did was just look at videos with versus without an interaction.
And so chance here is 50%. And the VisualRNN does slightly above chance. But, actually, it's not a balanced data set. So if you look at the confusion matrix, it's just totally guessing. And, in contrast, our model does do quite a bit better, and it also does a decent job at telling apart the different types of gaze interactions as well.
And so what we found is that this purely visual model that does have relational inductive biases seems to be able to reproduce human interaction judgments. And, importantly, it has no explicit representations of agents' goals or the physics of the world. I think it's an interesting question to what extent it might be learning this information implicitly from training data.
But a key to this model, though, is that you have to have this graph structure. So one thing we tried was, can you just give the RNN information about what agents are touching each other, looking at each other? And it doesn't perform any better. So there seems to be something important about this graph structure in particular to reproduce human judgments.
So we found that social interactions seem to be processed selectively in the human STS, and it's separate from a bunch of other functions that we know that are both perceptual but also higher-level. And it seems like structured visual computations may underlie the judgments humans make about these videos.
And in future work, in collaboration with Tianmin and Josh, we're hoping to compare these model representations across these different models to brain representations to close the loop and see, in fact, are these models that were inspired by our original neuroimaging findings but also other knowledge from cognitive science-- to what extent do they match the brain representations that we're seeing as well?
Just want to circle back to this image and really stress, I think, the importance of trying to extend both our neuroimaging but also our modeling paradigms into more real-world settings and also acknowledge that, going back to that first example, all I've talked about today is pretty much like this interaction here between these people. But this alone is also not enough to tell what's happening in this image.
And so, obviously, we are going to need that information about agents' mental states and the physical world. And so I think, in reality, humans are incorporating all of this information, and we're going to have to start to figure out how the human brain might be doing that and try and understand how we might get models that can incorporate all of this rich both perceptual and higher-level cognitive information together.
So I want to thank my lab, especially Haemy and Manasi, who led the work that I talked about today. You can read more on our website. Both of the papers I talked about are available. And I also want to thank everyone at CBMM, especially all my amazing former mentors, and Tianmin and Josh, who have been really great collaborators on this GNN project. And I'm happy to take any questions anyone has.
[APPLAUSE]