The Neural Basis of Perceiving Human Visual Social Perception (37:39)
June 6, 2018
June 6, 2018
All Captioned Videos CBMM Summer Lecture Series
Leyla Isik, a post-doctoral researcher at MIT, studies how the human brain recognizes objects and social interactions, using MEG, fMRI, and computational modeling. Dr. Isik first describes her work on decoding the information in MEG signals measured in humans viewing images of visual scenes and objects, showing the temporal evolution of object representations in the brain. She then presents work that reveals a region of the posterior Superior Temporal Sulcus (pSTS) that codes information about the presence and nature of social interactions, derived from visual input, which is well matched by a feedforward computational model.
Leyla Isik’s website Isik, L., Koldewyn, K., Beeler, D. & Kanwisher, N. (2017)
Perceiving social interactions in the posterior superior temporal sulcus, Proceedings of the National Academy of Sciences 114(43). Isik, L., Meyers, E., Leibo, J. Z. & Poggio, T. (2014)
The dynamics of invariant object recognition in the human visual system, Journal of Neurophysiology 111(1):91-102.
LEYLA ISIK: If you take a basic visual recognition problem, like trying to recognize the dogs on the left from horses on the right. This probably feels very easy for you, but it's actually an extremely computationally challenging problem. And in particular, what makes it hard are different transformations, like the fact that these dogs all appear in different sizes, positions, viewpoints, et cetera.
And you can show empirically, just in a little toy example, that if you were to remove all these transformations-- so in other words, normalize all these dogs and horses so that they're all facing the same way, they're roughly the same size and in the same position-- this becomes a really trivial vision problem. So you can take these dogs and horses, and you can train a classifier simply based on the pixels in the image. And this classifier, if you only give it one dog and one horse, can already way above chance tell you if a new image is of a dog or a horse. If you were to look instead at the un-normalized problem, this is much, much, much more difficult.
And so being invariant to these different transformations really helps facilitate how we solve this problem. And we know that the brain is also really good at dealing with these transformations, or being invariant to them. And this happens largely along the ventral visual pathway, which I've highlighted in red. And most of what we know about this pathway actually comes from recordings in, mostly, macaque physiology. So here they're actually invasive recordings, where they record either from single cells or groups of cells at a time. And that's helped us really map out this hierarchy.
So we know that the first stage of this hierarchy, which actually happens there in the back of the brain, is primary visual cortex or V1. And single cells in this layer respond to very specific oriented lines or edges. So for example, they might respond to a 45 degree angle in this position but not a 90 degree angle in that position, and not a 45 degree angle in any other position.
And that's the beginning of this pathway. And then all the way at the end, we have IT, or inferior temporal cortex. Signals reach there about 100 milliseconds later, and at this point, they recognize whole objects. And they're invariant to a lot of the changes that I talked about before. And while we have a pretty detailed understanding of how this evolves in the macaque, the picture in humans is much less clear, I think. And in particular, one thing I'm interested in is understanding what the timing of the individual steps are. Because I think that can really help provide insight into what the different computations are and what order they're carried out in.
So the main thing we know from humans, mostly from non-invasive EEG signals, is that by about 150 milliseconds, you have signals that can differentiate between different object categories. So like, face, not face, animal, not animal. But the detailed steps to build up to that are still largely unknown in humans. And so, in my first project, I wanted to ask if we could read out earlier visual signals in the human brain. So we have this top finished category representation, but what happens on the way to build that up? And in particular, I wanted to ask when signals could generalize across some of these different transformations, like changes in size or position.
And so to answer this, I used MEG decoding. So this is a picture of our MEG downstairs. And a big part of what helped me find this, is that the MEG came in my first year of grad school and opened to new users in the second year. And it was like, who wants to try it out? It's free to try out. And so I tried out a project on it, and it's totally changed my whole research trajectory. So I think it was really helpful to-- I don't know. These kind of odd things can have a large impact but I think it's helpful to be receptive to them and think about new ways to ask the questions you're interested in.
So this is our MEG downstairs. Subjects sit in it, and there's this helmet that we lift them up into that has 306 sensors. And those sensors measure the change in magnetic fields that are induced when many neurons fire synchronously. So this creates a current when a lot of neurons fire synchronously, which changes both the resulting electric and magnetic fields. So it's very similar to EEG, which you might have heard of, but it has slightly better resolution.
So subjects sit in the MEG, and we can show them pictures on this projector. And we record the data, over time, from 306 sensors. So the main advantage of MEG is that it is a pretty direct measure of the neural firing. So unlike functional MRI, that I'll talk to you about later, it gets very good temporal resolution. So every millisecond we actually get out a new recording from the MEG. And I analyze this data using machine learning, so we take the data at a particular time point, which gives us a 306-dimensional vector that we then just feed into a linear machine learning classifier.
So I think many of you know about machine learning, but just a little mini-primer. So in particular, we use supervised machine learning, which means that we take a set of our data that's labeled, and we call this our training data that we train our classifier on. So for example, if I were to say, project onto a 2D space the images on the left-- response to images on the left versus on the right, you could imagine trying to learn the classifier that best separates these two sets of points.
In particular, this is a support vector machine which learns a hyper plane, but in two dimensions, that's just a line. So the line that most maximizes what we call the margin, or the distance between that line and the nearest training point. So you give it these points. It fits the line. And you can either learn a linear classifier, or you can learn something more complicated that's non-linear. And then you take a whole new set of test points that your algorithm has never seen, so for example, these lighter gray and lighter green points. And you can just ask, how well does that classifier predict the right labels for these new test points?
So here we have a linear-nonlinear classifier just on this little toy example. Which one do you think would work better? The linear one. Yeah, exactly. Oh, I don't have a little demo. So the linear one would work better. And we use linear-- often, linear algorithms are less prone to what we call over-fitting, but the main reason we use linear algorithms is because the operations that are carried out there are kind of similar to what people think another stage of neural processing could do. And so we want to understand how then, if the neural signals can be separated in a linear way, that could be carried out by another set of downstream neurons or a downstream brain area.
And so we'll take a subset of our points, put it into a linear classifier, and get out a predicted label for a new testing point. And so, if in this case, the classifier predicted face, that would be correct. But if it predicted car, that would be incorrect. And we can use the accuracy of this prediction as a proxy for what information is present in the MEG signals. So if you can correctly classify face versus car, then there's probably information in the-- then there is information in the MEG signals about whether something is a face or a car. And because MEG has such good temporal resolution, we can repeat this over and over at each time point. So we get a measure of the classification accuracy over time. Any questions?
AUDIENCE: What is the spatial resolution like?
LEYLA ISIK: Great question. The spatial resolution, people would say, is on the order of centimeters, which is not very good. So one thing you can do with these 306 time-- so we only get 306 points out over time. But you want to estimate say, the activity in all the voxels in a brain, which is tens of thousands, so it's a really ill-posed problem. So there are some math solutions to try and fix that, but it leads to errors. And so here, I'm actually not going-- in all of the MEG work I'm going to be telling you about, I'm not worrying about this, where the signals are coming from at all, just what's happening over time.
AUDIENCE: So you can't distinguish between two [INAUDIBLE]?
LEYLA ISIK: I could not with the standard off-the-shelf source localization methods. Some people are really into source localization. It's a lot of the research they do, and they would probably tell you that you can. I'm more skeptical of it, and the approach I generally take is that if you have a question about where in the brain something's happening, you should use functional MRI. And if you have a when, you should use MEG or EEG. And there's now cool techniques to kind of fuse the two. So you could show the same images in both and compare the pattern of responses across the two to get a better picture of when and where things are happening.
AUDIENCE: So you mentioned that you think the linear SVMs have a more biological basis. Am I missing something?
LEYLA ISIK: I don't think the linear SVMs, per se, have a biological basis. But I think that if-- generally people think that if two signals are linearly separable, a population of neurons could tell them apart. Downstream neurons could tell them apart. They wouldn't necessarily use an SVM-like mechanism.
AUDIENCE: I'm sorry, downstream was?
LEYLA ISIK: So if I am reading out signals at 100 milliseconds, wherever the-- the idea is that the next brain area that those are projected to, in theory, could also tell them apart, if that makes sense. But we don't know if they do, right? So it's all kind of conjecture. This is just a question of when the information is there. It doesn't necessarily tell you if the next brain area is using-- or if the brain is using that information.
Right, so the first question we just wanted to ask was similar to the car versus face one I just outlined. We showed 25 grayscale images on a blank background and just asked when you could tell which of these 25 images somebody was looking at. And so I'm going to plot the classification accuracy on the y-axis versus time on the x-axis. And before the image goes on at time 0, we see that we're at chance, or 1/25, which is reasonable, because there's nothing on the screen. So the classifier is essentially just guessing.
But once the image goes on, and we flash these images for 50 milliseconds, you see a pretty dramatic rise in the classification accuracy. That stays significantly above chance for several hundred milliseconds. So that blue line at the bottom is indicating when the decoding is significantly above chance. You see that that starts at about 60 milliseconds. So as soon as 60 milliseconds after we show this image, I can already reliably tell you which image somebody was looking at.
So it was pretty cool. But telling these images a part is kind of trivial. So like I mentioned, even just based on the orientation of the-- different oriented lines or signals that are present in very early visual cortex, you could probably do this task. So we were really interested in this generalization question. So when you can generalize across, say, different positions or different scales.
So to do this, we can take a subset of those images and present them at different positions and different scales. And then we can train our classifier with data from the images presented at one position, say the lower half of the visual field, and test the classifier on images presented at the center of the visual field. And this will tell us not just when there's information about a cup on the screen, but when there's information about a cup that generalizes across positions. Does that make sense?
All right. So that's what I'm plotting here. So you'll see that we still can decode across this position shifts. So now there are fewer objects. I think there's only six, so chance is 1/6. But the other thing you might notice is that now this happens quite a bit later than that initial 60 milliseconds I showed you. Now it's after 100 milliseconds. So it seems like generalizing takes longer. And we can repeat this for all of the different position comparisons that we did. We can see that we can decode across all the transformations. We can do the same thing for the different scale shifts, and we can decode across all the different scales.
But we might notice is that there is a difference in the onset latency for the different decodings. So you can decode across the largest and second-largest image better and earlier than you can across the largest and smallest image. So we thought that was interesting. So we actually explicitly looked at the time to when you can first significantly decode.
So we can plot when invariant-- when the decoding first comes online. First, for the case where there is no invariant, so I say there's zero shift between the training and test images. They're exactly in the same position in scale. And that happens at this, like, 60 to 80 milliseconds that I showed you earlier.
Then you can look at the three size cases so a 1.5x scaling, a 2x scaling, and a 3x scaling. And what you see is that the 1.5x scaling happens quite a bit earlier than the 3x scaling. So it suggests that the larger the transformation, the longer it takes. And we see the same trend emerge with the position invariant case. So the two 3-degree position shifts happen much earlier than the 6 degree position shifts. And this was pretty cool. It suggests that invariance to small transformations occurs before invariance to large transformations. And in particular, it verifies some computational models of the ventral stream that I will talk about now.
So these are convolutional neural network models. They are inspired by visual cortex. And these include both simpler biologically-inspired models, like the HMAX model that I tested here. But the modern class of deep learning models that you may have heard about, that currently achieve state of the art performance on a whole range of tasks, have the same basic architecture as these. And so they consist-- you put an image in, and these models are what we call hierarchical in that they have many stages. And they're feed-forward, so the output of one layer always serves as input to the next layer. And information is just passed in a forward manner.
So they have convolution layers that builds selectivity. So for example, those would convolve the input image with a particular feature. So at this top layer that's a 45-degree angle. And these are analogous to those cells I told you about in early visual cortex that respond to a 45-degree angle all over the visual field. And so if you don't know what a convolution is, it's basically just comparing the similarity of a feature to the input image tiled at all possible positions. And so we say they build up selectivity, because it would tell you if there's a 45-degree angle versus a 90-degree angle.
And these output of the convolutional layers go into what we call pooling layers that build up invariance by, for example, taking a local max over a group of underlying convolutional cells. So this red cell would take a max over the response of all four of these underlying convolutional cells. So we say that it would be position-invariant, because it would fire if there was not only a 45-degree angle here, but at any of the four underlying positions.
And then these layers are often stacked on top of each other to build up selectivity to more and more complex features and pooling over a larger and larger range of the image. And so we don't have time to go into all the details, but it might be clear to you that invariance in these models gradually rises from layer to layer. And so we compared it on the exact same stimuli, and we saw a very similar trend that the early model layers could tell apart-- could recognize images across small transformations, but it wasn't until the last layer that we could decode across all of our transformations.
And so just to summarize, this gave us a picture of how information about invariant object recognition evolved over time. So we saw that in between 60 to 80 milliseconds after an image is shown, you can read out signals that can distinguish between different objects. But invariance to these signals gradually devolves over time, over the next 100 or so milliseconds. And the early and late stages of the human visual processing maps on really nicely to early and late layers of a convolutional neural network and nicely also mapped on to latencies that we knew, that I told you about, from the macaque physiology.
LEYLA ISIK: Exactly. Exactly. So you could just take the output of that layer, and train a similar linear classifier. And this was really kind of a nice proof of-- so it gave us previously unknown latency is for human object recognition, but was also a nice proof of concept that we could use these tools to differentiate computational steps in the human brain.
So that was all great. But meanwhile, in the computer vision world they're making huge advances in this object recognition problem. And that was spurred in large part by the creation of this, what's known as the image net data set. So this contains millions of labeled images of 1,000 different object categories. And so, for example, these are all terrier images. And what you might be able to notice is that it has-- these images have all the variation of those dogs that I showed you in the first frame. So one nice thing about this data set is that it provided a huge amount of variability to try and train these networks on.
And before 2012, people were using what we called hand-engineered features to try and recognize these images. So they would go in and hand-design particular edge detectors and things like that, based on their own theories of what image features would help recognize, would help differentiate, these different categories. And you can look at how well these algorithms performed. So this is the error rate on the y-axis of the 2010 and 2011 winners.
But then in 2012, there kind of came about this huge deep learning boom. And these are networks that have a very similar architecture to what I just told you about. But in the previous HMAX model that I mentioned, we hand-designed all the filters and the pooling stages to match what we think happens in human visual cortex. But they were still all designed by hand.
What people have now shown works much better is, instead of hand-designing these features, you can actually learn them from data. And these models aren't new. They've been around since the 1980s. I think even a little before. But what changed is that, one, now there is a huge amount of labeled data to train them on and optimize all the features, the filters. And two, people started to realize that you could use GPUs, or graphical processing units, to parallelize all the computations. And that made training them much, much faster than it was. So training them at the scale used to be infeasible. And now, just because of new larger labeled data sets and better computing, these same networks are now having a huge resurgence.
And so in 2012, there's a really dramatic drop in the performance-- and the error rate on the best performing model. So I should say that this is a challenge that every year computer vision researchers submit their best model, and then the people who put on the challenge evaluate whose network did the best. And it's a really big deal to win this. It's like a pretty key benchmark. And performance has just been dropping and dropping and dropping and has always been roughly based on one of these deep learning convolutional neural network architectures. And now people claim that these networks have even surpassed human performance, which they show in gray. And so on this particular task, which is a little funny, these networks do even better, are even at superhuman performance in recognizing objects.
All right. So that was really exciting, but I think what's even more exciting for those of us in neuroscience and cognitive neuroscience was some work pioneered, really, by Dan Yeomans and Jim DeCarlo, showing that not only do these networks perform really well on the task, but the different model layers in the network-- which they're showing here at the bottom-- map on really nicely to the representations actually carried out in-- here, they mostly compare it to the macaque brain, so invasive recordings in different stages of macaque visual processing. So these models not only achieve superhuman performance, but also give us a really faithful model of the underlying biology.
So that's all really exciting. And I think, sort of, begs the question, are we done? You know, is vision solved, we all go home? And I would say, no. And at this point, I was really trying to think about what are all the other rich things that humans do with their visual system beyond just recognize objects. An example of this that I like to give is this video from the Senate floor that showed John McCain's vote on a health care bill like a year ago. I don't know if any of you saw this, but I think it's a really nice illustration of all the things that we do with our visual systems, in particular, to recognize people and what they're doing.
So for example, you can very easily tell what other people think of each other, can tell what they're looking at, and you can recognize their communicative gestures. That was a thumbs down. You can also tell rich information about groups of people. For example, whether they're engaged in a social interaction or not engaged in a social interaction, and what the nature of those different interactions are. And all of these abilities really underlie how we navigate in our complex social world.
And so these social vision abilities are developed very early in infancy, and many of them are shared with other primates. So for example, newborn-- both macaques and humans-- really have a strong preference to look at faces. And then, after only a few months of age, human infants can make even more complex social judgments like recognizing different social interactions. So for example, they can tell if one puppet is trying to help another one get up a hill versus hindering it. And they can do this as young as three to six months of age.
You also know that there are large portions of the human cortex dedicated to recognizing other people's faces, bodies, and even thoughts. And more and more, recognizing this sort of social information about other people is becoming increasingly important for today's artificial intelligence systems. In particular, the example I always like to use is self-driving cars. Even a state of the art self-driving car can't make an unprotected left turn, because if you think about the way you make an unprotected left turn is you look at the other driver. You make eye contact with them, and you recognize their very subtle social cues and actions. And today's AI is not good at this.
But despite all of its important importance, I think the underlying neural computations are still really poorly understood. And so part of the reason for this is because it's a pretty challenging problem. So that video that I started with was so complex that there were many newspaper articles dedicated to just dissecting it. So in this one, a Buzzfeed news reporter actually went through and annotated different parts of that video. So for example, he described this image. "Elizabeth Warren starts leaning to the side and craning her neck to try to see what McCain is going to do." And just for fun we can put that through a popular deep network and get out the labels that it gives this image, which are rocking chair, desk, and barber chair.
Right. We can try another example. "Democrats gasp in shock and relief, until Schumer gestures frantically, urging them to cut it out." and the AlexNet says, barber chair, barber shop, shoe shop. And the last one is, "Before McCain returns to his seat, he pauses for half a second in front of McConnell, who doesn't acknowledge him." My algorithm says, suit table, soda bottle. And if you think about it, these labels aren't bad actually from a computer vision-- from an object recognition standpoint, they're pretty impressive. But I think this really highlights the gap between all the progress we've made in understanding object recognition and humans' rich social visual abilities.
So that's why my overall approach is to take the methods that have been so successful in helping us understand object recognition and apply them to understand social vision. In particular, I mean high spatial resolution neuroimaging data, so functional MRI, high temporal resolution neuroimaging data like the MEG I just told you about, and computational models. And today, I'm going to talk a little bit more to you about not social vision broadly, but one particular application that I've been focused on, which is social interaction perception.
All right. So to start with some functional MRI studies. So recognizing other people's social interactions is critical for how you navigate in the world, and in particular, how you determine the social structure of the world. So for example, it can help tell you who is friend or foe, who belongs to which social group, and who has power over whom. These abilities, like I said, are shared by infants and primates. But we still don't know if there are distinct brain regions that are selected for social interactions, or if social interactions are processed in some of these other known social brain regions.
And why this problem is important, I think-- I think for me, both the spatial questions and temporal questions can help us understand different parts of the computation. So this helps us ask, one, is this problem so important that we have dedicated cortical machinery to solving it? And two, maybe if we do, is that because we need some specialized templates or information specifically about social interactions to help recognize them?
And then if we do identify a brain region that's selected for social interactions, we want to know not just does it recognize a social interaction, but can it tell apart different types of social interactions? And because of the rich developmental literature, some of the infant studies I told you about, we specifically looked at helping versus hindering interactions.
All right. So in the first experiment we showed subjects these point light videos of pairs of agents either engaged in a social interaction, like here on the top, or two agents acting independently. And we chose these stimuli because real world social interactions are imbued with a lot of rich visual and contextual information. But here we wanted to strip them down to their most bare elements. So two agents, acting with actions directed towards each other, that are temporally contiguous. And we wanted to ask simply, are there any brain regions that respond more to videos like the one on the top versus those on the bottom?
So we show these to people in functional MRI, and we just compare responses to images on the top versus the bottom. Yeah, so I think real people would also work very well. The reason we didn't do real people-- and in general, that activates more of the brain than these point lights-- but the reason we didn't want to do that is because real people also have faces. And there's other information about them. And we really wanted to look at the interaction motion information. So people originally designed these stimuli to study single people's actions. So you can recognize what action these guys are doing quite well, even though there is primarily motion information, devoid of the form.
AUDIENCE: Is this kind of to reduce error, or [INAUDIBLE]?
LEYLA ISIK: I think just reduce the amount of information that we have, to kind of strip it down to its barest bones, is how we're thinking about it. Good question. Other questions? So this is a group map of all our subjects' functional MRI data and just trying to contrast the interacting versus independent. So red are all the regions that respond. The more red or yellow it is, the more significantly that part of the brain responds to social interactions. What we see is that there's a very clear peak of significance located here in this region known as the posterior STS.
So this is the superior temporal sulcus. So in general, your brain has many folds. Those are called sulci or sulcuses. And this is one that's known as the superior temporal sulcus, and we call this in particular the posterior part of that. Posterior just means back. So it's at the back of the superior temporal sulcus. And so this was pretty cool, that we could find a region that was very selective for that. And I'm only showing you one view of one half of the brain. But this preference was really selective for the right half of the brain and that we didn't really see any other activity outside of this region.
But that was a whole big group map. So to do that, I have to align different people's brains together, and you see a lot of anatomical variability and functional variability across people. So this region in me and in you will be pretty close by to each other, but because our brains are shaped differently, they won't be exact. It's difficult to exactly align them. So what we ultimately want to do is identify this region at individual subjects. And in addition, this area called the superior temporal sulcus has a lot of other social processing regions nearby. So we want to identify our region in individual subjects and ask if it overlaps with other known social processing regions.
So here's one individual subject's data. Can you guys see this OK? And so in red is our social interaction region that I just told you about, and we compared it to the location of the temporal parietal junction, or TPJ, which is known to represent information about other people's mental states or when somebody is thinking about somebody else. It's also near this region in the pSTS that responds selectively to moving faces. And then, as a control region, we localize this region known as MT, which is a motion selective region that responds to any motion. So the more motion, the more it responds.
And we did this in all 14 of our subjects, and I'm just showing you four of them here. And I hope this gives you a sense that these regions were pretty systematically located across all the subjects, even though there was variability. I mean, this is also to give you a sense of how messy this data can be. Even though there was some variability, these regions seem to fall in roughly the same spots in all our subjects.
AUDIENCE: Can you guys work with any subjects that have diseases or [INAUDIBLE]?
LEYLA ISIK: Great question. So this is all healthy participants who we scan here. I'm really interested in, at some point, looking at people on the autism spectrum. I think it would be really interesting to see how that might be affected here. But all the work I'm going to tell you about today is with healthy, typical developing participants.
AUDIENCE: This [INAUDIBLE] detects excited neurons?
LEYLA ISIK: So this detects the changes in the blood oxygenation levels. So when many neurons fire, blood moves to that part of the cortex. And so that's what we're detecting. So it's a really indirect measure of neural firing. And we have some models of how that blood flow translates to neural activity, but it's large scales of neural activity. We don't have the best sense of the biophysical basis of that, to be perfectly honest.
And because the blood flow is pretty slow, we only get a new reading every two seconds. So it's pretty different from the MEG signals that I showed you before, where you get a new reading every millisecond. Here, you get pretty good spatial resolution, so you can even get millimeter resolution, but you're very poor temporal resolution.
With all of these non-invasive techniques, we don't have the best idea of activity in individual neurons. We saw that there was a systematic organization of these brain regions, and at least spatially, they were pretty distinct from each other. So it's not like this social interaction region was the same thing as previously identified face regions or something like that. Which you could just imagine you have a big chunk of this part of the cortex that does all these different social things. At least here, anatomically, that doesn't seem to be the case. We can ask the same thing functionally.
So we can, on held out data, ask how much each of these brain regions responds to the interacting videos versus the independent videos. And so this is a social interaction brain region. We see a 2 to 1 greater response in held out data to social interactions versus independent actions. So this helps show us that our response is robust and replicates independent data. And we don't see any difference in the response in the TPJ or motion selective [? reasoning, T. ?]
We see a weak response in the face region, but it's significantly greater in our social interaction region. And I have some theories for why this is that I'm happy to chat about. This is kind of a funny region, the face region. It responds just as strongly to voices, and people don't really understand what its underlying function might be. So I think this can also in the future help provide some clues to that, as well.
All right. So it seems like we identified a brain region that's selectively engaged when people watch social interactions. We thought that was really exciting, that there is a chunk of your cortex that's more involved when you watch social interactions than not. But we wanted to ask, does this generalize to new stimuli? So people will rightly ask, these point light figures are a little funny, so one thing we can do is try it in totally different videos. And two, does it represent information, not just about whether there's an interaction, but also the nature of that interaction? So that helping versus hindering question.
So to do that, we made our own stimuli in the style of these classic psychology experiments by Heider and Simmel. We should put one in here. Maybe some of you are familiar with them, but he showed that you can really attribute animacy to these shapes moving around in animate fashion in videos. And we designed them to be kind of in the style of those infant experiments that I talked about before, where one shape is trying to achieve a goal, and the other one either helps it or hinders it.
That's a helping one. We also had hindering scenarios. And in all of these, across the different help versus hinder, we tried to keep them as matched as possible. And there were 10 of these scenarios in total. So we compared these to social interactions, to a physical interaction. So here these two shapes are moving around like billiard balls. And a final set of videos, where there's just a single shape trying to achieve its goal and either succeeding or failing on its own.
All right. So the first thing we want to ask is, does social interaction selectivity in this brain region generalize to these new stimuli? So these shapes look totally different than those point light figures I showed before. They don't even have a human form. But do they also drive this region? So that yes, they do. We see a significantly greater response when people watch either the help or hinder videos compared to the physical interaction videos.
So that was encouraging. But now we wanted to ask one more question. So other people have looked at stimuli like these, and there's a lot of differences between the help and hinder scenarios and the physical billiard ball scenarios. So in those help and hinder scenarios, you really perceive those shapes as being alive and having goals. And so many people have seen this difference and attribute it either to the shapes being alive and said this brain region cares if something is animate or inanimate, or attributed it to a goal, say, this brain region really cares about goals. And so that's why we had that last shape, where people rated them as just as animate and just as goal-driven.
And we wanted to ask, does that also drive this region? Is that possibly an alternate account of our findings? And we find, no. We find that this region really is not driven any more by a single animate shape than it is these physical billiard balls. So it seems like you really need these two shapes engaged in a social interaction to drive this region.
And so now we're going to move on to our last question. I wish I'd done this in a different order. So before I was just showing you the average response in each brain region. We call this the univariate response. But here, we're going to go back to that machine learning approach. So instead, I can take all the individual voxels in our brain region. So a voxel is a pixel with volume. So it's a 3D pixel. And ask does the pattern of activity in all those voxels-- can that help tell us apart helping versus hindering? Because we didn't see a difference in the average mean response to helping or hindering. This region liked both of them.
But now we can plot the classification accuracy in this brain region for helping versus hindering, and we see that it's significantly above chance. Chance would be 50%, since there's two classes. However, we also see that our other nearby social brain regions can do this-- the TPJ and the STS face region. And this was interesting, perhaps not surprising, that in particular, a region that is focused on understanding other people's thoughts can also make this distinction. We thought that was pretty interesting. Importantly though, our control motion selective region can't tell apart these stimuli. So it's not some low level compound in the motion of the videos.
So all in all, we've identified a region in the pSTS that codes for both the presence of social interactions in point light stimuli like these, as well as in these shapes stimuli, and also codes for the nature of interactions. And so this helps to give us some insight into how the brain is recognizing social interactions.
Associated Research Thrust: