Invariant representation of physical stability in the human brain
March 23, 2021
March 16, 2021
All Captioned Videos CBMM Research
Successful engagement with the world requires the ability to predict what will happen next. Although some of our predictions are related to social situations concerning other people and what they will think and do, many of our predictions concern the physical world around us. We see not just a wineglass near the edge of the table but a wineglass about to smash on the floor; not just a plastic chair but one that can (or cannot) support our weights; not just a cup filled to the brim with coffee, but a cup at the risk of spilling over and scalding our hands. The most basic prediction we make about the physical world is whether it is stable, and hence unlikely to change in the near future, or unstable, and likely to change. In this talk, I will present our recent work where we asked if judgements of physical stability are supported by the kinds of representations that have proven to be highly effective at visual object recognition in both machines and brains, or if the ability to determine the physical stability of natural scenes may require running a simulation in our head to determine what, if anything, will happen next.
NANCY KANWISHER: Welcome, everybody. It's nice to have you all here and it is my pleasure to introduce Pramod for today's talk. So Pramod received both his master's degree in electrical engineering and a PhD in neuroscience and electrical engineering from the India Institute of Science in Bangalore. That's the super fancy top place for science in India.
And while he was there, Pramod worked with Arun on a whole host of interesting questions about visual perception in people and machines and the relationship between the two. And that included work on the role of symmetry and object perception and the optimality of peripheral blur for object recognition. And in my lab, he's been working with me and Josh Tenenbaum looking at a topic at the core of CBMM's interest, and that's how we reason about the physical world.
And I find this super exciting, because it's just kind of wide open and fundamental to human cognition. And it's an area where we're starting to get serious computational models to test, and one where we think we're starting to know where to look in the brain. So there's all kind of cool stuff to do here. And Pramod will tell you about it. So take it away, Pramod.
PRAMOD R T: Hi, all. Thanks, Nancy, for that introduction. And thanks to CBMM for giving me this opportunity to present my work on invariant representation of physical stability in the human brain.
So this is a picture of an everyday kitchen scene. Just from the snapshot, we see not just a glass, but a glass supported by the table. A glass that contains milk that is precariously placed on the edge of the table. And someone who is not careful enough might bump into it causing the milk to spill all across the floor and possibly even shattering the glass into pieces.
So understanding the physics of the world is useful in various everyday circumstances to plan actions and interact with our surroundings. It could be as simple as stacking dishes in a pile in the kitchen sink without breaking any of them, where and how to walk on the Cambridge sidewalks during winter without injuring oneself, play a game of billiards and not suck at it, and even invent new ways of using common everyday objects.
This intuitive understanding of the physical world around us is considered to be a part of core knowledge along with other mental faculties that help us understand agents and their intentions. As such, many of these intuitive physical abilities are present even in infants. Four-month-old infants understand that solid objects cannot pass through each other. And at around five months, they have different expectations of solids and liquids.
Infants also understand that objects cannot just float in the air but do need support. And by around 11 months, they also understand stability. And as early as 3 and 1/2 months, infants have an understanding that object can only be lowered into an open top container and not a closed-top container. In other words, the relationship of containment. And finally, infants can already reason about gravity and understand how objects move under its influence.
From studying human behavior in adult humans, we also know that we make use of this knowledge about the physics of the world to make various sophisticated inferences. We can tell whether block tower is stable or unstable and robustly judge its stability depending on the mass and material properties of the blocks.
We can quickly infer relative weights of objects and update our beliefs about the weights of these objects if the outcome of an interaction violates our expectation. We can determine what might have happened if not for a possible cause that led to certain outcome of interaction between objects. We can also infer material properties like viscosity just from a snapshot. And we can also predict the paths taken by solid objects and fluids under the influence of gravitational force.
Interestingly, understanding the physics of the world is not only important for us humans, but it also is essential and of fundamental import in the field of robotics, whose aim is to build robots that interact and navigate in the world just like humans do.
Recent advances in this field have shown that including a module that learns the physical properties and relationships of wooden blocks does indeed help a robot to master the game of Jenga. Physical reasoning capabilities also enhance problem solving abilities of robots by leveraging proper tool usage and planning sequential manipulation of objects in the environment. And finally, learning physical properties of liquid by performing one task, like stirring, will help robot efficiently perform another task, like pouring.
By now, I hope I have convinced you about the importance of this core ability that we have to understand and use the physics of the word. Robotics still lag behind human performance in large part because robots do not understand the physics of the scene they're in. How do humans do this then?
There is some recent evidence that regions in the parietal and frontal cortices of the human brain, shown here in blue and green, respond more when people are performing a physics-based task compared to a difficulty-matched color judgement task. Further, these same regions have been shown to carry invariant information about the mass of objects through a series of experimental manipulations.
Now that I've given you a brief overview of what we know about intuitive physical reasoning in minds, brains, and machines, I will tell you something about a couple of ideas out there regarding how physics is computed in the brain. Since these are still early days and we don't have a lot of direct evidence from the brain, there are two dominant views of how this physics related information is computed in the brain.
One view hypothesizes that a representation similar to those that underlie object recognition in feed-forward neural networks is sufficient for intuitive physical reasoning. Whereas another view hypothesizes that our brain performs intuitive physical judgments through probabilistic simulations similar to those seen in video game physics engines.
Claims for evidence of this pattern recognition view comes from both behavioral and computational experiments. Humans are able to find the oddball target in an area of distractors defined only on how stable or unstable the target is compared to the distractors, arguing that stability is extracted pre-attentively.
And humans can also detect changes in block tours better only when the stack change affected the stability of the tower even though changes in stability were totally incidental to the task manipulation. Further, a few studies have shown that feed-forward neural networks similar to those used for object recognition can detect stability of block towers and, in many cases, perform as well as humans.
On the other, evidence for the simulation viewpoint also come from a variety of behavioral and computational studies. The influential paper introducing the concept of intuitive physics engine showed that a model that performs approximate probabilistic simulations was able to match human decisions and uncertainties regarding the stability of block towers.
Other experiments also showed that a probabilistic simulation-based account also explained how objects would move after collisions when the bulb of a pendulum was cut while in motion, how humans make judgments of causality by running counterfactual simulations, and how fluids collide and move under the influence of gravity.
The two viewpoints that I've presented here need not be necessarily dichotomous in the sense that claims for feed-forward processing, either because of its pre-attentive nature or faster reaction times, may not exclude simulation as the underlying computation. However, exploring those directly in the human brain will provide data to constrain these explanations.
So in order to address some of these issues and explore how aspects of intuitive physics is represented in the human brain, we considered stability as a good case study. Because of all the predictions we make about our physical world, stability is the most basic prediction.
Whether the situation in front of us is stable and, hence, likely to stay the same or unstable and, hence, likely to change in the immediate future. Moreover, understanding stability is to understand whether anything about the scene will change at all in the near future. With this in mind, we set out to address four specific questions about physical stability in the human brain.
First, we asked if representations in a feed-forward neural network trained on object recognition can be useful for detecting physical stability. Second, we ask if representations underlying object recognition in the human brain in the ventral temporal cortex can distinguish between physical stability and instability. In essence, both these questions are geared towards assessing the pattern recognition and specifically the object recognition viewpoint.
Third, we asked of the fronto-parietal regions, the ones I showed before that respond more to physics tasks and carry invariant information about object mass can also carry invariant information about physical stability. Finally, we asked if stability was represented in the brain by carrying out forward simulations.
To answer the first question that is whether feed-forward neural networks trained on object recognition carry information about physical stability, we created a data set of images that depicted physical stability or instability. We started with a set of images of unstable and stable block towers because it was used in a previous study and also in order to replicate their findings.
One thing to note here is that some of the previous study used pre-trained features from a deep neural network, Convolutional Neural Network, or CNN, trained on object recognition are fine-tuned to pretrain CNN to distinguish between stable and unstable block towers like the ones I'm showing here. However, the way the generalizability of these classifiers were tested was by training them on block towers with three blocks and testing them on block towers with two or five blocks.
We thought that this way of testing generalizability really does not capture the gamut of scenarios in which we humans are capable of distinguishing between stable and unstable configuration of objects. So we created another set of images. This time, we're depicting stable and unstable configuration of objects denoted here as physical objects scenario. These images depict recognizable everyday objects and we try to match the objects as much as possible across the stable and unstable scenarios.
Finally, we created a third set of images to test the generalizability across naturalistic scenarios. The images on this set depicted people in stable or unstable situation, like the ones I've shown here. And the thing to note here is that even though it includes people, the instability itself arises from a precariously placed object, like the ladder, for example. So with this set of images, we set out to answer our first question.
AUDIENCE: Pramod, can I ask a quick question? I'm just curious how you made the images.
PRAMOD R T: So OK, these were some of the images I actually got off the internet. And the ones, the block towers that I'm showing here were actually used in a previous study. They actually photographed these. And the way they photographed the unstable towers is by actually holding a stick on top and clicking a picture just after removing the stick so that it stays stable during that frame. And I mean, in some cases, you can also see the stick here. And these ones I got off the internet, and these ones too.
There's this funny sort of meme called "why women live longer than men" where you have these funny scenarios of people doing various actions on precariously placed ladders and some really stupid scenarios.
AUDIENCE: I have a quick follow up on that question for the unstable blocks. It's funny. When you mention the stick that's poking, I wonder if a deep net-- maybe you're going to talk about this later-- catches that bias, kind of in the famous radiology paper of, oh, they can detect tumors just because the quality of the image is better and not because it's actually detecting the signal the right way. But I don't know, you could answer that later.
PRAMOD R T: I think it could have been that way. But what they also say in the data set is that they use the same sort of stick situation even in the case of stable images just to maintain similar scenarios.
AUDIENCE: That makes sense.
NANCY KANWISHER: But Pramod, you might mention how you filtered your stimuli, or at least checked to make sure they weren't decodable at an early layer of [INAUDIBLE]. So at least there wasn't--
PRAMOD R T: That's right. So for all the images that I'm showing here, I mean, on at least the ones that are used for the experiment, I just passed them through the initial layer, con layer of deep neural network, just to see if there are certain low-level properties that would give you this decodability. And then, only after passing that test-- I mean, I chose images which could not be decoded from the early layers and then did all of the rest of the analysis.
So we collected these three set of scenarios. And we rationalized that any system that has understood the concept of physical stability should be able to detect stability in all these three scenarios. And as far as the neural network is concerned, we chose ResNet-50 architecture trained on the ImageNet data set for object recognition. A shallower variant of this model with lower object recognition performance was previously used for stability classification. So if anything, this network should improve on earlier results regarding stability classification.
So how did we test the neural network for stability classification? For each scenario we extracted unit activations or features from the final fully connected layer of the network and trained the linear Support-Vector Machine, or SVM classifier, to distinguish between feature patterns for stable and unstable images using cross validation.
What should we expect to find here? If that is generalizable information about physical stability in the feature representation learned by this neural network through training on object recognition, we should find not only that it should have above chance classification within each scenario, but also significantly above chance classification performance when trained on one scenario and tested on another.
So we replicated earlier findings that training the classifier on block towers would give above chance classification performance on [INAUDIBLE] block towers. Interestingly, we also found that deep neural network features contain information about physical stability independently for the physical object scenario and also the physical people scenario.
However, when we train the classifier on one scenario and tested on another scenario, the performance dropped to near chance levels, indicating that the pre-trained CNN features can discriminate physical stability within a scenario but not across scenarios. So to answer our first question-- no, feed-forward ImageNet trained deep neural networks previously claimed to support stability judgments do not carry information about stability that generalizes to novel scenarios.
So let's move on to our next question where we asked if brain regions that are thought to be optimized for object recognition in the ventral temporal cortex carried scenario invariant information about physical stability. Again, just to motivate why we're asking this question of the ventral temporal cortex, we know that VTC, the Ventral Temporal Cortex, is involved in object recognition and is also well-modeled by convolutional neural networks trained on ImageNet object classification. So will we find similar results to CNNs in this brain region too?
AUDIENCE: Pramod, before you go on to the next section, can I ask just kind of a conceptual question? I'm not sure-- I guess part of the reason you're asking this question is because past research groups asked similar questions. But I guess I just don't really get why we would expect a network trained for recognition or parts of the brain that are possibly specialized for object recognition to contain information about physical stability.
Because it just seems to me that if we think of object recognition as categorizing into types of objects, there's a huge amount of variation within each object class. A huge amount of physical variation, right? There's things that I can classify as plates and stacks of plates and cups and different physical configurations and some of them are stable some of them are unstable. I just don't see why we would ever expect that to be related to physics.
PRAMOD R T: That's right. So I think one answer could be that, for example, if you take a scenario of block towers. One way of distinguishing between stable and unstable block towers is to just figure out what the center of mass is or the center of the object is, which is very simple to compute in terms of visual features. And eventually, a region that is optimized for recognizing objects could also have information about the extent of the object or centroid of these objects that would in turn help you distinguish between stable and unstable block towers.
AUDIENCE: Right. I mean, I agree with you that it seems relevant that-- I mean, there's obviously going to be visual features that carry information for our physical [INAUDIBLE]. But it's just not clear to me why object recognition per se is relevant for physical--
NANCY KANWISHER: Sam, just to jump in. We didn't think it would either. It was Firestone and these guys are saying, oh, it's just like object recognition. It's the same kind of thing. It's just pattern classification.
JOSH TENENBAUM: I think there's two. But I think we should distinguish two or at least two versions of this hypothesis that several groups, including Chaz Firestone, Brian Scholl, but also several computer vision groups have argued. One is that in some sense, it's just the same as object or scene classifications. So the same networks-- like [INAUDIBLE] with--
NANCY KANWISHER: Alvarez.
JOSH TENENBAUM: --George Alvarez, right. I mean, they actually, I think, took pre-trained networks and just linearly decoded for certain classes of block towers, and claimed that that was the way to do it. So one is that it's like literally the same representations can be quickly repurposed for this, and that would be consistent with the fact that it's very quickly processed, in 100 milliseconds or less, and all that.
But another possibility as sort of a generalization of that is is it's not exact same circuitry or the same representations but the same kind of network, like a deep ConvNet could be used for this. And both of those, I think, are hypotheses that at least some serious people are taking seriously and they might be part of the picture. So we do want to take them seriously. But yeah, the strongest version that we're testing right here, that it's literally just a pre-trained object recognition network, I don't know how much I would really have expected that to work in the first place.
AUDIENCE: Yeah. Yeah. No, that's basically what I was trying to get at, which is that it seems more plausible to me, potentially, that the latter possibly that Josh just described. I mean, actually, even more plausible is a version where if you're trained to do physical inference, object information becomes useful for that, right? But it seems less obvious the other way around.
JOSH TENENBAUM: Yeah, right. I mean, another view is that there's some kind of a generalizable representation of 3D object geometry and scene geometry that's jointly useful for all these things. But it'd be nice to test a version of where-- the problem is we don't have anything like an ImageNet scale data set for training just feed-forward visual physics, right? It'd be nice if we could test something like that.
I'm not sure how we would do that. Maybe some kind of motion contrastive-- like, self-supervised or contrasted losses on motion. A lot of people have been trying to learn things in that setting. Dan Yamins and others have a network on that, and Brendon Lake. So maybe we could try taking those visual representations and seeing the same kind of idea. That might be another thing to do.
AUDIENCE: [INAUDIBLE] is actually on the call. I don't know if you wanted to say something.
AUDIENCE: You're putting him on the spot.
AUDIENCE: No, I don't mind actually. I was going to try and hop in.
JOSH TENENBAUM: I like putting [INAUDIBLE] on the spot. [LAUGHS]
AUDIENCE: So I think the claim that I would basically make for the sake of decoding stability from object recognition is not that it's anything about object recognition per se, it's about the extent to which object recognition is able to generate features that are then usable in other tasks. So to the extent that object recognition is good at more of like an intuitive geometry, per se, to the extent that object recognition gives you.
JOSH TENENBAUM: That's what I was trying to channel there in the last thing. But we might think that there might be other ways of doing that. Maybe using more general video data sets for which there'd be an intuitive geometry and dynamics that might be more physical. And we might want to test that too.
AUDIENCE: We're actually trying to replicate some earlier object recognition results with just contrastive learning networks. So that we're kind of getting at this question of what exactly are the features that are learned that are most relevant? Which is a question you can test more directly by asking, OK, what kind of contrast do you set up in order to produce the features that you think would be relevant for a variety of physical judgment tasks beyond stability, but also with stability.
JOSH TENENBAUM: Yeah, I mean, maybe we could even collaborate or at least coordinate. Because I think we're interested in the same thing even if we're coming from different places. So it'd be good if we could have a separate offline discussion of what are the right self-supervised or contrastive losses that could plausibly be tested in these different cases and try out the same networks and stuff.
AUDIENCE: I'm in.
JOSH TENENBAUM: Let's follow up on that. OK, great.
PRAMOD R T: So here we wanted to ask if brain regions that are thought to underlie object recognition in the ventral temporal cortex also carry invariant information about physical stability. So to do that, we again considered images from the two scenarios that were previously used, the physical objects and the physical people scenario. However, one could argue that any difference between stable and unstable conditions we might see in the brain could be explained away by attention. People precariously placed on ladders might simply draw more attention.
So as a control, we also included a third scenario where people were in either perilous or non-perilous conditions, but due to animals. This has the same danger or peril that the other scenario has, but the peril in this case is directly due to an animal rather than a precariously placed physical object.
So with these three scenarios we set out to address the second question, that is whether brain regions known to be involved in object recognition also contain information about physical stability. So we collected fMRI brain responses to these conditions using a block design. And an orthogonal one-back to ensure that the participants were focused on the images. Throughout the task, the participants were asked to fixate on the center of the image, which we then confirmed using the eye tracking on a subset of participants.
So we then defined our region of interest in the brain, the ventral temporal cortex, shown here on an inflated cortical surface, and asked if pattern of activations in this region contained information about physical stability. Specifically, we looked for such information using a multi-voxel pattern and a pattern correlation analysis.
So within each scenario, we split the data into two halves and asked if the pattern correlations within conditions, shown here as RW1 and RW2, are greater than pattern correlations between conditions, shown as RB1 and RB2. So this will simply indicate how similar are the voxel activity patterns within the condition. Higher with condition correlations would indicate that the activity patterns for the two conditions are distinctive in this region of the brain.
So what are we expecting to find? So if the neural network results are anything to go by, this analysis should reveal information about physical stability within each scenario. So what did we find? We found that the within condition correlations were significantly higher than the between correlations in the physical objects scenario and also in the physical people scenario. However, the pattern information did not distinguish perilous from non-perilous conditions in the animal-people scenario.
So VTC carries pattern information about physical stability within scenarios similar to what we saw in object recognition trained CNNs. But what we have not answered yet is if this information is scenario invariant.
So to answer that, we again used pattern correlation analysis and computed within and between condition correlations but now across scenario types. That is, the within condition correlations were for stable conditions in, say, the physical-objects and physical-people scenarios. And the between condition correlations were for the stable condition in physical object scenario and unstable condition in physical people scenario. So again shown here as RWs and RVs.
So we found that within condition correlations were not different from between condition correlations for pairs of scenarios involving animal-people as one of the scenarios. This is what we would have expected, given that the nature of instability or peril is different. One has instability due to the physical properties of objects, whereas the other has peril due to an animal. However, the crucial comparison here is for the two physical scenarios, that is the physical objects and the physical people scenarios.
So what did we find here? Interestingly, here, too, we found out that the within condition correlations were not that different from between condition correlations, indicating that the ventral temporal cortex does not carry scenario invariant information about physical stability.
So to answer the second question, we found that, similar to CNN's trained on object recognition, the part of the brain known to have representations and functions necessary for object recognition, the VTC, also does not carry scenario invariant information about physical stability. But since we know that, we can easily recognize stability, regardless of the scenario. Surely, there must be some part of the brain involved in its processing. Where might that be?
So taking a cue from a couple of previous studies in the lab, we hypothesized that the fronto-parietal regions that responded more to physics task and carried invariant information about object mass might also carry scenario invariant information about physical stability. To explore this, we used the same physics versus non-physics task localizer used in the previous studies and functionally localized the candidate physics regions in each participant.
So what are our predictions for this brain region? If this brain region is involved in processing physics information, then we should find information about physical stability in this region. So we address this, again, by using the same pattern correlation analysis.
We first ask if there was any reliable information about stability within each scenario in the candidate physics regions. As before, we found that the within condition correlations was stronger than the between condition correlations, but only in the physical objects and the physical people scenarios. But not in the animals-people scenario. This is similar to what we found in the ventral temporal cortex. And the question now is whether we might find similar results even for the across scenario comparisons.
So if this region, the physics ROI that we have defined here, is simply reflecting the processes in the ventral visual pathway, we should find no generalizable information for physical stability. So let's see what we found. Again, we computed pattern correlations within and between conditions across scenarios. And this time, we found that the fronto-parietal physics regions did not generalize across physics and animate scenarios.
What about the two physics scenarios then? Surprisingly, we found that the fronto-parietal physics regions responded similarly to conditions even across scenarios, specifically when the scenarios involved physical objects. Indicating that these regions contained abstract scenario invariant information about physical stability.
So the fact that we don't see generalization for the animals-people condition indicates that these regions or these results are not driven by the perilous nature of some of the stimuli that we've used in the physics scenarios. However, there could be other confounding factors that can easily explain these results as well. So we tested some of them next.
So the first confounding factor we checked was low-level visual features. Although I told you that these stimuli were considered after passing them through an initial layer of a deep neural network just to make sure that they're not distinguishable using low-level visual features, we wanted to actually test this in the brain in V1.
And what did we find? We found that there's no pattern information in V1 for generalizable stability representation. So then we asked if any of our results could be due to differential eye movements. There could be systematic differences in where participants fixated even though they were asked to maintain fixation at the center.
They could have made saccades differently due to different locations across different scenarios. To check this, we collected eye tracking data on a subset of our participants during fMRI and extracted various eye movement variables, like the average x- and y-positions, the number, duration, and amplitude of saccades. We found that none of these variables systematically differed across conditions and scenarios.
AUDIENCE: How fast was the stimuli shown, Pramod, in this?
PRAMOD RT: So here, these were shown in a block design with about 10 images in the block with two one-backs, so for a total of 12 images. And each image was shown for about 1.7 seconds, which is a 300 millisecond interstimulus interval. So, yeah.
Finally, we asked if differential attention to these conditions could explain our results. Though it's hard to quantify attention and explicitly define what we mean by it, we nevertheless wanted to get a measure of it. So we collected subjective ratings of how interesting or attention-grabbing our stimuli were.
We found that, as expected, unstable and perilous images were rated higher compared to stable and non-perilous images. However, this effect was not significantly different across all three scenarios, indicating that the scenario invariant representation of physical stability cannot be due to differential attention. Thus, low-level visual features, attention, or differential eye movements cannot explain our results.
So the simple answer to our third question is, yes, the candidate physics regions in the fronto-parietal cortices carry abstract information about physical stability. This already suggests that the representations that potentially underlie our ability to judge physical stability across various scenarios is not similar to those that underlie object recognition.
So we will now move on to our final question where we specifically ask if these fronto-parietal regions represent physical stability using forward simulation, which is--
AUDIENCE: Pramod, can I just ask a question before you move on? What about the possibility that implied motion is important?
PRAMOD RT: Yes, that's a good question. I'm not presenting those results, but just to give you-- just a sec.
AUDIENCE: Nancy worked on this like 20 years ago, I think.
PRAMOD RT: That's right.
NANCY KANWISHER: Good memory, Josh.
PRAMOD RT: Yeah. So, OK. How do I show this?
NANCY KANWISHER: While you're looking for that slide, I'll just say, what Zoe Kourtzi and I did was implied motion. So it was photographs of people who are at that moment moving. Which is subtly different than stability, which is the possibility of motion in the future. So it's similar, but not exactly.
AUDIENCE: But the block towers, those are in the process of starting to move, right? Once you move the support.
PRAMOD RT: That's right.
AUDIENCE: I mean, it feels to me like-- it's almost like those are very strange stimuli because you just don't ever encounter, in real life you don't encounter that kind of thing as a static image, right?
NANCY KANWISHER: Yeah. So to be fair, the block towers were not used in the functional MRI study.
PRAMOD RT: In the functional MRI, yeah.
AUDIENCE: I see.
NANCY KANWISHER: That would apply to the earlier comment.
AUDIENCE: But I think it but I think it is a similar point, anyway.
PRAMOD RT: That's right. I mean, one has--
AUDIENCE: The physically unstable things have a similar implied motion.
PRAMOD RT: That's right, yes. So yeah. To answer that question, actually, I looked in MT. And saw that, actually, you find higher activations for unstable condition compared to stable across all three scenarios. So implying that it's not specific to the physics conditions. But in general, if anything has an implied motion or predicted motion, you would find higher responses there.
AUDIENCE: But I guess more generally, I mean, these stimuli are cool and they obviously do something to your brain. I'm just wondering about the choice to study these particular things with static images. Because, again, in a real-world scenario, I mean, of course it's very common that you judge stability of things and that's on a continuum. And you can make judgments about, oh, well, if I took this thing out, it would fall. That kind of happens. But these things where it's clearly about to fall, you don't walk around and in the world seeing that all the time, you know?
PRAMOD RT: Yep. Yeah, but also, I mean, one cool thing about this is you can sort of judge this even by looking at a snapshot, which is not true for other sort of physics scenarios. So in order to just remove the complexity of showing videos--
JOSH TENENBAUM: I mean, I think that's, of course, a good point, Pramod. But I also really like Josh's suggestion because we could do that. Toby in our lab and others did some studies like that, where you can show a perfectly stable scene and then ask a question, like, if this thing is removed-- imagine that one of the blocks, you had some kind of block scene or some kind of constructed scenes where one object is red or something. And you say if I were to remove the red block, would it fall or not?
So it's conditional, hypothetical, or counterfactual stability judgment. And people can make those judgments and probably use the same brain mechanisms. It would be nice to show that. But it would be a control because there would be no implied motion, at least in the raw stimuli. Although, in the mental image-- I mean, in some sense, you could imagine that you get the same, you might still get implied motion effects because to do the simulation, it might down-activate MT, for example. So I don't know that it would completely--
NANCY KANWISHER: Well, that's the idea.
AUDIENCE: As far as removing it.
NANCY KANWISHER: Yeah. And Josh, that's presumably why you see it also for the animal's case, which isn't really physical in the same sense but does predict motion. You also get this [INAUDIBLE] response.
JOSH TENENBAUM: But it'd be interesting if-- I don't know, to speculate or something-- like some of the newer fMRI technologies with super strong magnets can tell the difference between bottom up and top down. You might expect there could be a difference between somehow like, I don't know, implied motion that's just because it's just there in the image versus-- and some automatically engaged. I guess it would still be sort of maybe top-down. Anyway, it seems like there should be a difference. Maybe it's time that it would take or something, between automatically engaged applied motion versus only when you ask yourself the right question does it imply motion.
AUDIENCE: What about backward masking or something like that? Isn't that the conventional way to try to prevent feedback processing?
NANCY KANWISHER: Or latency of MEGD coding, maybe.
PRAMOD RT: Well, that is kind of confounded with the strength of the signal itself. I mean, imagining something might be sort of weaker compared to actually seeing something in midair.
AUDIENCE: Well, yeah. They're all imagined. But is it like Brian Scholl involuntary. One of Brian's key axioms of a being perception is is it fast and automatic. Whereas the conditional stimulation that would be required for what Josh was suggesting is not automatic. It's only when you think about it.
PRAMOD RT: When you think about it, yeah.
JOSH TENENBAUM: So anyway, we can talk more about this later. It feels like that'd be an interesting additional experiment to do under the right circumstances.
AUDIENCE: I mean, I still think that there's-- configurations vary from being very stable to being pretty unstable. And you can have things that are not that stable without necessarily being about to fall. It's mostly just like, you know if you walked past it and bumped it, it would definitely fall, as opposed to something that probably wouldn't. And I think that could be a very fast bottom-up thing, right?
PRAMOD RT: Yeah, that's true.
AUDIENCE: I feel like as a parent, I've become very attentive to--
AUDIENCE: I was thinking the same thing.
AUDIENCE: --precariously placed things that I think people without kids don't think about, really. I feel like it was a massive perceptual learning. Where it's like now, all the perceptual affordances are really about physical stability and breakability and things like that. Josh knows what I'm talking about.
AUDIENCE: Oh, yeah.
JOSH TENENBAUM: Well, and the other Josh does too. I think it's not an accident that I started working on these things around the time when-- [LAUGHS]
AUDIENCE: I'll just add that it's not clear whether that kind of perceptual learning you want to classify that as top-down or automatic. Because it feels, subjectively, like it's automatic. But it's also clearly an effect of the demands of the environment, right? It's not like everybody's hardwired with that kind of-- with those affordances.
AUDIENCE: I'm not sure about that personally. I mean, you definitely become more cognitive of it when you have kids. But I mean, I can say--
NANCY KANWISHER: I don't have kids, and I'm very, very attuned to things leaning over the table edges.
AUDIENCE: Yes, I think so. Yeah.
PRAMOD RT: What I've shown so far is that this abstract scenario invariant information about stability is present only in the fronto-parietal physics regions and not in the individual visual pathway. So now we'll move on to our final question very specifically asked if these fronto-parietal regions represent physical stability using forward simulation.
So we hypothesized that forward simulation will show up as [INAUDIBLE] difference in neural responses within the physics regions. Why? Because in case of unstable stimuli, there could be more diverse possible outcomes, more changes to the scene in the future, that is more predictive motion and hence more activity, compared to stable stimuli, which are what one might call in a "sleep state" where no change of motion is predicted and hence don't need to be solved for explicitly.
So what did we find? We computed the average activation for each condition in each scenario within the physics regions. And we found that physical instability involves a higher response compared to physical stability in the physics regions. But the animal-people condition evoke similar activity for both the perilous and non-perilous condition.
So this is consistent with the hypothesis that physics regions in the brain might be using forward simulations for inferring physical stability and, in general, for physical scene understanding. However, for our hypothesis to hold, we should also look in audio-visual regions and find a null effect.
So we looked. And what we found is that none of the visual regions tested, and we tested-- [INAUDIBLE]. So we tested three visual regions-- the primary visual area, V1; Lateral Occipital Complex, LOC; and VTC, as I showed before. And none of these regions actually showed higher responses to instability, physical instability, compared to stability.
So what I'm showing here in this table are the average activity values within each visual region for stable and unstable conditions. And the p value just denotes the significance for the appropriate statistical test comparing the average activations. As can be seen, none of these scenarios showed a significantly univariate response difference, indicating that the trend we observed was unique to the fronto-parietal physics regions.
So the answer to the final question is that yes, consistent with our hypothesis, we found stronger responses to physical instability providing an indirect evidence for forward simulations being carried out in these brain regions.
So circling back to this debate, we found that representations underlying object recognition in both feed-forward convolutional neural networks and the [INAUDIBLE] convolutional pathway do not carry generalizable information about physical stability. However, instead we found that scenario invariant information about stability is found in the fronto-parietal regions that is potentially computed through forward simulations of what will happen next. Although we find evidence suggestive of the simulation hypothesis, there is still a lot more to be more known.
What I've shown you today is that a pre-trained CNN trained on object recognition cannot detect physical stability across scenarios. However, a neural network specifically trained to do so might. This requires creating a large data set of stable and unstable object configurations across different scenarios. On the contrary, building and testing computational models that extract objects and their representations that explicitly predict future states of the scene by running forward simulations might learn to generalize to null scenarios faster and match human behavior better.
So such models also have the potential to be better encoding models of neural responses in candidate physics regions. In this regard, there has been recent advances in building such object-centered models using graph networks. And we're working towards testing those models on our images to ask whether these models can distinguish between physical stability and instability across scenarios and also to ask whether these models can predict brain responses in various regions of interest, but mainly in the fronto-parietal physics regions.
And regarding the simulation in the brain, I have presented somewhat indirect evidence using the univariate analysis of the board responses. However, collecting high temporal resolution data using electrode grids or ECoG can provide a direct evidence for simulation in these regions and potentially enable us to ask more interesting questions about the nature of simulation itself. Is simulation always running? At what level of detail these simulations run?
So I have to acknowledge here that the actual brain processes that underlie intuitive physics might not be as dichotomous as I've made it out to be. Some problems might as well be solved by appropriating already existing representations for visual object recognition, and others by running forward simulations, and even some through a combination of both.
So finally, from the previous literature and the results I presented today, we know that the candidate physics regions encode invariant object mass and stability. So what are the other physical properties represented in these regions? There are lots of other object related properties, like the bounciness or elasticity of these objects, and different forces that act on various objects as they interact in the environment. Lots of dynamic variables that are involved.
And what among these different variables are represented in these candidate fronto-parietal physics regions? So we are working towards understanding these aspects of intuitive physics in ongoing and future studies.
So with that, I would like to thank my co-authors and mentors, colleagues in the Kanwisher lab, the funding sources, and you all for attending this. And thanks again.
Associated Research Module: