Neural representations of faces, bodies, and objects in ventral temporal cortex (46:59)
July 5, 2017
August 26, 2016
All Captioned Videos Brains, Minds and Machines Summer Course 2016
James Haxby, Dartmouth College
There is substantial evidence that processing in ventral temporal cortex is optimized for face and body perception. Recent research employing a greater range of static and dynamic stimuli highlight the important role of agent action and the complexity of fine category distinctions. Computational analyses of fMRI data using multivariate methods capture fine-grained patterns of object representations in the ventral pathway.
JIM HAXBY: Thank you very much, Gabriel, and it's a real pleasure to be here. I'm using the amplification because I have the kind of voice, I don't know why, it disappears in a large room. So this is so that you can actually hear what I'm saying.
Now I usually talk about the neural decoding work, and specifically about this work that we're doing on how to build a common model neural representational spaces for decoding. But because my talk now is preceding a debate with Nancy Kanwisher, I'm going to give another talk, a different talk, about views about how things like faces, bodies, objects, are represented in the ventral temporal cortex. And I just have to show people in my lab at the beginning to make sure that I acknowledge that the work that I'm talking about really is their work.
So this is the outline of what I'd like to talk about. First, I'd like to talk about how we think about, how most people think about how ventral temporal cortex, the ventral visual pathway, is organized, and with the special emphasis on how it seems to be optimized for person perception, and especially face and body perception. And I'll just go through some of the compelling evidence for the primacy of person perception, face perception, body perception, in the ventral visual pathway.
But then I'm going to talk about how there are some problems with this conventional, or these models. I call it a fly in the ointment. And specifically I'm going to talk about how there's a stimulus sampling bias. And that primarily concerns that the studies that have been done use still images of a limited variety of stimuli, and how that really has had an impact on how we can understand the ventral visual pathway.
And then about computational methods, I'll get a little bit into the role of decoding for thinking about the ventral visual pathway in a new way. So first, what is the evidence that it's optimized for face perception? Well, there's something very special about faces. There's absolutely no doubt about it.
For example, in psycho-physical studies in which people move their eyes to images of faces in natural images, so they have two natural images, one on each side, the task is just to move to the side that has a face in it. People are amazingly fast. There are reliably accurate saccades as fast as 100 to 110 milliseconds. This is faster than it should be, given what we know about evoked potentials in the brain to faces.
The single cells in the monkey Ig cortex, that are face-selected, so they're face neurons, this is the red bar, have a much faster average latency than the neurons that are selected for body parts or objects or places. And even the detection of familiar faces also allows a saccade to a familiar face as compared to an unfamiliar face, with reliable accuracy as fast as 180 milliseconds. Again, this is just too fast.
The first evoked potential that shows a distinction that is sensitive is modulated by familiarity is at 210 milliseconds. And so somehow this information has to be realized in a movement, an eye movement, 30 milliseconds before evoked potentials can detect that kind of information. So there's something that the system seems to be really optimized for face perception.
It seems to be pretty resistant to attention. So this is an MEG study that we did, where we had people looking at these superimposed faces on houses, and they would attend either to the face or the house. And when they were attending to-- and then this shows the average evoked field while people are looking at the images and performing this task. When they are attending to faces in houses, face attention is the magenta line, house attention is the light blue line, they're perfectly superimposed.
It's absolutely insensitive to the direction of the type of attention task, until 190 milliseconds. And this response is the response to the face. So even when they're attending to the house, OK, and really, if you perform this task, the face subjectively disappears, because it's a demanding task. That first response in the brain, which is the equivalent of the N170, is there in full force.
So it's quite automatic, when people are unaware that they're looking at a face. So in these experiments by Sheng He and his collaborators, they render the face invisible using continuous flash suppression. So the two eyes are seeing different stimuli. The face is shown to one eye. The other eye is seeing this very dynamic stimulus. When that happens, they don't know the face is there.
And at some point, the face image breaks through suppression. And so this, the breakthrough from interocular suppression is used as an index of processing that's happened without awareness. And faces, if they're upright, break through substantially faster than faces if they're inverted. So again, if people are unaware of the faces, something about the upright face is grabbing the system, before the subject's even aware that a face is being presented to them.
And this, again, is seen for features like the direction of the face. If the face is facing towards you, it breaks through faster than if it's turned to the side. If the face is a friend, it breaks through faster than if the face is a stranger. So it's not just if the face is an upright image, it's what direction is the head directed, is the head directed towards you or away? Is it someone you know or is it someone you don't have? That kind of information is being processed.
And there is a specialized system in the brain for face processing, consisting of multiple areas. So the first three areas that we talked about were the fusiform face area, the occipital face area, and the superior temporal sulcus. We have a model for how these systems play a role in visual analysis. More recent work has suggested there are additional face areas in anterior superior temporal sulcus, anterior ventral temporal cortex, and even in frontal cortex.
But these areas all respond more strongly to faces than any other stimulus. And there are equivalent patches, a system of patches, in the monkey brain, that also respond more strongly to faces than any other stimulus. So this is a deep body of evidence from psychophysical studies of the speed of processing, the attention, the fact that faces are processed with minimal attention, even without awareness. And there's this special system of the brain for processing faces, and primarily in ventral temporal cortex, or in the ventral visual pathway.
All of this suggests that this system, that this is a very dominant function in ventral temporal cortex. Face processing is fast, requires minimal attention, mediated by distributed neural system, in humans and monkeys. But I want to say, I think there's a fly in the ointment. And after the talk you should decide whether you think it's a little housefly or something more serious.
So the first problem that I want to talk about is the stimulus sampling bias, that almost all studies use still images of faces and objects, with a very limited range of categories, faces, bodies, some animals, limited range of animals, and certain categories of visual objects. And very, very few studies have used dynamic images, and looked at the information content in actions.
So the conclusion I'm going to come to is that a dominant role is being played by the representation of agentic action in ventral temporal cortex. The second problem is the inadequate computational methods for data analysis and modeling. And these have relied on univariate contrasts. Is there a stronger response to one thing than another? And it asks what I think is the wrong question, which is what is the function of an area? And there are two problems with this.
First of all, it's the function. Do these areas do one thing, or there are many functions that are multiplexed into an area. And the second problem is, is it an area? Or is there a different way to understand cortical topography other than a division into these category-selective areas.
And I will argue that we can make more progress by modeling functional architecture as a high dimensional representational space, and that will better capture this complexity, as well as the category-specific regions. Let's first talk about the stimulus sampling bias. Now I just have to find out this group of young people here, how many of you have not seen these stimuli before?
Oh, look at that. Well, these were developed in the 1940s, I believe. And they were produced by drawing triangles on index cards and making a movie. And I think you can tell there's a difference between what's going on here between these two triangles and here with these two triangles. So something is going on here.
Let's just do it again. OK, that suggests there's an interaction between these two triangles. There's no biological formula. All there is is action. But people very quickly see that these two triangles are having some kind of a fight. And there's a big triangle who's beating up on a little triangle.
Well, these were used in a functional brain imaging study back in 2000. Castelli was the first author. It was from Chris Frith's lab. And it was to look at the representation of a theory of mind, social knowledge. And people concentrated on this area in medial prefrontal cortex, and this area in the superior temporal sulcus, posterior superior temporal sulcus or temporal parietal junction.
But there was this area, too. This really is right there in the lateral fusiform gyrus, suspiciously close to where we think the fusiform face area is. Now Gobbini et al did a study using the same stimuli, with some other stimulus, and repeated the activation of the superior temporal sulcus, and a little bit more here, some things in the medial prefrontal cortex, but also, again, ventral temporal cortex in the fusiform cortex.
And they also showed that these displays, these are point light displays of biological motion, again, there is no form here. That's obvious from the still images. See if I can get that, you can see the still image doesn't obviously suggest a human form. But as soon as it starts moving, you can see someone doing handsprings, right? And when people look at these point light displays of motion, it activates the lateral fusiform gyrus.
So there's no biological form here. There are no faces, there are no bodies, all there is motion and information, information carried by motion. And for some reason, ventral temporal cortex is responsive to that. Now this is a study from Greg McCarthy. And they had people looking at videos of industrial robots doing their thing. And these robots did tasks of varying complexity, of varying levels of goal directness.
And they found that the more complex goal-directed tasks evoke stronger activity in ventral temporal cortex, again in the fusiform gyrus.
NARRATOR: Simple or compound angles. And twist parts into place, similar to the flexibility offered by a manual operator. To learn more about the Pfennig M3iA, and our full line of assembly and delta style robots, simple or compound angles, and twist parts into place, similar to the flexibility offered by a manual operator. To learn more about the Pfennig M3iA--
JIM HAXBY: This is not one of their videos. I got this off the web. It's about a robot who builds a fan. And you can see quickly see that you are doing something, it has
to pull the records to perform the task. So seeing a non-biological stimulus, clearly, no one would say this is an animal. It's a non-biological stimulus--
This is a study from my lab. This is Sam Nastase, who was a student of mine in Italy, and now he's a graduate student at Dartmouth. Here he is back in his Italian days having an espresso, and a course of roast meat in Orvieto.
Well, he had subjects looking at videos of animals. And the subject would indicate whether or not the class of animal, like insects or primates or lizards, was the same as the previous one, or whether what they were doing was the same as the previous one. So this is a study of attention. I'm not going to talk about the effect of attention. I'm going to talk about something else that came out of this that was very surprising. Just to give you a flavor for what these videos look like. Here's a video of ants running, a seagull running, and a baboon running. These are just animals all doing the same thing. The motion vectors are very different, and the visual form of the animals, of their actions, are very different, but they're doing the same thing, they're trying to get from one place to another.
Animals and plants developed--
Here's a gorilla eating some kind of fruit, a humming-bird sipping nectar, and a caterpillar eating something disgusting.
Animals and plants developed--
Again, very different animals, very different kinds of eating behaviors, but it has the same goal. Or you can group things by species, so you have insects, here's a beetle, caterpillar, an ant, and a ladybug, fighting, eating, running. So he analyzed this, using what was called representational similarity analysis, which looks at the vector of the response to each condition, and how similar it is to the vectors, the pattern vectors for other conditions. And then he tries to analyze that matrix of pairwise similarities for 20 different conditions, for actions by five types of animal.
And he did it with two models. One says, well, to what extent are animals of the same category similar to each other in terms of the responses? The second one here is to what extent are the actions dissimilar to each other, the responses to the action similar to each other, regardless of which animal is performing the action, OK?
And what was surprising in this, it's a complicated study. It's on BioRxiv, by the way. It hasn't been accepted yet. You can see it on BioRxiv. Is that this similarity structure, showing the similarity of the action response, similar responses to the same actions, regardless of animal, accounted for a lot more variance than did the similarity of the animal species, which we found very surprising.
Now it's not that surprising in places like the intraparietal sulcus, or pre-central sulcus, where we thought there might be action representations. But it's also true in lateral occipital and ventral temporal cortex. Look at that.
So about over 3 times more variance accounted for by the similarity of actions, than by the similarity of taxonomy, taxonomic similarities. And then another piece of information here comes from some work that's coming out of Dave Leopold's laboratory at NIH. And this is work with Brian Russ, where they did MRI studies in monkeys. And they had the monkeys looking at natural movies of monkeys interacting, and other things happening.
And the monkeys are very happy to watch these movies. But apparently they get bored after seeing them two or three times. You have to give them special incentive to watch the movies again. And they asked, can we use the responses to the movie to identify the face patches in the monkey cortex?
So they would tag the movie for when faces were on the screen, in the movie, and as compared to times when faces were not in the movie, and found out it was a very effective localizer for the full face patch system in the monkey cortex. So the responses to a natural movie does reflect this face patch system, which was mapped out using still images. But then they just looked at the motion energy in the movie, and asked to what extent does the motion energy, regardless of what it's representing, drive the responses?
How much variance is accounted for by motion energy in the variation response in different parts of the brain? And it was a much stronger effect. Motion energy really dominated the response, across the visual to anterior temporal cortex and frontal cortex here. So what about in the face areas? Are these face areas still mostly sensitive to moving, to the presence of faces?
Well, this is what really surprised them. But I think this is really important. First of all, they showed that the face response, which is what is driving this map here, accounts for 3% to 5% of the variance in most of these face patches. These are all six of the face patches, PL, ML, MF, AL, AF and AM, up to a maximum of 5% in AF. And in every single case, the motion energy accounts for more variance, than the presence of faces.
And this isn't just any motion energy. They found that, if the dynamic information was in videos with animals in them, it was a dominant feature. If the dynamic information was in videos without animals, just natural things like hurricanes, it accounted for much less variance. So it seems to be something specific to animal action, that's driving the responses in these monkey IT neurons, including in the face neurons, OK? More so than the presence of faces.
And there's a really nice talk, Dave Leopold gave a colloquium at Dartmouth in May. And the video of that is on the CCN website for Dartmouth. So you can look this up and watch the talk. And, in this, he's talking about work that I don't think is published yet, where they're measuring single cell responses in face patches, while they watch these movies, and, again, finding that once they look at the response to movies as compared to the response to still images, that the response tuning functions become much more complicated.
As a matter of fact, they're kind of baffled by exactly what these things mean at this point. But they find that, if they measure a population of neurons, right in the middle of a face patch like AF or AM, that it's best to describe the variety of tuning functions as a high dimensional space. And when they do a PCA it takes about 40 or 50 PCs.
One more piece of information from Fairhall and Gobbini, by another paper under review, they had blind subjects listening to the voices of people. So these are people who have never seen a face, never seen an animal form. And in these subjects, you could decode information from the voice about the emotional content of the voice, in the blind subjects, but not so well in the sighted subjects.
So this cortex, if it has not ever been stimulated with visual input, is coding something else that seems to be emphasizing people, the emotional content of voices. The point I've been making here is that a lot of fusiform gyrus activity is evoked or modulated by things that have no biological form, moving triangles, moving points of light, industrial robots. It's modulated more by animal behavior than by animal form. This is surprising because animal behavior doesn't evoke a stronger response in ventral temporal cortex. The point is the pattern of response is more dependent on the type of action or behavior the animal is performing, and voices in the congenitally blind.
So the stimulus sampling bias in still images, faces, bodies, and objects has restricted our understanding of the functional architecture of ventral temporal cortex. So what is represented in the lateral fusiform cortex? What are the dominant features? Is it animate as compared to inanimate entities, or is it agency?
So is the animate-inanimate distinction a major large scale feature? Now this also has a long history, in the decoding literature, where we find that the responses to animate things like faces and cats are distant from, quite dissimilar from the responses to small objects in houses. In a beautiful study by Kiani looking at population responses, in monkey IT cortex to a wide variety of images, they did a similarity structure analysis back in 2007, and they found the biggest distinction was between the responses to animate stimuli as compared to inanimate stimuli.
And this was repeated, using the a subset of the images by Nico Kriegeskorte, who found that the similarity structure in monkey IT cortex, drawing a distinction between faces and bodies, and animate and inanimate, was mirrored in human IT cortex, suggesting that there's a high degree of similarity in the representational spaces in monkeys and humans. And Kalanit Grill-Spector has proposed in her paper on ventral temporal cortex, that one of the major distinctions is between animate and inanimate fields. And then there's a structure that's kind of multiplexed below that, for different aspects of animate stimuli.
So the problem with this is that, in all of these studies that look at are the stimuli animate or inanimate, they usually use human, mammalian, or avian stimuli. And when we do a broader sampling of animate stimuli, clearly animate stimuli, we get results that suggest the distinction doesn't apply to all animate stimuli. So we did this study in 2012, Andy Connolly did the study, where he had people looking at pictures of ladybugs and luna moths, in addition to mallards, warblers, monkeys, and lemurs. And so when he looked at the distinction between the response strength to primates and bugs, it looked just the same as the map contrasting faces to objects.
So this is a pattern that we see repeatedly with the animate stimuli on this side, and the inanimate stimuli on this side. But now we have animate stimuli on this side and bugs on this side. Well, it could be that if we had people looking at objects in the same study, it would be just an even stronger response, in this more medial cortex. So we did that study.
Long Sha was the first author on this paper, where we had some bona fide inanimate objects, keys and hammers, in addition to things that were kind of at the low end of this animacy continuum. So we had stingrays and ladybugs and clownfish and lobsters, in addition to people, chimpanzees, cats, birds, pelicans, warblers, and giraffes. And what we found was that in the ventral temporal cortex, the similarity between responses, was on this continuum in the similarity analysis, that hammers and keys were overlapping with the low animacy stimuli.
To see what we would have expected is that the inanimate objects would be way over here, and then all the animate would be over here, with the animacy continuum, from low animacy to high animacy. We're calling it the animacy continuum, but we really don't know what it is. All we know is that people and mammals are on one end, and bugs and fish are on the other. This continuum, again, has a topography that shows this lateral to medial distinction.
So the animate-inanimate dichotomy does not survive broadening the stimulus sampling to low animacy animals. So no one would say that stingrays and ladybugs and luna moths are inanimate. And this animacy continuum actually is an old idea, in intellectual history, going back at least to Aristotle and his treatise called On the Soul.
There are also animacy hierarchies in linguistics. So the grammatical forms are sometimes conditioned by how animate the thing is that you're talking about. We have a little bit of that in English, where we never use "it" when we refer to a human. Sometimes some people will refer to babies as its. I don't. But people don't have any trouble referring to insects as it, right?
We very rarely use a gendered pronoun for insects. And this plays an even stronger role in other languages. So is this, that is, the animacy continuum, so varying levels of agentic complexity, is that the dominant organizing principle in ventral temporal cortex?
I'm going to skip over this just to-- and not exactly. So I'm going to now indulge in a digression to reframe the question. And this is a very brief presentation of the methods we're developing for modeling high dimensional representational spaces based on a set of basis functions. These are, basis functions have tuning profiles, connectivity profile, and topographic components, that are shared across brains, and this early work where we just did this in ventral temporal cortex.
So we have people watching a complex movie, not too complex. It's from Hollywood. So Raiders of the Lost Ark. And we measure their brain activity while they watch the movie. This is a very easy study to find subjects for, because they're very happy to watch a movie and be paid. Now these patterns of activity here are responses in two subject's brains to the same time point in the movie.
Now, we reasoned that, while people watch the movie, they're representing the same visual information. And with a director like Steven Spielberg, they're probably attending to the same kind of information. So somehow this pattern here, these patterns here, and these patterns here, are representing the same visual information. But it's very hard to see exactly how they're similar.
So the way we think about this is that each pattern of response can be thought of as a vector in a high dimensional space, now, I'm just showing this as a vector in a three dimensional space, three voxel space. But each voxel, each measurement unit in the pattern, is a dimension in these pattern vectors. So as the movie progresses, the vector, the location of the vector, a pattern vector, changes in this representational space.
So this is just illustrative data showing 15 patterns for 15 time points in the movie, with 15 response vectors, in the three dimensional representational spaces, three voxel representational spaces, in two subjects' brains. And as you can see, because these voxels are not well aligned anatomically, at least at this fine scale, the vectors are in quite different locations. Now the idea behind hyper-alignment is that we want to find a transformation matrix here, such that, when we transform this pattern of vectors, it becomes more similar to the first subject.
And this is simply a rotation. It's an improper rotation, that we calculate it using the Procrustes transformation. And when we do that, we find a matrix that rotates this, so that these two patterns of vectors are now in good alignment. And so now the vectors for these different time points in the movie are close to each other, making them more discrimininable with simple classifiers.
For a third subject, we have to find another subject specific transformation matrix to align this subject to the average of these two subjects. And again, the Procrustes transformation does a very good job of that. And this works actually remarkably well. So we find that, when we do that, we derive a space for ventral temporal cortex, that has as many dimensions as there are voxels, which is about 1,000.
We can reduce that with PCA, and we find that we need over 30 dimensions to account for the information of this representational space in terms of the responses to the movie. Now this is a measure of classification accuracy for classifying what part of the movie was the subject watching, when we measure their brain activity. And you can see with five PCs, five dimensions, a five dimensional space, five features, about 47%. And it goes up.
And in the ventral temporal cortex, it peaks up around 68-69%. We're getting pretty close, around 30 or so. So somewhere above 30, we toyed with the idea of proposing that the magic number was 42. But we abandoned that. So these 30 PCs can be visualized in terms of the weights in the transformation matrix in ventral temporal cortex. And these are the distribution of weights in the two subjects.
And you can see they're not exactly the same, for each PC, but there's some similarity that you can see with your human pattern recognizer here. Now the surprising thing is that none of these PCs really seems to be pulling out the location of the fusiform face area or the parahippocampal place area. So if we map out the fusiform face area in each subject using a standard localizer, we can show the location of that, using these yellow lines. And the green line is the parahippocampal place area.
So does that mean that this transformation and dimensionality reduction has essentially eliminated the existence of the fusiform face area? But that's not true at all, because it can be reproduced as a linear discriminant of these features, these PCs. And if you find the right set of weights, it's the same set of weights for all subjects, you can then model, you can then find the pattern that best discriminates the response to faces from the response to objects.
And now you can see that it's actually capturing the location of the fusiform face area very well. And here it is blown up. So you can see that this subject and this subject have very distinctive topographies of the fusiform face area. But they are captured in the model that was derived from responses to the movie.
So the fusiform face area exists. The distinction between faces or faces and bodies in animals, is in this model space, 35 dimensional model space. How big a factor is it? How much variance does that linear discriminant account for, in terms of the responses to the movie, the dynamic content stories in a movie, not to still images? Well, when we look at that linear discriminant, it accounts for about 7% of the variance, which is a third of the variance accounted for by the first dimension.
And it's less than one eighth, it's about one eighth of the variance accounted for by the full 35 PC model. So something about what the brain is representing in a natural viewing condition, like watching a dynamic movie, this particular dimension is playing a relatively minor role. It exists. It exists, it's very clear. But there's something else that is driving the variability of patterns of response in ventral temporal cortex.
Now, if you go back to still images, and this is from the study, Long Sha's study, which used still images, that first dimension in a PCA, accounts for over 50% of the variance. With responses to still images, that dimension, which here is now the animacy continuum rather than the face object or animate-inanimate distinction, but these things are all quite co-linear in the ventral temporal cortex, is accounting for a lot of variance. And the second PC is down around 10% of the variance.
So does that means that all the information in response to still images is captured by this one dimension, this one contrast, animate versus inanimate, or animacy continuum, of faces versus objects, whatever you want to call it. And that really isn't true. So that first PC does a very good job of classifying between classes. So that is birds versus primates, or birds versus fish, or objects versus insects.
And the other, but the distinctions within class, pelicans versus warblers or people versus chimpanzees, are very poorly, are much more poorly classified by the first PC only. And the other PCs, the second to the 11th PC, in this model, that affords good within class discrimination, so that information.
But what is the difference in the response between the response to a warbler versus a pelican? Or a stingray versus a clownfish? That information is hidden in these PCs that are accounting for much less variance in the responses to still images. But they still are carrying significant information over and above what is carried by that first dominant dimension.
OK, so I'm going to wrap up now. This takes me a little time. So is the animacy continuum the dominant organizing principle for representation in the ventral temporal cortex? Not exactly. So the animacy continuum, face-object distinctions, account for only a small portion of variance in the responses to a rich dynamic stimulus. So the animacy continuum and face-object distinctions account only for coarse, but not the fine-grained distinctions.
And these dimensions are derived from responses to still images. Dimensions response based on responses to agentic behavior may tell a different story. In other words, the representation of agentic behavior may play a much more dominant role in the representational space in ventral temporal cortex. So first I talked about a stimulus sampling bias.
So these are images from Roozbeh Kiani's study, had over 1,000 such images, but they're all still images that are flashed to the animal. And when we look at studies that use things like the height of simple animations, industrial robots, or these wonderful nature movies from David Attenborough, we find that the models based on these still images don't seem to account for as much of what's going on in ventral temporal cortex. So category selective regions have more complex tuning functions, that suggest they play more diverse roles in person perception.
And representational geometry is dominated by the representation of behavior, not form. This is really a very surprising result that requires more studies to nail down. And the animate-inanimate distinction is not the principal dimension that characterizes the coarse lateral to medial topography in ventral temporal cortex.
So the second one is inadequate computational methods, so cognitive neuroscience has its historical roots in neuropsychology. And in neuropsychology, the method was to find patients who had local brain damage, due to some accident of nature, and determine whether or not some functions were impaired and others preserved. And the critical test was double dissociation. So finding that in one area function A is impaired, and B is preserved, and in another area function A is preserved and function B is impaired, this is the standard method.
And it was necessary, because all that people could do was find patients, study them, and then wait for them to die to see where those brain lesions were. But we have much more sophisticated methods now. But with functional imaging, this again is historical, functional imaging starting with positron emission tomography, looking at changes in regional cerebral blood flow with positron emission tomography.
I did that stuff, and we could get at most 10 measurements. So instead of having 1,000 time points, we had 10. And so we had very limited data, and the best we could do, again, is this kind of looking for simple contrast, which conditions evoked stronger responses than other conditions. We have much better data and much better computational methods now.
So looking at univariate contrast is really a holdover from the history of cognitive neuroscience and behavioral neurology and neuropsychology and early methods of functional brain imaging. And the reliance on simple contrast, which are really like single dimensions in representational space, reminds me of the legend of the blind men and the elephant, where six blind men are sitting by a river and they sense the presence of something large among them.
They go over to investigate. They can't see the thing. So they each feel different parts. One feels the tusks and says it's a spear. One feels the trunk and says, I don't know, a hose or something. One feels the side and says, it's a wall. One feels the tail and says it's a rope. One feels the ear and says it's a fan. All those things are really correct, OK?
The tusk is kind of like a spear, and the ear is something like a fan. The tail is something like a rope. So there's nothing incorrect about what they're observing. But because they can't see the totality of the elephant, they can't say it's an elephant. These are all just pieces of an elephant.
And in many ways the hypotheses about what is the principle, what is the function of an area in ventral temporal cortex, is like looking at a piece of a larger picture, the category selective regions. Some people have proposed stimulus size as a major dimension, expertise, we have the animacy continuum, domain specificity, even retinotopy. But to see the whole elephant, to see how is ventral temporal cortex doing the amazing thing that it does, which is it is the seat for complex vision, for recognizing what things are out there.
And with all the variety and subtlety of distinctions of what we can encounter visually, we need better analytic computational methods. And that's why I propose that things like neural decoding and thinking of the ventral temporal cortex as a high dimensional representational space, rather than as dominated by simple single dimensions, is much more powerful. So univariate statistics, which really goes back to neuropsychology and early functional imaging, is limiting analysis to univariate contrast. And this leads to search for the function of an area.
And then it becomes circular, because if this is your method of study, what happens is you think the brain is divided into areas that have single functions. But in contrast, if you use multivariate methods like pattern classification, representational similarity analysis, hyper-alignment, you can model the functional architecture of an area as a high dimensional representational space. So this can account for these category-selective regions.
They don't disappear. They're very real. But it also accounts for fine-grained patterns that carry finer distinctions. And it can allow us to have models of functional topography that are multiplexed topographies where the tuning function of a single unit, that could be a neuron or a voxel, is complex, and it is not easily described as being responsive to one thing and nothing else. So with that, I thought this would be a good introduction to the debate.