Functional imaging of the human brain: A window into the organization of the human mind - Part 2
- All Captioned Videos
- Brains, Minds and Machines Summer Course 2021
NANCY KANWISHER: Great. OK, so we've you've added some candidate blobs on the brain. [CHUCKLES] OK, but what about auditory cortex? There's a limitation that somebody over here mentioned earlier. There's the limitation of all the stuff I've talked about so far, which is, we sit around, and we come up with a hypothesis like, oh, maybe there's a special brain region for this. Let's go look. And that's fun and diverting and we've found some cool stuff with that. And often we try that and don't find anything. And so that's a worthwhile enterprise. That's standard hypothesis-driven science.
The problem with it is, who's to say that the hypothesis we will think up, that really the interesting things about the brain are things we will think to test, right? And so that's one of many reasons why it's nice to use data-driven methods, where you just collect a shipload of data and shake it and try to say, OK, what's the structure right here, right? So the stuff we did with auditory cortex, when we started this a bunch of years ago, there was very little known about human auditory cortex that we basically knew that there was tonotopy, like a map of frequency space and primary auditory cortex, which has been known for a long time and has been found in animals.
And there was some kind of evidence that there might be cortex respond selectively for speech, but basically, the whole organization of high-level auditory cortex was not really known. So we decided to just scan people while they listened to recordings of 165 different natural sounds. So we recorded these two-second clips. And we chose them to be the most frequent sounds people hear, frequent categories of sounds people hear and to be easily recognizable. So we did lots of testing on Turk to ask people, list all the sounds you've heard in the last hour and compile all of those.
And then once we got our list, we asked people, when did you last hear this sound? And we got the most commonly heard sounds. And that's a whole list of things, starting with, what is the most common sound people hear? Man speaking, OK, that's a problem.
[LAUGHTER]
Anyway, it's true. Woman speaking was number 5 ratings. Anyway, they all made the top 165 so we included them. We then scanned people while they listened to this. It's actually a really hilarious experiment. You're lying in the scanner, and you hear, bang, bang, bang from the scanner. And then you'll hear, woof, woof, woof, woof, woof, woof, bang, bang, bang, bang, bang, bang, and then toilet flushing, and then bang, bang, bang, bang, bang. Quite entertaining. And the upshot of all of this is we measure the magnitude of response of each voxel or 3D pixel in kind of greater suburban auditory cortex, like the whole top of the temporal lobe.
So for each voxel, so we have its magnitude of response to each of the 165 sounds. We put it in this little matrix here. I've lost my cursor right here. Right, there we go. OK, so each column here is this 165 vector for the response of that voxel. Everybody with me? OK, so then the cool thing we can do is throw away all the labels. Just analyze the matrix with various matrix decomposition methods. And just basically ask that matrix, what is your basic structure? How can we boil you down to your simplest components?
And that's essentially what we did using essentially independent component analysis, which is to say for a moment why this is a nice idea. You can imagine that each voxel, after all, it's about a half a million neurons in a typical voxel, which is shocking. I mean, it's shocking that we ever see anything at all with its cruder method. But you can imagine each voxel then contains multiple different kinds of neural populations each with different selectivities, OK?
And so what we want to do, and each of those neural populations will have some response profile over those 165 sounds. OK, so what we want to do is in-- the measured response of each voxel is some kind of weighted sum of this combination of kinds of neural populations by its representation in that voxel. And so what we want to do is discover these canonical response profiles of the different neural populations that reside within a voxel.
OK, so that's the agenda. And so when we do that in the auditory cortex using our 165 sounds, we find that just six components account for 80% of the replicable variance. Now, first and foremost, I want to say I don't think there are just six different kinds of response profiles of neurons and auditory cortex. I'm sure there are way more than that. So this is in part a statement about what you can see with functional MRI. Nonetheless, what was cool is when you then take those components and stick-- importantly, all of this analysis is done without any labels.
The math doesn't know where the molecules are which the sounds are. It's just saying, serve up the dominant component. So it's very hypothesis neutral in that way, right? But then once we find these components, we can stick the labels back in and say, OK, what are they? Right? And when we do that, we find that four of them make sense. So they're like nice positive controls. One responds to all of those sounds with a lot of high frequencies. Another responds to all the sounds with low frequencies. It's like check, check, that's tonotopy. And we can actually project their weights back in the cortex and say, yep, they're living right there on the gyrus. So that's a really nice positive control that it discovers things we already know.
Two other components are these kind of acousticy things that weren't exactly known before. Were completely interpretable but weren't completely crazy either. One was basically clicking type sounds, and another was pitchy type sounds. And actually, we had already reported that there's something like pitch selectivity in the cortex. OK, so that was nice. Nice positive controls. But then there were two other components that were not like that. So here's one of them.
And so this is, again, the response profile over those 165 sounds of component 5, OK? It's just emerged from this very data-driven method. OK. If we stick those sounds into rough categories, which we did by sticking them on top and giving people these choices, you find, oh, this component responds a lot to the light green and the dark green. What is that? Well, the average across these categories, we see in this component responds strongly to foreign speech, English speech. What's that little thing? Oh, that's music with vocals, OK?
So this component-- oh, and then much weaker but close are non-speech vocalizations. OK, so it's pretty clear. You just look at this and you say, OK, this is a speech selective component. And it's just emerged out of this data-driven analysis. We didn't go in there and say, oh, you have speech selective stuff in the cortex. We did this broad screen, and it just emerged as a dominant component in the cortex. OK, so that wasn't entirely new. As I mentioned, other people had hypothesized this before and found something kind of like that. But they tested three or four conditions. So they didn't have the really lovely evidence for selectivity that we have here, because we have all these different stimuli.
So still it was nice. But the real prize in the study was component 6. And if you look at component 6, you see all these light blue and dark blue stuff produced high responses. And then the response profile kind of goes off a cliff. So if we average within a category, oh, that's instrumental music and music with vocals. And everything else produces a much lower response in this component. So this is the first time that really strong musical selectivity was found in the cortex. People had some kind of sort of maybe things before that weren't very impressive.
And the reason we could find it here is because we're using this ICA method. We're basically pulling apart neural populations that cohabit within voxels and then blur each other. So if you use standard voxel-wise methods, you don't find strong selectivity for music. But if you use ICA decomposition method to pull apart these different response profiles, you get this really nice clean strong selectivity for music. And it's really quite remarkable, because as you see here, there are many, many different music clips with just instrumental music and many with vocal music.
And they're wildly different. It's like a classical flute solo, a heavy metal band, a reggae band, spirituals, opera, you name it. All these wildly different kinds of music all produce a strong response in this component. So that was pretty cool and pretty new. And that also tells us some cool stuff about music. Music is just a big mystery Darwin puzzled about. We get why humans need food and sex and have all these basic mechanisms of language make sense, survive longer if you can communicate with one specifics.
But music, why is it that all human societies have music? What is that doing for us, and why do we have it? That remains a big puzzle. We haven't answered that question here, but we've constrained a little bit. Because one of the stories about music was that music co-opts mechanisms that have evolved for speech, right? And this shows like, no, these things are not interested in speech down here, right? Really, music is its own thing in the brain. It's totally separate from high-level responses to speech.
It's not co-opting those mechanisms. Separate work by Ev Fedorenko and me and Sam Norman-Haignere and others has shown that language, which is separate from speech-- I should have said that. This thing is about speech sounds, not language, right? It responds at least as much to foreign speech that you cannot understand. So there's a clear division between hearing the sounds of speech in the brain versus understanding the meaning of the sentences. Those are very different regions, right? Nonetheless, the music response has no overlap with either of those. Yeah.
So whenever you get an astonishing result, you should say, really? And so we said, really? And I love this. I mean I was just excited by how you could separate these separate responses with some fancy math. But not being a mathematician, it always makes me nervous. How do we know the math and invent the result rather than discover it? And so I wanted to see it in the raw data. And so I was thrilled when we got a chance to look at this with intracranial recordings from patients with grids sitting on top of the temporal lobe.
And what we can see is actually in individual electrodes, we replicate all of these results. So here's an individual electrode showing you-- I keep losing my cursor. Where did it go? I don't know. There it is. OK, here is time here. It's just a single electrode responding to those same sounds, all the different foreign speech sounds in light green. Native speech, understandable speech in dark green, vocal music in pink. So it is a similar selectivity for speech sounds. And more importantly, really nice selectivity for music sounds in this electrode in another 10 or so electrodes.
But the real surprise is, we also found something new that had not emerged from the functional MRI. And that was electrodes that respond selectively to vocal music only. OK? And what's cool about this is that the response to instrumental music is low. The response to speech is low. It's highly super additive. It's not just both music and speech. It's really a selective response to both the music. And that, we didn't predict after the fact. You can tell various stories. Sam there at Harvard has various stories to lend us. They're all highly speculative with a shred of evidence but not much about special vocal music.
It is probably the evolutionarily earliest form of music, because after all, we need to make an instrument to make it. And it is often going to be the first kind of music you hear, because developed infants can hear sounds in the womb including speech and music. So none of that says why you should have a selective response to vocal music, but it's pretty compelling. These are just showing that we had 13 subjects who had electrodes on the surface of the brain. And so that's showing you where the electrodes were.
Remember, there are two different kinds. There's the kind where the surgeons put a whole grid of electrodes on the surface of the brain. I really like that because it gives you a sort of spatial picture as well. And if the grid is on top of a place you're interested in, you're likely to get some of the good stuff. So that's super exciting. Unfortunately, neurosurgeons are moving away from that and using depth electrodes now more. So that's random. It's like the electrodes land wherever they land. And they're not even near each other necessarily. And so sometimes you get lucky, and sometimes you don't. So this just shows you where those grids were on each patient.
Pink is a response in a single electrode. This is all one electrode. These are three different electrodes. Here's one. This is one electrode that responds selectively to song. OK? All right, so I want to do that both because I think the result is cool, but also because I just love the complementary methods of using data-driven stuff and the ways that different kinds of methods like functional MRI and intracranial recordings complement each other. OK. Also, in later work, [? Hannah Bollinger ?] in my lab has just published a paper showing that that music collective component is present in people who have zero musical training.
They have lots of exposure to music. It's impossible to find people who we can scan who have zero exposure to music, but no formal training. So it is quite different from the visual word form area, which you only develop after people teach you to read. The music selectivity arises from whatever exposure normal humans have to music even without any training. OK, how does all this stuff get wired up in development? I have no idea. I'd love to know. But I have a few little humble offerings of data that start to constrain the space.
So first of all, many studies have looked at the development of the fusiform face area and other regions in the ventral pathway. Almost all of those studies are done in kids aged four and up. And the basic claim is that all of those things, especially the FFA, continue to develop way into adolescence. OK? So the story has been out there for a long time that the FFA at least develops extremely slowly. And hence, by speculation it's trained up by experience. I hasten to point out that slow or later way after birth development does not in itself prove a role for learning and experience.
Some things mature later in life not because of experience, but just because of a genetic program, like puberty, right? OK, you have to eat and you have to do some basic things to get to the stage where you reach puberty, but puberty doesn't happen at a certain age because of something you learned. [CHUCKLES] And so the fact that there's slow protracted development certainly makes you think that it might be learned. But it doesn't in itself nail it. This is just a common confusion, so I'm mildly obsessed with it.
OK, so there's a slow development of the FFA. And there's a whole beautiful set of studies that I thought Marge Livingstone was going to talk about. But apparently, she didn't. I didn't listen to the whole lecture, but she's done gorgeous, gorgeous work on monkeys showing development of face regions. And her basic claim is that face regions and these specific regions in monkeys aren't present until relatively late. If you translate the standard way between monkeys and human years, which I think is multiplied by 4, I don't know where that comes from but people do that, the claim is that monkeys get this at approximately four years, I think. I don't keep the numbers a part of me, but something like that.
So it's roughly consistent with the human data. She claims not to see it before then. I think one could debate that. But most elegantly, she did this very labor intensive and really important study of raising monkeys without ever letting them see faces. And showing that those monkeys did not have faces when first tested with functional MRI after being raised without seeing faces. OK? So they occurred other monkeys. They smelled other monkeys. They got lots of cuddling from human caregivers who had visors over their faces. So they had lots of social experience, just not visual experience. And they did not develop face matches. Very important.
OK, so those two things suggests that it takes years to develop face selective systems, and it requires experience. Yeah? Yes. So those papers lead people to say, face areas take years to develop and require visual experience. Really? I mean, maybe, but one wants more data. So I just mentioned all of this. OK, so Rebecca Saxe did what no one else could do before that. And that is to scan young infants. It is almost impossible. It is unbelievably challenging.
The kids wiggle and puke and poop. And the parents get nervous. And the whole thing is just a train wreck. But Heather spent years at this, and she can get a kid to calm down and go in the scanner for just the time you need to get the data like nobody's business. It is remarkable. So Rebecca and Ben Deen first found similar spatial distribution of responses to faces and scenes in infant brains at six months in an earlier paper. But they didn't find selective responses.
And that's ambiguous. Maybe it's just that infant data are crappy and you can't see it, and there's just too much blurring from head motion and all the rest of it. Or maybe it just hasn't developed yet. OK? So Heather Kosakowski comes along and collaborates with a physicist. Makes a new infant dedicated coil. Ben had one too, but Heather made an even more amazing one with higher signal-to-noise. And as I say, she can get data from infants like no one else can. And over many years of efforts scanning I think 80 some infants, she found face, place, and body selective responses in six-month-old infants.
Here's a group analysis. It doesn't reach official significance levels, but when we do the functional region of interest thing, that is, define the candidate voxels based on one pool of data. Measure the response and held-out data. That's important for all the reasons you machine learning people know. You don't want to train on testing data, all of that. Similarly, you don't want to measure response magnitudes and functional MRI on the same data used to identify the voxels, OK? So when we do that, we find here is the face selective response in that region of the FFA in six-month-old infant. That's faces and body scenes and objects.
The play selective stuff shows a similar play selective profile. And the body region shows a body selective region. So none of this tells us how that region develops, but it says that already by six months you have selective responses in the right region of the brain. And that's going to constrain any possible story. OK, so that's exhibit A. So no, it doesn't take years to develop. It may continue to develop after that. It probably does. But it's already there very early.
OK, so does it really require visual experience? Well, let's test blind people. So Ratan Murty, my postdoc, also a veteran of this course, 3D-printed objects like these shown here. Faces, little places, little mazes, hands, and chairs, a bunch of each of those. He stuck Velcro on the bottom. He stuck subjects in the scanner with a big cardboard disk on their bellies with Velcro. And he put things on a Velcro and rotated the disk from just outside the scanner board. So the blind subjects could lie in there and feel each object as it goes by without having to move their arms, which slashes the brain around, OK? So we're scanning people while they feel these things. Everybody got that?
OK, first thing, you want to replicate visual face selectivity. Oh, I didn't make it all pop up with drama. OK, too bad. On the top is just showing these stimuli visually to sighted subjects and getting the face selected responses in the usual location right there. This is what happens with the congenitally blind people in the contrast of feeling faces versus feeling mazes and hands and chairs. So this is remarkable. These people have never seen. And here they are showing what looks to be a very similar location of face selected responses in the brain when they touch faces.
Further, if you look at the response profiles and held-out data, here's the visual FFA response to faces and the other conditions up here. And here's the blind subject's tactile selectivity and you see this very similar selectivity. Now, I sneakily made the y-axis label so small you probably can't read them. The magnitude of response is much higher to visual stimuli incited subjects than the tactile stimuli in blind subjects. So they're not identical. You can see, it's weaker here in the blind subjects. Not every blind subject shows this. Over half did. Some didn't. But this is now the cool response across all of them, whether they have it or not. You could do this ROI method.
And so we are seeing face selectivity in the same place in the blind subjects. And that is something pretty astonishing. I can still hardly believe this. Frankly, if anybody else had published this, I'd be like, eh, nah. But [CHUCKLES] I spent a lot of time looking at these data, and they're for real. Yeah. Exactly. I don't think we do. I just spent a lot of time talking to Ratan about it. How do you get from somatosensory S1 down there? What are the roots? I have no idea. It's a great question.
Other people who know neuroanatomy better than I do might speculate. But I will say, it's not the first time this kind of stuff has been shown. So blind subjects show nice tactile responses to objects in LO. So that's been shown for a long time. And so clearly, there's some way to get tactile information into kind of high-level visual regions, but I don't know what the root is. It's a very good question. OK, so we did do some stuff using resting functional MRI to try to look at connectivity. It's not connectivity. It's a really lousy proxy for connectivity, but it's kind of all we got.
And we find that those regions in the blind subjects show roughly similar functional correlations at rest to other brain regions to what we see in the FFA inside the subjects. But in the blind subjects, it's more weighted towards connections to frontal regions than connections to posterior visual regions. All of that is right at the edge of significance. I forget which of those things crossed over into the signals. It's all barely detectable. And it's not really the method you want to look at structural cognitively, because basically, we don't have a good one. No, DDI is not a good method for that. I won't go down that route right now.
[LAUGHTER]
OK, all of this begs the question of, how does that damn response know to land right there? Right? And your question about connectivity raises that. And I'll skip a whole slide on how many people have speculated that long range connectivity is a fundamental constraint in development. And I love that idea. And if I had to guess right now, I guess that's probably true that a lot of the stuff lands where it does precisely because of its long-range connectivity. And we have some tiny shred of evidence for that, but I'll skip over. OK. So all of-- OK, maybe I'll a bit. What am I going to do here?
Anyway, so those are a few humble snippets. These things don't add up to a story about how they develop. But probably long-range connectivity matters more based on a priori considerations than actual data. And it's not that I'm going to say experience doesn't matter at all, but experience must matter. How can it not? Nonetheless, astonishingly, you can get something that looks like a face selected response and congenitally blind people have never seen. So those are all important constraints.
OK, this is very loose speculation. I think visual cortex is pretty homologous between monkeys and humans and that auditory cortex is not. And you can ask me about that later. We use audition in very different ways for monkeys, whereas we use vision and pretty similar ways in monkeys. Yes, you're absolutely right. It's like, what? How do we reconcile these? I don't know how we reconcile these. They seem completely inconsistent. I point out that, first of all, they're not technically inconsistent. There's a difference in species.
And Marge Livingstone did not test her face-deprived monkeys on touch. So there's also a difference in modality. My guess is that the deeper difference is that when a blind person feels a face, they know damn well that that's a face. And they know what a face means. When a monkey who's been raised without ever seeing faces sees this pattern of visual information, he doesn't have any idea what that is. So I think that just knowing what that thing is and understanding its significance is probably a big part of the response. But that's a speculation.
If Marge ever tested her face-deprived monkeys with patch, that would help. But I think she hasn't done that. Yeah, they went to great lengths to make sure that there were not reflective surfaces on the cages they were raised in, indeed as the blind people did. Importantly, blind people don't go around recognizing each other by touching each other's faces. That's not a thing. We looked this up on the web, and there's like all these sites put up by blind people who say, how come sighted people are always asking us if we want to touch their face? That's disgusting. [LAUGHS] So it's not a thing.
Nonetheless, I'm sure blind infants, like sighted infants, feel their parents' faces and their own face. So there is some tactile input, but not the kind of extensive use for identification that sighted people have with faces. So to me, that just raises all the more mystery. What the hell are they doing with the brain region for a function they mostly don't do? So my guess is, it's actually not primarily doing that. It's doing some other social thing that we haven't tested yet. But, well, I'm going to skip the whole first half. It's elegant. I love it, but we just got accepted for publication. So it's coming out soon.
I'll just tell you the gist. As I think you guys have heard from DiCarlo, and some of you in this group are already doing this yourself in your own work, you can build really nice CNN models of neural responses. They predict responses to novel stimuli shockingly well in monkey data from the ventral visual pathway. Other people have shown that with visual responses. And what Ratan has been doing is doing this for the FFA, PPA, and EBA. If you take the whole region as a whole, take its main response, measure its response to 100 stimuli or 150 stimuli, build a CNN model, which is a weighted combination of units in the CNN model predictions response to held-out stimuli, and that works shockingly well.
And you guys know all this stuff, so I skip all that anyway. This is just linear weighted combination. But look at this. This is the predictive-- each dot is a stimulus. This is the predicted response to that stimulus in the FFA. This is the observed response to that stimulus and the FFA-- correlation of 0.91. Holy crap, not corrected for the noise ceiling. Just 0.91, right? I mean, Ratan gets this because he scans the hell out of subjects, like 10 scanning sessions to get really rapid responses to begin with.
But also very interestingly, you get productivity within the faces and within the non-faces, right? So it's not just predicting all the faces are high and all the non-faces are low. Similarly so for the PPA and the EBA. This productivity outperforms what you get if you just ask people to look at the image and say how facey is it or how placey or how body-like is it. This does better than that. It also outperforms expert in the field. Experts in the field who have spent many years recording the responses of these regions and who are asked to predict how strong each region would respond to each of those stimuli, these models do better. OK?
So on the one hand, OK, fine. We've seen this in other cases before. Big deal. There's a few reasons why I think this is cool. Well, one, it's really nice to be able to predict this stuff. And now they apply across subjects. Now, we can go to dozens of published papers in the literature and say, OK, do these same models predict all those things in the literature? And that's like, yes, mostly, they do, which is cool. But second, because you have a model, you can do these kind of turbocharged experiments, these high throughput experiments that you could never do on actual brains.
For example, somebody asked a while ago, how do we know we've tested the right hypothesis? So for the case of the fusiform face area, I say it's going to respond more to a face than anything else. How do we know I've tested the right stimuli? Maybe if I scan people looking at pineapples, there would be a higher response to pineapples than faces. And then it's not a face-selected region. So I can't do all. I don't know what else to test. I've tested all the plausible alternatives I can think of. Other people have done that, but that's only a few hundred stimuli.
But now, we can take this model, which is so highly predictive, and we can run the entire machine learning database through the model and see what its top predicted images are. When we do that, all the top rated images are faces. If some of them weren't, we would then go back to the lab and test them and measure the response. But none of them were, right? And similarly so for the PPA and EPA.
Similarly, you can do other high throughput methods, where you mask different parts of the face and say, OK, what part of this face image is driving that strong FFA response? And you can discover, like, OK, it's mostly eyes, a little bit this. It's kind of exactly what you think it's this stuff, just like the patients said, this stuff. That's what changes, right? So anyway, so it's a nice way to say with computational precision what we mean by face place in body selectivity that Carlo's been saying to me for like 15 years.
What's a face? What do you mean face selective? And I say, Jim, you know what I mean. Give me a break. He's like, no. No, you don't you haven't said exactly what a face-- well, now, we have an image computable model of faces. Take that, Jim.
[LAUGHTER]
Yeah, he's a co-author. He loves this. But also more seriously, it gives us a way to do all of these high throughput screening methods. And we can compare-- and then another way of looking at it is it's just a reality check that these methods can work even on crappy functional MRI data and work really well. And now, we can use them to do cool stuff like understand other regions that we don't understand that well, discover new regions and try to characterize them. So there's a whole suite of new things we can do with this. OK, so that was responses.
No, they were just off the shelf CNNs trained on ImageNet or whatever, 60 different models. ResNet, BGG, AlexNet, all the standard things. And then you take those, and you don't retrain them. You just do a regression of those models. You feed your images through there. Collect the activations. Fit a weight for each unit in a given layer that fits the responses you have. And then test it on held out images, and that's what the point 9.1 is. They were not retrained on faces.
And as a weird little sidebar, wouldn't you think that say BGG trained on faces would work better for fitting the FFA than BGG trained on places? You would think. And it's not true. And this is a big problem. And I don't understand this. And we're doing things in my lab to try to figure out what we think that means. But it bugs me. It's a problem. [LAUGHS] OK. OK, so that was Ratan stuff. And if you send me an email, I can send you the paper which is coming up imminently.
OK. So Katharina Dobs has been asking a very different and, I think, really fundamental question, which is not, can we just build a model of this thing? But why does the brain have functional specialization in the first place? Why do we have all these kind of special properties in face perception? And just to give you a little sidebar on a few decades of research-- so for decades many people, me and mostly other people, have been doing all these behavioral experiments documenting the specialness of face perception.
Face perception is special in all these ways. If you present the places upside down, we're really bad at recognizing them. And that cost is much greater for faces than it is if you present words or objects which seem upside down. It's something about the face system, it's very inversion sensitive, right? So there are all these properties that people have been studying about the face system that are sometimes called the signatures of face processing that have been taken to argue that face processing is special, which sounds kind of occult. And it's kind of occult, but this is what people mean when they say that.
First of all, we're really, really good at it. Second of all, we have this face inversion effect, which is kind of diagnostic of the face system. Another thing that people have studied a lot is the other race effect. And this is just the fact that they all look alike. Whoever they are that you have less experience with, you have a hard time discriminating them. This is well documented. It's not racist. It's just a fact about the visual system that shows that we're sensitive to the training data. And there are some very interesting phenomena about it.
This is work by Elinor McKone, who's a brilliant face researcher who's done a lot of lovely work. This is my favorite. If you grow up in a society that has a dominant race A and less proportion of race B, and then you move to another city where there's more of race B, can you learn your way out of the other race effect? Answer, you can, but only if you move before age 12, which is really interestingly related to lots of other things like learning the proper phonemes in a novel language and all kinds of stuff like that. So that's a critical period for learning your way out of that.
Anyway, other race effect is very, very widely studied as a kind of signature of the face system. There's also a lot of work characterizing face spaces, whatever it is as multi-dimensional space of what we code and we look at a face. And so people have studied this in a lot of ways. And all these things are part of this long running behavioral literature on the way the face system seems to be special. And of course, one of the ways the face system is special is there seems to be a special brain region for it.
So the question that Katharina has been asking-- why is my-- oh, there's a whole lag here. I know Josh was complaining about it. Now, I'm experiencing it. OK, in the past these things have been treated as curiosities to be collected. We're kind of like 18th century national historians who go out in the woods and say, oh, I found some of these. And I found some of those. And here's a beautiful description of this one and a beautiful description of that. And we just describe it. And we say, oh, these are so special and so lovely. And here they are, right?
But we want to ask not just what is out there, but why the system has these properties. OK? And the hypothesis we're going to consider is that all of these signatures of the face system are just exactly what would be expected consequence of a system optimized for face recognition. And for now, we're going to finesse the question of whether that optimization happens in the human case over evolution or development. We don't need to worry about that. We're just going to take the adult system.
However it got there, it's got these properties. It's got these signatures. Can we understand those signatures as just what an optimized system will do? OK, this is just like ideal observer analysis but using deep nets because we're looking at stuff that doesn't lend itself to the kind of linear analyses that you can do with low-level vision ideal observer analyze. OK, I said professional. It makes sense? OK. So that's the agenda. And so the prediction is that all of these signatures of the human face recognition system will arise spontaneously in CNNs optimized for face recognition. OK?
All right, so we take three networks. And I assume you guys have been-- you've heard all this stuff to death. So I can just-- you know what a basic CNN trained on BGG faces is. And we'll ask whether this model will show those signatures of the human face recognition system. But if it does, we want to know if it's the optimization for faces per say that led to that. And we find that out by comparing it to the same architecture not trained on faces. So we take the same BGG trained on ImageNet, or actually a subset of ImageNet in which we throw out all the humans and animals and stuff like that.
And then for the hell of it, we throw in and we match the number of training images. Then we also take a random network that's completely untrained and ask whether we can find these signatures there. OK, so that's the agenda. So the overall strategy that Ryan used to test this hypothesis that the signatures will result from optimization for face recognition. And so we'll see them in the networks trained on faces, not the other networks. OK, so let's start with this high accuracy on face recognition.
So many people have pointed out that the various networks trained on face recognition are really good at it. And people say on a par with people, but those studies are basically done with five authors of the paper saying, here's my performance on this task. And here's my network like that. Yeah, that will do. OK. So we decide to do this a little more seriously. Katharina devised a really nice face recognition task that can't be done by super low level things.
So the task here is, which of these bottom images is of the same person as the target case up here? I find this damn difficult. I'm on the [INAUDIBLE] music end of that big normal distribution. Even when people won't have masks, I'm on that end. Anyway, I think the correct answer is this one, I forget. Anyway, so she devised-- no? It's that one. OK, somebody knows. OK, whatever. OK, so she measures performance of 1,500 people on Turk doing versions of this task with a bunch of different youngish white women, because we wanted to not enable people to do this by gender or race. In that case, even I could do the task. That's no good, right?
OK, so you give this task to humans. You measure their performance. You give this task to a network. You measure its performance. How do you do that? You guys probably already know this, but you run the image through these three images through a CNN. You collect the features from a fully connected layer. You get this vector. You ask whether the vector for the target image is closer to the one on the left or closer to the one on the right with a standard distance measure. OK?
So here's human performance on this task. How do the networks do? Face network does just about the same as humans on this task. OK, so we knew that before. And this doesn't totally explore the space. I am sure we could find versions of a face test that humans would do better than machines. So I'm not saying that the machines are there, but for this not trivial task, they're matching pretty well. Crucially, though, how does the object trained network do? We do the same thing. We run-- oh, both these are novel faces not the trained faces. So yeah, you do this by this method I just showed, which one is closer.
So we do the same thing with the object-trained network. And it sucks. And the random network sucks. And on the one hand, you machine learning people or even the rest of you will probably say, well, duh, out of training sample does worse. We knew that. Yeah, we kind of did know that. But it's nice to see this. It's nice to see it this dramatically. And there is lots of people out there who will tell you that ImageNet is this really powerful data set that enables networks to extract this very omnipotent feature space that works for all kinds of things. So that claim is out there. And that's not true for faces at least. OK.
All right, so what about the other race effect and the face inversion effect? We're not training on those layers. We're just reading out of those layers. And yeah-- yeah, you can get very similar stuff at most of the layers after the first few. It grows as you go up, first, across the layers. But in the top convolutional layers, it looks very similar. OK, so what about the other race effect and the face inversion effect? Well, we can play the same game. We can test face-trained and object-trained and random networks and see whether they show these effects.
So first, here's human behavior measured on Turk in a subset of the other race effect. So these are humans on Turk who identify themselves as white who have typically more experience with white faces than Asian faces. And here they are doing better with white faces than Asian faces. The inevitable reviewer three says, that's not fair. You need to do the whole crossover interaction with both on the networks and on people. And we knew that. We just didn't have the Asian data set.
But prompted by them, we are now doing all of this with-- we finally found an Asian face data set big enough to train a network and through a lot of people and all of that. And we're now getting full crossover. I don't have those data here, but it's working just fine. OK, so that's just human other race effect. Here is BGG face tested on-- which is, the BGG face data set is mostly white faces with under-representation of Asian faces. So it also shows the other race effect for Asian faces.
And the object-trained-- even though the object-trained network has plenty of faces, and they are disproportionately white as well, it doesn't show that. And neither does a random network, OK? So check-- I'm going to check again. Object-trained network doesn't. OK, what about the face inversion effect? We played the same game. Here is a replication number. I don't know what, many hundred of the face inversion effect in humans performance matching upright faces compared to inverted faces on that same task I showed you before.
And we see it big time in the face-trained network, not in the object-trained network or the random network, OK? And actually, the reviewer suggested something that's kind of obvious but kind of cute that we just did. So if you train BGG only on inverted faces you get an inverted face inversion effect. It's better on inverted faces than upright faces, right? So that's, again, sort of intuitive. So all of this is on the one hand, it's not shocking at all if you're a machine learning person. But think of all these people who've come from behavioral studies and they have carefully collected these lovely beloved signatures that are so important. And what we are just doing is like these are just things that are expected from optimization for face recognition, yeah? OK.
OK, I'm going to-- no, maybe we're OK. OK, what about face space? OK, so a lot of ways to measure face space. but there's a really nice one that's on the web that was put up there by Kriegeskorte and his work with Marieke Mur, where you start with a circle like this and you drag cases around to arrange them in a 2D space in which similar pieces are nearby in that space. And you drag them around for a while until you're happy that you've roughly captured your perceptual system's similarity space for these faces. OK?
So then what you can do is you can make a representational similarity matrix. People have talked about that in here. Yeah, OK, which just means that each cell in the matrix shows how similar or dissimilar-- I don't know why Nico says you should flip it, but whatever. How dissimilar those two faces are from each other, that's what the matrix shows. And we can do that with human behavior like this based on that rearrangement task. Everybody with me? OK.
Then we can do the same thing on a network by taking the vector for each face out of a late layer in the network. Actually, we also do it with all the layers thinking it cross across layers. And we can make an RDM for the network. And then the whole beauty of RDMs is we can correlate them with each other. Yeah? And so we then do that with the three different kinds of networks. Here's the human noise ceiling. Not that high, because people quite disagree with that multi-arrangement task is OK but has some noise. The face-trained network is close to what humans do. That's pretty correlated with it. And the object-trained and untrained networks are not, OK?
OK. So where are we? We have shown that face-trained but not object-trained networks show all these classic signatures of the special face system, OK? OK, what about specialized neural machinery for faces? Yes, it is a puzzle. It is a real puzzle. It's a puzzle that we're spending a lot of time racking our brains about in my lab right now. It's not flat out inconsistent. It's just puzzling. It seems like those things should go together, but they don't.
Predictivity is a different kind of beast. Remember, you're basically just doing a regression on that pattern of activation in the layer in an effort. And so you can pick and choose your favorite part from that network. And so it has more degrees of freedom in a way than just doing an RDM. I'm not saying that doesn't really answer it. I'm just pointing out that they're really not the same thing. So it's not a data contradiction. It's just a puzzle, why they don't all point in the same kind of theoretical direction.
So I talk to the whole first half lecture about all the evidence that there is face-specific neural machinery. What I didn't mention is that there's actually a double dissociation in the brain. There are brain regions that process something kind of like object shape that are not essential for face recognition. And actually, one of my favorite results in the entire neuropsychology literature is the work of Marlene Berhmann and Maurice Moscovitch who studied this patient who has an unknown lesion because he didn't want to get scanned. So we don't know where the lesion is. So it's just abstract, but never mind, it's cool.
This guy was in a motorcycle accident, I believe. And he is severely agnosic. He can't recognize objects worth a damn. He can't read. He can't recognize places. He's severely visually impaired at high-level visual tasks. But he is 100% normal with face recognition. Chew on that, amazing. Powerful double dissociation, in many ways more surprising than the existence of prosopagnosia, right? You lose your face recognition ability. It's easy to say, oh, yeah, you have an object recognition system. And then faces are hard. So you have this extra system that sits on top. No, the existence of this patient argues, no, it's not a special thing that sits on top of the pathway, yeah?
Anyway, double dissociation in the brain, lots of evidence. And so the question now is, why do we have that? Why is that a sensible design strategy for brains? I've been ending every talk for 20 years saying, OK, we got all these special bits. Isn't that cool? Here they are. Why is it a good way to design a brain? I have no idea. But I now think we can make some headway looking at that question with deep nets, which I'm enormously excited about. So I'll tell you about that.
OK, so the hypothesis here is it makes sense for the brain to have these separate systems because they're just different feature spaces that you need for face recognition and object recognition. And it just makes sense to process this separately. That's the hypothesis under consideration. And in fact, I've already shown you that a face-trained CNN does better at face recognition than an object-trained CNN. So that's part of it. But what about the opposite?
We can play the same game and ask these two networks to do an object recognition task. And when we do that, we find the flip side. So we've got a double association with networks as well. Right? So what this says is that the optimized feature space for faces doesn't work well for object recognition and vice versa, OK? And I like that, because that kind of suggests a kind of sensible reason why we might want separate systems in the brain. This is to me suggestive that that's why it makes sense for a brain to have a special system for face recognition as are just different feature spaces, different kind of problem?
But it's not the strongest test yet. Strongest test would be, what if we tried training one network on both? That should be a problem, right? Or maybe it will discover a common feature space, and then this argument will be wrong, OK? So we try that. So now, we train one network on both face and object recognition, OK? And so first, I'm showing you the face-only-trained network, the object-only-trained network and their performance on their own task. This is the same data I showed you before as a benchmark.
Now the question is, how is this dual trained network-- we did the same thing, just concatenate all the categories and train all the same stuff, right? How is this dual trained network going to perform? Is it going to be worse than the singly-trained network? Or is it going to find a common feature space? Drum roll. It does the same. And I was really disappointed at first. Like, damn it. So much for this hypothesis. Looks like it's found a common feature space.
And then we thought, wait a minute. How is it doing that? Let's look inside. Maybe it's doing something different. Maybe it has discovered segregation and spontaneously segregated itself. And that would be even cooler. OK, so to find out, we did lesion study. And so what we did is, for each layer, you rank-order the units by how much performance drops on the face task if you drop that unit. OK, so now you are rank ordering the units, OK? And then you do the same for the object test. And you do that at each layer.
And then you drop 30% of those units and you measure performance on held out similarly the same task, OK? So we're just doing a cause of lesion, like trying to find, are there face important units? And let's see, are there important units. And let's see. OK, and so what I'm showing you here is the functional segregation in the network as a function of layer growing over the convolutional layers with a measure which is showing you the ratio in the drop of performance. This is for face performance.
So this is showing us-- this is what the units are here. So there's a six-time greater drop in face performance and held out faces when you drop those 30% of face units than there is a drop in object recognition performance at the upper layers. And we find vice versa. You drop the object things. You find a similar thing for object recognition performance. So we have a double dissociation in the network. And what this means is that network has spontaneously segregated itself into these different pools, right?
It has kind of figured out that functional association is a good idea. And it's done it spontaneously without us telling it to. And I think that's even cooler than finding that if we segregate it, it does better. Yeah. So basically, covertly, there's some version of this branching network in there even though we didn't build it in, OK? OK, much like we see in the brain. OK, so let me make a few comments to summarize where we are here. So what I'm saying is that many of the classic signatures of the special face system emerge spontaneously in networks optimized for space recognition, including the segregation.
And this suggests a reason why humans have these properties, OK? And it's namely our system does exactly what it should do if it's optimized for face recognition. I also think, by the way, that this has implications for development. And on the one hand, I think we're on thinner ice using deep neural networks to model development, because backdrop is nothing like brain development. But on the other hand, I think there are some in-principle arguments we can extract, OK?
So as you guys know, an absolutely fundamental and classic question in cognitive science, neuroscience, philosophy, you name it, is, what do you have to build into a mind and brain to get it to end up with an adult structure, presumably with the addition of learning? And in particular, do we need to build in domain specific priors to construct domain-specific machinery, right? So if you're less familiar with the cognitive science gobbledygook, we're just asking, do you have to build in something innate about faces per se to get a face specific system? Yeah?
And so I think that these findings suggest that in principle to develop an adult face system with all the signatures that people love so much in cognitive psychology and cognitive neuroscience of the human face system, we didn't build in anything face specific into any of those networks. We just took a generic network with no face specific priors and trained them on faces. And it did all that stuff. Yeah? So that doesn't mean that there's nothing innate about faces in humans. It just says that in principle, you might not need it. Right? And that's a radical thing for me to realize. So I think that's cool.
OK. On the other hand, I hasten to add that there's lots of empirical evidence that suggests that something is built in. Loads of it, I don't have time to review, but I'll just mention a few things. Neonates who are literally just a few minutes old, like five minutes old, will preferentially track faces compared to other visual patterns that are similar but not be face-shaped.
Face-deprived monkeys, I talked about Livingstone's beautiful face-deprived monkey, but there's a much older study from Sugita in 2008 who showed that if you rear monkeys without ever letting them see faces, on the very first time you test them with habituation of looking time, their face discrimination performance is the same as normal adult monkeys. OK? That's a problem for people who think that all this stuff is learned with no priors. I mean, like everything, you can wiggle out in various ways. But that one's hard to wiggle out of.
And third, I showed you our own data. And there's other people's data showing that congenitally blind subjects have face-selective responses in the right place in the brain. So I don't know what the answer is to all of this, but it's a fascinating ongoing puzzle with many strands of evidence in different directions on development. And I'll just say a few more things about caveats on this whole application of machine learning to understand the face system.
One, our claim is not that the face recognition system in humans arose from the same process. It's a face system in CNNs. It's a totally different developmental process. Rather, it just says that an optimized system will do this, right? Second, I'm very careful to say optimized, not optimal. We've only tested a few networks. There's a vast space that many of you would understand better than I do about all the different kinds of networks. So You could do all the different kinds of loss functions and many more yet to be invented.
And we've only tested a tiny little bit of that space. And so who knows? Maybe these results will be different for some other kind of architecture and training regimen yet to be discovered. And following from that, our results don't show the necessity of a face training or any of these things. They just show that for the space, this is a sensible way things behave. We say nothing about whether the system in humans arose from training over development or training over evolution, probably both.
And I like all of this because to me, we're addressing "why" questions in neuroscience. Why is the human face system like this? Because that's what an optimized system would do. But we haven't asked, why an optimized system would have those particular features and properties? Right? We call this the "why" of the "why". And that's a whole other enterprise that we haven't stuck our toes in that water yet. But I think there's a lot of ways that one could do that with some of the same methods. Trying different kinds of stimuli and different kinds of training regimes to see under what circumstances those things arise.
Nonetheless, I think this is an exciting development. And I'll put my slide away because it's overblown. And I will use my overblown version of this. I think in my more overenthusiastic moments that this is really bringing a lot of cognitive science and cognitive neuroscience into a kind of-- turning it into a deeply theoretical enterprise, where in many cases parts of it were not before. Much as Darwin transformed natural history with the theory of evolution, which now makes sense of all this stuff, I think we can look at all these behavioral and neural properties that we've just been documenting as facts and ask why it makes sense for minds and brains to behave that way and to have those properties. And I think that is awesome. And I'll stop here.
[APPLAUSE]