Faces
Date Posted:
August 22, 2022
Date Recorded:
August 14, 2022
CBMM Speaker(s):
Winrich Freiwald All Captioned Videos Brains, Minds and Machines Summer Course 2022
Description:
Winrich Freiwald, The Rockefeller University
Part of the Brains, Minds and Machines Summer Course 2022
PRESENTER: Our speaker this afternoon is Dr. Winrich Freiwald. Winrich is a professor at the Rockefeller University. And his lab studies how the brain analyzes visual information it receives from the face and what computational principles it employs. Please welcome Winrich to our summer class summer school. Thank you, Winrich.
WINRICH FREIWALD: Thank you very much for having me. And please excuse me for not being there with you in person. The subject of today's talk, which is about faces-- so I kept the picture of Salman Rushdie on here on purpose for several minutes. And I expect that many of you will have watched it intently.
You will have explored it. And you will have probably thought a bit about this person, what you know about him, and what kinds of inferences you could make from a picture like this. And this is part of the reason why my lab is so interested in faces.
The more general question that we like to tackle is the question of intelligence. What is it about our brains that makes us intelligent? Or put slightly differently, what is it about our brains that makes us think the way we do? What I mean by this, and what is special about us as humans as a primate species, is really this confluence of sociality and intelligence.
You can be a very social animal but not really understand anything that you're doing. You can be a very intelligent animal and not be very social. But really what makes humans special, and as I will show to you in a second, this is really our primate heritage, is this intersection of sociality and intelligence. So what is it about our brains that makes it possible for us to do a deep analysis of a face and make inferences, maybe even about what this person is thinking while the picture is being taken?
So the sociality, of course, we share with all mammals. But there's really something special about primate sociality. And as I mentioned, this is a level of understanding of the social environment that we're in and the way of interacting with that.
So I really like this picture. It shows literally how our social world is all around us, how we're immersed in it. And it is thought that this is part of the reason why we are intelligent in the first place. This part of the video shows the alternative hypothesis. Someone just invented a very useful tool, the nose pick.
And the alternative hypothesis is that it's the structure of the hands that allow primates to manipulate the environments in ways that other animals cannot, and that this ability to physically interact with the world has been a driving force in primate intelligence, and then primate intelligence, of course, making it easier for us to physically interact with the world. Whatever the reason, primates really have amazing social cognitive abilities. And one of the examples is the story of Ahla.
Ahla was a female baboon monkey in southwest Africa. When this paper in 1961 was published, there was still a tradition in this region for farmers to use baboon monkeys to herd their goats instead of dogs. And so Ahla was one of these monkeys that lived with a farmer to herd goats. So over time, Ahla would adopt some behaviors that she would not have shown if she had lived in the wild.
You can see her here with the goats licking salt, which is something that obviously she would not do if she had not been immersed in goats and emulating their behavior. But she would also continue to engage in behaviors that are very primate specific. So you can see her here. And again, the quality of these pictures is not really great.
But you might be able to see her here grooming the goats. Grooming is a very typical behavior for primates. Again, it requires a fine structure of the hand and the ability for fine motor control of the fingers that, obviously, goats don't have. So she was persistent in some behaviors that seemed to be innate or learned early in life. And one behavior is particularly remarkable.
What the farmers would do is to separate the adult animals, the adult goats, from their kids and keep them separate when they came home after grazing during the day. And Ahla would not have that. Ahla would go out of her way, pick up each kid, and pair it with its mother.
You can see this here in this one picture, her carrying a little kid. And she would even be punished for doing this. But she would just not stop. She was relentless, making sure that every kid was paired with its mother. So what this tells you is, there's really something ingrained in us being forced to execute certain behavioral programs that are just part of who we are.
But it also means that she had an understanding of a social world that is really quite amazing. Because she knew something that the farmers did not know. The farmers would not have been able to pair the right kid with the right mother. But Ahla would know this. And so she had an understanding of a social environment that was deep and that was structured.
So in books like this one, Baboon Metaphysics, by Dorothy Cheney and Robert Seyfarth-- which I strongly recommend to anyone interested in cognition, and in particular, primate cognition-- stories like this are used to illustrate what primates can do and what they cannot do-- their astonishing limitations, also, in how they understand the world, but their also astonishing abilities that they have, too, to understand the world. Primate ethologists like Cheney and Seyfarth have come up with the proposal that primate social knowledge is structured at three different levels.
At the first level is the level of the individual in terms of properties of the individual, like if it's a juvenile, or an adult, or a female, or a male. There's a second level of processing that are still of the interactions that are taking place between these individuals like grooming, or mothering, or fighting, or other kinds of behaviors you might think of. Typically, these behaviors are peer-wise interactions.
And then based on these two levels of understanding, the level of the individual and the level of interactions between them, primates build these high level structures of social knowledge that regard the relationships between individuals along domains like friendship, kinship, or hierarchy. And what's amazing about these data structures-- and maybe the main reason why I would really like to figure out how the brain is implementing them-- is that they're not purely associative.
So friendship might be if A is the friend of B, B is also the friend of A. But the relationship of kinship is not if A is the mother of B, B is not the mother of A. B is the daughter or the son of A. And those are very complex data structures that are formed.
And I think we know of at least very good guesses where in the brain those data structures are established. All of this is rooted in the concept of the individual. Almost all of our social interactions are with people that we know very well. They are highly familiar. And we form this concept of a particular individual that all of this knowledge is ultimately rooted in.
A lot of the information that we can gather really comes from the face. And that is a major reason why we are so interested in faces. If you look at the face, and in particular, if you look at a face over long periods of time like you did in the very beginning, you will gather a lot of information.
Some of it-- actually, all of the information that I'm listing here, you will get within a fraction of a second. You will make inferences about identity, gender, age, race, species, similarity, attractiveness, trustworthiness. And I should emphasize this is not actual trustworthiness.
But you will get, again, in a fraction of a second, you're going to make an estimate about how trustworthy another person is and about the mood or attention state of the other person. So that's a lot of information that we infer from the face. And that's a very rich territory for us to explore and try to understand how the brain makes that possible.
And at first sight, you might think this is going to be extremely difficult and maybe even close to impossible to figure out. And honestly, again, 30 years ago, that's what I would have thought. We also have to consider that not everyone is very good in recognizing faces.
I like this picture from the fifth season of Curb Your Enthusiasm which, of course, shows the very same face photoshopped onto different bodies here. But it's also an illustration about how the social world might look like to people who suffer from face blindness, or prosopagnosia, a condition that's really not rare, estimated to affect maybe 1% or 2% of the general population. If you cannot really tell by the face one person from another, it's going to be very difficult for you to navigate your social environment.
And that can make it very, very difficult for you to have successful social interactions and also for you to be recognized by your peers. Because they might think that you are arrogant for not greeting them when it was really just your inability to recognize them. So this condition is also a reminder of the importance of face recognition. And I would briefly like to do a little psychophysical experiment with you.
If you look at this set of faces here, how many different individuals do you think are in here? And I'm going to give you a couple of seconds to parse through this image. And really think about how many different individuals these 40 different pictures show.
So I will not be able to see your hands very well. Otherwise, I would ask for a show of hands. So you can do it internally. How many of you thought that all these pictures were from the same individual? How many thought they were from two individuals, from three, from four, from five, from six or more?
And the correct answer is two. I'm outlining here the pictures of one individual by these green outlines. And the other ones are not highlighted. These are actually two authors of the paper that they published in which they found that, on average, people who had about 10 seconds to look at this array of pictures will say that there are about seven or eight different faces in here.
So our facial recognition abilities are far from perfect, something that the judicial system suffers from-- long overlooked-- but in 75% of cases of wrongful conviction, an eyewitness misidentification was at least a contributing factor. And looking at these images, you might already get at a glance what the challenges in facial recognition are. These pictures look different. Why? There are very slight variations here in age.
But that is, of course, a major factor. All these pictures are taken from the front. And then for an eyewitness, this is not the case. There are changes in lighting conditions. All of those are difficult factors to account for-- the changes in facial expressions, maybe changes in hair style, might even changes in size.
But even so, some of these changes are there. And that makes it difficult to decide if those pictures are the pictures of the same person. So what is it computation of any system, be it a computer or a brain, has to do in order to succeed at facial recognition? So we can think of it as occurring at different levels, the first level being face detection.
This is a scene from The Godfather. And here you will see great variation in basically every aspect of faces and the very difficult lighting conditions. But probably none of you would have difficulty detecting faces, knowing that they're faces there, knowing that there are three faces, and where they are.
So this level we call face detection. So then once you've detected faces, then you can do more. So again, using examples from The Godfather, we have six faces here. And if you are given enough time, you might figure out that these six faces belong to three different individuals.
And you're actually perceptually grouping the pictures of different individuals together even though pictures of faces from the same head orientation, like these two front faces, are physically much more similar to each other than the faces of the same individual that are taken at different viewing conditions. But perceptually, what we're doing is something like this, grouping the different pictures or different views of the same identity closer together. And this kind of discrimination process is called face discrimination.
So how does the brain do it? We got a clue from studies that Charles Gross, the late Charles Gross, started in the 1950s and 1960s. Actually, some of them were conducted at MIT when he was showing pictures of monkey faces and recording in a part of the brain known as the inferior temporal cortex. And he found cells that responded selectively to faces.
This is a side view of the macaque monkey brain with the superior temporal sulcus open. And all these symbols are indicating locations where subsequently, face cells were found. Now, Bob Desimone, if you can get a hold of him, have him tell you the story of the discovery of face cells. Because for a very long period of time, people were incredibly skeptical and incredulant that this finding was real.
And it was partly a difficulty in documenting the sample trace that I'm showing here, the face cell, and the action potentials. That's from a paper from the 1980s. But the actual discoveries was made far before the time when papers were written with words describing what was done and with very little graphical elements providing any direct evidence.
So people might have been forgiven for thinking that this is something that is very unlikely to really exist in the brain, that there is a cell that is doing something very semantic, that it's responding when there's a face. And it's not responding when it's not a face. Also, the cell did not respond when the face was scrambled into the parts randomly arranged. And so it felt like doing something pretty amazing.
And actually, on a side note, it's doing this in an anesthetized state. So this really felt like something that must have been the result of a mistake. But in this following decades, people found face cells, like indicated here by these red symbols, all across the inferior temporal cortex.
And then in the 1990s, pretty much with the advent of functional magnetic resonance imaging, Nancy Kanwisher found a region in the brain. And again, subsequently, there were multiple regions found that was responding selectively more to faces than to non-face objects, the fusiform face area. And I think she talked with you about this two days ago or so, or maybe yesterday.
So fMRI is a very indirect way to measure neural activity. And so therefore also, people didn't quite know what to make of this finding. After all, we're showing here threshold and significance maps. And so what does it mean for any given region?
How strongly is it really specialized for faces? And how is it processing faces we don't know? So this is why Doris Tsao decided some 20 years ago to bring together these two approaches in the macaque monkey and to say first, to ask fMRI, other face like the regions of the macaque monkey and then if there are, to target those for single-cell recordings to figure out how the cells are processing face information.
So fMRI is really a critical element to this day in my research program. Because I believe that we cannot understand the complexity of these brains, which is really immense, and which we actually, by doing these kinds of experiments, get increasing appreciation of where, if you just move by a few millimeters in this brain, just move 3 millimeters over, you find a completely different functional specialization.
So with fMRI contrasting activations in the brain to faces, shown here in yellow and red, contrasted with activations to non-face objects, we've found in the macaque monkey that there are six regions that are selectively responding to faces in a sea of blue that is regions that are responding more to non-face objects. Those six areas we found in both hemispheres. So there's a total of six.
So this was really great. Because it meant that we now had the ability to really analyze what is going on, how strongly these regions are selected for faces, and really learn something about how they're doing it. So how would we do it? We would use Charlie Rose's technique and place a recorded electrode into one of these face areas.
And I should say that their anatomical locations are so reproducible that we can find the very same area in different animals. That is also a great advance. It's like-- many of my colleagues here at Rockefeller work with invertebrates. And they can find back individual neurons in different animals. And that's, I mean, a big help to really get a mechanistic understanding.
We don't have that yet. And it's unlikely that primate brains are organized in this way one year and being the same from brain to brain. But these areas are consistent across different individuals. So once we stuck our recording electrode in and presented a set of stimuli, including faces and including non-face objects, we could listen to the responses.
And those of you not doing these experiments, I hope that this video is going to give you a little bit of the experience and the joy that you might have and that I certainly have when I listen to recordings from individual cells, this one from the face area indicated here. So you're going to hear clicks whenever the cell is responding. You're going to see the images that the animal saw at the same time. Those images are from the control monitor. So don't really pay attention too much to detail.
You will see a black square moving around. That's the fixation location at the same time that the animal did not see it. Focus on the clicks and the images in the stream.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: So I hope that you could all appreciate that every time there was a face shown, there was a response. You might also have noticed that sometimes when there was a non-face stimulus shown, there was also a response. And this was very typical for the population as a whole, which I'm showing here as this population response matrix, where we're sorting cells from top to bottom and stimuli from left to right, 16 faces shown on the left-hand side, then followed by bodies, by fruits and vegetables, by technological objects, hands, and scrambles.
So for each of these categories, we had 16 examples. And we were plotting here the response magnitudes in the color coded fashion where red is indicating response enhancement and blues indicating response depression. And you can see at a glance that on the left-hand side, during the presentation of the 16 faces, you get significantly elevated responses in about 80-plus percent of the cells.
And then there's a smaller population of about 10% of the cells where the response is systematically suppressed. And then there's a smaller, like a very small population, between maybe 5% or so of cells for which it's unclear what they are responding selectively to. So the population as a whole, shown here is the average response of the population. That's really strongly favoring faces.
But again, as you notice, there are some weaker responses to non-face objects. And they occur for interesting stimuli like clock faces, and apples, and pears that share several physical properties with faces. They are roundish. They have intrinsic symmetry. And they have intrinsic features in them.
So those are eliciting intermediate responses. And this might already give you an idea that yes, this is a purely visual-- well, purely, you don't know. But this is a visual area. It's performing some kind of analysis on incoming information, giving them bigger responses to faces and everything else, but being partly fooled by stimuli that are not faces.
So putting all of this together, the fact that we find a reproducible set of face areas in all individuals that are packed with face-selective neurons-- almost all the cells in there are face selective-- and then third piece of evidence that I'm going to gloss over here for lack of time, the fact that we can causally manipulate face perception abilities by interfering with processing in even just one out of these areas. Those pieces of evidence together make us suggest that these face areas-- the middle face patches, as we call them-- there are two of them, early processing stages, in this whole system.
But they are face processing modules. The term "module" is highly loaded. What we mean by it is that there is a part of the brain that is there for one purpose and one purpose only. And in this case, it is face processing.
Mind you, we have not shown, and not proven, and it's not really possible to prove that it's only face processing. But we really have tried. And we have not found anything different.
The other conjecture that we made was that maybe this area is doing face detection by shape analysis. And this is really based on this observation that you had with this one example, a cell that I showed to you, that these cells are partly responding to stimuli that are not faces but share certain visual properties with faces. So that's also why we think we are forced to see faces in stimuli that we know are not faces like these two peppers here that I really like.
Because they've just been sliced. And so there's a lot of reason why they might be angry, and losing their teeth, and having all of these nice properties that you ascribe to them. You know they're peppers.
But you have to see faces in them. That's, we believe, is the system in overdrive, trying to grab whatever information it can get to make sure you're going to detect the face when it's there and then sort of erring on the side of maybe a wrongful detection just to make sure that it's not missing a detection and it should be there. On a practical note, this was really transformative for our understanding of how faces are processed.
Because it meant that for the first time, we had unprecedented access by having fMRI localization of face errors and being able to target these face areas day after day at the same location to a functionally homogenous population-- homogenous because virtually all these cells are selected for faces, coding for one high-level object category that is faces. That is very, very difficult to come by anywhere else in the system. And it's certainly impossible if you don't do fMRI or an analogous technique before you're going to be targeting this region.
This means you can now tailor stimuli to ask very specific questions and ask deep questions or deeper questions about how faces are processed. And you can use that as an example for how objects are processed in general. And so this gets us closer to mechanisms of face recognition. And I should say that by now, we are seeing these mechanisms at the level of single neurons in populations that I'm going to show to you.
But we're also making efforts to measure very large populations with two-photon imaging to identify different cell types and to really get deeper into the analysis of circuits that generate these. But the first level of mechanisms, of mechanistic understanding, is really understanding what single neurons are doing and then how these properties of the single neurons combine in population responses. So I showed you, and this is true not just for the middle face patches but for all face areas that we recorded from, is that they're packed with face-selective cells.
So the question that we could then ask-- or several questions we can ask. But the first one that we asked is a fundamental one in face recognition. It's the relationship of the part and the whole.
So if you look at a highly blurred image of a familiar person-- using an example here from a paper that Pavan Zinner published many years ago. This is Woody Allen. You can recognize a face that is highly blurred.
And just from the gist of this face without access to any detailed local feature information, you can recognize the person. So there's something seemingly special about the whole or the gist of the face. But yet, we can also process individual facial features.
And by the way, that's a lot what face blind people have to do in order to discriminate between different individuals-- pay a lot of attention to individual features. And those are clearly represented as well. So how do these two different parts fit together? The way we addressed this was by making cartoon stimuli-- showing at the bottom, for example, faces shown at the top-- that really consisted of just very, very simple geometric shapes-- lines, triangles, ellipses, that's it.
Based on these different properties, we created a face space that consists of 19 different dimensions and 19 meaningful, different facial features that we could manipulate across 11 different values, seven of which are shown here. So one example is face aspect ratio. So we have one extreme on the left that would look like Sesame character Ernie and one on the right-hand side that's like Bert.
Note that on purpose, we used physical features here that go way beyond what a monkey or a human would ever see in real life for a real face. So we're trying to create stimuli that are extreme and that are far outside but still sort of structurally valid faces in the sense that there are two eyes above a nose above a mouth. Pupil size, another parameter we can vary from no pupils to very big pupils.
And then there's a relational parameter, intereye distance, going on the left, almost [INAUDIBLE] arrangement where the eyes are touching each other to the opposite where the eyes are straddling the outside of the face. And we were now able to vary these 19 parameters dynamically every 133 milliseconds. We updated all of these 19 values. And this is how this image looked like, almost like a cartoon character that's trying to talk to you.
But really, all that's happening is randomly assigning one of 11 values to each of these 19 dimensions in every frame that we are showing. What this allows us to do is to compute tuning curves for each of these 19 dimensions, asking whether the firing of the cell was modulated as we varied one dimension independently of all of the variation that occurred along the 18 other dimensions and then repeating this process for all of these different 19 dimensions.
In doing so, this means that for every cell that we're analyzing in this way that we're recording action potentials from and doing a worst correlation analysis, we get 19 different tuning curves. And in this example cell, four of those are significantly modulated. And that was true for about 2/3 of the cells.
The cells really liked cartoon stimuli. They responded to them very well. 2/3 of them showed the same kind of properties that this example cell is showing. And that is ramp-shape tuning curves. That is, the cells showed a minimum response for one extreme and the maximum response for the opposite extreme.
So that was very surprising to us. We did not expect that the maximum responses would occur for stimuli that, again, are outside of the normal range of faces that you would see. But it had an interesting property. Because it was almost as if the cells would take a ruler and measure a small set of different facial parameters, the cell measuring four parameters and then relaying them in almost 1 to 1 fashion to downstream neurons for analysis.
That must be very convenient for analysis. And in a way, it's also very simple. So just very briefly in case you are concerned about whether this has any implication for processing of real faces, we also looked at that. And when you measure the same properties that we varied in the cartoon faces in real faces, you first of all find this.
Yes, indeed, the physical range of natural faces is much, much smaller than the one that we covered here. Yet in this range, the cells that are tuned in one particular way-- I'm showing you cells that are preferring Ernie over Bert-- they show the same kind of tuning for this range for human faces as well, preferring flatter faces to more narrow faces. And that is true for all the parameters that we looked at. So why might they be doing this?
So you could think of prima facie, about two different ways that things could be coded. One is, you would have sort of an exemplar style processing where the cells are narrowly tuned. And for every face that you're showing, even morphing one face into another, you would have a cell that's picking its response for one particular face and shows smaller responses to others.
This would be like an exemplar-based model of face processing. Or you would have an axis model in which you have broad tuning curves that straddle the whole face space with maximum responses one end, minimum and the other end. And so if you have a diversity of cells, which we do, you could spend the entire face space here. And that is very compatible with what we've found.
So this idea of face space is really a very nice way to simplify your way of thinking about faces. So think about the cartoon faces, that 19 different parameters, just this in the cartoon faces. We could specify exactly what those parameters were. You can infer these parameters-- or, if you will, latent variables, from faces also from an image of any face.
And with just, I don't know, 30 different parameters or so, you might come up with a pretty good description of the face. And so what people have shown is that faces are then arranged perceptually in a space in which there's a neighborhood relationship so that people perceive faces with similar physical properties similarly-- this would not surprise you-- but that there's also global structure to the space that is, in some ways, similar to color space.
Yes, it's more high dimensional than color space. But it has properties that it shares with color. So if you look at one particular face, you can construct an anti-face to it, a phase that has, compared to the average, the opposite set of latent variables.
So compared to the face here of Jim on the upper right-hand side, you can construct the opposite face. You know Jim has thin lips. The opposite is thick lips, and so on and so forth. Why draw a narrow jaw? You go through it.
It's possible to construct these faces. And they're all pretty real and realistic. So that is like you have colors on a wheel. And every color that you have, you can find an opposite one. This goes so far, this analogy, that if you are looking at a face for longer periods of time, you're adapting to it.
And now faces on the opposite side of the average look actually more distinct from the average than they look before adaptation, just as with color adaptation. If you adapt to red and then you look at the average color in the center of the color space, white, things will look slightly turquoise. And this does not happen for equidistant faces somewhere else in face space. So there's something about the organization that is long range. And there's something about the organization that's special around the face.
So these broad tunings' first access code that we found really explains rather naturally how you could have this large-scale structure of face space with long range interactions and also adaptation happening here. OK, so this is all about features. It doesn't tell us anything at all about global properties of the face.
So is there something about the facial whole? And we addressed this in different ways and this one way that I want to show to you again by example, one with a sample cell. And so here, we had follow-up experiments after we determined the dimensions that a particular cell was tuned to.
So here's an example cell that was tuned to the eyes-- in particular, small eyes, not big ones. And then we tested the response of the cell to the eye regions in different contexts, one in the context of a face, and in the other case, without the face being present. And please listen to the response of the cells. And you will get some idea of whether the context of the embedding into the whole matters or not.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: So you hear some response modulation. It's impossible to tell if it's with small faces, or big ones, or small ones. Now we take away the facial context.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: And the response goes away almost completely. Going to put it back on in a second.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: And again, you're going to hear the sounds being modulated.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: So this was typical for the whole population. This one was extreme in that there was hardly any response at all when there was no facial context. For most of the cells, we got a tuning curve. I'm indicating it as a change in firing rate from top to bottom. Time runs from left to right.
So this cell had the maximum response to the small eyes and minimal response to big eyes. And then if you put it out of context, this modulation is practically gone. On average, what happened was the modulation of the gain of the tuning curve such that you would have weaker tuning but of a similar shape outside of the context of the face.
So the face is modulating the response to the feature by changing the gain or increasing the gain. And so this is one way that we think that features and this coding of the whole can go hand in hand. So in the middle phase patches that I've been talking about so far, we have already implemented certain central capabilities that any face recognition should have that are mechanisms of face detection.
I didn't go into the mechanisms. These are other experiments done in the same way, which we figured out ways in which the cells are able to detect faces and discriminate faces from non-face objects. There's encoding of facial features. And I did show this to you. And it is also encoding of configurations.
The cells exhibit also a certain characteristic of human face perception. So I talked about opponent coding or coding of faces in face space. We can already see this happening here. There's a caricature effect. Sometimes, you can recognize a face better in an extreme, artificially extreme version.
And think of the cells that are responding maximally-minimally to extremes that are outside of the physical range of faces. They would naturally explain why you're better at recognizing a caricature of a face than the original. There's a face inversion effect I didn't show you. But we may turn the cartoon upside down and then vary the features.
Some cells were confused that they were responding to the mouth before, now responding to the eyes as if they were looking for a particular region of the face and then trying to analyze the feature within and others just showing weaker tuning. So this explains why we are better at recognizing faces when they're properly turned upside than inverted. And then lastly, the part-whole effect that I talked about, a very peculiar psychophysical effect is that you're actually better at processing facial features in the context of a whole face.
You might expect the opposite. It's better to analyze something locally without the distraction of the rest of the features of the face around it. But psychophysically, you're better. And also, the cells are more strongly attuned to features when they are in the context of a face at the right position when upright face.
So what about the other face patches? And what about the global organization? I showed you that there are other face areas at reproducible locations. I showed you-- or I told you. I didn't show you.
I told you that they are also packed with face-selective cells. But they have very different properties. And so I showed you the video from the middle face patches. If you go to another face area, area AL, which is down here, and you're now recording from the cell, you will find this.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: You'll find a cell that is very strongly tuned to head orientation--
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: --but in a funny way, in a way that is confusing responses to the left and to the right profile. So this cell here, you can recognize it's responding both to left and to right profiles-- almost doesn't matter which individual is shown. If it's a profile, the cell is going to respond.
AUDIENCE: Professor Freiwald?
WINRICH FREIWALD: Yeah?
AUDIENCE: We cannot hear anything over here.
WINRICH FREIWALD: Oh. Can you hear me? I mean, I've been talking for quite a while now.
AUDIENCE: Yes. Yes, we can hear you. But we can't hear any of the audio. We haven't been able to hear any of the spikes.
WINRICH FREIWALD: Oh, I am so sorry. We tested this before. And it seemed to have worked. So I'm sorry for that. I can hear it. So I don't know if there's anything that I can do to make you get better-- I can crank up the volume somewhere. But I don't think that it's that. Let me try it once more.
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: You don't hear anything?
[VIDEO PLAYBACK]
[CLICKING]
[END PLAYBACK]
WINRICH FREIWALD: OK. Then after you have to take my word for it. By the way, how are we doing on time? How much more time do I have?
I don't want to-- I want to enable questions. And I feel I've been talking slowly even though that's not my habit. Hello? Can you hear me? No? Now I'm gone completely?
AUDIENCE: Yes. Yes, we can still hear you. I think until 3 o'clock or 2:30. 2:30 for the talk. And then we still have time for questions? Great. OK, good. So OK.
So here's what you would hear. OK, the first video that I showed to you was supposed to show you the cells responding to faces but not to other things, except for things that matched some physical properties with faces. Sorry you couldn't hear that.
What I didn't show you is that in the middle face patches, which are down here, for which I showed you in this example recording, the cells are very, very strongly tuned to head orientation such that one cell might only like the left profile and not the right profile. And by the way, the cells that initially we thought in our 96 image screening said might not be very face selective, they're all very face selective. But they just like a particular profile, which we didn't show.
At this level of processing, area AL, you also find cells that are very strongly tuned to that orientation. But now they confuse left and right profiles. And at this level of processing area, AM, you find cells that are rather weakly tuned to head orientation but can be very selective for the physical properties of the cells such that the cells might respond only to some of the faces-- or in some cases, even just one face out of the range of faces that we are showing. We also showed that these face areas are selectively coupled to each other.
So the picture that we have of the system is one of a network of areas that are each specialized on processing of faces in different ways. This is why I'm giving it in this cartoon image here. I'm giving the different face areas different colors. Because they are directly connected to each other and because each of them is containing a different representation of faces, we can think of the transformations in this network as computations that are converting one format of face representation into another.
And the early format is very useful for face detection. Because the cells are responding to a broad range of faces not indiscriminately, but almost indiscriminately. And then if you go after two major transformations to this level of face area, AM, you find a representation that's very, very useful for face discrimination. Because it's abstracted, to some extent, from head orientation and has become individual selective.
So that is a descriptive way of talking about what the system is doing. It is great because it really allows us now to ask questions about why the system might be organized the way it is. I'm actually going to jump over this part. And I have to apologize for it.
Yes, if I have 15 minutes, I should jump over this part. So let me just tell you that there is a third level of face processing that is really, really important. And that was actually the point of the psychophysical study that I was referring to in the beginning. For people who know individuals, this task is very, very simple.
So I happen to know the two authors here. And so for me, it's immediately clear that they are two people in this image. And this is just a reminder that for me, a face recognition or a person recognition is really different from generic face processing.
And oftentimes, this distinction is not made. And that is one of the difficulties, also, in eyewitness identification is because we think we are really good at identifying faces. And the limitations show up very easily when we have to process faces of individuals that we don't know. And so what is the link here?
And this is really the work of an amazing grad student, Sofia Landi. Sofia discovered two more face areas, one in parietal cortex and one in the temporal pole region, that is heavily understudied that showed up for a context of familiar faces versus familiar objects that showed a psychophysical characteristic, a familiar face recognition that I have to gloss over now. And that also showed very nice properties when she was recording from inside this region.
She found cells that were very highly selective for just particular faces-- not for the person as a whole. So these cells were not responding to voices. They were also not responding to bodies-- slightly modified by body context, but really very, very face selective, and then very face selective just for the familiar faces that the particular subject knew, and being highly selective, almost to the point of mimicking properties of a face motor neuron.
So those are the few main steps of face processing that we might want to think about. So what we want to do, we want to understand some of the computational principles that the system is implementing. And what do I mean by that?
If we see these properties here, what is really great about them is that they're qualitatively different. You go to one face area. And you record from it. And just by looking at these tuning properties, record a few cells, three cells. And you know which area you're in.
Anyone in my lab showing me some data, I can tell which area it's from just by these properties. So that's great. Think about the importance of orientation tuning for figuring out information process in the visual system. Why was it so important? Because it was a qualitatively new property that the inputs did not have.
So here we want to ask now, just as has been done for orientation tuning, what are the mechanisms? But also, why is the system showing these properties? And what I mean by this is, is it a consequence of the properties of the stimulus?
So is it just because you're looking at a stimulus that is a face that some of the tuning properties will emerge in basically every network that you're going to look at? Or is it a consequence of the way the network is wired? Or is it a consequence of how the network learned to acquire the selectivity for this particular stimulus? So those are questions that we would like to answer in order to say that we can really build a system artificially and therefore, in Feynman's terms, we are able to really understand it.
OK, so we can use it here. We have this transformation solved. And that is something that was not possible when people thought that face cells are distributed all across the inferior temporal cortex. Yes, there were some ideas out there. But they were impossible to test.
So for example, here we didn't even know how many transformations there were. So now it's clear there are two major transformations that occur between different face areas. So Tommy Poggio was a pioneer of deep processing developed as this MAX processing system, which is basically stating that you can identify complex visual objects by stacking the processes of primary visual cortex, simple, and complex cells on top of each other. And it's very similar to deep learning.
So we were interested to know whether the properties in the face processing system, which seemed to occur at three different levels of processing, why do we think this? The anatomy is suggesting it. The more entry you go, the later the processing.
Response latencies are suggesting that things like receptive field sizes increasing are suggesting it. Everything is telling us there's a hierarchy there that goes from one area to the next to the next. So that is compatible with this idea of a deep process. But the question is, can we now explain these processes? And first, we ask this at a qualitative level.
Can we explain why this orientation tuning other properties in this network? Do these architectures have to be deep to bring out these properties? And in particular, we're interested in this property of mirror symmetry confusion. Why? Because it didn't really occur in these MAX models that Tommy had studied before.
And so we were a bit concerned that maybe there's something that's pertinent finding in every face processing system in a macaque monkey that our computational models of the process are lacking. So why is this important? Because let's say you have an area ML and you have inputs to an area AL. And you have a regular learning rule that is working by association.
And what you're seeing in the physical world are faces that are typically rotating from left to right and not jumping from left to right. They always have to go through a physical intermediary. This is the front view of the face or something rotated some other way, but typically going through the front view of the face.
If you have a temporal continuity learning rule, which is really what everyone is assuming is existing in the brain, you would expect that in AL, you should see broader tuning curves. You do not expect mirror symmetric tuning like we found. That is implying a very different wiring. So any way you look at it, if you think about it, mechanisms of processing, if you think about computation principles, AL should not have the properties that it does. So why does it have the properties?
We engaged in a collaboration. Joel Leibo was the grad student who really looked at this very deeply. He came up with a flat network that could reproduce these three different levels of processing. We just posited it's a hierarchy. Level 1 is a face filter responding more to faces than to non-face objects.
I showed you evidence, talked about evidence of face detection. Level 3 is therefore face identification-- again, compatible with this idea of what the cells are showing. And then he looked through several learning rules. And it turned out that only one learning rule, a regularized version of Hebbian learning, would give us mirror symmetry confusion at the mid-level of processing.
So this means that these properties that we found in the face processing system can be explained by the feedforward processing network. This network does not have to be deep. But several factors have to come together for us to be able to explain these qualitative processes. There has to be large-scale organization of a hierarchy. Again, it doesn't have to be deep.
There has to be the right learning rule. Without it, it's not going to happen. And there has to be a stimulus that has an intrinsic bilateral symmetry. That is maybe surprising, maybe not. But again, the mirror symmetry confusion is not about the mirror symmetry of the front view.
It's about the mirror symmetry in the profile views. But if you look at objects that are not intrinsically mirror symmetric, like a face, they don't show this property. So it's two things that are really critical-- learning rule and the intrinsic geometry of the stimulus you're processing.
OK, this is the qualitative picture. What about the quantitative one? We can actually quantify exactly how much information of head orientation, bilateral symmetry confusion, and identity is in the response at every level of processing. We're doing this based on the population response similarity matrix.
That is basically taking the response of the entire population as a vector and then asking how similar the response vector to one of 200 stimuli-- I'm going to explain in a second here-- is to the response vector of the population to any of the other 199 stimuli. So these 200 stimuli were organized by identity and by the orientation. And the course organization here is by head orientation.
So we have eight different head orientations here, 25 different identities to stimuli 1 through 25, have different identities. Similar, 1, 26, and so forth at the same identity. And then head orientation is changing every 25 stimuli. And we're showing here in this population similarity matrix similar responses, simple Pearson correlation coefficient, and dark.
And so the similar responses are the ones that are darker here. The main organization, the middle face areas, is one by diagonal. It's head orientation that's determining how similar responses are. So you can see this here. For any given response, the most similar one is the one to the same head orientation.
Or for a front view, there are also similar responses to the upwards or downwards face or for full left profile-- also for half profile-- are pretty similar to each other. So head orientation is dominating things. In AL, that is still the case. We see a high similarity based on head orientation.
But in addition, we now also see dark regions for the left and the right profile that did not exist in ML before. That's a new property. And you might also see another property coming up here. And that is para diagonal stripes that are darker stripes here that are very narrow. That is, they're identity specific.
And they are for the same identity. Because they're shifted by 25 identities. And that now becomes a very dominant property in AM. You see that basically here, the most similar responses you get are for the same identity. And it doesn't really matter that much anymore which head orientation you have.
But it's not perfectly true. You can see the off diagonals also in AM. So now we can ask, how similar are these matrices, so idealized versions that are only view-specific, only mirror symmetric, only view invariant? And this is the quantification for the actual face processing network.
OK, good. So that's good. I mean, it's mildly interesting. But why would we want to do it? Because now we can apply the same quantification to any artificial computational system that is trained on face discrimination.
When we do this, you get the quantification for the faces for a VGG network and only look at the layers that are the most similar to the three that we find in the brain, we find the profile on the lower right that bears some similarity with a system like identity selectivity going stronger, head orientation selectivity going weaker over time, and mirror symmetry emerging at the mid-level of processing-- but qualitatively, not so similar. So Ilker Yildirim and Josh Tenenbaum and I were collaborating a [INAUDIBLE] both our labs.
And we are following the idea that maybe what the system is doing is actually rather different from what a vanilla deep network is doing. And that is, it's not doing a mapping directly onto the identity of the face but onto the latent variables of the face. And that makes a lot of sense if you think about the ramp-shaped tuning curves that I was showing to you before. Why is that different? Because now we can do analysis by synthesis.
You can understand an incoming image. Because you're able to reconstruct it based on the latent variables of an internal three-dimensional model with a forward graphical engine. So you can now recreate the image-- very different from the interpretation of a deep network-- and again, partly motivated by this finding of systematic tuning to latent variables in the face.
So if we now look at our network that Ilker built that is doing this mapping on the latent variables, we now find a pattern that is much more similar where we now get a correlation coefficient with the actual selectivity in the brain that is close to what? Almost too good to be true. So this is one line of evidence that we have that maybe what the system really is doing is doing this analysis by synthesis approach, very different from a vanilla deep network approach.
We have other evidence that is based on psychophysics. There's a very cool illusion that's called the hollow face illusion that you should look at that shows that when you are looking into the concave part of a face mask, your brain is actually forcing you to see it as a forward facing convex version of the face as if your brain had an internal three-dimensional face model that is only able to interpret the incoming information in terms of a forward facing, proper convex face.
So deep convolutional networks are capturing some of the properties that do not capture the qualitative transitions that we see. They don't provide good quantitative explanations. And so one way to phrase it is, in Mars terms, is that the computational goal of the system is really important. We did use a deep network to train things because that's efficient.
And it works really well. But it was just that we said, don't train it on a label. Train it on latent variables. That made all the difference. OK, and so this might mean that analysis of a synthesis approach might be useful for this processing.
Here's the hollow face illusion. I'm at the end of my time. So I'm not going to show it to you. Actually, maybe I can just do it very quickly just so that you've seen it--
PRESENTER: Winrich?
WINRICH FREIWALD: Yeah?
PRESENTER: We need to end at 2:45 so you have plenty of time.
WINRICH FREIWALD: Oh, OK. All right, good. But then we have time for lots of questions. Sorry. This is not a face illusion. Let me jump back. Sorry. Here it is.
So it looks like looking at Einstein, a bust of Einstein looking at you. We're going to set it into rotation. It looks like it's moving leftwards. And not even that is true. You'll see in a second.
But you've been looking at a concave part all along. Now you're looking at the convex part. Everything makes sense. Einstein is looking at you, rotating in this counterclockwise fashion. Now you're going to be looking to the back again, the concave part.
And now, pop. It goes forward. You have to see that at forward facing. And it's rotating the wrong direction. So your stereoscopic vision is kidnapped into thinking that this is forward facing when it's really a backwards facing.
So what I was running through, and apparently a little faster than I would have had to-- but this gives us a lot of time for discussion-- is number 1, there is a specialized machinery in the brains of macaque monkeys-- but it's also there in humans and Nancy talked to you about it extensively-- that is specialized on faces to a remarkable degree. Again, when we record from these fMRI identified face areas, virtually all the cells are face selective.
So we have a locally highly homogeneous population of cells. But it's also heterogeneous in that every cell is interested in different aspects of faces. So I showed you cells tuned to entire distance, face aspect ratio, all different kinds of latent variables. These different face areas, modules for face processing, are connected to each other to form a face processing network.
This is based on two pieces of evidence-- electrical stimulation with a scanner and also tracer studies. They're remarkably selectively connected to each other. So the funny thing about this face process network is that in a way, it's mostly talking to itself. And that is still something we have to figure out. I didn't tell you evidence or feedback.
We have evidence for feedback in the form of predictive coding of this network happening. I was focusing here on the feedforward processing pathway here. The system, if you look at it from a computational point of view, does exhibit some properties of a deep convolutional network. It gets inputs from parts of the brain that are not face selective several layers of processing before it's happening.
The architecture, while not being strictly hierarchical, you see some aspect here of parallel processing as well. There are elements that are those elements of a processing hierarchy. This organization is very, very useful for us. Because it separates different qualitative processes into different face areas. And that's really been key for us to get a grip on the system and really understand somewhat the fundamental properties that it has.
This separation made it possible for different computational approaches and to identify some of the principles. Again, by "principles," I mean understanding why the system might be organized the way that it's do. So for example, another property that we're interested in is the RAMP-shaped coding. RAMP-shaped coding you find in every network you're looking at.
It can be a deep convolutional network. It can be this flat network, even just two layers of processing. If the system gets facial input and if the system is trained in face discrimination, it'll generate these ramp-shaped tuning curves. We don't understand mathematically why that is. But this is an observation across many different architectures.
So this is a property that's resulting from stimulus geometry and maybe the computational goal of the system but nothing else. But the mirror symmetry that I was talking about that we find in the middle of the processing system, that requires one particular learning rule and a particular architecture that is a hierarchy. Again, deep convolutional networks, as I said, capture some properties but others not.
And this is really, really important. Because analysis by synthesis is really very, very different from what a deep convolution network is doing. It's almost like violating information theory in that when you see a two-dimensional image of a face, you actually interpret it, including latent variables, that are those of a three-dimensional face. So you make an inference about the face based on prior information that you have that is deeper than just using the information that you have.
And then lastly, I was glossing over the fact that after this stage of processing, that it's generic for all faces. And that seems to be extracting these latent variables for all faces only after that. There's a level of processing that is really identity specific that just depends a lot on the personal familiarity that you have with individuals there.
And so that's selective for familiar individuals. So what it is about our brains that makes us think the way that we do, one of the answers is that this very, very complex brain is organized, in a way, by modules and by networks of modules that, from a pragmatic view, makes it possible to analyze it at a mechanistic level but, on a conceptual level, means that there is higher degree of functional specialization.
And I expect that anything that would be higher in the human brain than that of the macaque brain. We identified not only face processing, that work that I talked about today on the left-hand side, not only notes for person knowledge I was glossing over today, but also networks relevant for the other levels of organization of social cognition that I mentioned in the beginning, so the analysis of social interactions.
And we believe the combination of these networks and temporal pole in social interaction analysis is pointing again to other parts of the temporal pole to store this deep knowledge about the social environments in terms of kinship, and friendship, and these other levels of understanding that the primates have, but also networks for social communication, for emotions, for attention control.
So there's a lot of specificity linking parts of the brain that are far apart together into functional, specific modules and that the existence of those modules and the interaction, I think, is a critical component that explains why we think the way that we do.
Associated Research Module: