A Computational Explanation for Domain Specificity in the Human Visual System
July 14, 2020
June 6, 2020
All Captioned Videos CBMM Research
Katharina Dobs, MIT
Many regions of the human brain conduct highly specific functions, such as recognizing faces, understanding language, and thinking about other people’s thoughts. Why might this domain specific organization be a good design strategy for brains? In this talk, I will present recent work testing whether the segregation of face and object perception in primate brains emerges naturally from an optimization for both tasks. We trained artificial neural networks on face and object recognition, and found that smaller networks cannot perform both tasks without a cost, while larger neural networks performed both tasks well by spontaneously segregating them into distinct pathways. These results suggest that for face recognition, and perhaps more broadly, the domain-specific organization of the cortex may reflect a computational optimization over development and evolution for the real-world tasks humans solve.
From a brief glimpse of a complex scene, we recognize people and objects, their relationships to each other, and the overall gist of the scene – all within a few hundred milliseconds and with no apparent effort. What are the computations underlying this remarkable ability and how are they implemented in the brain? To address these questions, my research bridges recent advances in machine learning with human behavioral and neural data to provide a computationally precise account of how visual recognition works in humans. I am currently a postdoc at MIT where I work with Nancy Kanwisher. I completed my PhD at the Max Planck Institute for Biological Cybernetics under the supervision of Isabelle Bülthoff and Johannes Schultz investigating behavioral and neural correlates of dynamic face perception. During my first Postdoc at CNRS-CerCo working with Leila Reddy and Weiji Ma, I used a combination of behavioral modeling and neuroimaging to characterize the integration of facial form and motion information during face perception.
KATHARINA DOBS: I want to talk about a computational explanation for domain specificity in the human visual system. So let's start with the problem. What is the problem that vision needs to solve?
And so from a brief glimpse of this complex scene here, we quickly and seamlessly, effortlessly, recognize people and objects and their relationship to each other and the overall gist of the scene. And in the brain, when we look at this face, for example, the 2D pattern of lights entering our eyes is processed through different stages until we come up with the abstract representation of the face, such as this is Anna. So how does the visual system solve this? How does visual recognition work in humans?
And despite the wealth of empirical data collected in the last decades, it's actually still unclear what are the precise computations and representations at each of those processing stages and why. However, it is something that we learned about how the brain cells visual recognition, and that is that these processing stages do not seem to be the same for all visual categories. In fact, many labs in the last decades have discovered functionally specialized areas in the ventral pathway.
So for example, if you show subjects in the fMRI faces and objects, some areas in the ventral pathway respond selectively to faces, compared to any other visual category. And the same has been found for places or bodies or visual words. And while we learned a lot about what these specialized neural populations are doing and whether they're causally involved in these tasks and so on, some big questions actually remain open.
The first thing-- why do we have functional specialization for these categories, and not others? And second, why is functional specialization a good design strategy for brains in the first place? And I think we all agree that we would only expect functional specialization for tasks that we either do on a daily basis, that are modern life valid, important, or that have been important to our ancestors or during evolution.
However, that constraint alone seems to under-constrain the functional specializations that we find in the visual system. So for example, when labs went on it to see whether there is functional specialization for cars, which we have a lot of daily experience with, it seems like there's no evidence for that. Or food, as an example for a task that we do from basically the earliest days we're born, it was important to our ancestors. We need to discriminate food all the time. There is also no evidence for functional specialization, not that. Or other more evolutionary important categories, such as snakes or spiders, could also not being found.
So here what we want to propose or test is that maybe on top of being modern life and/or evolutionarily important, there are computational reasons for why these categories specialize, and others don't. So to be a bit more precise, the hypothesis that we're testing here is that the domain-specific organization in the human ventral pathway reflects the computational optimization for the real-world task that humans solve. And so this is not a new hypothesis. It has been out there since a long time. Mark talked about it, and others.
However, until now, there was basically no mean or no tool to test this hypothesis in a computational system that was anywhere near human-level performance in these tasks. And that suddenly changed with the rise of Convolutional Neural Networks, or CNNs, in computer science. So in 2012, Krizhevsky et al developed AlexNet, a CNN optimized to classify images into object categories. And critically, CNNs are image-computable, meaning that they can be trained on natural images like the image of this art, here. And this image then gets put through a cascade of operations, extracting features of the image, pulling those together, and eventually classifying the object category.
And this particular architecture, AlexNet, won the most important computer vision challenge at that time. And each following year, another CNN won that challenge, until by 2015, CNNs achieved new human-level accuracy in the task of object categorization. And I just want to emphasize here that they're still far away from other aspects of human-level object recognition, but they're really good at categorizing objects.
And so CNNs have been really useful and a big advance, not just in computer vision, but also for human cognitive neuroscience. And the reason for that is that these computational models are actually inspired by the visual processing hierarchy. So in the last few years, several dozens of papers have been published showing a high similarity between the features extracted at the different layers of a CNN and the different processing stages in the human brain. And this is exciting, because for the first time, we have image-computable models now of how human visual recognition might work in the brain. So let's use them to test our hypothesis about functional specialization.
So the work I'm presenting here today has been worked with Nancy Kanwisher, supervised by Nancy Kanwisher, together with Alex Kell, Julio Martinez, and Michael Cohen. And I wanted to particularly focus today on the case of faces, or face specialization, in the human visual system. But all of these techniques actually can transfer to all other visual categorizations.
So there are two main questions that we had at the beginning. The first one was, from a computational point of view, do face and optic processing require distinct computations? So do they require distinct computations, or can we train on one task? And do the computations then also transfer to the other task?
And then as a second question, we wanted to see what happens if we train a system on both of these tasks? Can the representations-- is there a set of representations that can be learned to support both of these tasks or not? To better understand is there a computational reason for why these things should be supported.
And to understand why the face and object tasks require distinct computations, we trained two systems on these two different tasks. So we took a standard AlexNet architecture, trained it randomly from scratch on face identity. So we used around 1,700 entities from the BGD phase two data set and trained that model from scratch until there was no more improvement in performance using standard training and optimization parameters. And then we had another AlexNet architecture-- so the same architecture. Now, we trained that and optimized it in object categorization.
So what we did is we took 423 categories from the ImageNet data set. They were prototypical objects. So we removed all the scenes and animals and other categories from that data set and left only prototypical objects. And importantly, we matched the images between these two data sets such that the training set for both of these networks was the same.
And now how can we find out whether a network trained on one task can also do the other? So what we did is we had a set of 100 held-out identities. So these were face identities the network had never seen before. They weren't included in the training. 10 images each, so 1,000 images in total.
We showed them to the network and then extracted the activations from the penultimate layer, active 7. So we get a pattern, a feature vector, basically, for every one of those images. And then we also had 100 held-out object categories that we took from the Things database that we showed to do object CNN and extracted those activations. And then we also showed the object images to the face CNN and the face images to the object CNN. So we also got the same patterns now for object images.
So now, in order to find out whether these features are useful to do both tasks, we trained a support vector machine to decode the 100 face identities. So the idea was basically take those features, keep them constant, and just vary the readout and see which ones are more useful to decode faces-- when we take the activations from the face CNN or the ones from the object CNN? So on the y-axis here, you see the decoding accuracy-- how well we can decode faces. So let's look at the features extracted from the face CNN first. And that's what we see here. So we can really well-decode face identities using the features from the face CNN.
So what happens if we take the features from the object CNN? Does this generically trained object CNN-- is something like face identification something that just emerges and falls off of training on object categorization? Or does it perform worse than the face CNN?
And that's what we see here. So the object CNN is definitely above chance level. Chance level is 1%, here. But the performance is much worse than when we used the activation-- the features from the face CNN. And we can do the same thing for object decoding. And here, we find the exact opposite. So now, the features on the object CNN are very useful to decode objects, while the features from the face CNN are not as useful.
So we find that the phase 2 network does not do well on the optic task and vice versa, basically showing here a double dissociation, just like we see it in the brain, that a system optimized to do faces cannot do objects really well and the other way around. So do face and object processing require distinct computations? This analysis suggests, yes, the representations are actually suboptimal for the other task, suggesting that that is a potential reason for why these things are segregated in the brain.
But then what we didn't test yet is whether there is a set of representations that can be learned if you train a system on both of these tasks simultaneously. So that's what we wanted to understand in the next question. Can representations be learned to support both tasks?
So in addition to those two separate CNNs, the face and the object CNNs, we also introduced a fully shared dual task CNN. And that technique has actually been nicely introduced by Kail et al in 2018, where they used that in the auditory system to look at task segregation of word and music tasks. And we want to transfer this now to the virtual system and ask, how about face and objects?
And so the idea basically is you take the same architecture, and the only thing you change is you would add another classification layer, one for the face classes and one for the object classes. And then you train that network again on alternating batches of these images and object images and face images and object images until, again, the training or the loss reaches an asymptote, so there isn't more improvement in learning. And then the hypothesis is that if that network discovered a shared feature set to do both tasks, it should achieve the same performance as the face to object-only network. And if it cannot-- and that is our hypothesis-- we actually expect that the fully shared network does do works than the separate networks.
So let's see what we find. What we're looking at here now is the test performance. So in this case now, we can directly go and look at the test performance, which is based on an independent set of images that we have from all the classes that the face network and the object network have been trained on. But they have never been seen during training. And the red bar shows you the performance from the face-only CNN, and yellow, the performance on object categorization from the object-only CNN.
But now what we really want to know is how does the dual-task network perform? The network that has been trained on both tasks, we get two performances from the same network, one for the faces, one for the objects, and the idea being if that network discovered a shared representation, the performance should be very similar to what we have at the face and object-only. If it failed at finding such a shared set of shared representations, performance should drop. And that's exactly what we find.
So we basically see that there is a large drop in performance when trying to share these two tasks in one system, in particular for the face task. So it seems that face and object tasks cannot be performed well in one network, suggesting that that might be a potential reason for why these things should be segregated if you want to perform well on them. However, that's also not really what we have in the brain.
In the brain, we don't see that nothing is shared or everything is shared. What we actually have and what we see in the brain is that early visual processing can be shared across these tasks, like face and object and all the visual system. But then at some later mid-level processing stage, these tasks diverge.
So we wanted to find out whether that is the case in our system as well. So we basically asked, what happens if we share a little less or even a little less or even a little less? At what point would we achieve the same performance as a network that has only been optimized to do faces and objects? Or do we maybe have to go all the way back and data set cannot share anything in the system, which would not be very biological.
And so that's what I'm showing you here now. On the x-axis, you see the different networks that we trained with different branch positions. So the network either branched at a very early stage and didn't share much, or they branched at a very late stage and shared almost everything. Now, we're comparing the performance for the face test to the network that has only been trained on faces. That's our baseline. That's the performance that we want to achieve.
So let's see what that looks like. And what we see is that we can basically share the first convolutional layers up to three or four and achieve very similar performance to the face-only network. But if we tried to share more, then the performance starts to drop. We can look at the same thing for the object task, and here we see a very similar pattern. Sharing the first three or four layers is fine, and then performance will drop. So it seems like early processing stages can actually be shared between those tasks like in the human visual system.
And so can representations be learned to support both tasks? We find not without a cost of sharing, at least after mid-level stage processes. Early layers can be can be shared. And that being a strong support for-- certainly supportive for the idea that it computationally makes sense to distinguish, two separate, these two tasks into different systems if you want to perform well.
But now, we've looked at one architecture. This was AlexNet. It's a very early network in terms of the history of CNNs, and there has been much development, here. So we wanted to see, do these results-- are they particular to AlexNet? Do they generalize to larger-- to other networks in particular? Do they generalize to larger networks, more layers, more features in every layer, and so on? And that's important to find out.
So we trained another architecture. We used VGG16, which is a much deeper network, way more layers, more features in every layer. And it also performs better than AlexNet on several benchmarks. So we trained this architecture again on the face and the object task. That's their performance, here-- the test performance, again, for those two networks.
So we find that the overall performance is better than for AlexNet, as to be expected. But critically, now, we also trained this dual-task network. Using this large network now, also trained it on faces and objects. And now we wanted to see, OK, how about this network? Would this also show a cost of sharing, so a drop in performance, when we try to do both tasks in the same system? Or would this network maybe be able to do it?
And this is what we find. And this was, to be honest, very surprising to us. So we basically saw there was no difference in performance if you train this architecture, this large system, on one task or both tasks simultaneously. So face an object tasks can be shared in larger systems, but why not in the brain?
So far with our hypothesis, that suggests that this network discovered a shared set of features to do both tasks. So that cannot be a reason for why these things are segregated in the brain. That's one hypothesis.
Another one that we thought about then is what if this network actually spontaneously discovered task segregation? So what if, hidden in those layers, units started to specialize on face processing or object processing, and that's how this network achieved the performance? So we wanted to find out whether that was true.
So we started to perform lesion experiments in that dual-task network. And I just want to quickly emphasize here how cool it is to have these systems that you have full access to. And you can basically very cheaply and quickly perform things as lesion experiments to look at causal involvement of different layers or units in the network, something that would be very hard, if not impossible, to do.
And so the idea was basically that if we wanted to find out whether a certain filter is involved in a face task or the object task, we can just lesion it and look at the performance and see how does that affect the performance in both tasks. So that's the basic idea. So we took that dual-task network and dropped every single feature map, one by one, in all the convolutional layers and showed it batches of face images.
And then with these batches of face images, we compute the loss to see when dropping that particular feature map, how does that affect the loss? Does the loss increase? Is it important for the task? Does it not change? It doesn't seem to be important for the task.
So we did this for every single convolutional layer. And I'm just showing you one example layer here. So we basically get a loss associated with dropping every one of those feature maps in this layer. And then we can sort them or rank them, basically, according to the loss of the effect that they had on the face task.
And then what we can do to find out whether that sorting is really task-specific-- so right now, we only know that these feature maps are important for the faces. But they could also be important for the object task. We haven't ruled that out.
So what we did then is we dropped the entire best 10% or best 20% and checked the performance on an independent test set-- so the test set that we have for those networks-- and test whether when we drop those units according to the sorting, will this only affect the face performance, or will it also affect the object performance? And then we can do the same thing on object batches. So we showed that with objects, we find the same thing. So we can do the same thing and get an object sorting, now, ranked for how important they are for the object task.
So now the important question-- does a larger system segregate both tasks? So what we do-- and I'm showing you one example layer here. It's a late layers, so if there's segregation, it should have occurred at that layer.
So what I'm showing you here is how we dropped the feature maps from 0% to 50% in 10% stacks. Also, on the y-axis, you see the performance on that independent test set. And the gray line shows you what happens when we randomly drop units. And you see that these layers are actually pretty robust to dropping feature maps, and the performance is not very much affected.
So what happens if we use the face sorting now? This is what we see then. So we can actually drastically impact the performance in the face task by using those face-ranked units that we ranked to be important for the face task.
But what happens if you use the object sorting? Maybe that is also effective in doing that? But it's actually not nearly as much as when you use the face sorting. We can look at the same thing for the object task. So using the object sorting, we can affect the object task much more than using the face sorting.
So it seems there is evidence for functional segregation of face and object processing in these larger networks, which was really exciting for us to see. But then obviously, that's just one layer. So what we want to know is at which processing stage does that arise? Is that just from the beginning? Does it come at some point in the middle?
So we somehow needed to quantify this for every single layer. So we decided to drop 20% of the highly ranked face units according to the face sorting in every single convolutional layer and measure the performance on the face and object task. So that's what I'm showing you here now.
On the x-axis, you see the different layers from CONF 1 to CONF 13. And here on the y-axis, you see the normalized performance. So that's normalized to show you the proportional drop-- how much dropping the 20% face units proportionately affects the face versus the object task.
So let's start with the object task, and that's what we see here. So in the early layers, we actually impact the object task a lot. But suddenly around layer four or five, that changes. And so now, dropping the units according to the face sorting doesn't impact the performance of the object task as much anymore.
So now, how does this look like for the face task? And here we see a similar and different pattern. So basically, what we see is that we can also impact the face task a lot in the early layers. But as you can see, this is not task-specific. We impact the face task as much as the object task.
However, after some CONF 6, CONF 7 layer, these two curves start to diverge. And now, we impact the face task, while we don't impact the object task. So we wanted to put that into some impacts, just transform this into a single number.
So we defined a task-specificity index, which is just a ratio between the proportional drop that we have on the face task versus the object task. So that ratio should be higher when it's impacting the face task, much more than the objects task. And that's what this ratio looks like when plotted over all the different layers. And so you can really clearly see the specificity. And it starts to rise after layer CONF 7 and goes up to 6, which means we can impact the face performance up to 6 times more than object performance here.
And just as a side note, we did a control where we matched the performance between these two tasks, so we didn't have to normalize the performance. And we found the exact same thing. So the selectivity index is not driven by overall mean differences in the performance of the two tasks. So the important take-home here is that there's segregation of face and object processing that spontaneously emerges after some mid-level processing stage like in human visual system, and very consistent with the results that we found for AlexNet.
So we came at then this question-- can representations be learned to support both tasks? And larger networks actually spontaneously segregate both tasks and can share these two tasks by segregating these two tasks in the network, which is in some sense even more exciting because we didn't have to impose a branching structure on this or something. The network just discovered that segregation is a good thing to do in order to maintain a high performance. And that's really giving a strong supporter for the idea that it doesn't make computational sense to segregate these two tasks also in the brain.
However, there is another important thing we have to do that you might have thought about as well. What about other tasks? Do any two tasks require distinct computations? So obviously, we wanted to test that to see what happens. Is this something that is really particular to faces? Or would we find this for any pair of two tasks?
And so we thought about what would be good control tasks. And the first thing we came up with was food, for which we think it has really relevance to our daily life. But still, we don't see functional segregation in the brain.
So we found this Food 101 data set, and we trained a network on foods and objects simultaneously. So it is 101 categories of different types of food with changes from the exact same analysis to find out whether these two tasks would segregate. But then we also wanted to have a control task that is more fine-grained, more similar in that sense, to faces than food is, which is still a pretty heterogeneous task.
So we also included cars for which, again, we don't see functional specialization. There is no evidence of that specialization in the brain. So we have this car data set with 1,100 categories, where we fine-grain model, make discrimination, and we trained the network on this car task and object task to find out whether this would maybe segregate when it's also another fine-grained task.
And then importantly, I just wanted to mention that we matched this always to another face and object CNN to train on the same data set size. And also very important-- the object task that we had in these networks never included any food classes or car classes or face classes. So there was no bias in that sense.
So do any two tasks recreate distinct computations? That's what we found just right here. That's what I showed you before for faces and objects. Here's our task specificity index. So now we want to see what does this look like for food and objects?
And here, we find a very different curve. We basically also see that we impact the food task more than the object task, but not much. And that proportion seems to be pretty stable across all the layers. And in fact, the later we go in the network, the less we impact these two tasks. And so when we look at the selectivity index, we see there's much-- specificity index-- there's much less segregation. And if anything, it arises pretty late in the game.
Now the same thing for cars, where we have another fine-grained discrimination task. And that, again, looks very similar to the food task. And also, when we look at the specificity index, we find that there is much less segregation, even for this task, for this fine-grained task. And it occurs, if anything, much later in the network.
So food and car processing show less segregation than faces and objects, showing that this is not just the case for every task. It is something that seems to be particular to the face task in these comparisons. So do any two tasks required distinct computations? We looked at food and cars, for which we don't see much specialization, and found that foods and cars actually segregate less than faces.
I think I just want to summarize this by saying basically, these findings suggest that, from a computational point of view, it makes sense to segregate faces, but not other tasks such as food and cars. It's strongly supporting the hypothesis that there might be a reason for why these things are also segregated in the brain. And while this was a really normative approach that we took here-- we basically just asked, if we take a system and train it on objects and faces, would that system also segregate these two tasks?
We obviously also want to bring that back to human behavior and the brain. And I have some initial results on behavior. And we're looking at neural data right now. But that's to be stay tuned.
But what we really want to know is do dual-task networks better mimic human behavior using this behavioral data? So what we have is we collected a behavioral similarity to a large set of face images. So there were 80 face images with different gender and age. There's some examples here-- gender and age and different images of the same person and so on. And we collected this behavioral similarity and basically put this in a matrix.
So you can see here, this is just a representational dissimilarity matrix, where the color indicates how similar or dissimilar two images are from each other. And then we can do the same thing based on the activations to the exact same images from the different layers in the CNNs that we trained. So that's an example for the face CNN and some fully connect layer in the face CNN.
And so you already see that there is pretty striking similarity. But we can quantify this by basically just correlating these two matrices, putting them in a vector and collating those two vectors. And that's what we do for every single layer and for all the three networks, the face-only, the object-only, and the dual-task network.
And so I'm showing you here the results for the face CNN. So it's the correlation over the layers between the face CNN and the human behavior. And in gray is the noise ceiling given the variability between subjects. And what you see is that the correlation basically steadily increases. And we're achieving the noise ceiling around layer CONF 12 to 13. So it's a really pretty high correlation, here.
Now, we can do the same thing for the object CNN. And then we find the correlation is much less. So it's not anywhere near the noise ceiling. Something that's interesting, though, is that it actually outperforms the face CNN in these early layers, which is something that we see pretty consistently. These early layers are pretty good in that they're actually better models of image processing than a very specifically trained network.
So we can look at the dual-task CNN. And that's pretty cool because it basically combines the advantages of both of these networks. It has these better early visual features, but it also has the face-specific features and correlates as well as the face CNN with the face behavior in these layers.
Then we did the same thing for an object behavioral task with images from different object categories, different examples of the same category, and so on, and correlated that. And here, we find the exact opposite. So now the object CNN correlates much better than the face CNN, while the dual-task CNN again matches pretty much the performance of the object CNN.
So we find that dual-task networks correlate well with both face and object behavior, while the single-trained networks basically can only explain one and not the other. So we find that dual-task networks actually mimic human behavior well. And that makes us pretty optimistic that these might be a better model for human behavior, but also brains, where we do all these different tasks that are like face recognition and object recognition. And they are relevant to this.
So I just want to go back here to the questions I raised at the beginning about functional specialization in the human visual system. And so I think what our results suggest here-- why do we have functional specialization for these categories and not others?-- is that they cannot be performed by relying on common representations, but they need their own distinct set of representations, of computations, to be performed well, which ultimately brings us to the second question, too. It is a good design strategy for brains to segregate these tasks to perform well on these tasks.
So basically, we looked at the case of faces today. But what we think is, it actually counts more broadly that the domain-specific organization of the ventral pathway may actually reflect a computational optimization for the real-world tasks that humans solve. And this now opens up a wide space of questions and experiments and studies we can do.
So just to give you some examples of this technique, we definitely want to take this beyond just objects and faces and say, what happens when we look at objects and scenes or objects and bodies? Would we also see segregation for those? Or what about other tasks for which we don't see functional specialization?
But then even think about artificially synthesizing stimuli to get more at the question of what is it about a task that requires segregation or that makes the system require segregation or specialization? And that all will contribute to better understanding of why we have functional specialization for some categories and not others in the virtual system or beyond. And something that I also think is really, really cool-- that we can look also within domain tasks such as face expression or identity within the task of faces to see whether they are segregated in those systems and how that relates to the brain.
But then I would also like to basically go beyond these large areas that we know since a long, long time over islands of functional specialization and look at it in a more fine-grained way. Basically, taking a network that has been trained on hundreds of object categories, perform the same network visual experiments, to see which of those categories do we see functionally specialized units. And is that degree of specialization actually predictive for the degree of specialized bits or voxels in the brain? Yeah, to get at specialization in a more fine-grained way.
And with that, I'd just like to end with an announcement. I will start my own lab in the fall at Justus-Liebig University in Geissen in Germany. And so if you are excited about this work as much as I am, and you're looking for doing a PhD or a postdoc, please reach out. Ping me, and I'm happy to discuss. And with that, I'd like to thank everyone in my lab or in the Kanwisher lab. Nancy has been a wonderful, wonderful supervisor, all my collaborators and funders. And thank you for your attention.