Modeling the influence of feedback in the visual system
December 16, 2020
December 12, 2020
Grace Lindsay, University College London
All Captioned Videos SVRHM Workshop 2020
APURVA RATAN MURTY: So I'm excited to introduce our first speaker for the session, Grace Lindsay. Grace obtained her PhD at the Center for Theoretical Neuroscience with Ken Miller at Columbia. And she is currently at the Sainsbury-- she's currently a Sainsbury Wellcome Centre Gatsby Computational Neuroscience Research Fellow at UCL, where she works on building functional and interpretable models of sensory processing.
Her talk today is titled "Modeling the influence of feedback in the visual system." Take it away, Grace.
GRACE LINDSAY: Great. Thank you. So I want to talk about this project that I've been working on. It's still very much in the early stages, but it is about recurrent processing in the visual system, which, I feel like, especially looking even at posters and other talks at this workshop, is very much a thing that people are excited about incorporating into convolutional neural networks to kind of understand the function of feedback and other types of recurrent processing. So I'm excited to share this with this audience in particular.
So the outline of my talk, I'm trying to understand something that's been called local feedback. So I'll just describe what that is and how it's been studied experimentally, and what the potential roles of this local feedback are in visual processing. And then I'm going to just talk about how to add this kind of feedback to a convolutional neural network in a way that will make it replicate the basic behavioral trends that appear in experimental data.
And then I'm going to present kind of an analysis of the trained network. So trying to understand-- once this network is able to replicate some of the broad behavioral features, trying to understand what's happening at the different layers of the network and how it's doing that.
And then if there's time, I just want to talk briefly about some work I did previously about attention and how attention could interact with these local feedback circuits, because sometimes the word attention people think of as a type of feedback itself. And so I just wanted to speak to how that might interact.
So by local feedback I mean something that occurs automatically within the visual system and happens immediately after the feedforward pass. So I think most people are probably familiar with the standard feedforward visual hierarchy in primates, something like V1 to V2 to V4 to IT. But as many people probably also know, these connections-- these areas have connections that go the opposite direction as well. So V2 sends connections back to V1, V4 to V2, and so on.
And so as soon as the activity propagates through that feedforward pass, you're creating activity in these visual areas. And they're sending information back to areas before them. So this is what I mean by local feedback. It's just the immediate processing that happens from, say, an area like V2 that gets sent back to V1.
So some ways of studying what this kind of feedback could possibly be for, or some of the studies that try to figure it out are masking studies. And so the way that these masking studies work is that an image will be shown very quickly. And then it will be followed by a noise stimulus that acts as a mask, which means that it kind of interrupts the recurrent visual processing.
And so if you have trials where you include this mask, or you don't, or you have trials where the mask comes on at different times after the actual image of interest, then you can kind of control how much recurrent processing is allowed where if you wait until you show the mask, you will have more time to work on processing, where if you show it very quickly after the image, you will have less time.
And so this is a way to probe the role of recurrent processing in the visual system. And when people do that, one of the findings that you can see is that the recurrent processing seems particularly important for having people be able to understand noisy images. So in the example that I'm showing, the subjects had to classify an image. And the image could either be a whole kind of non-degraded version of the stimulus, so in this case a lion, or it could be a degraded version. And the amount of degradation is varied.
And you can see in this plot on the bottom here that if 100% of the stimulus was visible, then the amount of time that you got to see the image before the mask came on. That didn't matter that much. The subject still performed very well, even if they only had a very short amount of time to see the image if the image was not degraded, if it was just a normal image.
However, once you start introducing this degradation where only part of the image is visible, then you start to see big differences between the amount of time that you allow. So the more time, the better the performance. And it falls off both as a function of how kind of noisy the image is and also as a function of how much time the subjects have. So this suggests that potentially something that local feedback might be doing is helping process these noisy images in particular.
So I wanted to study this function of feedback and build a network that can replicate that basic behavioral trend that I just described. And so I wanted the network to not require recurrent processing for the correct classification of clean images, but then it should need recurrent processing in order to classify noisy or degraded images.
And so the way that I built a network to do that is to train it in two stages. So in the first stage, I'm just training a normal small convolutional neural network on standard MNIST images. So this is just a normal way to train a network to do digit classification.
And then in the second stage, I'm holding that feedforward architecture constant. And I'm only training these feedback connections. So I add in a feedback connection from the second convolutional layer to the first. And I train only those connections on a set of images that include different types of noise degradation. So pixel-wise, Gaussian blur, low contrast, and occlusions.
And so now these feedback connections need to learn to work within the context of the existing feedforward network in order to correctly classify these images, even when they have a lot of noise. And I just chose to run this for four time steps, and then use the readout at the last time step for classification.
So when I do that, I find that I can replicate this general trend where when there's no noise, it doesn't matter how many time steps you give the model. It will perform well. And then as you add noise, you start to need more and more time steps to get decent performance. So this is just replicating the broad behavioral findings that I just showed that occur in humans when they're trying to classify noisy images with limited time.
And I'm not showing this, but it is the case that if you just train a network all together where you just put the feedforward and the feedback altogether and try to train it, you don't really get this clean separation where the feedforward pass learns to classify clean images. And you need more time for noisy ones, it just kind of gets all blurred together. So to really replicate the behavioral findings, we need to train with this two-stage process.
So now that I have this network, the question is how does the feedback help improve classification. I know that the feedback is doing the job of improving classification, because I can see that in the performance. And so if we wanted to then just kind of pretend that this is the brain and we know that feedback is helping.
Now we want to analyze the neural activity to try to understand how that's working. So I'm going to start by showing you a visualization of the activity at the first layer of the network. So to visualize this, I just did PCA on the activity of the network in response to all different images. And I'm just going to show the first two PCs.
And I've color-coded each of the points to be the average activity according to which digit is being shown, so which digit is in the image. And the shape of the point will correspond to which type of noise has been added to the image. So we're going to look at the dynamics of the activity at layer one for each of the different types of inputs that went into this network, both digits and different noise types that were added to the [INAUDIBLE].
And this is showing how that activity evolves over time at the first layer of this network. So you can see that it kind of starts a bit scrunched up, and then the influence of the feedback is to spread the activity out. And you can imagine that that's helpful for classification.
Usually, if you want to classify noisy inputs, then you want to spread them apart so that you can draw a line through them. And each of these points is a center of a cloud of different examples. So if you're looking at the orange triangle, that's the average of all the different number one with blur. So there's a few different, obviously. And it says multiple examples of the same digit.
So you're looking at the average of a point cloud, and so if you want to classify these things, it makes sense to kind of spread them out. And we see that is what the feedback is doing.
And if you look at-- if you just try to do digit classification from this layer directly, the activity of the first layer, you do that over time. You can see that the ability to classify what image is-- or what digit is in the image does increase over time, presumably as a result of this kind of spreading out.
But if you also look at-- I'm showing this is now just a still of the last time step of the layer one activity. You can see that these clusters of points are actually clustering according not to digit identity, but to noise identity. So if you really want to classify digits, you would expect these points would cluster according to color in the way that I'm plotting it, but they're not. They're clustering according to the shape, which is the noise type that's applied.
And so, actually, if you try to classify what kind of noise was applied to the image, you can see that in this first layer activity, you already start with very high classification performance. So this first layer kind of really knows what noise was applied to the image. And then only gets stronger as a result of the feedback.
So the feedback is not, it seems like, at this layer kind of squashing the noise or trying to make all of the images of the same digit look the same. It's actually separating the activity into different clouds according to noise, which is counterintuitive when you think about the kind of strategies that are believed to be occurring or the strategies that are most studied in biological systems.
Now, if you look all the way at the last layer of the network, at layer four-- so this is where the actual readout of the digit occurs-- you can see that, again, the points kind of start clustered. And the result in the course of the dynamics is to spread the points out. But now they are kind of spreading out more according to digit identity. And that's consistent with the idea that the feed that does actually help you classify according to digit identity.
If we weren't getting this feature at the last layer, then it would be unclear how the feedback could actually be increasing performance on the test. So this is just a still of time step four at this layer. And you can see that kind of now clusters are happening according to color, and not according to noise identity.
And if we actually perform classification, you can see that, indeed, the digit classification increases as a function of time at layer four that's the old line and in fact, the ability to determine what noise was in the image goes down significantly over time for this layer. So what's interesting here is that you see-- in the noise classification, you see an opposite trend, basically. You see that at the first layer the representation of the images becomes more clustered according to noise. And at the last layer, it becomes less clustered according to noise.
And if we look at all of the layers, so now this is including the second and third layer in between, you get something like you would expect for digit classification, which is that both, as you go deeper into the network. So as you go through the layers, you get more digit classification accuracy. Obviously, that's what the network is trying to do, is to classify digits.
And then, also, as you allow more time, you get more accuracy and digit classification, which is just, again, reflecting that the feedback is actually helping. And this is true for all of the layers when it comes to digit classification.
But when it comes to noise classification, at the first layer, you get an increase in your ability to classify noise, a separation that occurs according to noise. And at some point, it switches such that by the third and fourth layer you're actually losing your ability to cluster according to noise as a result of the feedback.
And so that's kind of an odd thing to see, essentially, because I feel like a lot our intuitions about how networks work are actually kind of linear intuitions. And so you would kind of expect that if you're clustering according to noise at one layer, that's just going to be more true as you go deeper into the network. And to see this reversal where you can kind of push things apart according to their noise and kind of strengthen the representation of the noise such that somehow later on you are now classifying without regard to noise is kind of strange.
And so that's kind of the summary about this project on local feedback. So the kind of conclusion from this analysis is that pushing digits into this noise specific activity space in the first layer may somewhat counterintuitively actually help to later classify them just according to digit identity. It's as though the network kind of learned four different classifiers that it applied differently to different parts of layer one activity. So the goal of the feedback is to kind of push those things into the realm where they could be separately classified according to digit is one interpretation of these results.
So kind of beyond just the actual findings of this specific network, I also just want to say that I think that this endeavor of trying to analyze these trained networks is actually quite helpful and can be quite informative even if the network isn't working exactly like the brain, because I think it's a good test of the tools of systems neuroscience. So I'm presenting to you these classification results. And I showed you the activity in a low dimensional space. And these are kind of standard systems neuroscience tools at this point in time, I think.
I also tried several other things that didn't really lead to any insights as to how the network worked. So I feel like this is a good place to be trying out these tools, and seeing which ones lead to the most understanding in this situation where we have kind of as much data and kind of ground truth as we can get.
And then they also do lead to hypothesis generation. So this idea that if we see the representation of noise increasing in an area, the notion that has to be bad for classification. In this model, we can tell that that's not true, that increasing the representation of the noise could actually be a strategy to later get rid of that noise information. And I think-- as I said, I think that's counterintuitive, especially because we kind of use linear intuitions when we think about these networks. And then also it's important to be able to analyze these networks so that we can tell if they actually are working similarly to the brain, and then hopefully making them more similar to the brain once we learn that.
So yeah, so just to briefly talk about attention and how it could interact with local feedback networks. So attention signals don't necessarily require the first feedforward pass of activity to modulate the visual system. You can imagine I can tell you to kind of like be on the lookout for a red car. And that's an audio stimulus that I've given to you. And it can modulate your visual processing.
So the attention signals can come from frontal areas without an initial feedforward pass of the visual system. And they can just kind of modulate visual activity.
I showed in previous work that if you implement the neural changes that are found experimentally with attention, if you implement them in the neural network, it's pretty straightforward, because most of the changes with neural-- with attention in neural activity is just a multiplicative scaling of the activity. And so you can just do that with the ReLU unit. You can just change the slope of it in the network and you can kind of replicate those effects.
And it is the case that when you do that in these networks, you get an increase in performance on certain difficult detection tasks. So the neural correlates of attention lead to performance enhancements in these networks. But it doesn't matter where in the network you apply the attention. So you get the best impact if you do these modulations at the later layers in the network.
And so that makes it seem like perhaps attention should-- these signals from attention from the frontal areas should target the later layers of the visual system like V4 or IT. And when you look at where the strongest impacts of attention are on neural activity in the visual system, you do see that later areas both have stronger effects, and those effects come on earlier. So it does seem like-- for example, you can see in this image, it seems like V4 is actually getting the attention signal first because its activity becomes modulated by attention early. And then it's believed that then local feedback sends the attention signal from V4 to V2, which sends it from V2 to V1.
And so attention is kind of attacking the top of this network, but then it's the local feedback that's responsible for kind of sending those signals all the way back down. So that's kind of how those two different systems could interact. And knowing this, it kind of puts further constraints on what we can believe about local feedback if it has to also carry out this function.
So this is just what I said. That top-down attention signals from frontal areas can target kind of the end of the visual pathway, and then the end before IT, those kinds of areas, can kind of start this reverse hierarchy that sends that information further and further back.
Oh, and I just wanted to say also, if you're interested in attention and how it relates to models and machine learning, because that's a word that exists in machine learning now in addition to in neuroscience and psychology, I wrote a review that kind of tries to give a rough map of this space and see where the parallels are, and where there's room for more parallels, and that kind of thing. So that's in Frontiers. OK. So thank you.