Tutorial: Computational Models of Human Vision - Part 1 (27:06)
August 12, 2018
August 11, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Pouya Bashivan, MIT
Overview of the visual pathways from the retina to higher cortical areas; approaches to modeling neural behavior including spike triggered averaging, linear-nonlinear methods, generalized linear models, Gabor filters and wavelet transforms, and convolutional neural network models; and using models for neural prediction and control.
Download the tutorial slides (PDF)
POUYA BASHIVAN: Hi, everyone. I'm Pouya. I'm going to be talking about computational models of vision. So we're doing this together with Koh Kar, who is going to do the second half.
So the first half is going to be about encoding model, that is, basically, the encoding part of the computational models of vision, and the decoding part, which Koh is going to talk about. So I'm a postdoc at Jim DiCarlo Lab. And we do computational modeling of vision.
So the way I thought about the topics that I'm going to talk about tonight is, I grouped them into three parts. The first part is, the motivation why we're studying vision. The second part is a background about what we know about vision, and primate vision mostly. And the third part is about the models that we already have for these processes. And then we're going to end it with some applications of these models in neuroscience.
So while I was preparing these slides, I was thinking whether I should talk about models first or vision first. I finally decided that it's best to talk about models, why we're interested about models at all. So there are two questions-- why do we need models? And then the second question, which is probably related and even more important is, how can we use these models in science?
So the way that we should be thinking about models is models as modern hypotheses. So basically, we build a model as our hypothesis. And then this model gives us a set of falsifiable predictions, which are very important for the progress in science. Because we can make some predictions, then we can test them.
And we get some success, and more importantly, we get some errors. And then, once we make some errors in our predictions, we can use those errors to fix or reinvent our models. And then what we can do is go back to the first step and keep doing this loop until we get to the best model that we can have, which is the closest thing you can have to the truth that you're looking for.
So going back to the second question-- why vision-- I think many of you would agree that vision or maybe any part of the cortex, the reason why we are really studying them is that each of these regions in the cortex or maybe mostly vision is a window into how neural networks build a compact representation of the world. And that's basically our attempt to understand how brain learns, and how brain is doing all the interesting and amazing things that it's doing. So in order to get there, we need to answer a couple of different questions.
First of all, it's how the encoded image-- in the vision case-- how the encoded image is represented by the neural responses within the peripheral and the early cortical visual pathways. And the second thing is, what's the role of these representations in efficient image coding? And maybe how we get from this efficient image coding to behavior? Which is mostly going to be the topic of what Koh is going to talk about later.
So this section is about what we know about the vision. I'm sure most of you and maybe all of you today were at Nicos' presentation this afternoon. So I'm going to be talking about this pretty quickly because you already know most of it, at least for V1, V2.
So vision in the primate cortex is made out of two main visual streams, which are called "what" and "where" streams, or ventral and dorsal streams. So each one of these streams are mostly thought of as doing a rather different thing. For example, ventral stream is mostly about figuring out the color, or the texture, or the shape, the size. And the dorsal stream is mostly thought of as the mechanism to figure out the location, and the movement, and so on.
So I'm going to be talking about the ventral steam mostly today-- or not mostly, all of it-- which is made out of a series of regions that starts from the retina and then ends at an area which we call IT, Inferior Temporal cortex. Also, this is not really the end of the ventral stream because from the IT, there are projections that would go to prefrontal cortex. But for now, for this session, we are going to be mostly thinking and talking about this set of areas.
So starting from the retina, the retina is where the light is actually transformed to electric signals. And from there, it's propagated to a series of different areas that build a set of representations that give us an efficient coding of the image, which are basically made out of light projections from any object in the world. So the way it's done is through mainly two different types of sensors-- the cones and rods, which absorb the light and make electric signals. And then there are two other layers or two other different types of cells in the retina that receive inputs from rods and cones and then transform them in some ways. And then the last layer is actually the ganglion cells, retinal ganglion cells, which project the signal out of the retina and into the LGN, which is the next region.
So knowing about the quality of the sensors-- there are some facts about how good these sensors, the rods and cones are. Basically, these sensors, there is only a narrow region of high visual acuity in the retina, which is in the fovea here, which is very narrow. And on top of that, these sensors have a very small dynamic range. And the representation of wavelength is actually very coarse. So if this was-- yeah.
AUDIENCE: Yeah, [INAUDIBLE] a question, because I heard from someone that dynamic range [INAUDIBLE] pretty great.
POUYA BASHIVAN: On the cones and rods or in general for the visual?
AUDIENCE: I guess just in general, like for things you can see. Maybe give it some temporal type of adaptation you have [INAUDIBLE] more than 10 magnitude.
POUYA BASHIVAN: Yeah, what I'm talking about here is mostly this part only, only the rods and cones.
AUDIENCE: Right, but if [INAUDIBLE], I don't think [INAUDIBLE].
POUYA BASHIVAN: So that's the beauty of it. So the limitation that you have in the sensor is somewhat compensated by the whole system, the visual system. So that the downstream areas compensate for these weaknesses. Partly because the contrast is the piece of information that is being propagated up. It's not the luminance level, for example.
AUDIENCE: Thank you
POUYA BASHIVAN: So if you think of this as a camera, then this camera would probably not be a camera that you want to spend your money on. So the next area is actually lateral geniculate nucleus, which is an area in-between the retina and the cortex. And so there are two streams partially starting from the retina that have two different properties.
One of them is high spatial frequency and low temporal frequency, which is called parvocellular stream. And the other one is called magnocellular, which is the inverse of the other stream. So the projections of the retina go into LGN. And from here, they are projected to the first cortical area, which is V1, which we're going to talk about next.
So this is one of the oldest parts of the cortex that has been studied. So the neurons in the area of V1 are orientation-selective. And there are two types of neurons in V1 that are called simple cells and complex cells. So both of these neurons are orientation-selective.
The complex neurons actually pull over a couple of simple cells. So their inputs are coming from simple cells. And therefore, they have orientation selectivity and also some invariance to the position of the input coming in.
And one interesting fact about area V1 is the topographical map of the feature detectors. What you're seeing here is that the colors are actually coding the orientation angle of the feature detectors in each part of the cortex. And as you can see if you look at this point, there are points that are called pinwheels. That all around them, the feature detectors respond to different orientation of different angles. And that's interesting in the sense that it gives us some intuition about maybe learning, and how the brain as a whole is learning different representations from the input.
The next area is V2, which is mostly believed to be-- the neurons in the area of V2 are mostly believed to be responding to patterns, the correlation of patterns in V1 neurons. So they receive their inputs from V1. And the interpretation for V2 neurons is that they do some kind of and-like operator on V1 outputs. And that's what generates their output.
So in here, I'm showing you an experiment done by Freeman et al., Freeman and collaborators. So what they did was that they had a set of original texture figures. And what they did was that they synthesized two versions of these textures. One was noise images that had matched spectral characteristics. And the other one was a synthetic image that was generated to match the correlation between V1 cells outputs.
And when they showed this to monkeys, what they found was that V1 neurons would respond the same to both of these groups of stimuli. But V2 neurons would actually respond higher to this set of stimuli and less to this one. This is basically confirming the hypothesis.
So next is area V4, which I'm sure by now you're getting the flow of things. As we go deeper, the kinds of stimuli that neurons in each area respond to are becoming more and more complex. For example here, in area V4, these are two example neurons.
And the color is showing the normalized firing rates of these neurons to these different patterns. As you can see, for each neuron, they're responding to very complex and specific at the same time patterns of inputs. And another thing that you might notice is that as we're going deeper, the receptive field of these neurons, which is basically the part of the visual field that they respond to, is increasing as we're going deeper.
And so this is a summary of what we have seen so far. So V1, encoded edges. V2, encoded texture-like inputs. And V4 is encoding more complex gratings. So in a sense, we can say that the ventral stream is capturing visual regularities of increasing complexity.
But a question here is, are these regularities tuned to behavior or not? So I don't have an answer here. But later on, in one of the slides, we'll go back to this question and see what we know about this.
So the final area that we're going to talk about is the IT area, inferior temporal cortex. So as you can imagine, in this area, the kinds of stimuli that excite neurons are becoming even more complex and more object-like. For example, in this figure, you see that this particular neuron is responding to hands, different versions of the hand-- flipped hand, left hand, right hand-- and even to some object-like patterns that are similar to hand but not exactly, and scaled versions. But for example, it does not respond to faces or forks.
And there's been some studies that are showing that these object-like ideas that we think the neurons are responding to may be reduced to more simpler and more abstract patterns that the same neurons still respond to. And also, there's been some attempts to try to discover the optimal stimuli for the neurons in this area. But the fact that I want you to pay attention to is that as we're going deeper into the visual cortex, it's becoming harder and harder to actually discover these kinds of optimal stimuli. Because the space of possibilities is becoming larger and larger.
So getting into models of vision now. So again, we start from the retina. Retina was the first part.
So we have a range of different possible models, going from spike-triggered averaging. For each neuron, it looks at the patterns of input that occurred before, and specific neuron is firing. And then averaging out over all of the stimuli that caused a spike on the neuron. And then getting into a spike-triggered average, which is the average stimuli that generated a spike on the neuron.
And we have linear-nonlinear methods, which are made out of a linear transformation followed by a nonlinearity. And then we have some more recent work, which are convolutional models. For example, this particular model is made out of two layers of convolution plus nonlinearity, and one dense layer plus nonlinearity.
And the way this network is trained-- so each of these layers actually have some parameters that need to be tuned. So the way the parameters of these models are found is by showing the images that have been shown to the animal and then train them by reducing the error in predicting the neural responses in the retina.
And it's been shown that these kinds of models are doing much better than the previous classic models of the retina. This is something that we will be seeing in the next slides. And this is maybe eye-opening that most of the models that we have right now, the best models that we have right now, they're following these specific rules-- they're made out of these blocks, which are convolution, normalization, and some simple nonlinearities.
So going to the next area, V1. The models here are very similar to what we had for the retina. Maybe this one, not. So the classic model that we have for V1 is bank of Gabor filters or wavelet transforms, which look like gratings or Gabor filters that have different orientations and different spatial frequencies.
Again, here we have a set of models which are within the class of CNN. But the difference between this CNN that you see here and the previous one for the retina is that for this case, this is a larger CNN model that is not trained to predict a neural activity. But this model is trained for object categorization on a larger data set. And it's been shown that the earlier layers of such network would be a good predictor of the one cell activities. And here again, as we can see, the CNN model is the CNN and the VGG, performing better than the classic models that we had for the neurons in this area again.
So we have also models of V2. One specific model is the HMAX model, which again is made out of convolution, and nonlinearity, and normalization. So in this model, some of the parameters were actually fixed, but some of them were trainable using a task. So it's not a complete fixed model.
And Gabriel Kreiman was also a co-author on this paper. I wonder if he's here now. So so far we've seen that for most of these areas that we looked at, maybe all of them, we now have a convolutional model that is doing better than the previous ones. So let's dig deeper into these models before going to other areas, like before an IT.
So one example of these models, maybe the first one that actually worked on a big problem in computer vision, was a model called later on [INAUDIBLE], which was proposed by Krizhevsky and Jeff Hinton. So this model was [INAUDIBLE] out of a stack of convolutions and pooling layers plus some nonlinearities and normalization, like many of the other models that we saw before. The parameters of this model were optimized on a data set of 1.3 million label images. And the task that the parameters were optimized for was to reduce the object classification error on this data set. And what was shown was that the feature detectors in the first layer were very similar to the things that people had in mind as what the V1, maybe V2 cells are doing in the brain.
So how do we use these models in neuroscience? So we have a convolutional model up here that is trained to detect or to categorize the main object in an image. And we have a brain here that you can plant some electrodes on it. And then we can record the response of neurons in response to some images.
So what we do is that we show the same image to both the model and the brain. And we get the features from the model, which are basically the outputs of convolutional layers, and then after nonlinearity, before nonlinearity, wherever you want it to. And then also at the same time we're recording from the monkey brain.
So we have a set of features, and we have a set of response variables. So we can actually regress from the features to these response variables to make a complete predictive model that goes from the pixels to actual neurons activities. So we can do this for different areas and different feature sets.
So when we do this, it's been shown that the features in the higher layers of these deeper layers of these networks are actually predictive of neurons in IT and V4. I mistyped here. This is V4.
And another interesting fact is that the last layer here-- this was the four-layer convolutional model that was tried here. Interesting fact was that the last layer here was the best predictor of IT. And the middle one, the intermediate layers were better predictors of V4 neuron.
So the sequence of regions is still preserved in these kinds of networks. And between 50% to 60% of the explainable variance in V4 and IT neurons responses was explained by this model, for example. Right now, we're maybe a little higher than these numbers, about maybe 60 to 70, depending on which model we're looking at.
Another way that we can compare these features between the models and the brain is to see how the responses of activity-- basically the response of the population differs when you compare an image to another image-- response to one image to another image, or response to one object to another object, or maybe one category of objects as a whole to another category. So these comparisons are made between these matrices, which is encoding the dissimilarity between the responses and a set of features. The set of features could be the neural activity-- in this case, for example, of IT neurons-- or for the convolutional model, which is called HMO here. As you can see, I'm sure you can appreciate how similar these two patterns are, which is basically encoding the dissimilarity in population responses between V2.
So going back to the question that I asked earlier-- are these regularities related to behavior or not? I don't have a definite answer. But what we can say from previous studies in our lab is that the behavioral performance and the similarity in representations between brain and these models are correlated.
So finishing with the models, going to the applications. So the first set of applications that we can think of is in automation, basically industrial automation. So we can do a lot of things with a good vision model. We can do face recognition, we can build self-driving cars. We can use them in security and many forms of different intelligence.
Basically, vision is necessary to make sense of the visual world. And that's an essential part of every intelligent agent. But what about neuroscience?
So the first application that we can think of is to build predictive model of neurons, what we discussed before. In this figure, what I'm showing here is that we have the actual neuron responses as black. And we have the predictive responses from the convolutional network. And this part is response to different images of category "chair," for example.
As you can see, it does not exactly predict all the responses to all the different images. But to a good number of images at least, the predictive responses are very close to actual responses. And you can then change category to some other category, and you see that the same thing happens. And of course, all of these predictions were made on a set of images that were not used during the training of the model or the mapping, because that's essential.
Another application that we can think of is the control, what we call the neural population control. So as we talked about the model before, we first build a model. And then we can make a regression going from the model to our neural responses.
Now we have a forward model that goes from pixels to actual neural responses. Once we have this model, and we should know that these predictions are-- as we discussed before-- they are correlated. The predicted responses are very much correlated to the actual firing rates. But what we can do further is now that we have a differential model that goes from pixels to actual responses, we can use it in the backward path. We can use them to now put the population of responses into any desired state.
For example, two cases that we thought of was whether we can use this model now to generate some image that would drive the neuron beyond whatever firing rate we have observed so far. We're calling that stretch, "maximal drive." Another regime that we thought of was what he called "one hot population," in which not only we want to push up or drive one neuron. We want to also keep the other neurons from firing. So we want to inhibit every neuron except the one which we're driving up.
As we do this, what happens is, we can generate images for each neuron in whatever area we're recording from. So these results that I'm showing you are an unpublished work on neurons in area of V4. So we generate the images.
And before, I showed you what happens when we show those images to the animals. Here's what we had before-- a set of naturalistic images that we showed to the animal. And we got the predicted responses from the model and the actual firing rate. And we see that they're pretty correlated.
And we can look at the best image that excited that particular neuron, which is this image. And this red circle is showing the receptive field of that particular neuron. And here is the zoomed-in version of what's inside the receptive field of that particular neuron. This is the pattern that maximally activated [INAUDIBLE]. Once we show those synthetic images, and we plot those responses against these naturalistic images, we see that the model is actually completely pushing the neuron outside its normal range and generating close to optimal stimulus for that particular neuron.
So how did these stimuli look like? So we did this procedure using five different random seeds. And you can see how each of these patterns looked like. So although they're dissimilar from each other, they also look perceptually the same.
So the second version that we tried was what we called the "one hot population." What I'm showing here is the responses of two example neurons. The top row for each neuron is how each of these neurons-- which are basically a total of 50 neurons-- how each one of these neurons is responding to this pattern that was generated to drive this particular neuron up.
And as you can see, as we're pushing this particular neuron up, we're also pushing the other neurons up. So it's not really selective for an individual neuron. But once we change the loss function and include the other measurements that we had basically to drive every other neuron down and only keep this one up, we get a much better or at least closer to the desired pattern that we were looking for. It becomes much more sparse. And this happens for different example neurons that we tried.
Another interesting fact is that as we do these two different kinds of optimizations, it's interesting to look at some of these patterns that are the outcome of this optimization. So for example, if you compare these two, they're kind of dissimilar, but also they have some similarities. Maybe the luminance level is lower than this when we did the optimization to-- when we considered all the neurons in our optimization to drive them down. So this would be a better, a closer stimulus to the optimal stimulus of that particular neuron. So here is another application. And I think that's it.