Tutorial: Computational Models of Human Vision - Part 2 (28:34)
August 12, 2018
August 11, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Kohitij Kar, MIT
Models for encoding and decoding visual information from neural signals in the ventral pathway that underlie core object and face recognition, testing neural population decoding models to explain human and monkey behavioral data on rapid recognition tasks.
Download the tutorial slides (PDF)
KOHITIJ KAR: All right. So I will be talking a little bit about decoding today. Before I proceed, these are the two recommended reading that I would suggest. I'll put the papers in the Slack channel. These are really two good review papers for-- it's reviewing how people think about going from neurons to behavior. And the second one is from Jim, and this goes into details of how visual object recognition might be getting solved in the brain.
So just to revise what we have been talking about. How I think about system neuroscience in general is that you have an input to the-- wait this-- yeah, you have an input to the brain. And then, the brain represents the input and then does some behavior. Like we-- make some motor movements.
So for example, you can think of like-- you're thirsty, and you want to drink water. And you see a glass of water, then you go and pick the glass of water. But let's say you see this image, then you will not pick that glass of water because it's empty. So basically, depending on how these visual images are represented in the brain, the brain then goes on to read that and produce a particular behavior.
So if you're [INAUDIBLE] neuroscience, you're either going to study the encoding models, like how the visual world is represented in the-- how the world is represented in the brain. Or you're going to study the decoding model, how the representations are then used for behavior. Or you're going to study the combination of the two. And if you're not doing that, if you're going from the stimulus directly to the behavior, you're probably in the psychology department, OK?
So one important thing is that-- for this entire setup, one of the most important thing is behavior. You have to quantify this behavior very well to have a good gauge on how good your decoding models actually are. So this needs to be defined very quantitatively to evaluate how good the models are. And I will basically emphasize that good here means good models of the brain. I'm not talking about high performing computer vision models. I'm talking about models that actually do-- or try to mimic the brain, OK?
So in terms of behavior, as [INAUDIBLE] already mentioned that, typically, people have been thinking in terms of two pathways, the ventral stream and the dorsal stream. So there's a lot of studies that have gone into dorsal stream. And I will actually touch base on them during the psychophysics tutorial. It lends itself to these psychophysics tutorials.
But today, especially because I'm now working with Jim and we all always record from IT and [INAUDIBLE], and so I'm going to talk a little bit about ventral stream. Because I know the history a little bit better, maybe. So I'll talk about-- and also this is approved by Tommy because he has this code in his paper. But understanding how biological visual systems recognize objects is one of the ultimate goals in computational neuroscience. So I'm going to talk about the object recognition problem.
So let me motivate the problem. Because this is a very natural thing that happens every day. So here I am standing in front of the Harvard T stop. And this is a scene that's being bombarded on my retina. And I don't see the whole thing at once. I will basically parse this scene based on some eye movement. So I'm probably going to look at this and say, oh, it's Harvard square. Oh, there's a newsstand. There's a guy walking. There's the bicycles. There's a car approaching.
So that's how I'm going to parse this scene. And so this is a very natural movement. I agree that there are microseconds that are basically jittering a little bit of this stimuli. But more or less, for 100 milliseconds. These are stationary images that are falling onto your retina. And I don't need to give you the context. I can just ask you to fixate and repeat this image as one after the other for 100 milliseconds, and you can still tell me what objects are there.
So that's the motivation why we study this paradigm. Because this is exactly what you're seeing-- this is exactly what is falling onto your retina. So we have done some research-- a lot of people have done research on this field. And we know the areas that are implicated in this. This is the ventral visual stream that we were referring to.
So in terms of-- so I was thinking how should I organize this tutorial. So what I'm going to do is I'm going to give you an example of how our lab studied this phenomena, and how our lab built a decoding model of this behavior. So I'm just going to take you through the steps. So that might be-- that might make it easy to understand how one could approach the problem.
So the most important step is you have to define and operationalize the behavior. If you're behavioral metrics are not good, then the whole project falls apart. So in this case, we decided to go with the binary discrimination object discrimination task. This is a pretty simple task. An image will come up in the center, and then there will be two objects. And you have to choose which object was presented in the image.
So for example like this. So there was a person in the picture. Then, something like this. It's a bird. I hope you guys got it. So this is like 400 milliseconds. So this is pretty much the task. There is a monkey or human or machine, they just look at this image, and then these two option comes up and they have to choose. And so, in this way, we can run thousands and thousands of images. This is a very standard task so it's nothing new about it really.
Now, the second question is what is the behavioral metrics? I'm going to show you one behavioral metric here. This is a confusion matrix. For those of you don't know that, I'm just plotting the performance of the subject, whether it is human or monkeys or models, for each object when they were tested against all other objects. So this will be like camel versus dog. That will be like camel versus rhino, and stuff like that.
So these are two different species, and they look very similar. So that's the second question. So if you are a part of an EG lab or FMRI lab, you are probably going to use a human as a subject. We wanted to build decoding models not from EG signals, but from neural signals. So we need to choose a particular model animal that we can run.
So the criteria is that the model animal should be consistent with the ultimate animal that you're trying to study, which is human in this case. So I'm not going to tell you, but one of them is a human and one of them is a monkey. Or one of them is-- this one could be monkey, this one could be human. But they look very similar. And they have very similar confusion matrices-- confusion patterns as well.
So in summary, we have published this study at multiple grains comparing different levels of object recognition behavior between monkeys and humans. So we choose monkey for that. That's why we didn't go with the mice. And I think that's important to consider during choosing behaviors, which animal models you can tackle.
The next thing is that once we have chosen monkey in the behavioral metric, we can now decide where to find the decoder. And it's like-- this step cannot be completed without the encoding part. And that's why I think [INAUDIBLE] went first because people had to probe different parts and understand what an area is basically encoding.
So where to look in the monkey brain. So there has been decades of neuroscience trying to probe this thing. And it's not only correlating evidence of collecting neural data, it's also lesion studies. The causal perturbations. You perturb ID. You have deficits and visual recognition. So you know that these areas are critical, OK?
So based on research for decades, we have a model of how the ventral stream might be working, and it goes like this. So the image comes into the retina, and then it gets transformed. You can think of these as patterns or responses of these cells here, and it just gets transformed from one step to the other. So from to LGN to V1, V2, V4, and back to IT.
And the other interesting thing is that when you have different images, the pattern of activity in IT changes. So it does for every step. But I would like you to now concentrate only on IT. So what happens is that you keep on changing the images, and the pattern of representation keeps changing, OK?
I think it was Chao Hung and Gabriel also as part of this research. They found out that you can actually decode the categories of objects based on activity in IT I think it was 2005. So based on this study, now we know, OK, if I want to search for a decoder or search for area that a decoding area might read from, here is my first guess. I'm going to guess it's in IT.
So IT has different size. And if all fire-- like different spiking patterns based on different objects. And typically, what happens is that there is this firing around 70 to 170 that [INAUDIBLE] is also mentioning, where most information about the object lies, OK?
And this is result figure from their paper showing that you can actually do it around 256 sites. Do a very good job at categorization. Of course, this curve will depend on the kind of images that you have used, the kinds of tasks that you have probed the system with. So don't take this graph very seriously. Take the qualitative evidence from this that you can categorize, OK?
So IT is fixed. The problem is that, previously, we did single electrode recordings. I'd been critical of it yesterday. Probably shouldn't have been so much critical of it. But I still think these new techniques of using [INAUDIBLE] arrays where we don't consider single units, but many units at the same time. This scales up the process by a lot. And I think this really helps because at least the results show that the code is in the population, it's not in single neurons.
You can do this in the single neuron in the sense that you can keep recording single units over and over again, and then you can make a big pool of neurons. That will also work. But that needs three years, and this will take three months.
This is how our data collection rate has improved from 2005, which was a single unit recording study. Until now, we are using arrays and it's a huge exponential improvement.
OK, so how do you approach this problem computational? You can think of each image producing a vector of responses across IT. So these are the features of IT's model of the world. So anytime a image comes in, you get a number of fighting rates from each neuron. So it could be hundred to 1,000 neurons.
And if you record a lot of images, you get a huge matrix like this, OK? So now, you have shown different images, so the x-axis is individual images. And for each of these images, you get a huge matrix like this. So the computational problem now is that you have a behavioral metric or behavioral measurement that looks like this. And you have to find a model that goes from this, links this big chunk of matrix, to this.
So this is the decoding problem that you're trying to address. So you needed the encoding study, you needed the proper behavior, and now you're at a level where you can start asking, can I build some decoders that tries to be consistent with the human behavior, or the monkey behavior? And the specific parameters of the decoders are important.
I want to emphasize that this is not a qualitative study. This is not a study that will say IT works somewhat to do object recognition. We are trying to make a claim that you need 60,000 neurons, x number of dimensions, to predict 70% of the explained variance. This is the kind of grain at which, I think, we are trying to push the field, OK? That's why the exact parameters are also very important.
So here's a graph from one of the papers. A figure from one of the papers. And they have compared different decoding schemes trying to predict human behavior. So the consistency measure is also something important, and that I actually did not know before joining this lab. Because I typically used correlation as the measure of how similar to things are. But there is a bound to that correlation. And the bound is how-- for example, if I'm correlating-- if I am a model of another human, I need to know how consistent that human is to himself or herself. So that sets the limit of my performance.
So this human to human consistency is a very crucial important term. So anytime you're building a decoding model, that is something that you really need to pay attention to, OK? So here, the results are that if you take models that are pixel representations or V1 like or older computer vision hypotheses are, for example, V4-based models that are basically responses V4 neurons. They don't do as well as IT based hypotheses.
And even for IT, there is a specific set of algorithms that typically work very well. And you can see that from correlational decoders, I don't have enough time to go into the details of it. I will try to elaborate a little bit in terms of what SVM geometrically does.
The other important thing is the time at which the features were computer. So this was a static model in the sense that we just average the responses from 70 to 170 to produce any of those vectors for each of these images, OK? Now, I just want you to-- want to give you a feel for what we mean or what SVM might be doing in this case. And the idea is that, whenever I have different images of the same person, for example, let's say this guy Joe. And these are axes of different neurons in IT or any visual area, for example.
Every time-- every image will have one particular spot in this multidimensional space. So here's one image of Joe. And then, there are many, many images of Joe I can show up from different angles, from different sizes. It has a space that it covers within this multidimensional space. And you can call it Joe's identity manifold for this particular whatever area you're recording from, OK?
And it turns out that if you look at V1-- this is kind of what it looks like. So if you have Joe versus-- I don't know, Harry. Individual one versus individual two. The manifolds are tangled with each other. You cannot really separate two different individuals or sometimes two different objects if you're just looking at V1 representation.
And the idea that that was proposed by-- I think Jim and Dave Cox at that time is that-- it might have been reformulated from previous ideas that object manifolds might be getting untangled as you go up the hierarchy. So you can think of this hierarchy-- V1 has a much tangled representation and the entanglement decreases. Gets untangled as you go to IT.
So for example, in IT, you might have the manifold expressive itself kind of like a smooth manifold. And now, if you look at Joe versus-- individual one versus the individual two, it might look like this. And what the decoding model is trying to do is-- VM is trying to do is trying to put a hyper plane through it so that it can basically classify any image into either Joe or not Joe, OK?
So this is a very simple linear classifier. And that's the reason why it can be-- it's biologically plausible that a downstream area from IT could be just reading IT to go to behavior. So that's why we think that, in terms of decoding, it's just a linear step away from the behavior ID, OK?
So again, coming back to the specifics of the model. I cannot emphasize this enough that these numbers are important. The number of dimensions that IT spans. How similar is one monkey to the other? Or how similar is one IT to the other? How you even go about even answering this question, I think that needs you to be more quantitative to think about how many dimensional-- how many dimensions these areas span and stuff.
So that's why these numbers are important. So whenever I want to compare this to my own study, I want to know how many neurons I need to record to get to human level performance. So these numbers are important.
Sorry, I think I skipped the slide here. This is just showing that if you're going from 16 [INAUDIBLE], you have some amount of correlation for all 64 tasks. And the correlation is higher for IT compared to V4. But it's not to the level of humans. As you increase the number of sites, you get to a point where you're equal, OK?
So of course, this depends on the grain of the behavior that you have tested. So this behavior is somewhat sort of course. Because I'm averaging all the images of a given object to get the behavioral metric. But the thing is that because the approach to this problem was leading up to this conclusion, we can now ask-- OK, let's make the behavioral metric a little bit more fine grained. So let's go from object level to image level. So this is the fifth sort of test of your decoder. So make the task harder and see whether the decoder still holds true.
So philosophically, once we have a model, this is the job of the experimenter. The job of the experimenter is to design experiments that falsify the models. And unless you have a model, there is no falsification. Science just stays in the same place, and we all have our subjective opinions about things.
So here is a case where you can do an experiment and say, look, Jim, this model is wrong. And I think that's what I did because I started doing this in terms of image level performances. So now, basically, what I'm doing is I'm splitting this metric up. You need a lot more data to get to reliable measurements like this. But you can do it. It's not like-- Amazon Mechanical Turk or Monkey Turk. It's pretty easy, right?
Now so instead of having only average of all camel up images, I have individual images. So here I've shown 10, but there were hundreds of these. And you can have a huge matrix. And now you have to predict the entire thing, OK? So I'll give you an example of why I think that 70 to 170 millisecond decoder is wrong.
So remember the previous decoder said, take the activity, integrate it from 70 to 170, just to use that to predict what the object was. So I recorded-- right now, I recorded more, but the slide was made a little bit earlier. So it's 333 neurons, but still does the job. So let's say I show this face image. And what I'm doing is I have computed the average performance of the monkey for this particular image. The monkey's d prime is really high for this image.
These are the same vectors that I was showing, but now they are computed part time bin. So per 10 millisecond time bins. So for every 10 millisecond, I have a vector of 383 elements. And then, I use a decoder on top of that. The decoder is still the same as the SVM. Linear as the classifier. And I try to predict that given that activity, what is going to be the neural decoding performance, or what is to with the predicted performance of the monkey?
So what I see here is that the monkey activity-- sorry, the monkey performance is really low, the neural decoding performance. And then, at some point, around 100 or so it's high. And it hits the level of the monkey's real accuracy that was recorded during the time I was measuring these-- sorry. Go ahead.
KOHITIJ KAR: No. So the reason why-- so we wanted to go lower and lower for other purposes. We wanted to make it really tiny, the time bin. But the tinier you go, the noise increases. So we did an analysis to understand given an expected effect size, what is the lowest time bin that we can go to with reasonable number of repetitions of images? So that's why it's ten.
And so we need around 70 to 80 repetitions of every image to stay within a very low number. So that was a face. And then, I showed a zebra. And then, I can do this again and get the recordings. And you can see that they're getting solved around here. Now, I'll use another image. This is an image of a car. These images are picked that the monkey has the exact same decoding accuracy.
But now, you see that the response is going up and is hitting it around maybe 150, 180. So if I was just integrating from 70 to 170, the answer for this would not have been the same as if I was integrating for 70 to 170 for these guys. But that's a-- sorry, go ahead.
KOHITIJ KAR: No, this line? Is that what you're asking? So when the monkey was doing the task, this is the accuracy that it had to tell that this is a dog versus any other--
KOHITIJ KAR: No. So this was done-- so the images were shown for 100 milliseconds, and then the task was performed. So what happens is that the monkey sitting images gets shown for 100 milliseconds, then the monkey does the task. And the [INAUDIBLE] arrays are recorded throughout. So I am basically trying to see at what time in that zone the [INAUDIBLE] array is good enough to give a behaviorally relevant prediction.
KOHITIJ KAR: No, so the monkey doesn't know about this analysis. So the monkey just knows that he's going to see this thing for 100 millisecond and has to tell whether it's a dog or a house or dog or elephant.
KOHITIJ KAR: The x-axis is time from the image onset for the neural data. So I'm recording the neural data, which is a continuous measurement. So I can record the neural data when I just showed that image when the image went off. OK, go ahead. Yeah.
KOHITIJ KAR: So the problem is a single trial data has a lot more noise in it. So that's why we use trial average data. The interesting question is that, when we do use trial average data, does the behavior-- does the decoder suffer somehow? I'm almost averaging out a lot of interesting phenomena that people talk about like noise correlations and stuff. So because I'm averaging them out, is my decoder suffering? And so far it isn't. So maybe trial-by-trial data is not that important to build this kind of models.
KOHITIJ KAR: Trial averaged? Yeah, it is because of high signal to noise.
KOHITIJ KAR: No, because it doesn't matter how you get the decoder. Because you can get the decoder from trial average data. You can still use that decoder to predict the monkey's response for every trial, right?
KOHITIJ KAR: No, this is for all trials. Yeah, and you can even do this for monkeys that are passively fixating. It doesn't matter. So you can collect the data separately. And when the monkeys in the rig-- while we're recording, he can just be passively fixating. You can still get a pretty similar decoder. So it's a very automatic processes in the head.
KOHITIJ KAR: So for example, this is not at 5, which is our cutoff. So at 5, everything is-- he got everything right. So for example, now, this is at three to four. So there are some incorrect trials in here. And that also shows up in the decoding accuracy rate. So the decoder is capturing the fact that, probably, the monkey's not going to always get the right answer.
So you can do this either on a trial average way where even the predictions are trial averaged, or you can go-- every trial use the decoder and say, for this trial, what is the monkey going to do? And I have some backup slides that addresses that, but I can show you a little bit later.
Anyway, I think I'm at the end of this that-- this was hinting to the fact that, look, the decoder is wrong. And for example, we collected 5,000 images like this. And the time at which these images are solved are really variable. It's from 100 to 200. So it's hugely variable.
So recently, what we did is that-- so the question is now like, OK, seems like the old decoder was wrong, where's the new decoder? And now, we have a new decoder, which is-- I'm not going to go into details of it. It's just a leak integrator. So the scheme is changing also. We don't discard any time bin, we just integrate until the trial screen is shown.
And it seems like that's a better decoder to deal with this problem. And the important thing is that, what is the time constant of this leaking integrator? Is that around 40 millisecond? Whether that has some real value is to be seen. So just to make a point. So we're talking about the encoding model, and here is a realistic model of what the monkey might be using from IT to go to these behaviors.
So if you remember, I had a project in my mind where you can now know what kind of images the monkey will think where-- or what kind of images of, let's say, a dog that you can show the monkey where the monkey is going to say, oh, well, it's an elephant or something. I think that's a cool experiment because that clearly shows that you have information about the monkey that otherwise was not possible. So because you have the encoding model and the decoding model, you can even think of things-- you show a thing getting morphed, and you can predict exactly when the perceptible change of the monkey from a dog to human or dog to elephant or whatever it is.
I think this is the end of my presentation. And you can ask me questions for 19 more days. Just to answer your question, I think this is the slide. So forgive me for this particular-- so here's an example where I want to test trial-by-trial, OK? So here's an image that has-- I mean, this is a little bit tricky to maybe see. There's a plane and a bird in this image. In this image, there's also a plane and a bird. You can't see it. But believe me. You have to believe me here. So there's a plane and there is a bird. And the choices are also plane and a bird.
And the monkey, on average, is a chance. Because the monkey sees both of the objects and he's pretty confused. But because you can record, you can for every trial have a prediction of what the monkey might choose. This is typically known as trials probability where you show an ambiguous stimulus and you can see how much you can predict the monkey's choice on a noisy stimulus like this.
And you see that the trials level is at 50%. And with this kind of decoder, which is a leak integrator, to the choice screen, we are at 0.65. And this value, if you have been following the choice probability literature, is pretty high. But it needs a lot of neurons, too. So maybe that's where we are getting the little bit of edge.
But here is the comparison of laws over at IT, which was taking the trials from 70 to 170 and was not looking until the end. So that's almost a chance. So you can do experiments to falsify these older models further. But there is an interesting space of images and experiments that we can still do to make this even a more stronger test of the kind of model.
So I think-- I don't know if I answered some trial-by-trial questions that you have. I think that's it. OK, I'll take some questions. But I think, at some point, both [INAUDIBLE] and I will take questions. So this is it.