Jim DiCarlo: Introduction to the Visual System, Part 2
June 4, 2014
June 4, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: Decoding of IT signals for object classification (Poggio, DiCarlo, Science 2009); 3D object models; detection experiments with objects of different pose placed on random background images; neural population state space; LaWS of RAD IT decoding algorithm predicts human behavior (Majaj, Hong, Solomon, DiCarlo, Cosyne 2012); optogenetic experiments (Afraz, Boyden, DiCarlo, SfN 2013, VSS 2014); models of encoding visual input at various stages along the ventral stream, involving filter, threshold & saturate, pool, and normalize operations
JIM DICARLO: OK, so let's talk about data and what the data says. So let's now [INAUDIBLE] some of the data. This was the part-- I would try to set up a hypothesis, so this is the part where we're going to talk about evidence that maybe [INAUDIBLE]. What the evidence then [INAUDIBLE] population. I'll solve the problem as I've defined it, and then I'll show you what evidence we have for this, and I'll talk a little bit about [INAUDIBLE] more recently.
So, again, this was the [INAUDIBLE] here for the idea that things looked something like this, so what the evidence for that-- so this is actually work I've already briefly alluded to that began with [INAUDIBLE] and [INAUDIBLE] and [INAUDIBLE] way back in 2005, but we just did something very, very simple that I think [INAUDIBLE] has been done before, and I was surprised it hadn't been done before either, but [INAUDIBLE].
Let's take a broad set of objects. These are simple images. Look at white background. They are very, very simple, and it looks like proof of principle idea. We did a little bit-- we presented those objects at just different scales and positions and then we went and recorded to actively fixated monkey for a 100 milliseconds because this is the snapshot duration that I already introduced you to. As I said, 100 to 200, very similar to A here, but that was what we did [INAUDIBLE] randomly show just like the RSVP video, so you go measure a bunch of data.
All right. This is an old movie I have from [INAUDIBLE] listen to some [INAUDIBLE] you might not be able to hear this but-- So you guys kind of hear that? Hear there is some static [INAUDIBLE] the monkey is fixated. It's a little grainy, but you get the idea. [INAUDIBLE] go to [INAUDIBLE] IT and that's one neuron into a two-electrode recording.
Here's the thing. This is probably the first neural data [INAUDIBLE] in your field for [INAUDIBLE] images. [INAUDIBLE]. Those are action potentials there for those images you can randomly interweave and then you collect-- you present that at that time. [INAUDIBLE] despite this [INAUDIBLE], I see [INAUDIBLE] milliseconds [INAUDIBLE]. You can see a very [INAUDIBLE] response, see what real data looks like. Here's another site like this [INAUDIBLE] little more, a little less, and then here's another site in IT, if you [INAUDIBLE] for some reason [INAUDIBLE] but not so much goes.
OK, so that's a feel for the data and what raw data [INAUDIBLE] might look like. [INAUDIBLE] collected like 350 sites at the time. These are shown here. [INAUDIBLE] responses over those [INAUDIBLE] large [INAUDIBLE] images-- or not very large by today's standard, but that's 78 images, so the mean responses shouldn't be [INAUDIBLE] except we had a lot of data that you couldn't construct populations for.
And, remember, the goal is to just test how good is this for doing the path. We need accessible. We need a [INAUDIBLE] separating object A from not object A-- I'll give you those categories in a minute-- versus, OK, that was a little more inaccessible, or tangled-- there's [INAUDIBLE] layout here. This is not easily accessible. So good, bad. Simple idea as to how well you can do this and, you know, this is a slide that shows that you had simulated-- you had a population activity that emerged from response to one image, but could be [INAUDIBLE] measured, we did things like, how are things [INAUDIBLE] spikes in one interval. We could vary that interval, but let's think of the [INAUDIBLE] in 100 millisecond window and give me one number that you have now. Get end neurons. You have end numbers [INAUDIBLE] and you did that for a whole-- we asked how well could you, from this, predict that that category was there. We had eight predefined categories. I'll show you the names in a minute, but-- there they are-- and if I really-- that [INAUDIBLE] would be one point in an end-dimensional state space for IT population and, again, how well we separate classifier with your decoder, linear threshold as you were discussing these images or those images, and this would be for, for instance, face detection. But we could do this for all eight of these categories.
This is what you get from [INAUDIBLE] from [INAUDIBLE]. It just shows, you don't need [INAUDIBLE] to get up to near-perfect [INAUDIBLE] on [INAUDIBLE] performance on single trials with [INAUDIBLE]. You can also do identification quite well.
The point-- the interesting part to us is that you can train classifiers to say [INAUDIBLE] central [INAUDIBLE] logic, the center-- the distance from each side and then test it and different repeated images of that. And, of course, you could do quite well with that. That's going to be not that surprising, but what's interesting is you can also do very, very well if you change the position, but you see just a little bit of drop in performance here.
And that's really reflecting of the IT [INAUDIBLE]. You have properties here that [INAUDIBLE] the population that I showed you about earlier. So [INAUDIBLE] population. It is not [INAUDIBLE] easily do this later shown in [INAUDIBLE] is not nearly as good as IT doing it. The input to IT that you want is certainly not as good. [INAUDIBLE] bunch of work, so this was the kind of evidence that got us thinking about the way that I showed you on the last slide. This is how we think about the state space of IT.
Now, again, I want to point out that-- all I did was I put this information and [INAUDIBLE] for doing those kind of tasks that could be decoded. But remember the big picture that I presented at the beginning, is we want to be able to map between neural activity and human judgments of these things, so that would be a [INAUDIBLE] not just information [INAUDIBLE] about these things but how well [INAUDIBLE] what the humans are going to report or what the monkey's going to report.
So mapping between this domain-- a spike in patterns in the [INAUDIBLE] or say IT more specifically, to a report in this domain. [INAUDIBLE]. OK. So that's called, again, a decoding algorithm. You go to build a real model that could be falsified, has testable predictions, and that's what we try to do. So the goal here is to behavioral report, not to say that you can get information with decoders [INAUDIBLE] here, but say what's the mapping between this and behavior? So that's here. [INAUDIBLE].
So these are folks that did the work I'll show you next. [INAUDIBLE] post-doc in the lab [INAUDIBLE] grad student [INAUDIBLE] undergrad at the time and, again, we're in the domain of core object recognition and what we wanted to do-- remember, we're trying to ask, [INAUDIBLE] predict behavior, you have to actually then go and measure behavior. I showed you a little bit of that earlier with the monkey, but now I'm going to show you what we measure.
We didn't characterize the whole domain of core recognition. That's challenging, although we made some recent progress on that. What we did do is try to test behavioral performance on a sample of spanning tasks, and I'll show you those in a moment. You can decide how outstanding they are, but I'll just show you what we did.
We did this, not by just taking images off the internet. We wanted to control the latent variables, so we used 3D object models. We could then control the underlying variables. We could then render them and place them on backgrounds. We tended to place them-- we did place them on uncorrelated backgrounds. Not integrated, just uncorrelated, meaning the background was randomly chosen with respect to the object. That produced sometimes weird-looking images. Here are some examples here. [INAUDIBLE] here, but the point-- the reason we like this [INAUDIBLE] images and because of this strategy-- we had been doing computer vision modeling at the time. Computer vision modeling at the time really struggling with these kinds of approaches, because a lot of the performance was being driven by leaning on backgrounds and tricks like that.
So this was something that humans still did really well yet machines were having a hard time with, So that was a big part of the reason that we chose this kind of approach, to try to isolate where humans [INAUDIBLE] current machines at that time. And the [INAUDIBLE] can do that [INAUDIBLE] OK. [INAUDIBLE] I mean, [INAUDIBLE]. OK. Those are a little grainy, but I think-- [INAUDIBLE] those are about 150 milliseconds and should have been able to [INAUDIBLE] probably could do almost every one of those. Maybe you couldn't tell because I was saying the word, but you could get a sense that you could do that quite well. So it's the same thing I showed you earlier now is just a more defined [INAUDIBLE] pitch. So I [INAUDIBLE] we did something [INAUDIBLE] basic-level categorization first, so we had eight categories here, but-- and there were eight objects per category. We generate these images at low, medium and high variation. What that means is, those other [INAUDIBLE] parameters we already have and restricted to low range, medium range or a high range. I'll give you a sense of that kind of images here.
So here are eight images generated under high variations, to often changes in position. Here I'll get you the random backgrounds behind them. Here are some non-face objects [INAUDIBLE] into high variation. [INAUDIBLE] destructs. So we like to just break everything out of a binary discrimination. This is discriminating face from non-face.
So what I'm going to show you is a lot of binary discrimination, performance estimates that we have made from both human and neurons. So the simplest thing about this binary discrimination, there is a linear classifier and separate that from that.
AUDIENCE: Do you say why you want to use these--
JIM DICARLO:These crazy backgrounds?
AUDIENCE: OK, yeah.
JIM DICARLO: Instead of--
AUDIENCE: Instead of realistic backgrounds?
JIM DICARLO: So they may resist efficiency, so if you can't eventually put it [INAUDIBLE] machine, you could bring machines working even when they're not images. But [INAUDIBLE] you don't-- you need a lot of images to show that [INAUDIBLE] and we can't test that many images of the neurons. We wanted something that would isolate [INAUDIBLE]. So it will be challenging but it's more compact. In other words, one reason. Those underlying variables are also very interesting. As I mentioned earlier, [INAUDIBLE] contain that information?
But on random images pulled down from Google or ImageNet, you know, you can rely on human adaptation, but you don't have to have access to [INAUDIBLE] you understand? So that was the other--
AUDIENCE: You could generate images just in-- consisting of--
JIM DICARLO: Consist in context, right. We were trying to get isolated objects as opposed to what you might call context, and you might say, well, OK, I'm interested in context [INAUDIBLE]. We are too. We just wanted to start with a clean, simple object case, so this was our attempt to do that.
AUDIENCE: There's still an interaction here with contact.
JIM DICARLO: No. There's no--
AUDIENCE: But I see the plane hitting the ground, I immediately am very surprised by this, and it might be a different sort of mechanism that I'm using to identify it than if I were just to [INAUDIBLE].
JIM DICARLO: All right, so we're just asking you to identify, give me a name and you saw an image [INAUDIBLE] plan [INAUDIBLE] right-- they're not using my normal plane mechanism when I do that [INAUDIBLE] criticize that but we would assume you are.
AUDIENCE: [INAUDIBLE] context [INAUDIBLE] because you don't expect to see a plane [INAUDIBLE].
AUDIENCE: Yeah, but, I mean, it's quite surprising.
JIM DICARLO: [INAUDIBLE] not surprising. You get used to, like, OK, [INAUDIBLE] the background is kind of, like, flat and that's kind of what it is. Your computer [INAUDIBLE] is like uh-oh, I can't cheat anymore and [INAUDIBLE]. So that was really [INAUDIBLE], so they were [INAUDIBLE]. I'm not trying to claim [INAUDIBLE] all the different [INAUDIBLE]. I should have mentioned that. This was our first step. I'd rather motivate the strategy [INAUDIBLE] test behavior, test [INAUDIBLE]. I'm just showing you what we've done so far.
So all your predictions are well taken. This is not the end. This is just the history of how we got there. OK. So keep that in mind as we show you the results. So we end up pretty basic, but we did see [INAUDIBLE] indications [INAUDIBLE] discriminating, you know-- I don't even know what these are-- [INAUDIBLE] from [INAUDIBLE] different [INAUDIBLE] are slightly different from each other. Here's faces. These are hard to tell, but these are actually different faces. Notice they don't have hair or glasses on or teeth, but they are actually different shapes. [INAUDIBLE].
So I want to take you through behavioral phase, [INAUDIBLE] out there, binary or two-way discrimination from those data, I think you could probably do that. Well, think of each of these in a binary path. So this is discriminating-- that right there-- animals from all of these other categories at low variation. And this is the d-prime level. So high-- red means very good performance. Five is like nearly perfect. D-prime infinite at 100%, but nearly perfect, and [INAUDIBLE] so you could think about that at about 50% correct on the binary discrimination.
So these are physical terms that capture things like [INAUDIBLE] or measure d-prime, but this particularly is a performance measurement. So here-- you can just see that-- what I want you to see that, hey, [INAUDIBLE] is harder than [INAUDIBLE] of color. Some paths are harder than others for some reason, and the high area variation is harder than low variation.
So all this is intuitive. You want to say this is obvious from the physics of the world that this could be true, but you can't take low level pixel models and easily reproduce this behavior and I'll show you that in a moment. This is-- first of all, humans find some tasks harder than others reliably. These are not noise in the data that I'm showing you, like put three different humans [INAUDIBLE] here. This are cool human data, but I show you different [INAUDIBLE] with the same pattern and I'll quantify it later.
The pattern is reliable across through this and, again, it is not explained by low level [INAUDIBLE] intuition [INAUDIBLE] what is should be. I would argue [INAUDIBLE] do that. These are constraint data graphs. These are the kind of data that should be predicted by the decoding algorithm. If it's correct, it could show these same patterns that we see here.
That was the first level thinking you would like a decoding model [INAUDIBLE] not the only thing, but the first thing. And [INAUDIBLE] you present an image, you measure a bunch of neural activity [INAUDIBLE] spikes in a window. [INAUDIBLE] simulated data here just to show you. You can measure this response to the neuron, get [INAUDIBLE] to that image, [INAUDIBLE] complex images here [INAUDIBLE] got a lot of images and you want to say, how can I separate these [INAUDIBLE] green face images. Come on face.
Again, this is a hypothesis. Now, not that this is their information about faces but a hypothesis [INAUDIBLE] the kind of mechanism that supports face detection behavior or report. That's the hypothesis. The prediction is that the algorithm, as implemented here on the neural basis, should predict that human/monkey behavioral performance on the [INAUDIBLE] you have, and it should be predictive for all tasks that we measure, not just [INAUDIBLE] for one task but a whole battery of tasks that I showed you.
[INAUDIBLE] predicted that you believe that the model that [INAUDIBLE] your decoder and you do this, you can predict the performance. That was the prediction for that. Are they tested? We're going to measure a bunch of neurons first, just like did before. In this case, we're going to have to [INAUDIBLE] time record an [INAUDIBLE] along with that [INAUDIBLE] could record from 100 [INAUDIBLE] from parallel.
So [INAUDIBLE] sample of a few hundred neurons again along the [INAUDIBLE] IT-- parts of IT-- and also before the [INAUDIBLE] input to IT. So you can sample among several monkeys, we pool the data together and show you what I'm going to show you next, and you get-- and now here's a response [INAUDIBLE] those [INAUDIBLE] response, 168 [INAUDIBLE] that particular image. Here's a bunch of other images. And we actually collected now thousands of images, not 78, 25 images over [INAUDIBLE] I alluded to earlier. This is just one data set out of IT for [INAUDIBLE] This is just a top [INAUDIBLE] foot [INAUDIBLE] of the IT mean response data. Of course, we have the spikes and everything [INAUDIBLE]
And remember the idea that doesn't predict performance. An algorithm [INAUDIBLE] I think this-- I mean this-- I need you to predict d-prime values for all the 24 tasks on the bottom of that previous slide and all of the levels of variation. We had two to three levels of variation that we tested [INAUDIBLE] And when I say, does this predict, I meant this hypothesis as a linear decoder weighted on this-- appropriately weighted on this to predict performance. So, again, [INAUDIBLE] down through the neuron, [INAUDIBLE] appropriate [INAUDIBLE] That's the biologic [INAUDIBLE] hypothesis to us.
So I'll just show you the key result. The key result is that when you do this kind of decoding algorithm, you apply the same decoding algorithm on the IT feature face, which I'll call the code, then after we get human-- and we think monkey, although this is technically human [INAUDIBLE]. We had [INAUDIBLE] monkey [INAUDIBLE], so far it predicts human behavior. And the algorithm in words is basically this sample, roughly 150 random sites spatially distributed over IT.
This is sort of what I implied before, but I didn't lay out words. So they're random. They're distributed. We measure each sites average response over 100 milliseconds. Again, that's something we can vary but, what we did was, our millisecond average for each object learned appropriate way to [INAUDIBLE].
We can talk about the learning algorithm, [INAUDIBLE] type of classifier. How to [INAUDIBLE] example. We'll show you that. But it's the same for all tasks, and then to me is bit complicated algorithm that's learned [INAUDIBLE] on the randomly selected, average 100 millisecond response distributed over IT.
That's what we were implementing as a hypothesis of what's going on with the decoding algorithm. That's still too long for [INAUDIBLE] acronym [INAUDIBLE] I'm going to call of this [INAUDIBLE] algorithm that has a name. It's not just a hypothesis [INAUDIBLE].
Here's part of the image that I just alluded to [INAUDIBLE] really impressive, that all we did was apply this algorithm. Here's the predicted performance of the algorithm [INAUDIBLE] six. Here's faces first all, [INAUDIBLE] object from binary discrimination, and what you see is that this is almost on a unity line here. So 64 objects [INAUDIBLE], and it-- this isn't perfect, almost like [INAUDIBLE] and the spread off of here is no greater than the spread-- I mentioned the humans were very similar, but I put a single human year and pool of humans here, which is [INAUDIBLE] that is indistinguishable from this here. I'll quantify that for you in the next slide.
Here's the human-human consistency. This is now correlation [INAUDIBLE] the correlation, so it's really how much does this pattern-- oh, I'm sorry. Forget about [INAUDIBLE] just the pattern [INAUDIBLE] lie along the line here. It's highly correlated. Ignore the performance here, which is important. But for now it is being ignored [INAUDIBLE] the pattern. Here's the hypothesis. We take a learned way to [INAUDIBLE] neurons that are right up here.
We're getting a lot of other things in IT, which I want to just show you one here. [INAUDIBLE] recorded in IT for each task. That's a certain type of decoder. Kind of grandmother sound like. Here's where it ends up. You do things like you take [INAUDIBLE] and you [INAUDIBLE] on that. You don't get nearly as high. [INAUDIBLE] don't get as high. This is something I implied earlier, from the earliest study I mentioned. Take pixels, imagine low levels that would do this. You're not getting predicted patterns. You can take and do these algorithms at the time, uses some of them here. None of them come close to this. But these [INAUDIBLE] they're passing the [INAUDIBLE] test, if you will, encoding off of IT in the way I described for you. For all intents and purpose is from [INAUDIBLE] with these measuring [INAUDIBLE] indistinguishable from another human being. I report out IT, apply to that [INAUDIBLE], so that means consistence with the data, not just a correlation but hiding in the range of perfection [INAUDIBLE].
OK. So that's part of the reason that we think that this algorithm is a viable hypothesis at the moment. We can also look at the confusion patterns, not just the pattern reports across those paths, but what kind of errors are made. So here's [INAUDIBLE] from through the behavior-- the algorithm is reporting the kind of mistakes that will be made. I don't have the scale for you, but red means more mistakes of a certain type, [INAUDIBLE] diagonal, so we're just going to look at diagonal. Here's the actual human data that is not [INAUDIBLE] images and, remember, these aren't single images, these are pools of images, so [INAUDIBLE] the average is a mistake. And so it's not at the image level gradient, but you can see that these look very, very similar-- the [INAUDIBLE] diagonal elements. They're not perfect, but one corrected correlation is actually 0.91, which is almost perfect. So 1.0 would be as perfect as we could get within the [INAUDIBLE] of the data.
[INAUDIBLE] even on these confusion predicts, is at the object level. Now, that's medium variation. The high variation, for reasons we don't yet know, because the algorithm is not perfect. There is internal [INAUDIBLE] and it's interesting that only at 0.68 these colors are-- sorry, a little hard to read-- but it's only at 0.68. You know, someone could write a paper and say, well, 0.68, that's really cool, but [INAUDIBLE] is a failure, because of [INAUDIBLE]. Is it a deviation from perfection, which means there's something wrong with the model that we need to understand. OK. That means we're looking at the wrong [INAUDIBLE] area, and that's where we can dig in and sort of refine the algorithm going forward.
AUDIENCE: Sorry. Can I ask a question? Low, medium and high variation, is that variation within a class?
JIM DICARLO:Sort of think like you're doing all those passes-- there's 24 tasks I showed you-- either with little variation that's low, medium variation that's [INAUDIBLE] position, high is [INAUDIBLE] of the input that it can be--
AUDIENCE: How much does it generate space, how much variation is there in the object projection?
JIM DICARLO: OK. So these are-- we kind of ran those as separate groups. This [INAUDIBLE] contains some of that. Right. So you're just expanding the range, but the broader [INAUDIBLE] of the region.
OK. So I'm just going to-- I mean, there [INAUDIBLE]. I'll just quickly say that when I talk about this, you look at the binary rate codes [INAUDIBLE]. I did way back on the earlier paper. They might improve things a little bit, this is a higher millisecond average. These are finer rate coders. You might improve a little bit with a trend here. But all [INAUDIBLE] we did the levels that are current data, so it doesn't improve things much, and so they're all passing the [INAUDIBLE] test.
If you look at the-- because these sounds recorded simultaneously, you could ask about [INAUDIBLE] correlating data versus not correlating data. [INAUDIBLE] not much difference there. But do you think it seems to get worse if you, um-- Sorry. The trend is slightly better, but it's not much different. And you can also start getting into the face literature that I talked about earlier. If you think that the face [INAUDIBLE] is supported by the face blobs, you could build the colors that only work on [INAUDIBLE] face blob. That's something we did here.
I won't take you through it other than to say when you do that, you're predictions only get worse. So in our [INAUDIBLE] that doesn't seem to help much-- assuming a prior like that does not help predict the patterns of human behavior. So remember, basically what happened here [INAUDIBLE] just sample where I see and learn and then you do better than if you start to do things like, oh, all the faces must be done on the [INAUDIBLE]. It doesn't make things better to do that.
AUDIENCE: Were you recording from big fetches?
JIM DICARLO: Yeah. So we're recoding over most of IT solid samples. Some of the [INAUDIBLE] happen to be here in parts that cover some of the face patches. Now, we don't have to have [INAUDIBLE] but there is so much of our beautiful stuff [INAUDIBLE] in [INAUDIBLE] that shows [INAUDIBLE] they show you [INAUDIBLE] same things, so you almost [INAUDIBLE] saying, oh, this is the one, this is like these little face [INAUDIBLE] and so that's what we did for [INAUDIBLE] and then you can measure the neurons in their facial activity until they're comparable to those measures as well.
AUDIENCE: Are you always testing [INAUDIBLE]?
JIM DICARLO: I'm sorry, I should have mentioned that. All the predictions are-- you train on something that-- you're always testing on new inventions. I should have said that earlier on. Everything is predictions on human images. I should have predicted performance-- so prediction performance [INAUDIBLE]. Is that your question? So every time I show you a decoder [INAUDIBLE] always don't have [INAUDIBLE].
AUDIENCE: The high variation [INAUDIBLE] neurons?
JIM DICARLO: That's right. That's the kind of stuff we want to dig into. You know, so what we want to do now is start looking at what I can plot-- I'll show you function of the number of neurons, but [INAUDIBLE]. I haven't done that yet [INAUDIBLE] talking about, but that is exactly the kind of [INAUDIBLE] we just don't have enough sampling to make those predictions.
AUDIENCE: Can you see a difference in--
JIM DICARLO: You get better as you add more neurons to a point. I'll show you that [INAUDIBLE]. I can show you that now for the off diagonals that [INAUDIBLE] performance pattern--
AUDIENCE: Was the first diagonal--
JIM DICARLO: No, the [INAUDIBLE] pattern is like the diagonal of the [INAUDIBLE] that just collapse [INAUDIBLE]. The course is [INAUDIBLE] prediction. The next [INAUDIBLE] is the [INAUDIBLE] the next finest [INAUDIBLE] would be image by image level prediction. And at the next point, there will be an image by image, tile by tile prediction that we haven't done-- like we did on the [INAUDIBLE].
AUDIENCE: You [INAUDIBLE] medium [INAUDIBLE] is longer than any length you want?
JIM DICARLO: It's comparable. I mean, it's also but a lot of the algorithms can predict well at low variation. So like [INAUDIBLE] can predict pretty well at low variation. What separates out those other algorithms, actually, is going through those harder tasks.
So here I'm going to get the [INAUDIBLE], we've talked a lot about this. Here's-- what is this-- what is this mechanism? What we thinking about is there's some patches that-- remember that in [INAUDIBLE] stuff I talked about earlier? This is just how I think about this hypothesis. You know, we've taken over [INAUDIBLE] of IT, that means [INAUDIBLE] IT there's about 150 millimeters squared tissue there in each hemisphere. And remember [INAUDIBLE] similar but not exactly. If you imagine that you had a downstream neuron-- this is a simple model but it's a model-- that samples from 50,000 random neurons distributed across it. Have 100 trains that [INAUDIBLE] cars, but yet it ends up with 5,000 heavily weighted neurons that we test for our classifiers that look like about 10% of these are kind of heavily weighted.
Remember, we're not doing 50,000. We're inferring this. We're getting much lower numbers, on the order of 150, as I showed you in data. You can infer this. I'll show you that in a moment. But when you end up with about 10%, you're going to have neurons-- you can imagine more neurons that look like this. You weight roughly 5,000 of IT, of course you're going to weight the features that are [INAUDIBLE] for discriminating the task based on training examples. One thing I want to point out-- this is to connect neuroscientists to the idea that this is not just black box decoders, this is what downstream neurons could do and one other thing I want to connect is first faces, for instance, these blobs are making a space pass, like they do earlier. Here's a neuron. It starts out it, it doesn't know what to do. But it learns about faces under training examples. I would end up, almost by definition, leaning a lot on these spots, because that's what we're defining. We're finding they're good at discriminating bases or non-bases. These are good features for discriminating faces.
You would end up with something that might look a lot like this at the end of the day. Don't buy a thousand neurons, we think. Go 100 [INAUDIBLE]. The point I'm making on these slides it this: this is the same algorithm producing different decode outputs, but it's the exact same algorithm producing everything that you'll need so far.
So one learning strategy applied to an IT basis give you the data that you seen so far. And it includes, if you like the idea of base neurons, or base pass, it's not as obvious that it doesn't come with [INAUDIBLE] in a larger frequency.
AUDIENCE: [INAUDIBLE] doing a schematic up on the bottom there, it tells me that communication detection works.
JIM DICARLO: I'm sorry? Because I thought you meant something--
AUDIENCE: Just including the neurons and the face patches [INAUDIBLE].
JIM DICARLO: Oh, for this, this would get these where they're randomly selected. This is our preferred model. You clip your version where you limit this downstream neurons that only having lines from the red patches, and then learning. And that's just like read through.
And then, you have to go get a [INAUDIBLE] because some of your bases have to. But the reporters predictions across and amongst the face path don't correlate quite as well with the humans.
Now, it's a trend. It's not significantly worse. But it's not better.
AUDIENCE: And you don't need the subordinate face patch for that?
JIM DICARLO: Well, there are some [INAUDIBLE]. I did, remember, I mentioned we use the subordinate, and we did catch them. And I'm going through it quickly. Yeah, if you wanted to go back to that I could show you on that slide. There were subordinate [INAUDIBLE].
I published about [INAUDIBLE] or unsupported [INAUDIBLE] face patches, not well defined. Right? It's supportive detection, [INAUDIBLE], discrimination. It's everything in faces, you're bug faces count. It's not well defined.
In our hands, it was discriminating these faces from each other, and detecting human faces from other categories.
And we don't see that unique bias to downstream learning yet, to make that prediction to fit out legal data standards yet. But we don't seem that idea. We prefer a person [INAUDIBLE] idea, so far, to capture the data.
And as an end state, end up with something I think people will like to think of face patches we shouldn't be unhappy with, because that's just what they might think already. It's really, ultimately a question of a learning [INAUDIBLE] to which we start to distinguish that.
Now, maybe we do image-level analysis, and we push harder. We may find that pre-defining a prior to put the neurons to read from that face patches might lead us to slightly better predictions when we push harder to the questions here about why is it now perfect. [INAUDIBLE].
I'm telling you, kind of, where we are. So far, I"m contextualizing the perspective face patches with my introduced [INAUDIBLE].
AUDIENCE: According to this model, how flexible are those downstream neurons? Are you making this claim that you eventually got to hard neurons and base neurons, or are these more, the connection's more defined by cache, how you're reaching out and finding them.
JIM DICARLO: Yeah. So, again, this is something we hypothesized. These neurons can live in downstream like pre frontal, and [INAUDIBLE]. And if you think about the [INAUDIBLE], maybe these are now the [INAUDIBLE] cells. They get dispersion. Right?
Somewhere you have hold these could be dynamically engaged in prefrontal [INAUDIBLE] some order. You kind of have to believe in, at least, dynamic [INAUDIBLE] cells, to the extent that we can use a patch. We've got to have some cell [INAUDIBLE].
We haven't found, but Earl Miller's lab, again, trained monkeys in [INAUDIBLE] in categories like neuron [INAUDIBLE]. [INAUDIBLE], and [INAUDIBLE] a number of years ago. We like that idea, but we haven't gone after the general [INAUDIBLE] is consistent. We start wanting to put what we found in a context that other people have shown, and this is how I think about it. But you might go in and say, well go find those neurons and now go record from them.
Again, some people have gone. But what we haven't done is be passionate. So, I don't know if I answered your question. It's a hypothesis of what might be there based on the decoding of the [INAUDIBLE].
[INAUDIBLE]. But that's where we are. The part that we like is that we can take the same algorithm and we can predict a lot of stuff. We don't have to do anything fancy there. That's the part we really like.
Imagine the number of neurons seen. And here's something you've probably asked about, is increasing numbers of IPRs and decoding. This is the consistency, that correlation. Here is the [INAUDIBLE] area, the grey cells, here's the red identified here. And [INAUDIBLE] there, they're passing as a human.
This is the performance. That's that unity insight. This is why, if you could be high performance and computer vision sensing.
In general, if we add more neurons, you always get better. But you don't increase the consistency of correlation with humans. You don't find the same patterns of difficulty [INAUDIBLE]. That's why they're stressing that as the letter that's more interesting to figure out what human brains are actually doing.
Instead of increasing our neurons throughout the course, we'll get better, and better, and better. But, that's sort of [INAUDIBLE] constraints.
But a lot of can make an estimate that I've already alluded to, to get to that point of around 500 more features with this coding strategy. And, I want to point out, though, I'm sure I've mentioned this before that this depends on the number of training examples. So this is with 100 training examples per object.
This is a manifold of possible solutions that would all give [INAUDIBLE] predictions right now. So with 100 training examples per object, you end up with the predictions I showed you. That would need to be about 500 neural features, as I just implied in the last slide. Those features are actually computed by averaging data over trials, when we do the analysis.
So to factor this out to real neurons, we have to make some adjustments for real neurons and their joint levels. And that ends you up with a number that's more like 50,000 single units. That's why I [INAUDIBLE] that number comes from.
So this is kind of the algorithmic level of what we did to the data. But his is our inference of what you need to support the real time performance.
Again, you could pick a different point, depending on how much training you think subjects get. And [INAUDIBLE], we don't know. What I showed you is that you could just take one point here, and you could do all the tasks they've laid out.
AUDIENCE: And isn't this the relationship you see? The red and black?
JIM DICARLO: So, OK, get this. This is harder to get this, right? This is training examples, this is the number of features needed, and this is the family of solutions that all make the same kind of predictions that I showed. They're really indistinguishable to us.
OK? You got that so far?
JIM DICARLO: Then you want to know about the mapping between the red and the black?
That has to do with the fact that when we give you analyses, we first average the neurons. There are two parts. So let me think of a simple one to understand this. Pretend that we had six times. Same [INAUDIBLE]. We just average those responses together to get the response of the feature, we'll call it. To that integer.
The brain can't average over 50 presentations. It has to respond to one.
JIM DICARLO: So, if we pretend to [INAUDIBLE] up, we can then estimate that everything still seems to work. What we can't measure is 50,000 units for all of these paths. You can't do that.
AUDIENCE: I've heard this part. I've [INAUDIBLE].
JIM DICARLO: The highest, most [INAUDIBLE]. The other part is smoothing the [INAUDIBLE]. These are [INAUDIBLE] averages, often. Step off of doing it, and corral the data. That changes, a little but, by about a factor of two. [INAUDIBLE].
We get other inferences from some analysis that I've didn't distribute, that I didn't [INAUDIBLE].
AUDIENCE: And this is for a specific number of [INAUDIBLE] charge?
JIM DICARLO: This is for the [INAUDIBLE], the 64 behavioral path, which are those 24 categories of the difference levels of variation. Remember that 22 prime [INAUDIBLE] without decoders? That's her ability to do that, to predict that.
OK. That's the number of [INAUDIBLE]. Some people think this is too low, some people think it's too high, I don't know. I mean, you can see with the free parameters here, how much training examples these human subjects get to learn tasks.
The instructors begin to study that in the monkeys, but we don't actually know how to pin that number down yet. So, we're describing what data sufficient looks like.
I want to just say, what is this good for? We're building an algorithm here, a predictive algorithm. Well, now in principle, we should be predict, if it's right, we should predict for any task in this domain. Result, you should be able to predict the pattern of that behavioral performance.
And we should be able to predict, if we manipulate IT neurons, I'm going to be helpless to know under some assumption, what should happen to performance. And if I could have the ability to push the neurons around, and make predictions about what should happen behaviorally. I think that should be obvious to you guys. And they're more specific than saying, I push face neurons in face paths.
I push this neuron over here, and these add up, all these tasks are changing exactly this way. And it's almost predictive.
I had to fly on [INAUDIBLE]. So, one of the things we're doing now is we've moved neurons around with learning them, [INAUDIBLE] change [INAUDIBLE], and the behavior should then change in a very quantitatively predicted way. And I don't have time to tell you about that.
I'm going to briefly tell you about how [INAUDIBLE] genetics work. And, again, we're trying to, in this case, silence neurons, that produce as predicted, perceptual changes.
I'm not going to give you what you want. What I want, too, which is to change for one of those features that bobble up and down, and said the behavior should move exactly this way. That's what the principles postulate.
What I'm just going to show you is some proof of principle that we can move neurons around, and they affect these idols paths. You're going to have to wait, probably a couple of years before I can show you the results of what we're really trying to do, which is manipulate these neurons individually. And then see if that creates response.
But does the theory of that make sense? If you guys [INAUDIBLE], really go into those 150 little cubes that you [INAUDIBLE], bobbling up and down. And it should be quantitatively predictions across a range of behavioral paths, if the [INAUDIBLE] is correct.
That's the way we're perceiving trends here. [INAUDIBLE], does that make sense?
So that's in the slide here. Remember, we can go mapping very precisely, so want we wanted to do is say, why don't silence just those neurons? You know, if we an get into the base question and be like, how do the face patches effect clock cache? That's one way to think about. But for us more broadly, what happens when we silence that? What happens to [INAUDIBLE] if I just shut those neurons down briefly? What is I shut down those neurons? What is I shut down those neurons?
That was what we wanted to do. Little millimeter chunks, if you will, this is Arash Afraz of [INAUDIBLE].
Do we have, how do we shut down neurons without these magic tools of genetics? Remember when we heard about genetics? OK.
Magic tools. That's how neurons quickly [INAUDIBLE] cooled off genetic [INAUDIBLE]. You never had the ability to do that before.
It's not how I activate yours by silencing them briefly. So here's an example of IT neurons driven with the [INAUDIBLE] stimulus, [INAUDIBLE], [INAUDIBLE] two neurons. Trying to laser light, boom, shut this down. We block that with the laser, and it comes back right away,
Now, there are two of our stronger sites. It's not always this good. But it shows the potential of what the spectrum can do.
I'll just show you briefly the behavioral effect. This is Arash's data on discriminating male versus female faces. So, this is a gender discrimination task. It's not one of the tasks I showed you.
And this is parallel work in the lab that I wanted to try to apply what I have been telling you. But Arash just wanted to state, if you discriminate males and females, trained monkeys on males versus female discrimination tasks, here's an [INAUDIBLE] in performance. Here's the up shot.
You barely need to turn the light on. And in particular small regions of IT, low mixes on [INAUDIBLE] trials, and not others. And you lead randomly. You've suck out neurons with a laser. You get a 2% drop in performance and [INAUDIBLE] level. You get no drop in the [INAUDIBLE] field, and that's what we would have predicted, given the layout of IT.
So, this is all I wanted to get you excited about, to manipulate high level behavior. You might get 2% [INAUDIBLE]. If you're interested in that, I have a cool slide, again.
But why is this so small? We think that's consistent with everything we know, that I've told you so far. And I'll try. I can show you some of that at the end.
But it's small, but reliable.
AUDIENCE: Can you say a little bit about the state of optogenetics in [INAUDIBLE]?
JIM DICARLO: So, yes. Far below where the mouse is. But the state is, you can put viruses in and they express. And then you can get [INAUDIBLE]. Nobody's shown anything in IT yet. Granted, you get to assume behavioral [INAUDIBLE] in some places. But people certainly can show [INAUDIBLE] can activate or silence neurons with these kinds of technique to some degree. But not as strong as you can, in say [INAUDIBLE].
So the state is the tools are doing something to the neurons, but they're still not tied to certain cell types yet. Well, certain layers. They're not [INAUDIBLE]. That's the key.
So we have a whole [INAUDIBLE].
AUDIENCE: Is this like a permanent change?
JIM DICARLO: Oh, well the neurons are infected. And they're infected for a long time. We don't know if that fully functions, but there is [INAUDIBLE]. And those neurons are stable. This is why I've been working optogenetics [INAUDIBLE] and others had worked out how to do this and what kind of things you can actually get this to express in a safe way, and not block up the neurons.
AUDIENCE: Do people notice [INAUDIBLE] changes?
JIM DICARLO: Not that we've noticed. But there may be things that turn out to be [INAUDIBLE]. As we're getting older and dying, neurons aren't dying on [INAUDIBLE]. There may be subtle changes, but we just don't know what to measure for. But we can use changes by turning the lights [INAUDIBLE].
AUDIENCE: Isn't it the general that you there's a think area around that, and you turn it off it will be worse? [INAUDIBLE] independent of [INAUDIBLE] the model?
Bare in mind, you would say, I have a model which says base neurons do base tasks. And if we were doing gender discrimination, you might say that's facial. [INAUDIBLE] So that means that you say that you've already predicted if I shut down neurons in face patches it should effect a base. That result should happen.
Right? I want to disagree with you. But I would say that you then can't for what would happen if I asked you to discriminate from [INAUDIBLE]. Should it affect that task? Yes or no?
Face neurons. Should it affect it?
No, right? It's just a weakly predicted model. It only predicts some domain that we don't actually understand. Again, it's not saying that there's not some predictions. They're just weak predictions.
They just make tighter predictions about other tasks. With the idea to show you the power of the [INAUDIBLE].
We did the proof of principle, which is consistent with the Mendel model, which is why we did that. That's the think that everybody thought would work. They make sure the tools are working. Then they want to do something a little more interesting, that I wish I could tell you more about [INAUDIBLE].
But wanting that [INAUDIBLE] in any sphere, and I'll just go really quickly through. It turns out there's a little bit into coupling. It's not that the base [INAUDIBLE] usually defined, remember that red and blue dot, it's that that's predictive of the effect of the performance.
It's the gender discriminate ability, how male versus female. In each, they have one [INAUDIBLE] that's the best predictor of performance. And that is not perfectly correlated with the face versus object.
So, they need to be correlated but not perfectly correlated. So there might be spikes here that discriminate male versus female. But they're not as good at facial versus object. In other spheres, they're good at facial versus object, but not as good at males versus females.
So there is some coupling there. It's not super strong, but there's some coupling.
So, I think it's more interesting than just saying that face patches do everything with faces. I think it's much more subtle than that. But that's where the edges are [INAUDIBLE].
AUDIENCE: So, I'm interested in that because,
how do you reconcile then, like Doris' lab, and then, like our lab [INAUDIBLE]. You get this really big face effects and not for clocks or objects, or anything. And then the decoder would suggest that you can do it outside of the region, or inside of it. And it's just as good, either way.
JIM DICARLO: No, no, no. The order doesn't stay just as good outside. Remember that was the slide that said when you decode, you're going to have some face signal at IT and you just [INAUDIBLE] well.
If you're descriptive outside, you're not as good. Header is actually very similiar to the human patterns. So you need more neurons. So you're not as good per neuron.
So, sorry. Just to be clear about that. You're talking about the idea that if you stimulate here, that you're going to see the light series of patches that would [INAUDIBLE]. Is that the thing where you--
Yeah. The information seems to be like you can classify overall, like, many of these neurons [INAUDIBLE] IT. But then if I stimulate, I can affect the perception of just faces in a face pattern. So even though the information may be present across it, is it really only the stuff inside, though, that's involved through perception?
JIM DICARLO: I think that's still the edge of the work. I know I saw Doris at [INAUDIBLE]. There was a cool poster on stimulation and the facts of that.
Were you talking about [INAUDIBLE]?
AUDIENCE: Like, hers or [INAUDIBLE] up to the electrodes dump to.
JIM DICARLO: Oh, the are for ET work? Yeah. So, right.
Except, I think there's clear evidence that if you mess around, you're going have have something to do with faces, right? I don't think out group gets that.
We're trying to move into a domain where we [INAUDIBLE] to predict, what are we going to do with for a larger range of patches. That's where a larger model comes in.
What might end up happening is the idea of these [INAUDIBLE] modules might become a little more muddy than that, if we have a more prohibitive view of the system. It kind of depends on the downstream reader, ultimately,
I don't think there's and idea of [INAUDIBLE] right now, but I think there are entering questions there.
One practical one is, do these neurons do anything about faces? You might predict that if you look at the strong form of this one, I found some neurons when I test clocks versus oranges, that you have zero impact. That, to me, is a prediction of a modular face hypothesis.
If it has n impact, at all, then you better throw away that strong hypothesis. Right?
AUDIENCE: You already told us that the hypothesis isn't true, though, right?
JIM DICARLO: What?
AUDIENCE: From [INAUDIBLE] to where he was at, that's very strong.
JIM DICARLO: No, no. That meant causality [INAUDIBLE] the neurons. We don't know that from that [INAUDIBLE] data. We know from our [INAUDIBLE] data that the information about platforms, just the deep hypothesis, is that information used by [INAUDIBLE]? And you can only test that with direct [INAUDIBLE] intervention. That's the crux, I think, of people's debate about models.
Nobody's debating that there's some spatial structure in there. We showed that earlier. But we base it on how would it be used. And I think that's still open and interesting.
And at I think this is going to be fun to see hot it all plays out. I think all we're trying to do here is set up a [INAUDIBLE] of stuff that you like to consider a lot of, we'd like them all to work out at some consistent phase. Wouldn't it be better to think of, let's study creation of these objects so that they're two completely separate areas.
Maybe that's the better way to think about it. We like to think about it at a unified space, for now, because maybe that's [INAUDIBLE]. I like to wrap my mind around it that way, but it may turn out be better explained in the former way. I think that's the [INAUDIBLE] come at if from prior [INAUDIBLE]. Yes?
AUDIENCE: I actually thought that Doris' stimulations worked [INAUDIBLE] from disruption with clocks, and objects like that. Right?
JIM DICARLO: Right. When Doris showed that, and then I said, you have suggested the stronger concept is false, she said, well, but not I think the monkey's receiving the clock as a face. And that's why it's making some mistakes. So then, I kind of didn't know what to say.
I mean, you got to make a prediction. Again, this is a lack of predictive models that make predictions. I would think that, I don't know. You don't have an alternative use like that, unless there's a strong prior that's got to be something to [INAUDIBLE], right? An the predictions are, what patch the effect is.
I don't know. So I'll see how the debate shakes out. Again, I don't think there's a strong debate. Yes, I think these are increasing questions that people are [INAUDIBLE] in [INAUDIBLE] lab.
And my [INAUDIBLE], we're all hiding some say [INAUDIBLE].
I'm sorry. Let me just jump to the end. I want to skip. I remember giving you guys all these words, but you know, if you the performance of how good each of these regions is, to support a patch at gender, and then you plot the optical effect of how much [INAUDIBLE] can effect animal behavior. Remember a small effect, around 2% change, we see that the more useful the features are, the more of the effect is not m but we can see there's this trend here that's more useful each millimeter is by thinking of it from a downstream algorithm point of view, the more optical effect you get.
Which we get is kind of proof of principle, but there's not engaged the debate or discussion we've been having here. Again, this is just at the beginning of patching [INAUDIBLE]. They just want to show you the idea out there, for those of you interested in that stuff. The tools are allowing us and other to start to do this.
OK. I want to end by talking about encoding models. We only have 15 minutes left. I've spent a lot of time talking about encoding models. I think that's a fun discussion, the link between neurons and behavior. But one of the things these models us, is they start to tell us, remember, they have to fight for IT. They tell us, look, if you can just predict the mean rate within this window, then this [INAUDIBLE] told me that so far that's a pretty good predictor for behavior.
So they focus on the kind of thing that we might want to need to predict. And so that's the kind of things that we're going to test. How well does the encoding I to predict that? So that's one way that those models have been formed with encoding, but they don't tell you what the encoding model is, they just tell you the sufficiency so far. What needs to be predicted with this type of problem that I'm doing.
So let's talk about encoding models in the last 15 minutes here. So remember, this is decoding. Encoding is images to neurons. OK? That's encoding.
Large class [INAUDIBLE] of our old model, [INAUDIBLE], Tommy's Lab, Yokushima, all names down here. A lot of people work at this. It's a very good class of models.
They basically consist of stacks of linear filters, followed by non [INAUDIBLE], like thresholds, and saturations followed by some sort of cooling operation, and all these commonly occurring mobilization operations.
Sometimes in different orders. But these are the basic operations that many of these models share.
So these are meant be models of possible [INAUDIBLE] scheme algorithms. You might say, oh, this is too forward, or this is too back. I wouldn't disagree with you, I"m just telling you the class of models that have been tried. And then I'll show you what we've done. The neurons, they all have large fanning like neurons. They have no [INAUDIBLE]. They're convolutions models. That is, whatever filter you saved goes flying across the whole visual field.
Each layer has many types of training functions. He's like a B1 model. You have many different types of orientation filters apply across the visual field.
Here's some other complex nonlinear functions now on the input, now applying across the field. That's what convolutional means. And it's deep then, so Johan would say, well, as long as it's level three, three [INAUDIBLE]. Heat means greater than three, according to Johan.
So, I'm going to tell you about our modeling [INAUDIBLE] domain. But what I want to do is to give you some context to say remember what's cool about these models is not an argument about the specifics. What's cool is that different in psychology. We can predict the models. These are models that make predictions that can be tested. So each model, whether it's right or wrong, make predictions about what layer. That allows cache within a lot models to be refined. And I think that's the spirit of the CBM enterprise.
As long as you make a prediction, that's better than making no prediction in my mind. A wrong prediction is still better than a model that can just predict anything. It is not useful.
I went on to say that this large class of models has thousands of unknown parameters, which I'm pooling up in this data here. Things are high in many, many parameters, as you might imagine. They're heating up in this class.
They have to do it like a filter shaped [INAUDIBLE], each level, what is a [INAUDIBLE], how do you threshold, how do you cool, what's the normalization strength? You can imagine a whole bunch of parameters.
And so, what we want to do is ask, which of these algorithms, I would say, if any, many of you already know that this is impossible. We don't have any feedback. This is not going to work. But we let's just look at the class as, which of these algorithms, if any, is the one that I would change?
And what we chose to do was try to optimize for something. Let's optimize to find specific algorithms in this class. Or we could ask much more.
Which algorithms do we want to pull out of this large gangway of [INAUDIBLE], and what parameters do you want to choose? What optimization target are we going to choose? Well, we chose an optimization target, a visual path, that's I've already implied [INAUDIBLE] built for salt. Either built by evolution, and, or development.
That optimization target is a variant core recognition task. That's when we picked up an optimization target.
So, I think when I say we, I mean Danny as a postdoctorate. I'm a graduate student. And what they did, what you optimize, for these kinds of tasks, a large variety of objects. In this case, there's 36 objects. Rendered with high duration, some of the same strategies, they're just of the uncorrelated background.
These are not the objects I'm going to use the way you're testing. So they were just different three objects that they were going to use. They're completely different objects when we do the testing.
So we're just making a space of stuff generated from objects where you have discriminate one from another over a high variation. That's the intuition, and what we're optimizing for. And so, again, what optimize means is find a model that has a fixed set of parameters, which we HMO1. Because they slang for hierarchical modular optimization, a term they coined for the modeling.
I don't have time to tell you about it. But it's published now, if you're interested. But you're going to have to choose a specific set of parameters with a specific model, not a [INAUDIBLE] model, a specific model that has all its parameters fixed.
And it has four layers, which we think of as modeling these areas of the brain by opposites. And again, the model parameters are fixed just by optimizing performance on those kind of cache.
And so, once we pull down a model like that, optimized performance on task, we're never looking at any neural data, at all. Just try to do these kinds of tasks well.
Then what we've been doing is think, of course, we can measure all of this. Let's just take the model featured and ask how well they predict the single neural IT responses.
Is everybody with me so far in what I did? I don't you don know the details about the optimization. But there's a large family of largely C4 models. We find parameters in there that do perform all of the tasks, and all we're going to have to follow is just predict that.
A heck of a large space of hypothesis, which we found one under a meta icon, which is optimization for performance on [INAUDIBLE].
AUDIENCE: So, is the idea that you're matching the IT response. But is the idea that parameters are learned at a similar way that the brain might learn the placement?
JIM DICARLO: Yeah. So, great question. We're agnostic as to how the brain optimizes to get to this point. It could be evolution, it could be develop, it could be both. All we're trying to say with this work is if we just try to find a model in this biologically constrained space that we like, the classic model, if I find one that we think IT does, is that offering any more of a powerful prediction on the response for neurons than other models that haven't been introduced. It's agnostic to evolution. Our techniques of optimizations certainly not what we think the brains uses, with either of those.
So just think of it as a way to fish out something of a class and ask, did the optimization target, with a lot of structure, lead to better predictions?
But you're question's great because that's really the future. [INAUDIBLE] we like to make it learn, or evolve, like the brain really did. Which is much harder, but also almost more interesting, in a way.
But this is just to try to get us in the game.
AUDIENCE: Is the prediction based purely on firing patterns in IT?
JIM DICARLO: You're asking what are we going to test? So this is the questions. I haven't shown you the result yet. But we're going to ask how well does this predict that? I think you're asking what grain of prediction, like spike-by-spike, or abortive.
So as I alluded to, we're going to average responses in IT over that 100 second window that was a very good predictor of the behavior from the earlier part of the topic that I showed you.
We're not going to try to predict spike-by-spike. We're going to predict the mean response of each IT neuron to each image.
AUDIENCE: Where did that 100 milliseconds [INAUDIBLE] come into the model? Is that at all?
JIM DICARLO: The model does not have any temporal dynamics at all. This is like this large classic model. You probably already heard it on this.
It's more about spatial interaction across the visual field, what we think is interesting about these models. It's not a temporal model, yet.
AUDIENCE: Is that something you had earlier I found very interesting, and I think that's missed in the model. The fact that when you go from station to station, it does not potentially [INAUDIBLE]. There's some complication there. So what is that, and why?
JIM DICARLO: Good point. And I'm looking at that when I say look at these, arrows. Right? There is local feedback, [INAUDIBLE] normal [INAUDIBLE], but there's no feedback in these models here. So any rumblings here would be independent of any feedback lines of between areas. There's no cortical-cortical feedback model at all.
And again, you could say it's missing something that biology has. That doesn't mean it's necessarily wrong, it's just a start. We just don't have it there yet.
I think this gets to deep questions about, if you want to predict exact dynamics of B4, this model's not going to do that. It's not going to offer that. And I would agree with you. We get at the mean rate of IT over the images as a first pass.
That's all we've done.
You put that as a grain of prediction. We're not going to predict every spike. We're not going to predict all the dynamics. But I tried to convince you earlier, it doesn't seem like it means much to [INAUDIBLE] those dynamics, as least up here, to actually explain the behavior. And that may not be as fully satisfying, but it depends on what your goal is.
But that's what we've done. And we could imagine a dynamic model. We just don't have it yet. We're [INAUDIBLE] how you integrate across stations, and what nonlinears would you use.
I said to my philosophy, start with something you haven't set up. If it's simple and wrong, but at least it gets you started, and see how well you can do.
AUDIENCE: Do you guys know if performance depends on the data set that you use?
JIM DICARLO: The performance of the model, or the [INAUDIBLE]?
AUDIENCE: I guess, either way, for predicting--
JIM DI CARLO: I haven't even shown you how to predict anything.
AUDIENCE: As a said, I sort of know,
JIM DICARLO: Yeah. OK. So, you can predict something. I want to show everybody else that. I think you're asking how the optimization target interact with the prediction set? That's my translation. Or, the model class.
AUDIENCE: Well, actually, you can think of it like this, constructing a correlation between background. In the real world there are correlations.
JIM DICARLO: Great. So maybe if we had actually optimized the model on those [INAUDIBLE], we could make better predictions.
AUDIENCE: Yeah. I just wondered if you guys had tried that.
JIM DICARLO: Not yet, no. You should talk to Dan in the lab, if you want to. He's interested in all that stuff.
And there's lots of cool things you can do in an extension of this. We just wanted you to see this is a different strategy. Just to point out, if you can try to fit these neurons up here, then they usually take the neural data and then try to fit it directly.
And at the turns out, there's not a lot of neural data. But if you had a lot of imagery about it, you could optimize it more. So, it's just a different way to get into this, to take a guess about what it's doing. And then [INAUDIBLE] those predictors not present to fit the date directly.
That's what we need done. We can always predict way better IT than previous models that we had in our hands. [INAUDIBLE].
Maybe I should show you that. We're almost out of time. Can I just show you guys this [INAUDIBLE]?
AUDIENCE: Well, it relates to the last slide.
JIM DICARLO: OK.
AUDIENCE: What do we make of the number of parameters in the model?
JIM DICARLO: Nothing other than it's a very complicated model, even though it's no feedback, which is even more complicated.
AUDIENCE: You can use thousands of parameters?
JIM DICARLO: Yeah. But they're locked down. Once we lock them, there's no more free parameters. So they're locked according to patch, then held in the and issued predictions of IT where it's parameters are locked. You have no real data to set those parameters.
AUDIENCE: So it's all about performance on the task?
JIM DICARLO: That's right.
JIM DICARLO: And I should say no. But aficionados, you might say, wait, how do you pick up each [INAUDIBLE] and map it to a single neuron? You have to take some data from the neuron there. What we do is linear regression from some of the [INAUDIBLE], each neuron, from the features [INAUDIBLE]. Linear regression predict on a held out [INAUDIBLE].
If you [INAUDIBLE] here, depending on home many neurons, or how many features that we have to regress on. So it's a little bit of data used to actually make the final prediction.
Of all those other parameters, they are all [INAUDIBLE] lock down. it's just the feature mapping. We're just going to say, is this in a linear state to stand by this [INAUDIBLE]?
Here's the prediction. OK. The black line is a prediction of the mean response at 100 milliseconds and [INAUDIBLE]. Here's a bunch of images not timed [INAUDIBLE]. So you can come in and there is more or less a certain average structure here.
You can see, for some reasons, it likes pears. The red line is a prediction in this model. You can see it trapped really, really well. Its correlation r squared is 0.48. That's like [INAUDIBLE] at 0.7. It's not perfect, but it's really, really good.
These are some example images here. These were never seen. The object leads were never seen by this algorithm before. That's the certain point we were just discussing.
So forget reasonably well over these kind of not fully natural, but naturalistic states. But which we've measured IT neurons.
Here is for a base neuron. You'll use it, again, as a base [INAUDIBLE]. But there's always structure in each [INAUDIBLE]. .
[INAUDIBLE] pattern here, a prediction that's actually a [INAUDIBLE] within neuron, which I call, simply, a face neuron. The neuron doesn't respond to all faces. But that structure's also really well [INAUDIBLE].
There other neurons, you wouldn't look at and categorize if you [INAUDIBLE] of anything. But it can actually predict, pretty well, this neuron. There's something about the same features.
These are r squares, and tend to be about 0.5, so about half of the grain's explained.
I'll show you the context of other models and ideal observers as fractions point variants. Because, take a bunch of other models and do exactly the same thing. [INAUDIBLE] would be one leg, there would be a two-legged model [INAUDIBLE] and other [INAUDIBLE] that we built.
A bunch of other models here. They all predicted about 0.2 with this data. This mode, again, was really impressive that the top level was 0.5. So, again, that's a dramatic improvement, but not perfection, which would be 1.0.
And each level gets better and better. So that's pretty cool.
I think we've got to probably stop soon.
Also, what's really neat, what Dan, the leader of this work really likes is this, we [INAUDIBLE] predict this [INAUDIBLE]. But let me take these intermediate levels where I showed you Ed Cotter's work. [INAUDIBLE].
What words should we use for [INAUDIBLE]? Like the word face is used [INAUDIBLE] words with [INAUDIBLE] who understand it. [INAUDIBLE] here is very mysterious.
You can take these models that optimized for the object [INAUDIBLE]. And then you can look at the intermediate level. The intermediate levels of this model, for some reason, L3 prediction them. E4 was popping really, really well.
Even though, again, we never solve for a neuron [INAUDIBLE] it's better than all these other models here, and the intermediate level and the top layer its gets worse. It's really striking how you just optimize these classic models and we can already predict with higher layer, but also, intermediate level, at least with the grain that we've been [INAUDIBLE] 100 millisecond mean responses for each image.
I'm just going to close this and put this up. This is a [INAUDIBLE]. This Is a population level metric. By these kind of metrics, we're almost perfect. We just need a more compressed grain of analysis here. We're almost perfect at predicting these kinds of [INAUDIBLE], those of you who know that style?
If you don't, don't worry about it. I just want to leave this group with this take home message of what we did. Maybe you guys can [INAUDIBLE].
This is the performance of an algorithm of high grain recognition, and this is why we did this whole thing. Sorry, I should have showed you this from the beginning.
This is the ability of a model to [INAUDIBLE] responses. These dots are samples out of that model plot I showed you. There are different parameters that were withing that large family of models.
Here's some existing models from previous years of [INAUDIBLE]. Here's a model we built [INAUDIBLE].
So, what I wan you to see is that there's structure here that the better you do on this, the better you do on that. And so, this again, supporting the idea that you can optimize on this and you might do better on that, which is one of the things neuroscience explains. So what we did was optimize here, we found the model that we call HMO, and it's now up to 50% [INAUDIBLE].
Obviously, if we do this well, we can keep optimizing through this right path at the right model states, you should be able to better and better predict. And that's kind of what we're trying to do.
So that's the theory of what we did. We can get encoding functions out, and talk about development. And I'll end by saying, what needs to be done next is we need to start predicting [INAUDIBLE] image. I've already mentioned this. We need to talk about dynamics. And keep asking those questions about these kinds of algorithms.
We need to test the direct neural perturbation by alluding that we have proof of principle. We have to make these encoding now. This is just a starting point. I gave you one, one [INAUDIBLE]. That's just one example in this family.
How do we fit the rest of the variants and coding algorithms? That's a good discussion to have. Any of these unsupervised or more developmentally appropriate, Emily's question earlier, these are all open questions. They're not just for the field. They're for you guys to chew on.
And I think this came up early in the discussion. This is one set of digital paths that I call [INAUDIBLE]. We're thinking about 2.0. We're working on 2.0. What's this [INAUDIBLE] going to expand the domain of paths here so that they [INAUDIBLE] bigger, and bigger, so to speak. And, of course, we want to do that, as well.
And that's just all up to you guys. For me, as an experimental scientist, with algorithms that predict stuff, Good. The more of then we have, the better. We can capture experimental data. That building models in a useful thing to do. Because that's how [INAUDIBLE].
These guys did all the the work. I thank them along the way. Bob Hahn and Dan Yamins, especially, for the work I showed you through most of this talk here. And the folks who read, as well. Thank you guys for listening to my talk.
JIM DICARLO: So Tommy, I know you have to cut us off. I'm sure there's [INAUDIBLE] teed up all time here.
TOMMY: No, we have five minutes.
JIM DICARLO: OK.
AUDIENCE: So when one, maybe, take away from the modeling work is that the IT responses didn't constraint your model at all. There just used to validate it. So do you think we're just not at the right [INAUDIBLE] of the scale where we can study the brain to inform models? Or?
JIM DICARLO: I think, yeah. When we do this for [INAUDIBLE], I often don't a lot thinking the exact same thing. What's the value of the neuroscience to the modeling? And when we started, [INAUDIBLE], if you look at neuroscientists, most neuroscience is actually trying confirmed things that engineers know how to do already.
So they take these simple models of decision and various models, and then you've got a compute [INAUDIBLE] for your quantitative findings, which engineers think are easy to do. And they're then figuring out where they live in the brain. It's a little over simplified.
But what was cool about space of problems to me, is that these are things that engineers [INAUDIBLE] know how to do. Highly motivated at the beginning of the talk. A the hope was, that we could find ideas from the brain that would tell us what to do to help the engineers.
But where it seems to be right now, is that the engineering is getting better.
These models, they were inspired by things that neuroscientist did 30 years ago. So that may seem like a long time. Now, these networks are all the rage, right? But those grew out of things that neuroscientist has constrained. So if there wasn't inspiration there, eventually now it's producing things that you could [INAUDIBLE] as the referees on which one is correct. Which is a little the drug scientist said earlier, but now at a more complicated domain.
Even our work there is an example of that. Let's just try to be good engineers, and that validates, as you said. And I think there's a good discussion to have about what's the role of neuroscience in driving the next generation in technology. Maybe is just has a very long time lag. I think this is what my colleagues might say, that it does, it just doesn't right immediately.
But maybe think about the brain and the things we learn. It does have impact on the kinds of models that some of us, like some of you in the room, are building. As it did for Jeff [INAUDIBLE], and Yahn, and Tommy and others, when they building those kinds of classic models that now are having some success.
So I hear you. I'd like to go and extract the secret magic and just hand it to the engineers, and we still hope for that. But, to be honest, it's really just the validation of engineering modeling in a bio constrained space by earlier neuroscience. That's one way of looking at it. [INAUDIBLE]
Sorry. I didn't give you an excuse to [INAUDIBLE], but that's fair.
AUDIENCE: Could I just get a citation [INAUDIBLE].
JIM DICARLO: Sorry, the last talk?
AUDIENCE: The last [INAUDIBLE] IT.
JIM DICARLO: Oh. It's co-signed and it's under review.
AUDIENCE: Oh, OK.
JIM DICARLO: So, I can give you a co-sign abstract. [INAUDIBLE].
But I'd be happy to send you a draft of it.
AUDIENCE: Yeah, that would be terrific.
JIM DICARLO: By email. Yeah.
AUDIENCE: What do you mean by learning each image as [INAUDIBLE]?
JIM DICARLO: Pedict each image. It came up in the discussion. All we're doing is creating the pattern of difficulty and the continuing pattern of the object level. We're not predicting image by image conclusion yet. And we're not predicting a trial by trial conclusion yet, which we have to do with monkeys because you can't [INAUDIBLE].
The finer grained prediction, if you believe in an algorithm, you better put it down to [INAUDIBLE].
AUDIENCE: This isn't a behavioural. This is predictive [INAUDIBLE] activity.
JIM DICARLO: Well, OK, it's two predictions. This predicts the [INAUDIBLE], that's what's here. Those decoding algorithms. Then it predicts neuroresponses, again, at the grain of 100 millisecond mean responses.
You might argue that grain is not good enough. But to me, you take this to inspire what you should predict, and that's what we're trying to do.
AUDIENCE: But with more [INAUDIBLE]?
JIM DICARLO: This one, or that one?
AUDIENCE: Predicting each image. This having more here.
The more in the law of the [INAUDIBLE] key algorithm, pretty [INAUDIBLE]. That was a question that came up. My students asked the same question I wanted him to work on, but [INAUDIBLE] It works for the measures we have, up to about 150, and then you're already perfect.
But now as the finer grain is not yet perfect, but can you tell there is a trend that you're getting more perfect. I wish I could answer that for you. We have the date, I just haven't analyzed it yet.
And that way we have [INAUDIBLE] range. We're still trying to publish the part of it [INAUDIBLE].
AUDIENCE: What do you think about [INAUDIBLE] core objects 2.0?
JIM DICARLO: Oh, great. That's a great question. So, what occlusion is a big factor in computer engineering that we didn't have at all? That's one of the things we wanted to do. We also wanted to think about paths that are more, not necessarily about identity.
I didn't show you this, but we can estimate things like the position in [INAUDIBLE]. It turns out, that same algorithm predicts quite well the human ability to estimate this [INAUDIBLE] position of an object.
You might say, well, B1 could do that. B1 can't do that on these colored images. [INAUDIBLE] light variables.
AUDIENCE: What position does that take?
JIM DICARLO: The position in the field, here's an image showing where the object was. Again, where was the object? That And those colored images, you can take out [INAUDIBLE] over there. But you can't do that on B1 simulation.
So I tried this as a little diversion but, part of 2.0 is not just expanding the image case, but expanding the goals of what things you're going to be able to predict. So it's not just categories, but also some of the other main variables, which we can already try here.
But what our study does [INAUDIBLE] one of the post doctoral that wants to do things like [INAUDIBLE] surface formulas, for instance. Can we estimate other factors off the stage that are not just about [INAUDIBLE].
We have to get a whole bunch of behavioural data on that, to measure probably more neural data.
But out main thing for 2.0 is occlusion and multiple objects, just one or two. That's the thing I think is missing here, and that we know is challenging for computer [INAUDIBLE].
If you have suggestions, we'd love to hear it.
JIM DICARLO: We're often trying to mind image that for interesting dynamics, on what [INAUDIBLE] happens early and late with questions of feedback. Can we enrich the set with things that take a little longer? Not a second longer, but let's say, 200 milliseconds longer. Can we find paths like that? [INAUDIBLE].
We're hoping to find find things of that type. It might not have names like occlusion, but they are formable. But they might have [INAUDIBLE] that we can pull down and get enriched with those. And we see those emerging later in response to IT. [INAUDIBLE].
AUDIENCE: I think I missed it. Right now your prediction is not to reach [INAUDIBLE].
JIM DICARLO: The prediction here is for each image there in that domain that I showed to you, the domain that I generated.
AUDIENCE: OK. But your goal is to do [INAUDIBLE] for each image? That's what the goal underlies?
JIM DICARLO: This goal is to take a population pattern somewhere in the ventral stream and predict the human report.
AUDIENCE: Oh, OK.
JIM DICARLO: Right? I showed you we can predict 200 ports for monkey data using an algorithm that I call [INAUDIBLE]. We can take population vectors here and predict this very, very well, but not perfect yet.
JIM DICARLO: But that algorithm [INAUDIBLE] to the discussion about base neurons, for instance. Under that algorithm, and the assumption about spatial layout, you can reason parts of this and make predictions of what that [INAUDIBLE] should be affecting.
JIM DICARLO: So that's why this model has a value. That's what I was talking about with optogenetics. That's to me, one of reasons it has value. It demystifies the [INAUDIBLE] of early discussion. It demystifies for us. You don't need to do anything complex to get to that. And maybe that pushed us ti hard paths, or harder reports. That was in the discussion you and I were having earlier.
But that's the value of doing these things. I just want to point that out, that's something that visual neuroscience has spent most of it's time with over here. This is the issue [INAUDIBLE] neuroscience, those [INAUDIBLE].
It does a little bit of this, but they neglected this for a long time, spending most of the time trying to build goo encoding algorithms. And that's not bad work, it's just we think that this really helps that, as well. It tells you what you need to predict. It's also deeply interesting, and it's on the right, right? You push neurons and change perception. That's the [INAUDIBLE]. That's pretty cool. That's the deepest question neuroscientist are asking.
So again, I'm trying to leave you guys with the idea that the predictive models are good and this is how we're using them. And you guys are being trained to be able to build these kinds of things.
[INAUDIBLE] neuroscience needs more of, so I hope that you will be attracted to this kind of work. And part of what I like to do with the experimental lab is help [INAUDIBLE] are correct and [INAUDIBLE]. How are they incorrect? That's what we normally do together.
AUDIENCE: So going back to the idea of the object manifold, to what extent do you think they are related to multiple degrees of freedom of how we collect that [INAUDIBLE]?
JIM DICARLO: Yeah, a great question. That's sort of related to the 2.0 question. How do we register limited domain of not just visual [INAUDIBLE], core recognition. You're asking to expand the domain for bits of time. Like, how you reach to an object, are you going to interact with is?
AUDIENCE: Just move your facial system around, and ses--
JIM DICARLO: So know we've got time bearing movies, is what you have to be showing the visual system to emulate that. And again, we've heard quite a lot of experimental concession, no necessity, but constraint in the simpler stay in this domain of 200 millisecond, or 100 milliseconds snapshot.
Time variations will be really interesting, but you'd like to motivate from this interesting behavioral phenomenon that evolved over time. You'd like to show like you're doing action recognition, and things that often require time. As experimental, we can't just start showing movies to monkeys, that's why we [INAUDIBLE] these experiments for a long time.
So if you get motivated with a behavior up here, that's the way into that kind of question. That's why I'd put that right object recognition 2.0, or 3.0, or call it object interaction 2.0, or whatever behavior you want.
If you got motivated, it's like the science from the top from a perceptual sense, a goal sense of your engineer. And then, we can work in to ask, well if it's support here, which part of the brain supports it? What algorithms supports it? That's just how I like to think about problems, rather than just diving right into the the middle of ecosystems and layer figure out, maybe these neurons relates to a path and you can't really define the domain behaviorally on the task, and try to make it interesting.
As you would a behavioral experiment before you go in and measure neurons. This just makes more sense to me from an experimental approach. So if you have good ideas about that, I'd love to hear what you're thinking about.
AUDIENCE: Do you have predictions about what should happen if there's a movie and the object undergoing some pretty straightforward transformation?
JIM DICARLO: Yeah. That's something for the dynamics of the whole system. And you create it from these simple models, [INAUDIBLE] as soon as you've shown up there, [INAUDIBLE] you show that face time prediction where we had different views of the face wobbling around. And we predicted that that neuron should do over time.
In reality, the dynamics of the systems when we started to show a movie are more complicated than the dynamics of [INAUDIBLE] together static imaged like that. Right? I think you guys are all pushing as you have time-based unfolding movies, how well are these kinds of models going to work?
And I already know that they're not to work and there are going to be dynamics it's not going to capture.
And so, there really was meant to be a [INAUDIBLE] you could find at a regional. You guys are right in pushing me to go to a larger domain.
We've done a little bit of with in that regard. But take is as a [INAUDIBLE], right? As soon as you put a face up, it should go--
[MAKING LOUD NOISE]
So we're finding it thinks it's still there, right? But in practice it goes, [MAKING LOUD NOISE]. It's adapting, or whatever word you want to use. But it ain't going to be no moving around. You out the [INAUDIBLE].
[NOISE SOUNDS] You say, well that's explaining away. It's predictive coding and all of that to be part of that system.
The nerve guys will just call that it have a patient with [INAUDIBLE]. These models have none of that. They have no predictive coding on them at all. I just want to think that someone might pull back one dude to say if could be layered on. Now we have this core to do layer on some of these ideas, and we're not the first to say that. But them could we do that, predict the dynamics not just to explain away, but to put it in a larger, theoretical framework.
Associated Research Thrust: