Jim DiCarlo: Introduction to the Visual System, Part 1
Date Posted:
June 4, 2014
Date Recorded:
June 4, 2014
CBMM Speaker(s):
James DiCarlo All Captioned Videos Brains, Minds and Machines Summer Course 2014
Description:
Topics: Why study object recognition in the brain; comparison of behavior in humans and monkeys; overview of the ventral visual stream and the ventral (what) vs. dorsal (where) pathways; retinal receptive fields; simple and complex cells in V1; shape features that drive V4 responses; response properties of IT neurons; increase in size of receptive fields along the ventral stream; spatial organization of IT cortex; face patches; the challenge of recognition: image variation, object manifolds (DiCarlo, Cox, TICS 2007; Pinto, Cox, DiCarlo, PLoS Computational Biology 2008; DiCarlo, Zoccolan, Rust, Neuron 2012)
JAMES DICARLO: All right. Thank you, Tommy. So I wanted-- as I understand, many of the folks in this room have a pretty good background in machine learning and computational biology but maybe less of a background in neuroscience. I think my job is to try to give you guys [INAUDIBLE] what goes on and some aspects of computational vision. The talk I put together is-- the first half is going to talk about some review of the visual system. And in the second half, we'll get more into some specifics of how at least my lab is applying some of the things I'm sure you guys are learning about to the problem of visual object recognition.
But more broadly, you can think of what we're after-- when I say we, I mean all of us, I hope, in this room. [INAUDIBLE] vision is computational algorithms that can explain challenging visual tasks. It's how our brain executes those tasks. And I want to sort of start by-- if my slides will advance. Sorry. There we go. So this is a slide actually that I stole from Tommy at one point. But it's just to illustrate generally the problem of vision.
So you all look at this, and you quickly do things like identify that these are cars. These are people, [INAUDIBLE], buildings. You could maybe estimate that this is a walkable surface. You probably are pretty good at predicting what these people are going to do next, whether it's safe to cross the street. You do a lot of that very, very quickly with this whole scene here. There's a lot of visual tasks there.
Now I want to say that in my lab, I actually work on a collaboration with the Defense Department on many things. And there's something I thought would be fun to share with this group that is still somewhat classified, if you will. Not fully classified, but some of it is not [INAUDIBLE] yet. But there is a sort of computational system that has basically 100 billion computing elements, solves problems not solved by any previous machine. Actually, this doesn't look like it. But it only requires about 20 watts of power. And the [INAUDIBLE] is still classified.
And so if I know what I'm doing here, I'm not talking about this thing. I'm talking about that thing. And so there's this amazing set of algorithms in here-- that's how we think about it-- that we'd like to reverse engineer or extract from the brain. And I know many of you are probably interested in those things as well. And so let me tell you that again what I'm going to talk about today is the problem of vision object recognition. And this is probably the [INAUDIBLE] computer vision, is to be able to at least put bounding boxes around this and be able to identify things.
Now this is maybe not exactly how our vision system does it, but at least it's an operational definition of the problem. And I'll be using a definition that's very close to that. So first, I want to back up and talk about, why study object recognition in the brain more broadly? And it's because these representations that our brain is computing from the visual world are really the substrate of higher level cognition. I mentioned things in the last slide as being able to predict where a person's going to go next, whether that car is going to hit you.
You can't do that unless you first estimate those items in the environment and their states and perhaps their future states. So they're underlying in things like your memory, what things you're going to lay down in memory, your value judgments, again decisions, and actions. And if you're building robots, then you need with your front end systems to do things like obstacle avoidance, navigation, danger avoidance, et cetera. I won't read all those. So again, why study this?
Well, the other reason that I want to point out is that, as I mentioned, this is already a challenging computational problem. So the goal-- I want to point out this slide. This is from 1966 from MIT. There's a summer vision project here. I used to think this was an impossible story. But [INAUDIBLE] actually gave me this document. There was a goal in the summer project to have some undergraduates put together a system-- this was in the sort of heyday of early AI, the first round of AI.
Again, we would solve object identification and name objects by matching the vocabulary of certain objects. There were some goals for July and August. And the goal was to essentially solve this problem by the end of the summer. That was 50 years ago. Now we're-- almost 50 years ago. Now I think we're close to starting to solve this problem. But I want to point out that we had an intuition that some of these things were easy because our own brains do them quite well. Yet it turned out, as many of you already know, that this is a very, very hard problem.
You're [INAUDIBLE] document's really fascinating [INAUDIBLE] and then like objects that you think are cigarette containers and things that we would not look for in our offices [INAUDIBLE]. OK. Let me then put that in context. So many of you are interested in complex tasks. Here is just one simple slide I put together on what I like to think about what are our brains good at, and what are machines good at today? And this is my-- I'm not going to take you through all these. But machines are better than us at a lot of things. But at some things, like object recognition or general scene understanding or even things like walking well, machines are not as good as us.
And our goal-- I think many of us, at least in my lab, is to try to discover how the brain solves object recognition, and I mean that in an algorithmic sense. So actually potentially that would mean that this would be commensurate with building machines, that sort of definition of actually saying that you understood the problem. And I also would probably-- I think probably others think of this as a gateway problem that understanding of a complex problem like visual object recognition might give you more insight into how the cortex works more broadly.
So this is all just as context here. So back up a little bit. The goal of any science, if you will, is to-- sort of really big picture [INAUDIBLE] here-- is to say, can you take measurements in some domain 1 and predict what's going to happen in domain 2? And on this side, it's partly maybe-- maybe I don't need to motivate for this computational audience but more for my neuroscience and psychology colleagues that predictive power is really what science is all about. This could be a future point in time, this could be the state of plans in time one and time two.
And we can do those kind of things with understanding from Kullback physics, but it feels like neuroscience is still struggling do this well. And for us though they've envisioned this as the domain of all macro images where each point is an image with a point of some high dimensional pixel space, and these might be mapping to what you might perceive in that image.
And there can be various methods of perception here, and that's a very interesting domain, but that's the general problem you like to be able to take pretty much is going to pop into you mind. Now, more to the accuracy of this kind of predictive mapping is again the strength of any scientific field. Again so the goal of accurate predictivity I don't think I need to focus group that, but in general [INAUDIBLE].
And that's what ultimately underlies your ability to fix the system that's broken, to build something that's comparable to maybe even augment it. This is a diagram to talk about today and the goal is accurate predictivity. For problems like facial recognition this is face detection which as you know, or you may or may not know, but there's been a lot of progress in that recently and so face detection was a pretty challenging problem for a while because of variation in the images a thing we'll talk about in a moment.
And so there for that and other reasons people started to go and record normal activities. So now instead of just the domain of images you have the [INAUDIBLE] neural activity where you can think of each of these dots as being [INAUDIBLE] pattern of a population of neurons somewhere brain. We'll talk about those later. I just want to give you the big picture. And so the way that I and any others that work in this way think about what we're trying to do here is, we're trying to predict the mapping between images and regular patterns of neural activity against populations patterns of firing. And that's usually called encoding. And then we're trying to make algorithms that go from neural activity to accurately predict things like perceptual [INAUDIBLE]. That's called decoding. That's sort of by way of big picture. You hear all those words encoding and decoding [INAUDIBLE].
Here's what I hope to try to take you guys through today. First, is define a complex visual task. I'm going to talk-- again my work is on core mental object recognition-- so I'll try to define that for you. But I hope some of what I tell you today is inspiring for those of you who use other aspect of vision. I'm going to give a very brief overview of the brain regions that house the circuits that execute the brain's algorithm. That's a series of visual areas called the ventral visual stream. So those of you who haven't seen that yet I'll take you through that. Then I'm going to discuss the central computational problem that you think about how it's solved, or how it must be solved. Then I'll talk in a minute about decoding algorithms.
So those are things like asking and having neural populations to effectively solve the problem. How those neurons can support that. [INAUDIBLE] answers here. Is there any causal evidence in support this, and I'll tell you a bit about that. And then the last part before we end-- we'll probably break somewhere around here-- but the last part I'll talk about encoding. What specific algorithm is it worth along the mental stream. And many people are interested in that and I'll tell you what we've been doing on that and how we're taking the strategy [INAUDIBLE].
So I want to stop here and ask, since I've just been talking at you guys for awhile, I want to get a sense of where you guys are in your thinking and tell me which things-- I don't want to just start talking [INAUDIBLE], and if you're not stopping and asking me questions then this will be a waste time. We're here to educate you guys. So if people want to tell me do they feel like this is if they'd like me to sort of skip through a lot of this. What's your guys sense. Would people like to hear a lot of review of the ventral visual stream, a lot about object recognition or would you like me to jump toward the cloud. Who wants a lot of review of the visual system? Sort of half. Who's more interested in talking about decoding and encoding algorithms? Okay. I'm talking at too low of a level? Too high of a level? Okay, so I guess I'm doing okay. Again I want this to be-- please raise your hand to interrupt me. I will just-- I sort of alluded to, I can just sort of machine gunfire this whole block at you, but that's not what I think is-- this is really for you guys not for me.
So let's try to find this [INAUDIBLE], which I'll call poor visual recognition. So again I want to point out that I'm not talking about all of vision. We're going to talk about something that I refer to loosely as object recognition. When I say that, I include things like face objects, I consider a face to be another object. So again as I already mentioned, here is a scene, you might have imagine, oh, yes you you quickly tell me there's a car, there's a person, there's a sign. How did you do that? Well, first of all, [INAUDIBLE] ventral system where I'll show you that the ventral stream doesn't really analyze the whole scene at once. The anatomy suggests that the central 10 degrees is heavily represented in the part of the visual system that solving object recognition called the ventral digital stream.
And we can think of it as heavily analyzing this central 10 degrees. Now, 10 degrees for you guys that's basically if you hold two hands up at arm's length, that's about 10 degrees. So that should give you a feel for what we think your ventral stream is about doing. For some of you that might be about 10 degrees depending where you are in the room. Hopefully it's close. That may be the fixation point. You might be fixated there. Or you can think about 10 degrees.
But when you explore this scene, when I put it up for you, spontaneously, without me instructing you, you make these rapid eye movements around the scene. [INAUDIBLE]. And you pause at these points here, which are called fixations. And the duration of the pause is about 200 to 500 milliseconds. 200 milliseconds is quick. 200 milliseconds is quick.
So you're making this sample of the scene. And there's interesting science-- that maybe you guys or we can talk about-- in how you direct those gazes. And it's parts of the brain that I'm not going to talk about today that control that sample.
What I am going to tell you is that given this kind of sampling, what this brings to your [INAUDIBLE] engine-- think about that central 10 degrees, now. It might look something like this.
OK. So those were snapshots at those fixation locations, each for about 200 milliseconds. And notice they were sorted in context. I mean, they were all drawn from that scene, so you had some context, because you saw the scene.
But I hope you noticed you can recognize one or more objects in each and every one of those. I'll do it for you again. You can quickly map it to some word in memory, maybe one or two words from each of those snapshots, even in only a couple hundred milliseconds.
This is what we would call core recognition. And again, it's an operational definition of a problem that connects [INAUDIBLE] understanding to a more defined problem that we can go in and study quantitatively.
AUDIENCE: Jim?
JAMES DICARLO: Yes.
AUDIENCE: Is this like category recognition and not in a specific templar?
JAMES DICARLO: So I was being loose about that right now. So you mean distinguishing one type of car for another. For now, think of it as basic level recognition, which would be the first word that pops into your head. Which you might call category. But there's of course different grades of which you might be able to do that. Was that car a Chevy or a Honda? Maybe there's some things that you can't do in those time durations. There is a lot you can do, which [INAUDIBLE] on, but of course there's limits to what you can do. We're actually going to take advantage of those limits. So you could think of this as categorization. But it depends on exactly how you define that. I'd rather think of it as basic [INAUDIBLE] recognition.
All I'm trying to illustrate is that you have some power in that domain. What we have to do is then measure that power. And that's what we'll show you later. So I'm trying to set up how your brain tends to sample it. It's part of the problem that we're going to study.
So we're not going to do all scene understanding. We're just going to do, what do you do in that snapshot.
AUDIENCE: [INAUDIBLE]?
JAMES DICARLO: Depending on the pixels, you can do something. If it's a person, it's under what conditions and-- it depends, right? So yes. But I think I was trying to [INAUDIBLE] that there's a lot of single words that you can pull out of each of those images very quickly. OK. Does that answer the question?
AUDIENCE: The only reason I asked that is that it seems like with things [INAUDIBLE] like discrimination and the cross-category detection [INAUDIBLE] would be very different computation problems.
JAMES DICARLO: It could be. I think we and others tend to assume that they're grains of the same strategy. But I think that's an interesting point for discussion.
Especially when I'll show you the behavioral, which shows that of course, you do worse at discriminating among two very similar faces in that kind of time window than you would, say, a face and a car. The information content itself is [INAUDIBLE].
I think that's what you're leaning to. What extra mechanisms do we invoke that [INAUDIBLE] the more top [INAUDIBLE] of discriminations, which is usually called subordinate level of discrimination, as opposed to entry level. That's what you're after, right? Right. OK. And I think that's an interesting problem [INAUDIBLE] behavioral ability [INAUDIBLE].
And I'm calling it core. Because again, I think that the things we can understand from that will inform those more fine-grained things. But we'll have to demonstrate that. for you to come from to convince you of that. I'm just coming to strategy [INAUDIBLE]. Yes.
AUDIENCE: Have people done Molly Potter [INAUDIBLE] stuff for subordinate categories as well? And do they have this quantitative drop-off?
JAMES DICARLO: In terms of final presentation?
AUDIENCE: [INAUDIBLE].
JAMES DICARLO: So we've done [INAUDIBLE]. And that's why we call this ability core recognition. And again, that's just the operational definition of, say, central 10 degrees of vision, let's say 200 milliseconds. And you're asking about-- you could ask about within 200 milliseconds, and even beyond [INAUDIBLE] milliseconds.
And we've plotted out [INAUDIBLE]. What happens between zero and 100, we rise up quickly from chance to a very high level. Then we saw a little bit of an increase from 200 to 500-- a very gradual increase. This is for single objects about the type that I'll show you guys. So of course it's going to depend on the past. But that's the general trend if you plot those data.
AUDIENCE: But does it interact with-- can you make generalizations about, is subordinate generally harder? Or are some subordinates harder? Or does it depend on expertise?
JAMES DICARLO: Right. I mean, this is sort of getting to a space that we haven't gone in to that both of you are asking about. Which is, if I plotted that curve for subordinates, with a later rise, it'd be steeper. And presumably it will be, because more there's more room to gain there. But what are the limits of that? And once we get beyond a couple hundred milliseconds, it's hard to control experimentally where your eye movements are. But I can plot that for you.
But that's where we're going next. What we've been doing now is, what do you do up to 200? And I want to point out that you have this rapid rise up to about 100. And that's what we've been trying to understand. And you're already getting into like, well what about beyond that?
So hopefully I can convince you that we have a sort of reasonable understanding of that first part. But the next part is sort of-- we call that the next frontier. OK.
So that's the Molly Potter type video. Right? You mentioned RSVP, the Molly Power in the '70s, where you can do videos like this. And you can usually map one or more objects to memory. I mean, you can attach a name to it. So again, this is known from the '70s, that you're visual system can do a lot, at least under some conditions, quite rapidly.
I want to point out that this is pretty fast. It doesn't feel like you're putting a lot of effort, for whatever that's worth. It doesn't feel like you're thinking about long division or playing chess.
Again, I think this is why, early on, it was sort of assumed that this problem could be solved very quickly. It's something our brains have been evolved to do quite well. So we had sort of assumed that it was something easy to solve in a machine.
I want to also point out, I didn't need to tell you to expect Star Wars characters. I didn't need to tell you to expect buildings in Italy. Yet you were able to probably say, oh, that's the Leaning Tower of Pisa. I didn't have to strongly cue you attentionally. That doesn't mean there are no effects of attention. But you can do a lot without any pre-cuing. You're entertaining thousands of things that are quickly mapping to memory again-- I mean, Tower of Pisa. Who would have known that if I would've shown you those things that you could quickly do that. So there's a lot you can do without these kinds of things. I'm not denying them, but there's a lot already happening.
And what's really impressive is it's very tolerant to variation. [INAUDIBLE] on Yoda [INAUDIBLE].
AUDIENCE: Are there some priming attempts for under the 200 millisecond range? I'm guessing what you're saying is that the priming effects would happen towards the end of the 200 milliseconds or afterwards.
JAMES DICARLO: It depends on the task conditions how much effects you're going to get. When you say priming-- like if I cue you for Star Wars characters.
AUDIENCE: Before you start [INAUDIBLE].
JAMES DICARLO: You're going to see one of the Star Wars characters. So it's going to depend on the specifics. You're going to see something in the center, to the right. Pre-cuing things are what we call classic top-down attention. And those things do have an effect. The magnitude of the impact on behavioral performance depends on a lot of specifics. But they do have effect.
AUDIENCE: [INAUDIBLE].
JAMES DICARLO: Yeah. Because you're pre-cued. So you can already see those effects very early. OK.
But what I want to point out is that those effects actually are sort of modulations on the larger effects that we've already talked about in terms of time. So we have been trying to focus on what happens even ignoring these effects, in these neutral conditions where I'm just going to put things at you in unknown order, if you will. And the argument is, you already do a lot. And we'd like to understand that first, as a baseline for understanding the more complex [INAUDIBLE].
So again, I call it core object recognition. It's a subset of recognition-- again, operationally defined. So let me give you guys now an overview of the brain regions that we think house the circuits that help execute that kind of cache, and the pre-cuing things that were mentioned.
So this is the ventral visual stream. Of course, this is the human brain. We all really would like to understand that brain. We work on this brain-- the non-human primate. Have you guys had any talks from non-human primate folks?
AUDIENCE: [INAUDIBLE].
JAMES DICARLO: OK. Well, maybe some. Anyway, what non-human primates-- part of the reason we and others like this model is it's a very good model of human ability. Now, these are recent data from my lab. Because I used to just say that because it's from the low-level cache. Here is the abilities of monkeys to discriminate objects that are high degrees of variance, that I'll tell you about in a moment.
These are the confusion matrices. So the red means that the eye was confused. The blue means it's not very confused. Here are some object names on the bottom. These are just 24 objects.
And all I want you to see here is that these patterns look very, very similar. If you did a low-level pixel on, or a V1 model. We've done that. [INAUDIBLE] very, very different than what you see here.
So monkeys look a lot like humans in their confusion. These are monkeys that have been trained to discriminate these objects. It takes them a couple weeks to learn each object. But then they know many, many objects. So they need some learning. They don't know how to respond to these objects initially.
But I just want to point out, these are things that sort of feel intuitive to you. A camel's confused with a dog highly by monkeys and by humans. Tanks are confused with trucks.
This is not surprising, because they're in the shapes themselves, you think. But again, it's not like any algorithm does this. The point is that the monkey looks a lot like the human in these kind of core object recognition tasks.
AUDIENCE: Jim? What region of the brain is that?
JAMES DICARLO: Well that's no region of the brain. Those are behavioral data. Sorry, I wasn't clear about that. But a monkey is good model for a human in part because behavioral reports look a lot like a human report.
AUDIENCE: What's [INAUDIBLE] report of identity?
JAMES DICARLO: So again [INAUDIBLE] jump ahead, but the idea would be put up an image. And then you have choices of which of those [INAUDIBLE] words [INAUDIBLE] to the point. And sort of like the images I showed you, but one at a time. I'll show you those images a bit later. That's the biggest [INAUDIBLE]. From that, sometimes it'll say, oh, the camel was a dog. [INAUDIBLE] averaged over a lot of something. Those are the kind of errors [INAUDIBLE].
AUDIENCE: What does the monkey do?
JAMES DICARLO: Same thing, exactly the same thing.
AUDIENCE: [INAUDIBLE]?
JAMES DICARLO: No words. In their different ways, they have picture maps. They like an image and then iconic pictures of the car and dog. [INAUDIBLE] training with words, we haven't done that in this location. It's not really reading, it's making an association. But the interesting part is this-- the mapping between the images, which can be highly varied about a car, to which words he chooses.
And [INAUDIBLE] a monkey doesn't know what cars are. Humans know what they are. But at least in terms of these kinds of visual paths, they show the same patterns of errors once they're trained to do this kind of task.
And we think all of us had to be trained to know what a car was at some point in our lives, too, to make a mapping between some feature basis to the word car. And we like to think that that's what we're teaching the monkeys. But this gets into deep things [INAUDIBLE] monkeys. All I'll tell you is operationally on these paths, those are the data we get. Yes.
AUDIENCE: So then you teach them to classify [INAUDIBLE]?
JAMES DICARLO: We teach them to discriminate among objects and yeah, the words if they can do that. And I could show you the training curve at some point, but again, it takes a couple weeks to learn every new set of objects, learn, like, a new object practically every day or two on average. So you can imagine it could increase the vocabulary very [INAUDIBLE] something [INAUDIBLE].
AUDIENCE: You say they learn a new object? [INAUDIBLE]?
JAMES DICARLO: Oh, no, this is deep and interesting. When did you have to learn the mapping to the word car and only training [INAUDIBLE]?
AUDIENCE: No, I mean like novel objects.
JAMES DICARLO: What's a novel object? I mean people like to say that, but the way to think about it is the visual basis that all of us might have-- might be built in by evolution, might be learned by statistics from the environment, some combination of both. But you have to attach a label to it to actually do something to report on these kinds of tasks.
And you like to think that the monkeys and us share a common neural basis, but that they haven't yet attached labels that we have, and this is some evidence that when we teach monkeys to attach labels, they make the same mistakes we have. So it's evidence in support of that idea. Again, there's some limits to how much a monkey's going to be like a human, can drive it, the functional goals of the thing that a monkey's not going to do, but in these kinds of tasks, we don't seem to see the differences yet between monkeys and humans.
AUDIENCE: Well, the task is two alternative tasks?
JAMES DICARLO: For the humans, we've done [INAUDIBLE]. These data were actually two alternative choices for the humans. But they were not pre-cued.
All these are intermediates, so again, it's important [INAUDIBLE] questions [INAUDIBLE]. There's an image and then two choices. And you don't even know what the two choices would be ahead of time. You're just ready for anything in the [INAUDIBLE] world of 20-some odd [INAUDIBLE] possibly [INAUDIBLE].
AUDIENCE: Do you have any idea why the [INAUDIBLE] watch and shorts [INAUDIBLE]?
JAMES DICARLO: [INAUDIBLE] specifics of the particular three objects we picked, probably somebody could try to pick it apart, but yes, at some level. These are actually pretty fresh data. We haven't delved into exactly why that would be.
And we would think it's because, again, the neural basis, for some reason, [INAUDIBLE] each other and we're [INAUDIBLE] hypothesis. But that's not answering your question, just shifting it to why's the neural basis look like that? And that's a deep question that we would call the encoding question that we'll talk about at the end of the talk. OK, [INAUDIBLE] in the back.
AUDIENCE: So do you know what other species might also match the human performance in [INAUDIBLE]?
JAMES DICARLO: We don't know. Again, this is the animal system we used in our lab. We haven't [INAUDIBLE] all possible [INAUDIBLE].
I imagine, based on the anatomy of chimps, would probably do as well as humans. Some kinds of monkeys would do less well. Rodents would certainly do much less well, just for QE reasons.
These are the exact same images to the monkeys as humans. So it's a really interesting question as to how that maps out. For us, we're just pointing out that in this domain, this is a model system all quantitatively for this kind of behavior.
So I'm sorry, I can't give you any more than that, because this is all that we've done. Let me [INAUDIBLE] why the heck we do this with monkeys anyway? You guys already know why we don't use humans.
Well, partly there's a lot of work on monkeys that tell us about the visual areas that are involved called the ventral stream. We think these are the areas that are computing the representations, and I'll give you summary evidence of that in a moment. You should know that neurons in these areas project to things like prefrontal, which is involved in inaction, these cortical areas up here, and around the bends of the temporal lobe, which is thought to be involved in things like memory.
And so when we think of these neural patterns of features, if you will, as the substrate or the basis for what you would call higher-level cognition, which is usually meant by these [INAUDIBLE] here. So you can see anatomically, that makes sense. And we're not the first to point this out. I'm just orienting you guys.
Because monkey is not human, though, we can study the neural activity at the level of neural spikes. And we think that's the most appropriate level of abstraction, but that's also an interesting point of disucssion-- how fine-grained these measurements have to be. But in monkeys, we can go and do that. And we can push the neurons around, and I'll show you that, I hope, in a bit. And we basically can't do anything. [INAUDIBLE] better tools to go in that separate system here than we do here.
OK, let's talk about visual recognition. This is a very simplistic view, but here's the ventral stream. As engineers, we like to break it down into these, let's call these each [INAUDIBLE] visual area. There's millions of neurons in each area. I'll show you that in a minute.
So what happens is, when you get this image here, it produces a pattern on your retina, which is really a nicely processed photograph. It's rapidly conveyed up the visual system. And this implies a sort of [INAUDIBLE] pipeline. It's not that simple. But the average latencies here are 60 milliseconds to V1. And then on average, about 100 milliseconds, you start seeing responses in IT cortex. So that's partly-- those are the kind of data plus anatomical data that give you the layout in there.
And notice I have before connections with some unknown transform. I have feedback next indicated by the dotted line. I have cortical connections with this arrow. So it's complicated anatomy here, but this is the rough label.
AUDIENCE: [INAUDIBLE] your brain on part ways from G1 to IT, V2 to IT and those kinds?
JAMES DICARLO: Yeah, those are the bypass pathways. [INAUDIBLE] reduced to some degree. And there aren't really great quantitative data on that. These data often come from making injection here and looking at density of connections in different areas.
So I'm sure, you go back and look at the studies and V1 makes some projection to V4, it's just a little weaker, and so forth. So quantifying that, that was not something that's being done well yet. So I know that's not really an answer to you Tommy, but as you look at the dominant connections, this is the first-order model.
So please, I don't want you guys to walk away [INAUDIBLE]. The anatomy plus the latency is here. And the connectivity pattern between those-- I don't have a slide on which layer of cortex-- which of the six layers of cortex. Maybe I should have a slide for that for you guys.
These are not equal neurons. Particularly in the cortex, they're all a little bit different. There are output neurons. There's ones that stay local. There's ones that project back.
The anatomy of how they project back and forth is what helped establish the field, establish this hierarchical lay-out that you guys often hear about in the ventral stream, and other cortical regions. But this is all a first-order model of how this system looks. Yeah.
AUDIENCE: Yeah, do you think the estimates on the weighting system could be overestimates due to experimental conditions, and in the natural world, [INAUDIBLE] visual stimulus much quicker?
JAMES DICARLO: I guess it's possible. Again, those are average estimates. I have a [INAUDIBLE] a wide range [INAUDIBLE] each area. You're asking if you change the precondition. If anything, it might be underestimates, because often, they're measured from blank screens to some luminous transient, which is a pretty powerful pulse to a system.
I mean, a new image on a blank screen is a better driver than a movie where neurons kind of trickle along and have a hard time estimating [INAUDIBLE]. But I don't know if that's-- I don't think it's going to be much different. When you look at [INAUDIBLE] people estimate latency, and we've done some of this in our lab, where it is more of a continuous estimate, we do things like reverse correlation. The latency according to that is comparable to these numbers.
I have a whole file of studies of latency if you're interested, but I don't know if we should delve into that right now. But it's not going to be that much different from this. That would be my guess.
I just wanted to point out that what we think about, when you're watching RSVP, of course, your homologue, your IT, your homologue of IT, your other visual energy, your brain that are homologous with the monkey brain are clicking along, populations of neurons firing. Just to give you a ballpark, about 10% of neurons in each area are, on average, active for each image that's shown from a set of natural images. So they're all firing in there, responding to the images.
And of course, we can go ahead and measure those kinds of things. And what my lab tends to think of, and lots of labs, think of each area as containing a new population representation. And reason we and others think that way is that we know there's a complete retinotopic map in the retina, of course, in LGN and the thalamus, the LGN and the thalamus in V1, in V2, in V4.
And I think things get murkier. I'll search some data on that, but it's maybe more than one, probably more than one area. But it's almost getting less organized about the spatial layout, and more organized about things like categories, thinks like faces. So I'll search some of that data later if we have time.
Does everybody know what a retinotopic map is? Don't be shy if you want to-- [INAUDIBLE]. So here's a drawing of the rough size of those areas now. This is a little more drawn to scale in the monkey, so you can get a sense of things.
Here again are the latencies. Each area is sized to represent the approximate number of neurons, and the color bars indicate roughly the central 10 degrees. Right now, V1 is dedicated [INAUDIBLE] but the color in V1, I tried to scale to estimate what portion do we want is [INAUDIBLE] degrees. As you can see, IP has been heavily almost all filled by [INAUDIBLE].
I want to point this out because if you care about other people, [INAUDIBLE]. But there's MP-- this little tiny thing. So then lots of people study these kind of things, and you probably know about all these studies, but there's a huge chunk of brain devoted to vision that you can see here again called the ventral stream especially this [INAUDIBLE] degrees. So again, I think [INAUDIBLE] IP here is three areas [INAUDIBLE] which is anterior central and posterior, and people parse in different ways, but that's the way we explain and talk about.
AUDIENCE: This is more or less in scale?
JAMES DICARLO: Yeah. Well, I mean, these are scaled relative to each other. So in turn, you can see-- I don't know if you can read that-- 190 million neurons here, 36 million there, 10 million right--
AUDIENCE: But this MP is small?
JAMES DICARLO: Yeah, MT is still relative to these other areas. And coming off is like a pimple on the floor of the [INAUDIBLE]. [INAUDIBLE] is a big studier. You'll be off the first ground it. A bunch of [INAUDIBLE] research is huge here. Very little study up here. It's a difficult area for [INAUDIBLE]. Yeah?
AUDIENCE: I was wondering where that 10% number came from that you said [INAUDIBLE].
JAMES DICARLO: Oh, that was who studies-- partly from my lab [INAUDIBLE]. We didn't have time to tell you, but Nicole [INAUDIBLE] a bunch of neurons, shows a bunch of images, and then just counts. That number is roughly how-- you kind of, after this period, to have this maximal firing rate observed and on the [INAUDIBLE].
AUDIENCE: Do you have to worry about [INAUDIBLE]?
JAMES DICARLO: Yeah, you do. But when you look at the [INAUDIBLE] data that people have more recently, it doesn't seem like there's a lot of dark neurons, if you will, so we might worry that we're just overestimating. So there's some worry there, but it doesn't look like it's a big worry.
And Nicole, in those studies, she would take any neuron spiking at all. But you might worry that neurons [INAUDIBLE], but at some point we can estimate that they've come, that they're not [INAUDIBLE] become insignificant [INAUDIBLE]. We don't care about that right now. So my policy is [INAUDIBLE].
AUDIENCE: I have a question about the latency. It's not that I don't trust the number, but it seems like our [INAUDIBLE] we can recognize the gist of the scene just like 25 milliseconds. But the time when-- that is probably just do as [INAUDIBLE].
JAMES DICARLO: OK, let's be clear. You can't record 25 milliseconds correctly, but you can deal with a 25 millisecond input, so there's a difference there, right? So this is the input coming in it can be short, but then you're not going to get any response for you probably not until at least 300 milliseconds. For a monkey it goes to 200 milliseconds after you press a button they can actually [INAUDIBLE]. So there's a time lag for the behavior.
But you're right. You can put smaller pulses of data in that are going to take a little while to propagate through your system. So I think we can get 25 milliseconds, but the image time not the reaction time. Does that answer your question?
So we didn't get a reaction time that's pinged by you depending on the [INAUDIBLE] going to probably be longer or shorter [INAUDIBLE] shorter than what they've required in neurons. Whether that can be measured behaviorally, I don't know of an example to show that, but that would be the prediction of time.
AUDIENCE: Is it just a lag or a buffer time? Or is it process time? I mean, process--
JAMES DICARLO: Well, OK. What I can tell you [INAUDIBLE]. The time between here and there is given axonal conduction delay should be about 1 millisecond. So these latencies have to do more with cortical integration then they do with like just transport time.
So the stream of the axon potential is from here to here. You can't give it in distance because there should be 1 millisecond difference if you just took-- went all way out there. Remember that synapses-- Actually, from here, synapses have to release. Neurons fire at layer four, and they transmit to layer three. Two-three will transmit to layer five-six. All that's churning and happening here, and then you see these latencies up here.
But all the old data is just bang, bang, bang, like this rolling wave of activities right? So very overlapping with large variance. These are the average-- the median latencies. [INAUDIBLE].
AUDIENCE: [INAUDIBLE] recognize sequential images or [INAUDIBLE]?
JAMES DICARLO: I don't know if I want to use the [INAUDIBLE]. All I can say is that these are the averages to a bunch of images, so maybe we can sideline that question [INAUDIBLE] for a moment. If you want to come back to it--
AUDIENCE: OK.
JAMES DICARLO: It's called the [INAUDIBLE] in this part of the brain is that if you [INAUDIBLE] part of the brain, you get [INAUDIBLE] monkey, for instance, being able to discriminate this object from that object. It was studies like that that led people to think that IT cortex [INAUDIBLE] and that, for instance, the dorsal stream is [INAUDIBLE] to the location of information. This is very oversimplified but that's part of the release that motivates the ventral stream to be involved in the path.
So let me give you the brief tour of what's going on here. So here's the retina, so I'll show you one side of the retina. The retina has, basically, nice centers around pixel detectors. They respond to this as a pixel if you will. They're tiny little receptor fields.
Depending where they are, they can vary in sizes over a wide range, but they can be about a quarter degree or actually a little bit less than that. I'm sorry. I messed up. Depending where they, they can vary from very small to pretty large. But they respond to increases when light comes on, and that drives the neuron more in the center. And if you put light up surround. This is an on center [INAUDIBLE], and it suppresses [INAUDIBLE] response here.
So [INAUDIBLE] ganglion cells. They're rough. They're pretty linear [INAUDIBLE], and there's a lot of work in the retina-- a beautiful work in the retina. But from a point here with image, they're just sampling tiny, tiny little samples almost [INAUDIBLE] maybe like your camera does a really good job on your phone is actually kind of [INAUDIBLE] a nice image out of that.
So the retina has little tiny center surround cells. If you go up into like [INAUDIBLE] that you guys should all know about, and if don't already cells have what's called orientations [INAUDIBLE] that are called simple cells. And what that means is they respond to a particular line orientation that can be modeled quite well as linear filters that have plus and negative [INAUDIBLE]. I know you guys can't all see that, but those are built up by the LGM cells converging in the right way.
And [INAUDIBLE] likes this orientation and likes [INAUDIBLE]. You change your orientation, you get less spikes. That's called orientation selectivity.
And then there are other cells in V1 called complex cells where they have selectivity. They like this orientation and not that one. I know you can't see that here, but they also have tolerance in the case of position.
And this elemental operation [? inspires ?] many important models, the visual system. It still inspires them that you have this buildup of both selectivity as are described there and then a little bit of tolerance, in this case [INAUDIBLE] others have built into many model of the visual system [INAUDIBLE]. Yeah?
AUDIENCE: Are there differences between the retinal ganglion cells and LGN? There both--
JAMES DICARLO: Those are generally very, very similar. People--
AUDIENCE: Why do you have them? That seems weird, right?
JAMES DICARLO: Yeah. Why would it just be a relay? What other processes go on there? Again, evolution may do things that we wish that they could just undo in some the way, but when people measure it, those cells look very much functionally like retinal ganglion cells. And so it doesn't mean that there's no other thing going on in the thalmus there, but just functionally if you just did linear filter maps of them they look very similar to the maps I showed you [INAUDIBLE].
AUDIENCE: [INAUDIBLE] same size on and off like things--
JAMES DICARLO: They have the same on and off types, and they're very similar sized. It's not [INAUDIBLE], but when I read about those things that [INAUDIBLE] how they're portrayed, I don't know if you guys are having anyone from the LGN [INAUDIBLE].
[INTERPOSING VOICES]
JAMES DICARLO: I guess [INAUDIBLE] are a little bit different. There are certainly nice studies on [INAUDIBLE] LGN but not the V1 and not the retina. [INAUDIBLE] involving feedback there [INAUDIBLE] but functionally most of the [INAUDIBLE] community thinks that they are very similar, almost relay-wise [INAUDIBLE]. But just from a filter mapping point you can explain them pretty well with [INAUDIBLE]. Yeah?
AUDIENCE: So are these like cells from V1 are?
JAMES DICARLO: This is V1. So that's primary visual cortex right there.
AUDIENCE: Right. So do you actually see in your orientation variant tolerance to orientation as well [INAUDIBLE]?
JAMES DICARLO: Well, OK So the [INAUDIBLE] they don't like to have [INAUDIBLE] they have a range [INAUDIBLE] as this model would forget. [INAUDIBLE].
AUDIENCE: So yeah. I was talking about complex cells. So do you see any--
JAMES DICARLO: Complex cells have a similar bandwidth orientation selectivity but then have some ability to tolerate the condition and cannot [INAUDIBLE] pretty well and linear filter and these cannot.
AUDIENCE: So they don't have orientation tolerance as well?
JAMES DICARLO: Well. Again, tell me what you mean by that word? So with some range, it will respond, so maybe we should define tolerance a little more clearly. This is how people have usually used the word. [INAUDIBLE]. It's going to respond similarly across the change [INAUDIBLE]. [INAUDIBLE] how I think about tolerance.
I think maybe I should leave that for a little bit later. If you're still confused about that, ask me after a few more slides. I wanted to say-- again, I'm doing nobody justice. I can't review the whole visual system but I'm trying to give you a feel for what's known about the area. There's been more recent work-- I'm skipping over this-- on V2 and how it differs from V1. There's some beautiful work from Tony [INAUDIBLE] lab suggesting that basically, V2 is doing combinations of V1 cells, the products of V1 cells.
And that work, I don't have a slide for it today, but those seem to be consistent with the ideas that make models that you're just going to build up selectivity. But the particular type of selectivity and which V1 cells you cross together is still [INAUDIBLE] a mystery for most kinds of data. But they have evidence that suggests that you repeat this idea of what you saw, V1 being a combination of all the [INAUDIBLE] cells, and then you have some buildup of tolerance in complex cells, and then take the output of those complex cells and build products of that, if you will, another level of selectivity, so templates of those complex cells. That idea is consistent with the kind of data that they have.
For a long time, it was hard for people even to distinguish between V1 and V2. They were just using those simple bar stimuli like I showed you. Everything just looked like complex cells. So you had to use natural images and [INAUDIBLE]. I don't have a slide for that, but I want to tell you that there's some advance there.
I wanted to give you a history of the field, though, a little bit. If you go in and record in higher areas, V4 and up into IT. I would be remiss to not show you slides like this. People for a long time would go in and record there, and they would say, well, I think the cells might like hyperbolic gradings. This is, I think, Jack [INAUDIBLE] work.
So they would show here's an example of a cell. The red indicates high firing. This is a single cell to another cell. The blue indicates [INAUDIBLE]. These patterns buand not so much these, and then like this one, you can find a long history of here's some cells that [INAUDIBLE] and some [INAUDIBLE] under a particular stimulus set and maybe theoretically motivated by some region, but it was really hard to make sense about what these cells are doing. It's viewing a collection of interesting data but without a unifying theory to predict it all.
Similarly, at Connor's lab, there's some beautiful work on measuring how cells respond to curvature. They would use stimuli like this. I'm sorry I won't have time to take you through this in detail, but you can think about a simple object, like this one, is having curves relative to some object's center, different amounts of curvature, and different segments that are curved and different orientations of those curves.
And they devised a basis for those kinds of things to describe these shapes on a curvature basis, and they could then go ahead and do that hypothesis that V1, V4 neurons are tuned into a [INAUDIBLE] visual system, the V4 neurons are tuned in this basis. It was just an idea, and then they go and say, let's now test that idea with a bunch of simple shapes.
You can see the responses here of a bunch of cells. Sorry, this is one cell. Black indicates high response. White or gray indicates low response. [INAUDIBLE] like that thing, like that thing, like that thing, like that thing. You can see it like some of these weird, potatoey looking shapes where their varying curvature [INAUDIBLE].
And from that, then they could build models where they would describe each of these shapes in a basis set, a curve relative to the center of the object, and details that I'm not going to have time to take you through. But they could explain half the variants of V4 cells with that kind of model. So this is to give you a sense of the work that has been done in this area.
These are not things that are easily computable on actual images, which I think really limits them as predictive models, but it was a kind of descriptor of what's going on at the level of the visual system for [INAUDIBLE] visual system for some types of stimuli.
This is, again, a little bit of the history of the field. A simple way to take this was they showed V4 a bunch of images. Here are some responses to those. We can't predict those well from natural images, but they had some progress in a limited domain of stimuli.
People have looked at other kinds of things. If you go up to the V4 literature, you'll find those kind of studies, the dominant [INAUDIBLE] studies. If you go up into IT cortex, what you'll find is studies that look-- I'm going to skip through. This is the anatomy about the central 10 degrees. I'm going to skip through that.
Here's the history of the IT cortex. So people that started studying IT come from a different tradition than those that were studying V1 or the LGN. They come from more of a psychology tradition, and Charlie Bruce's lab was a big [INAUDIBLE] of this field. A lot of this work was dismissed, but I think later turned out to be very important.
This is some early work from Bruce's lab. You can see this is 1972. They were [INAUDIBLE] and just discovered these simple cells in V1, and they liked to respond to oriented images. And so here they are recording in these mysterious areas of cortex deep in the ventral stream called IT. And they would start showing stimuli like this. They're trying to drive the neurons with various stimuli to figure out, what makes the neuron go. Everybody was looking for the trigger features of the neuron.
And I like to show this quote. "The use of these stimuli," you see them up here, "was begun one day when, having failed to drive a unit--" they mean a neuron. "Drive," of course, means produce a lot of spikes. I should have said that-- "with any light stimulus, we waved a hand at the stimulus screen," so they [INAUDIBLE], and people [INAUDIBLE] when they do that. This is what they said, "waved our hand at the screen and elicited a very big response from the previously unresponsive neuron."
So that's very anecdotal, but it gives you a sense of what the field was doing. This is what Charlie wrote. "We then spent the next 12 hours--" this is an anesthetized monkey. The eyes are propped open. They're showing stimuli. "We spent the next 12 hours testing various paper cutouts in an attempt to find a trigger feature for the unit," the one thing that it really wants. "When the entire stimulus set were ranked according to the length of the response they produced, we could not find a simple physical dimension that correlated with this rank order. However, the rank order of adequate stimuli did correlate with similarity, for us, to the shadow of a monkey hand."
So these are the stimuli that produced high responses or low responses from each [INAUDIBLE]. It looks like a hand neuron. So from that launched a whole idea that the trigger features of the neuron are full objects, things you might attach words to, like "hand" There is selectivity of these neurons. I'm telling you the history of the field. Whether these are strictly [INAUDIBLE] is still effectively an ongoing debate.
From Charlie's lab, Bob [INAUDIBLE] group showed similar things [INAUDIBLE] quantitatively. As you can see, these are post-stimulus time [INAUDIBLE]. These are basically sums of lots of presentations of these images. Just look at these blips mean lack of spikes coming out. Those are, again, [INAUDIBLE]. There's the stimulus on here, here here. You can see these are the visual stimuli. You see it likes these kinds of hands, doesn't like the face, doesn't like that box. So there were these games being played of let's figure out what these neurons do, and we have a limited set of stimuli, which is [INAUDIBLE].
Tanaka's group did similar things. Here's a neuron that seems to like a face, even a reduced face. Take away this mouth line, it goes away. Take away these eye dots, it goes away. They bring it back over here, change the contrast, it kind of went away. So you'll find a lot of papers through the '90s that take this approach, let's figure out what these cells are doing, especially from Tanaka's group.
And they tried to do things like reduce the stimuli. What features are the IT neurons tuned to? They take an image, a bunch of objects. They dangle them in front of the monkey and figure out the neuron seems to respond better to this object here. Then on a computer screen, they try to reduce it to something minimal. And here it's saying, still responding, still responding, still responding, still responding. Now, when I do this, no more responding. This is how they present the work.
And from doing that kind of thing, they would say, well, here's the initial object. We reduced it to this. The one thing I think you can take from this is that they were usually able to reduce the stimuli to something that you would call more intermediate than the full object they started with and still maintain a reasonably good response of the neuron. So that's already a clue that these neurons are not-- think of them as object detectors. It needs a cat. This one maybe some texture oriented in a funny way enough to drive [INAUDIBLE]. These were just what they started with.
Remember, they're trolling around in a super high dimensional space. I'm not trying to [INAUDIBLE] our understanding of this is the history of what people have done. You guys all follow all this so far? Again, this is what you do when you don't know what to do, so I'm not trying to badmouth them. I'm just giving you the history of the field.
And from this, then they actually did things like compared [INAUDIBLE] from V2. So here's the brain. Here's V2 way back here, V4, and then IT coming out here. You can see the recording location are those dots. And these are the kind of features that they would call the minimal to drive the neuron. One thing they wanted to do the eyeball. They said, it looks somehow more complex than those. How are they judging that? They're just [INAUDIBLE] and then hoping that you will [INAUDIBLE].
The receptive field. They and others have shown it gets bigger along the ventral stream. So you record here. The receptive fields here show these little squares, and you can see they get bigger. This is a 10 degree field. Remember I said [INAUDIBLE] 10 degrees. You see a lot of different 10 degree sides here, sometimes bigger, sometimes a lot smaller that might have [INAUDIBLE], but they don't show those here, but you could say on average about 10 degrees. So the receptive field sizes are getting bigger down the ventral stream, something that you've probably heard about. These are just primary data.
From those kinds of data and others that I won't show you came this very simple conceptual model that we still carry around that basically goes like this. Neurons and IT, they have some receptive field indicated with a dotted line, so a portion of the visual field. They respond to some object well-- again, many objects or many shape features which we don't quite understand-- here are the stimuli of the letter A, and they respond to that even if you shift it around inside that field.
So remember that complex cell in V1 that lights that bar? It's the same thing except the field's bigger and the shapes are more complex. It's not an A being shifted around inside this bigger receptive field and you can change the size and it still responds well. But then if you switch it to another set of features that didn't really respond well, so here I indicate that with B. So this was the qualitative, conceptual story about what it seemed like neurons did. Of course, the details are still not understood, but It's an extension of the [INAUDIBLE]. More complicated shapes [INAUDIBLE] in bigger receptive fields.
AUDIENCE: Are these also invariant to rotation?
JAMES DICARLO: To some degree, and [INAUDIBLE] had a really cool study on that in the '90s. I was going to mention that in a moment. There's some degree of tolerance to rotation, but it's more difficult and it hasn't been fully fleshed out quantitatively. It's more difficult to give you a clean answer on that. You can find neurons that will respond to an object even across large changes in rotation, but then you'll find other neurons that are tuned to various particular view of that object.
You have both types, but of course, you can take V2 neurons and pull them together to build up fully invariant neurons. So just like here, this isn't the whole visual field, this is not the whole rotation field is a way to think about these neurons. It's just that in some range of rotation, they're still responding better to their preferred object than to the least preferred object.
That's what I mean my tolerance. They maintain rank order selectivity across some range, in that case of pose, in this case of position, but also [INAUDIBLE]. So the most important thing, computationally, is that the response to B doesn't rise above the response to A over some range here, and that can be for position or for pose or for anything.
That's what I mean by tolerance, to the question that came up earlier. If you have neurons that can do that, that like object A better than all the other objects over some range, that's actually a really, really good basis for discriminating objects and maintain information about things.
Actually, it's just separated, jointly coded two latent variables, one about position and one about identity, or one about pose and the other about identity. So if you think about these are individual neurons, but think about a population of them, that's the key single unit property you would want if you were to separate out these types of latent variables from the tolerance. I'll show that in the next slide.
Is everybody with me so far on my brief, whirlwind tour of the ventral stream? So this question was just asked. IT has some tolerance in the viewing context, cue that defines the shape. I can show you that, I mean, exactly how you define the shape boundaries, interesting work on that. The retinal position and size. I can show you the data I was just describing.
This was the Scott and [INAUDIBLE] study, the Booth study. Both suggest some tolerance to view the illumination, the 3D objects, and some tolerance to this, some ability to tolerate visual clutter. Again, all these things are partway. They're not perfect but there's some ability to deal with these things.
I also wanted to say remember, we're dealing with a piece of tissue. So what I was describing so far is there's a bunch of neurons [INAUDIBLE] sample model. I'll tell you about them individually. The neurons are not in a vacuum in a big, mixed up bowl. They actually have some spatial organization in there.
So neurons that like nearby things tend to live nearby. Just like in V1, if they're on the same position of the visual field, they tend to live nearby. That's a [INAUDIBLE] topic map. Or like some orientations nearby, that's an orientation map. In IT, there's some spatial structure that's on a scale of around 500 microns per millimeter.
Tanaka's group is the one who first pushed this idea, and they went in. When they reduced the features, they would show that nearby neurons tended to have similar features. They kind of drew it like this and then hoped that you would [INAUDIBLE] as a visual model of what they thought was going on from their data.
Gabriel Kreiman with Chou Hung working in my lab, actually, showed a very similar thing later. So I think there's some proof to the idea that nearby neurons in IT respond to similar things. There's some structure, again, a 500 to 1 millimeter scale.
This is optical imaging data of IT. I don't have time to tell you about optical imaging, but just think of these patches as being the response area to these particular stimuli. Don't worry about their crazy cats. All I want you to see is that the blobs here are on the order of about 500 microns to a millimeter, so these regions of activation. They're different depending on which stimuli you show, but you can see they're getting chunks of tissue activated.
I don't want you to think every neuron in there does exactly the same thing, but they're just more similar to each other than-- these neurons are more similar than, say, a neuron here and a neuron there which, when you go far enough apart, are very dissimilar when tested with a large number of objects. And that's the kind of data, I think, that Gabriel showed, that there is spatial structure at this scale. That's how [INAUDIBLE] that.
Then I want to point out that more recently, Doris Tsao's lab, working with Mark Livingstone and Winrich Freiwald, showed that this spatial structure even on a slightly larger scale. I already alluded to this with bases, but you have the IT cortex here. They were doing image contrast motivated by Nancy Kanwisher's group. They were saying, hey, there's regions in the human that respond more to images of faces than to non-face objects, and they showed that there are these regions that they name here.
These are their names. I don't need to take you through them. But there are regions of tissue that respond more to these than these. And besides these regions depends a lot on MR, things like thresholds. These are MR based, functional Magnetic Resonance Imaging. And this is all of IT. It looks a little weird to you, but IT is basically from here to here because of they way they rotated and inflated this monkey brain here.
But there are these structures in there, and Doris later showed that if you go in and record from at least the middle patch here, and later from the interior patches, she and Winrich showed that the neurons, a lot of them, as you might expect from the MR, tend to respond better to images of faces than to other things.
This is a more modern day version of what Charlie Gross said and [INAUDIBLE] they're going to respond well to faces. But here, they're spatially clumped at scale on the order of two to six millimeters. There's some clumping of neurons that at least like these face images better than these other images. So people tend to then call these things face neurons. I think that's too strong, but operationally, when we test them with a bunch of images of these things, these are the data you get.
I'm sorry I didn't take you through this. The red means high response, this means low response. You can see almost all the neurons tend to respond better to images of faces than to other things. That's the [INAUDIBLE]. The point is that there's structure at slightly larger scales maybe for classes of things like faces.
This is just to show you from my lab, there's some work [INAUDIBLE], the MR. There's a lot of [INAUDIBLE] work in my lab. You go in and record the IT cortex, and each of these dots is a recording location. He's measuring the response to, again, a bunch of faces versus non-face objects, not just two images. There's a whole battery of images there. And then the color indicates the degree of selectivity or the preferences for faces over non-face.
This is more of an unbiased sample of what's going on [INAUDIBLE]. Look at the red clumping here and here and here. These turn out to line up very well with MR patches that we measured in the same animal. This is called the posterior patch, the middle patch, and the anterior patch. So you can see that there is some clumping at this larger scale for these kinds of objects [INAUDIBLE].
AUDIENCE: I have a question about a point made on the previous slide. So there are a lot of dimensions along which things could be similar and neurons may respond to similar things, but the brain is embedded in a three dimensional blob, so you won't be able to replicate all similarity spatially.
JAMES DICARLO: Yeah. You've got to put a whole bunch of objects on a 2D map is what you're saying.
AUDIENCE: [INAUDIBLE].
JAMES DICARLO: I think this is a great point of discussion. I'm giving a review of the literature, so I'm trying to be fair to what's out there. I have my own biases on these things that we can debate about this. I should say more recent work from [INAUDIBLE] lab that relates to color. You test [INAUDIBLE] color versus non-color. They get little patches that are here, here, and here.
So to me, that looks like repeating random topic structure in a more complex shape space and not modules for [INAUDIBLE] of color, but the folks that are doing that work like the idea of module, and that connects well to our own ways of thinking about understanding. My guess is that these are more of a continuum of complex shape mapping, that there is some clumping. This is a natural grouping of the cells, and that when you do these things, you'll get what tend to look like clumps.
That doesn't mean that they're going to discount the data. How strongly do you want to think of these as being [INAUDIBLE]. And the big questions for this field are, do those neurons only support your ability to discriminate faces or do they also support your ability to discriminate a clock from another clock which has round features but no human would ever call either of those a face.
That's the crux of the question that's still debated and we're starting to get close to being able to answer that question. I think that's the bottom line. These are the data. The data show if you do the MR [INAUDIBLE] with this, you get this model. The inference from that is where the debate is [INAUDIBLE]. I did hear [INAUDIBLE] a question [INAUDIBLE] as to how we interpret this. I'm telling you the data [INAUDIBLE] discussion on the edge of science.
AUDIENCE: Have there ever been neurons that not only are sensitive to faces but have been identified with individuals, that would be sensitive to an individual monkey or person?
JAMES DICARLO: I'm thinking about face categories. The [INAUDIBLE] study that I briefly mentioned but I didn't show you and the Booth study that I mentioned both had trained monkeys to discriminate among 3D objects or to know a bunch of 3D objects. They had a different type of training to use.
Then they had neurons that would be tuned inside the space of the things they had been trained to discriminate among. You could call it individuation within the neural population that they could find. So I think the answer is that if you train an animal to discriminate, then you kind of find a neuron somewhere in IT that shows sensitivity to the thing that the animal had learned.
Now the debate becomes, how much of that was there in existence and how much of that was trained in [INAUDIBLE]? There are various studies that speak to that, but I think the answer that's appropriate is somewhere in the middle. Some of that was already existing. I believe some of that by development and other natural images [INAUDIBLE]. And then some of it during the training period, which it enforced. This is the discrimination habit.
That's my guess, but sorting that out is still-- nobody's done that well. It's difficult. I mean, you have baby monkeys and control all the individual statistics to really dig in to those questions. I think that's where your questions is sort of leaning. Because what we know as individuals depends on our own history. But I mean, I don't think we'd find--
That being said, the monkeys can discriminate you from someone else-- you know, one image of you from somebody else. But the information's there on the pixels. But on a map of the animal's perception, which is a deeper question, is probably going to require some pretty-- I don't know if I-- did I answer your question? Or should I try again?
AUDIENCE: Yeah, I guess.
JAMES DICARLO: OK. Think about it some more. Again, all these are super cool, interesting questions. I'm trying to give you guys a review of what the field's gone through and where it stands right now. But I want to then kind of, because this is a computational audience, I want to come back to object recognition.
I went through here [INAUDIBLE]. But I want to talk more about the problem of object recognition that we started with because that's a problem that needs to be solved. And then we'll take a break once we get to the last of these slides. So I'm moving pretty quick for this audience.
The problem of object recognition, I've already alluded to this. Even if you take away color, there's many, many possible objects. On the order of debate of the object recognition is onwards thousands of estimated objects we think we discriminate among. These are some that we use in the lab. We use a 3D model just to give you a sense of object. OK, I don't have to slide up yet, but many of them. So you need to deal with that problem.
But if you need to deal with that problem, that would be called selectivity among these objects in the context of being able to tolerate the fact that any one of those objects, like this car, can produce essentially infinite number of images on your retina due to the fact that you don't see the car just like this all the time. You can see a car in different positions, poses, or different illumination.
You might even, to the point of categorization, want to say well, look, these are all cars. So this is exemplar [INAUDIBLE]. And of course, a car can appear in a background, many different backgrounds. You can have it including things. But you still want to call it a car. So there's a lot of-- you need to deal with those kind of image changes.
But all these, humans can kind of quickly say, that's a car. And I say we can do that. The degree to which we do that is an empirical question that we'd have to measure, and we have measured, which I'll show you in a moment.
But this ability to be able to hit the car, not one of those other 600 objects-- [INAUDIBLE] on the last slide, one of those other 50 objects back here is called tolerance. So to say, oh, all these are cars, and they're not any of those other things, across all this, that ability is called tolerance, or it's sometimes called invariance. I don't like the word invariance just because it implies perfection, and we clearly don't have that in our own visual systems, but that's the colloquial term for this, for the ability.
OK, so that sets up the problem space. Why is this hard? Again, I like this geometric interpretation. Some people in my lab don't. Well, you guys can see what you think. So think about neurons living in escape space-- OK, so each axis of the activity of a neuron. And there could be many neurons. Of course in the retina, there's about a million retinal ganglion, so a million-dimensional space.
And when you see an image-- so this is an image of an object. So it's a projection, a 2D projection, of one object, a source that happens to be a person we'll call Joe-- that becomes an image. It just rides a pattern out of a million neurons. It's either just one point of this three-dimensional space.
The interesting thing is when this object undergoes transformation of one other latent variable, in this case, pose-- not just jumping all around. It's a continuous latent variable, pose. But it's producing different images that then are-- like me, I've moved to different points in space. You can see it's not linear in the pixel space-- maybe some complex curve.
So here's this other image of Joe. But they're all images of Joe. And then here's the other latent variable, the other axis of pose. And that's a two pose-axis. Now he sits off that way. He's looking upward. But these are all still Joe.
And you could imagine not just that the three degree of pose, but you could imagine two degrees of position or the third of size as part of position. You can think about those kind of latent variables. And of course, there are more that I showed you in the last slide, but it gives you this sense that there's this continuous point that Joe could give rise to in this space here.
And that continuous but highly curved [INAUDIBLE], I call that Joe's identity manifold. And any point on that [INAUDIBLE] should be categorized as Joe because Joe is the source of that. Joe's what gave rise to that. [INAUDIBLE] here, your best report right now is that's Joe, and not Sam or Eve or Jill or whoever. But it's this very thin sheet of point that time curved up in space.
And so now we have to come back to like, well, why does this matter? Well, somebody downstream, something-- not a monkey, but some mechanisms downstream looking at the population neuron has to make a decision about, when I put it in front of the monitor and say, is this Joe or not Joe-- it has to make a decision. Is this Joe, or not Joe, where Joe needs to be all the other distractor objects he has to deal with. In this case, it's called Sam.
And these are then, of course, as a manifold of points that it could give rise to. And what I'm showing here is a simple idea of separating hyperplane. There could be other simple mechanisms, but they're simple with respect to this population space. And we like simple mechanisms because we think of them as implementable what downstream neurons could do. These are linear classifiers for us. [INAUDIBLE] the population activity with a threshold.
That's why we like this. You see, many people like classifiers too, be we think of this as a simpler set of this biologically possible thing we think engineers know how to do, many of you guys already know how to do. It's probably mapped to some of the downstream neurons [INAUDIBLE].
Or Will or Tom, you showed, for instance, the prefrontal cortex. There are neurons that seem to be able to read off of [INAUDIBLE] to do categorization tasks. So there's evidence of downstream neurons reading from the mental stream could use these kinds of mechanisms to solve these kinds of tasks.
OK, so again, we don't think of this as us going in there, trying to extract information from the brain. We think of this as asking, if there was a mechanism in the brain, here is the kind of mechanisms that we think are within its toolbox. And that's why we use these [INAUDIBLE] what are the downstream neurons that are viewing this? What could they easily do?
So I wanted to say it's not just linear separability which needs to be done here for this [INAUDIBLE]. We should deal with [INAUDIBLE] playing with a relatively low number of training examples. We'll talk about that later. But that's sort of the idea of what you want-- is easily with a few training examples, you have a simple decoder that can tell you you can now do all of the object manifolds, so a few training examples that allow you to always separate [INAUDIBLE] objects. That's the ideal of what you want for a task like this.
And so we think of this as shown here. If we can support that, then we call it explicit. It's still a population representation, but it's explicit because there's only one little step of training needed with a simple decoder to then extract the information. So that's why, again, explicit. That's why--
AUDIENCE: So are the transformations that lead to the separation supposed to be static?
JAMES DICARLO: In this case, these are just the different static images of Joe,
AUDIENCE: Sorry, I mean like, are they going to teach you all sorts of [INAUDIBLE]. It's the idea that neuron's have visual hierarchy, but if I [INAUDIBLE] it to maintain a separability, or is it--
JAMES DICARLO: Well, so you're asking how-- I'm just giving you the sense of this is what a good representation might look like. You're asking how you actually build it and get it to be that way. I'm saying this is what you need to support this pass on a core recognition of here's an image slash [INAUDIBLE]. You need to pass it through some [INAUDIBLE] to press the right button if you're going to this well, if you're going to get a representation that looks like this [INAUDIBLE] It's not a simple thing.
That's all I'm saying. I didn't tell you how to do this. This is just to give you a sense of what I mean by good. It also leads to our test of what we think is good. But using population tools to ask, is the information explicit? And this what I mean by it. That's all I'm trying to say. I didn't tell you how to get there yet. So maybe I'll show two more slides and then you can ask me again.
I want to show you that when-- I'm saying explicit representation of shape. We talked about-- I think you asked about identity and category. We assume that-- again, category [INAUDIBLE] maybe are just different grains of shape discrimination.
So you can do identity-- this is actually subordinate level, which is what we were talking about earlier, right? But you couldn't imagine if the boat was a car, basic level of discrimination. But the concepts are still the same. We think you need a shape representation that can support these kind of things.
OK, so I want to point out that it's simulated. It's a simulated [INAUDIBLE], what's going on in the retina, a million-dimensional space. This isn't quite a million-dimensional space. It's more like a 20,000 [? something ?] space.
But these two three-dimensional objects and then you project them into a three-dimensional space, it looks really awful, which is why it's so hard. Because these are obvious other going view changes in this case here. And it looks like a red and blue thing, [INAUDIBLE]. But they're actually just crumbled up sheets of paper living in a high dimensional space, and you can't easily see where the separating hyperplanes live.
When we project it in three dimensions, it looks ugly. And we call that tangled. It's why you can't put simple [INAUDIBLE] on pixels to deal with recognition problem here with a limited number of training examples on pixel spaces, even much nicer looking number of [INAUDIBLE].
Again, [INAUDIBLE] thinking about feature spaces and learning. But I think for neuroscientists, they need to understand that [INAUDIBLE]. I like to call it that the manifolds are tangled in this initial space. That's why the problem is hard and interesting, but that they have tangled information [INAUDIBLE] spaces and pixels, the retinal ganglion cells. And then you go through some sort of transformation, and we maybe have some more untangled information.
And I use that word to mean a couple things that I've showed here. But first, they had a separability that you could put a separating hyperplane between that. And then you could do it with lower numbers of training samples.
And then also, I think the word helps imply that it's not just building grandmother cells for Joe and Sam, right? Those are going to be points in activity now. This is IP, right? These are IP neurons here. This is how we think about IP. It's not just there's a Joe cell and there's a Sam cell to develop in IP. Then those manifolds from the [INAUDIBLE] paper project a point in this case.
So what are the other actions there? Does anybody know from what it shows so far? This is a safe act [INAUDIBLE]. If I put a classifier here, I can conduct Sam and not Joe. But what are these? Can you see from what I-- I went through it quickly.
But what information is represented in a lot of these?
AUDIENCE: Cones?
JAMES DICARLO: Yeah, cones in this case, right? So we can give them other weight parameters that were there initially. They didn't throw away. We just recoded it.
So here, if you want to do other things, like that [INAUDIBLE] to the right, the principles just don't have that [INAUDIBLE], to some degree. And we've done some recent work on that if you want to talk about that. But I think it gets people away from [INAUDIBLE] help. Then just think about, what are the latent parameters, the latent [INAUDIBLE]?
We think IP's done a reasonably good job with IT. The [INAUDIBLE] streamer's done a reasonably good job reflected in the responses of IP of untangling those variables from each other--
AUDIENCE: Jim, can you go back for a second?
JAMES DICARLO: Yeah.
AUDIENCE: How did you get to the first-- the crumpled one?
JAMES DICARLO: How did we generate it?
AUDIENCE: Yeah.
JAMES DICARLO: So this is generated by-- so what you do is you take the 3D object. You simulate them over-- I'm trying to remember because this is pretty old now-- it's at least 3 degrees of hose. I think that's what it was. Maybe, actually I think it was 2 degrees of hose.
And then you have all this points in high-dimensional space. And you simulate all that. Then you try to find the axes from-- we were trying to find the axes, even from training examples-- what's the best way to separate, in which we had to come one way to get the two projections that we're going to show you.
And I'm trying to remember exactly how we did that. It was either a simple [INAUDIBLE] classifier or something ETA-- close, not exactly the same.
AUDIENCE: Because there are simple argument that this will lead to the first formation in the group. Then you don't have [INAUDIBLE]. You don't have intersections. They're separate in the very high-dimensional space.
JAMES DICARLO: I didn't-- I'm sorry. Thank you for-- I didn't mean to imply that these were intersecting.
AUDIENCE: OK.
JAMES DICARLO: They look like they're intersecting. That's why [INAUDIBLE] tangled because if I took two sheets of paper here and rolled them up, they never fuse. Except when these two people [INAUDIBLE], and then I have ambiguous information. Then they will contact.
But we think that's the unusual case that, I think as you were saying, too, that in high-dimensional space, the stuff is still there.
AUDIENCE: There is a lot of room in a high-definition--
JAMES DICARLO: There's a lot of room. But that's the point of-- there's a lot of room. But finding that room with a limited number of training tables is still a hard problem.
AUDIENCE: That's it.
JAMES DICARLO: Yeah. Right. I know that there's two different ways to think about this. [INAUDIBLE] is not just about linear separability. In a high-dimensional space with enough training tables, I can find a point, right. I think that's the fundamental point that it's there. But it's just very hard to find in that space.
So another way to think of this is that you'd rather think it was not linear separability is what for me but the dimensionality of space, meaning those number of training tables that you use to find a point is being reduced. Yet we're maintaining the information.
AUDIENCE: [INAUDIBLE]. Yeah. And the IP is a recurring network or just to there?
JAMES DICARLO: I haven't had clear evidence that IP does what I said here. I was trying to give you a sense of why the problem looks hard on a retina and how you went [INAUDIBLE] for IP. And there's nothing to support that up until [INAUDIBLE] moment.
And from that previous that I briefly went through. But when I'm making a plot, I'll line plot this. All I need is-- this is like the firing rate of one IP neuron over some timely [INAUDIBLE]. So is that produced by a recurring network? Possibly. Possibly by [INAUDIBLE] recurring network.
But we can measure the firing over some time [INAUDIBLE] neuron one, neuron two, neuron three. And then this is what the information looks like when measured that way.
AUDIENCE: Because then you need another dimension for time to return time and some information that--
JAMES DICARLO: Right. Again, I see my presentation's been ignoring [INAUDIBLE]. Here's the image. Tell me what it was. And it's not about predicting what's going to happen next. It's just only to separate [INAUDIBLE] from [INAUDIBLE]. Correct ignition is just a static image. And--
AUDIENCE: Could static image can have spatiotemporal--
JAMES DICARLO: It could. They do this. But I mean, I can show you the plots of how IP supports this when actually IP responses are pretty stable over the time to observe at 200 milliseconds. There's a little bit of moon in there. And sometimes, it may come up a little earlier than later. But I'll have to show you the data on that.
This is just to fix conceptual ideas about how or what the problem looks like, at least the problem we were thinking about. So again, you may not like it or not. But I think this is a discussion that's useful.
In [INAUDIBLE], what would you want it to look like at the end? And now what's the evidence that it looks like that?
AUDIENCE: Now why do some people in your lab not like the geometric [INAUDIBLE]? Can you give me [INAUDIBLE] about that?
JAMES DICARLO: It's hard for me to say. I guess I'm like this point of view. I don't know.
AUDIENCE: [INAUDIBLE].
JAMES DICARLO: Alex was asking why some people don't like this point of view. And I can't [INAUDIBLE] in that. But I think [INAUDIBLE] guy. He doesn't like it. And [INAUDIBLE] a real manifold might be something [INAUDIBLE] there.
But every time you describe what's going on, he'll go, what you really mean is that [INAUDIBLE]. It's just a matter of how you think about the problem. I don't think he disagrees with me [INAUDIBLE].
I like to think about a geometric [INAUDIBLE]. But maybe our inclusions are [INAUDIBLE] are not helping us, to some degree. But I think, again, for the neuro audience, I think a key thing here that the problem is hard. This is why it's hard because of the transformation variable. And then you don't throw everything away.
And again, people have said that on the computational literature, it's just the field of neuroscience is-- they don't know that or didn't factor that. But I think this kind of drawing help bring a larger audience into what's going on and what would you want to do?
Again, what if I made-- the details start to nag. But this is why we did this was to explain to people what's interesting, and what's hard, and what does the [INAUDIBLE] look like. It's just conceptual. Yeah.
AUDIENCE: So [INAUDIBLE] the assumption of linear separability?
JAMES DICARLO: Sure.
AUDIENCE: What evidence is there to support that?
JAMES DICARLO: Just the fact that you can do the touch.
AUDIENCE: Why not anything more computated?
JAMES DICARLO: OK. So here's the way I think about it. Because we're recording here, we want to ask about-- we think these are powerful features, and then we do this.
But of course, you need linear separability. But of course, if I were here, this is not linear separable, which is exactly what's being implied there. So I'm saying a transformation is linear. I'm saying that from some point in the brain, it's probably linear. And we think that point is here.
AUDIENCE: Well, why do you need linear separability?
JAMES DICARLO: Somebody-- some [INAUDIBLE] neuron has to eventually make a decision, whether it's a motor neuron or the basal ganglia-- somebody has to make a decision. And that decision is modeled by a linear time.
By a linear discriminative is one way to model a decision. Now there are other ways, but they are all relatively simple. So at some point, you need to get simple. The motor neuron's not complicated, right.
You have to get simple. You have to make the [INAUDIBLE] somewhere in the brain at some point when you're dealing with action. That's all I'm saying.
AUDIENCE: So that is [INAUDIBLE] modular interface the--
JAMES DICARLO: The [INAUDIBLE] provide [INAUDIBLE] basis, which then could [INAUDIBLE]. But the [INAUDIBLE] neurons [INAUDIBLE] might not know who Joe and Sam is. We had training for a while on Joe and Sam. Oh, let's just learn up some linear weights, from [INAUDIBLE] on that linear performance. That is how I think about it.
That's a hypothesis, right? I mean, [INAUDIBLE] saying you have [INAUDIBLE]. You have measurements. You try to have what's the mechanism for A to B. So--
AUDIENCE: I don't have any trouble with the linear thinking because it seems like there are all sorts of classifiers that [INAUDIBLE].
JAMES DICARLO: Sure. And if we open a box of classifiers, then we can say, well, if [INAUDIBLE] throw all this brain stuff out, make this all [INAUDIBLE] I thought and say, I need a nonlinear classifier to the problem of separating Joe from Sam over a bunch of transformations. And it's a nonlinear classifier. Let's all get back to work.
That's not helpful, right? Because if you're going to go in the system, you want to start to think points and say, look at this. It looks linear from here. And now the problem's over here. It's not linear from here--
AUDIENCE: I mean, would it-- so because you can use a linear classifier to decode at the final layer doesn't mean that it doesn't hurt the brain of the patient [INAUDIBLE] to do that.
JAMES DICARLO: Fair enough. But when I remember at the beginning, it's like hyperpredictive model. This is a hypothesis that's testable. If you have an alternative one, just put it up there and we'll test it.
So that's what we're doing while [INAUDIBLE] the next part. It's like, if this is right, then make predictions about what behavior should be like. So we go measure those predictions.
And I'm not saying the brain could be linear, of course. I'm just saying, from here to the kind of path that I showed you at the beginning seems to be well fit by a linear decoder. Now you can make, again, remember the data. Just [INAUDIBLE] make inferences from. Well, I don't like the way you're thinking about the brain. That's not a real criticism to me.
What I'd like, here's an alternative way that it could also explain the data. So the data aren't behavioral performances on the path, in this case. And I'm going to show you that it's just your linear decodes on this that can basically predict that.
So does that mean the brain's just a linear decode? I don't know. But it definitely could be. It's a hypothesis.
So we're never going to prove anything. We're just going to test ideas. And so then we'll say, well, what does that predict that we haven't yet tested? What other idea would you like?
So that's how we're proceeding. And I don't want you to think that's only linear because this is not linear up here. If this was linear, we would [INAUDIBLE] rotation in space. I mean, linear's just going to rotate in that space around [INAUDIBLE]. That's the [INAUDIBLE] interesting part of the problem. All we're trying to do is say, look, the problem is already solved up here and then our work is mostly trying to understand what's going on in these non-linear parts. That's how labs see it. We started in the past, defined [INAUDIBLE], and then we want to work to the middle, understand how you produce that.
AUDIENCE: So if the intermediate layers are doing non-linear transformations and [INAUDIBLE], why do you not have that same thing going on at the decoding?
JAMES DICARLO: This is a subtle point. It's actually linear, non-linear, linear with a threshold. That non-linearity, that's a huge non-linearity. This is the model, but this is also very much like that, linear with a threshold. There are other types of non-linearity, but linear with a threshold is non-linear. So these are all ideas. When I say "linear," I mean linear with a threshold. The decision is a threshold. That's non-linearity.
So I like the spirit of what you're saying. This is teaching whatever [INAUDIBLE] down here, you'd like to think of it working here and here and here, whatever that clash [INAUDIBLE]. And linear with a threshold is an option [INAUDIBLE]. And [INAUDIBLE] ignoring the feedback, asking what is it for. We're just driven by the data to explain the data, and this is the fun discussion that happens.
AUDIENCE: Just another comment on that. So even if you're using a non-linear [INAUDIBLE] or something, eventually, it's just leading to linear separation. So if you're dealing with a non-linear SVM, it still boils down to SVM is just transformed into a state. At some point, it has to be linear if you're--
JAMES DICARLO: Well, I think that's all I'm saying.
AUDIENCE: Exactly. I agree with you. I'm just trying to explain.
AUDIENCE: I know that, but you can write any function as [INAUDIBLE] with any other function-- two other functions. That's fine, but that's not really the point.
JAMES DICARLO: Of course, we all believe that global transformation is interestingly non-linear. The question is, how do we get into the brain and [INAUDIBLE] what the brain is doing? All I'm saying is go in and record on [INAUDIBLE]. It looks like from that point, linear with a threshold explains the behavioral data in the [INAUDIBLE] that I've defined here And your reaction to that might be linear [INAUDIBLE] is wrong because I wish that it was non-linear from IP or from some [INAUDIBLE] motivate me to not like the work, but you can't deny the data that--
AUDIENCE: It's not that I wish it were non-linear. I just don't see why the fact that-- you could train other non-linear models [INAUDIBLE] on the [INAUDIBLE] and you would also be able to decode, right?
JAMES DICARLO: Yeah, but [INAUDIBLE]. You'd say linear is more parsimonious in those models, so I want the most parsimonious models from IP to explain the data. And that so far is--
AUDIENCE: I think it's arguable whether it's more parsimonious to say, well, as you go higher, it's linear rather than to actually use a more complicated model.
JAMES DICARLO: I thought you meant from IP. You were saying, given--
[INTERPOSING VOICES]
Linear coders or non-linear because I'm saying linear decoders work, and so when [INAUDIBLE], IP looks like this in the state space. That's what I'm saying.
AUDIENCE: That is what I meant, but you could presumably go to a lower area and use a non-linear decoder. Maybe that would also be parsimonious. I don't know.
JAMES DICARLO: Yeah, but you'd want to build that non-linear decoder that's structured to look a little bit like this, but yeah, you could do that. If you want to go record or you want to build non-linear decoders to do [INAUDIBLE], be my guest. You could model D1 right now. I could give you a model of D1 and you could [INAUDIBLE]. That, in some sense, what I showed at the end of the talk, what we've been doing on the decoding work.
We're building models that go from here to try to solve the paths, to try to find a non-linear transformation, and then later ask, how well do they look like the neurons? That's just for us. We put the split here because we recorded here and called that [INAUDIBLE]. This is decoding and that's encoding, so we can use rear decoders from this point in the threshold, and we need non-linear encoders to get to this point. And overall, the thing's non-linear, but that's just where we came into the problem.
That's all. We could move this point here. We could talk about sort of linear here and then build a non-linear there and just call that decoding and then encoding, but just relative to where you're standing, it's going to be more or less non-linear. That's all I'm saying.
AUDIENCE: I like the motivation of the linear readout from a simple model of a neuron and dot product of the threshold. That's a bit of a simple model of a neuron. We know that they are more complex in certain ways. What kind of classes of classifiers does that open up when you take into account [INAUDIBLE]?
JAMES DICARLO: Right, and even within the space of simple decoders, there's a lot of space for change. At some point, the data are not going to be constrained over all those possibilities. I'm not disagreeing with you in any of this, but of course, the brain is probably not as simple as I'm describing on the screen. But I'm saying this is to explain the behavioral performance, and that's a piece of evidence that we as a field have to reckon with and decide, what do we interpret that to mean? All I'm saying is it that this means to me that this layout of state space looks something like this.
AUDIENCE: I think what Alex is imagining and what I'm alluding to is that as far as I know, there's other data that's well established that you're ignoring.
JAMES DICARLO: On IP?
AUDIENCE: No, that point neurons aren't really neurons.
JAMES DICARLO: Neurons are not neurons?
AUDIENCE: Point neurons.
JAMES DICARLO: Point neurons are not--
AUDIENCE: Linear, non-linear models for neurons, they aren't perfect models of-- and the feedback stuff. You, of course, acknowledge it, but you say, well, I don't want to deal with that.
JAMES DICARLO: I'm just saying I don't need to make a more complex model.
AUDIENCE: But it's not more complex. He's talking about data. That is the data. The data is--
JAMES DICARLO: No, no. This is a concept slide. There's no data on this slide. [INAUDIBLE] can show you.
AUDIENCE: No. I wasn't pointing at the slide as if I thought that was actually the data, but the [INAUDIBLE]. I think, there's a lot of feedback. It's a [INAUDIBLE] to postulate that it's there because you know it's there.
JAMES DICARLO: Again, I tried to point out the fact that we can record up here is not denying the importance of feedback, to keep on using that word, and you have to define the word. It does not deny that. It just says up here to get to the [INAUDIBLE]. Do you like that? That's all it says.
AUDIENCE: I don't see how-- I mean, your model doesn't have that in it.
JAMES DICARLO: When you say my model, there's no model yet even on the page. This is just a conceptual idea. [INAUDIBLE] simple, and I'm going to show you the evidence that it works. So that's all. I mean, why don't you go through the evidence and then we'll have this discussion again, OK? I think we should take a break because I think I was near the end of this section anyway.
I just want to say for the people that may be computation oriented, I just think of this as a [INAUDIBLE] encoding basis for these kind of paths, and this is a powerful encoding basis. It doesn't say how we get here. We can call it magic. A magic basis does well. That basis that we know how to build from the image does not do well. It doesn't tell you how to get the magic basis, but it's just magic basis is very powerful with respect to the [INAUDIBLE], and I don't mean powerful in a computer vision high performing sense.
I mean powerful in that it matches the kind of errors and difficulties that humans have, which shouldn't be at all surprising [INAUDIBLE] because the human has neurons in his head that somewhere have to correlate with his performance. That's, in one sense, what I'm saying about neuroscience. Those neurons are being passed that live here, and [INAUDIBLE] evidence that I'm going to show you next.
So let's take a break. Sorry, question. I'm going to try to have a quick break because this is actually, we're moving to the data, our discussion here. And then if we have time, we'll talk about the decoding model, which is probably really interesting [INAUDIBLE]. One question.
AUDIENCE: I wanted to ask [INAUDIBLE] question. [INAUDIBLE] in the history of [INAUDIBLE]. It's like the concept was almost there already between [INAUDIBLE] and [INAUDIBLE], and the technique of [INAUDIBLE], so how come it's kind of [INAUDIBLE] the technique and [INAUDIBLE] was almost there and you could do all of these experiments [INAUDIBLE] and IP and showing images and show the-- I mean, why took it so long to have these--
JAMES DICARLO: To do what? I showed the history of the field, to record and measure some images? What? [INAUDIBLE] part of that chain. But what step are you saying took so long to do?
AUDIENCE: I don't know because [INAUDIBLE] neuroscience because of the techniques [INAUDIBLE], and this seems like it could have been done a lot earlier.
JAMES DICARLO: Yeah. I mean, really, what this is is just a blending, and Tommy and I gave the initial [INAUDIBLE] together. He probably gets a lot of credit for it. It is a blending of ideas from machine learning, population recording methods, and high level [INAUDIBLE]. [INAUDIBLE], a lot of people [INAUDIBLE]. They were just like, I'm going to record a neuron, and all they knew how to do was [INAUDIBLE] images because that's where the features were.
But if you start to think of this view of think about the task and how you might solve it and what kind of a decoder [INAUDIBLE], then yeah, it can become obvious now. I think that's your question. Why haven't they done? I think it's an issue that the people that were working there would point out. It's a neglected part of the brain for the last 10 years. There's a Japanese group's foundation. It was neglected.
We're trying to bring ideas that a lot of you guys know [INAUDIBLE] better than I do onto these kind of problems that just mapped out of neuroscience. That's the spirit of, I think, part of [INAUDIBLE] is there's some insights here that inform in both directions. One's [INAUDIBLE] modeling, one's the knowledge about what's going on in the brain. That's, I think, the spirit of [INAUDIBLE]. That would be the vision part here, as I see it.
Associated Research Thrust: