Reverse engineering visual object recognition

Yes

Date Posted: August 11, 2020

Date Recorded: August 10, 2020

CBMM Speaker(s): James DiCarlo

All Captioned Videos
Brains, Minds and Machines Summer Course 2020

Associated CBMM Pages:

James DiCarlo

PRESENTER 1: Welcome back, everyone. It's a great pleasure to introduce Professor James DiCarlo. He's the head of the Department of Brain and Cognitive Sciences. He was trained by Kenneth Johnson and John Maunsell.

And he contributes what I think is a perfect example of the marriage of the three levels of analysis that Tomy Poggio was alluding to-- combining computational work, investigating the neural circuits at the neurophysiological level, as well as building computational models and studying behavior and cognition. I don't want to take any time away from James-- so please, Jim, go ahead.

JAMES DICARLO: OK. Thank you, Gabriel. And thank you, Tommy, and the other organizers for organizing this summer school virtually and for giving me a chance to speak a little bit about our science but also to try to give you a sense, as Gabriel said, of how the spirit of Brains, Minds, and Machines from CBMM-- I think all those three ares can be brought together to the benefit of all of them.

Again, I think that's what's quite exciting about CBMM. And I hope that's why many of you are listening today. To that point, my first slide-- why are you here to listen to my talk? Or why are you here at CBMM?

I'm going to let our first department head for Brain and Cognitive Sciences, Hans-Lukas Teuber, motivate you a bit-- and what motivated me into the field. So you're going to listen to some audio here. This is Hans-Lukas Teuber in 1974 motivating a question that I'm going to talk a bit about today.

HANS-LUKAS TEUBER: In a way, we could say that the first step in any new field is to know what to be astonished at-- what to wonder at. A very, very gifted theoretical physicist by the name of Murray Gelman who was here on a visit from that strange place on the west coast, Caltech-- you know, our competitors. We tried very hard to persuade him not to go back there, but we failed.

Gelman said to me, without any provocation on my part, that he thought ours was a weak-kneed department, rather than being unorthodox, as I claim, because it had in it neuroanatomy and neurophysiology and linguistics and other things that are usually left out. It was not bold enough, because we were not studying those issues in psychology that were really important. I said, which ones are important?

And he said-- and he absolutely floored me by saying that, because I think he's bright. I know he's bright. He said, well, you shouldn't study perception. You should study extra-sensory perception. You should not study language, but thought transference and hypnosis.

There was one third thing, which I have forgotten or repressed, as some people say. It might come back under hypnosis, maybe. But in any case, I was astounded and shocked because he said all this in front of our then-provost-- fortunately, a very good friend of mine.

I tried to convey to him the idea that we think to wonder about is perception-- that if extra-sensory perception did exist, I would feel somewhat let down. I'm not against its existing, but I would feel deflated, because it would be much easier to understand.

Perception is the great riddle. How do I see a line as a line and a face as a face? And as that face amongst the 1,000 faces? Where emotion and memory enter-- that's a much greater riddle.

JAMES DICARLO: OK, so I'm in Teuber's shoes now, as the Head of Brain and Cognitive Sciences. But even back then, motivating what seemed to be something almost, at the time, seeming simple-- which is the problem of perception-- is the thing I wanted you to take from that clip. That is one of the things that we should be wondering and then be astonished about. Indeed, visual intelligence and visual perception is one of the most astonishing things that the human brain is able to accomplish. And I'm going to tell you a bit about one aspect of visual intelligence today.

And again, Teuber mentioned other things-- language and other things as well. And I think you'll hear about those in other talks. But today, my talk is going to focus on an aspect of visual intelligence, which is really motivated by answering the kind of nouns that are out there in the world, which is things like, what is in this scene? Where are the cars? Where are the people?

So these are objects of the world. And being able to pick those out of pixels and to say they're either there or not there. And where are they, exactly, in the scene? Those are fundamental questions in visual perception and computer vision. And we call them and put them under the umbrella of visual object recognition.

There are more complicated questions, like, what will happen next? And where is it safe to walk? And so forth. And these are questions that you'll hear from other speakers during this course. But I'm going to tell you about the first three-- these core visual object recognition questions. Just basically, what's out there?

OK, before I do that, I wanted to-- since the CBMM talk-- to orient all of you, that many of you are coming from computer science or AI backgrounds or you're interested in a an AI approach to things. So what you're trying to do, I think-- and I think I'd put myself in this category-- is we'd all like to build some silicon-based what we call computational intelligence that meets or exceeds the biological or human intelligence. We'd ideally like that to be a good model of the system-- or maybe even, again, exceed the system. So you have lots of choices if you're going to take that as your goal. And many of you out in the audience I think might take this as your goal.

First of all, you could decide, as a strategy, to ignore brain sciences and just proceed as straight-up engineering and do your best. And I don't criticize that approach. But I don't think you'd be here at this Brains, Minds, and Machines course if you thought that that was the right way to go. But I want to point out there are other things you could do.

You could ignore the brain sciences and what you actually do day-to-day, but you could talk about your systems as brain-inspired or brain-like. I'm not a big fan of that approach. It's really good for advertising and PR, but it's like taking the branding of the brain and the mind, but not actually using the data to constrain what you're doing. So I encourage you to think beyond just stamping things with a brain-like seal of approval and move beyond that to actual data that are measured from either behavior or neurons or other brain activity.

You could use human performance as a benchmark to report your progress. That is a bit better, because you're using the constraints of human performance. And so you're moving in the right direction as you do that. And many of you may be thinking about that as you think about that work you want to do beyond the CBMM course.

If you're a scientist-- this is an opposite approach. Coming from this background, you might say, well, let's start with some very simple reduced neural system and hope that some principles of intelligence will be discovered. And that's motivated by the codes of DNA being first worked out in simpler systems, like viruses. And then the idea that you're going to scale up later, but first you're going to work on principles.

Again, a lot of people take that approach. And I don't think that's a bad idea. But I don't think it's going to be enough for what our project is here.

Now, this is, again, my opinion-- not facts. But I'm just giving you lay of land here. I'm going to contextualize what we do as what we call forward engineering systems within wisely-chosen brain science measurements. And again, looking at the data to test deviations from those measurements as you build those systems. And adjust your system-building if the deviations are growing rather than shrinking.

So we're trying to build better and better approximations of what's actually going on in the biology to accomplish the intelligence tasks that interest-- in our case, visual intelligence tasks. And I refer to this broadly as reverse engineering, as a theme of approach. And a bonus of this approach, relative to some of the approaches above, is that it enables advances in human health, education, and brain machine interfaces and other applications, to name a few.

And by the way-- and I think Tommy knows we'll be first to point this out-- there's a small bonus that if you work on this, this is one of the greatest open science problems in the history of our species. How the biology gives rise to intelligence and consciousness and so forth. And that's really a motivating thing beyond just building good, say, AI systems.

OK, so that is, to me, under this umbrella of reverse engineering, which I also sometimes describe this way-- is that our goal is to account for the ability of the mind-- say, visual intelligent behavior-- using systems of components in the brain-- for instance, connected neurons-- in the language of engineering and predictive-built system.

So this is where science and engineering meet around what we'll call algorithms, which to us are specific in-silicone neural networks that actually work. So they can do tasks that engineers would like them to do and they can explain the measurements of both behavior and neurons that you find within the mind and the brain. And they're motivated by both engineering and science. And their building is to the benefit of both fields, both as hypotheses for brain and cognitive sciences and as actually applications for engineering.

So it's quite exciting right now that engineering and science are coming together. And again, CBMM represents that intersection. And I'm going to give you an example of how this science merging with engineering has really been blossoming in the area of visual intelligence-- or really, visual recognition.

And when I say we, I'm mostly referring to work from some of these folks down here. This is my current lab. And I'll try to highlight them along the way. But they deserve all the credit for things I'll show you. I'm just an ambassador of their great accomplishments.

So when I started building a lab at MIT around 2002, we said, let's set about trying to reverse-engineer some aspects of human visual intelligence. And these are some of the aspects that I've already referred to in the earlier slide-- what's out there in the scene. And to do that, we said, let's reduce the problem to something manageable. So rather than how you absorb that whole scene and think about that whole scene as a scene understanding problem, we focused on how the primate brain extracts object identity from that scene.

And the first thing we applied is the notion that you, as a primate, do not just absorb the whole scene at once. You have high acuity at the center of the gaze-- so if this red dot is your fixation point, this is illustrating, say, the central 10 degrees when you have high acuity vision. And that's the portion of the visual field that we've been analyzing. So that part of your retina that processes the central 10 degrees, which a lot of your brain is devoted to. And the way you make these analyses of the larger scene as a whole is by making eye movements around the scene with fixation durations shown by the dots that are on the order of a couple hundred milliseconds.

So just a split second at each of these dot locations, as an example-- so-called scan path. I'm telling you all this, because now, instead of us thinking about a whole scene analysis problem, we can think about what you do in each of those glimpses. And I'm going to now show you those short fixation glimpses here as a sequence of frames. Now, if you just fixate in the middle here while I play these glimpses for you. Those are 10 degree cut-out glimpses of the scene. I'll play that one more time.

I hope that you can realize, as I show those, that you're not identifying all problems of visual intelligence in those brief glimpses. But you're able to identify things like, what is out there? Cars, signs, people. You're also able to identify things about the pose and position of those things within those glimpses as well.

So this reduced problem focuses on question. You can answer questions like, is a car here? Is a person here? And so forth. This reduced problem we refer to as core object recognition.

And again, that's just an operational definition of a 10-degree viewing window with about a 200-millisecond viewing duration. So this, as you see, is a building block for vision and visual intelligence, but it's not all the visual intelligence. But that's what I'm going to focus on today so that we can make progress on our understanding.

OK, the way we test this behaviorally is that you might be brought into the lab and you might look at a scene like this. And you're asked to fixate the dot. And then an image might come up like that one. And I hope that each of you can recognize that what was in that image was more likely a car than a bird. And you would indicate to the right a choice of a car.

Now, this is a little bit of a tangent here to remind you that the reason the problem of "recognition" in general is hard is because the light coming off the car never strikes your eyes in exactly the same way. So you have what's called identity-preserving image variation. So the cars can be in different positions, sizes, poses, illuminations-- and have background clutter and so forth. All of that stuff that you have to deal with or be tolerant to or so-called invariant to is what the visual system is robustly able to generalize to. It's been a long-known core problem that the visual system has solved.

And so I'm just pointing that out to you to say, when we test you or other test subjects in this-- or animal species, like primates, that I'll show you in a minute-- we're careful to test not just very central objects but also wide degrees of variation. And again, the position, size, pose, and also the background.

And we test both synthetic images, like the ones I showed [INAUDIBLE], when you have 3D-rendered objects on complex backgrounds, as well as naturalistic image photographs taken from the real world. And we test both of these along the way. So I'm saying that we have context to give you a sense of how we do experiments. And then we can test subjects like this again, so here you go.

What did you think? That was a synthetic object-- more likely a face than a car. Here's another one-- OK, more likely a bird than an elephant. These are pretty easy examples.

OK, there is a natural image. That's a photograph-- more likely a bird than an elephant. So these are the kind of tests that we do. We do them, as I mentioned, on humans.

Many humans that we test online to get your ability to do these-- and you're not perfect at it. You're good, but not perfect. And your pattern of errors is actually quite useful to guiding us. But you are much better than machines-- at least machines were around 2010 or so.

And that's what's shown on this plot. Your performance on the y-axis and your ability to report the object identity accurately as performance as a function of the variation in the object. So you're very good, even when you have high uncertainty in the object, [INAUDIBLE] shown to the right side of this plot-- and much better than machines were at the time. And that regime there, where you're doing good generalization across these variations, is what we call core object recognition.

So again, this is all by way of set-up to say, we study this problem of visual recognition in primates and humans. And in my lab, we study it also in rhesus monkeys, because we can gain access to the neural hardware at a much finer grain than we can in humans. Here is a monkey doing the task that I was just asking you to do a few slides ago. The monkey is triggering the screen to present a test image that you see here in the middle.

That 100-millisecond presentation is followed by a choice of one or two choice objects. You see there's multiple different types of choice objects coming up. This animal knows on the order of 30 objects or so. It takes them about a day to learn each new object.

And they're doing it in high variation conditions to the center of the animal's gaze, as I set up for you. Black means he gets it wrong-- so he doesn't get them all right. Green is he's got the trial right. So you see he runs a lot of trials this way. And we can collect a lot of data about the monkey's behavior-- and as I mentioned, the human behavior.

And just as a shortcut, I'm going to tell you that those studies basically reveal that you, as a primate, are equivalent to the rhesus monkey as a primate in your ability to discriminate among objects in the conditions that I just described for you as core recognition conditions. That doesn't mean monkeys know what a car is in the sense of being able to drive it away or that they can use it to get from point A to point B. It just means that they can distinguish a car from a truck as well as you can distinguish a car from a truck. And they make the same kinds of errors that you do. So this means that we have a very good, almost quantitatively accurate model inside the monkey brain of our own brains in doing this kind of intelligence test.

And again, by the way, these primate systems are much better than computer vision systems, certainly back in 2010-- and they're still actually better than computer vision systems at doing these things. And so we still have a lot to learn from the primate biology about how it's doing better than computer vision system. But that big picture-- primates, equal to each other, greater than computer vision systems-- is what I'd like you to take from this.

OK, as I mentioned, we studied the rhesus monkey as a primate model because we can access that brain and we know much more about it. So decades of neuroscience have provided measurements of the macro and meso architecture of the non-human primate brain. And I'll show you some of those areas here.

So now we're switching exclusively to the rhesus monkey. These areas in color are what we refer to as the ventral visual stream. It's a series of cortical processing areas that I'll outline for you. The highest one of those purely visual areas-- it's mostly visual-- is called infratemporal cortex, or IT cortex. And we're especially interested in IT cortex in the ventral stream that leads to IT cortex, because it's long been known that lesions in this part of the brain cause deficits in tasks like core recognition.

Just to orient you, this is not processing of visual data just to go nowhere. The parts of the brain the ventral stream project to areas involved in decision and action and also to regions in the medial temporal lobe involved in memory and value judgments. We think the ventral stream is processing the visual data to good forms to be able to explicitly encode things like objects to guide, again, decision and action and to lay down long-term memories-- just to orient your other talks that you'll hear in the CBMM course. So here's this ventral visual stream. It's a series of visual processing areas in the cortex.

Remember, it's preceded by the eyes, of course, which are capturing the data in the back of the eyes, in the retinal ganglion cells-- down here is RGC. The thalamus, which lives in the middle of the brain that I'm not showing on the slide-- this is the lateral geniculate nucleus, the LGN. And then these series of cortical areas-- V1, V2, V4, and IT.

And each of these dots within each area is meant to signify a neuron. There's, of course, millions of neurons in each of these areas. And I'm just showing a schematic here to give you a feel for things.

There is both feedforward connections and feedback connections, as well as recurrent connections. And this is a very simplified ventral visual stream diagram, just to orient you to the basic dominant neuroanatomy of the ventral stream. And so when you think about the function of the ventral stream, the way we usually start off conceptualizing this is you have an image like this one, which is a pattern of light that might be striking, again, your central 10 degrees of your retina. And then that is then converted from light photon energy into patterns of firing in the back of your eyes and your retinal ganglion cells. So now spiking neurons in the retinal ganglion that transmit various spike rates.

And here I'm showing high-firing neurons in black and low-firing ones in white as an example, just to schematize this. Transmit this down your optic nerve, to your lateral geniculate, which then transmits it to V1, to V2, to V3, and to IT. And you have this cascaded flow of neural activity with different neurons firing at each level of the system, indicated here by colored dots. So you have an image shown on the eyes that produces a pattern of neural firing in each of these areas, up to this IT cortex that I outlined for you.

It takes about 100 milliseconds to go from the image to the firing at IT cortex. And the key thing is that the images on the retina, they're basically nice photographic copies of the image. They're like what your camera would capture on your smartphone. But then your brain is transforming these so that the neural pattern up in IT cortex looks nothing like a photograph anymore. It's some special pattern of neural activity that has made the object identity much more explicit than is on the retinal cells themselves. And I'll show you more about that in a minute.

But you'll get the basic idea that you now have a new image leads to a new pattern in IT cortex. And I can go back to the old image and it evokes basically the same pattern of activity in IT as when I showed it the first time. So these patterns are reproducible when you study them with electrode recordings that I'll show you in a moment. But also I want to point out that your visual system can easily follow along at timescales like this one.

These are images shown about one per 100 millisecond. And you can already notice, if you look at this image yourself, this movie that you can identify one or more objects in that movie as it comes by, similar to what I showed you at the beginning, with the natural scene. And your visual system, your ventral stream can follow along quite nicely-- again, with a lag of about 100 milliseconds-- producing a new pattern for each image at the top in IT cortex.

So since this is one of your first CBMM talks, I want to remind you just a little bit of neuroanatomy. This is a rhesus monkey. As I've been showing you some behavioral data and the lay of the land, neurally-- this is what a monkey brain looks like after it's been fixed and taken out of the skull. So you can get a feel for what a monkey brain looks like. And I'm showing you that if you then cut this brain and look at the neurons, this is what you might see-- the layers of cortex all folded up here.

This is in the visual area of V1. I'm not going to do a neuroanatomy lesson here, other than to remind you that there are multiple layers within the cortex. Each of these little dots that you're seeing here stained with this missile stain-- is an actual neuron within the cortex.

If you stain it in different ways-- this is a Golgi stain. And you see all the dendritic and axonal processes associated with a subset of those neurons. So now you see the neurons as these pyramidal bodies here, with all these dendrites and axons coming off of them. And what we do in our experiments that neurophysiology in general does is lower microelectrodes into the tissue where they're close enough to some of individual neurons that they can record the spiking patterns from individual neurons. And that is the workhorse method of neurophysiology.

And the basic idea, the reason that we'd really like to record these spikes is that these spikes are what transmits the information through the neural network from one set of neurons-- say, in a visual area, V1-- to another set in V2. So these spiking level activities of neurons are a very privileged level of information, because it's the only way that neurons can communicate at the speeds that you can do things, like vision. And so even though there's many things you can measure in neurons-- calcium molecules, other things-- those things aren't fast enough to do vision. So the spikes are what we're most interested in modeling-- what the spike activity looks like within each cortical area leading up to behavior. That's why it's a special level of information.

OK, so that's a bit of neuroanatomy and context for why we study spikes inside the monkey brain. Here's back to the visual ventral stream. Remember, you're getting different images producing different patterns of activity up in IT cortex. And as I mentioned, you can lower electrodes and record from individual neurons in IT cortex.

This is an example recording from IT. And these are patterns of neural spikes out of this one individual neurorecording site within the monkey brain, in response to four images that you see on the top. Each tick mark is a firing of an action potential out of the neuron and each row is a different trial where we're representing the image. So just to orient you to tell you what's going on here-- when we present this first image, you can see that it always gives rise to an increased firing rate with a lag. You see the timescale at the bottom here of about 100 milliseconds after the image comes on at time zero.

The fire rate of the neuron increases from a background firing rate that you see with these sparse fights to this higher firing rate. And it does that on every one of these 10 trials we've shown here. Here's another-- the second image, you see the same thing. An increased firing in a response to this image. So we sometimes say the neuron likes these two images and it doesn't like these other two images as much-- the firing rate doesn't elevate as much.

So this is just an example so you can get a feel for the elemental neural data. Here's another site in IT cortex. It doesn't like this first image, but it likes the second one. It's the second, the third, and the fourth. And here's a third site which doesn't like either of the first two, but likes the third and the fourth.

And I'm showing you this to remind you that there's a diversity of neural response types within IT. And these are examples of just random recordings within IT cortex and showing you how the neurons respond to different images. And I'll show you in a minute-- how the neurons respond to these images is actually one of the great mysteries, I think, we've been working to solve. But I'm just giving you a feel for the kind of data that you get when you record from monkey IT.

OK, now instead of us tracking these individual spikes out of neurons, what we are doing here is we make an approximation. So we average the spiking activity in a time window-- say, in this case, I'm showing you a time window that is on the order of 100 milliseconds long. And we average over the repetitions here, different presentations of the image-- assuming things that are roughly the same. So we're averaging out the trial-by-trial variability and we're averaging across time, ignoring any spine temporal dynamics. Later, we return to exposing some of those interest points.

But for now, I'm just telling you how to summarize the data so that we end up with one number out of each image. So we can say this site in IT likes this image this amount-- 60 spikes per second are the units of these measurements. The average spiking rate of this neuron to this image-- average spiking rate of this neuron to this image is 71 spikes per second here at 25-- here's 7, as examples. So now you can understand how we go from the elemental recordings to what we actually are going to tell you about modeling when we model systems like this one.

And we don't record just neurons one at a time-- we're typically in our lab recording around 300 simultaneous sites by implanting chronic arrays, like these ones shown here, which have about 100 electrodes each. And they live in the animal's tissue, in his cortex, for multiple months so that we record thousands of images over those time frames and stitch together large data sets this way.

OK, now I'm going to ask you to do a little bit of a mind flip in how you think about these data. Instead of us thinking about one neuron responding to many images, I'm now asking you to think about many neurons in IT cortex responding to one image. So here's this one image at the top, on the order-- this is actually about 168 neurons shown here. And the green color indicates the spike rate out of each neuron to this particular image.

So again, the units are spiked per second. And you have this big, long response vector. That's the length of the number of neurons that you were recording. And this is meant to be a sample estimate of what IT as a whole is doing sampling from IT.

Again, we're not recording the millions of neurons in IT. We're recording on the orders of hundreds and trying to make inferences of how the system works by recording this in IT or in other areas, like V4 or the other visual areas. OK, so this is one so-called response vector, because it's a population response where we keep track of the individual responses of each neuron.

Here's eight different images. And we don't record just eight but, as I mentioned, thousands of images. And we get large data sets that look like this one. We have thousands of images and hundreds to thousands of neurons.

And by the way, if you're interested in modeling these kind of data in your CBMM projects, we have some online and we're happy to share these with you and share more with you to do projects in this space. So one of the main messages I'd like you to get from the talk at this point, beyond the background of the monkey visual system, is that these patterns in IT, as I mentioned earlier-- these population patterns that we can measure, they're very, very special.

And the reason that they're special is that if you apply linear decoders onto these populations and you test those decoders and their ability to generalize to new tasks, like I trained you to identify a dog and then I test you on new images of dogs-- they generalize quite well. In fact, they generalize just as well as the animal generalizes-- the animal being either the human or the monkey-- which, remember I said they're equivalent in the recognition performance.

So when you just put linear decoders on these kind of patterns-- and we've shown this on a number of papers, some of which are listed here-- these are quite powerful, in the sense that they generalize across those hard image variation problems where I showed you the car varying earlier in the talk. And so this is all by way of background, in some sense, to tell you how we think about vision in the monkey for object recognition. And from a computer vision or AI point of view, the fact that you and I have these codes computed in our head that are in IT cortex that we can measure in a monkey, that's why our brains are better than machine systems at being able to do tasks like recognition-- because we compute these codes. Because once these codes are computed, linear decoders explain the rest of our behavior.

So in some sense, you can think of shifting the problem away from explaining your behavior to explaining this almost magical code up in IT cortex. As what these kind of studies taught us early on is that there's very special codes up there that are very privileged in that sense that linear decoders can be applied to them.

So this is a point where I want to pause and take questions, because that is basically a big set of background on vision to get you to the newer stuff over the last decade or so that I'd like to tell you about, if we have time. So Chris, if you see questions or if someone can field for me, that would be helpful. I should be about halfway through the talk now, maybe a little more.

PRESENTER 2: Sounds good. We have a question from [INAUDIBLE]. The question is, how do you derive the recurrent network architecture in the visual system? How do you find the evidence?

JAMES DICARLO: So I think that's a question about neuroanatomy. So at the bottom, I'm showing this feedforward and feed-recurrent network. And that anatomy, I didn't unfortunately have time to put that into our talk. The connections between the different areas are established by making micro-injections in animal brains, euthanizing those animals, and then tracing by various methods how one area connects to the other. And those tracings for the rough ventral stream are at the level of giving you area-by-area connectivity, not neuron-by-neuron connectivity.

How we draw diagrams like this tells you one area is, on average, strongly connected to another without giving you the exact precise wiring diagram-- which is still a topic of great investigation at the moment. So I think that's my short answer to the question. It includes both recurrent and feedforward. I think the question was about recurrent.

And I should say, by the way, what I've been telling you about is a feedforward view of the ventral stream. The recurrence is very important to the processing. And if we have time, we'll talk about that at the end. But you've got to start with the first order model, which is the feedforward model. And then you can start to add in the recurrence.

That's how we think about approaching the problem. But good question. Thank you. Others, please.

PRESENTER 2: Great, thanks. We have one from Nathan Corneal. The question is, do you think spikes as units of information transmission can be useful from an engineering perspective for machine learning, where currently floating point numbers are usually the unit?

JAMES DICARLO: Yeah, so that's another interesting question. So as I mentioned, when we model this, we take the spikes and we convert them to mean fire rate. So now they're analog numbers. I think there are lots of deep questions that I'm not going to talk about here today about why spikes might be useful from an energetics perspective, relative to current compute [INAUDIBLE]. And I think that's the spirit of the question.

And maybe if the questioner is interested, we can have some more of that discussion offline and I can connect him to some references. But for the talk today, we're going to ignore the individual spike timings and just focus on predicting the mean firing rates. And one thing I'll add to that is when I mentioned that the IT codes explain the behavior, that means you can take the mean firing rates without regard to the individual timings of those fights and explain the behavior with linear mechanisms at that point.

So that's part of the reason that we believe that mean firings is a good level approximation of the information processing. But the question contains hints of it and elements of not just information, but also about energetics rather just information carrying that I think are interesting with regard to spikes. But I think that's all I say at this point.

PRESENTER 2: Great, thanks. The next question is from Guy Gasiv. You have shown reproducible responses to images in IT. After how many trials does adaptation occur? How is adaptation avoided or how is it addressed?

JAMES DICARLO: Yeah, so that's another good question-- though I did say the responses are quite stable. I hope you noticed when I showed you these examples of the responses, they're not stable at the individual spike level. So the mean rates you can think of as being stable is the way we model it with stochasticity on top of it. Sorry, Chris. Can you say-- there was something more to that question that I think I was going to get to.

PRESENTER 2: Sure. So the two main parts were, after how many trials does adaptation occur? And then how is adaptation avoided? Or how is it addressed?

JAMES DICARLO: Yes, I forgot the word adaptation. So adaptation generally means a change in response over some timescale. So there's many timescales of adaptation here. And when they're called long enough, then they're referred to more as learning. And of course, throughout the brain, you can find all of the timescales.

And the way we study IT is we generally ignore the short-term adaptations that you'll get trial-by-trial that are reduced when you go from a very dark room, for instance, to a light room. Just the light transient will cause that. And the way we wash that out is we average over many presentations that average out those short-term effects.

And again, basically we can check if the population vectors look similar at the beginning of the experiment versus the end of the experiment. And to the extent that they do, we say the system hasn't adapted. And so we treat it as a non-adapting system that at that point. And that's generally what you see, unless you apply very specialized statistics or other kinds of learning paradigms that can change the IT responses.

So here you should think of these IT responses as essentially stable, with adaptation effects that we washed out for the moment. That's a short answer to a very complicated question. And I hope that's enough for now. Anymore, Chris? I should maybe move on to get through the rest of the talk.

PRESENTER 2: Yes. So we have about 24 minutes left. And plenty of questions that we can get to at the end if you have time.

JAMES DICARLO: I hope all the meat of this talk is maybe you're now motivated by the ventral stream is an interesting area, a place to do projects for CBMM projects. Data are available if you want to work on many of these kind of questions, many of which are still open. But I want to orient you to the things we've been working on the last decade or so is, even though I told you that IT has these powerful, almost-- I use the word magical-- codes, that of course shouldn't satisfy you if you're an engineer. Because it just says we can measure them in the brain.

And of course, they need to live in the brain somewhere because the brain supports behavior. But how they're computed from the pixels is the really interesting question to us. So how do you go from the pixels up to what we measure in IT? Or how is the IT code computed? Or that is, how are the neural responses derived from the pixels themselves?

And I've already given you a hint of that. Of course they don't just go from the pixels up to IT, but they go through all this neuroanatomy. So there's intermediate transformation stages that need to be understood to get up to IT cortex. So that's another way of phrasing the same kind of question. What are the population patterns of activity along the way, along the ventral stream?

And here, I'm speaking of these questions all at the mean rate level-- the timescale's 100 milliseconds or so, as I introduced. In our lab, we considered a bit of a breakthrough that we had around 2013 or so. And I'm going to tell you about that today, just to give you a spirit of how that worked out and what we think it tells us and what it doesn't tell us yet. And just so you know context here-- as neurophysiologists, as I mentioned, we've been recording for decades in these areas, like V4 and IT.

My lab's not the first to make these measurements. But it's always been mysterious as to what makes these neurons do what they do-- that is, what aspects of the images drive those neural responses? And so you can do things like this-- you could plot the neural response here to a bunch of images that I'm showing here.

Here, they're organized by category. This is the mean firing rate, as I mentioned earlier-- over a 10o-millisecond timescale. And you see these fluctuations in the response to different images.

Just to be clear, this is not time on the x-axis-- these are just different images, grouped by category. And you see this neuron for some reason tends to like images of chairs-- some of the images of chairs, but not all chairs. So it would be a mistake to call it a chair neuron, but it has this average preference for chairs.

But it's really hard to eyeball these data and say, what makes these neurons do what they do? You need a mathematical or computational model to try to take this on. Same thing where these are famous neurons in IT called face neurons, because they respond more to faces, images of faces, than to other, non-face objects. But you can see, they don't respond to all images of faces and they sometimes respond to images of non-faces higher than some images of faces. So it's more complicated than just saying it's a face neuron.

And so I've been here in area V4-- remember, which is a mid-level visual area. The responses even look less categorical. So here, you see some of the top firing images of this neuron. And here, you wouldn't even dare to call it a car neuron or a chair neuron. It's some kind of image features that are driving these neurons.

You have a complicated pattern of responses, because this is just one example E formula. And I would say to all of you that if you want to say you've understood a system, that we need to explain the neural mechanisms of, we must at least be able to predict these behavioral responses. That doesn't mean that the only thing we need to be able to do, but we need models that can at least accurately predict these responses. And that's what we've been after is trying to find neurally mechanistic models-- that is, neural networks that can capture these neural responses. Not as the end of understanding, but as the next step in understanding.

And we've been following a long line of very important work. And I think Tommy probably introduced some of this in his introduction, because his lab was a key driver of a lot of this background work-- is that there's a lot of background from brain science that led to this style of modeling where you have this feedforward cascade models with local processing copied over the visual field and a stacked-like system. They build a feedforward approximation of what's going on from the brain data, the type that I alluded to, or outlined for you earlier. And that led us over decades to many types of models.

Some of them I've listed here-- models of V1, which you might know as [INAUDIBLE] filter-like models. Models of the whole ventral stream-- there's HMAC family models built by Tommy and his collaborators. My lab built some of these models-- we called Klaus O9 models. These are examples of all-- you can see feedforward cascade-like models motivated by brain science.

The trouble we were having is not that we didn't believe this family of models. It's just that none of them were really adequate as scientific hypothesis because they couldn't explain the data well. So they all failed to accurately predict those complicated V4 and IT responses that I just took you through. So we were inspired by this class of models, but they weren't quite right.

So the reason we knew that is we could take those V4 neurons like I showed you and we could ask, how well can you predict the responses to new images? That's what I'm showing on the y-axis. This is for cortical area, V4. Predicting for new images-- and this is noise-corrected, so you've already adjusted for the reproducibility of the data.

And you can see that it's really pretty poor. This red line shows a model trying to predict the responses of the neuron in black that I showed you a minute ago. And you see, it doesn't fit very well. You can see that that eye is just quantified on the left-- that we're not doing very well. Here in IT, also we're doing even slightly worse with some of these models.

So would you call this a failure? We're not there. We didn't throw out the hypothesis class. The problem, we thought, was that there's a lot of parameters in these feedforward models that are not determined by the existing science results-- the neuroscience results. As I mentioned from the anatomy question, we don't know all the details about how things are wired up and what the weight values are on the synapses and so forth.

And so we got around this trick-- and when I say we, it's a large group of people. But the people that in my lab that were most involved in this is Dan [INAUDIBLE], who was a postdoc at the time, who's now at Stanford and [INAUDIBLE] graduate student-- is that we started to take models that were of this family that I described-- so these artificial neural networks, feedforward convolutional. That's basically the HMACs class of models that Tommy and others introduced that I showed you on the last slide. It's guided by neuroscience.

And I think you're going to hear more about these throughout the course. And you probably already have seen them a bunch. I'm not going to dwell on the details of these feedforward convolutional ANNs, other than to say, they're inspired by neuroscience. But the real trick was that you start to just try to force these models to do things useful, like get them to do these core recognition tasks that I introduced to you, to get them to be able to invariably say, oh, this is a pair of boots-- over changes and variation, for example.

And so you've got to get the models to do that. And to do that, you use engineering tricks to tune a lot of the parameters about the models-- both the weight parameters and the hyperparameters of the model to tune it up-- to basically do learning within these models so you find parameters sets that are then able to do well on the recognition task. And so you can see, this is a mixture of brains, minds, and machines around a core problem here.

And what's really exciting about this is that when you build these models-- and I'll show you in a second-- once you've got these models to perform on these tasks, you can measure neurons within the different levels of the models-- so the model IT or the model V4. And it turns out the individual neurons in the models started to behave very much like the individual neurons that we had been recording in the monkey brain that I showed you that we were not able to explain.

So that's the big picture take-home if you optimize within these family networks and the neurons in there end up looking a lot like the brain, much more than the previous neural models. And that's why we call it a breakthrough, because there was a big gain when we started to do that. And so here, I'll just show you that again. Here's a deep neural network model. And one thing that's great about these models is you can take any image and push it into the models.

There's neurons in there, so you can make comparisons with the actual neurons that we recorded in the brain. Now, again, shown at the bottom is the brain data that I've been illustrating for you all along. And then we can go in and we can compare neurons in our model IT with neurons in actual IT.

Now, there's a lot of details in how you do that. The easiest way to think about this and one way you can do it if you find the IT neuron that's recorded. For each recorded IT neuron, find the in-silico or model neuron that's most similar to the recorded IT neuron and then you check its responses on new held-out image. That is, check how well it predicts that IT neuron's responses to held-out images. The details of these things, you'll find in papers like the ones I missed here.

But I'm just trying to give you a spirit for how we do that. And the upshot of that is once you use these models to now make predictions internally in both IT or V4 or other layers-- this is an example of an IT prediction. You see how good this red line, which is the model prediction, lines up with the black line, which is that complicated neural data that I showed you earlier.

Again, I said this neuron is not a chair neuron. It's something more complicated. And now you see that complexity falling out of these models almost naturally once you tune it up to do recognition tasks. It's in the inside the models of these neurons can they exist-- that can explain the neurons that we were recording.

That's an example there of this chair-like neuron, these face neurons. And all their detail can also be predicted. I want to say they're predicting very well-- better than word models that say, it's a face neuron, it's a chair neuron. I hope you can see that. But you can also see they're not perfect.

So they're about halfway there in explaining these neural responses, is where we think at the moment. So we call this a glass half full, half empty stores. They're about halfway there.

In the interest of time, I'm going to skip through this, which is a motivation for building better models, and jump to some newer results. So this is basically saying that these models are motivated by neuroscience, but tuned with engineering-- and that led to models that can better explain the brain.

And so we've got about a 50% match in IT-- a little bit more in V4, depending on how you do the analysis. We can match and check at the behavioral level. And it's pretty good, but it's not yet human-like-- or monkey-like, even. And others have shown in V1 that these things are pretty good matches at V1 as well.

And I want to stress here that we're not just using deep neural networks as black box predictors. We're trying to match every level of the deep neural network with every level of the ventral stream. There can't be components of the models that are not mapped to something within the brain. And that's how we treat these models-- not as just hammers to predict stuff, but as actually neural mechanistic hypotheses of what might be going on.

If you're interested in checking your latest and greatest convolutional ANN, or any ANN against neural data, including ours and others that are free to this website, brain-score.org, where we're organizing a lot of our data and others. And you can actually score your models quite quickly to these kind of numbers that I'm showing you here. And you'll see where you sit on the leaderboard with regard to models of this type explaining brain data along the ventral stream.

Now, connecting to the broader topics of the CBMM session that you guys are in-- these advances, I want to stress, they resulted from performance gains in these models. Again, we trained them to do things better than they were able to before. And in some of the early training, we weren't even using gradient descent methods for learning the models. We were using architectural search methods. So any performance gain was leading to gains and matching to the neural data.

So this is an important point because sometimes people say this kind of data are evidence that the brain is using deep learning or gradient descent. And I would say this is not inconsistent with that, but it's certainly consistent with many hypotheses beyond gradient descent that can be running in the brain. Because any performance optimization tends to lead to a better match of the ANN model to the brain model. And in fact, I think that's an active area of work, moving away from the supervised gradient descent methods which are common in deep learning. And so I'm also highlighting that for you here.

The brain does not really learn that way with a lot of supervised examples. But you can get matches without training models that way as well. So big picture here-- let's see, I have about 10 minutes left. Chris, is that right? Does that sound right to you?

PRESENTER 2: You have 12 minutes to go.

JAMES DICARLO: Yeah, OK. So I'm going to do about five more and then I'm going to take questions. So I want to highlight for you what happened here is really engineering and science came together to build neural networks that actually can do things. Those became good hypotheses for how the brain works, which is what I was just showing you. But also actually, of course, these hypotheses-- these are now the leading computer vision models.

And actually, when you apply deep learning broadly, this is the big workhorse tool in broadly what's referred to as AI in the moment. It's not really AI, but it's what people call AI. But again, this was all inspired by these feedforward recognition models having these first breakthroughs in computer vision and then extending beyond. So that's now referred to broadly as deep learning.

I want to return to these ANN models in general as hypotheses for scientists about how we think about the brain and the mind. What do they tell us more about the neurons within the brain? Or how can we use them? This side here is what I said to you earlier in words-- is that this is a glass half full, half empty story.

These particular ANNs are leading ANN models of the ventral stream. They explain about half of the explainable response variance-- that means [INAUDIBLE] variance that I keep referring to. But they don't explain the other half. We can already see there's something wrong with them. They're much better than previous models, but they're not perfect yet.

Of course, what our lab is interested in is filling up this glass to make them perfect. And I'll say one minute about that at the end. But I just want to highlight for you-- even though they're not perfect, they already allow us to do something.

So there's debates in our field about whether they count as understanding. But we turn that around and say, well, what can you do? If it's really understanding, it should allow us to do something. And here, I'm going to let Teuber speak again to introduce one of the things we did next with these models. So here Hans-Lukas Teuber again in the same lecture.

HANS-LUKAS TEUBER: You see the positivist view is that science-- any science-- tries to explain, predict, and control certain difficulties, then, to define astrophysics as a science at this stage. You may explain and predict, but control is still a little difficult, particularly where other galaxies are concerned. But who knows? By that token, you could say we are far from being there, because we cannot yet explain, but we try.

JAMES DICARLO: OK, so in 1974, I would say the same thing about the [INAUDIBLE] system. Even as far as 2013, we couldn't really even explain what was going on or predict what was going on. So even thinking about control was not on the radar. But now I'm telling you about models that we're starting to better explain and predict, that motivate us to say, well, OK, what can we do with these models in terms of control? And so then two postdocs in the lab-- this is Pouya Bashivan and Kohitij Kar-- we decided to push this to say if these models are actually reasonably accurate predictors of the neural activity in the ventral stream, we should be able to turn this around and use them to start to control the neural activity.

And so what does that mean in practice? It means now you have a neural network model, like shown here at the top. You have the actual brain, shown in the bottom. And our control knobs are the pixel energy that we're applying on the eye.

So the models-- remember, take you from pixels all the way up to neural activity-- in this case, visual area V4-- which, remember, is a mid-level visual area. And we can invert through the model to design custom pixel designs that should control the neurons into whatever state we want them to be in. So that is using these models as a control framing to say, what they're describing is the relationship between pixels and neural activity. Let's turn that around and try to do control.

And now you can say, what do I mean control V4? Well, you're going to pick whatever goal that you mean-- whatever you want to V4 to do. Turn all the neurons on, turn all the neurons off. Turn one on and all the others off. Whatever control goal you have among the V4 population is how we were thinking about it.

And so we started to use these ventral stream models to design patterns of light energy that would set V4 into particular states. And now you ask some optimization code to work through the model to say, please derive from the images that will achieve your control goals. And again, there's many control goals. And here's an example of an image being optimized to do one of the control goals in V4.

Now, this means the optimizer can find images that are not natural. It can be things that you would have never seen before, never could create. Remember, interface is quite large. But we were then able to apply these images and show that we could actually control neurons in ways that we weren't able to prior to having these kind of models.

So I sometimes refer to these as our digital treatments, because there's possible health care consequences down the road. But here's the basic idea-- here's an example of a control goal. So you have V4 population shown here on the x-axis. Here's 38 simultaneous recorded V4 neurons. This is their activity rate on the y-axis.

And say your goal is, I want to, for whatever reason-- maybe it's a health reason, maybe it's a science reason-- I want to turn on neuron 12 and shut off all other neurons. Hold them all at baseline. Please find me an image that will do that. And you can ask the model to do that for you.

If you didn't have a model-- just as an aside, all these neurons are overlapping in their visual respective field, shown here. So this is the visual field. And these are the receptive fields of the neurons. So this isn't as simple as putting light energy at some spot on the eyeballs. It's like all these neurons are responding the same portion of the visual field, so you need to be more clever than that.

If you just used searches through natural databases-- and here, we used a bunch of images we collected. And say, this is the best image we could find, in terms of activating site 12 and deactivating all other sites. And you see it achieves roughly the goal here-- activating site 12 in the dark blue. And the other ones, they're more active than we'd like-- but OK, it's in the right direction. But once we have these models, we can say, model, please find me an image anywhere in image face that could do this.

And here's the image that it found. It doesn't look like a chair, it doesn't look like anything. Again, it's not natural. But it drives the neurons-- in fact, what it did was it drove this neuron up and all the other neurons much lower than before.

It's not perfect-- you see this is up. I don't like that. This should all be flat. So it's, again, half full, half empty. But you can already see it's given us control ability of these neurons.

It's far greater than we would have had without the models in place. And so this is just one example of how we can use these models and turn them around to do other things in the science field. We're starting to dream about using these controls beyond V4 to control areas downstream. Of course, IT is one of our targets. But then IT connects to areas like the amygdala, which is involved in mood disorders.

So we start thinking about how we could shape the patterns of neural activity downstream by having these kind of models. And that's an active area of research for us at the moment. But I'm pointing out that you can now think about computational models, not just as AI, but as tools to help brain scientists achieve goals both science and then maybe health-related goals connected to that. And as we make these models better, than these control abilities will, of course, get better as well.

This was asked in one of the questions-- I'll just highlight real quickly. The recurrent aspects of these models, I haven't been talking about at all. We focused on the feedforward part in my talk, but we've been working on the recurrent part. And some key [INAUDIBLE] supported work was to show that recurrence is also critical to the functioning of the creation of those IT codes that I've described to you earlier. And I unfortunately don't have time to tell you about that, but if you're interested, I would refer you to this paper.

So just broadly, our goal is to really discover even more accurate computer vision models of the human visual system, especially the ventral stream. Brain score, as I mentioned, is a way of gauging where we are. We and others-- and I hope many of you-- might be building models that are even better than the models we have today. Those models then can lead to applications in both AI, better computer vision, some health applications that I was alluding to, possibly even educational things. That's how we think about the framing of why we do this kind of work and why many of you might be excited to do something similar.

And I'll just summarize for you in case you fell asleep for most of the lecture what I told you today, in a few nutshells. The ventral visual stream produces an IT population that carries this very powerful code-- generalizable code. Optimizing deep ANN architectures for core recognition tasks leads to models whose internals are much better matches to the actual ventral stream than any previous models we had in the ventral stream-- this includes matches for face neurons. And this result, I mentioned along the way, is consistent with the brain doing some kind of backdrop. But it does not imply the brain is running any type of backdrop-- it just says performance optimization tends to lead you there.

The other things that I would add is that these same models can start to be used to construct novel synthetic images to both superactivate neurons-- I didn't show you that-- and control sub-populations of neurons, which is the example I showed you-- to tune the patterns that you get out of the neural firing, using the models. And these same models are nevertheless-- I keep stressing this-- they are not correct yet. They are far better, but they're not yet correct. They need lots of improvements-- recurrence, maybe spikes, many of the questions you may be wondering about.

That's what we're wondering about, too. We can see they're not correct. That's the half empty part of the glass. I mentioned recurrent circuits. That's something that's been missing.

And we've started to add in, but we still don't have the right formula yet. And more broadly, we and our collaborators are building more and more models using more biological constraints. And so far, this model show improvements in efficiency and some gains in robustness to perturbation and even adversarial attack. And that's work that I do not have time to show you about today, but I hope gives you a spirit of how the science connects to that style of work.

And with that, I will just remind you that I'm talking about vision. In this course, you're going to learn about many things beyond just core recognition-- scene understanding, intuitive physics, and so forth-- language and other things. Intelligence is far broader than vision and it's certainly far broader than core recognition that I introduced to you here. And I hope you're inspired by this idea to connect to other aspects of intelligence. With that, I'll end and say thank you and take any questions we have time for. Thanks.

PRESENTER 2: Great. Thank you very much, Jim. We can squeeze in a few questions. And please let us know if any of these have been answered, since they include questions that were previously submitted. The one at the top of the moment is from Gio Batista Perez Candovo. And the question is, do the linear decoders generalize also across brains, at least to some extent?

JAMES DICARLO: Generalize across frames. So just to be clear, the way we test linear decoders is we train them on a bunch of static images where we measure the responses and then build the decoder weights that way. And then we test on new images. So in some sense, that is a generalization across frames. But I think the question may be something more about movies, if there wanted to be a clarification there.

PRESENTER 2: Sorry, I think you misheard me or I misspoke. It's generalized also across brains, with a b.

JAMES DICARLO: Oh, brains. OK, so remember the decoders are a set of weights applied to a specific set of IT neurons. If I was going to plug a decoder into your brain or my brain, we'd have to at least measure some neurons to know how to put those weights on your neurons versus my neurons, because we don't have names or labels for neuron. I can't say neuron 12 in you is neuron 12 in me.

So we think they generalize in that sense, but you need some calibration to know what you're measuring. Imagine you stuck in a BMI device in each of our heads. You have to measure some responses to make that mapping, because we have no other way to map it. So yes, the idea is generalized, but the details require some retuning per brain.

PRESENTER 2: Great, thanks. The next one is from Jesse Parent. And in regards to your talk about the goal of AI researchers, is this related to a general suggestion for how a direct cognitive science in a post-cognitive revolution era?

JAMES DICARLO: I think this goes back to almost my first slide-- was, my comments were mostly to say, look, you can do straight-up engineering. And there's lots of people doing that. But we think engineering under the guardrails of both neuroscience and cognitive science is going to lead you to answers on intelligence faster than straight-up engineering. And that's a bet. You don't have to take that bet. But I was explaining to you how that bet has seemed to pay off and how it's at least in the space of vision.

I don't know about a revolution, but I think I think it's been a revolution in vision science by merging engineering with some of the brain scientists I've referred to you here. And I think there will be in other aspects of intelligence as well, at least on the natural sciences. But you can't forget the natural sciences is the key thing that I would like you to take from my talk. It really helps move the models in the right space to begin with.

PRESENTER 2: Great, thanks. The next one is from Brandon Davis. Is it conventional to assume that linear decoder explains the behavior? What if there is a non-linear relationship?

JAMES DICARLO: Yeah, OK. That is a great question and of course I don't have time for that. We use linear decoders from the IT codes, because we try the simplest thing. And to the extent the simplest thing works, you stick with it. And the linear decoder works, so we leave that as our sketch-in model of how the IT relates to the ultimate choices that the animal or you and I make when we to do the task that I showed you.

But the mechanism between IT and the final button push are certainly not linear. We are just abstracting them away with that linear thing and saying, look, the computations have mostly been solved at this point. And there's a lot of open questions on the neural mechanisms on that side of how you even learn a new object-- I mentioned the monkey learns a new object. So I wouldn't want you to walk away and say, the brain's linear after IT.

Just think from a computational perspective-- you don't need fancy tools after IT to do the kind of tasks that I've been showing you. So that's the sense in which you should take that work-- not that that's a good model of the mechanisms after IT. Because it's certainly not a perfect model of them after IT. I hope that helps.

PRESENTER 2: Thanks. I think we have time for one more question. This is from [INAUDIBLE]. How are the cells in IT-- for example, with the different image preferences-- spatially organized?

JAMES DICARLO: Yeah, that's another good question. I talked to them like they're in a big sea and all mixed together. And in fact, if I had time to give another talk, they're actually not. They're spatially clumped. So nearby cells respond similarly.

So you might have heard about, for instance, face patches-- the work of Doris [INAUDIBLE] and others, showing that neurons are clumped together. Nancy [INAUDIBLE] showing it in a human. Neurons that tend to like faces-- those face neurons that I showed you that are not really face neurons, but they, on average, like faces-- they tend to be nearby. And so they're not randomly scattered. And then why is that?

Well, our current best hypothesis is that if you put in wiring cost constraints and you train that into the neural networks, you end up with models that can perform as well but also show those spatial organizations. So that may be due to other metabolic or other costs beyond information processing is how we think about the spatial organization of places like IT in the ventral stream as [INAUDIBLE].

PRESENTER 2: Great. Actually, we can sneak in one more if you're up for it.

JAMES DICARLO: As long as you can take it, it's fine.

PRESENTER 2: We're still waiting for our next speaker, anyway. Martin Irani asks, given that visual perception is usually accompanied by attentional control, do the current state of the art computer prediction models consider this? And if so, how they implement this top-down monitoring?

JAMES DICARLO: Yeah, that's another great question. And I did not touch on that at all. The only way I touch on that is overt attention is the point I introduced in the beginning. The way most of attentional control works is by overtly moving your eyes to fixate a new location. We don't model that at all.

We're modeling once some machine-- other parts of your brain-- moves your eyes, we're modeling the parts that are processing the central 10 degrees. So you can think of, the first entry into that attentional question is, what determines where to move your eyes next? I think Gabriel or others may talk about that, so it's a good segue to some of the other lectures that you're going to have. And just think about the recognition system as a very nice processing retina that is being driven around overtly by another system, mostly in the dorsal stream. And I hope Gabriel will talk on that.

There is also covert attentional control that you may get talks on as well. That is another interesting variant of that. And I did not talk about that as well. But we think of those as being layered on top of the network that I described for you. But those are questions that I hope you'll get. They're open questions, but I hope you'll get the latest progress on that in some of the other lectures that you hear in the CBMM talk.

PRESENTER 2: Great, thanks very much for your time. And great talk today, Jim. If you're up for it-- if you want to email me your slides, I can provide them for everybody to download. Totally up to you.

JAMES DICARLO: Happy to do that. And I can also send links to some of the papers that you were asking about-- spatial, for instance. A lot of this stuff, we have on viral archive. So I'll try to put some of that together if people are interested. And if you have questions on data sets, please email me and I'll try to get you to the right person to get you there on that.

PRESENTER 2: Great, thank you very much.

PRESENTER 1: Thank you very much, Jim. That was fantastic, as usual. And our next talk is by Ethan Myers at 2:30.

JAMES DICARLO: Yes. And Ethan will introduce decoding, I think. So on that linear decoder question, that will come up in his talk.

PRESENTER 1: Great.

Search form

You are here

Video

Yes

Reverse engineering visual object recognition