Recurrent Computations in Cortex
August 12, 2019
August 12, 2019
All Captioned Videos Brains, Minds and Machines Summer Course 2019
Gabriel Kreiman, Harvard University/ Children's Hospital Boston
GABRIEL KRIEMAN: So I want to keep this quite informal. I put here a lot of slides about a couple of different topics centered around recurrent computations, but I would like this to be more of a conversation. So please stop me. Please interrupt. And if we don't go through the whole thing, that's perfectly fine.
With Christofer, I'm a very strong believer that we need to share code and data. He gave a whole spiel about this. I'm not going to repeat that.
Every now and then, when you see some of these QR codes, you can check them out, and they should link to code and data for what I'm talking about at the time. It's not quite complete. We still have a lot of data and code that we haven't quite shared.
And it's not as easy as it sounds. But in many cases, that should work. And if there's anything that you're interested in that you cannot quite get access to, please talk to me afterwards.
So I want to start by giving you an example of approximately state of the art image captioning. Some of you may be familiar with this Microsoft captioning bot. It doesn't matter which one it is. There are many based-- very similar ones.
If you upload a picture such as this one, it says-- I think it's a group of people standing in front of the Leaning Tower of Pisa. So it's quite remarkable. So if you know about the state of the art a decade ago or so, we've made a pretty long way, right? So the system can recognize-- when I say we, I mean the field.
We can recognize the Tower of Pisa. And of course, perhaps that's not too surprising. There are probably a gazillion pictures of probably almost the exact same Tower of Pisa out there for training.
Tourists are not super original. They mostly take very small variations of exactly the same picture. So this is probably almost memorizing pixels, perhaps, or memorizing certain features.
But [INAUDIBLE] goes beyond that. It says it can detect that there's a group of people. And that's quite impressive.
These people are pretty small. They're probably just a handful of pixels, and the algorithm can detect that, which is, I think, pretty cool. Of course, it may be that in the training set also, when you have a lot of blue, then that means that the picture may be outdoors, and it correlates very strongly with having a group of people.
It also says that the people are standing, which is actually probably true for most of them. I don't know if the algorithm was lucky about it or, again, whether that's just the statistics of the training set. then if we change the picture to this one, I think it's a person standing in front of the Leaning Tower of Pisa. So again, it recognized Tower of Pisa.
That's quite impressive. The person is probably-- there is a person there. It's probably standing. Yet, I think it's pretty clear that the algorithm misses the most important aspects of what's going on here.
So I would contend that we have a very, very long way to go. So for all the people who say that we've solved vision, that computer vision is solved, I think we have a very, very long way to go until we can build algorithms that will understand a picture like this. And if you scramble the picture and you just do a simple game like this one, the algorithm says, I think it's a person standing in front of a building.
So now this very simple transformation, the algorithm already lost the notion of the Tower of Pisa. It's still got the idea that there are people in there. But again, the caption is pretty similar, although the picture is completely different in all sorts of ways.
So the goal, ultimately what we'd like to be able to do is take any picture and be able to understand what's happening in that picture, meaning that we want to be able to understand and answer what's effectively an infinite number of questions about the picture. We want to understand what is there, where is Obama's foot. We want to be able to-- I can ask you search for all the mirrors in the picture and so on and so forth.
So here are just a few of the example questions that one may be interested in. And in particular, as I will argue very shortly, we not only want to be able to do this, but we want to do it in a way that mimics biological processing. That is, we want to understand where the mechanisms that allow us to answer these questions based on our own hardware.
So incidentally, you may understand from this particular picture that this picture is a little bit funny. Obama is being playful. He's sort of playing a joke on this guy here.
And I cannot even begin to imagine what kind of computation algorithms we need to be able to understand and grasp that type of concept. You need to be able to understand that that white thing there is a scale. Maybe some of you are very young. They have never seen a scale like that.
And you need to understand that people are self-conscious about their weight. You need to understand that he doesn't know that Obama is doing that. You need to understand that all the other people do know what's going on, and that's why they're laughing. There's so much to understanding a scene than where we are. And I think that's where we'd like to navigate towards.
And the way to do that for us-- I am a physicist by training. I like to think about building models and breaking models. Nothing more sophisticated than that-- that's what we've been doing for centuries.
And that's partly what we want to build computational models. We want these computational models to be related biological hardware, and I'll say a few more words about that. And Jim will have a lot to say about that, I'm sure. And then we like to break those models in order to be able to build better models.
So this is our trick, the trick that scientists invented. We could perhaps argue that Galileo invented this. This is the trick that we invented to be able to have a job forever. So as you can see, we can build models. We can break them, and then we can continue this loop and have fun for a very long time.
So a couple of desiderata and basic assumptions for the kind of models that we want to build-- so this is basic science. We want models to be falsifiable. So we don't believe in models where we postulate that there's a homunculus in the brain or there is an Obama engine in the brain.
So we don't like engines. We don't like homunculus. We need to be able to instantiate models in the language of real biological hardware. We want to be able to have quantitative predictions.
We want models that are image-computable. There's a lot of exciting work that has happened in the field of vision with models that you cannot do computations directly on the images. So we think that this is an essential ingredient.
We think that the models have to be based on neural networks. We need to respect some sort of basic mapping of the model onto something that could be implemented by biological hardware. And that restricts our space of possible models quite a lot. And we want to be able to explain neuron responses. we need to be able to explain what happens in the brain.
And there are lots of assumptions that we can unpack about this. At the same time, we would like to be able to capture the behavior of the organism as a whole. We don't think that either of these by itself is sufficient. I would not be satisfied purely by explaining behavior without the neurons.
I would not be satisfied by purely explaining neuronal activity without the behavior. And critically for me, we want to be able to extrapolate to novel situations-- so to ensure that whatever model we have, it doesn't only apply to just the 10 types of picture that I chose to use for my particular experiment. I want to be able to use these models for any possible picture in the world.
So here's a very naive initial working hypothesis of some types of computations that we sometimes referred to as visual routines that we think might be important to be able to do a visual scene understanding. The idea of a visual routine was coined by Shimon Ulman, who is part of CBMN but that's not going to give a talk this year. But I think you can see some of his videos online.
And the idea is that there are some computations that solve specific smaller parts of a problem that can be used in a flexible manner, in an interchangeable manner, and in a recurrent and repeatable fashion as well. So one routine could be to extract an initial sensory map followed by proposing some sort of gist of the image, some sort of basic understanding of the property of the image, followed by extracting and perhaps labeling specific objects in the foveal region, in the central part of where the subject is fixating, being able to put together this gist of the image with the foveal information to make inferences about what's happening in the picture, and then be able to sample the picture perhaps by moving your eyes to sample different parts of the picture. And that's what we refer to as task-dependent sampling and active sampling for [INAUDIBLE] potentially many other routines like depending on the question, depending on the task, detecting people, detecting spatial relationships, and so on.
So this is not an exhaustive list of visual routines nor am I claiming that this is necessarily correct. This is just a working hypothesis of certain operations that we think would be interesting and important to implement in order to ultimately be able to understand visual scenes.
And the idea is that many of these things could be studied somewhat independently. And then we can plug and play and combine them to be able to solve complex visual tasks in terms of scene understanding. So I'm going to give you an example of a few of these. And again, when I get more into the meat of some of these specific routines and specific computations, I would encourage people to stop and interrupt and ask questions, OK?
Each one of these routines can in turn be subdivided into multiple subroutines. Perhaps one of the ones that we understand the best is this idea of how to propose a foveal object, what happens along the ventral-visual stream. And I'm not going to say a lot about this.
I'm going to give a brief introduction because Jim is going to say much more about this very, very soon. Usually, we do this the other way around, which I think probably makes more sense, with Jim describing all of these before I go into talking about recurrent computation. So I'll say a few words about this part, but most of this will be explained much better in much more detail in the next talk.
So a lot of the inspiration for us to build computational models of visual processing comes from understanding the basic anatomical circuitry of connectivity in the macaque monkey. So when you see these-- and I think Christof showed and flashed a version of this diagram doing one of these talks as well. For those of you who are not familiar, what you're seeing here is what I call the mesoscopic map of connectivity in the macaque monkey visual cortex.
Each one of these boxes is meant to represent a visual area. At the bottom, you have the retinal ganglion cells. The next step is this part of the thalamus called the lateral geniculate nucleus. Information from there goes on to the next stage, which is called primary visual cortex, and so on.
These lines indicate the connectivity between one of these boxes and the other. And as was alluded already in the talks that Christof gave, this is just a very, very coarse approximation to the actual connectivity in the system. In a few days, we'll have a talk by Jeff Lichtman, who is going to describe at the ultrastructural level, at the EN level, the connectivity between different neurons in a small patch of the brain. So if you open each one of these boxes, you'll have a bewildering and amazing complexity of connectivity, of motifs, and so on. So this is just a very, very coarse approximation to the way that neurons are connected in the visual system.
But even at this very coarse and mesoscopic level, we see that there is a lot of stuff going on. There's a lot of complexity. And I will contend that the kind of models that we have today only capture a very small fraction of what's happening. And this is only looking at the visual system alone without even considering all the other parts of the brain.
What are we inspired by, and why are we looking at the macaque? One of the difficulties is that we just don't have a diagram like this for humans. In humans, we just don't know-- basically, we know almost nothing about real connectivity in the brain.
There are a lot of people who make indirect measurements in humans. And as Christof alluded to, these indirect measurements-- well, how can I put? It They are difficult to interpret, and they don't have the rigor of being able to trace individual axons to know that one particular part of the brain is connected with another. We just don't know basically how different parts of the human brain are connected with each other.
So just to give you a quick glimpse, you had already tutorial about computer vision. Both Jim as well as Tommy will talk more about some of these deep convolution network models. Many of you are familiar with them.
So these diagrams here correspond to Alex in the VGG. These are names for several deep convolutional neural networks, deep because they have multiple stages, convolutional because of the main operation that's repeated over and over the space of picture. And these are networks that have been extremely successful in visual object recognition tasks.
So in a typical visual object recognition task-- and I know that some of you are quite familiar with these type of problems. But just for those of you who are not, you have a lot of pictures. Let's say they belong to some number of categories. For example, there's one database that's called ImageNet.
There are 1,000 different categories. And you can train these algorithms to put labels onto these pictures. This is a chair, this is a table, and so on. And people-- then you do cross validation.
You take other pictures, and you try to put labels on. You evaluate how well you can label the test images after training the system. So these are essentially several different systems that achieve a pretty high performance in these type of tasks. To some extent-- and again, we'll discuss this in more depth with Jim's talk.
The black boxes, the boxes that are circled in black on the left there, are the ones that are purportedly to be described by this type of model. So it's only a very small fraction of what's happening in the visual system that's being incorporated into this type of computational models. So how do we know what happens in one of these boxes? How would we know what a neuron in the retina Does?
How do we know what a neuron in LGN and V1 does and so on. So the gold standard to be able to understand function is to look at the spiking activity of neurons in the brain. So there have been heroic efforts by many people over the last approximately seven decades putting electrodes in different parts of the brain and trying to map the responses to different types of images.
So just to give a quick overview and a very unfair overview of seven decades of visual neurophysiology, Steven Kotler recorded activity of retinal ganglion cells and showed that neurons in the retina are particularly picky about the location of illumination in the individual field. That's what-- following up on earlier works, he defined the receptive field of neurons in the retina so that neurons are interested in a particular location in the visual field and not in others. Fast forward a few years, and we have [INAUDIBLE] showing that in primary visual cortex, neurons are tuned to bars of a particular orientation.
A neuron may fire very strongly to a vertical bar and almost not at all to a horizontal bar. People have divided the visual system into two main stream-- the parietal or dorsal stream, where we find neurons that are particularly selective to aspects of the visual stimulus, such as a stereo or the direction of motion and then the temporal pathway or the ventral pathway, which is going to be the main focus of the next talk, where we can find neurons that are selective to complex shapes, including fractals, including paperclips, including faces, including chairs, and many other types of shapes. So just as a quick description, what do we know about the human visual system?
Again, almost nothing. It's very hard to record the spiking activity of neurons in the human brain. Every now and then, we can record intracranial field potentials, and I show you a couple of examples of field potentials in the human brain. And in a couple of very limited cases, we have access to spiking activity in the human brain by working with patients that have refractory epilepsy.
So at the top of this diagram, the pinnacle of this diagram is actually the hippocampus and the entorhinal cortex, ER, and the hippocampus, HC. These are strictly not purely visual areas. If you make a lesion in the hippocampus, humans can still see very well.
Nonetheless, this is one of the areas that we had access to. So if you put electrodes in the human medial temporal lobe in areas like the hippocampus, entorhinal cortex, and the amygdala, you can also elicit visually selective responses. This is an example from very old work that we did with our colleagues [INAUDIBLE] and Christof [INAUDIBLE]. In this case, this is an example of a neuron.
In this particular case, this is a neuron that was located in the right amygdala of this patient. Each one of those dots corresponds to one spike. And this is a typical raster plot where you see the responses in every single trial in response to that particular picture. And the histograms below the raster plots correspond to the posthumous time histogram. That is the average activity over trials of that particular neuron to that particular picture.
So this is a single neuron. And I'm showing you here the responses to 15 different pictures. So you see that this neuron was very picky. it didn't just respond to any picture. It responded to some pictures-- in this particular case to three different pictures of former president Bill Clinton. It did not respond to other faces.
It did not respond to other presidents. It does not respond to other pictures of chairs, animals, and so on. So it was highly selective.
So some people started calling these Bill Clinton neurons or Jennifer Aniston neurons in the same way that people talk about orientation tuned neurons or face neurons or chair neurons and so on. So I want to start by following up on some of the discussion that Christof elicited by asking how well do we really understand cortical responses. In other words, what do neurons really want? How do we know that out of all the possible stimuli in the world, the ones that we happen to have chosen for the experiment are really meaningful in any sort of way?
So just to clarify, when we say what do visual neurons really want, of course neurons do not want anything. Neurons are in the business of communicating with other neurons. They fire action potentials depending on the inputs. And to a first approximation, if the sum of all the inputs is strong enough, there's a particular part of the neuron that essentially will dictate whether an action potential will be fired or not.
So the sense in which I ask the question what do neurons really want is how can we effectively drive a neuron? How can we maximize the firing rate of a neuron, the activation of a neuron, by carefully selecting what kind of pictures we use?
Is it true that neurons are interested in chairs? Is it true that neurons are interesting in orientation or Bill Clinton or whatever the kinds of visual stimuli that will maximally drive neurons? I'm going to restrict this question to the domain of flash stimulus presentations and in this case to passive viewing and importantly to realistic experimental conditions where we have a finite amount of time to entertain ourselves with a neuron.
The finite amount of time is critical because of the course of dimensionality. Essentially with current techniques, it's impossible to exhaustively sample the stimulus space, OK? So if you consider an image that has 100 by 100 pixels, arguably a very small patch of the visual field but nonetheless 100 by 100 pixels, each pixel can have 256 shades of grays. And you want to exhaustively explore stimulus space.
I think I did a calculation that if you show five pictures per second continuously in five years, one could present about 10 to the 9 pictures. And it would take a couple of millennia for a grad student that never sleeps and never eats to be able to show all of those possible pictures. So it's just impossible to test all possible pictures.
So what do people do? So people are inspired by previous studies so somebody else published a paper saying that gray things are effective at driving V1. So let's try gray things.
So that has been extremely successful, lots of heroic experiments trying responses to gray things all throughout the visual system. A lot of people have intuitions about certain images that may be more interesting than others. For example, some people believe that faces are cute and interesting, and therefore we should study them as a special category because they may be of some sort of evolutionary relevance. So they did-- this is basically choosing stimuli based on intuitions about what kind of pictures are important.
Other people have argued that the statistics of natural images are relevant to dictate what kind of images we want, that our visual systems were trained in an environment that has some regularities, and we can use those regularities to guide what kind of pictures we use. Then people have used computational models. And perhaps one of the most fundamental ways of discovering the tuning properties of neurons along the ventral visual cortex has always been serendipity, people getting lucky and if papers interested, we can discuss about three examples here of how people got lucky, essentially.
So it's a combination of careful observation and a lot of hard work and getting lucky and being at the right place at the right time to discover what neurons may be interested in. So I want to briefly talk about a study that was conducted by a student [INAUDIBLE] in the lab to try to come up with a unbiased or at least less biased way of exploring the tuning properties of neurons, of visual neurons, in monkeys, in humans, in whatever species you want. So the basic idea is that we're going to try to get the neuron to dictate what it prefers, what it likes.
So the algorithm goes like this. We have-- on the top left here, we have an image generator. So this is an algorithm that will generate pictures. If you think about deep convolutional networks, basically what happens in a deep convolutional network is that at the beginning, at the input, you have an image. You have pixels.
And then you extract some features. This image generator turns the problem upside down, basically. So we start with features. And the output of that is an image. So the output of this generator is a picture.
So then we're going to record the activity of a neuron in response to that picture. And then we're going to use some sort of search algorithm-- for example, a genetic algorithm-- to try to get the neuron itself to evolve and dictate what kind of pictures it likes. So this is done in real time in a closed loop to try to evolve pictures that will maximize some sort of function that's dictated by the neuron. In the examples that I'm going to describe now, we're going to use the firing rate of the neuron as the main function that we're trying to optimize.
So we want to maximize the number of spikes, the spike count, in a given window. This is an assumption. We choose firing rates because lots of people in the field have been using fighting rates forever.
But in principle, you could optimize other things. You could optimize local field potentials. You could optimize the derivative of the firing rate. You could optimize the synchrony between neurons. You could choose your favorite function, your favorite intuitions about the neural code, and try to optimize that. Here, we're going to work mostly with spike outs.
I see several questions. I think Lucy and then Nils. Yeah.
AUDIENCE: And you know when to stop.
GABRIEL KRIEMAN: That's a good question. So we don't. I will show you that in a couple of cases, we got lucky, and there seem to be monotonic conversions.
And I think it's pretty reasonable to observe that empirically, these converged. Ultimately, we don't know where to stop. We can talk about local maxima and firing rate or not. I think these are interesting questions.
I'm happy to discuss those later. But we don't, and we just kept some sort of criterion for convergence. Yes.
AUDIENCE: How did the parameters of the generative network itself [INAUDIBLE]?
GABRIEL KRIEMAN: Very good. I'm going to show you some computational exercises where we play with that and we try different possibilities for that. I'm also going to show you some monkey [INAUDIBLE] physiology experiments that we've done in collaboration with March Livingstone. And in that case, we used a fixed system which was essentially inversion of AlexNet.
We started with FC6. So this was a pretrained-- so for those of you who are aficionados in the field, AlexNet is in an eight layer neural network it was retrained on ImageNet. This is this large data set I was referring to earlier.
And so this was pretrained on ImageNet, and this is an inverted version of that. And computationally, we played with many other different generators, and [INAUDIBLE] show you results. With monkeys, unfortunately, this is very costly, very expensive.
We had to make some choices. And that's one of our choices. Cole.
AUDIENCE: [INAUDIBLE] imagine you walk into a [INAUDIBLE] multiple people who are sitting there talking a language that I don't understand. But I see them producing sound and formulating their sound and producing words in some language. It seems to me like this approach is comparable that you have neurons that are talking, they have some role in this function that they're producing.
And I have access to their firing rates and like, the anatomy of the sound. And we're able to make sense of what role each [INAUDIBLE] member is playing just by making them shout aloud. And that seems to be kind of--
GABRIEL KRIEMAN: I think that's a very good question, and we can-- so basically, you're asking why are we trying to maximize firing rate. Why not something else, right? So--
AUDIENCE: No, no [INAUDIBLE].
GABRIEL KRIEMAN: Right, so let me ask the question a different way. Hubel and Wiesel got a Nobel Prize for discovering orientation tuning. What is orientation tuning in V1?
It means that there are more spikes for one rotation than another, right? So there's an implicit assumption there that more spikes is a good thing, right? What is the placed cell in the hippocampus?
A placed cell in the hippocampus is a cell that fires more in one location than another. And I can go on and on and on. So it's true. That's an empirical question.
Do firing rates matter? Does anyone listen to those firing rates? Does more firing rate have any correlation with behavior?
These are very good questions. I think I'm together with 99% of systems in neuroscience over the last five decades in using firing rates. But you're absolutely right. In principle, this general formulation of the approach is agnostic to what you're trying to maximize.
So if you have a better metric-- so you could tell me that I know for sure what really is relevant is the joint mutual information between these 250 neurons, and we can try to maximize that using the same algorithm. We start with firing rates because, again, that's a simple thing to start with.
But you're right. I don't disagree with that. Yeah, yes.
AUDIENCE: Do you think there will be a difference when [INAUDIBLE] functional-- functioning regime versus I'm just pushing it to the limits? For example, if the animal is doing some task while you're doing this, [INAUDIBLE] I guess the most extreme example I'm thinking about is if [INAUDIBLE] recording, and you have some profile with that [INAUDIBLE] looking like. And you do size recordings [INAUDIBLE] inject some kind of current that [INAUDIBLE] to like crazy. But I don't see that cell behaving like that in vivo.
GABRIEL KRIEMAN: That's a very good point as well. Again, we had to start somewhere, and we chose to start by looking at the condition of passive viewing and flashing pictures. This is certainly not the only thing that I want to understand in vision.
I like to understand in everything. Does it depend on the temporal history? Does it depend on whether the monkey is scratching his head?
Does it depend on whether the monkey is doing a visual discrimination task? These are very valid questions. In this case, we started with the most basic-- passive viewing condition, flashing pictures, and trying to maximize firing rates.
And we think this is a useful an interesting regime, but I'm not claiming that this will extrapolate to every possible condition. Maybe if we show videos, we'll get a different answer, for example. And we're very slow. We're going very, very slowly and--
AUDIENCE: [INAUDIBLE] also interesting.
GABRIEL KRIEMAN: Yeah. Yep, I agree. I don't know who's first. Yeah, go ahead.
AUDIENCE: I was wondering whether it's [INAUDIBLE] the monkey is learning here. Is there some [INAUDIBLE] comparison, or basically you show [INAUDIBLE] and then see how the neural responses [INAUDIBLE] developing [INAUDIBLE]?
GABRIEL KRIEMAN: I guess you're all very smart. You assume that this worked. That's why I'm showing it.
I haven't shown you anything yet. So hold onto this question. And if it's still unclear at the end, ask me again. Yeah.
AUDIENCE: Yeah, I was going to ask [INAUDIBLE] the neurons they were firing [INAUDIBLE].
GABRIEL KRIEMAN: Tired, tired, yes. That's a very good point. So I'd say a few words about that as well.
So let me show a little bit of the results, and then that's a very important question as well. Is there anything unclear about the role of the methodology? Or-- OK, so let me start quickly by-- so OK, so first step, we have an image generator.
This is essentially an inversion of AlexNet here. This is the reference. We start with completely random conditions.
We chose 36 random images. So we start with noise and to start and assess whether this might even work or not. Instead of a human or instead of a monkey there, we put a computational model.
So we take a network here. And we're going to study that network. For example, we're going to study AlexNet or ResNet.
This is very different from that network, OK? So here, we're using a computational model as a proxy, as a species, as a cat, as a monkey, as a human. We're going to record the activity of a unit in the network just to see whether we can actually maximize activation in that network.
This is a useful exercise for us to even know whether it works. If it works here, doesn't mean that it will work in vivo. But if it doesn't work here, then we would be in trouble. So the short answer is that it does work in here. Yes.
AUDIENCE: So you [INAUDIBLE] network is [INAUDIBLE] outside, but a network [INAUDIBLE].
GABRIEL KRIEMAN: OK, you don't get the identity because in this network, we're assuming-- we are making no assumptions about what we know. So we are treating this as a black box. So we're making no assumptions that we know anything about this particular network.
But you may be worried about the fact that if this works, does it work because I'm using exactly the same network, right? And I'm going to show you that that's not the case, OK? But that's a good concern to have. Are we just overfitting here, OK?
But we're going to treat this as a black box. I'm going to assume absolutely no knowledge. I know nothing about the number of layers in that network, about the weights, OK?
All right. OK, so we put our electrode in the network. We look at activation of one of these units. And this is one example. So this is the top layer of AlexNet, a layer called FC8.
This is unit 599. It's called a honeycomb unit because it's involved in recognizing honeycombs. So if you look at the 1.4 million pictures in ImageNet, the three best pictures are those three pictures, and that's the activation value that you get in response to those pictures for that particular unit.
So if you look at the distribution of activation values for all the ImageNet images, that's the green curve there that's a distribution of activation values for all possible images. And then we run our evolution algorithm. We create images, trying to maximize activation.
And lo and behold, we can maximize activation. We can get images that will drive this unit much, much better than the best possible images in ImageNet So that's what is shown here with all of those gray bars. The shade of gray here indicates which generation we're talking about.
At the beginning, we start randomly. So we have images that are very bad, meaning they trigger very low activation values. At the end, we have very high activation values. So at the end, we have this picture over there that gives you an activation of 37.6, which is much better than this 1.4 million images in ImageNet, OK?
We call these super stimuli-- images that elicit higher activation than classical ones. So we can do this again and again and again. So here we did this for 100 different units in different layers. These works in different layers.
There was an important question before. Does it only work on if you tried it on AlexNet because the generator is from AlexNet? We can try this on lots of other architectures.
Again, remember that generator is fixed. That's based on AlexNet. But now we're looking at units in Inception, in ResNet, in all kinds of other networks.
Again, we're making no assumption about any of the weights. We know nothing about those. We're treating those as black boxes.
We can still drive. We can still find images that will drive those units better than the best pictures in ImageNet most of the time, not all of the time. So everything that's below 1 means that we failed to find something that's better than pictures in ImageNet. There may still be very good pictures but not better than the ones in ImageNet. Yes.
AUDIENCE: If you looked at the best images, you'd find [INAUDIBLE].
GABRIEL KRIEMAN: Let me come back to that. One of the arguments I'd like to make is that we should avoid trying to find patterns in terms of words, in terms of describing pictures.
So what I'd like to argue is that units are interested in describing and extracting certain types of features that we cannot necessarily put into words. So I showed you one example of that image.
So that one over there is the image. And if you want, you can probably start to make claims about that kind of looks like a honeycomb-ish to me, or it has a lot of yellow, or it has-- we can use those words to describe it. I'd rather not.
I'd rather avoid any kind of description of what that is. All I can say is that that's a picture that can trigger high activation. Yeah.
AUDIENCE: What I mean is you look at, say, 100 messages and try [INAUDIBLE].
GABRIEL KRIEMAN: Well, so here we have 100 different units. So I don't necessarily expect them to be similar. These are different units in the network.
One unit likes chairs. One unit likes cars. One unit likes Honeycombs. So--
GABRIEL KRIEMAN: So here. I'm showing a distribution over different units. But if you do that for the same unit and you have different initial conditions-- and I think I'm not going to show that here. Then I think your question is this more relevant.
That is, if I start with different initial conditions, do I always converge into the same thing. The answer is no? Do I converge onto similar things?
The answer is yes. Do I converge onto similar activation values? The answer is yes. But those things are not identical. And that's an important question that we don't have a good grasp on that we can discuss further.
OK, I want to accelerate here. So now we felt confident that this method works. We collaborated with it with March Livingstone. Incidentally, she's going to come and give a talk in a couple of days as well.
So now I'm going to talk about doing exactly the same thing but looking at the activity of neurons in macaque monkeys. So this is a typical recording from a neuron and inferior temporal cortex specifically in one of these areas that March and many others call a face patch. What is a face patch?
It's a particular part of cortex that using non-invasive functional neuroimaging seems to have higher activation for pictures of faces compared to other types of stimuli. And lo and behold, if you present lots of different natural pictures-- these are examples of the types of pictures that they have been using-- and you measure the response, the firing rate in response to objects, you get a very low activation response to monkey faces, and human faces is much higher. So based on this kind of response, a lot of people would like to call this particular type of cell a face cell.
What does it mean to be a face cell? It means that you get a bit more spikes in response to some of these human and monkey faces compared to some of these other arbitrary categories that we used for comparison purposes. If you look at the top 10 pictures, they're shown here. The worst 10 pictures are shown there. This is what in the field has been called a face selective response.
So we run our synthetic generation algorithm, our evolution algorithm. And this is what we get. So here, I'm showing the response, the firing rate of the neuron, as a function of the generation number. So I'd like to make a couple of points partly in response to previous questions.
So the natural pictures are the real world pictures. Those are the chairs and faces and monkeys and so on. You see that the response to those pictures decreases a little bit over time. That's a small effect. And we think t that has to do with adaptation, with the fact that we're repeating the same pictures over and over again.
There's also adaptation to the evolved pictures, which end up looking very similar to each other there are many generations. So we're fighting against that type of adaptation. We think we have some sort of quantitative grasp on adaptation by looking at these gray curves.
The firing rate to the synthetic images increases over time. Almost by definition, that's basically saying that our method is working. And we don't have a good stopping time.
We stop at some point where we think that there is convergence based on putting a threshold on how much change there is in the firing rate from one generation to the next. This is the standard stopping criteria in many ways in which people train neural networks as well.
But we don't know that that's the final point in any sense of final, right? It's conceivable that if we were to record from this same neuron for five years, there could be another picture that will trigger 200 spikes per second, and we just didn't see that. And we can discuss that as well.
AUDIENCE: Just like there's adaptation to multiple images that are [INAUDIBLE], is there any surprise factor [INAUDIBLE]?
GABRIEL KRIEMAN: I'm not sure exactly how you'd quantify that here. So the natural pictures and the synthetic ones are intermixed. The monkeys are fixating in this case. The natural pictures may show some sort of advantage in here because there are fewer of the--
AUDIENCE: [INAUDIBLE] different. They're not like the same picture.
GABRIEL KRIEMAN: Right. So I don't know how you quantify surprise. So one version of surprise is you're seeing all these funny evolved images.
And all of a sudden, you see a chair. So to the extend that you think that that's surprising, that would be included in this. But maybe have another definition of surprising.
OK, all right. So this is what we get. We can generate images that are better-- that is, in the sense of eliciting higher firing rates than natural pictures. Here's one example of that process. We start in generation 0 in the upper left.
We go through this whole procedure. At the end of 209 generations, we get that picture and somebody may ask you, well, what is that picture?
I don't know what that picture is. I personally would like to refrain from putting a word onto what that picture is. A lot of people disagree with me. I think March disagrees with me.
I think it will be fun to ask her, and she'll have a whole conversation about this. I see this as a combination of features that are effective in eliciting high firing rates. That's nothing more, nothing less.
Here are two more examples. In green, I'm showing the distribution of firing rate responses after background subtraction-- That's why you can get negative fighting rates here-- in response to 2,550 natural images. And in gray, you can see the responses to the images that we evolve. OK, so here are two examples.
The one on the right, we did not manage to get a picture that was better than all the 2,550. We got something that was very close in terms of firing rates but not necessarily better in this case. Adam.
AUDIENCE: The question [INAUDIBLE] setting [INAUDIBLE] monkeys.
GABRIEL KRIEMAN: The monkeys are fixating. So this is doing passive fixation.
AUDIENCE: There's no response to he [INAUDIBLE].
GABRIEL KRIEMAN: There's no behavior. It's passive fixation. All they have to do is fixate in this case.
AUDIENCE: And when I see the [INAUDIBLE], it seems that the high frequency [INAUDIBLE].
GABRIEL KRIEMAN: Say it again?
AUDIENCE: So when I see the earlier rates in the [INAUDIBLE]--
GABRIEL KRIEMAN: Yes.
AUDIENCE: --it seems more likely like only the low frequency domains are there. There is no high frequency per say.
GABRIEL KRIEMAN: Oh, in the images themselves. I think you're right. I think you're right.
We are not selecting for high frequency or high spatial frequencies in the images. But the algorithm in principle can evolve to wherever it is. But I think you're right.
I think that the early synthetic ones, they are noise. But the way we synthesize that noise based on-- we use something called the Portilla-Simoncelli algorithm. And you may be right that we start with-- that overall, they have lower frequency content.
So we're not trying to optimize frequency or color or contrast or chairness or faceness. We're letting the algorithm decide where it goes. The only goal here is to maximize firing rates.
AUDIENCE: I was curious. So did you also look at the surrounding cells? So basically, is there some sort of [INAUDIBLE] distribution [INAUDIBLE]?
GABRIEL KRIEMAN: Yeah, I don't have a good picture of this here. These recordings were done with a [INAUDIBLE] ray. Nearby cells were correlated in their farming preferences, but they were identical. So typically, when we were trying to maximize-- so again, this particular exercise was done d one single channel in the [INAUDIBLE] array at a time.
These are multi-unit responses from a single channel. We also try with single unit responses. In general, we observed that if you're maximizing one of the channels, you'll get relatively effective stimuli for nearby channels but not necessarily optimal or not necessarily better than the rest. So we're focusing on one. And because there are correlations in the structure and topography in the structure of the responses in cortex, these are not bad stimuli for-- but they're not best either. Yes.
AUDIENCE: In terms of my understanding the results, I'm assuming the neural network [INAUDIBLE] fixed. So I guess I'm trying to find out the correct input [INAUDIBLE] activate to get [INAUDIBLE] architecture to get the maximum activation. I guess in the brain, because it's dynamic, how should I understand what that [INAUDIBLE]?
GABRIEL KRIEMAN: So there are several things you might be alluding to by dynamic. One of them is adaptation. So that's one effect that happens over time.
Another thing that you may be arguing here is that the weights between neurons, the connectivity between neurons, might be changing during the course of our experiment. I don't know if that's true. In general, if you look at the firing rate responses of a neuron to a particular picture, we get very comparable responses at the beginning of the experiment and the end of the experiment.
That's [INAUDIBLE] mathematical proof that nothing changed in the network. But we think that during the course of one experiment, at least in the course of half an hour, one hour, the recordings are sufficiently stable that the average firing rate doesn't change. I'm talking about the average firing rates.
Neurons are funny devices. If you look at the by trial response of a neuron, there's a lot of variability so much so that people have argued that the final factor, that is the variance in the response over trial divided by the mean, in many cases is close to one. So there's a huge amount of trial to trial variability in response to the same picture.
And we can talk about why and what that means. But if you look at the average running rate at the beginning of the experiment and the end of the experiment, they're extremely similar. That doesn't mean that the network is not changing.
I don't know. One way in which it's changing is adaptation, and that we can quantify. Otherwise, we don't know.
OK. I wanted to talk a little bit about recurrent computation. So I'm going to skip ahead, and I showed you a couple of examples if you look at more examples that are here. In general, I think one issue that I'd like to raise here for discussion-- and I think this follows up on several of the points that were already made by Christof, is that I think we should revisit our anthropomorphic variable descriptions of neuronal twin tuning.
So we like to use words to describe neuronal twin tuning, and I think that's not the right vocabulary. So saying that this is a chair neuron, saying that this is a face neuron, saying that this is an orientation neuron and curvature neuron and so on, that's not the right vocabulary. So what is the right vocabulary?
We need to build models. We need to build quantitative models. Jim will have a lot to say about that. But let me just point very quickly that some of the best models that we have today we think are not very good in describing the responses to these pictures that we have evolved.
So what we are doing here is taking a deep convolutional network. We use the same natural pictures-- so all of those pictures of chairs and faces and houses and whatnot. And we use those to fit the response is at a particular level of the deep convolutional network. So we take one of the layers of the convolutional network, and we try to see how well we can predict firing rate.
So this is an approach that was championed by [INAUDIBLE] and Jim DiCarlo showing that you can get very reasonable feats of the firing rate responses to those pictures. And we're showing that here, and those are the green points.
So on the x-axis is the firing rates that we predict using a convolutional neural network. And on the y-axis are the actual firing rates. So we can have a pretty decent predictability of firing rates in response to those pictures by simply doing a linear fitting of responses from the model onto firing rates. For the afficionados, we're doing exactly the same thing that the others have done. This is partial least squared regression field.
However, when we look at the responses to our evolved image, to our synthetic images, these are images that are very different from anything that we've used to train this mapping, this linear fitting. We see that they depart.
These are all the gray points, and they depart quite strongly from this linear field. So we cannot predict the responses to these picture very well. Arguing that yes, we need to build computational models, that's a much better way of describing the responses than trying to assign words to describe them but at the same time suggesting that we still have a long way to go to bring our computational models to be able to explain the responses to any picture. So here's a whole family of pictures that we just cannot predict very well using these type of computational models. Go.
AUDIENCE: So if you take the [INAUDIBLE] I can see that [INAUDIBLE] on the network. I assume it's the [INAUDIBLE] mapping the features of the neuron, right? So let's say you're [INAUDIBLE] and [INAUDIBLE]. We have a [INAUDIBLE]. Is this something that is happening here that because [INAUDIBLE], we are [INAUDIBLE] function [INAUDIBLE] you may [INAUDIBLE]?
GABRIEL KRIEMAN: Yes, I think that's a very good point. So you're right. So if you only learned about apples, it's going to be hard to extrapolate. So I would say that's a bug in our computational models.
I'd like to build models that will extrapolate to any picture in the world without having to train on them. I see training as cheating. If you always have to train on those same type of pictures, then we're not really learning much and extrapolating much. So I would argue that-- so I think part of that, I think maybe what's going on is that we did try to include some of the evolved images into our training set. And that improved a little but not too much.
But I think that is part of what's going on. And you and many other people have done this exercise [INAUDIBLE] leaving some whole categories out, for example, and seeing how well these models extrapolate. And these models are pretty decent at extrapolating but certainly not perfect. So you do take a hit.
If you've never seen apples, predicting the response to apples is hard. And I think that's one of the problems why we think that we need better competition models. In my mind, once we have a good computational model, it should allow us to explain the response to any possible picture in the world, and I think we're not there. Yes.
AUDIENCE: So [INAUDIBLE] computational model, we need the [INAUDIBLE]. But it also could be the linear map of [INAUDIBLE] all trained on the [INAUDIBLE]. So I think the computational model could still be OK at predicting representations that represent synthetic images. But the linear mapping [INAUDIBLE].
GABRIEL KRIEMAN: It's quite possible. yeah, absolutely. So I'm just saying this particular flavor doesn't seem to be able to account for these responses. I'm not saying any-- I'm not trying to argue all deep computational models are wrong.
I'm very fond them. There's plenty to do with them. I suspect we can fix this, and we're trying to. You may be right. It's in the linear mapping part, and that's a reasonable hypothesis. We tried a couple of things they didn't work, but we have much more to do.
AUDIENCE: I thought you were saying that that [INAUDIBLE] creating some [INAUDIBLE].
GABRIEL KRIEMAN: I don't want to have to train on every single possible picture in the world because then-- right? So I want to be able to ensure that we can extrapolate to completely different pictures. That's my measure of success.
So that to me-- I think it's a reasonable goal to have. I don't want to say, well, you can only test me with apples. That doesn't seem like a reasonable [INAUDIBLE].
AUDIENCE: [INAUDIBLE] for the mapping function could be exposed to a certain number of dimension [INAUDIBLE].
GABRIEL KRIEMAN: It has to be rich enough. It has to be rich enough in some sense. You're right.
So it has to be exposed to a certain number of dimension. That's a topic that we can discuss further. I don't know very well how to quantify that, but I share your intuition.
If I train my linear mapping only with a picture of this pointer, that's not going to work. So that's unfair, right? But at what point is it-- so I want to make sure that we can extrapolate. But I do want to have rich enough dictionary features that we use for that mapping [INAUDIBLE].
AUDIENCE: So this is analysis [INAUDIBLE] at which you're identifying a correspondence between individual units in biological cortex and in your artificial cortex. You're optimizing the image with respect to the artificial neuron, but then you're testing to see whether the optimized image with respect to the artificial neuron actually then increases the performance of the firing rate of [INAUDIBLE].
GABRIEL KRIEMAN: Not in here. That's not what we're doing here. So what we're doing here is mapping between let's say an entire layer of the artificial network and a single neuron.
And that's what you're seeing in this example here. So we're describing the response of a single neuron. This is very similar to what many other people have done before. So we're not trying to optimize anything in the artificial network here, OK? So all the optimization, all these evolved images were done based on the real biological neuron.
AUDIENCE: But have you found that sort of relationship where you optimize an image with the artificial neuron? So you find some correlation between the [INAUDIBLE] artificial unit. You optimize with the artificial unit, and that ends up increasing the firing rate of the biological--
GABRIEL KRIEMAN: That's an interesting experiment. We haven't done that. So I think it's an interesting idea.
Indeed, there have been people that have been trying to do-- not mapping an entire layer but mapping one unit of the network onto one biological neuron. And that seems to work surprisingly well.
And we can debate why. And if you do that, then you can do that type of exercise that you're alluding to. We haven't done that, but-- OK.
I just want to point out that there's a beautiful paper that I think Jim will probably talk about, and this is work by [INAUDIBLE] and co. And where's [INAUDIBLE]? OK, [INAUDIBLE] and co and many others with a different approach with similar goals-- and I think everybody should read this. The idea here is instead of using this evolution approach to try to actually build a computational model of the unit and then use that computational model space to try to generate images according to certain functions such as maximizing firing rates or other ideas.
I won't have much time to discuss that now. I think that this is a very interesting approach. The relationship between these two these two is also very interesting. OK, let me-- I'm going to accelerate here.
OK, I want to talk at least briefly about one or two more points along this list. What I was telling you about is how do we know what we know, how do we know about what neurons prefer, how do we build models about trying to propose fovial objects. And again, this will be expanded upon in the next presentation.
I want to give you just a flavor of how we're going about trying to study some of these other visual routines, some of these other computations that we think are essential for visual scene understanding. So I'll talk very briefly about these two, and I'm happy to expand on this and discuss further. So the first one is about making inferences, about being able to reason in some very generic word form of the word reasoning about the picture, to be able to interpret that-- in this case, in cases where you have incomplete information in the image in the visual system.
Perhaps one of the most basic ways of being in a situation where we have incomplete information is the case of visual occlusion. So that chair over there, for example, I can only see a couple of pixels. It's heavily occluded by a lot of people.
And yet I know that that's a chair. So how are we able to make inferences to be able to complete patterns from minimal information? So we used, in this case, an experimental paradigm called bubbles.
Essentially, it's like looking at the world like this. OK So if you have a lot of bubbles, it's relatively easy to recognize what the object is. In this case, this is a tool box seen with 20 bubbles.
If you have only four bubbles, the one at the bottom is pretty hard to recognize. So we can titrate based on the amount of visibility. We can titrate the difficulty of the visual recognition task.
So I'm going to ask how well can humans recognize objects when they are heavily occluded. And then I'm going to ask how well can computational models do this and what happens inside the brain while subjects are performing pattern completion for heavily occluded objects. So let me start with the basic psychophysics. This is behavioral data in human subjects. What you're seeing here is performance as a function of the amount of visibility in the objects.
So if the object is fully visible, this is a very simple task. Subject were performing a five way categorization task. And people are essentially at ceiling. That's the point that you see there on the upper right.
As you start reducing the amount of visibility, the task becomes harder and harder. Chance is 20%. But what's quite remarkable here is that subjects are quite robust, and they are at about 60 or more, 60% or better performance, even up to 10% visibility. So you have only a very small fraction of the object is visible. And yet, subjects can recognize very well.
The colors are not too relevant here. They will become relevant very soon. The colors here represent the amount of exposure of the picture, for how long the picture was on the screen.
OK, let me keep these. So let me now show you what happens when we use a technique that's called backward masking. Backward masking involves presenting a picture and then very briefly after showing that picture introducing a noise pattern.
This has been purported to interrupt processing in the visual system in such a way that people have claimed that this is a way of studying largely but probably not perfectly the bottom up pathway without the intervention of the interaction between incoming signals on feedback signals. So the idea is that you present the picture. After a few milliseconds, you show some of these mask noise patterns.
And the argument is that because you are interrupting processing with this noise image, you basically interrupt the interaction between the incoming visual signals and any potential feedback signals. I have to confess that I'm not completely convinced that this is true. We can debate about the mechanisms of what's actually happening doing backward masking, and I suspect that they are far more complicated than the short story I just alluded to.
But at least to a first approximation, it's clear in lots of behavioral psychophysics experiments going on going back to almost a century that backward masking can severely disrupt visual processing. And I'm going to show you behavioral evidence that that's the case. So here is the same exact experiment except that briefly after showing the picture, we introduce this backward mask.
So the backward mask is introduced either 25 milliseconds, 50 milliseconds, all the way to 150 milliseconds after the onset of the picture. That's what the different colors represent. So if you have a lot of processing time, if you have about 150 milliseconds of processing of the image, basically nothing happens. That is, the background mask-- the performance under the backward mask conditions are essentially identical to the ones that I showed you before without masking.
However, particularly in the 25 millisecond condition but also in some of the other conditions when you have a very brief exposure to the image, backward masking severely disrupts visual recognition performance. To the extent that backward masking is indeed interrupting some sort of feedback computation, we argue that this backward mask is pointing to the notion that we need additional computations in order to be able to perform this visual pattern completion task.
OK, so that's behavior. So now let me give you a very quick glimpse of what happens now in the human brain when subjects are doing this visual recognition tasks with heavily occluded images. So the way that we can record activity in the human brain, as I already mentioned earlier and Christof also mentioned in one of his talks, is by virtue of working with patients that have pharmacological intractable epilepsy because none of the known drugs work with these patients. What the neurosurgeons end up doing is inserting electrodes inside the brain in order to map where the seizures are coming from and in order to also map for functionally different parts of the brain to try to resect the parts that are responsible for seizures and try to remove them. So the removal of epileptogenic focus is one of the main ways of treating this type of pharmacological resistant epilepsy.
So we have a patient that's about that stays in the hospital for about one week with electrodes implanted. I showed you earlier one example of the very few cases we can record action potentials from the human brain. This is the case that I'm showing now where we're recording intracranial field potential signals.
So what are these field potential signals? Well, nobody really knows. There's some sort of conglomerate activity of a large number of neurons in the vicinity of the electrode. They probably spanned on the order of 2 to 5 millimeters of computation.
We have millisecond resolution, high signal to noise ratio. But we don't have the ability to identify individual neurons here. So what you're seeing here is an intracranial field potential response to this picture.
there are 39 repetitions. The X-axis is time. The dashed line corresponds to the onset of the picture.
And you can that we can get highly reproducible responses to this picture. The gray lines are the individual trials. The green is the average of those 39 responses.
So what happens now if we show some of these heavily occluded images? So these are four examples. It's four single trials of heavily occluded images. And you can see that the responses-- there is a consumer amount of variability from one response to the next. But we can still elicit visually selective responses despite the fact that these are cases where the subject only saw about 10% to 15% of the picture.
The numbers shown there are the times in milliseconds of the peak of the response. So you can see that the peak of the response in the fully visible picture was 150 milliseconds. In the other cases, all the response peaks were larger. A skeptic scientist-- you can ask whether I'm just selecting a couple of trials.
So let me just show you more trials. So these are all the responses that we got for a single electrode in this particular patient in this particular session. And what I'm showing you in this case are all the responses to five different pictures of faces.
And you can see a pretty consistent response. The color corresponds to the intercranial field potential. The scale is shown up there. So sometime before 200 milliseconds, there is a negative voltage that's similar to what was shown here.
So it's a negative voltage starting somewhere before 200 milliseconds followed by a positive voltage that's the red line somewhere after 200 milliseconds. So this is very consistent across trials. As a matter of fact, in some sense that are still hard to quantify, these are much more consistent than the responses of individual neurons that I alluded to earlier.
So what happens when we present the partially visible images? We still have selective responses. We still see those blue and red traces.
They look poorly aligned in this case because every picture is different. The position of those bubbles are random. So that's why we get a lot of variability.
If we fix the position of the bubbles, we get a little bit more consistent, not as consistent as the fully visible responses. But interestingly, all the physiological responses that we obtain in response to the partially occluded images are significantly delayed with respect to the fully visible ones. In other words, at the physiological level now, it costs about 50 milliseconds, it takes about 50 milliseconds extra, to be able to make these type of visual inference.
So coupling the behavioral results with the backward masking experiment with the physiological responses, we conjecture that we need additional computations. We need a little bit more time we need a few extra milliseconds to be able to do pattern inference, to be able to recognize objects that are heavily occluded. Indeed, if you show exactly the same images to your favorite deep convolutional neural network, we find that we get relatively poor ability to recognize those images. So this is the same format as before-- performance as a function of the amount of visibility.
Each color here corresponds to a different layer or a different deep convolutional network architecture. And essentially, all of these networks would perform extremely well if you had fully visible objects. But when you have a very heavy occlusion, human performance is significantly better than any of these networks.
For the aficionados, these are networks that are pretrained on ImageNet. We're not doing any fine tuning, any training at all with our pictures. We don't want to train with our pictures. We want to be able to extrapolate and move from training with completely different pictures.
So based on the behavior, based on the physiology, we conjecture that the addition of recurrent computations-- that is, the addition of additional computational steps-- would be useful to be able to do pattern completion. And we implemented this in an extremely simple way. We took, in this case, AlexNet. We did this also for many other networks.
We took the top level of one of these computational architectures, and we added horizontal connections between all possible units in that network. Those units were trained using what we call a Hopfield attractor network, meaning that these computational units have symmetric weights. And those weights were dictated purely by the fully visible objects.
So we're never training our system with a partially visible images. And that's the network that we refer to as RMNH. And we see that that already gave us a small but quite significant boost in performance without any free parameters.
There are zero free parameters. There are zero knobs here. We're not tweaking, training, learning anything here.
Just by adding horizontal connections in a single layer of AlexNet, we can get better performance. And then if we play the games that many people like to play with training networks, that's the version that we called RMNH5 where we allow ourselves to train the network with those occluded images, of course always using cross validation. And in that case, we can match human performance just by adding this recurrent connectivity in the network. OK, let me-- are there any questions about this? Yes.
AUDIENCE: What layer did you select from that? And is there any kind of-- what kind of motivated to choose the network?
GABRIEL KRIEMAN: So everything that's shown here is with the FC7 layer if that tells you anything. OK, so AlexNet has eight layers. FC stands for Fully Connected and the seven stands for layer seven.
The next layer, which is called FC8, is the classification layer. So that's trained on ImageNet, for example, to categorize 1,000 things. So there are 1,000 units in FC8.
One is the honeycomb unit. Another one is the chair unit and so on. So this is just one layer before that.
We actually used-- if you look at the figures and supplements, we looked at many different layers and many different options, many different architectures as well. I'm just showing you one example of that in here. There's variation between layers, between different architectures.
Trust me. I'm a scientist, as Christof said. The basic conclusions, I think, are independent of whether you choose ResNet or AlexNet, and they are somewhat dependent on the layer. But ultimately, I think this is just a proof of principle toy example.
What we really probably should be doing here is adding horizontal connections, for example, in every possible layer, which is the biologically more relevant way of doing this.
Incidentally, Cole, who's sitting right behind you, has, I think, beautiful story that came out recently in Nature Neuroscience which I think is very related to these where they've shown that adding essentially recurrent computations throughout all the different layers in these networks can help not only [INAUDIBLE] the conditions of visual occlusion. This is just one example of many other the hard visual recognition problems where essentially, all kinds of visual recognition scenarios where feed forward networks fail where you can add these recurring connections and train them and get much better performance.
AUDIENCE: [INAUDIBLE] one [INAUDIBLE]
GABRIEL KRIEMAN: OK, sorry, sorry. OK, so Cole and then Martin and [INAUDIBLE].
GABRIEL KRIEMAN: Jonas, OK. Jonas, OK. All right. Yes, Adam.
AUDIENCE: [INAUDIBLE]. But [INAUDIBLE] the only network that will train to the images?
GABRIEL KRIEMAN: Of the ones shown in here, yes. I mean, we can debate about how humans are trained and how humans-- but yes.
AUDIENCE: No, I'm just wondering if it's related to the horizontal connections [INAUDIBLE] try training another network that doesn't have your [INAUDIBLE] connection [INAUDIBLE] images in a single [INAUDIBLE].
GABRIEL KRIEMAN: Indeed. So that's a very good point. So here's one thing. Let me see if I have a picture of that.
Yeah, so in some sense, there's nothing completely magical about recurrent connections. As a matter of fact, you can do what people call unfolding. You can take a network that has horizontal connections, and you can actually convert that network into a purely feed forward one by doing what people call weight sharing.
So I just take that network over there, and I just inject multiple additional feed forward layers with the same weights. So that tells you that at least there's a proof of existence that there has to be a feed forward network that can do exactly the same type of computation. As you pointed out, we can take a feed forward network. We can train it with the occluded objects, and we can actually get similar performance as well.
Why do we care about recurrent computations? So I like to argue for recurrent computations for a couple of reasons. So one of them may be less interested to you and to other people, which is the sheer economy of how many units you have and how many weights you need to train.
So networks that have horizontal connections are much more economical in the sense of how much energy you expend, the size of the brain, how many units you need, how many synapses you need, and so on. So there may be a lot of energetic considerations why it may be preferable to use horizontal connections as opposed to keep expanding enormous feed forward networks. Now that may not be super interesting.
If you are Mr. Google, you have infinite power. You don't care about the environment. You can keep making enormous networks, and you don't care about saving units perhaps. That's a fair point.
So I'd like to make another argument, which I think is interesting about doing things that way. And that has to do with the flexibility of the computations that you can make. If you have a purely feed forward structure and you and you train that type of structure for a particular task, you're stuck with that architecture. Whereas in this case, you can flexibly route information and do as many computations as needed on a per case basis depending on the task demand.
So there may be certain pictures that are very easy to recognize-- for example, the fully visible pictures that I showed. In which case, you don't need to worry about all of those recurrent loops. Maybe you can just go through the entire system very, very, very fast, and you can solve that problem, let's say, in 100, 150 milliseconds. And that's what the physiology sort of alluded to. That's what the backward masking experiments also alluded to.
But if you have a more complex task, maybe you need to ruminate and you need more additional computations. And the same architecture without any retraining, without anything, can flexibly be used to solve the problem one way or the other. Whereas the feed forward version, you're stuck, and at least in this version, you're stuck with that architecture. You always have to go through all the layers in order to solve that problem.
An alternative way to solve the problem computationally, and this has been used in certain architecture such as ResNet, is to have what people called bypass connections. So in principle, there's no reason why in this diagram, you could connect layer i minus 1 directly to layer i plus 1 by bypassing all of intervening ones. As a matter of fact, you can create a network with 100 layers and connect them all to all if you wanted to. But then I would argue that we're starting to resemble that kind of scenario.
AUDIENCE: [INAUDIBLE] actually what is I think is interesting here is actually the first [INAUDIBLE] train the network with the second model parameters [INAUDIBLE] horizontal connections, it is possible that there is some memorization inside the network, and horizontal connections are not even being used.
GABRIEL KRIEMAN: You're right. So if you trained networks with the same number-- so then the number of parameters is actually smaller in the recurrent network in this case. If you train AlexNet with fewer weights or the same number of weights as with the recurrent case, which, I think that's what you're asking, you get slightly worse performance, but you still get a pretty significant boost as long as you're training with those occluded objects.
So here, the argument-- so to look deeper into this question, we did another experiment-- I was not planning to show that-- which has to do with the following. We had used completely novel objects that humans can never seen before and show that humans can actually recognize occluded objects just from one instance of recognition of those objects as well. So again, I'm a bit skeptical that in order to recognize an occluded chair, you need to have seen occluded chairs before.
So all this business of training with occluded images seems in principle, at least, strange to me in that you need to have training with specifically that same class of objects in an occluded version. But if you want to, you can actually get very good performance if you train-- partly, I think, because these networks have so many free parameters that even if you have heavy occlusion with the number of categories that we use, you're still able to solve that. As another point to this, that kind of approach still breaks down again when we expand enormously the number of categories.
The kind of exercise that we play with here are pretty small by computer vision standards. I didn't quite go into details, but we have five categories and five pictures per category, so 25 pictures. By computer vision standards, this is very small. Now if you think ImageNet and you include all the pictures in ImageNet, then you run very quickly into trouble if you're trying to train the purely feed forward network. So you can solve this small problem in that way with these very artifactual version of training with the same kind of pictures when they are occluded.
AUDIENCE: So when you're [INAUDIBLE] like the main [INAUDIBLE] or even like the [INAUDIBLE] designed to do that kind of associated memory and partial [INAUDIBLE]. So perhaps the way to check that would be-- so we could add the [INAUDIBLE] when you expect [INAUDIBLE] association.
GABRIEL KRIEMAN: Indeed. We chose the Hopfield attractor network because we knew you can mathematically show that a Hopfield attractor network can do error correction.
GABRIEL KRIEMAN: So it was not a-- perhaps I should have been more clear about that. We didn't choose randomly a Hopfield network. We know that we can mathematically show that under some conditions, if the weights are symmetric and whatnot, that has to converge to the representation of the whole object.
I am not sure what you mean by-- at least I have to think a bit more. And you can show me later how can I build a feed forward network with an attractor-based Hopfield rule. I'm not sure how to.
Maybe-- that sounds a bit strange to me, but maybe you have a way. I agree that if you have a way to do that, that would be an interesting comparison. I don't think there's anything magical about the Hopfield rule.
What I like about it is that we have zero free parameters. We don't have to do any training, and it works. But as I showed, you can also inject horizontal connections, train them, and get similar results.
AUDIENCE: Then you don't get the [INAUDIBLE].
GABRIEL KRIEMAN: And then if you actually look at the weights, first of all, they're not symmetric. And second, it was very unclear to us that there's any kind of attractor-based dynamics in networks when that-- yeah. Yes.
AUDIENCE: I have a bit picture question. So in what sense do you think the problems people are working on are really helping us solve the original task [INAUDIBLE], which is understanding that picture of [INAUDIBLE]? So surely at some point, you've got to-- even if the representations themselves can't be described with language, the point of that task is to label that image in the way that we would as a [INAUDIBLE].
GABRIEL KRIEMAN: I think it's a fair question. I have to confess I'm extremely far from being able to understand Obama's picture. I'm arguing and I'm postulating, if you will, that there are certain routines that we need in order to do visual understanding. And we're studying each one of those steps independently.
I have an almost-- the intuition or the working hypothesis here is that as we get better and better at solving all of these tasks that we know happen in the brain and that they have too much physiology and behavior that ultimately, we're going in the right direction. But you're absolutely right. I'm not telling you. I use that mostly as motivation. I won't tell you anything about how people think about their weight or how people understand humor.
Actually, in one of the previous summer courses, I proposed this as a project. I have lots of graphical humor pictures. I think that's an impossible project. I don't think that there's any chance right now that we can get a system to that.
And these are very, very, very tiny steps towards that, and I'm happy to discuss more about how we're going to get there. Any other questions? OK, I'm just going to give the title of the other thing that I wanted to mention just in case anyone is interested in these kind of issues. I'm going to skip most of the slides here just to conclude very quickly.
The other topic that I wanted to mention briefly has to do with eye movements. And the reason I wanted to bring eye movements is because I think this is one of the next things that happen when you're in a stunning scene. Very quickly, you have a glimpse of what's happening within, let's say, some radius around your center of fixation.
And very quickly, you are moving your eyes in the image. It takes about somewhere between 200 and 300 milliseconds for the first [INAUDIBLE], for the first eye movement under most of these natural conditions. So we've been spending quite some time trying to understand and computationally model the mechanisms that dictate where you're going to move your eyes and how you're going to move your eyes within a scene which, again, we think that's a very tiny component with a long term of being able to understand these type of scenes.
So I'm going to skip that. I just want to mention that we did a visual search task and just flashed this picture very quickly. If you're interested in eye movements as one of the next steps is to our straining to understand and integrate different parts of an image and what happens when we're doing that image understanding, you can scan that QR code over there. You can download the code.
You can download the data and see the primitive initial steps that we have made to our straining to predict, in this case, how humans would move their eyes while they're doing a visual search task. So I'm not going to say much more about this. If anyone is interested in visual search, come talk to me. And I think I'll just stop there, and if there are any other questions, I'll take them.