What are the computations underlying primate vs. machine vision?
August 25, 2021
August 16, 2021
Thomas Serre, Brown University
All Captioned Videos Brains, Minds and Machines Summer Course 2021
THOMAS SERRE: Today I want to tell you about or give you my view on neural computations, computer vision, and what I think is important today in the state of computer vision and visual neuroscience. So because I literally just drove here and my understanding of what you've heard about is from a preliminary schedule. I'm kind of guessing here what you've been hearing about. I'm assuming that you guys have-- and I think you've heard about computational models of vision, so you've probably seen this diagram or figure many, many times. But a lot of what we know about vision as of today is this the anatomy, physiology of the ventral stream of the visual cortex, sequence of visual areas that are almost always reciprocally connected. And the key computational models that have been developed to try to model and formalize our understanding of vision take the form of this hierarchical feat forward models.
And again, I'm assuming that you've heard a lot about those and I don't want to repeat any of that but I'm sure you've seen this figure, again, several times. We're at the point where progress in computer vision, AI, deep learning has led to a tremendous increase in the ability of computational models to be fitted from neural data. So actually, I'm realizing now that there is a lot of information that's being hidden. The bottom line is that progress has been significant. We now have computational models of vision that have been trained for image categorization tasks that fit neural data much better than older models. And I'm sure Gabriel mentioned prior work done in Tommy Poggio's lab about the HMAX.
Until very recently, convolutional neural networks were the state of the art in fitting computational neuroscience data related to the ventral stream. In the past, I would say a couple of years, few years, a little bit of correction has been done. Again, work in the DiCarlo's lab has been influential in this area. The addition of feedback recurrent mechanisms on those hierarchical feedforward models or convolutional neural network has led to further increase in the ability of these computational models to fit neural data. So I think-- I'm sure that Jim has told you about this.
I think we're at the level where it's pretty clear that some form of feedback or recurrent connections will be needed if we want to understand and model the visual system. My own view on the topic is that as of now we know that feedback is necessary. I don't think anybody has a very good understanding of what feedback does precisely in our visual system, OK.
To add to this point, I would say that it is somewhat frustrating to me that all of the state of the art in computer vision if you'd have kept almost any kind of real world benchmark is dominated by feedforward neural networks. And so we know that the brain leverages feedback. We know that recurrent models of vision-- neural data are better. We know that these feedback connections are there. It's surprising that much of modern computer vision progress and engineering insight that have been gained have revolved around purely feedforward neural networks.
So I'll try to change your view on this. And I'll try to-- I'll start by saying that I'll try to suggest that things could be different, that feedforward neural networks are indeed limited in their ability to solve a visual problem. My view is that part of the reason why they are still dominating the field is that because of the set of tasks that people have chosen to focus on in computer vision. That much of the field has been focusing on image categorization, action recognition, which are in some ways forms of template matching kind of tasks. And if we think about visual reasoning more broadly and all the kinds-- the myriad of tasks that primates and humans in particular can solve, those tasks go far beyond image categorization. So there is much more in vision than just shape an object recognition that has been the focus, I would say, of much of modern computer vision at least until now.
And then the point that I'm going to try to make throughout this lecture really is that when we consider more complex visual reasoning tasks beyond image categorization-- and I'll try to illustrate and give you examples of what I mean by visual reasoning tasks-- it's pretty clear from the neuroscience that there are many additional operations, computations that take place at least in the primates. And again, I'll try to allude to some of those mechanisms and computations, but I'll be referring to things such as attention, memory, et cetera. The kinds of things that were at least until a few years ago were not present in state of the art convolutional neural networks. This is changing. I think if we look at what's happening or has been happening in the past two years in computer vision, we are starting to see a shift from truly convolutional neural networks towards architectures that incorporate some form of attention and some form of memory, OK.
So here are examples of basic visual reasoning tasks. I took this figure from Shimon Ullman from his textbook from 1996. This work is actually much older. This is work that Shimon has done in the 80's. Did Shimon lecture this year? I don't know if before COVID? No, OK. So you've missed Shimon but-- so very important and seminal work from the 80's. Here are a set of basic visual reasoning tasks that Shimon proposed.
Here's an example in panel A here of assuming that this could be a positive examples, negative examples. You have to discover the rule. Here the rule would be whether those markers here fall inside or outside of a closed-- form a closed, contour shape. The task here could be whether those two shapes are same or different. And you could imagine asking any learning agent really whether biological or machine to solve this task maybe beyond just circles over a variety of shapes.
Here's another interesting task, figuring out whether dots fall on lines or not. So it would be examples of dots falling on a line, dots falling next to a line. And here is perhaps maybe a slightly more challenging task, figuring out a path to go from this marker here, this black circle onto a cross. And so the learning algorithm here would have to figure out that there is a way to go here, maybe through here, here, and then here. These are examples of tasks that are algorithmically speaking that require or that are read to be different from the kinds of human categorization tasks that have been-- and I'm sure you've heard about in this course.
All right, so the couple of take-home messages that I would like to convey today. If you have to forget everything else that I said, these are the few things that I would like you to remember. First, is that I think we have a tendency in computer vision because we spend a lot of time trying to beat benchmarks and tweaking architectures to improve the accuracy of the systems on very specific data sets, I think we have a tendency to focus on architecture that solves particular data sets as opposed to solving visual tasks. And I'll try to give you a few examples of those. And I will argue that if we care about solving Vision with a capital V, if we care about solving and understanding how our visual systems solve different kinds of visual reasoning tasks, then we will need architectures that incorporate feedback mechanisms.
So what I'll try to do in the rest of this lecture is to give you an overview of the work that we have done in my lab in the past five, six years. And really what I think we've been doing is doing work in computer vision from the perspective of cognitive psychologists. So a lot of the work we've done has been to try to think of different kinds of visual reasoning tasks that for the most part were inspired by older work done in cognitive psychology, things that we knew require computations in primates that are not available in convolutional neural network, things we knew require feedback, things we knew require attention, memory, and things of that sort in order to demonstrate the limitations of the state of the art in computer vision which as I said is mostly a feedforward neural networks.
So I'll try to give you an overview of that work and illustrate how I think as neuroscientists or cognitive psychologist you guys have something to contribute to the debate. And then I'll try to show you how RNNs and in particular a breed of RNNs that we developed in the lab that were grounded in the anatomy and physiology of the visual cortex are indeed able to address some of the limitations as we're able to identify in those feedforward neural networks.
All right, but before I get into the meat of this topic I want to open a little bit of a preamble. So I'm assuming that you guys all know by now that modern deep neural networks that include in particular perceptron, convolutional neural networks, transformer networks, those are all universal approximators. That means that with enough hidden units and enough training examples, they can learn to approximate any arbitrary input to output mapping. So that could be an association between natural images or pictures of objects and class labels. This could be examples, positive and negative, example from some of the visual reasoning problems that I illustrated earlier. But again, with enough units and enough training examples there is nothing in theory that these neural networks cannot learn to solve.
So the question is never whether different architectures can solve or not various kinds of problems, the question is can they solve this efficiently? But before I tell you more about this, I wanted to illustrate what I meant by this universal approximator question. So here is a very influential paper from a few years ago from Ben Recht's group in Berkeley. So what the group wanted to show was exactly that, that there is a very significant possibility that a lot of what's happening in computer vision involves the rote memorization of association between images and class levels.
And so what they did is they came up with a very simple scheme, which is a way to approximate or estimate the capacity of modern neural networks, OK. So what they did is that they completely shuffled the class labels of image net. So now you have random associations between images and class labels. So you would have to learn that class one includes cats and dogs and different kinds of boats and different kind of animals. All these classes now contain random bags of images. So there's no other way-- any kind of visual statistical associations between images and class labels as being by construction removed from the shuffling, which means that the only way for a neural network to solve this task, assuming that they can, is simply by rote memorization of images and class labels, OK.
What we know from statistical learning, as I said earlier, if you have enough units and enough you can do that. And even for a very large data set like image net that contains over a million images-- if you have enough capacity, if you have enough free parameters, enough hidden certain units in theory you can just store this random associations between images and class labels. And so what you see here is essentially what they were able to do just to demonstrate that it's possible to train, which was I think at the time a representative convolutional neural network, Alex net, which I'm sure you've heard about.
It showed that they could shuffle the images and class label and still if they remove regularization and things of that sort, they can still classify perfectly on the training data. There's no way that this thing can work on the test side because again these are just random associations. What this shows you is that this is not evidence of very much. It just should get you to think that there is a non-zero possibility that a lot of what we see in computer vision involve some form of rote memorization, OK. Again, these neural networks can rote memorize associations between images and class labels for a million images or so on.
All right, so the point that I'm trying to make here is that-- now, if you unshuffle those class labels, there is also a significant possibility that the same rote memorisation mechanisms could also help you address the bigger challenge of image categorization. So in general, whenever I show you visual reasoning task-- and in fact, a lot all of those tasks will involve synthetic examples that we've generated with computers. So we can possibly generate millions of those images. The question is never whether a CNN or a resnet or a transformer network can learn to solve those task. They can if you give them millions of training examples. The question is can they do it as efficiently as other architectures that are endowed with feedback mechanisms, attention mechanisms, memories, and so on and so forth.
All right, so here is what I think is really unique about primate vision and in stark contrast with the state of the art in computer vision. So I know that you guys are getting a little younger and younger I guess every year so I don't know how familiar you are with these different familiar faces. Actually, can you guys guess any of those? Yeah. Actually, you're not that young. Who is that? Woody Allen. You guys are good. I don't I don't even-- this might be Seinfeld or I don't know this one. And this is Reagan, right? OK.
OK, think about it. So you've seen those faces undegraded at some point in your life. I'm quite sure that none of you or at least most of you have never seen a face that was altered in such a weird way, right. So here this has been stretched in a way that there is no way you would have experienced that in the real life. And yet, you didn't have any problem recognizing those faces. And in fact, I would go as far as saying not only did you recognize the faces underneath but you were also able to know that there is something wrong with those faces and perhaps even guess the transformation that had been applied to those faces.
As you know, today there are computer vision algorithms for face recognition that are performing on par or even slightly better than human observers for face recognition. In fact, there's a study published in PNS a few years ago that has shown that state of the art face recognition system perform at the level of official forensic experts, so people can who recognize faces as a day job. So this is very impressive and yet I promise you that if you try to recognize faces that have been altered in any of those ways, those systems that are outperforming you and I on face recognition will fall pretty sharply, OK.
So one critical distinction between state of the art feedforward CNN's et cetera for objects and face recognition and us is our almost uncanny ability to deal with image degradations that potentially we've never seen before. Why is it different from CNNs? Well, CNNs can actually deal very, very well with image nuisances, image noise. So this was a study published a few years ago comparison between human subjects undergoing a rapid visual categorization task. So classifying 16 images that were flashed for something like 200, 500 millisecond and then comparing the accuracy of human subjects for this 16 way classification against state of the art convolutional neural networks. And they found that, at least in those rapid categorization regime, the state of the art was able to outperform human subjects in this 16 way categorization task.
Not only that, but the authors of the study could add a dramatic amount of noise. So what you see here is I think very strong. So first of all, the images have been grayscaled and then I think there's a very significant level of I believe this is white noise applied to those images. It's possible to train CNS to in such a level of noise. And then if you train the neural network with such high level of noise, the neural network can actually outperform human subjects.
So it's not like CNN and their friends cannot deal with noise. They can be trained to deal with noise even better than humans. However-- and here's the dirty little secret-- if now you make a tiny little change in the kind of noise that's being applied between the train and test-- for instance, if you test on this white noise that we just saw, but instead you train with I think this is salt-and-pepper noise. So just switching some of the pixels black or white-- I'm sure at the distance where you are, those three images look pretty much identical. And yet this is enough from going from superhuman accuracy in this case down to chimps level.
OK, so the point that I'm trying to make here is that I showed you that neural networks have the capacity to just rote memorize association between images and class labels. In addition, they have an ability to completely memorize the kind of noise and the noise model applied to images. And yet, if we make a small change in the kind of noise applied to the-- if we present those networks to types of noise that was not used to train them, then their accuracy falls apart in ways such that human accuracy would be largely robust.
So here's a very good example as well from Boris' group. Here they showed that you could take state of the art convolutional neural network. You see the name of these different neural networks. The details don't really matter. But the point of this, they show that there's a lot of low level biases in image net. So objects in image net tend to fall under very similar position, scale, context, view points, on and so forth. And I would go as far as saying that the dimensionality of the space of images in image net is not quite as large as one would have thought initially given the large size of these data sets.
Nonetheless, they showed that if you take your favorite image net, train a deep neural network, and then test it on a data set which share at least a subset of the object categories from image net, but now where for instance the context of these objects was altered, the viewpoint of these objects could be altered so on and so forth, then the accuracy of all this system would drop quite significantly. And in fact, I think they reported something like 40, 45% accuracy drop, OK.
So here is another kind of image degradations I guess. It's presenting those neural networks to things that they were not introduced to during training-- small changes in viewpoint, position, context, and things of that sort. The kinds of things that as humans participants, we would have been robust to-- that sufficient to yield a significant drop in the accuracy of this neural network. So relatively fragile accuracy in light of image degradation.
And then I don't know if you guys have seen this video. This is something that I found on Twitter a few weeks ago. Has anyone seen this? This is taken from a Tesla. All right, so what you're going to see here is the Tesla users-- I'm sure you all know that the Tesla is equipped with an autopilot unquote. So a system that again, is largely based on convolutional neural networks and closed extensions. The systems typically annotate every pixel in an image. So it'd be used to detect lines, so to mark lines to make sure that the car can see in lines, detect pedestrians that could be crossing along the street, bicycle, traffic lights, so on and so forth. So what you're going to see here, the car is not in auto pilot. There is an actual human driver, but here the person is looking at the control panel of the car, which is essentially kind of the output of the car recognition system. And so what you're going to see here-- it might not make sense first because you're just going to see the control panel, but wait a few seconds for the driver to actually point the cell phone to the road. I don't know if you can see.
So essentially, here this is the car seeing traffic lights flying at it. And if you see essentially there is, facing the car, there's just a truck that is carrying stuff. I don't think this is even traffic light. These are just speakers probably from a concert. And so the point that I'm trying to make here is that my critic doing computer vision would say, in all fairness, these cars have never seen a truck carrying these kinds of speakers.
But obviously, this is a serious issue, right. There's a point where if you want to deploy those algorithms, they need to be able to generalize to objects and images that were not present in their training data set. So they need to be able to handle distractors that are very similar to target object categories. They need to be able to understand that something that's sitting on a truck is not flying at you right and you don't want those cars to be braking on the highway.
So I think this illustrates I hope why a lot of the things that I'm going to be trying to tell you is important. If we want to be able to deploy those computer vision systems in the world we need to have some guarantees that at some point they'll learn some-- they'll learn to generalize beyond the training data. And I would argue that today there is very limited evidence that those deep neural networks generalize way past beyond their training data. And there's at least a lot of evidence that as soon as they are out of distribution examples, examples that are sufficiently different from the specific examples used for training, their accuracy collapses. And that to me is in perhaps what is the key difference between the ability of these algorithms and our own visual system.
So here's another example related to what I mean by this frame of abstraction and generalization. So this is a representative data set called Sort-of-CLEVR. This is used as a way to evaluate the ability of modern computer vision system to solve the so-called problem of visual question answering. So you have a small world made of shapes of different colors, computer vision system are asked to answer questions about what's going on in computer generated visual scene. So for instance, in this scene the kind of question that could be given to the computer vision system could be, how many circles, how many squares? What is the object to the right of the green square and so on and so forth.
And a few years ago if you looked at the state of the art, you could already see that relatively simple linear architecture, convolutional neural networks, or simple extensions of convolutional neural networks would actually do a very good job at answering questions posed to them at least on those simple kind of artificial visual worlds.
Now, one of the issue and one of the point that my graduate student made very quickly is that part of the issue with the way those algorithms are evaluated is that of course, different IID samples are used for training and test, but the test images always use shapes that are generated from the same finite dictionary of colors and shapes. So this algorithm says they have seen all possible geometric shapes and they have seen all possible combinations of colors and shapes for instance. And so the question becomes, well, if you really learn to perform visual reasoning, you should be able to do that beyond the examples used during training. You should be able to have seen some amount of training examples and then be able to generalize to another set of training examples.
And so the experiment they run really tests whether these systems were at some point learning an abstract representation of an abstract ability to reason was by holding out some combinations of shapes and color. So here they didn't do a full sweep of visual reasoning, they tested the ability of these algorithms to solve a particular kind of visual reasoning problem known as same different. And it's the very kind of classic task that has been used in cognitive psychology for decades. The idea is I could give you two shapes or n shapes on screen, and I could ask you whether all the shapes are the same or at least one is different for instance. It's a very simple task. Human subjects have no problem solving this kind of task. Of course, if you actually train state of the art convolutional neural network on that task in the kind of regime that they are normally trained and tested, they learn the task no problem, OK.
But here's the kind of dirty little secret is that now if you keep one combination of shape and color out of the training data-- So for instance, in this case, you would use something like a different kinds of colored circles except maybe-- so different-- sorry, different color, let's see, different color of squares except maybe this blue color, OK. So you've seen all colors, all possibilities of squares of different colors. You've seen all other shapes that are blue, but you have never seen a blue square. But you have seen all of the combinations. So there's still a lot of training examples.
The assumption is that if these algorithms are able to learn an abstract role, which is, again, very simple whether all the shapes or two shapes are the same or different, at some point they should be able to abstract away from a sample that they haven't seen during training which is in this case would be blue squares. And sure enough, my students found that they could fit the training data perfectly. So there's no problem for this network to learn or give the appearance of learning the rule. But if they hold out and if they test the network on this blue square here the accuracy falls down to chance, OK. So they can learn association between visual features and an abstract visual rule, but they don't really show any evidence of generalizing beyond the visual appearance of the samples used during training, OK.
In comparison, there's a study that was published just a few years ago in science, 2016, where they showed that they could train ducklings at birth to solve that task. Not only could they train duckling to solve that task, they could train ducklings to solve that task from a single training example. So just if you go back to this image or this figure, we are talking about number of training examples here in thousands. We're talking about half a million training examples is not enough for them to learn this abstract role and to generalize over a left out combination of shape and color. Here these authors have found that from a single training example at birth, simply by imprinting ducklings with a pair of shapes, they demonstrated that the ducklings could from there on in a [INAUDIBLE] choice, generalize beyond either the color or the shapes used during training.
The point that I'm trying to make here I think is more general. Is that our visual system has this ability to learn all those tasks and to generalize beyond the training examples in a way that those neural networks fail. And so I think as a vision scientist, my quest is to discover neural architectures that are going to be able to learn all these tasks without requiring and crafting particular loss functions, particular kinds of operations or computations that are tailored for every one of these tasks.
And when you think about it, this is essentially how computer vision works today. Someone specializes on image categorization, someone else visual question answering, and so on and so forth. Action recognition, image segmentation-- for every one of these problem something comes up. When we try a state of the art architecture on a new problem, it doesn't really work. But then there's always an engineering feat that someone cleverly design that alleviates these problems. But at some point, as a vision scientist, I want to develop an architecture that will be consistent with what we know about the anatomy and physiology of the visual system, but that is also able to learn any arbitrary task in the same way that a primate, human and non-human, would be able to learn and to generalize.
So this is mostly introduction, but I hope you got the gist of what I was trying to convince you of. I'm going to switch gears a little bit and tell you about recurrent neural network. And then at some point, the two will get together. And I'll try to show you that it's possible to design recurrent neural network architecture that address some of the limitations of the feedforward ones.
I'm assuming that everyone knows about feedforward and feedback mechanisms. You've heard about that at some point, right? OK. I'm not going to go through too much details about RNNs. I'm hoping that you've heard about in some ways. If you haven't, there's an excellent tutorial from Kanaka Rajan that she gave at Cosyne. It's on YouTube and it's really excellent. So if you want to learn more about RNNs and learn how they work and the theory behind, you should watch that. I think there are three, four lectures and they're all excellent. So this is Kanaka here. Skip her.
All right, so my focus in telling you about RNNs will be specifically about vision and the work we've done. And so our-- and just maybe to clarify before we get into the weeds, I'll be talking about, interchangeably, about two kinds of feedback or recurrent processes. I'll be referring to as horizontal connections, any kind of connections between neurons within a layer of the cortex or a layer of your favorite convolutional neural network. So feedforward connections run from lower layers to a higher layers. Horizontal connections will learn within a layer of CNN or cortex. And so that means that now as soon as we add horizontal connections, we're endowing our convolutional neural network not just to exchange information with higher up neurons but also with neighbors and potentially a whole neighborhood of neurons. And then I'll contrast those horizontal connections with stepdown connections, which run from higher level layers of visual areas onto lower layers of visual areas.
So a few years ago we got interested in understanding one particular phenomena that is known from the neuroscience standpoint to involve recurrent processes. How many of you have heard about extra-classical receptive fields before in vision? Half of you and half haven't heard, OK. So what we call the classical receptive fields in neuroscience is simply the part of the visual field that needs to be stimulated with the proper stuff, bars of different orientations, faces, you name it depending on the visual area. So this is the part of the visual field that needs to be stimulated for that particular neuron to respond. And if you remember what you've heard I'm sure earlier in this course, these visual receptive fields tend to be very small in early visual areas, maybe a fraction of a degree. And as you go in higher and higher visual areas, the receptive field of neurons grow, and so they get larger and larger, OK. So that's what we call the classical receptive fields or receptive field in short.
What we refer to in visual neuroscience as the extra-classical receptive field is a part of the space that sits right outside of the classical receptive field. So usually it's something like an annulus that is right outside of the classical receptive field. How do we know the difference between the classical and extra-classical receptive fields? The extra-classical receptive field if stimulated alone is typically not going to elicit any response from the neuron, OK. However, if you stimulate the classical receptive field, then what falls in the extra-classical receptive field can have a very significant or elicit a very strong modulations of the neural response. To the point where you could present a bar at a particular orientation, maybe even the preferred orientation of the neuron, which would normally elicit a very vigorous response from the neuron. That same response could be completely suppressed with the proper stimulation of the extra-classical receptive field.
And in the same way that the extra-classical receptive field can sometime reduce or suppress the response of the neuron, sometimes it's also possible to increase the neural response by again presenting the right kind of stuff in an area of space that sits right outside of the classical receptive field.
So we get interested in building computational neuroscience models of those phenomena. And our starting point was kind of a classic line of modeling that has been used in computational neuroscience since Cohen and others and maybe even earlier. I'm not going to get into the detail of this equation. The point is just to tell you that this uses dynamical system approach. Here the responses of neurons is a model by two sets of differential equations. Actually, there's one governing the excitation that a neuron is responding. One, the inhibition. I guess this was the opposite. And the idea here is to be able to model not just the instantaneous response of neurons as you would in a particular perceptron convolutional neural network, but also the evolution of those responses as a function of time as neurons are able to exchange information.
In this case, initially we focused on modeling horizontal connections, so both within and across cortical columns. Something else maybe I need to tell you, you have all heard about cortical columns I'm assuming. So remember that your visual-- especially visual areas-- are organized which are known typically. So at almost every location in space, there's a battery of neurons tuned to a battery of different templates, filters, sitting in your primary cortex. So at every location, you have maybe 1,000 or so neurons spanning a whole range of possible set activities. So for instance, in the primary visual cortex, at every location in space you would have neurons spanning a range of orientation, color selectivity, motion disparity, so on and so forth. So you can think of that as a battery of filter or feature that we would see in computer vision. And then those channels get replicated at every position in space.
So here, whenever you'll see this kind of darker cylinder, this will be corresponding to those cortical columns. There is neuroscience work suggesting that this classical, extra-classical receptive interactions are mostly driven by two kinds of connections-- connections within cortical columns. So linking neurons that have overlapping receptor fields. So corresponding to the same spatial location, but spanning a range of selectivities. As well as neurons across columns at different locations.
And so the question here is that we have models [INAUDIBLE] that could help us simulate this kind of very complex recurrent interactions between those neurons. The million dollar question here is to figure out how to wire them, how to set the right pattern of connectivity both within and across critical columns in order to explain neuroscience data, OK. And so this kernel of connections would be here this big W I here for inhibition and big W here for inhibition.
So without getting into too much detail, if you want to build an intuition about what those right cranial networks do, think about standard feedforward model where you would measure a bunch of feature activations at all locations. Here this would correspond to only the very initial [INAUDIBLE] of the neural responses. Because those neurons are connected horizontally, over time that response are going to be changing. They're going to be dynamically as a reflecting interactions between these neurons.
And I would go as far as to say that if now you're trying to simulate with those models, a full field stimulus that covers the entire visual cortex, then in theory those neurons are able to exchange first information locally, but this information can propagate over the entire visual field. So in theory, those kinds of recurrent neural network architectures can allow you to exchange information from two neurons that are at two opposite sides of the same visual scene without the need for any kind of feedforward connections as would be the case with a standard CNN.
What we were able to do is go back to monkey electrophysiology, mammalian electrophysiology. There's a lot of phenomena that have been observed and imbued to this interactions between cortical columns. I'll spare you the detail. I'll just tell you that we've done our homework. We took all the experiments we could found, and then I had one of my former graduate student David Mely essentially trying to tweak by hand the pattern of synaptic connectivity so that we could have a model that if presented with the same stimuli as those used for the actual physiology such that the model would be able to reproduce those results.
And so here's how the model works and then I'll speed up a little bit. Imagine that you have-- we are looking at the response of a cortical column centered here on this red patch. The assumption here for simplicity is that let's say we're in the primary visual cortex where we're looking at cortical column, so spanning all possible orientations. So you have your classical filter channels as found in a CNN. The idea here is that there are two kinds of interactions between these feature channels. Their responses are no longer independent, but they're going to be both excitatory and inhibitory connections between all of those.
And then what I'll show you is that regarding the extra-classical receptive field, we are to consider two different part of the extra-classical receptive field what we call the near extra-classical receptive field, which is this blue circle here. We again, are looking at cortical columns. We will be talking about tune excitation here in the near eCRF. And what we'll be referring to the far surround the further parts in blue here of the eCRF. So three parts-- red is the classical receptive field, center call column. Then we'll be talking about the near surround, which will be the set of columns immediately surrounding this central part here. And then blue columns a little bit further out for the far surround. And of course, whatever I'm going to be describing here in this idealized case, imagine that the same thing is happening everywhere. So now the neurons are able to exchange potentially very complex signals across a large region of space.
OK, so in order to account for the physiology data, we had to essentially bake three constraint on the models, which are as follows. We are to assume that in this near extra-classical surround, the pattern of excitation onto the center had to be excitatory. So stimulating in that part will increase the probability that the neurons in the center fire. In the far surround, the opposite. We had to bake in some form of inhibition. And again, that means that whatever we put here potentially reduces the probability of neurons in the center to fire. So it's a very specific pattern of connectivity.
From the surround we found that the connectivity between center and surround neurons was of the human kind. That means that two neurons at two nearby locations will be connected if and only if they share the same selectivity, OK. So I'm part of a cortical column. I'm vertically tuned. I'm looking at my neighbor here. I'm going to be connected to that friend if and only if that friend has the same preferred selectivity. If not, I don't care, OK. Now the-- sorry. And so if this neighbor is close in the in the near surround, the connection will be excitatory. So we'll help each other up. If past the critical distance, this will be inhibitory, so we'll push each other down.
And then the last thing we had to do, which is somewhat interesting for computer vision. Most of the work until recently in recurrent neural networks, CNNs involve mostly linear operations-- excitation and inhibition. There is a well known form of nonlinear inhibition that was discovered decades ago known as shunting or divisive inhibition. So this is a nonlinear form of inhibition. And we found that this was critical to have not just linear inhibition but also nonlinear inhibition in order to account for the electrophysiology data.
Now, the magic about this model is that we constrain it so that it would explain the neurophysiology. And then when we tested it on visual illusions that are contextual in nature, we found that the model was able to account for human behavior over a host of contextual illusions. So here are examples of so-called contextual illusions. I don't know how many of you have heard about the tilt illusion. I should point out that those phenomena are scale and distance dependent.
So depending on your distance to the screen, you might hit the sweet spot or not. But assuming you had the right distance here you should have-- and I always forget. I'm clearly too close. I think when the angle between the center and the surround is small. The surround tend to repel the center. And so this center bar should look tilted to the left for you. Here the angle between the center and the surrounding increases so the effects will reverse. The center grating here should look tilted to the right. Don't expect to make-- you know we're talking about a few degrees. If you test subjects, you'll find that they exhibit a bias to the left by a few degrees or to the right, depending on the angle of difference between the center and the surround. How many of you see the effect here? OK, most of you. OK, good. All right.
So this model that I just described is able to account for that. And I'll show you-- I'll simulate the model in just a-- or I'll show you a simulation of the model in a second to show you why that is. And this in part. Here's another example of contextual phenomena. As I'm sure many of you know, color is highly context dependent. So I can show you the same gray patch, depending on whether this patch is shown on a somewhat neutral gray, greenish background or pinkish background, the center gray here should look completely different.
Again, a lot of those effects are scale dependent so I don't know how well you see it. And there again, this is a completely uncalibrated projector, but this color here or what we call the U here and here are identical. And if you had the right scale they should be completely different. One should look you know orange and the other one is salmon. Those are all examples of contextual phenomena, things that are potentially explained by this model that I just illustrated.
So let me just give you an intuition for-- so, yes. Very good point. I should say that it's very clear that there is not a single pattern of horizontal connections. In fact, I'll show you-- and as you pointed out, this was an initial model where we really baked in the connectivity just to explain a set of electrophysiology we had. And so for instance, that doesn't include the fact that-- for instance there is colornarity effects. That's some of those connections are not-- So for this model we didn't take that into account. I'll show you if we have time some examples now where we actually optimize the connections for different tasks-- train on natural images, for instance, for contour detection. There we find a much greater variety of kernels for horizontal connections.
So I guess what I'm trying to say is that you're completely right. This is a model. So we focus on one set of-- explaining one set of I guess phenomena. The pattern of connections we baked in were sufficient for all the electrophysiology, but it's not accounting for exactly what you described, those kind of-- how do we call them? Blanking on the term, but those colornarity effects. But we've done follow up work which does. And yeah. That's a good point.
All right, So let me give you a little bit of intuition for how this kind of recurrent neural network and the associated dynamics differ from what you would get from a bottom up or feedforward model. So let's look at here in this case, the tilt illusion. So let's assume that we stick an electrode one in the classical receptive field one in the far surround, shown in blue here. And let's say that we are recording from an entire neural population. So what you are seeing here, there is no axis but assume that this axis would be orientation. The height of those bars tell you the response of a neuron with that particular preferred orientation. So maybe this is 0 degree vertical. This is a max response. And then the response of neurons to off orientation starts to get reduced.
This is the response in the center. The response in the surround because the surround is tilted to the right is shifted toward the right, OK. So if you remember, I told you that the pattern of connectivity between the far around and the center is one like, two like and two inhibitory. That means that the connection works like this. This guy connects to this guy. This guy to this guy, et cetera, et cetera. And because this inhibitory, those guys are going to push-- those guys are going to push the response of those guys in the center.
And because the inhibition is nonlinear, the amount of inhibition that these guys get is proportional to their own activity, which means that the higher those guys are for a fixed amount of input inhibition-- the higher those guys are the more suppressed they will be. And so that means that this guy is going to get a big push down. This guys is also going to get a push down because there is less inhibition but it's even higher than the other one.
And so here is just maybe if-- I know that I'm going pretty fast-- but here is just a simulation-- sorry. Here's a simulation of the model. At times t equals 0, the model starts. You see two peaks. Remember, red is the center, blue is the surround. You have this population responses. I think we have 100 neurons within a column. The two population responses start with peaks at the correct orientation of the stimuli. But then over time, you see this red population response shifted to the left. And that's because there is a lot of activity in the surround. And so the activity of the surround is essentially going to push all those guys, right, all the ones that are overlapping here especially the ones that are very high here are going to get pushed down by a lot.
And so the effect dynamically is that it looks like this guy is moving to the left. And so what you see here, what we show in darker color, is the readout we are getting. We're just training an ideal observer, decoder to try to predict the orientation in the center. And so what you should have seen is at t equals 0, the decoder would have seen a vertical line, but then over time as the population responds shifts to the left, then the decoder starts to see an orientation tilted to the left. [INAUDIBLE]. I might have gone too fast.
So you'll see here there is a little bit of dynamics going on. And then you see this guy slowly shifting to the left, and the decoded orientation tilting to the left. And you see the percept of tilt to the left. Because there is-- proportionally there is a small center and large surround, so there's an asymmetry between the activity in the center of this and the surround. That the affect is mostly from the surround to the center. At least in the simplified stimuli, when you present natural science to a model like this, you get very complex dynamics which we're still trying to understand.
AUDIENCE: So this simulation is based on static image?
THOMAS SERRE: Yes. So you're absolutely right. So this is just literally the tilt illusion that I showed you two slides ago. But so you're right. Now, if I show a dynamic stimuli then there are two things riding on top of each other-- the bottom of activity that these neurons are receiving is going to change, and then on top of it they're going to exchange information potentially reflecting the past. And so the future is based on both what the neurons get bottom-up and also how they exchange information.
And this is a very difficult problem and your friend here on the third row is working exactly on this problem-- how to train recurrent neural network to deal with that problem. So you need to rethink a little bit how back-prop works. But anyway, that's a point of detail.
You know first of all, if you're interested and I went too fast, I'll be around and so we can talk more. I think the point that I'm trying to make here is not so much for you to remember the nitty gritty details of the simulation. The point is for you to start appreciating the fact that as soon as you have recurrent connections then you can have very interesting dynamics. And in particular, the context in which-- It's not like a CNN where the only thing that a neuron sees is what is presented with receptive field. But the context, the broader context, the entire visual field can have a drastic impact on the response of those neurons, which again, is consistent with our own perception because anything from orientation perception to color perception are context dependent and we know that, OK.
OK, so I have 24 minutes. I'll try to give you a sense of what kind of benefits can be gained computationally from this kind of recurrent connections. So very quickly, as I mentioned earlier, we started with toy computational neuroscience models where the training would be done by graduate student descent, right. So you take a set of electrophysiology experiments and then we try to quantitatively reproduce those results. So you have a graduate student spending six months of their life changing the pattern of synaptic connections, trying more or less linear inhibition, changing the pattern, et cetera, until everything is explained. And then you say, OK, job done. What is needed to account for this phenomena?
But of course, it's limited in a way. And especially, it doesn't allow us to relate those particular pattern of connectivity to broader task. We just know that you need this pattern of connectivity, but that doesn't tell us whether this pattern of connectivity is good for. In order to answer this, you really want to be able to train and optimize this pattern of connectivity for different visual tasks. It's hard to do if you have these complex systems of couple differential equations because they are heavy to simulate. And so when you try to simulate the entire visual field-- remember that you have two differential equations per neuron. If you have thousands of neurons at every location and then you thousands and thousands of locations, you can easily end up with millions of differential equations that you need to simulate.
But this is the deep learning era, and so there's a lot of work that has been done in this discrete time approximations of the kinds of continuous time dynamical system that I showed you earlier. And so today there are very good architecture that have been developed that are very efficient at learning. And you probably have heard about LS systems, GATE E direct continuity grooves and all these things.
They are essentially very convenient. They're going to be very convenient for us because not necessarily completely biological in their implementation, but here we're going to ignore that fact for a little bit and then instead of trying to solve and integrate the differential equations, we're going to use this GATE E direct current units to learn the pattern of connectivity from data for a particular task, OK. So this is not just going to be arbitrary feedback connections. We're going to be considering ways to learn kernels of horizontal connections as well as feedback connections from higher layers and to lower layers. And we'll try to understand how these connections affect performance and how they relate to human behavior for different kinds of tasks.
So here's-- maybe I'll spend a bit more time on this task because I think this was our starting point. Here's a fun task. We call that pathfinder and we thought this would be a perhaps the most convincing evidence or illustration of the benefits of recurrent horizontal connections over convolutional neural networks.
So here's the task. So we can think of the task as to follow some footsteps in the snow in the middle of other footsteps so we are framing this as a categorization problem. We can have positive examples and negative examples. You should be able to see contours here that are made of paddles. This is all synthetic so we can control the angles between the paddles, their spacing, et cetera. And the part that might be harder for you to see if you're sitting far, both the positive and negative examples have markers on the images. The positive examples, the two markers fall in the same contour. On the negative side, they fall in different contours. It's actually reminiscent if you've heard about Marvin Minsky's work. And this is kind of a modern twist on the XOR that famously failed the perceptron. If you don't know about it, that's no issue.
So the task here is we need to figure out whether there's a path going between the two markers, right. It's kind of a yes/no question. But we're going to be training neural networks and human participants to solve this task. The parameters of the stimulus of interest here is the number of paddles-- or the length of the contours. So you see that here the contours are made of six paddles, nine paddles, 14 paddles. And the reason we care about this is our assumption is give you this lengthy introduction that CNN multiplayer perceptron, are universal approximators. If I give them enough training examples they'll learn that task, right. That should require millions of training examples.
But our assumption is that in order for CNNs to solve this task, they would need to have receptive field that includes the entire contour. That you would need to have a neuron here that is large enough that you see the contour and the two markers. And so what that means is that if I consider longer and longer contours, the minimum CNN that will learn that task is one that should have receptive field.
So as the contours get longer, the optimal CNNs to solve the task should get-- the receptive field should get larger and larger. Well, so that depends on-- I guess that depends on the specifics of how you engineer the CNNs, right. But if you assume that you build-- you have a max fix downsampling from layers to layers. If you start from an architecture that has a given number of layers, the points that you're going to need more depth to grow more and more layers-- to recognize to process longer and longer contours. This is all I'm saying here.
AUDIENCE: But in all the CNNs the-- the fully connected layer will not see the whole image.
THOMAS SERRE: Yeah, I mean-- yeah sure. I mean, yes. I mean, I guess if you think about a reznet where you average out. So here I guess-- I'll show you results with resnet. When I'm saying I agree is a little bit specific to a particular way to engineer a neural network, which would involve a fixed downsampling from layers to layers. But I'll show you the results hold more generally. From the intrusion that I'm giving you, we expected that the minimal depth needed to solve this task for longer and longer contours would increase.
And if you take that for-- if you follow me there, this is what these figures show. I know that there is a lot going on here. What I'm going to say is a few things. One is that we call our horizontal connected recurrent neural network hGRU because it's building horizontal connections out of getting direct current units. The single layer of recurrent neural network can solve the task perfectly for all the different lengths. When you look at a CNN of different depth, we do find indeed that as the length of the contours increases, we need deeper and deeper convolutional neural network. And then I think this is the point that [INAUDIBLE] is making. Even if you use a resnet, which in this case, independently of the depths is the entire visual field, we do find that you do need more and more depths to solve longer and longer contours.
So to give you a sense here, so every subpanel correspond to the-- there's a lot of stuff but focus on the top one. Every subpanel includes the accuracy of different architectures for solving the task was lengths six, nine, and 14. So you get perfect in the hGRU. You see a one layer CNN. So that would be one that doesn't see the entire with limited receptive fields. Does OK on six but the accuracy goes down pretty quickly for six, nine, and 14. With a three layer depth you do a bit better. But again, you don't solve the 14, et cetera, et cetera. And same story even for alternative architecture like a resnet where in order to solve the 14 length version of the pathfinder, you need to go all the way to 152 layers of depth, OK.
So if we just replot this where we look at the accuracy of these different-- this is for the path length 14-- we look at the number of parameter multiplier for all this feedforward architecture compared to our single layer RNN. So we have where the most efficient number of parameters, so we have a free parameter image of one. You get a perfect accuracy. You can get a perfect accuracy with feedforward neural network but you need to go all the way to 152 layers of depth. So this is about almost 1,000 times more free parameters for the same accuracy, OK. So again, the point is not that you cannot solve this task with a feedforward neural network. You're going to need to resort to very brute force solutions. Yeah?
AUDIENCE: Would anything in the blockchain had or not had pretrained for image net?
THOMAS SERRE: If you had-- sorry?
AUDIENCE: If you had or not had pretrained on image net.
THOMAS SERRE: So this is-- we've done I believe-- I mean, if you look at the paper there's probably 20 pages of control. We have versions that are pretrained on image net. Some trained from scratch. This is what I should have said, I think it's one million training examples, by the way. And this is trained from scratch. This holds. OK.
All right, and so the key here, again, if this was not clear enough, is that if you allow recurrent connections the neuron can just exchange information. So as I'll show you in a second, nearby neurons can just tell each other we're on the same contour. We're on the same control, et cetera. So the task can be solved almost trivially. With a CNN, you need to do that through depth.
All right, so here is an extension of this task. We said well, task is limited. This is on the horizontal connections. We also know about top down connections. What the hell could those stepdown connections do? Well, we came up with this extension of the pathfinder that now involves not just simple contours. After all, the rule here is to figure out learn a very simple Gestalt rule, which is one of continuity and proximity. Contours is defined by paddles that are not too far from each other and that are not too different in terms of their angle. So this is essentially a low level task. You don't need any kind of knowledge of objects for instance.
So we extended the task to be very similar between both now objects in this case letters. So the game is pretty much the same. I don't know if you can see the markers, they are black here. We have letters and so the question is, are the two markers falling on the same letter or not? So here the answer is yes. It falls on the same n. Here the answer is no. These are two different letters. I don't know if you can see that the markers are here and here.
And so here we are playing a similar game and I don't have enough time to get into the details, but we're always after this idea of a combinatorial explosion. So if you are a CNN and you're solving this task brute force, here if you make the contours longer et cetera, there's an infinite number of combinations of orientations that you need to learn. Here we are applying two different affine transformations to the two letters.
Here we're applying almost the same exact transformations to the two letters. Here we make the transformation more and more different to the point where here there is a very big difference in the transformations applied. So if you just store templates, there's a combinatorial explosion in the number of templates you need to store if you want to solve the task like a CNN.
OK, so our assumption was that all you need here so these tasks are horizontal connections. In fact, we know that a single layer is sufficient to start solving this more-- these higher level tasks. And by the way, we even control for to make sure that simple Gestalt low level rules wouldn't help solve the task. It's not enough to use simple principles like continuity and so on and so forth. And I can tell you more about it if you're interested. But the point is that we should double dissociation. So what you see here is the accuracy for the pathfinder for this semantic task that I just described, which we call Cluttered ABC.
You have human baselines here from Mechanical Turk. Humans don't have much problem solving those tasks. What you see here is the accuracy of horizontal recurrent neural network which we call H-CNN. Can solve pretty well all the tasks. However, it's having trouble on the semantic version on this Cluttered ABC. The pattern is reversed with top down connections. The pure top down connections with horizontal connections struggle a bit on the pathfinder, but it does, in comparison, much better on the Cluttered ABC. And of course, if you have both, there's a little bit of overfeeding happening. But in general, this architecture that incorporates both horizontal and top-down connections can solve both of those tasks. So in general, you want both if you want a general enough architecture.
AUDIENCE: These horizontal and feedback connections there are convolutions that then have that LSTM module, right? So the you're learning the kernels that--
THOMAS SERRE: Two kernels-- one for horizontal and absolutely one for top-down.
AUDIENCE: And they're like let's say feature one, it learns the kernel of feature one with feature one. The kernel feature one with feature two. Or are the kernels the same?
THOMAS SERRE: It's over-- yes. It's over space and channels. It's specific. OK. I mean, we can talk more if you're interested after the-- I'm happy to give you more details. Or one of the expert is sitting two rows behind you as well.
All right, one final example. This is our latest work which is not even published yet. We've extended the task to motion and tracking. So here we call that the path tracker. If you look at much of the real world tracking data sets, I would say they are pretty simple in the sense that there's always a target object which is very different from the background or the context.
So we wanted to see what would happen to the system. And most of the systems rely on appearance cues. And we know that human subjects, primates, can track very well even when target stimuli are embedded in the structure. So we kind of build this extension of the pathfinder to the spatiotemporal domain. So now instead of having static trajectories-- so this path linking two markers-- the paths are invisible and we just have dots moving along invisible trajectories.
At the beginning of the task, there is a marker crossing the red little square there, and then the question for the agent is whether the marker that crosses the blue shape there at the end of the video-- the video is looping which is why what I'm saying might be hard to follow. There's a beginning and an end. The question is whether this is the same marker that passed at the beginning over the red square that also crosses the blue square. So it's kind of a yes or no. It's literally a temporal extension of this pathfinder that I just showed you.
And so we can vary the number of these structures. We can vary the lengths of the video, so we can make it harder in the sense that the spatiotemporal dependencies that need to be learned by those neural networks can be made arbitrarily complex. Again, there is a combinatorial natural explosion here in the number of templates that will need to store if you are trying to solve this task brute force.
And again, we've shown my postdoc, [INAUDIBLE], took all the state of the art we could find-- 3D convolutional neural networks with spatial temporal, convolutional neural network, 2.5 convolutional neural network, transformer network, transformer networks extension to spatiotemporal domain, et cetera, et cetera. In general, we find that those guys do OK. So you have a human baseline here, which are this-- I don't know if you can see the stripes here. This is the confidence interval around human observers. If you have a single track distractor and short video, 32 frames, people can do perfect or near perfect. Well above 90%. All the architectures can do OK. Most of them are within the confidence interval of humans. when we increase the number of distractors, they can still do OK. And even 25, they do relatively well.
Now things get tricky for those guys when we start adding frames. And in particular, when we train and test them on different number of distractors. So here with a single distractor, 64 frames, they already start to collapse. And then things only get worse when we increase the number of distractors. I'll just say that-- and I won't give you many details, but the pink bar here is very slight extension over the recurrent neural network that I introduced earlier that incorporates both horizontal and top-down connections.
There's only one extension here, which is just to incorporate some attentional map, something akin to a saliency map if you heard about models of attention. And if you do this, essentially the network learns to keep track of tasks relevant targets and distractors. And we find that the network is able to keep track of occlusions and temporal ambiguities and things of that sort, which unfortunately I don't have time to tell you about.
What I want to do in the last five minutes is to show you how you-- so I showed you a lot of what most of my colleagues in computer vision would think of as kind of cute little computational neuroscience work, so inspired by cognitive psychology task, very constrained task. Far from the complexity of the digital world. What I want to tell you is that this is not necessarily the case. Much of what we've learned from this synthetic task, cognitive psychology task so far has translated very well to the real world domain to the point where we can claim that several of the systems we have achieved state of the art accuracy on modern computer vision benchmarks.
So here's early work, which I'm going to go through quickly this was already a few years ago. But contour detection Berkeley data set which was introduced by Malik and colleagues long time ago. We can train our recurrent neural network on those. They do about as well as the state of the art, which are very deep feedforward neural network. As I said earlier, it's not necessarily surprising. By the way, both feedforward network and our what we call Gammanet, which is our recurrent neural network achieved human level of accuracy on this contour detection.
But what is interesting is that when we start reducing the amount of training data available to those algorithms. And what we find is that when we for instance remove the augmentation on the training data sets, the accuracy of the feedforward neural networks starts to drop very significantly, to the point where with only 10% of the training data, recurrent neural network can do as well as a feedforward neural network that uses 100% of the training data. So again, the assumption there is that the recurrent neural network are able to learn visual strategy based on something like Gestalt phenomenon, Gestalt principles as well as semantic information that makes them much more simple efficient.
And this is to show you that the strategy that this recurrent neural network use is very different from that used by convolutional neural network. So what you see here is the representative images, ground truth contour, which would we average over many human annotators. It's equal 1. This is the output of our recurrent neural network, it's very messy. Some of the contours are there but there's a lot of spurious contours, a lot of missing contours. And then the recurrent neural network do not build or detect the contours through static depth as the CNNs. But rather, they do that through those recurrent connections through time. And so you see that through time the recurrent neural network is able to do what you would expect is the right thing to do, which is to slowly filling in the missing contours, suppressing the background, and so on and so forth. And by the final time stamp, which I think was 8 here, the recurrent neural network is able to produce something that's very similar to what human subjects would do.
The other thing that is somewhat interesting-- I started telling you about illusions and this original computational neuroscience model. We didn't know what those illusions were about, but we find that we can actually test the sensitivity of this feedforward and feedback models that are at human level for contour detection. And we find that so-- I won't tell you how to read those plots. You'll just have to trust me. This is how human perception looks for this tilt illusion. This is what you get out of state of the art feedforward models. There is no effect at all. But a recurrent neural network does in fact exhibit the same exact biases that humans would. So there's this repulsion for neural orientation between the center and surround and far away attraction.
And so presumably what is interesting is that shows that the solutions are not necessarily just bugs, right. They're not just limitations of the visual system whereby our perception is deviating from the real world of stimulus. But in this case, we see that one, they arise only for feedback networks. And two, as a byproduct of optimizing for contour detection, which suggests that they would reflect a feature much more so than a bug, and that they only potentially visible when we test the visual system to those kind of corner cases. We're also able to recapitulate a lot of extra physiology, including some of the one that you were asking about [INAUDIBLE] related to this. I cannot remember the name of this kernels where-- kernelarity. Anyway, it will all come back. But yes. We are able to-- if we study the kernels that I'll learn for contour detection, we do find an excitation for colinear contour. So it's not just isotropic, but there is some anisotropy in the kernels. And we can talk more interested.
OK, the last thing. I know it's 4:30. Here's an example state of the art accuracy, panoptic segmentation is currently one of the main remaining challenge in computer vision. The idea here is not only to annotate color every pixel according to a class label but even to distinguish between different instances of the same object that potentially could be occluding each other. It's very hard. The state of the art for a very strong baseline feedforward This is a future parameter network. You don't need to know all the details. But the point is that the accuracy is impressive. And yet the criteria to evaluate the systems are such that there's still room for improvement.
We just took a strong baseline feedforward architecture. We just slapped our recurrent neural network modules to incorporate both horizontal and top down connection. If you do this after just one time step, we're actually find already an improvement over the feedforward baseline. And then, as we increase more and more, as we allow the neurons to exchange information over time, as we trade over time we find significant improvement in accuracy. So by the time with only three to 5 time steps, you see that we can already improve the gap quite significantly.
If that doesn't look impressive, here are representative examples. This is what a feedforward system, the baseline system does. There is-- it's here a sheep on the rock. That is [INAUDIBLE] a zebra here on this train. And the feedback just simply suppresses this kinds of force detection. Now, what is interesting is that we don't tell the network how to solve those task. We just train it. And so when we look at the way the network actually ends up solving the task it looks as if it uses a coloring strategy or a flat field strategy. It seems as is a thing to equal 1, it starts from a seed somewhere around the center of mass of the object and then activity seems to be spreading until it hits the contour of the object.
So this is one example we have more example here just very briefly. So this is to give you a sense of how good those neural networks are. So what is interesting is that there is beautiful work from the group of Peter Rosamma in the Netherlands that suggests that human observers use a very similar strategy of flat feeling and strategy for image segmentation. And I'm not going to have time to tell you more about that. We also getting state of the art. We are currently number one on the tracking net challenge, which involves real videos. All right. OK.
So just to conclude, I tried to convince you of a few things. The first one is, you remember, don't buy a Tesla at least for a while. Or at least don't turn the autopilot. Vision is more than categorization. So if you have to pick a problem, don't work on image categorization. There's a whole world there and there are many more interesting tasks. I would say image categorization is saturated and not really solved. And then the third point that I wanted to hammer, and I hope that made that point clear, much of computer vision really is about solving data sets. And so at some point, we need to start worrying about solving tasks. So be critical in how you assess the quality of modern algorithms and in particular how they are being evaluated.
I tried to make the case that if we tackle the real problem of vision and a hard visual reasoning task, there is no way around considering feedback mechanisms. I've tried to illustrate the fact that it's possible to engineer recurrent neural network that are grounded in the anatomy and physiology of the visual system, and I showed you that they can, in fact, outperform feedforward-- engineered feedforward computer vision systems for contour detection, panoptic segmentation, and tracking.
I'll just leave it at that. I'll just want to leave the name of all the people who did the hard work, including Lakshmi who is one of your classmates here. And several of these people actually graduated from this wonderful summer school. So I hope you'll get the same enjoyment as they did. Thank you very much.