Feedforward and feedback processes in visual recognition
Date Posted:
November 6, 2019
Date Recorded:
November 5, 2019
Speaker(s):
Thomas Serre
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Thomas Serre - Cognitive, Linguistic & Psychological Sciences Department, Carney Institute for Brain Science, Brown University
Abstract: Progress in deep learning has spawned great successes in many engineering applications. As a prime example, convolutional neural networks, a type of feedforward neural networks, are now approaching – and sometimes even surpassing – human accuracy on a variety of visual recognition tasks. In this talk, however, I will show that these neural networks and their recent extensions exhibit a limited ability to solve seemingly simple visual reasoning problems involving incremental grouping, similarity and spatial relation judgments. Our group has developed a recurrent network model of classical and extra-classical receptive fields that is constrained by the anatomy and physiology of the visual cortex. The model was shown to account for diverse visual illusions providing computational evidence for a novel canonical circuit that is shared across visual modalities. I will show that this computational neuroscience model can be turned into a modern end-to-end trainable deep recurrent network architecture which addresses some of the shortcomings exhibited by state-of-the-art feedforward networks for solving complex visual reasoning tasks. This suggests that neuroscience may contribute powerful new ideas and approaches to computer science and artificial intelligence.
KAMILA JOZWIK: Good afternoon. Welcome to our CBMM seminar. My name is Kamila Jozwik. And I am from the [INAUDIBLE] CBMM, working with Jim DiCarlo and Nancy Kanwisher. And it's my pleasure to introduce Thomas Serre who is an associate professor at Brown University.
And, before starting at Brown, Thomas was actually here working with Tommy Poggio, doing his PhD and postdoc. So he has a lot of friends in the audience. And Thomas made very important contributions in the computational mechanisms of visual perception. And, recently, he was working on comparing the features in deep net and also in humans and showing where the deep nets fail. And he brought the fun piece titled, "Deep Learning-- The Good, the Bad, and the Ugly."
And, beyond feedforward, deep nets, he was working with recurrent networks. And I think he will be talking about that today because the title of his talk is "Feedforward and feedback processing in visual perception." So please welcome Thomas.
[APPLAUSE]
THOMAS SERRE: Thank you. Can you guys hear me well? OK, yes.
AUDIENCE: Yes.
THOMAS SERRE: Yes, thank you. And thank you very much, Kamila, for the kind introduction. Thank you, Hector, for the invitation. It's wonderful to be back. It's wonderful to see so many old friends and mentors, but also I have to confess so many new faces. So I kind of feel old right now.
All right, so I want to start maybe by celebrating a little bit the many tremendous success, I guess, of computer vision from the past few years. What I put here on the screen are what I think two representative examples of the successes that computer vision has been witnessing-- object recognition on the left, ImageNet, face recognition here on my left, your right.
And those are the kinds of tasks that I would say, 10 years ago, we said would be very hard for computer vision algorithms. And we're at the point where, by some estimates, the state of the art in deep learning is already approaching, if not outperforming, human subjects on image categorization. I would say even more impressive is what has been happening in facial recognition.
Jonathan Phillips, who has been working for the government for the past several decades essentially keeping track of progress in facial recognition, published a study last year making the demonstration that the state of the art is not just approaching our ability, you and I ability to recognize faces, but it's essentially already able to approximate an approach the level of accuracy of the best visual recognizers we have, which are the facial forensic experts. So I think this is very impressive.
And I think, as much as we need to celebrate the great achievements, as more and more people have been studying to work in computer vision and visual recognition, the limitations of the current state of the art are slowly becoming more and more evident. And I want to spend a few minutes maybe going over what I think are the limitations of the field at the moment.
So I think most of us are familiarity already with adversarial attacks. I don't want to dwell too much on it. What is fascinating to me, as a computational neuroscientist, is how we can get computer vision algorithms that can outperform the best human observers at facial recognition and yet that can be fooled with simple 3D-printed glasses that can make you and I look like Brad Pitt, for instance or almost anyone that the researchers choose.
It is amazing. You can fool these algorithms with a very small amount of noise, which is already superimposed here on those images. The noise is targeted, obviously, but it is invisible to the human eye. And this is sufficient to fool a deep neural network into very reliably, confidently misclassifying those images, which would have been otherwise easily classified.
Here is another example. I don't know whether this falls, strictly speaking, in the realm of adversarial examples, but this is, again, one of these deep neural network with superhuman ability to discriminate between traffic signs. This is obviously important for autonomous vehicle. And people have shown that you can just print and paste a few of these black and white stickers and turn this stop sign into a speed limit sign, so somewhat puzzling.
I think, in general, one of the main issues we are facing in the field is that of essentially trying to get those algorithms to generalize beyond training data. And I think the best way to illustrate the strengths of our own visual system is a case that [INAUDIBLE] here made I believe some 12 years-- 17 years ago, excuse me, where these simple examples here are there to illustrate the fact that perhaps-- and, hopefully, you've never been exposed to these kinds of very weird distortions that are applied to faces.
I can stretch faces along arbitrary dimensions. I can blur them to a point where you've never seen faces blurred at that level. And yet what's fascinating and unique about our own visual system is our ability to deal with those completely novel image degradations.
If you look at what's happening in deep neural network, you find that the story is quite different. So this is an interesting study that was published just last year from the group of [INAUDIBLE] in Germany where the idea is that they took a few representative deep learning architectures. They show that, if you train them on your favorite image database, so say CIFAR, they can exhibit this superhuman accuracy, at least when you somewhat constrain the presentation time for human subjects.
On top of it, you can superimpose a very severe amount of noise. So here would be an example of Gaussian noise. You can make those images literally unrecognizable to human observers. You can train the deep neural network, and they'll be able to recognize images at levels of noise that are literally impossible for human observers to recognize those images.
And yet, if you fool around with the noise-- so, for instance, you train the neural network on say something like salt and pepper noise. And you test a neural network in Gaussian noise. So, to human observers, the two kinds of noise are, especially at your distance, they're almost identical. And that's sufficient to get from a superhuman level accuracy for this deep neural network to a chance level.
So, in other words, these neural networks are perfectly able to fit the noise applied to the training data. To this day, we don't have evidence that they're able, even while they have been trained with many, many different kinds of noise, to generalize to image degradations that they were never exposed to.
All right, why is that a problem? Well, it's a problem if you care about applications of computer vision. So Brown purchased an electron microscope a few years ago. My colleagues wanted to leverage the great progresses in computer vision, a lot of the progress, I should say, studied here in the department, efforts led by Sebastian Seung. And, essentially, my colleagues are interested in leveraging computer vision to help with the automation of the 3D reconstruction of neural circuits.
There are claims that there are at least two systems, the 3D unit developed by Sebastian Seung here and the flood-fill neural network developed, actually, by a former student of Sebastian who is now at Google Brain, Viren Jain. Both of these systems outperform human observers if you ask human observers to trace all the various kinds of neural processes in those images.
But it's important to realize that those superhuman neural network are always trained and tested on the same brain volume. So they're always trained on maybe the first half of the brain volume and tested on the second half. For this to be useful to a more general audience, those networks need to be, essentially, used kind of out of the box.
So I asked my students first to replicate what was published. They were able to train and test this deep neural network on the same training and test volume, different sections of those volumes. When they try to train those neural networks on all the available volumes that were already annotated-- and there was already a number of them. Here are examples shown in blue. They were able to do that, but we found that the neural network generalized very poorly on a left-out volume, OK?
So you see here how a volume that's trained and tested on this [? Briggman ?] data set does well. And then, as soon as we test for generalization to another volume, the accuracy collapses, OK? So, in other words, the neural networks are able to essentially fit the idiosyncratic properties of the training data. To date, I would say there is very little evidence that they're able to generalize beyond those training data.
What I think is part of the issue in the field is illustrated on this plot. This plot is meant to summarize the history of what has been happening in computer vision with this main challenge known as ImageNet. What you see here is essentially the accuracy. Or, actually, this is the error rate. Sorry, the height of the blue bars is the error rate of the winning entry for this ImageNet challenge.
And so you see that's the accuracy has been steadily decreasing since the introspection of the challenge in 2010, all the way to the last time the challenge was held in 2016, but, perhaps, more impressive is the trend that's pretty evident here that the depths of the winning architecture has been growing I would say probably even exponentially, OK? So, today, the state of the art are those very, very deep neural networks that incorporate on the order of hundreds of layers of processing.
Part of the issue when you have so many layers of processing is that you end up with architectures that are heavily overparameterized. As you're probably all aware, this is kind of machine learning 101.
When you have the number of free parameters, essentially, in a learning algorithm, someone gives you a rough estimate of the number of training examples that can be stored or, in other words, the number of random associations between training input and training output that can be just rote memorized. So, if you have an architecture that contains on the order of hundreds of millions of free parameters, in theory, this architecture should be able to just rote memorize, roughly, on the order of millions of tuning examples.
And, to convince you of that, there is a very interesting work that was done in Ben Recht's lab at Berkeley. Ben Recht was actually an alumni here at MIT. And they showed that they could, essentially, shatter the ImageNet.
So ImageNet has about a million training examples. What they showed is that, essentially, you could completely shuffle the class labels on ImageNet, so create completely random associations between input images and class labels. And, if you look at the-- so let me try to highlight. If you look at here, for instance, this column about the top one training accuracy, if you don't apply, depending on the amount of regularization you apply, you can get, essentially, perfect training accuracy.
So those networks are deep enough. They have enough parameters, enough capacity, that, in principle, they can store completely random associations between input images and class output on the order of ImageNet size, which is over a million training data.
So I would say this is part of, potentially, one of the issues that we've been building those very deep neural networks to the point where now they are, essentially, orders of magnitude deeper than our own visual system. So, if we look at the anatomy and physiology of our own visual system, depending on how you count, we can guesstimate on the order of maybe a half a dozen, up to a dozen, layers of processing.
What is interesting is that, obviously, when we leave enough time to our visual system to solve image categorization tasks, I love one of the quotes from [INAUDIBLE] who says, "The more you look, the more you see." There's this notion that our visual system is able to deepen the depths of processing, essentially, through time.
And so the argument that I'm going to be trying to make here is that the solution that our visual system seems to have chosen is, rather than building depth of processing through a very deep neural architecture, it is as if our visual system has somewhat chosen to rely on a backbone, on a hierarchical architecture, that has maybe a half a dozen, up to a dozen, stages of processing, but that, presumably, the depth of processing is not achieved through static feedforward depth, but, rather, through cortical feedback, OK?
And just to maybe introduce some definitions, I'll be using terminology for two different kinds of feedback. I'll be referring to as recurrent or horizontal connections to define the pattern of connectivity within a layer of processing or maybe within an area of processing. And I'll be referring to as top-down processing or top-down connections in reference to connections that arise from higher level visual areas onto a lower visual area, OK?
So how do we-- where do we get started? Where do we get to build a model of visual processing that leverages these types of feedback mechanisms? Well, our starting point was work we performed a few years ago with a former graduate student of mine, David Mely, who tried to build a neurophysiology constrained model of visual processing.
And so we tried to build a model that would take into account the notion of so-called extra-classical receptive fields. So maybe let me introduce very briefly what the notion of extra-classical receptive field. So most of you are probably familiar with the notion of classical receptive field. This is what neurophysiologists typically refer to when they refer to the receptive field, the part of the visual field that needs to be stimulated in order to elicit a response from the neuron.
Perhaps less known is the extra-classical region, which is typically a surround region that sits right outside the classical receptive field. If this area of the visual field is stimulated all by itself, to be clear, this is not sufficient to elicit any kind of response from the neuron. But, if the classical receptive field is activated property, then any kind of presentation in this extra-classical region can have a very dramatic effect on the neural response.
The neural basis for these extra-classical receptive field effects is widely assumed to originate from the combinations of a pattern of connectivity that can be found both within cortical columns and hypercolumns, as well as pattern of connectivity across cortical columns. So we build a computational neuroscience model. If there are aficionados in the audience-- and I know there are-- this is a model very much in this period of seminal work by Cohen, people like Grossberg, Sebastian Seung a few years ago.
The idea is to model the interactions among neurons through systems of coupled differential equations, so models of the visual system as a dynamical system. I'm going to spare you the details of those differential equations. The most important here maybe is to realize that the key parameters in this model are those kernels here, Wi and We, which aim to characterize how neurons both within a hypercolumn and across hypercolumns are connected.
We typically simulate this model using relatively simple Euler equation. It tends to be tedious to simulate these kind of models because you have to realize that, with two equations per neuron, if we try to simulate interaction at the level of the entire visual field, we end up with a very large number of differential equations to simulate.
Nonetheless, it can be done, and we did it. And, if you have a good graduate student who can tweak the parameters here, the pattern of connectivity both within and across columns, we found that, in practice, we could explain a number of classical actual physiology studies that have highlighted the effects of this extra-classical receptive field on the neural responses. And I'll spare you the detail on that.
I'll just tell you that there are three ingredients that we found were key for the model to work and to be able to account for the actual physiology. First, we had to assume, essentially, two distinct surround regions, what I'll be referring to as a near excitatory surround shown in green here. So imagine this is an annulus. You find a center reference column in red. I have a near excitatory surround right around me.
And then, further away, we had to assume a far inhibitory surround. And I'll get back to that in a second. So that's the first key ingredient in this model.
The other particular mechanisms that we had to bake into this model is this idea of tuned connections or often referred to as like to like. So the basic idea is that, if I'm a neuron in a center reference column, I'm going to be connected here to my neighbor with an excitatory connection if my neighbor is nearby. The connection will be present if and only if my neighbor and I share the same preferred stimulus.
So, if we're in the realm of orientation tuning, you can think of a center column, essentially, with a neuron tuned to a vertical bar. I'm going to be receiving excitatory connections from my near neighbor if it is also tuned to a vertical bar. I'm going to be receiving a connection from my faraway neighbor if and only if my faraway neighbor is also tuned to vertical bar, but now the interaction switches from excitatory to inhibitory, OK? So this is the second aspect of this model.
And then the third aspect, third computation we had to bake into this model, is to have some kind of asymmetry between excitation and inhibition. Just because I know that Tommy is in the audience and is a big fan of shunting inhibition, essentially, we had to assume a form of nonlinear inhibition known as shunting inhibition. This was introduced by Grossberg in the early '70s.
The idea behind shunting inhibition is that, if I'm a postsynaptic neuron, and I receive some amount of synaptic inhibition, the amount-- the decrease in my activity will not just be based only on the amount of inhibition or the level of activity of my afferent, but also my own level of activity. In other words, the most active I am, the more inhibited I'm going to be for a fixed amount of presynaptic input if that makes sense. So that's the model in a nutshell.
And so, if you do this, something essentially magical happens that we found that, essentially, all the kind of classic contextual illusions we could found, at the behavioral level, could be explained by this otherwise very simple computational model. So here are a few examples. We were able to account for the tilt illusion. So I don't know if that's going to be obvious for you guys because some of those effects are frequency dependent.
But here you should see there's a vertical grading in the center surrounded by a grading tilted slightly to the right. When the center and the surround are of a small angular difference, the surround typically tilts or pushes the center grating to the left, away from the surround. I don't know if you see it here.
If I increase the angular difference between the center and the surround, the effect gradually decreases up to a point where then the effect reverses. Past a certain difference between the center and the surround, the surround starts to attract the center towards it. So you should see a tilt here in the center to the left and here to the right.
AUDIENCE: They're both straight.
THOMAS SERRE: Sorry? You both see them straight.
AUDIENCE: I see both straight.
THOMAS SERRE: Sorry? You see them both straight. It could be a matter of the distance. I suspect that people in the back might see it better. This has to do with the spatial frequency of those gradings.
So I'll do it again at the end if you want. And you'll get a chance to move in the back and, hopefully, to observe it. But this is not me. This is a very well known kind of classic contextual illusions that has been extensively studied.
And so we found that this was true in the orientation domain. If we consider similar neural population responses, we were able to account for many of the color contrast, color assimilation, motion induction, stereoinduction, so on and so forth.
And so maybe I'll go briefly through this one. This is an example of color contrast. This is the same center here in both condition. They should look quite different. The key difference is the surround. Here the surround is red. It gives you a sense that the gray center looks greener than it actually is.
All right, so I know that I'm going to be running out of time. So I'm going to try to go quickly through this. Maybe just to give you a flavor of how the model works, imagine that we're sticking a laminar probe, a depth electrode here at two location, somewhere in a column located in the center stimulus, another electrode here in the surround. What I'm showing you here are neural population responses.
So imagine that we have say something like orientation on the x-axis. The height of the bars correspond to the response of the neuron at that particular preferred orientation. So let's say that this would be maybe vertical orientation. This is where we get the peak response. And then the nearby neurons get this. The population neural response looks like this typical bell-shaped curve, so peaked at vertical.
In the surround, the surround is still tilted a little bit to the right. So this population neural response is somewhat tilted to the right. Now remember that the inhibition in this particular regime would exhibit this like-to-like pattern of connectivity, which means that, essentially, all these responses here are going to push down the responses in the center. And so, if you put yourself in the shoes of a neural decoder trying to read out the orientation here in the center, essentially, after this orientation gets pushed down, you would get a false readout, essentially shifted to the left.
I'm going to show you a movie, an actual simulation in the model, that shows you precisely that. All the responses are responses of-- so it goes pretty quick. You see here that, when we start, when we turn on the stimulus, we get, essentially, activity both from the center and the surround. You can see that here, essentially, because the surround is much larger than the center, the surround has the effect to essentially push this center population away from it.
And so, starting from a vertical orientation here, the darker bar here correspond to the readout from our neural decoder. And you see that the neural decoder now gets a readout, which is shifted to the left. In general, we find that, essentially, all the visual illusion we could find, the contextual illusion we could find, can be well explained through essentially three separate contributions of this near excitatory surround and this far inhibitory surround.
And you can think of those near surround and far surround kind of acting as push-pull forces on the center population. Depending on what we feed in the surround, we're going to get, in some regimes, the near excitation essentially pulls the center responses towards it. In regimes where the inhibition wins, it's going to push away the center population responses. And so, essentially, all the contextual illusions we could find could be explained, depending on the kinds of stimulus and the specifics of the experimental paradigms.
In some experiments, we found that the particular stimuli used would either exclusively lead to a response or a main effect from either the near or the far surround. We find that, in some regimes, both gets activated, both the near and the far surround. And so they essentially kind of fight one another and try to simultanesouly push and pull. Sometimes, the excitation wins. Sometimes, the inhibition wins.
And then we found that some, in some very special kind of conditions, as in this weird stimulus here that I forgot to describe-- so this is an illusion that was introduced by [? Moly ?] and [? Shavel. ?] Again, I don't know if that works for you because the screen is completely uncalibrated, but, when the illusion works, these are the same RGB values here.
They should look pretty different. They don't look that different for me right now. But the only difference between these two stimuli is essentially the ordering of the outer rings here, going from here salmon to lime to purple versus here going from salmon to purple to lime. Just flipping the ordering of these two types of rings produces very different-- does it work for you? Or--
AUDIENCE: Yes. No, I can't see what you mean.
THOMAS SERRE: OK, sure. I'm happy to take a question.
AUDIENCE: Are you going to say something about like sort of normative. Like why would it do this? Or why does it produce a--
THOMAS SERRE: I will touch on it
AUDIENCE: --reference error.
THOMAS SERRE: Yes, I'll try to tell you something about it. In the case of-- yes. So I guess your point is that I've been showing you that this circuit is mostly helping the visual system being fooled, right? So why would you bother?
And the point that we are trying to make as we speak is that those errors that arise because of these circuits are kind of edge cases when we use those very artificial stimuli. But, as an example, we found that, just testing this circuit without changing any free parameters, we can explain the level of color constancy measured from human subjects. So we know that this circuit allows the neural population to be much more invariant to changes in illumination.
What I'm going to show you, if I have time, in the rest of my talk will be that a system that exhibits this tilt illusion does much better detecting contours in natural images. So the normative story here would be that that's actually a way or a byproduct of a visual-- or an artificial system, in this case, optimized for contour detection.
But the point that I wanted to make-- and, again, I don't know whether the effect-- it's not working for me. So I hope it is working for you. These effects appear to be very dependent on the actual size of those rings. And we found that we could perfectly explain this dependency on the size of the rings. We find that the effect is maximal when the size of those rings fall exactly in the near surround that we've been hypothesizing.
The effect here happens and is maximal when those lime and purple color are on opposite sides on the color circle of this salmon color. Essentially, one is pushing away the salmon away from lime, while the purple is essentially pulling it towards it. And so, essentially, these two mechanisms collaborate to make these superadditive effects in a sense, which I'm happy to take questions.
So, anyway, I'm going to skip through the actual data, but we did our homework. And, although the intent was not to provide a quantitative feed to experimental data, in practice, we found that, quantitatively, we could explain most of these illusions. And I'm just going to flash through these slides very quickly.
On the left will be psychophysics data that we replotted. On the right will be the data produced from the model. And you'll see that, in general, the feed is quite good.
So this is the tilt illusion with the attraction and repulsion. This is a similar effect happening in depth. This is a similar induction in motion, slightly different experimental conditions. This is the color contrast, color induction. And this is this [? Moly ?] and [? Shavel ?] kind of superadditive effects.
I apologize for flashing those slides so fast, but this is mostly published work. So I will encourage, if there are psychophysicists here in the audience who are interested, I would encourage you to check out the paper.
All right, so, to [INAUDIBLE] point, I've shown you a circuit that essentially messes around with perception. So one could rightfully ask what's the point. And so, as I said, we believe that those illusions are the byproduct of computations that serve the greater purpose of maybe building object constant representations. And we would like to essentially demonstrate that by being able to train this module and optimize it for arbitrary tasks, maybe something like color constancy, contour detection, et cetera.
The problem is that, in these kinds of dynamical systems and systems of coupled differential equations, it's not quite easy to train those networks to set the pattern of inhibitory and excitatory connections for these horizontal connections. And here we were able to do it with a graduate student [INAUDIBLE] optimizing and tweaking the pattern of connectivity for matching the electrophysiology, but this is not necessarily amenable to optimizing for arbitrary computer vision tasks
Luckily, and perhaps it's kind of surprising, we didn't realize that earlier. But, when we wrote down the Euler integration step, we realized that the Euler integration step-- and I'm trying not to see Jean-Jacques because he would be very upset at me-- but it's true that, if you write down the Euler integration step on one of those systems, you can literally approximate that with a modern kind of recurrent neural network.
And so, if you do this, the benefit now is that you can approximate this dynamical system approach with your favorite recurrent neural network with all the bells and whistles of your favorites LSTM or GRU. And so now you can build a machine learning module that's fully differentiable, fully trainable. And that can be, essentially, embedded in your favorite deep learning architecture so that you can actually demonstrate and test the usefulness of such a circuit.
And so, maybe to illustrate one of the benefits of I think these kinds of highly recurrent neural network architectures, here's a task that we developed. This is inspired, I should say, by classical cognitive psychology work, in particular, in addition to work by Peter [INAUDIBLE] who has done extensive electrophysiology on these kind of tasks. We call this task the Pathfinder.
So the analogy would be imagine following some footsteps in the snow. Here the idea is that we give-- this is a binary classification task. We produce on the order of hundreds of thousands of positive and negative examples. All the examples here are made of contours. Those contours are made of the little puddles you see here.
On both the positive and the negative set, you see pairs of dots. On the positive set, the dots fall on the same contour. So, yes, there is a path from A to B. On the negative set, there is no path. The markers fall on different contours.
The main parametrization we used here is that we are able to, of course, fully control the task difficulty. In particular, we're able to control the lengths of the path from A to B. The reason we cared about that is that we reasoned that, if we are a feedforward neural network, your favorite deep convolutional neural network, those guys are universal approximators. We can produce millions of those images. So there is nothing that they will not learn if we give them enough training examples.
But our assumption was that, if perhaps they are trying to solve-- if you are trying to solve this task with purely feedforward pulling mechanisms, the only way to solve it is by reaching a receptive field size where the entire contour would be contained within. And the way the state of the art deep neural networks increase their receptive field sizes is through depth. So our prediction is that, as we make the contours longer and longer, the minimal convolutional neural network to solve the task should get deeper and deeper.
In comparison, I'll show you results with a single layer of our highly recurrent neural network trained on the same task, OK? So the plot that I'm about to show you is a little complex, but I'll try to distill the essence of it for you.
So please look at the top row here. Each one of those cells will correspond to the accuracy of one of our baseline architecture. On the far left is our recurrent neural network, which we call the hGRU, as in Horizontal GRU or Horizontal Gated-Recurrent Unit. And you see that the neural network single layer has no problem learning to solve this task at all possible contour lengths.
In comparison, we run all kinds of feedforward convolutional neural network. I'll draw your attention here to the ResNet, which is still today considered one of the state-of-the-art architectures. You see that we have three ResNets spanning from 18 layers, 50 layers, and 152 layers.
You see that all the networks, all the various depths, are able to solve the easiest task when the contour length is six. But, as we make the contour longer and longer, you see that only the-- to the end, only the deepest of the ResNet, the 152 layer, is able to solve the task.
If I replot this, just to give you a sense of the computational complexity of that beast, shown here is, on the y-axis, the accuracy of all these models. We have our hGRU here with a perfect accuracy for the longest contour. Shown here is the free parameter multiplier, so the architecture with the least parameters is our hGRU. We can get almost perfect accuracy with a ResNet, but with 1,000 times more free parameters.
So I hope this illustrates the benefit of having this kind of highly recurrent neural network. At least for this kind of what I would refer to as incremental grouping task, there is a very natural solution where the idea is to propagate information between neighbors, as opposed to feedforward through depths.
And, of course, we know that this is not the only way that our visual system solves these kind of grouping tasks. I illustrated one example, which I think is able to capture one aspect of grouping based on low level gestalt principles, but we also know that there are, in addition to these gestalt principles, they are also so-called semantic or top-down cues, right?
So here's an example. If I give you a zebra on a well separated background, it's very easy, actually, for human subjects, even possibly in feedforward pre-attentive mode, to answer a question such as whether those two yellow dots fall on the same side of the contour or not. If I give you a whole kind of group of zebra, the task becomes very difficult. And simply leveraging gestalt principles of proximity, continuity, et cetera is not going to buy you a lot here in this case.
So this is evidence that, potentially, the only way to solve this task would require some knowledge, higher level knowledge, whether you want to call this object level or semantic. And so, in order to try to disambiguate the role of this horizontal versus top-down connections at a computational level, we perform an extension of our Pathfinder task that I'm going to be illustrating in just one second.
So I should point out that we essentially extended our architecture to account for top-down connections. So I'm going to be referring to our architecture in terms of hGRU for pure horizontal connections or fGRU, which is a more general form that can include top-down feedback in addition to these horizontal connections. And I'll spare you the details. The details are not probably too important.
What I want to show you is our extension of this Pathfinder here. So remember the Pathfinder. The goal was for a learner to learn to say whether there was a path between the two markers. You can think of that as figuring out whether the two markers fall on the same objects. Here the object is a contour.
To tap into this idea of top-down semantic, we extended the task now to be based on objects. And so the easiest objects we could come up with and parametrize are letters. And so we call this task the cluttered ABC.
The idea is the same. I don't know if you can see the markers here. They're a bit harder to see. But the task is very similar. The networks have to learn to answer whether the two dots fall on the same letter on the positive or two different letters on the negative.
Here we're going to be playing the same game. We're going to try to train or make the job of the feedforward neural network harder and harder. And the way we do this is by gradually increasing the transformation that we apply to these two letters.
But here's the trick. What we did here is that we applied two transformations, two fine transformations to those letters. In the easy case, we apply the same, pretty much the same transformations, exactly the same transformations on the two letters. To make the task harder and harder, we gradually decrease the correlation or the type of degradation that's applied to the two letters.
So what you see here, to make the task really hard, we are applying completely different fine transformations to the two letters. The reason why we expected this to be hard for feedforward neural network, if you are feedforward neural network, the way to solve this task is by, essentially, storing a lot of feature templates.
And you'll notice that here, if you're using this strategy, the number of templates that you need to store increases exponentially from left to right. Because here there is correlation, you can just learn to deal with the transformations for the two objects. Here you need to figure out what the transformation is for each one of the two objects.
So there is a combinatorial explosion in the number of samples we can produce from left to right. If you're a feedforward neural network and if you are, as we expect, rote memorizing templates to solve the task, you're going to need an increasingly large capacity to solve this task. In other words, you're to need to get deeper and deeper. Yeah?
AUDIENCE: Are you only doing this with the computer neural networks? Or do you do this with people as well?
THOMAS SERRE: I'll show you a human baseline. We have human baselines on those tasks, yes. Yeah, OK? All right, so let me show you some results. Let's look at the left side first. Here is the accuracy actually of humans. On the top will be the Pathfinder. Bottom will be the clutter ABC.
So the first thing that you see is that human subjects don't have too much trouble. They can solve this task. And they don't exhibit much of this straining that I alluded to. So they don't seem to care too much about how long the contours are or how correlated or decorrelated the transformations are on those letters.
The second thing to notice is that a model that incorporates-- so a convolutional neural network that incorporates both horizontal connections and top-down connections can solve pretty much both of the tasks. There's a little bit of straining here for the hardest cABC task. But, most interestingly, we found this double dissociation between the role of horizontal and top-down connections between these two tasks.
So you'll notice that here, for the Pathfinder, as expected, horizontal mechanisms, perhaps carrying gestalt-like biases or inductive biases, can solve this task pretty well across all path lengths. However, pure top-down systems, lacking these horizontal connections, exhibit very significant straining. So you see that, as the path between the points gets longer and longer, a pure top-down architecture gets a major, major drop in its accuracy.
For the cluttered ABC, the result is somewhat opposite. We find that the pure horizontal version of the CNN is unable to solve this cluttered ABC. A pure top-down, essentially, does pretty much as well as the full model.
So, to try to give you some intuition about how this recurrent neural network work, we essentially retrained them here on a slightly different task. Here we just trained them for the sake of explanation just to produce a segmentation mask. We show only one dot here, one marker.
And we train the network to produce a segmentation mask for that one object that contains the marker. And so this would be examples of the ground truth that you would expect or similar to the ground truth you would expect for those.
And so what you see here is that the strategy that the horizontal recurrent neural network leverage is one where the tasks are-- where there is very broad kind of global inhibition along most of the contours. And then we seem to see here a spread of activation along the contour.
So time goes from left to right here in different stages. And you see, from left to right, starting here from this example, the network starts with a spot of excitation at one of the marker. And so you see activation essentially spreading along the contour and immediately trailed by inhibition.
In comparison, a pure top-down architecture seems to solve the cABC very differently. So here you see that, on the latter case, it is as if the network makes an initial guess about the rough location and parameters of the fine transformation for the letter. And then, through time, it seems to gain some confidence about the actual position and viewpoint of the letter.
All right, so we've run many more controls that I'm not going to have time to go through. I'll just maybe briefly show you-- oh, I forgot to show you the human data. Where are my human data? Oh, yeah, so these are the human data here on the right side.
We ran the same experiment on human subjects, and what we are reporting here is the percent of the variance explained by the models that were able to actually learn the task. Not all the models were able to learn the task, but, for those that could learn the task, on the Pathfinder, we find that the network with horizontal connection does best, no significant difference with the fuller model that incorporates top-down connections, but significantly better than our feedforward networks.
And then the converse is also true for the cABC. So we found that the pure top-down model does about as well as the fuller model that incorporates both top-down and horizontal connection and also significantly better than the next best feedforward neural network, OK? So, in other words, those recurrent neural network seem to be able to solve the task much more efficiently with a fraction of the number of training examples. And they seem to be much more consistent with human data.
All right, so I want to maybe end showing you a couple of proof of concept that those neural networks can solve more complex tasks than these cognitive psychology tasks. And so, as a proof of concept, we train one of our fully recurrent top-down horizontal connection network on a contour detection task.
So we use the Berkeley Segmentation Data Set. To be honest, I think the data set has been overly used, but this is, essentially, the only way to convince my colleagues in computer vision that we have, us neuroscientists, potentially relevant ideas for them. And I should preface everything that I'm going to say by telling you a couple of things.
First of all, the state of the art today is still very deep feedforward neural network, one. Two, the way we train these neural network is I would say ludicrous. You need to pre-train them on ImageNet for object recognition. Then you fine tune them on PASCAL where there is very coarse mask for objects. And then you further fine tune them on the actual Berkeley data set.
But wait. Not only do you fine tune them on the data set, but, because the data set is relatively small with only a few hundred manually annotated images, people use the training and the validation set plus the documentation. So this is how you get a state of the art accuracy on contour detection today.
If you do this, you can actually achieve human level, which is-- so this BDCN here is a [INAUDIBLE] feedforward neural network that performs human level on this contour detection task. If we train our recurrent neural network in the same way, we get very similar accuracy on par with human subjects. So we do pretty well.
What is more interesting here is what we can do when we now cut down the amount of training data for our neural network. So, if you look here on the right, you see what happens with different levels of training data. So 100% would be using the whole Berkeley data set, but without the documentation.
And you see that right now what we call Gamma-net is our recurrent neural network. We can do significantly better. In particular, if we go down to only 10% of the training data, we can achieve about the same as what a purely feedforward neural network can do with 100% of the training data.
So I would say this is an example of where recurrent neural network can leverage and achieve a much lower sample complexity for tasks that I would have expected to require recurrent connections in the first place. And so here you see how those networks solve this task. Again, they don't do it through depth. They solve it through time.
So on the left side here would be the network producing initial guesses about the position of edges and contours. And then you see those maps getting refined through time. You see that there's a lot of junk here. There are incomplete contours. There is spurious detection.
As you proceed through time, you see the spurious detection essentially decreasing, the segmented contours, the object contours, gradually kind of increasing. And this is the ground truth here on the right side. So we've been able to achieve state-of-the-art accuracy on natural image segmentation.
Here is an example in connectomics for neural tissue segmentation. So we tried two of the kind of main connectomics data set. We started from a state-of-the-art architecture, again, one of these units developed by students in Sebastian's group. And there, again, we see a very similar trend. The results, perhaps, are not as impressive, but we do find a clear benefit of introducing this recurrent neural network to learn more efficiently with fewer training data.
Now, going back again to [INAUDIBLE] point, if we actually test this train architecture after we train it for contour detection, we test it for some tilt illusion, and we do find a tilt illusion. So this is, essentially, very similar, at least qualitatively, to what was reported in human subjects.
In comparison, if you do that on a purely feedforward architecture, there is no bias whatsoever, no tilt. So the question then becomes what is this bias or tilt illusion good for. Well, we tried to remove this bias from the neural network. And the way to do this is, essentially, to freeze the readout layer for contour prediction, but then to, essentially, fine tune all the layers underneath to try to force the neural network to produce an unbiased estimate, even in the presence of tilted contexts.
And, if you do this, you find something very interesting. So what I'm showing you here are the predictions produced by the-- I believe the original, the full model, is in blue. The bias-corrected, the tilt-illusion-corrected model would be in red. This is showing you the results of a sensitivity analysis.
So this, essentially, is showing you what matters in the image for the network to make its decision. You see that, on the full model, the full model seems to focus mostly on kind of high level semantic contours. In the network for which we corrected for the bias, there seems to be many more spurious detection on textures. So the kind of normative story here or the explanation would be that this tilt illusion is the byproduct of an active mechanism, feedback mechanism, that is aimed at trying to learn to detect kind of semantic object contours.
All right, so this is all I had. Let me just conclude briefly. I've tried to make the case that feedforward neural networks have achieved impressive abilities. And I think I don't want to minimize that. I'll just say that there are still very clear limitations of these feedforward neural networks.
I've tried to show you how it's possible, in theory, to build recurrent neural circuits that are inspired by the anatomy and the physiology of the visual system and that it's possible to leverage that knowledge to be a machine learning modules that can then be embedded into your favorite deep neural network architecture and that, if you do this, good things happen. You get comparable accuracy, albeit with a much lower sample complexity.
I'll just leave it with an acknowledgment of all the wonderful students and postdocs in the lab that did all the hard work and that allowed me to do this talk today. So thank you very much.
[APPLAUSE]