Towards a system-level theory of computation in the visual cortex
Date Posted:
April 14, 2015
Speaker(s):
Prof. Thomas Serre, Brown University
All Captioned Videos Brains, Minds and Machines Seminar Series
Loading your interactive content...
Description:
Abstract: Perception involves a complex interaction between feedforward (bottom-up) sensory-driven inputs and feedback (top-down) attention and memory-driven processes. A mechanistic understanding of feedforward processing, and its limitations, is a necessary first step towards elucidating key aspects of perceptual functions and dysfunctions.
In this talk, I will review our ongoing effort towards the development of a large-scale, neurophysiologically accurate computational model of feedforward visual processing in the primate cortex. I will present experimental evidence from a recent electrophysiology study with awake behaving monkeys engaged in a rapid natural scene categorization task. The results suggest that bottom-up processes may provide a satisfactory description of the very first pass of information in the visual cortex. I will then survey recent work extending a feedforward hierarchical model from the processing of 2D shape to motion, depth and color. I will show that this bio-inspired approach to computer vision performs on par with, or better than state-of-the-art computer vision systems in several real-world applications. This demonstrates that neuroscience may contribute powerful new ideas and approaches to computer science and artificial intelligence.
PRESENTER: It's an enormous pleasure to introduce our friend and colleague, Thomas Serre. Thomas is no stranger to this community. He's a good friend of many of us and a tremendous colleague who has made enormous contributions to our computational understanding of the principles orchestrating visual processing in cortex and also computer vision and engineering algorithms that actually work in practice and that are actually used in a variety of different applications.
Thomas started his career in the old continent, in this funny country called France in the Lycee Pasteur studying math and physics, and then moving on to several degrees in engineering that he may tell us more about, probability and statistics as well as EE and CS. After that, he joined our community here at MIT in BCS where he got his PhD under the mentorship and guidance of Tommy Poggio, and that's where I met him.
And it's been a tremendous pleasure to interact with him. I still vividly remember all our conversations and discussions. In many ways, those were the highlights of many of my days here, being able to talk to Thomas and having him as a source of continuous inspiration, an endless source of new ideas and coming up with all sorts of new projects and thoughts about what to do next.
I could tell a lot of stories about Thomas. He just explicitly told me not to, and I should say as little as possible. So because he's here, I'm not going to say anything. I'm going to share all the stories when he's gone afterwards if anyone's interested. So with that said, again, it's a great pleasure to introduce him.
He's done tremendous work in computation neuroscience as well as computer vision. And I could go on and on praising his work and describing all his multiple contributions. But it's much, much better to hear it directly from him. So please join me in welcoming Thomas to this lecture today.
[APPLAUSE]
THOMAS SERRE: Thank you for an overly positive introduction, and thanks for the invitation. I'm overjoyed. I haven't been here in many years. And seeing so many mentors and friends in the room, I'm actually-- I feel the pressure. I'll do my best. So today, I'm running a little bit of risk of running low on science. I decided to try to give you a general overview of the kind of work that I've been interested in studying in the good old days here at MIT and then continuing in my lab at Brown.
And so I'm going to tell you about some of the most recent work we've done on the computational mechanisms of rapid visual processing. And I'm going to tell you about a recent collaboration with a monkey electrophysiology group from France. Then I'm going to try to give you an overview of some of the recent work we've done towards the development of an integrated computational model of the visual system, at least part of the visual system, and trying to integrate multiple cues ranging from 2D to 3D shapes, binocular cues, as well as motion and color. And then, if I have time, and I have the feeling that I'm going to be running out of time, I'll try to tell you or show you an application of the kinds of computational models we've been working on towards the automated behavioral analysis of rodents.
All right. So I'm going to start by showing you a little movie of what happens in your brain when you process a visual scene. So this data we actually collected with Gabriel in his lab many years ago. So this is data from an epileptic patients. So the white dots here are surface electrodes.
And I chose this particular patient because the coverage is quite good, although we don't have any control over where those electrodes are located. This particular patient has a reasonable coverage along the temporal lobe. So I'm going to show you two movies. The first movie I'll show you just the raw voltages, so you're going to see a raw-- the visual information flowing throughout this patient's brain.
So 0 millisecond correspond to the presentation of a stimulus. Nothing much happens until about 16 millisecond, when you start seeing potentials developing in more posterior areas. And then, after about 180 milliseconds, one can start seeing activation in more anterior areas. So I would characterize this flow of information as a, essentially, bottom-up or feedforward flow, running from lower visual areas to what we would think of as higher level visual areas, probably some IT analog.
Now I want to give you a slightly different perspective on these visual dynamics. And I'm going to show you the result of a so-called decoding analysis. So without giving you too many details, I'm going to show you how we assess after we assess the diagnosticity of the categorical content of the stimulus, located on each of these electrodes. OK?
So what you're going to see here is a slightly different perspective where, essentially, not much happens until about 180 milliseconds, where you're going to start seeing a hot spot here towards higher level visual areas. And then after this hot spot forms, you're going to see, essentially, this categorical category information flowing back to almost all the electrodes we're recording from.
So I would characterize this flow of information as, essentially, a top down flow of information, initiating in higher visual areas, running towards lower visual areas. So the point that I'm trying to make here is that if you think of the primate visual system, with about three dozen visual areas or so, given that those visual areas are almost always reciprocally connected, one should expect, in the most general case, to have a very complex underlying visual dynamics, with both bottom up and top down signals flowing through the brain.
In my view, if we want to make progress towards understanding how these visual areas carry specific computation related to different visual functions, we need to find ways somehow to constrain the underlying visual dynamics. And so, luckily for us, psychophysicists have worked on this for decades now, and, actually, some of the pioneers are here in the room.
There is an experimental paradigm called the rapid visual presentation paradigm that was started here with Mary Potter many, many years ago, decades ago I should say. So the specific paradigm has evolved in the past couple of decades. What I'm going to show you here is a particular example where the rule of the game is that images will be flashed very briefly, and participants have to report the presence or absence of a particular category of objects, namely the presence or absence of animals in those scenes.
So here the presentations are relatively slow compared to what we would be running in the lab. This is about 100 milliseconds or so per flash. In the lab, we could speed that up, down to just a few milliseconds. And so at this speed, people don't perceive every single little detail of the visual content of these images. But people are normally able to report the gist of the scene. In this particular case, the presence or absence of an animal.
So what do we know about the neural basis of this task? Well the best animal model we have to study this is, of course, the primate, or monkeys. Unfortunately, I would say, most of the electrophysiology work in monkeys has been mostly conducted on passive, passively fixating, animals. So this passively fixating electrophysiology protocols have told us a lot about the detailed anatomy and physiology of one part of the visual system, which is the ventral stream of the visual cortex shown here, which we know is involved in the processing, or we assume is involved in the processing, of 2D shape information and object recognition.
But most of the evidence is indirect in the sense of there hasn't been yet a link between neural activity in these brain areas and behavior responses during one of these rapid categorization tasks. So I was fortunate enough to have two fantastic collaborators at CNRS in France, Denis Fize and Maxime Cauchoix, who were able to train monkeys to perform this rapid animal categorization task.
So a sketch of the task is shown here. Those are head-free monkey, they are rewarded with a drop of juice whenever they are correctly categorizing the presence or absence of an animal in rapidly presented images. So every trial starts with a fixation cross. Stimulus is very briefly flashed for about 33 milliseconds. And then, after a short interval, they have to report the presence or absence of an animal with a go-or-no-go paradigm.
So, essentially, they are holding their hand on the button pad, and then whenever they see an animal, they release the button and touch the screen. Or if no animal has been presented, they are trained to hold their hand on the button pad. I should also say that on half the trials, we actually use a backward mask, because we're interested in understanding how the presentation of this backward mask will affect the underlying visual dynamics and the availability of neural information.
Just in the interest of time, I'm not going to tell you anything about the mask. I'm just going to tell you that, essentially, we found that this mask leave unaffected the first 50, 60 milliseconds of visual processing within the ventral stream. And for all the analysis I'll be presenting you, we could thus just collapse the trials from both mask and unmask presentations.
So here you're going to see the animal working. So you see here, actually, this is a mask set, a block of masked trials. So you see that it goes so fast that probably most of you only see the mask at this speed. And so you see the monkey at work. You see the monkey here releasing the button whenever it sees an animal and otherwise holding the button.
Denis and Maxime trained the two animals, and so there are a familiar set of images of-- about, I think, 300, 400 images-- and the accuracy of the two monkeys in this set was very high on the order of 90% accuracy. Now, of course we know that monkeys have a fantastic memory, and so it's entirely possible that they were-- despite the large size of the data set, it's possible that they were simply memorizing the animal and non-animal examples.
So we wanted to make sure that they were actually generalizing and really learning the concept of animal. And so, I essentially sent them to Maxime, I sent them to France, a set of stimuli that we're using here on human participants. And so we tested those animals on the very first presentation of those completely new images, and we found that, despite the fact that we could measure a significant decrease in the overall accuracy of both monkeys-- so each of the two dots correspond to one of the two monkeys-- despite a significant drop of about 10% in accuracy, the accuracy remained highly above chance.
And just to give you some context, human participants are at about 80% accuracy on the same set of images here. And, actually, human subjects-- here, these are our experiments we did many years ago when I was still a PhD student. These are actually MIT students. So I don't know what that tells you about MIT students, but to me, that tells me, that monkeys are an excellent animal model for this high level vision.
AUDIENCE: But MIT students drink more juice.
THOMAS SERRE: [LAUGHTER] That's the other interpretation. So now, of course, just because you have two visual systems performing at the same level of accuracy, doesn't necessarily mean that they share the same underlying visual strategy. So we wanted to characterize a little bit better the visual strategy that both of these species were relying on, and we came up with this very simple index, which we call the Animal Mass Index.
So for human participants, for each individual image, we could count over all 25 participants the fraction of time that the participants responded animal or non-animal. So we end up with an index running from zero to one. Zero means that all the participants rated this image as highly non-animal-like. A score of one means everyone agrees that there's an animal. And anything in between reflects some uncertainty in their decisions.
For monkeys, we didn't have 25 monkeys, we only had two. So just to reduce the noise level a little bit, we had to collapse the first 20 or so trials. So there are repeated presentations, but I really doubt that this affected the results. So we get one score for humans, we get one score for monkey, for each stimulus that was used. So we can now plot one versus the other. Right?
So here on the x-axis is the score that we obtained from humans. On the y-axis, the same score derived from monkey responses. And the first thing that's pretty obvious is that there's a very strong correlation between the two. Now, I'm not going to tell you exactly the level of correlation, because that wouldn't be completely fair.
And the reason it wouldn't be completely fair is that if you have two systems performing at about 80% accuracy, you see here that most of the images on the upper right corner are green, meaning that they contain an animal, and most of the images here on the left contain non-animals. So most of the correlation comes from the fact that those two visual systems are pretty accurate.
So if we want to know exactly how correlated those two species are, we essentially need to factor out the class label. And without getting into too much detail, the correct measure for that is a measure of partial correlation. And I'm only going to show you the plot here. So the dashed line here corresponds to significance level, computed by a bootstrapping procedure. You see that the correlation from human to human is still relatively high, even factoring out the class labels, at about 0.25.
The correlation between the two monkeys that are called Deeky and Rooky, hence the DY and RY, is on par with the level of inter-correlation between human subjects. Most interestingly, the level of correlation between human and each of the two monkeys is almost at the same level as the level of correlation within each of the two species.
So, you know, I used to hear a lot that-- a lot of complaints about these rapid visual categorization paradigms that just by flashing images very briefly or forcing subjects to answer very quickly, we're just forcing them to make random motor errors. What that tells me is that that's not the case, that in fact there is a pattern of correct and incorrect responses that seems to be shared both within each of the two species and across the two species.
So, very briefly, I just want to give you a sense of the kind of electrophysiology analysis we've done. So, Maxime and Denis went on and recorded from intermediate areas of the ventral stream, in area V4 and PIT. So those are actually wire electrodes. So that we are essentially sensing the equivalent of local field potential, so a relatively course signal. However, we were able to map out the receptive fields from these relatively course signals.
And most interestingly, we were able to decode reliably from single trials the presence or absence of an animal on individual trials. So what we did for this analysis is we considered the full neural signal from all the electrodes for each of the two animals. And then we passed the raw voltages-- so about a dozen electrodes or so-- to a linear classifier. And for every point in time, we tried to predict the presence or absence of a animal in the stimulus that was just presented.
So here you see two curves which reflect the accuracy of the decoding as a function of time. So the x-axis corresponds to time. The y-axis corresponds to how well we can decode this categorical signal from those electrodes. And so the first thing to notice here is that we passed significance here-- again, assessed with a bootstrapping method-- very, very early.
Just to give you a sense of the kind of visual latencies we would expect from these areas, we would expect just raw latency on the order of 60 to 80 milliseconds. Here we find that, already within 80 millisecond post-stimulus onset, we can already decode the presence or absence of an animal in those images. Now just because we can decode the presence or absence of an animal, does not necessarily mean that the animal is essentially exploiting this information. After all, there's a relatively large literature arguing for a raw for subcortical areas-- in particular, the amygdala in this kind of rapid visual processing.
So to try to link a little bit better the availability of categorical information in this set of areas to behavioral responses, what we did here was to actually sort out our trials according to reaction time going from fastest to slowest trials. Then what I'm going to show you here for the analysis is the decoding analysis conducted on four quartiles of reaction times.
So we look at reaction times to organize our trials, and then we perform a separate decoding on each of those quartiles. So what I'm actually showing you here is the decoding on the fastest trial, when the animal is responding within about 250 to 280 milliseconds. If we consider the next quartile, we find that the latency of decoding is shifted a little bit to the right. The next quartile we find a further shift to the right. And the last quartile further shift to the right.
So these strongly suggest that there's a link between the availability of categorical information in this set of areas to the reaction time, or how fast the animal will be to respond. Of course, I wish I could give you a causal evidence, and, unfortunately, one of the two animals removed its implant before we could do that. But, ideally, we would have liked to stimulate, which we haven't done. So, at the moment, this is mostly suggestive evidence, but I think it's nonetheless very strongly suggestive evidence of the involvement of this set of visual areas in driving behavior.
So why do we care? Well, we care because, I think, this kind of results-- and because of the speed of the availabilty-- I had forgotten about the train here. I'll wait for it. That's going to be the perfect transition to my final teach for this set of results.
Why do we care about these kind of results? We care about these kind of results because what they tell us is that the very first, the earliest, visual signal we can record from from this set of areas seems, one, to contain reliable categories for information and, second, seems to be driving behavior. And I would take this as suggestive evidence for this class of feedforward hierarchical models, which suggests that the core, what actually James DiCarlo calls the core object recognition, proceeds in a feedforward, bottom-up sweep of activity through a hierarchy or visual areas.
So let me switch gear a little bit and transition to telling you about what I think is state of the art in computational models of this rapid visual processing. And so, being at MIT, I'm assuming that most people are familiar with this class of feedforward hierarchical models. They were pioneered, or proposed, by Hubel and Wiesel somewhere in the early '60s. Fukushima extended these ideas of hierarchical processing to computer vision applications. Tommy Poggio here has been doing a lot of work with a particular type of model that has been dubbed the HMAX, starting with work done by one his graduate students, Max Riesenhuber.
I was fortunate enough to be here for several years where, together with colleagues who are actually in the audience, we kind of extended this initial proof of concept and showing that this was sufficient to account for a number of electrophysiology across areas of the ventral stream of the visual cortex. So I'm not going to make justice to a number of people that I've built on this line of work. There's been many extensions. There's also been some fantastic work by Jim DiCarlo's group on this idea of high-throughput screening, kind of extended this class of model and making it work even better.
Now one thing that is interesting that has been happening is that, kind of in parallel to this work done in computational neuroscience, trying to build hierarchical feedforward models of processing along the ventral stream of the visual cortex. There's been a lot of work on a related class of model. You might have-- I'm sure you've heard, actually, about convolutional neural networks, or deep learning networks, which have been developed from a different community, the machine learning and computer vision community.
And so what's interesting is that while these two classes of models, or architectures, have been developed somewhat independently, they share a lot of the same underlying architecture. In particular, both the HMAX, or these biological models, and the CNN share the same key operations. So they both alternate between layers of template matching and invariance pooling. And they also share, essentially, everything to the level of details of the specific kind of normalization happening across layers.
Now one major difference between these machine learning or computer vision architectures and the work that has been done in neuroscience is that, unlike in biological models where we care about trying to mimic or constrain the tuning properties of cells across stages of processing, this architecture didn't really care about mimicking the biology. Instead, they were able to optimize all the hyper-parameters. For instance, the receptive field sizes, the range of invariance, and the tuning properties of cells simply for optimal accuracy on various kinds of tasks.
Another significant difference between these classes of models or architectures and biologically realistic architectures or models from biology is that the learning here is all supervised. So all the layers are essentially optimized for a particular visual function, say object recognition, as opposed to architectures that have been developed in computational neuroscience where the learning is mostly unsupervised, without a teaching signal or a task in mind.
Now what I think is happening is fascinating. I mean, I was just looking at the archive, and I found out that Microsoft is now claiming to have one of these-- one version of this deep-learning architecture to actually surpass human level of accuracy in categorization. And this is not a binary classification task. This is on one of the main computer vision data sets, ImageNet, which has millions of images in tens of thousands of categories.
So, you know, I think this is an amazing time, and, certainly I think we have done-- at least the community has achieved a lot of progress in the past couple of years. Now in addition, there is work here at MIT in Jim DiCarlo's lab that is showing that not only are these machine learning or computer vision architectures better in terms of recognition accuracy, but also better than any of the existing biological models in terms of predicting the firing out of neurons or how they respond to different kinds of stimuli.
And so that's yet another reason why I'm very excited. Now, just to be clear, despite all of these major achievements, there is also something interesting that has surfaced in the past couple of months. And so what has surfaced is the fact these architectures-- and although some of these experiments were conducted on convolutional neural networks, I suspect any of the more biologically plausible models would suffer from the same exact limitations. And so the limitation is shown here.
So the kind of experiments that have been run is a so-called adversarial type of optimization of stimulus. So what this group did, and there are several different versions of that experiment, they picked some of the best stimuli, if you want, the stimuli that the network was correct at recognizing and very confident in terms of its decision.
And so what they did is that they did a type of stochastic optimization on the input stimulus, so essentially starting from one of these images for which the network was very confident about the class label to just optimize some very fine level of noise on individual pixel intensities to see if they could essentially swap the class label. So going from the network being very sure that it's looking at a dog to producing an image for which the network would be 100% sure that it's not a dog. And the images that they produced are those images.
The level of knowledge that was needed to swap the classification score is so small that you can barely perceive it on these images. So these networks are highly sensitive to a very small kind of IID type of independently drawn noise on individual pixels. Sorry, I need to move my car. All right too late.
OK. So clearly there's something missing here. And so, I think, the reason why I'm using this experiment is that, or this example, I think is that it's a representative example of the kind of limitation one would expect from this type of feedforward hierarchical model. And so I think it's good, because most of the work we've been doing in my lab in the past couple of years has been trying to go beyond this kind of what I would call bag of neurons type of visual representation to what's building more sophisticated models with much richer visual representations.
And so I think I'm going to try to argue for two things that I think are missing in these types of architectures. And I'll start with one, which is I think if you look at it carefully, you'll see that the basic computing elements remains what we call the LN model if a neuron. So each of the millions of units in this kind of hierarchical models is essentially assuming that there's a set of afferent units, and then the way the output of each unit is computed is by considering the level of activity of these inputs.
Weighting that by a linear synaptic weight, summing that up, and then passing that to a nonlinearity. So this is often referred to as the LN model, linear and nonlinear model. That remains a very ideal, simplified and idealized, model of what we understand about the biophysics of neural circuits.
And so some of the work we've been doing in my lab for the past couple of years now has been to try to refine the kinds of elemental operations that is carried in those networks to come up with more sophisticated mechanisms. And one example, if I think, a biophysical mechanism that's often overlooked in those architectures is spike timing and synchroning.
So here's a simulation, very simple simulation, that we conducted using neurons. So I'm going to show you a very simple simulation with a Hodgkin-Huxley type of neuron. And so what we did here is that we're considering two presynaptic neurons, and we're inducing some rhythmicity on these inputs. We're assuming that there is some clock somewhere, and that's inducing some oscillations on the firing rate of these neurons.
So we're assuming that both of those neurons' firing rates is modulated by a signal at one particular frequency and that these two neurons are shifted in terms of the exact phase of this carrying oscillation. What I'm showing you here is that, as we would expect, starting from rhythmic inputs, we end up with a firing rate on the postsynaptic neuron, which is also rhythmic.
What is interesting, and probably most of you know already, is that if we vary the phase, the relative phase between the two inputs, we find that that's going to have a tremendous effect on the actual firing rate of the postsynaptic site. OK? If we normalize the output of this neuron, and if we induce zero phase between these two inputs, we find that we have the highest possible normalized firing rate.
And as we start shifting the relative phase between the two inputs, we start getting a monotonic decrease in the postsynaptic firing rate. Then the output goes up again once we hit pi, because, again, we are now reversing the cycle. So the assumption is that the amount of synchrony between the two inputs, the relative phase between the two carrying the two rhythmic signals that I'm modulating those firing rates, would have a tremendous impact on the output of the firing rate-- on the firing rate of the output neuron, excuse me.
So, you know, I am a system level computational neuroscientist, and I don't want to have to bother with this level of detail when I work with this kind of large scale visual architecture. And so we wanted to come up with a class of abstractions, or what I would call phenomenological operations, that would be able to capture this idea with not necessarily requiring spikes.
And so our idea was very simple, but I hope I can convince you that that's a good idea. And the idea is to extend the idea of the linear and nonlinear model, the LN model, of neurons from the realm of real numbers to complex numbers. And the rationale for relying on complex numbers is that in addition to be able to carry operation on the amplitude of the signal, of the firing rate, which would be given by the magnitude of a complex number, we can also model the phase between multiple inputs.
So the basic idea is that, again, we have a clock somewhere, and we're going to be considering the relative phase of all the units in our networks with respect to this clock. And in most of the next slides, I'm going to be representing phase with using [INAUDIBLE] of this kind.
And so here the idea is that rather than having a neuron that only cares about the activity of the afferent units that then get weighted by a synaptic input, which is a real number, here what matters as well, in terms of the output of this unit, is the relative phase between the two units. So if you do summation in the complex plane, the optimal output here would be obtained where the two inputs are colinear, when they have the same face. And then the output will decrease as a function of the difference in phase.
And so just to show you an example. This is the kind of output firing to difference in phase that we can obtain that looks relatively close to the curve that I showed you on the previous slide that we obtained from the Hodgkin-Huxley neuron. So this is a phenomenological mechanism that, presumably, allows us to take into account the relative phase between units at all stages of visual architecture in terms of computing some elemental operations.
So here's an example of where I think this kind of synchrony could be useful. So this is work that I should say that I did with a former post-doc of mine, David Reichert who has since been poached by DeepMind in London, and so we haven't been able to finish this work. But, so what David did is that-- he's a big fan of these deep learning networks, so this is a stack of restricted Boltzmann machine. The details don't really matter.
One difference between this type of architecture and the feedforward hierarchical model that I introduced earlier is that the connections run both bottom up as well as top down. And so, essentially, carrying the visual recognition in this architecture, it takes several cycles of bottom-up, top-down iterations. So what David did is that he trained this architecture to recognize squares, which are made of four corners. I don't know how much you can see those squares here in the examples.
So after the architecture was trained-- you know, training the architecture means learning the visual representation in different stages-- what we did was simply to initialize the input of this architecture with units at random phases, completely random phases. And then we let the architecture-- we let the inference carry, and we let the architecture convert towards a classification response for each of the inputs.
And what's happening is quite interesting. What's happening here is that as the network is going through several bottom-up, top-down loops, initially that's going to be a few units here, maybe responding to different kinds of salient features in the image that are going to have a particular phase. And they're going end up kind of driving the other nearby neurons. They're going to train, or push, these units to start oscillating at the same phase that they are.
So what's going to happen is after a couple of iterations, we're going to start from a complete random phase in the inputs to all units in this architecture, essentially spontaneously synchronizing at one particular phase, which is the phase of the most salient unit and coding for one particular object. So what you see here is that after a convergence, the input here-- all the inputs from the square I've turned blue, which means that all these units are now in phase. Anything due to spurious feature detection by the background end up being oscillating at a different phase.
So from these kind of simulations, what you can see is that one role by which synchrony could be deployed in this kind of architecture could be in terms of binding visual information from all the units at all levels of the architecture that are responding to the same given object. So you can think of this mechanism as a way to read out binding of information across layers.
So, similarly, David conducted further experiments. So here's an example where he's showing that by just segmenting out the phase of all the units, in all the layers of this architecture, we can actually separate between different objects, occluding objects, separating between this object and this one here. And these are more examples.
So, unfortunately, I am well aware that those are still very simple examples. We are in the process of running more experiments. In theory, there's no reason why we cannot run these types of complex-valued units and synchrony mechanisms in more realistic architectures like the HMAX or convolutional neural networks. As I mentioned, a very well known and probably company known to many of you here, in London, poached my post-doc before we were able to continue the work.
Anyway, so then the second point I want to make in terms of limitations of this kind of visual architecture is that despite all their bells and whistles, they remain texture recognizers. OK? So, essentially, the thing that these units at different stages care about is essentially the visual 2D appearance of the objects that are-- of the 2D objects that are being fed to them. And I don't think I have to lecture anyone here in the department about how much we know the 3D shape affects both the learning and the recognition of natural everyday objects.
And so we know that there are many cues that our visual system is relying on in order to build this complex 3D representation of objects. So here's just a general outline of a relatively old paper from Van Essen and Gallant. Essentially, this particular architecture has probably changed, but the bottom line is that we know the different cues interact in all stages of the visual architecture. They interact both within areas and across areas.
So a lot of the work we've done in my lab, and particularly with these two graduate students, David Mely and Junkyung Kim, has been to start from this hierarchical feedforward model that had focused so far on 2D shape processing and trying to extend them towards a complete model of the visual system that incorporates color, motion, and stereo processing. So I'm not going to have time to go through the details of all these different models.
I'm just going to give you a sense of how this extended models work. Essentially, what we've done is I guess we've applied Occam's razor and started from what we understood well in the domain of 2D shape processing and then tried to see how far we could go with these kind of architectures towards extending them to multiple visual cues.
And so the basic elements in these extended models is an extension of the model of simple cells, which were essentially applying Gabor functions with different properties in terms of their tuning to special frequency and orientation, from the luminance domain onto the domain of color opponency, so working with color opponent channels that operates in the red-green domain, red-cyan and blue-yellow.
We've also been working on an extension to stereo, where now the basic model assumes that binocular cells in the primary visual cortex receive inputs from afferent cells, which are each monocular, each input is tuned to the same feature-- so a particular orientation and a particular spatial frequency with a given spatial disparity between the two eyes. And the last one is an extension of this Gabor models from the luminance to the spatial-- or the spatial domain to the spatial temporal domains.
So I decided not to bore you with the details of this architecture. Instead, I wanted to give you a sense of some of our ongoing work, which is essentially, now that we have these multi-cue models of visual processing, to come up with good benchmarks, human benchmarks, to evaluate their accuracy not just in terms of object, 2D object, recognition, but extended to multiple domains.
So just to give you a representative example. Here is work done with one of my staff, Youssef Barhomi. So I'm lucky enough to have a fantastic colleague at Brown, Bill Warren, who has been doing a lot of work on the visual control of navigation. And so, Bill has done a lot of work in essentially putting subjects in virtual reality environments and assessing the kind of cues that are affecting the visual control of their locomotory behavior.
And so, again, with our effort to try to move beyond this 2D object recognition, I couldn't resist the desire to essentially move Bill's subjects from these virtual environments, these virtual reality environments, to an experiment that I'm more familiar with, which is these rapid presentation paradigms. Here's just one example of the kinds of displays we gave our subjects.
So here you see like we give them-- oops, sorry-- a very short presentation. So this is a first person view of a particular scene. This particular example is essentially using gaming engines. And the reason why we use gaming engines is that we can get ground truth on anything, essentially anything, we want in those scenes. Some of the main models for predicting the pattern of locomotory behaviors from human subjects takes as input the relative distance of all the obstacles in the scene.
So we wanted to be able to get an accurate estimate of the depths of every obstacle in those scenes. So what we did here is that we essentially flashed those first person view, short video sequences. It lasted for about 200 milliseconds, which we estimated would be just enough for people to be able to integrate motion information, which we believed would be giving our subjects very strong motion parallax cues here while trying to block as much as possible, any kind of recurrent cortical feedback.
So here's just where we stand right now in terms of these benchmarks. So here we broke down the intersubject correlation from the simple kind of scenes where we have just a single obstacle. By the way, I forgot to tell you what the task was. So we're telling our subjects that there is a goal which is very far away, and then that they have to, given the scene we flash, they have to steer using a joystick towards the direction that they think is the best direction to avoid collision with an obstacle on their course.
So here you see the blue curves correspond to the intersubject agreement for different kinds of scenes with an increasing number of obstacles. What is interesting-- and, again, I'm not going to show you the details here-- but we find that the pattern of results from human observers is actually very well predicted by a mathematical psychology model that Bill had designed a couple of years ago, which is meant to explain how people will walk in these virtual environments.
So, essentially, the rapid presentation here seems to be a very good predictor of the kind of more online control law that it would be relying on during a more natural, kind of free moving, experiment. What you see here in red is where we stand right now with our computational models. This is, in particular, a model that's a model of motion processing along the dorsal stream of the visual cortex. And so we're not quite achieving human level of accuracy, but we are gradually getting there.
So we know that cues provide, many cues provide, useful-- so different cues provide useful information for a variety of visual tasks. So we've been testing people's ability to segment out visual scenes, and in terms of their contour. So there was no available data set which was multi-cue, so we went on and collected our own data sets. So we used an off-the-shelf, consumer grade camera, which are stereo cameras.
So we were able to collect these kind of short video sequences, which are stereo and color and have a few hundred millisecond in length. So here are representative key frames from these databases. I essentially-- pictures of my vacation in San Diego. So here are key frames. So we had human participants annotate for all the contours in these images. And we evaluate the accuracy of different cues in terms of recovering these different kinds of contours.
So without giving you too much details, we found out a number of interesting results. So, first of all, we are finding out that we are not quite, again, at the level of human agreement, but we're hoping to push that. We find that combining all the cues, even the weakest cues, yield to a very significant and robust improvement in accuracy.
And to our surprise, we found that, although the usual suspects-- things like luminance, cues, and color cues-- although in isolation they were the most powerful, we find that when we combine all the cues together, the most useful and the combination turned out to be the weakest one, which are the motion and the stereo cues, because they are the least correlated with the color and the luminance cues.
All right, so, I'm going to run out of time, so I'm going to go very briefly through that. Essentially, where we are is that I think there is still a major gap between the way we train our modern computer vision systems, or models of visual processing, and the way kids would naturally explore the visual environment. And so this is just to show you kind of representative examples of some data collected in one of my colleague's laboratory, Dima Amso, where these kids are wearing portable eye trackers, and we're able to monitor the kinds of-- well, their visual experience as they are exploring new toys and this room.
And you see that the kind of visual experience that these kids are having is very different from the kind of random flash of images that are being used to train our neural networks. And instead, they have access to many different cues, much richer cues-- binocular cues, motion cues that they are able to recover from that object manipulation. So, again, in the interest of time, I think I am going to skip that, but most of the work we're doing at the moment is developing learning algorithms so that we can combine most of these cues to learn richer intermediate visual representations that are able to leverage some of the richness of these visual cues. So I'm going to speed that up a little bit.
Yes, sorry. This was fancy. So just unless-- I mean if you're interested-- [LAUGHTER] I showed you more. OK, so very briefly-- so the work we've been doing has been, essentially, we've taken a sparse coding type of approach. If you are familiar, actually, the original formulation came from Antonio Torralba lab on the paired dictionary learning framework.
And so the assumption we're making here is that our hope is to learn intermediate visual representation that leverages multiple cues. And so what we're trying to do is to learn an intermediate level of representation for motion cues starting from a realistic model of empty cells, carrying both speed and motion direction selectivity.
And the assumption is to learn jointly the visual representation for motion together with the visual representation that is carried by a population of disparity [INAUDIBLE] cells, which encode for both orientation selectivity and disparity. And so the idea here is that rather than learning those different cues independently, we are learning in addition. We are forcing them to be learned simultaneously in addition to learning connections between these two populations of units.
And so we are at the level where, again, I don't have any quantitative results to show you. But we're at the level where these other kind of stimuli that are used in structure from motion testing and neurons that have been found in areas of the dorsal stream, such as FST when neurons can be 3D-shape selective from the presentation of these motions stimuli.
And we are making quantitative assessments demonstrating that it's possible to activate now, just from the presentation of motion cues, these disparity selective cues in a way which is consistent with what the binocular presentation would have been at the 3D presentation being presented. Sorry.
OK, so just to conclude that part briefly. I hope I shared my excitement about this convergence between neuroscience and computer vision on this idea of feedforward deep learning architectures. I think, if I have to give you two take home messages, it's, one, that I think we need to start worrying about considering more complex elemental neural circuits; we need to move beyond linear models of neurons and consider finer levels of biophysical realism.
And I hope I've convinced you that this idea of extending units from real-valued units to complex units will at least enable us to take into account, for instance, the spike timing between units without necessarily needing to model things at the level of spikes.
In addition, I've tried to make a case that if we want to understand visual processing, we need to move away from the rapid presentation of 2D visual cues and start worrying about how our babies, or small humans, learn from all these cues simultaneously.
So in the last five minutes that I have, five minutes or so, I want to show you an application of some of these computational models to the automated analysis of behavior. So I think I probably don't have too much of it-- too much of the use of behavior or analysis in biomedical research and neuroscience research, but here are just a sample of the kinds of problems that some of my colleagues in biomed are trying to solve.
So they are hiring and using armies of undergrads to just watch these kind of videos and, you know, essentially, using a time watch, assess the amount of time that the animal might be spending doing some kind of repetitive behavior, like hyper grooming. Or they might care about understanding, in an animal model of social disorders, how one animal model interacts with another animal.
They might be interested in figuring out how long an animal spends freezing when it's processing learning and memory. And last, they may be interested in recognizing one particular specific behavior. Say, for instance, rating the severity of a particular epileptic seizure in an animal model of epilepsy.
So I think the fact that we're relying so much on human manual annotation is problematic for two reasons. First of all, this work is tedious. And it's done mostly by undergrad students while they watch TV. In addition, many of those tests that I'm actually not showing you here rely a lot on manual intervention by the experimenter.
And we know that this manual intervention can have a very significant effect on the outcome of these experiments. I think there was a paper, for instance, that just came out a couple of months ago-- I think it was Nature or Science-- showing that mice are actually sensitive to the gender of the experimenter that manipulate them. And depending whether it's a female or male experimenter, the outcome of the experiment can be very different.
Even worse, actually, I discovered this ALS Therapy Development Institute here in Cambridge that I was not aware of. There is a major problem in ALS research. So, apparently, in the past couple of years, there has been hundreds of studies that have demonstrated that different compounds can actually extend the life expectancy of some of these animal models of ALS. This has yielded hundreds of clinical trials on humans which have all failed. And the usual explanations that animal models-- well, they're animals, right? So they're not humans, and that's why the clinical trial failed.
So what this group did is that they relentlessly went back and tried to produce the results of the animal studies, the rodent studies, on all of these compounds. They couldn't reproduce a single of those experiments, not a single one of them. OK? And part of the reason is that people don't always do the right kind of controls. First of all, they don't have enough data. So most of the time, the statistical power is too low. And then, in addition, they are not doing all the controls as many times as they should be doing.
So while I don't think automation is necessarily going to solve all our problems, I think removing the human in the loop and increasing the amount of data that can be analyzed and monitored would certainly help these kind of studies. So when I was still at MIT, actually, before I left, we used a computational model that we had developed in Tommy's lab, a model of the dorsal stream of the visual cortex for the processing of motion information.
We showed that we could use this model as a good visual representation and use that as an input to a modern machine learning classifier. And if we were doing that, we were able to determine, on a frame-based basis, the underlying behavior that was carried by the animal. So we had eight behaviours, things like resting, micro-movement, grooming, hanging, rearing, walking, eating, and drinking. For these eight behaviors, we were able to reproduce the level of consistency that one can find between human manual annotators.
So while this system makes mistakes, it's performing on par with a kind of accuracy you would get from trained undergrads, whether at MIT or Brown. And I should say that this is a proof of concept we published where, essentially, we are still doing that on a small scale. We would video record individual animals for 24 hours at a time, then get the memory stick, swipe the memory stick on a computer, copy the content on the computer, do the analysis, and do that maybe for a handful of animals.
So when I moved to Brown, I met this fantastic colleague, Kevin Bath, who's an expert in behavior, and he convinced me to invest in the time to actually build what I think is one of the very first fully automated behavioral core. So we went from this kind of proof of concept system to something that allows us to record animals 24/7 without any kind of manual intervention.
So right now, the core-- just to pitch a little bit or advertise for the core-- is not running yet at full capacity. I think it's at about 50% capacity with 24 cages being monitored 24/7. So here you actually see the output of the system on four cages, representative cages. Just to give you an idea, the core produces each month at 50% capacity 3,000 hours of video per month. So we assess that if you were to annotate fully this video content by hand, with trained undergrads, you would need 70,000 hours of work per month. So that's 24 years, assuming you're working 24/7 for 365 days a year.
So I see a lot of potential in this approach. So we've already worked with a couple of colleagues, and they have used the system for things that have been published or are currently under review. Just want to give you a sense of what this system is good for. So I'm going to skip that very quickly. I just want to point out one study that Kevin conducted just to show you the power of the method.
What I'm going to show you here are ethograms. So you see here on the x-axis is time. So this is 1:00 AM all the way through 24:00, so midnight. On the y-axis is the amount of time the animal spent for every hour block doing this particular behavior. And so, in general, we monitor the animal for about 5 to 10 days, and then compute these ethograms as the average over the 10 days.
So Kevin is interested in early life stress, so he has a model to induce stress in these young mice. And so, he first tested his model on males, and to his despair, he found no difference between control and stressed male. So the male wouldn't care whether stressed or not, they would groom, rest, or walk the same exact way. By the way, I should have said that those grayer areas correspond to nighttime. So you see here, pretty well, the circadian rhythm where the mouse is much more active during nighttime and resting much more during the daytime.
What he found for female was very different. What he found for female is that he found for all the behaviors he considered-- and we're just showing three here-- he found a very significant decrease in all of these behaviors for the stressed, the ALS model, female versus the control. So stressed females start to groom less. They have a disturbed circadian rhythm, so they start sleeping less. And they become less active.
So it sounded to us like a clear case of depression. Right? Like decrease of self caring, deficit in sleeping and walking. So he gave those animals ketamine, which is a form of antidepressant. And after giving ketamine, we found that, essentially, the level of the behavior of the animal went back to the level of the control animal. Essentially, the mice were cured. So we can actually validate all those results using more standard tests of depression, and we find very, very good agreement with the automated analysis. So I'm very excited about this study, automated study, of behavior.
We have, right now, the core that's working pretty well for the analysis of single mice. We are now extending together with colleagues in engineering to the study of social behaviors, understanding how multiple animals interact together. And our goal is to have no manual intervention whatsoever, and so let just the animal to grow in their colonies and study their behaviors in what we think is the most natural environment for them.
Very briefly, I just want to say that we actually just got a Human Frontier grant to push that work in the wild. And so I've been starting to work with collaborators in Canada and France towards the study of cognition in the wild. And so, essentially, we're trying to automate all the things that my colleagues would be doing by hand. So here we are trying to assess the evolutionary pressure on these animals, and how that affects their behavioral accuracy on various kinds of cognitive tests.
But, essentially, in order to assess the fitness of these animals, we need to make many kinds of measurements from understanding how the survival rate of birdies in their nest to assessing the amount of food that these birds are able to bring to the nest, and so on and so forth. So, again, I'm very excited by this approach, and I think computer vision is just at the level where we can start making a difference in many areas of the overall analysis. So that's pretty much all I had. I'm just going to give the-- I think I'm over time, so I'm going to give the acknowledgements on screen and take questions if you have. Thank you very much.
[APPLAUSE]
AUDIENCE: Can you give us an intuition about why adding a little teeny bit of the right noise causes [INAUDIBLE] models to go nuts? I'd assume that that would not be the case in the models that we have.
THOMAS SERRE: That wouldn't be the case, although we haven't fully tested. But yes, so the assumption is-- so I think the main-- I think one way to experiment-- I mean the way I like to think of that-- and I don't think anyone has a very clear, final answer, so I'm putting myself on the line here a bit. But I think that has to do with the notion of functional connectivity.
So, essentially, what these hierarchical architectures do is that they detect features. And they detect features throughout many different stages. And there is no explicit relative position information. One assumption is that what enables-- what would enable this-- or that this architecture would need additional mechanisms in order to recover the geometry, or the relative location, of these features.
You can call those-- I don't know if you know about "bag of words" in computer vision. I think the homology to models of biology would be "bag of neurons," that they don't know about objects. They don't even have a notion of objects. They essentially detect, you know, the kind of stuff, maybe the most salient features that make up those objects.
And what I would like to show, although I haven't shown you fully, is that if we were to have any kind of way, for instance, to bind these visual representations together to build a whole, or what we would call the gestalt in psychology, then I think a lot of these kind of IID type of noise would vanish and wouldn't have any of this effect, or at least not so strong. Does that answer your-- is that enough of an intuition for-- the short answer is that I don't know, and I don't know that anyone really knows. All we do know is that it's kind of amazing that they would be so sensitive to such small levels, especially to IID noise. Right? Yes?
AUDIENCE: Doesn't that bear an impracticality of using the model of the human visual system? I mean, in the top example I can't tell a single difference between those two images at all. I mean, what's the difference between them?
THOMAS SERRE: Yeah, I mean, so there is-- just to be fair-- I mean, you would see it a bit better. You know, there is the resolution here. There's a contrast that's poor. But Yeah, I mean, look, I agree. But, you know, you see the level of noise here. It's not big, but it's there. And, you know, what's happening is that that's what happens when you have individual units that are doing their local processing, even if there is a globalization, if you will, or a more holistic representation as you build up in higher stages.
You still see that, essentially, most of the units work independently. Right? You can start creating spurious patterns of neural activity by just pushing pixels up or down, because these are linear models, after all. Right? Just [INAUDIBLE]. So, yes?
AUDIENCE: Is it conceivable that just using [INAUDIBLE] similarly weakened? Like, it--
THOMAS SERRE: It's unlikely. Yeah, I mean, I think I can flash those images in seven milliseconds, and you'll still-- I mean, I don't know whether you'll be correct or not, but I'm pretty certain that you'll have the same pattern. I haven't done it, but that's my intuition would be that you wouldn't see any difference. You have to see them in print. They are so similar that you actually have to look closely to see the level of noise here.
AUDIENCE: Well, that's what I mean. What I'm saying, I mean, I thought maybe that noise was specifically for that neural node.
THOMAS SERRE: It is, right? So this is this adversarial training. It's optimized. Literally, it's the stochastic gradient, right? So you'll find a very good stimulus, and then you try to tweak the input by turning on and off or increasing or decreasing the RGB values on those channels so that you flip. So this is completely optimized for this network. But, I mean, I guess one way to think of this is just that the network, despite the 10 million examples that were used to train it, is still over-feeding.
But, again it's over-feeding in a way that, I guess if a visual psychophysicist or visual psychologist sees that, it's something that I don't think captures-- I agree-- anything about human perception. I don't think that means that the whole architecture is wrong. I just think that, in terms of the specific neural elements and the kinds of visual representation that's being emphasized, there is something that's missing. Yes?
AUDIENCE: I'm sorry, for doing two here. Just to comment on this, if the probability of that kind of noise is actually smaller, does it make any difference? We don't really live in such an antagonistic environment. [INAUDIBLE]. What I really want to know next is if we want to make progress from feedforward depths, and to model it, certainly an interesting way to do that is to make more interesting, complex units. But maybe we need to do something more drastic, especially if the majority of human synapses are feedback synapses. So maybe we need to go past the feedforward paradigm to feedback in order to get it.
THOMAS SERRE: Yeah, I don't disagree on that. In fact, I made the conscious choice to not talk about cortical feedback at all, but that's one of my-- I think you're right. I think understanding feedforward visual processing, in the end, might end up being relatively boring. You know, to tell you the truth, Simon Thorpe, who has done a lot of work on this rapid categorisation, gives talks saying, look, it's amazing the visual system is so fast and so superb at solving these problems.
The truth is that if you spend enough time, like I did, looking at the pattern of errors for either humans or monkeys, turns out that most of the mistakes are actually quite dumb. And so, if I can make any point or any strong claim, I think understanding this feedforward path is within our reach. I think, at least I see, it's something that we can solve in the next decade or so. Understanding the feedback-- the claim is that is going to take the next decades to come. And, I think, unless we understand really exactly the kinds of-- again, unless we understand exactly what's going on at the level of this initial bottom-up sweep, I think we're going to have a lot of trouble understanding how the feedback could come about and modulate those visual representations. But that's just my personal opinion.
AUDIENCE: In the work about the complex clockwork.
THOMAS SERRE: Yes, complex-valued units.
AUDIENCE: Then it seems that you are relying on one clock. Did you think of multiple clocks or lot of clocks?
THOMAS SERRE: Yes. That's exactly the kind of things we're thinking about right now. So, again, as a simplification, we're assuming that there is one general clock for the whole hierarchy of the whole visual system. But yes, we know very much from a lot of people actually here at MIT and elsewhere that there are many clocks, many spatial frequency bands, and different visual areas, or different stages, possibly, of visual processing will communicate with different clocks at different frequencies. So I think understanding how those clocks interact and what they contribute to the visual processing is fascinating.
AUDIENCE: Is this [INAUDIBLE] generalized in theory? Do you think that's a potential now?
THOMAS SERRE: Yes, I mean, again, I would be hard pressed to predict what would happen when you start putting multiple clocks. But in general, what all I'm offering here is just a simple abstract extension of this LN model that would enable you to, presumably, keep track of this relative phase to ongoing oscillations whatever their frequency, whatever is the area that's driving those, and to understand and maybe try to bridge the gap between these very detailed biophysical levels that most of computational neuroscientists are considering and this very high-level abstract level that machine learning and computer vision people are working with.
I think this is one way to maybe bridge the gap between these two levels of analysis, starting to understand things like-- you know, people have been talking about synchrony for decades, right? But I don't think anyone has actually shown that this was useful for anything. And so I think this kind of level of abstraction might enable us to understand at a functional level how some of these biophysical mechanisms may contribute visual function, synchrony being one example. But the list is very long. Yes?
AUDIENCE: This question is probably more speculative. But you cited cognition in the wild. What parameters are you going to be capturing? Are you looking only at vision, or are you looking at other sensory parameters that might just confuse or impair recognition and motion? [INAUDIBLE] So, roughly, an analysis to how that child could [INAUDIBLE].
THOMAS SERRE: So I'm mostly interested in the visual function, so we'll be looking at very basic tasks and things like visual acuity as well as object recognition, motion processing, and things of that sort. But I'm working with evolutionary biologists who are interested in more broadly speaking cognition. So we'll be running basic cognitive tests on those animals. And so, one of my colleagues in particular has started to do some work with a very simple-- just a proof of concept that this can be done using, I think I'm showing the-- using just associative learning type of paradigm.
This one. But so we are at the level, again, I mentioned-- if you're interested, I can tell you more. But we are building this little weather proof workstation, which is essentially using off-the-shelf, cheap hardware because we need to deploy a bunch of those, with a touchscreen. And the idea is that we're going to be able to extend beyond this associative learning to run, essentially, any test we want. So there's still computers that are running Python, and so we can run all kinds of experiments. We just got the grant for three years. Hopefully, I can come back in three years and tell you about the outcome of the work.