A conversation with Prof. Thomas Serre
Date Posted:
November 7, 2019
Date Recorded:
November 5, 2019
CBMM Speaker(s):
Kamila Jozwik Speaker(s):
Thomas Serre
All Captioned Videos Scientific Interviews
Description:
On November 5, 2019, CBMM Postdoctoral Fellow Kamila Jóźwik took the opportunity to sit down and chat briefly with Prof. Thomas Serre of Brown University.
KAMILA JOZWIK: Hello. I'm Kamila Jozwik, and I'm a postdoc here at the Center for Brain, Minds and Machines. And it's my pleasure today to talk with Thomas Serre, who is an associate professor at Brown University. So also, Thomas did a PhD here with Tommy Poggio, and then stayed during his postdoc before joining Brown. So to start with, could you tell us more about your scientific background, and then your research interests in general?
THOMAS SERRE: Sure. All my study was done in France. I studied math and physics as an undergrad. Then I attended what are called the [FRENCH], one of the [FRENCH], which are typically graduate engineering schools.
And while I was studying engineering in France, I realized that engineering was probably not going to cut it for me. And as I was working with artificial neural network, and working in image processing, I quickly fell in love with biological neural networks. So I looked for a place where I could come and do my final internship for the school where-- a group where potentially people would be doing solid mathematical work in machine learning, statistical learning theory. So I did work on the engineering side of things, getting computer vision applications out of that, and then studied neuroscience work on learning, and vision.
And googling around-- or I guess back then, there was no Google, so AltaVistaing around, I found this place at MIT, CBCL, which was the Center for Biological and Computational Learning, which was Tommy Poggio's group back then, and realized that it felt like an ideal place for me. So I applied for an internship, got it, and came here, and just fell in love with the place. And decided to stay, and so I applied for the PhD program in BCS, in brain and cognitive sciences. I spent about a year, I think, just working on computer vision with Tommy and a postdoc, [INAUDIBLE]. Studied my PhD here in 2001. Was a graduate student from 2001 to 2006. And then I was having too much fun to leave, so I decided to stay for my postdoc, so I spent another four years as a postdoc here, and studied at Brown in January 2010.
KAMILA JOZWIK: At Brown, how did you-- what is your current research about, and what you decided to work on after leaving MIT?
THOMAS SERRE: Yeah. So my PhD with Tommy was on essentially modeling the feed-forward processes involved during so-called rapid visual categorization task, something that has been dear to me for many years. I worked on extending a model known as HMAX, as Hierarchical Max, which is some type of an ancestor of our modern deep convolutional neural networks. So that was essentially most of my work as a graduate student.
As a postdoc, I was able to branch out. I did some work that is very dear for me with Bob Desimone, where we worked on attention. We built computational models of attention, understanding how attention and object recognition interact.
And then at Brown, it was-- I had some kind of-- I guess some sort of an epiphany, because I'm in a cognitive, linguistic, and psychological sciences department, where there's a number of people studying vision, and very different aspects of vision way beyond object recognition. People studying reaching and grasping, the control of locomotory behavior, low-level vision, high-level vision. And suddenly, it occurred to me that object recognition was too narrow of a field, so I embarked on deciding to study vision with a capital V.
So we've done a number of studies. I'm still very much interested in vision, visual processing. My long-term goal has been to build, essentially, large-m scale computational models of the visual system towards better understanding the brain mechanisms that guide our everyday visual processing tasks.
More recently, we've been focusing on the role of feedback in our visual system. We've been trying to understand how a feedback act in our visual cortex, towards trying to build better, smarter computer vision algorithms.
KAMILA JOZWIK: Yeah, and you did a lot of work recently about recurrence, and the importance of recurrence for different [INAUDIBLE] that humans solve. And could you tell us more, what do you think we really need recurrence for in the brain?
THOMAS SERRE: Well, I think that's the one-million-dollar question. What do we need recurrence for? Well, I think we know that there are tasks that potentially do not require recurrence. So we were just talking about rapid categorization tasks, something that you are very familiar with. But presumably, when we constrain visual processing to be fast for things like image categorization, scientists have been working with this assumption that one way by which our visual system can speed itself to its temporal limit is by allowing decisions to be made based on a single feedforward sweep of activity.
And perhaps not surprising, still the dominant class of computational models of vision are fit for hierarchical models. Interestingly, there is probably as much if not more feedback in our visual system than there is feedforward connections, which raises the interesting question of what is feedback for.
And so while presumably for certain easy-enough image categorization task, feedforward processing, pre-attentive processing is probably sufficient. Under more severely degraded visual conditions-- perhaps under occlusion, under noisy condition-- there is probably a need for feedback even in object recognition. And indeed, there is recent work from [INAUDIBLE] now suggesting that this is indeed the case.
But I think beyond object recognition, there are many tasks that are not just image categorization. I think, you know, [INAUDIBLE] proposed a number of visual routines some 30, 40 years ago. Things like figuring out whether there's a path to go from one place to another. Perceptual grouping, we know, are presumably, under certain circumstances, cannot be solved with purely feedforward processing. And there is neuroscience evidence that horizontal connections between neurons along different locations on the visual cortex, as well as top-down connections from higher into lower visual areas, do play a role in some of those-- I would call those more general visual reasoning tasks, beyond just image categorization. So I would say most of vision beyond this rapid categorization probably require feedback mechanisms.
KAMILA JOZWIK: Mm-hmm. And beyond feedback, what are other brain-inspired constraints that we should build in deep nets?
THOMAS SERRE: Yeah, again, I think this is in our million-dollar question. To me, it's fascinating to see how much progress we've made in computer vision and AI with what are mostly feedforward neural networks. Which raises the interesting question, as we are discussing, as to what is the feedback for.
We know in vision, from several decades of visual neuroscience work, that there are many other functions or computations that are required for our everyday visual recognition tasks. Things like working memory. Things like attention, among many other things. And so I do not necessarily have evidence at that very moment that incorporating working memory, or attention, or the kinds of gating mechanisms that we know are taking place, and routing visual information, our visual system, are required for the next generation of computer vision algorithms.
But if I had to put my money on a next important direction for research, I would probably put my money on some of those functions. Things for which we know humans-- and primates, more generally-- seem to leverage for certain visual recognition tasks.
KAMILA JOZWIK: Mm-hmm. So it seems that you want to really expand the path that deep net to do. Do you think that it's worth also trying to change some other biological constraints? Like for example, changing the learning curve because it's not biological. Changing the visual diet of the images that we give to this network to train, or some architectural stuff beyond the recurrent-- I don't know, like topography, or some human brain anatomy-inspired things. Do you think it's important, or task--
THOMAS SERRE: I think it's really important. In fact, my PhD thesis was really about taking-- building on earlier work on trying to extend a model that was somewhat loosely constrained by anatomical and physiological data to make it more closely consistent with the anatomy and the physiology of the visual system.
I literally spent, I think, two, three years of my life literally reading as much monkey electrophysiology or mammalian electrophysiology papers that I could find to what's coming up with some kinds of physiological wiring constraints, as you would put them, architectural constraints, constraining receptive field sizes to be of a certain size so that they could closely mimic the receptive field sizes in corresponding brain area. Constraining the pooling in our convolutional neural network so that we could model quantitatively the receptive field size of simple and complex cells and in various brain areas, et cetera.
So yes, I think those things do matter. Sadly, I have to acknowledge the fact that today, it seems as if those-- but by considering computer vision systems optimized for image categorization that do not have any constrain from biology-- or very limited, at least, quantitative constrain-- those models seem to be feeding experimental data, monkey electrophysiology data pretty well.
So I would say the jury's still out in terms of how much adding those constraints will be needed, both to better account for neural responses in the visual cortex, and two, perhaps to, again, lead the next generation of computer vision architectures.
But I think you made excellent points. I think most visual neuroscientists would agree that our visual diet, and the way babies learn, is very different from the way our computer vision algorithms learn. We're not just being flashed with random IID samples and class labels, but-- for one thing, we have access to a much richer visual world where multiple cues are available to us to, say, parse figure from ground, to be able to get better estimates of the surface properties of objects, and things of that sort which are potentially very hard for a modern neural network to learn simply because of the very impoverished visual diet, as you put it, that we feed to them.
And so in that sense, I think, yes, there is a lot to be learned from development, from neuroscience. I think it is up to us neuroscientists, computational neuroscientists, to be able to make those claims a little bit-- to make sure that those claims are backed up by actual data, meaning showing that we can be the state of the art by taking some of those constraints into consideration.
KAMILA JOZWIK: That's true. You mentioned that deep nets predict a lot of data, and that definitely is true. But also, your work pointed out some problems with deep learning. You were compering--
THOMAS SERRE: In fact, I would say this is most of what we do, is to try to bring the state of the art to pinpoint to [INAUDIBLE].
KAMILA JOZWIK: Yeah, yeah, yeah, exactly. So do you think that we just need to have our model group being deep net and just finetune them as you were also shown in your work, and making the features that they are sensitive to more similar to the human ones? Or maybe we need to start thinking about some other model class that is not deep learning, but something else. Or do you think that we can just finetune deep nets, and then we can explain full brain [INAUDIBLE]?
THOMAS SERRE: I think a little bit of both. I think it's hard. Given what we know about neuroscience, again, we have to remember that the deep convolutional neural networks are grounded in visual neuroscience. And so to day, they are still the best models we have available. Whether those models will pass the test of time, I don't know. But at the moment, it's hard to imagine a model of the visual system of the brain that would not involve some form of deep learning, or at least deep neural network. They are probably not as deep as our modern deep neural network. We know for one thing that our visual system is much shallower than the hundreds of processing-- layer of processing that are found in our modern architecture, but it's hard for me to imagine a final intimate model that would not involve some kind of feedforward neural network.
Now, we also know that those models are not sufficient. We've been discussing the fact that we need to take into account feedback. There's a growing body of literature suggesting that feedback is needed to better model and account for the neural responses along the ventral stream. We and others have started to point to key limitations for the state of-- in certain visual tasks that are easy for humans and hard for deep neural networks.
I should point out that what I mean by hard is that we have universal approximators, so they can any arbitrary mapping from input to output, but for certain tasks, they might require a very, very large number of training examples. Orders of magnitude more than what humans and babies would potentially need.
So regarding your question about what's missing to get the next generation of smart seeing machines or brain models, I think you hit all the right points. I think it's going to be a combination of additional computations that are currently lacking in those architectures. So we've discuss working memory attention, even though, of course, there's an increasing realization that attention actually is important, even in computer vision.
It's going to be ways to better approximate, or at least ways to improve the way those algorithms are being trained. So I think you mentioned, one way to help deep neural networks is potentially by leveraging human supervision. For instance, by instructing deep networks to care about certain parts, or certain object features when they are trained. The same object features that seem to be important for human subjects.
So I think it's going to be a combination of a lot of things. It's going to be extending what we have, changing the visual diet of those algorithms. And perhaps in the long run, it's going to be rethinking entirely the class of models that we are considering. But it's hard to envision that at the moment, given the, I would say, overwhelming evidence from neuroscience that this architecture is [INAUDIBLE] some level of realism and resemblance to biological neural networks.
KAMILA JOZWIK: Yeah. And so it seems that in vision, we are going in the right direction. But what are some other domains that you think we are really far from reaching human-level intelligence?
THOMAS SERRE: Yeah. So you say in vision, I would say in image categorization. So in fact, there was a study I think published just last year showing that the state of the art in, say, facial recognition is actually not just matching you and I and our ability to recognize faces, but it's on par with the very best humans we have, the facial forensic experts. So to me, this is a stunning progress in the field.
So I would say many would claim that image categorization has been solved. At the same time, I think when you push these systems beyond computer vision benchmarks, you still have things falling apart a little bit. And so what I mean by this is one example in the real-- actually, one example, autonomous vehicles. So I have to confess that just a few years ago, I would tell my family over a casual conversation over the family table that they should expect self-driving car within a few years.
I'm no longer convinced that this is the case. And I think we're seeing the limits of the current learning paradigm in the sense that we have algorithms that are very good at storing a large number of training examples. But there is, I think, some amount of evidence that there's not a lot of generalization beyond training data.
And so what I mean by this is that the world is essentially-- our visual world is completely open-ended, and so there is no way that we're going to be exposed to all possible image degradations that could be applied to pictures of pedestrians, or cars, or street scenes, and things of that sort.
However, we know that at the moment, the state of the art in computer vision is able to be trained on specific type of image degradation. So I can apply a very specific kind of noise on images that are used to train this neural network. And then they'll be able to recognize pedestrians and cars much better than you and I, with levels of degradations that actually make reconditioning possible for us.
And yet it is also known that if we make a tiny amount of change in the type of noise applied to, say, the test data, I go from a salt and pepper type of noise, if you've played with Photoshop, to a Gaussian type of noise. And suddenly, the accuracy of these algorithms collapses.
And so these algorithms are able to essentially generalize to image degradations that they were trained with. But so far, there is no evidence that they can truly generalize to image degradations that they've never seen before. This is a point that Pawan Sinha here in BCS made. When I train here some 17 years ago, that one of the hallmarks of human vision is this uncanny ability that we have to deal with potential image degradations that we were never exposed to before.
And so I think that's really where the challenge is for deep learning, dealing with the unknown and the unseen. And I think the fact that we don't have self-driving cars reflect the fact that we can keep driving millions of miles around the country, try to find those edge cases, those weird elimination, those weird weather conditions, those weird specular highlights. We will never run out of edge cases, and so we're going to be driving those cars around potentially forever unless we come up with a better solution.
KAMILA JOZWIK: Mm-hmm, that's true. And many young researchers may be watching this interview, so do you have any advice to close for the people in this field in terms of what they should think of, in terms of next questions, and kind of fruitful careers in this field?
THOMAS SERRE: Yeah, I would tell them to study vision. I know that I'm going to sound like an old fart, but I find it shocking to attend computer vision conferences and talk to brilliant young researchers who know a lot about deep learning, about the latest library to implement their algorithms, about all kinds of tricks and heuristics to get this very deep neural network to learn anything at all. But often, I'm baffled by how much the classical training in computer vision has been forgotten, and often is missing from general computer science, or even AI training.
And so I would tell students to go back to classic computer vision training, understanding the mathematical processes behind vision. I would tell them to go back to visual perception, to go back to some of the points that our mentors were making already in the '50s and '60s about what makes human vision so much better than just sheer template matching, and brute force template matching.
And I think we need more people studying and working in the area of deep learning which approach the problem from a cognitive psychology or visual neuroscience perspective. With the idea of understanding what's happening in the system, and really probing-- not necessarily focusing on pushing benchmarks, and pushing the state of the art by a few percent, but really tackling the hard problems. Figuring out the edge cases. What are the things that are easy or hard for those networks to learn, and how this compares to human's ability to learn.
KAMILA JOZWIK: Thank you very much for joining us today. It was a pleasure to talk with you. And Thomas will give talk later today, and so you can see it at the CBMM channel. And thank you very much for watching, and see you next time.
THOMAS SERRE: Thank you.