CBMM Panel Discussion: Should models of cortex be falsifiable?
December 7, 2020
December 1, 2020
Thomas Serre, Michael Lee
All Captioned Videos CBMM Special Seminars
Presenters: Prof. Tomaso Poggio (MIT), Prof. Gabriel Kreiman (Harvard Medical School, BCH), and Prof. Thomas Serre (Brown U.)
Discussants: Prof. Leyla Isik (JHU), Martin Schrimpf (MIT), Michael Lee (MIT), Prof. Susan Epstein (Hunter CUNY), and Jenelle Feather (MIT)
Moderator: Prof. Josh McDermott (MIT)
Abstract: Deep Learning architectures designed by engineers and optimized with stochastic gradient descent on large image databases have become de facto models of the cortex. A prominent example is vision. What sorts of insights are derived from these models? Do the performance metrics reveal the inner workings of cortical circuits or are they a dangerous mirage? What are the critical tests that models of cortex should pass?
We plan to discuss the promises and pitfalls of deep learning models contrasting them with earlier models (VisNet, HMAX,…) which were developed from the ground up following neuroscience data to account for critical properties of scale + position invariance and selectivity of primate vision.
JOSH MCDERMOTT: Welcome, everyone. Thanks for joining us here. So this is going to be a discussion of some of the issues that are raised by the newfound role of contemporary neural network models in neuroscience, particularly, with regard to sensory systems.
So this is-- what we're going to do here is have presentations by three people who are involved in this work. And then we've got a bunch of people assigned as discussants. And we'll, obviously, hope to get a lot of involvement from people who are here.
So the plan is that each of our three presenters will speak for seven minutes, showing some slides to present some material that will stimulate discussions. And then we will give our discussants an opportunity to give their thoughts and then open it up to the floor and see what happens.
So the people who are going to be presenting today are Tommy Poggio, the director of CBMM who has worked in this area for a long time, Gabriel Kreiman who's a core member of CBMM who's a professor at the Harvard Medical School, Thomas Serre who's a professor at Brown University who's an esteemed graduate of the BCS PhD program, works on computational vision.
And then, for discussants, we have another star alumnus, Leyla Isik, who was a core contributor to CBMM for many years while she was a grad student and then a postdoc, and she's now a professor at Johns Hopkins, Martin Schrimpf who is a grad student in Jim DiCarlo's lab who's been one of the main people working on Brain-Score, which I think will be discussed today, Michael Lee who's also a grad student in Jim's who's also done some work on Brain-Score, Susan Epstein who's a professor of computer science at Hunter CUNY and a member of CBMM, and Jenelle Feather who's a grad student in my lab who works on computational models of both the auditory system and the visual system.
So, with that, I'm going to turn it over to our presenters. I think we're going to start with Tommy and then go to Gabriel and then Thomas.
TOMASO POGGIO: Well, thank you, Josh, for the introduction and for playing the role of moderator. It will be tough. So part of the fun of being a scientist, I always thought, is the ability or possibility to argue with gusto about specific idea, theories, data. And, of course, by the way, this is the essence of being an illuminist. If you read just the first 20 pages of Adorno, Horkheimer and Adorno, they are clear-- the dialectic are clear on this.
But I want to do just that today and to draw a few provocation lines in the sand. So, first, models of cortex are falsifiable in the same way that models of the action potential are. And we should try harder to falsify them and be less lazy. There are too many labs getting into this model simply because software is easily available and not checking how much biological support there is for some of their key hypotheses.
So these are the three points. You can see them, right? And it's kind of interesting that HMAX-- HMAX was a model that was developed in my group, first, by Max Riesenhuber and then Thomas Serre and Gabriel and others. And this was a more quantitative version of Fukushima's model, which was itself an implementation of Hubel and Wiesel ideas of hierarchies of simple and complex and hypercomplex cells to explain visual cortex.
So the original goal for HMAX was really to falsify an hypothesis. And the story was this. You see here, by the way, HMAX here. This is a series of layers, starting with simple cells and complex cells and V4 cells, again, simple and complex in V4 and simple and complex in IT.
And the motivation for it, the reason we started this project, was because of the following. Back in the '90s, we had a model with Shimon Edelman trying to explain how we may recognize 3D objects like faces or like these paper clips that you see up here. This is one three-dimensional paper clip seen from different viewpoints, like you may see a face from different viewpoints.
And the idea we pushed is that you're able to recognize 3D objects because you store a series of views, different points of view, of that object. And then you are able to extrapolate, interpolate between these views. There was a simple network, [? shallow ?] layer, or a radial basis function with the neurons in the middle, this one being viewed to you on the neurons, and the output combining them to get a view-invariant output.
And we tested that with psychophysics, Bulthoff and Edelman especially, and then there was physiology done in IT by training monkeys to recognize a few of these 3D objects among hundreds of distractors. They were trained for two or three months.
And, after that, Jon Pauls and Nikos Logothetis were able to find cells that were view-tuned, like the ones you see up here. You see three different cells tuned to different views of the same paperclip, one of those that the monkey had learned to recognize. And you see also a neuron, lower bottom, that was very selective for that paperclip, but invariant to the views.
So it was really a nice confirmed prediction, but there was a puzzle that came up from the data, which was what you see on the right-hand side here, here, where you see that a cell that is very selective to a view of a paperclip is also very invariant to scale of that paperclip. Half the size or double the size elicits the same type of selective response.
And so the question was how can you achieve that, this kind of simultaneous selectivity and the invariance. And the obvious idea was to see whether a Hubel and Wiesel type of architecture, a sequence of simple and complex cells and hypercomplex cells, could achieve the selectivity and the invariance to translation and, mostly, to scale.
And I thought it was possible to show such an architecture could not do it, but, in fact, we found it could do it. So we verified the hypothesis that a Hubel and Wiesel type model could achieve that kind of invariance. By the way, I don't think present deep models of visual cortex, the one listed on Brain-Score, can do this simple test, can pass this simple test.
We are able to recognize-- to see a face once. Suppose you have never seen this face before. And then I can tell you, is this the same face or not? Or is this the same face or not? And I can, of course, do with distractors.
And the answer is, of course, yes, you're able to recognize a face you've seen once at different scales, at different distances from it. I don't think any of the current models can do that. This was what HMAX did, and it's, clearly, a simple and basic ability of human vision. So that's one challenge.
And the last slide I want to show is just a general point. I will speak more about it, but it is that I think performance can be quite-- absolute performance, quite misleading as a test for models. It encourages nonbiological implementations, such that backpropagation. And, if you do that, you should then say stochastic gradient descent is a critical prediction of my model. And we should find, in the brain, how is it done and where. And, if we don't do it, we should probably abandon this model.
There are other points related to overfitting and to recent work on the neural tangent kernel that says something interesting about the limits of performance as a measure of quality of models. So let me finish here. Probably, I was too long anyway, and let's go to the next one.
GABRIEL KREIMAN: So thanks for inviting me to participate in this discussion. I'm going to be brief. I mostly wanted to be provocative here and stimulate discussion. I find it extremely exciting that we're in a new era of quantitative models, as opposed to a plethora of non-falsifiable models, word models, pseudo-quantitative models, models that, ultimately, require some sort of homunculus or engine to [? tell ?] [? us ?] more.
So what's exciting about falsifiable models and quantitative models is that they can be image computable in the case of vision, quantitative, predictive. And the question that we face today, I think, is, when we have models that have enormous numbers of free parameters, are those models falsifiable or not.
So I want to say a few words about cross-validation, extrapolation, and interpolation. Models with lots of free parameters may be able to interpolate within some sort of distribution, but, generally, tend to fail to extrapolate. And one of the excuses here in terms of trying to defend ourselves against overfitting is to use cross-validation. And I would argue that random cross-validation does not really show generalization out of distribution.
What exactly we mean by distribution is a matter that maybe we can discuss more about. And I just want to give you a couple of quick examples of our struggles with building models of responses in cortex and to what extent they can generalize to novel types of images.
So, first, I want to just recap the basic methodology. This goes back to work that I hope most people are very familiar with by-- really beautiful work by Dan Yamins and Jim DiCarlo and many others where they took deep convolutional neural networks and, in this case, sort of precursors of many of these model networks and showed that one can explain a significant fraction of the variance in the responses of neurons in IT cortex.
And, if you look at that paper, sort of somewhat hidden in supplemental figure 9 and 10, you see that this performance and how well we can explain cortex decreases quite a lot when we try to generalize across different object categories. So this is already one hint that generalization is complicated.
So here's another version of the same thing, of the same idea, using the same data that Jim and others graciously shared, showing what happens if you try to train these kind of models with only one training category. This is work that was done by Thomas O'Connell in the CBMM summer school.
And I won't get into a lot of the details here. The basic point I want to make here is, if you train these models with one training category here, they really struggle. We can still get above-chance performance in explaining the responses or here, in this case, in doing classification from single-shot decoding. But we lose a lot of the ability to generalize when we have only one type of category.
So I'm going to give you one more example of our struggles with generalization with these type of models. And this has to do with a study that we published recently. This is work done by Carlos Ponce and Will Xiao and Marge Livingstone where we developed this architecture that we call XDream, which, basically, you can think of this architecture as a hypothesis generator.
So this has a generative model that creates images. We record activity of neurons. And then we have a closed loop system here with a generic algorithm where we can keep generating images that give us a better fitness function.
For example, in this case, we tried to maximize firing rate of neurons in inferior temporal cortex. So I won't have time to do justice to do this work. You can scan this QR code and go to the paper. We generate images that look like this. This is one of the images that we generated.
And this image is extremely effective in driving the activity of one of the neurons in inferior temporal cortex. So, by the kind of nomenclature that we used to define orientation tuning or [? clay ?] cells or any other kind of feature tuning in cortex, this is what the neuron likes. This is a neuron in IT cortex likes this particular-- this particular image [? up here. ?]
So then we tried to build this-- tried to fit the responses of neurons in IT using deep convolutional neural networks followed by lots of free parameters that we tuned via linear mapping. And, indeed, we can do this pretty well with what we call reference images. These are lots of natural images. We presented lots of natural images.
We do the typical cross-validation by randomly selecting a subset of these random images. We have predicted firing rates, which match the observed firing rates quite well. And this is consistent with the work that I mentioned earlier and with work that many other people have done.
But then, when we look at the images that we generated that were not used in any of the data-fitting exercises, we see that we struggle to really generalize. There's still some sort of correlation between the predicted firing rates and the observed firing rates. So correlation is a very poor metric to describe what's actually happening here, but, in terms of the explained variance, our firing rates, the ones generated by these images, are much higher than what would be predicted by this linear fitting exercise.
Conversely, if we use a substitute model approach, this is what some people have-- Jim and others have called a control model. We build a model, and we try to generate images based on the model. Then, these images, the predictions tend to be much, much higher than the actual firing rates that we get. These are two examples of our failures to generalize when we try novel images, either novel images created by a substitute model or novel images created by our XDream approach.
So I want to conclude here because I want to really move on to the discussion, but, the main point here, I move these models with large parameters further and further to the right. The problem with word models is that you can always change your words. You can always change your words to fit whatever data you want.
And this is the same problem with models that have very large numbers of free parameters. Give me enough parameters, and you can fit whatever you want. And, therefore, I think that many of these models that have enormous numbers of data fitting in them are not very falsifiable, perhaps, not falsifiable at all. I'll stop there.
JOSH MCDERMOTT: So Gabriel, thank you. Just, if I may exert moderator's privilege, on the slide that you just presented, like do you intend to imply that current deep network models fall somewhere in that space that you're depicting? Or are you agnostic about that?
GABRIEL KREIMAN: I think the question is how many free parameters you have. So, if you have thousands of free parameters-- if every single experiment you run requires refitting your data with thousands of parameters, then I think they fall on-- they are on the right. They're not falsifiable.
If you can-- if you have a neural network that you train once, and then, every single subsequent experiment, you use the same model, I think that that's a falsifiable model. For example, you can do the Francis Crick experiment that Tommy was alluding to, OK? The important thing is whether you need to refit and do linear fitting or retrain or do backprop again for every experiment. If you have to do backprop or linear fitting again in every experiment, that's not a falsifiable model.
JOSH MCDERMOTT: OK, I think those are issues that we can discuss more, but we can do that. Let's turn to Thomas.
THOMAS SERRE: All right, well, thank you so much for having me. It's a real pleasure to be surrounded by so many old friends. I don't have a lot to add on to what Tommy and Gabriel already said. I'll just try to maybe clarify my view in light of what they've said.
Maybe just two quick points to answer to Tommy, I'm not entirely sure that I share Tommy's optimism for models of the cortex to be falsifiable. I think this is, obviously-- at least, not in the physicist sense because I think biology is somewhat messier. That said, it shouldn't be an excuse for being lazy, as Tommy would put it.
Just to recap what we've been already saying, there's a heavy focus today on fitting deep nets to neuroscience data. Those benchmarks are, obviously, useful. I think we can all agree with that, but they are insufficient. And so what I want to do for the next kind of three or four slides that I have is to provide examples of alternatives to maybe or softer versions of falsifiability.
So I think one property, for instance, of models is that they should be faithful to neurobiology. I think, just as a case in point, I used here a recent study from Pospisil, Pasupathy, and Bair. Anitha Pasupathy has been recording from V4 for many, many years now.
This is one of their latest work where they record from the response of V4 neurons to, relatively, simple shapes. The details of the experiments are not all that important. They take representative deep neural networks. They assess the ability of different layers of these neural networks to be fitted to V4 data. They find that many of the layers actually fit V4 data equally well.
However-- and this is to Tommy's point-- when they test the ability, or they compare the tolerance properties, tolerance to position, to both V4 and model data, they find that, if you want to get the same amount of invariance from one of the deep neural networks, you need to go through quite a few stages of processing and, in fact, through stages of processing-- [? there ?] [? are ?] [? aficionados ?] here-- that are fully connected.
So there seems to be-- there seems to be somewhat of a mismatch between the number of processing stages that are known to take place between the retina and V4 and how many of these layers of processing are needed for those deep neural networks to be able to account for [? there. ?] So, again, I guess the point that I'm trying to make here is that, when you have a black-box model, as a deep neural network, it's possible for these networks to fit data, but for the wrong reasons.
Another aspect or criterion, I think, for these models is that I think they should be somewhat credible models of biology. So, again, to Tommy's credit, older models, HMAX included, but many others, of course, had a handful of operations, maybe even less. In fact, two key operations of the HMAX were the MAX operation and some form of tuning slash normalization operation. Both of those were shown to be implementable with simple biophysically plausible models.
If we look at modern-day deep neural networks and AI systems, there is an entire toolbox of operations that have been added in the past 10, 12 years. And, not to say that they are not all necessarily biophysically implementable, but I think, at least, someone should try. And I think, for these models to become credible models of biology, there should be some backup in approximations of those operations from a biophysical standpoint.
The third point that I want to make is that-- and that's probably an obvious one-- is that those models need to be able to make testable predictions. And so I've had several of these debates or related debates where I've found myself in a hard spot, trying to come up with actual testable predictions that modern-day deep neural networks have made for biology.
And, again, just to give full credit to Tommy and the work done in his lab, the idea with older models, something like the HMAX, where the proposal there was that a MAX operation is needed from a theoretical perspective to achieve invariance properties that are consistent with IT data, back then, there was no real evidence for anything like a MAX-like operation at the level of the visual cortex.
But this work, subsequently, motivated further neurobiology work done by Ilan Lampl and David Ferster who went on and recorded in the visual cortex to look for, which they found for at least a subset of the complex cells, to implement a MAX-like operation. Again, not to say that there is no, [? necessarily, ?] possible predictions to be derived from deep neural network models, but I would very much like to see a similar line of work originating from the work we see today in computational neuroscience.
OK, in the last minute or so that I have, I just want to kind of bring a word of cautiousness from the field of computer vision. So I'm sure everyone here in the audience knows about ImageNet. It's hard to argue that ImageNet hasn't had a huge impact on the field of computer vision, machine learning, and even to the field of neuroscience.
There's a challenge that has been, until recently, happening every year, comparing and testing varying kinds of computer vision systems for ImageNet. The winning architectures these days, since 2012, have been deep neural network. What you see here on the left is the gradual progress that we've made, the field, as a whole, has been making year after years, kind of making incremental improvements over the previous years' architectures towards reaching what is considered today, arguably, human-level of accuracy for image categorization.
The point that I'm trying to make here is that this is all great, but I want to remind many of the younger participants in this seminar that there was a life before ImageNet. And so, back in 2006, '07, '08 when I was a graduate student, there was something called the PASCAL VOC, which was a much smaller kind of ImageNet or data set compared to ImageNet. And, back then, we would see a similar trend.
So, as of 2007, the winning kind of solution in computer vision was the bag of words. And then, subsequently, year after year, we saw this gradual improvement in the ability of those models to beat the previous year's state of the art with various kind of twists, how to compute and store histograms, how to change the learning rule, et cetera.
The point that I'm trying to make is that these past five years of incremental improvement from 2006 and 2011 are of little value and haven't been applicable to modern-day deep neural networks. So, arguably, much of the gains we've achieved in accuracy in computer vision over these five years never translated to what is today, arguably, a shift in the kinds of architectures that we're considering and, hence, was, somehow, of very little use and contributed little to the field.
So I guess the final word is a word of caution that benchmarks are good, but they're mostly relevant if you have the right model of the visual cortex. There is a high risk that, if we don't have the right model of the cortex, we might be tweaking, years after years, the wrong architecture and not learning very much about the brain in the process.
JOSH MCDERMOTT: All right, thanks, Thomas. So I think what we're going to do now is turn to our discussants. And, if anybody is really itching to go first, you can let me know, but I haven't heard from anyone. So I think I'll just go through people in the order that we specified. And that means that Leyla is up next.
LEYLA ISIK: All right, great. Well, thank you for those presentations. I think they were interesting, and I don't know how, but, somehow, I guess I find myself now in the deep net defender position. So maybe I'll start with Gabriel's point about overfitting and lack of generalization and say that, actually, I think-- I don't think that's really a fair statement. I think that these networks actually do an OK job of generalization.
So you mentioned that, if you have to rerun backprop every time you run a new task, that's not a fair-- that's not a good model that generalizes well. And I agree with that. But I don't agree with the point about linear firing.
I think, a lot of these models, you can train them on one task and then learn a new linear decision boundary on a totally different task. And they work quite well. And I think that's a pretty strong case for their generalization, right? And I think the biological plausibility of that is something you could imagine happening, for example, in the human brain.
But I guess the other point that I wanted to bring up that no one really mentioned is that we don't know how well these models really generalize on different tasks because I think that, as a field, we've been really overly focused on the object recognition task. And so, to some extent, we don't know how well they generalize on totally different vision tasks, right?
So little of what we do as humans is go around and name different objects, right? A lot of what we do with our visual system is try and understand where to go, different properties of objects, who's around, what those people are doing. And I think, as a field, in part, because we don't have those benchmarks, we don't really know how well those models do on those tasks.
So, when we're discussing benchmarks, I think it's really important to think about what the right tasks we want to be evaluating are and develop the right data sets to test those. So I think I'm pro benchmarks. I think they help bring everybody on the same field, but I think that the problem is, when you get overly focused on one particular benchmark, one particular task, and even one particular data set in that task, that's really where you can fall into problems.
JOSH MCDERMOTT: Thanks, Leyla. Martin?
MARTIN SCHRIMPF: Thanks for the presentations. I agree with what Leyla said. I always wanted to add one point for Thomas and one point for Tommy. So Thomas, at the end, said that the gains from PASCAL did not lead to any useful models or new insights in the field.
And I think I really disagree with that because, for one, the models were still used for applications. Like they used in search, Google Search and others. So, even though they did not perform well, they still had some tangible outcomes that were beneficial.
And, also, I think, in terms, actually, for the benchmarks, we, at some point, realized that PASCAL really just didn't cut it anymore because we got away with pretty simple models. So then people set up to build ImageNet. And, like Leyla said, at some point, the benchmarks will just be too simple. So we need to keep making them harder and harder.
And I also just don't know what else would be the alternative. Like, if we don't benchmark and, at least, figure out what the models can do today on some data sets we have, then I just don't know what else we will do. We're sort of just fishing in the dark. So that's my [INAUDIBLE] for benchmarks.
Maybe, on that same line to what Tommy said, with this one simple test that models fail, and I think that then points to make a great benchmark. That's something we should add to Brain-Score to [? explore ?] [INAUDIBLE] all these models for [? some ?] of [? that, ?] and then we can try to improve them.
And, one more point to Tommy, you mentioned that SDG is something that's not biological. So, since the models already make incorrect biological predictions at that level, we should not even bother with them. I also don't agree there because like, [? for instance, ?] [? even HMAX ?] didn't have spikes, or it probably didn't have neurotransmitters. So there's always some abstraction level that we do not match. And maybe the discussion really is what is the right level of abstraction that we think we should operate on and that we think we should benchmark.
In Brain-Score, the decision we at some point made was that we think spike rates are the right level. And, so far, we don't even care about learning, but that is something that can be added. So, at least, on some level of abstraction, the current models are doing OK at predicting [? neurobehavioral ?] data, but, even on the benchmarks we have, are still falling short. That was a mouthful. I'll end there.
JOSH MCDERMOTT: Thanks, Martin. Michael?
MICHAEL LEE: Yeah, thanks for inviting me to be on this panel. I want to revisit something Gabriel said, which is that you made a point that, models that are based on deep nets, they only interpolate over their training domain, and they don't extrapolate, necessarily, to new domains that you test them on. So I think I'm interpreting that to mean that you implicitly believe it's important that scientific models should be able to do this, should be able to generalize to new domains that they haven't seen before.
But I'm going to push back and ask whether it's sort of fair or whether that's something that we should expect these models to do. So, in machine learning, we don't expect models to generalize for free to new data distributions. And, sometimes, they do, as Leyla pointed out, but they're not guaranteed to in any strong sense. So it's not surprising, in some sense, that scientific models based on statistical learning don't generalize for free either.
And you can always sort of fix this problem, quote unquote, "fix this problem," by adjusting your training distribution to include the domain in which the model has been assessed to fail. So I guess my question for the panelists is whether this fix is acceptable or whether you expect like our models to meet this more stringent requirement that they must generalize to completely unseen domains, quote unquote, "for free."
So like, as an example, like physicists can predict planetary motion using models of physical dynamics fit at sort of human scales and black holes from quantum models and that sort of thing. So that's my question for the panelists.
GABRIEL KREIMAN: Just a quick comment, maybe to answer some of the panelists, so, yes, my aspiration is to have a model that really can work with any image and any condition. So, if I show you a new picture, and you need to retrain your entire model-- again, to me, linear fitting and model are part of the same thing. I don't know why people make a distinction between the two. They're all a part of the model. So, if you need to retune your parameters or you need to start collecting new images, then it's not a real model.
MICHAEL LEE: I mean, at some point--
[? THOMAS SERRE: ?] It's not falsifiable at least.
MICHAEL LEE: I mean, at some point, you'll stop-- like your model will be good enough that it will, in expectation, generalize to any new image you show it. Like, at some point, will have fit it on enough data, given it has enough capacity, that it will, eventually, just work, right?
GABRIEL KREIMAN: Then it's not a model anymore, right? So, if your model requires every possible image in the universe for training, yes, then it's just a memorization of every single image in the universe, right? That's not a computational model.
MICHAEL LEE: If it's an accurate account of what the brain does to any image, does it matter how you got there?
GABRIEL KREIMAN: Again, you can always-- you cannot present every image in the universe, right? So, if my mother has never seen a picture of this-- let's say you've never seen this one before-- your brain is doing something. Your neurons are responding to it, OK?
I want my model to be able to do that too without having to do any new linear fitting, without having to present any new images, right? That's an aspirational goal. Like I don't have that model. I certainly don't just to be clear. That's my aspiration.
LEYLA ISIK: Just to go back quickly to the linear fitting point, so, if I, as an adult, have seen other dongles, like I don't-- and I haven't seen your particular dongle, I share that agreement. But, if you show a toddler, who doesn't know about dongles yet, a dongle, they can pick it up after a few examples because you give them a new example to linearly tune their category [INAUDIBLE].
GABRIEL KREIMAN: There's no linear tuning. The neurons in an infant will still respond to that thing.
LEYLA ISIK: They will still respond, but I guess the decision boundary that ties--
GABRIEL KREIMAN: There's no decision boundary.
LEYLA ISIK: --that [? initial ?] response--
GABRIEL KREIMAN: There's no decision about. When you're talking about a model of cortex, I want to predict the activity of neurons. There's no decision boundary. There's no task. This is just I show you a picture. I want to see what the response of the neurons will be. That's my task. That's all.
THOMAS SERRE: I think-- if I can add something, I think there is a little bit of a double discussion here. It seems to me that some of us are speaking about classification accuracy and generalization. I don't think this is Gabriel's point. Gabriel's point is at the level of fitting neurons.
If the definition of falsifiability-- I mean, that's one definition of falsifiability is that you should be allowed to have some amount of training data for your model to be fitted to experimental data. And then you should be able to produce novel predictions on images that were never seen. If that's not possible for any reason because of the approach taken to build these models or whatnot, then the answer is that those models are not falsifiable.
MARTIN SCHRIMPF: Gabriel, do you think that would work between two brains? Let's say we take your IT representation and my IT representation, and then we show it 1,000 images. And then we fit a linear regression from your brain to my brain. Do you think that would predict every single new image you could feed it? Or would it also fall short of generalizing at some point?
GABRIEL KREIMAN: First of all, I don't know why you fit a linear map between my brain and your brain, but, anyway--
GABRIEL KREIMAN: If you really want to do that, I have no idea. I think that's an interesting experimental question. But I have no idea.
MARTIN SCHRIMPF: I'm mainly asking if we like-- we need some mapping of system A into system B. Like system A and B can both be brains, or they can be model and brains. And what most of us are using in some kind of linear map. And we, I guess, now are discussing whether we expect that map plus the underlying model to generalize. But it's unclear to me if that would work for two brains that we think are the right model.
GABRIEL KREIMAN: Well, I don't know. Maybe we should let-- I think the point is, if you have a model of Martin's brain-- and I really admire your brain, as you know very well. So I think this would be a very sophisticated model.
So that model, at some point, I say this is a model of Martin's brain. And, at that point, it's frozen. I'm not allowed to change it anymore, OK?
If it's a true model of Martin's brain, I should be able to predict what happens in your brain under any circumstance. That's my definition of what a model is, OK? I don't have that. I hope, one day, I will, but it's almost a [? definition, ?] right?
SUSAN EPSTEIN: Gabriel, if you want to model the response of neurons, are you going to think about things like synaptic fatigue? Do you want-- do you want the model to also stop responding if it's overloaded in some way?
GABRIEL KREIMAN: That's an excellent question. So this then goes back to what Martin was saying before about what's the right level of abstraction. I was playing the same game that many of us have been playing, which is trying to fit firing rates. And I'm pretty agnostic right now as to what's the actual level of biology that we need to be able to fit firing rates, but many of the models that we have been working with, with a few exceptions, don't really have dynamics.
And so we can expand the set of questions that we want to explain by saying, well, I don't want to just fit firing rates. I want to fit the sort of dynamic distribution of spiking activity. And then we may need to incorporate many of the fundamental aspects of biology that you are referring to.
But, right now, at this point, I was a bit agnostic as to what exactly-- what are the ingredients of the model. I was just defining that a model should be able to fit new data without any tweaking. That's all I will say.
JOSH MCDERMOTT: All right, thanks. Susan, well, do you have any-- were there any other points you wanted to make?
SUSAN EPSTEIN: Oh yeah, if I can share my screen for a second?
JOSH MCDERMOTT: Sure, yeah.
SUSAN EPSTEIN: OK, on the theory that one picture is worth many words, can we see this one?
JOSH MCDERMOTT: Yeah.
SUSAN EPSTEIN: OK, so face recognition is one of the most important areas that the brain deals with and, also, something that many people have worked on with neural networks. First of all, if you've never seen this picture before, many people think it's quite funny. It makes you grin. It does.
And then so my first question to you might be, why are your neurons making you smile? And my next question would be there is the ability of humans to not only recognize something as being a member of a category, but to take two images and distinguish between them. And so I'm wondering-- little children very quickly can tell a cat from a dog. I'm wondering how you feel about this as a way of falsifying a model. I'd want the model to say different.
THOMAS SERRE: Sorry, Susan, I think you are suggesting to move away from classification and towards more representation of similarity.
SUSAN EPSTEIN: Yeah, I think so. I think that classification is almost trivial. I think someone else said that earlier that there are so many more interesting and challenging problems that the brain is delivering that are not just I know that this is a cat or a dog. And so I have a whole long list of them, but I thought we could start with thinking about this one because it's a classic example. It's in lots of textbooks. And it's a valid one, I think.
We don't want to know what stuff is. We also want to know why something is what it is. And we want to know why two things are different. And how could you explain that to me?
JOSH MCDERMOTT: Good, OK, well, I think we should return to this issue of classification and the fact that it's so central to the approaches that we're discussing and whether that's a fundamental limitation that I think we're, somehow, going to need to overcome.
JENELLE FEATHER: Yeah, so thanks, everyone, for these thought-provoking presentations. I guess one of the things that I wanted to say is, actually, I think pretty related to what Susan was just bringing up with this image and, also, what Leyla and Martin were saying before, which is that I sort of see us as just needing more benchmarks.
And so like the image that Susan was just saying is, similarly, another benchmark that could be added to a much larger set of tests, which do not necessarily just rely on predicting neural responses, but, also, perhaps, predicting behavior in ways that are maybe unnatural. And so this is a lot of what Josh and I, I think, have been working on in the past couple of years with our work using model metamers.
So I guess that that's the first thing is that we need these more targeted tests, but then the second thing that I wanted to bring up is related to something that Thomas said and, also, maybe was just the thread in some of the presentations, which is considering a lot of these models as black boxes because I really think that we should stop considering most of the neural networks that people use to actually be black boxes.
We are able to look into them and see what all of the firing properties are of each of the individual units. And so, if we actually do that, it's a hard process, and we don't necessarily have the right ideas and the right tools in order to fully characterize like what these models are doing, but we are able to see inside and develop insights in that way.
And so it's just sort of keeping in mind that that's one additional way that we have in order to actually falsify some of these models that we might consider to be a black box is, say, designing stimuli that track them in particular ways or designing stimuli that evoke particular model responses. And that could be another way that we can compare these models back to humans.
TOMASO POGGIO: OK, so this is just a figure on the right is trying to answer your question, Susan. We are able to recognize different objects, essentially, from one example or children are. And most deep networks cannot do that. Now suppose the following, that-- well, let me describe the experiment first.
If you show to a linear classifier based on pixels your object on the left there, C-- see these are cars or planes, but, from different viewpoints, different sizes, different scales. OK, I want to see whether I can show one of them, one plane, one car, and then have the classifier classify correctly a new image. And the answer is, if you do it with this training set, one car and one plane in arbitrary viewpoints, scale position, then the linear classifier, when I present one pair or two pairs or three pairs or 20 pairs, is, basically, a chance when you test it. It cannot do it.
But suppose that there is another call, which transforms the image before giving it to the linear classifier so that it is in a standard position and scale and viewpoint. So this is the B training set. It's one car and one plane from this training set.
I'm claiming the ventral stream up to IT is doing just that. Then just a single pair gives you 85% on new examples, so much better than chance. You need that to sort out this kind of complexity due to, trivially, in a sense, to viewpoint, position, scale, and so on. And there are ways to do that that deep networks of today don't use.
JOSH MCDERMOTT: Thanks, Tommy. Susan, did you have any additional comments based on that?
SUSAN EPSTEIN: Yeah, I have another picture for Tommy and everyone else. These are all faces. Small children will tell you they see the faces. How are we going to contend with that? And would you want-- I have a feeling Gabriel would say I wouldn't want anybody to say-- I wouldn't want my model to say this was a face. But there are some wonderful problems in these pictures.
TOMASO POGGIO: Yeah, this can be a nice test.
SUSAN EPSTEIN: You like that. Good, all right, I'll send them to you.
TOMASO POGGIO: OK.
JOSH MCDERMOTT: So, yeah, thank you. That one is also really nice. Jim, do you want to chime in?
AUDIENCE: So great to see [? everyone. ?] This panel is awesome, all my friends on the panel. So, to Tommy's point, I've been trying to summarize what I think I heard. And I think, mostly, you guys agree on a lot of stuff.
One, models cannot have free parameters. Gabriel said that most clearly. If they have free parameters, then it's not a model. It's a model class, so as long as that language is used.
And, if you a model class, you might not be able to falsify it. Gabriel said that, I think we all agree on that. So let's call models things without free parameters.
The second, when we talk about the primary visual system, gradient descent, somebody mentioned that. That's not a model. That's just a technique to get to a model. I hope we agree on that too.
If people think, me or others, think gradient descent is the model, that's not a model. I know some people in AI will call that the model. I don't. That's not the model we're talking about in this group.
TOMASO POGGIO: But it is part of the model, Jim. The performance you get, you get it through that.
AUDIENCE: I'm not disagreeing with that it's a part of the technique to get to what I'm going to call the model.
TOMASO POGGIO: No, but you have-- no, come on. You cannot-- you cannot [INAUDIBLE].
TOMASO POGGIO: [? As ?] [? part ?] [? of ?] [? the ?] [? model, ?] it is a critical prediction that that kind of--
TOMASO POGGIO: Yes.
AUDIENCE: No, but I think that's why we're kind of talking past each other because we view the endpoint of these things as useful simulations.
TOMASO POGGIO: Jim, you are [? teaching. ?] You are cheating, [? teaching ?] and cheating, right? And you should be careful not to confuse the two.
AUDIENCE: All right, I'm going to just frame this from neuroscience. To neuroscientists, those are two different questions. One is how the adult visual system is actually operating. The other is how the visual system got to evolve and develop. And those are different questions. They're both interesting.
They're not-- but I'm just talking about the adult operation question, which is how I'm phrasing the model. It doesn't dismiss the interest in the developmental question. I just was trying to focus the discussion around adult inference operation mode for the system.
I'm just trying to make that clearer for discussion, not that it's not interesting to talk about gradient descent. But gradient descent is a technique that might correspond to development. I don't think it does, personally, but I didn't think that's what we were all talking about in this group.
So, just moving past that-- and maybe we can come back to that-- some specific ANN models, fully trained, are better approximations of the ventral stream than all prior models. And I think we probably all agree on that too. So I--
TOMASO POGGIO: I don't agree. I don't agree.
AUDIENCE: Let me just finish, Tommy. At the same time, all current ANN models are currently wrong. We already-- they're all better, but they're also-- they're not all better. The ones that have been tested are better, but they're also false. And I think we all agree on that too. And Tommy showed examples. How do we--
TOMASO POGGIO: That's a trivial statement too, right? Every theory is wrong.
AUDIENCE: Tommy, it was the same one you made with the flashing of the image. I'm making that statement. I'm agreeing with you. I'm just stating it for the group as agreement that we all agree they're false. So they're falsified. All models that are defined that way are now falsified.
OK, how do we know they're falsified? Not because I had a flashed image, and I said it's not going to work. It's because we built something that we test the models on. Those are the benchmarks.
And we can see none of them can predict all the data. So we know they're wrong, and we can all say they're wrong in our own ways. And we can agree that they're wrong.
OK, so they're wrong. So here we are. We all agree on all that, I hope. And then we say, what are the interesting questions going forward? The interesting questions are, how do we build the next model?
I suspect people differ on their opinions, and that's a good discussion. How would we know if we actually built a better model? I think people disagree on that too, but we've got to have some way to know.
And the deeper, existential question that underlies both of those questions is, how do we even propose to ground those questions? That is, what is the point of our field, ultimately? Why are we even building those models? And I guess I'd like to hear all the panelists talk about that.
Like how do we build the next models? We agree that our current models are wrong. And how would we know if they're more better or less better than the current models? And I think that's the only operational question on the table. And I'm going to go away and let the panelists talk.
JOSH MCDERMOTT: Yeah, I'll just say I would also-- I think it would be very useful for people to sketch out what they see as a good path forward. It's sort of been hinted at in the presentations, but not really explicitly declared.
TOMASO POGGIO: Well, I think we should first discuss a bit more the issue of quotation mark, "overfitting." And there are a couple of people in the audience who could speak to it.
JOSH MCDERMOTT: There's also some things in the Q&A list. Like the first question, Tommy, I don't know if you can see that. I can just read it. Can the presenters elaborate on the problem of overfitting? In what sense does the fit of neural network models to brain data benefit from the number of parameters? Since the neural networks aren't fit to the brain data directly, it doesn't seem like the correlations to brain data are a result of overfitting.
TOMASO POGGIO: Yeah, I think we should, perhaps, listen, if you agree, Josh, to Colin Conwell.
JOSH MCDERMOTT: Yeah, definitely. Colin, are you here? Can you speak to this issue?
AUDIENCE: Certainly, I can at least attempt that. So the question of overfitting--
TOMASO POGGIO: Anyway, Colin is at Harvard.
AUDIENCE: Yes, I'm in the Harvard Vision Sciences Lab. I've been doing some work recently with some optical physiology data from mice. And I think the type of overfitting that I'm referring to is, actually, a product of, basically, the statistical procedures that we're using in our benchmarking and, in particular, ways that we evaluate feature spaces from deep neural network models.
And the phrase that keeps coming to mind for me is that the map is not the territory is something that's come up a lot here. When we're talking about linear fits between two brains, at that point, we're talking about a huge number of parameters, not in terms of the model itself that we're using to produce the feature spaces, but in terms of the actual sort of statistical free parameters that we're using to get a final benchmark for our model scores.
So what seems to be happening here is that, when we look at these very large feature spaces, not large in the sense of just total number of free parameters in the model, but in terms of the layers from which we're selecting feature spaces, and we use something like a MAX operation to get the best feature space from that model, what we're actually, subtly, doing in that way is we're creating more opportunities for models with greater numbers of feature spaces to produce higher scores.
And it seems like-- at least, some trends that we've been seeing, at least, with the optical physiology data in mice is that it does seem like many of the large-scale motifs of ImageNet trained models doing better, of deeper models doing better, of ImageNet accurate models doing better are, actually, somewhat, a simple byproduct of the fact that we're allowing for operations like taking the max over layers to produce better feature spaces by dint of chance and not necessarily of actually having better feature spaces as a model.
So I think the only input I would offer in this domain is that we need to revisit a number of statistical foundations in how we're doing benchmarking before we take the superiority of certain models in our benchmarks to be an indication of paths moving forward. That's what I'd say, largely, for the notion of overfitting when we're not dealing with the number of free parameters in our models, but we have free parameters actually in the linear mapping that we're using to produce benchmarks.
TOMASO POGGIO: Yeah, very Interesting.
JOSH MCDERMOTT: Thank you, Colin. Do any of the Brain-Score folks want to respond to that?
MICHAEL LEE: Yeah.
MARTIN SCHRIMPF: Yeah.
MICHAEL LEE: Me and Martin can take [? this one. ?]
MARTIN SCHRIMPF: [INAUDIBLE]. I maybe will take the first one, and then you can go, Michael. So there's three points I wanted to make to that. So, first, to the number of layers, at least, in the current benchmarks we have, all models that come in are not allowed to search over layers. They have to pre-commit to a certain layer based on public data that is not used for testing.
And they still search on the public data, but that is completely separate from any of the scores that we have on the Brain-Score side. So that, to me, takes a big chunk out of the argument against having many, many layers.
We also, of course, cross-validate everything. And, like Gabriel said, there is just different levels of generalization. We, typically, cross-validate over objects or images. You could require more cross-validation, and then that would become a new benchmark in our minds. And so we also try to take care of not just fitting the data, but, also, predicting new data.
And, finally, there's also other measures that we're looking at that are not using any kind of regression or any kind of fitting in the middle, such as RDMs or CCA, that tend to correlate very strongly with neural [INAUDIBLE], [? but, ?] [? typically, ?] are more noisy, which is why we like [? neural ?] [INAUDIBLE]. But then there are definitely parameter-free approaches that we can use to compare representations that also align with the results that we have with the parameter-based regression approaches.
AUDIENCE: Yeah, absolutely. So I was going to suggest there are, I think, a number of remedies that actually preserve the overall trends in the benchmarking, but that, basically, the large-scale story is that this subtle overfitting that Professor Poggio was referring to can occur at the level of the statistical techniques we're using for our benchmarking. There are ways, as you're mentioning, RSA being one of them, that can skirt this particular set of problems.
But I think the larger point is that we just need to be very careful in our presentation of where we are in terms of mapping to the brain before we take the indications from the benchmarking as paths forward. And, in terms of, I think, responding to the question of, well, what are the next models that we're going to design, if, for example, we're taking away, well, let's just add more feature spaces from our model-- to our model, that's probably not the direction that we want to go.
And I think [? CorrNet, ?] obviously, is a great response to this, and there are many other great responses to this, but there are certain trends that we might see in the benchmarking that would lead us down the garden of forking paths, as it would, to models we don't really want to be building because, number one, they are missing some of the biologically inspired components we have, and, number two, it's the case that maybe these models are just doing better because of statistical flukes and not, actually, a meaningful fit to the brain data.
MARTIN SCHRIMPF: Yeah, thanks for that. I think like more input from someone like you who is more grounded in stats would be really helpful to just make us get the benchmarks completely right. So I'd be [? happy ?] [? to know ?] more about that.
JOSH MCDERMOTT: So can we hear people sketch out their vision for how to move forward from the current generation of models?
JENELLE FEATHER: Can I actually sort of ask a question in regards to this, which is that, a lot of the people talking today, like, I guess, in particular, some of the plots that Tommy was showing in regards to invariances in the visual system rely on us knowing what those invariances are up front or rely on us knowing what some of these properties of, say, the human visual system are up front. And I guess the idea is that we would then build those directly into a model.
And so I guess I'm just curious whether people think that is something that our models should explicitly include or whether some of those things should fall out as byproducts. Yeah, it's something-- I guess, to give a little bit of context, I think this is something that, oftentimes, comes up for our work on the auditory system where maybe we don't know what some of those-- what some of those features should be. And so, essentially, we view neural networks or some of these like deep networks as a way to kind of maybe learn what some of those features are of the human auditory system.
TOMASO POGGIO: Yeah, well, personally, I think-- I may be biased, but I think it's a mistake to fall in the fad of deep networks because they work so well and try to map engineering hacks into the brain. We should go back to start from the data that we have from psychophysics and physiology and anatomy and build from there. There are techniques and tools from deep learning that we can use. It's, mostly, having the ability of use much more computing power than it was possible 10 years ago.
But I would love to see the same story happening that it happened with HMAX. HMAX was a model that was suggested by visual cortex, that we built to imitate visual cortex, and then, for some time, was state of the art in computer vision. And now we are doing the opposite. We are taking state of the art machine learning and trying to claim that they represent the brain.
And there are a lot of problems with that. There are a lot of nonbiological aspects of deep networks. And, I mean, do we want just to say, OK, let the engineers do the models? We'll just try to fit and to, somehow, squeeze the data so that they fit? I find that-- I hope it is not true.
JOSH MCDERMOTT: Gabriel or Thomas, do you have anything to add to that? Or, Michael, go ahead.
MICHAEL LEE: Yeah, for Tommy, could I ask what the alternative is? Like how do you envision progress that sort of would not involve some form of fitting to data?
TOMASO POGGIO: Well, as I said, I think the big advantage that we have-- I'm speaking we, all of us, all of us in CBMM, all of us in BCS-- have, with respect to all the thousands of people working with deep nets, is that we can do experiments on the real thing, on the brain. And we should really leverage that.
We should, ideally, be first to solve the problem of intelligence, not let the engineers doing it. We have something that we can look at, which is intelligent. And we can put electrodes into it and test input and output.
MICHAEL LEE: In the extreme, like we, as neuroscientists in this department, we can collect measurements from the brain. In the extreme, like--
TOMASO POGGIO: Well, it's not only collecting measurements, right? I think there is a lot to be said in terms of doing clever experiments.
MICHAEL LEE: Right, would you-- let's say I hand you a model. And, if I showed any image in some domain that we've agreed to beforehand, like ImageNet, it can very accurately predict how some monkey's IT cortex is going to respond.
And then I tell you, oh, I trained this model by using stochastic gradient descent. An engineer designed it. I fit it directly to a bunch of neural data beforehand, but not on the test images. Would you accept that as progress? Or is that sort of a misleading result that we've come upon?
TOMASO POGGIO: Well, if it fits all the data-- and is that you said?
MICHAEL LEE: Yeah, I mean, it's an accurate image-computable model of naturalistic images let's say. And it can sort of completely describe the responses of some monkey's IT cortex to any image.
TOMASO POGGIO: Yeah, but I would like to test it with some critical things like the images that Susan presented.
MICHAEL LEE: Well, OK, so what if I added images like that to my training set beforehand? And I collected neural data from monkeys on those images, and I fit that too with my giant function approximator. And I just kept on doing this forever.
Like you come up with some tests. I collect the data on those test images in the domain, holding out some images for actual testing. And we just continue just fitting the brain.
And Jim calls this like brain siphon or brain vacuum. And we just continue just sucking up all this data, updating some high-capacity model. Would you accept that as neuroscience, as progress? Or how would you think about something like that?
TOMASO POGGIO: It's like Einstein and quantum mechanics. I'll do all I can not to believe it.
SUSAN EPSTEIN: I want to ask a naive question here, which is why is everything binary in this conversation? Why doesn't anyone want to look at confusion or uncertainty? Why aren't we modeling those too? Certainly, the brain spends more time looking at some things than at others. Why wouldn't our model also do that?
TOMASO POGGIO: You mean like illusions, or?
SUSAN EPSTEIN: The pictures I showed you because they were not what was to be expected.
TOMASO POGGIO: Yeah, right, I find, in a sense, somehow, I was trained back in Germany with the idea of critical tests for a theory. This is something that wouldn't-- it will destroy, ideally, a whole class of theories. So you're looking for-- before you're looking for a lot of precise measurements of the mass of the electron or so, you want to see whether the electron exists and try to disprove it if you can.
I think we should do more of that exercise, you know? There are some of these properties that you want to have from any reasonable model like, as I said, the ability to recognize a face if you have seen it once, and you walk away or come closer. There are several things of this type or some of the strange images or difficult imagine. How does the model react? You want-- you want to try to break it. That's what one should do.
This was, by the way, what Bohr was saying. You should have a model, a theory. You get it. You are very happy about it. You collect evidence for it.
But then you have to go into the next phase in which you are trying to disprove it. And, if you cannot, that's OK. That's great. But--
SUSAN EPSTEIN: I'm just suggesting, Tommy, that maybe there's more than right and wrong.
TOMASO POGGIO: Yeah, yeah, I take the point. I don't know whether that's the point you want to make, but it's not always easy to have a yes or no answer.
SUSAN EPSTEIN: Yeah, that's it. And I think that modeling the confusion that a person might have looking at something would be valuable too. And it, certainly, is on the same scale I think.
TOMASO POGGIO: Yeah.
JENELLE FEATHER: I could be-- maybe I'm misremembering, but, some of the people in Jim's lab, haven't some of these things been incorporated into some of the deep, deep networks in various ways such that you can look at their responses for easy images versus hard images?
MARTIN SCHRIMPF: Yeah, and, in fact, a behavioral benchmark is like trying to also account for confidences by pooling over like different subjects' responses. So we're definitely interested in that, Susan.
TOMASO POGGIO: You know, there is-- this is a question for Martin. You had these very nice results about language in which you show this complete, very complicated hack GP-3, right?
MARTIN SCHRIMPF: GPT, yeah.
TOMASO POGGIO: Sorry?
MARTIN SCHRIMPF: GPT, yeah, that's what it is.
TOMASO POGGIO: Yeah, and this is a model, a language model, neural networks, trained with billions of training data. I don't know how many times more than a single individual could ever see. But then you show that, in some way, it matches neuroscience data.
How-- can you say a few words about it and what you think about it? You know, the model is, clearly, not biologically plausible. It's not what happens in the mind of any one of us.
MARTIN SCHRIMPF: Right, so, yeah, what Tommy is referring to is a recent result that some of the latest NLP models coming, primarily, out of OpenAI, such as GPT-2 and [? 3 ?] now, seem to do pretty well at predicting both neural activity in humans in fMRI [? and ECoG ?] conditions, as well as human behavior. And I think Tommy's criticism probably comes back to what Jim was outlining before.
Like do we care about how the model got there? And, I mean, at some point, probably all of us care. Just the [INAUDIBLE] [? stage ?] in different conditions seems to resemble that. And, by the way, for GPT-2, in particular, it seems like, even if you don't train it at all, just the inherent structure that is given to it by the architecture already does pretty well at predicting neural activity. But that's something we also have seen in [INAUDIBLE], to some extent, and, I believe, [INAUDIBLE] where, even without training, you can get pretty far.
So training seems to add on top of it, or [? experience ?] [? dependence ?] [? updates ?] seem to definitely be important, but [? the model ?] [? might ?] just be a lot of inherent structure that is defined by the architecture. And, perhaps, that's something we should discuss more on. Like what are the right priors to bake into the models based on a structural perspective?
TOMASO POGGIO: OK, yeah, that's an important point. Josh, any comment?
JOSH MCDERMOTT: From me?
TOMASO POGGIO: Yeah. You're a moderator, but you can express some--
JOSH MCDERMOTT: Yeah, look, I mean, I think what's hard in this domain is because-- it's the fact that all of the models are wrong. We know they're wrong, right? And so the question is like, well, how do you move forward? What do you take away from something as evidence that you're on the right track? And I think that what's interesting about where we are right now is that people have different intuitions about how to tell whether we're on the right track and what the metrics are that you should be tracking in order to move forward.
And there's like the strong view that like, well, you just have a quantitative measure of how well you can predict neural responses, and you just try to keep bumping that up. There's other views that, like kind of what Susan is expressing, that there are sort of qualitative phenomena in perception that we really need to be paying closer attention to and that really represent fundamental inconsistencies with the way the current model classes work.
But, again, the tricky thing is like, the current models that we have, they're obviously wrong. They've already been falsified, right? So I think the issue of falsifiability gets a bit tricky because it sort of becomes a more qualitative distinction. And people can have very different and legitimate opinions about how exactly to make progress.
I mean, the other thing that I was just going to say, though, Tommy, is I think that, some of this discussion, it reflects this sort of tension between the old-school view that principles matter and that one should build models by looking at biology and then incorporating engineering principles-- and that's, of course, how the original generation models that we all cut our teeth on were developed.
But, on the other hand, I mean, I think there was a widespread sentiment that we were kind of hitting the ceiling on those models. And I think it's difficult to actually know how to move forward and make them better. And that's kind of why people turned to machine learning, the idea that, well, you would learn a solution, rather than try to kind of design it by hand. And, of course, that's got its own challenges, right? But I think that's the fundamental tension.
TOMASO POGGIO: Josh, one could say, OK, you had models like HMAX, basically, neocognitron with a bit more of-- a bit more faithful to the anatomy and physiology. And then you just optimize it. That's what deep learning today is doing. There is no other difference.
And you optimize it using techniques that are not biologically plausible in terms of evolution or development of individual. And so what? You get performance that is maybe 5% better than if you don't optimize everything. Is this worthwhile?
JOSH MCDERMOTT: Oh, but it's more than 5%. I mean, it's a pretty substantial improvement from training.
TOMASO POGGIO: I don't know. I saw-- if you look at the NTK stuff, this is the Neural Tangent Kernel theory, basically, that says this is what you can get from a deep network if you train the last layer, and everything else is random. And you can maybe just minimally change the weights in the previous layers.
And this is about 5% less than the optimal you can get. And this is equivalent to a kernel machine. So it's equivalent to an HMAX type thing. So, yeah, 5% is important, but I'm wondering.
THOMAS SERRE: Can I interject? Yeah, I mean, so, maybe just to clarify the point that I was trying to make, and I think that's also where we're heading, I think the point that we've been trying to make is that-- I mean, I see a lot of work. I go to VSS. I still go to VSS, and I see so much for just fitting deep networks to all sorts of data. And, usually, that's the end goal. I mean, that's it.
And I think that's the issue. I think an improvement of whatever the amount of percent it is from one architecture, one arbitrary architecture to the next, is of very little value in terms of telling us much about the visual system or perception more broadly. And I think the nostalgia here is that there was a time where this fitness or checking consistency with data was just the first step towards, potentially, making predictions, testable predictions for experiments.
I think that's one example-- one simple example was the MAX. The MAX, empirically, for image classification worked better than the average. And that, in turn, motivated electrophysiologists to go and look for a MAX operation.
Part of my criticism is that, today, again, I go to VSS, and I see very little value of much of the work. And I'm sorry for being openly critical, but it's hard for me to see the value of just fitting models for the sake of it. On that scale, just improving-- going from one arbitrary model to the next, going from a DenseNet-121 or a [? ResNet-151 ?] to a [? DenseNet-180 ?] does not-- any improvement tells me nothing about the brain.
Where I think it would become interesting would be whether there is a key mechanisms or operations in one network over the other, which, in turn, can be tested experimentally, mechanistically. And that I don't see. I mean, maybe I'm missing it, but a lot of the work that I see in the field is really just stopping short of just fitting to experimental data. And that, to me, is insufficient at best.
JOSH MCDERMOTT: So can you outline your vision for how to move forward?
THOMAS SERRE: Well, so, for instance, the point wouldn't be to just stopping at fitting, but figuring out-- so I guess there should be enough control in the architecture that there would be one key variable that is changing across models that would allow us to make a testable prediction. For instance, again, the example was going from a network that uses an average pooling to one that uses a MAX pooling.
If that leads to a quantitatively better fit on brain data, then the prediction would be that we need to look-- there should be a MAX operation somewhere in the brain. Someone should look for it. And that's where you need clever experimental design to be able to tease apart predictions from an average operation versus a MAX-like operation.
And I would argue that the past many years of neuroscience, going back to Heeger and Carandini testing specific predictions of the normalization model, was one instantiation of that idea. And so I guess my argument is I think two things. One is shifting a little bit from a purely quantitative fit to a more quantitative test where the quantitative fit would be only the first step in helping us identify critical operations that can then be tested more explicitly.
And the second one, I guess, is-- and this is probably closely related-- moving away from purely data-driven approaches to top-down approaches, hypothesis-driven experiments that will allow us to answer, in a more falsifiable way, whether this particular model is more correct than these other ones because it uses the right or the wrong operation at its core.
JOSH MCDERMOTT: Thanks. So we have a question in the Q&A that's been upvoted quite a lot, and I think it's probably worth turning to. This is-- so I think I'm just going to read these from the list. This one is addressed to Leyla. It's from Daniel Janini.
He says, I'm interested in learning more about Leyla's earlier comments about studying how we learn new categories. Leyla brought up the problem of how children learn what a dongle is from a few examples. One possibility is that children learn a linear classification boundary operating over pre-existing object features. This seems a falsifiable hypothesis to me because the pre-existing feature space may or may not be capable of classifying dongles with only linear readout.
Leyla, could you say more about how you imagine these types of studies could be properly conducted? What do we learn from these tests of generalization? What kinds of visual domains and tasks do you think we should apply these tests of generalization to?
LEYLA ISIK: That's a great question. So I think we really should be looking more at development, not to understand how we get to the features that we need to solve a particular problem, but to understand what the problems we're solving are, right? So, for me, it's not so much do we use stochastic gradient descent or not, but is object categorization the right problem to be looking [? at? ?]
And I think we have a lot of knowledge about what the core domains that children learn very early are from developmental cognitive scientists. So I would love to see a movement towards trying to develop benchmarks to see how our networks match that human behavioral data. And I totally agree, Daniel, that that is a falsifiable test. And you could test models in that way.
And maybe this is related to something to Jenelle's question, which I thought was a great one, about whether all of this stuff is going to come from first principles or if some of it is just going to fall out. And I think it has to be both. I mean, I totally agree with what Tommy is saying and Thomas said, that we need to be going more back to first principles and figuring out what the intelligent things to be building into models are. But I don't think we're going to be able to build all of these into the models. Or should we?
So I loved Susan's example, but we're not going to have like a dog-human-- we're probably not going to have to build dog-human invariance into our model. That's something that probably falls out from-- humans find that funny based on all sorts of other visual and social learning that we do. So I think that's a great example of something that would fall out. You're not going to build that invariance in from first principles.
SUSAN EPSTEIN: What about dog versus cat?
LEYLA ISIK: The classification or the [? similarity ?] [? or-- ?]
SUSAN EPSTEIN: Are these the same class?
LEYLA ISIK: Oh yeah, I mean, I think that-- I think the object categorization problem is an important one for sure, but I think it's one we've been overly focused on. Even just when we're talking about fitting brain responses, we're talking about fitting-- mostly, we're talking about fitting IT firing rate in a very particular time window, right? We're not even looking at the more complex brain responses in other brain regions, later in the cycle, those sorts of things. So I think we've been really overly focused on the categorization problem, both explicitly, but, also, in the way we're talking about fitting brain data.
JOSH MCDERMOTT: Thanks, Leyla. So there's a bunch of questions in the Q&A, the next one from an anonymous attendee. Shouldn't strong benchmarks be able to weed out the wrong models? This is sort of addressing like the role of benchmarks and how dependent we should be on them. A bunch of people hit on that. Do you want to come back to that?
THOMAS SERRE: I don't think this exists out of the box. I don't think it'll come out of just a random set of natural images. That will require some hypotheses and some clever experimental design.
JOSH MCDERMOTT: And, Thomas, do want to say-- I mean, I guess there's a deep philosophy of science question about where hypotheses should come from. That's lurking here because I mean, right now, a lot of hypotheses sort of come from machine learning. At least, that's how some people would view it. But that's sort of not how it used to be. I mean, do you have anything to say about your vision for that?
THOMAS SERRE: I think this is precisely what I'm arguing for that this model should be useful for screening across mechanisms and operations, whether this is from a purely engineering and accuracy perspective or fitting to neural data, which, in turn, could inform and motivate experiments that would test explicitly at the mechanistic level whether those are effectively carried, or there is evidence for a subclass of cells found in sets of visual areas that would be carrying those operations.
So I think we're all in agreement. I mean, I guess the argument here is that I think there is a need for more hypothesis-driven and not to stop at fitting, really to test some of these what seem to be critical operations.
LEYLA ISIK: Can I ask--
THOMAS SERRE: Case in point, if I give you a network that doubles, that has twice the depth, that fits better data, and we know already that there are more layers in those networks than there are stages of processing in the brain, I'm not sure what to make of that. I mean, this is not really helping me make a testable prediction, right?
If, however, someone comes up-- case in point, the transformer networks all the rage these days, right? This is, arguably, one of the main kind of breakthrough, if I may say, for the past couple of years in language and now vision. It's hard to imagine that the brain could be implementing this particular operation. But, if that turns out to fit brain data much better, then maybe there is something. Someone should be looking at biophysically plausible implementations of those specific indexing mechanisms and evidence or lack thereof of such indexing in the visual system.
JOSH MCDERMOTT: Thomas, Jim has sent you a question via the chat. He wants to know what counts as a mechanism.
THOMAS SERRE: What counts as a mechanism or operations? Well, anything that could be tested experimentally, I guess.
MARTIN SCHRIMPF: But then-- so then you accept every deep neural network as being a proposal of a mechanism because that is what they are. They all make falsifiable predictions on unseen data.
THOMAS SERRE: Sorry, I missed the beginning of your question, Martin.
MARTIN SCHRIMPF: I was saying, if the key aspect of being a mechanism is to be testable, then you also have to accept all deep neural networks as being testable mechanisms and predictions because they all make new predictions on unseen data. We can test those.
THOMAS SERRE: But, again, I'm not-- just to be clear, but what I call prediction wouldn't be necessarily fitting data better. I mean, this is the first step, right? So, again, just to be concrete, an example of a mechanism would be arbitrary pooling operations, average, [? MAX, ?] and more. This could be-- we know that normalization is critical in deep learning. And there are so many flavors of [? normalization. ?] I cannot keep track of them.
You know, there's, literally, a zoo of neural computations that have been developed in the past 10, 12 years. I think what we're seeing from benchmarks, either in computer vision or Brain-Score-like benchmarks, is that some of those bring additional oomph to those models. They perform better, and they fit data better.
So I think I would-- it's hard to precisely define mechanisms, but I would call some of those neural computations, operations, again, anything that you can implement in PyTorch as a candidate for a plausible, possible mechanism for brains. And so I think then the question becomes what to do next, how to go about. And, again, you can test-- I don't have a recipe for all possible operations, but a subset of them, at least, can be tested, right?
And different pooling operations will behave differently on the rightly chosen kinds of input data, right? MAX operation is going to perform differently from an average pooling or any kind of higher-order moments on some inputs, right? So those can be testable experimentally. Someone can design-- and I think this is very much the point that Jenelle was making.
For instance, the metamers is a great example. You can create metamers based on a variety of statistics and pooling mechanisms. And so I think it's great to-- metamers are great. They are-- again, we need more of those that gives us constraint at the behavioral level. Again, I think we shouldn't stop there. And I think there should be closer interactions with neurophysiologists to go about and test the validity of some of those.
JOSH MCDERMOTT: So, from the Q&A, Nancy Kanwisher-- this is really a comment, but somebody might have a response. She says, I don't see any disagreement here. You collect a lot of data, and you keep improving models so they fit the data better and better.
But you don't stop there. You ask why this model fits better than that one in virtue of [? why ?] does it work better. Now we have methods, as Jenelle said, to open the black box to ask how the models work. So let's both find models that fit the data and figure out how those models work and why they fit. Wise words.
Now one point of disagreement that came up earlier is Tommy's assertion that the use of stochastic gradient descent to build a model is intrinsically problematic given that we don't really believe in that as a model of learning. Jim is arguing for a separation for the means by which you build the model and the model itself. And that seems like an interesting issue that might be fun to talk about.
I mean, I think one interesting question, to me, that we think about a lot is like the extent to which the learning process really leaves its fingerprints on the model. And some of the perverse properties that current deep nets have really might be due to the way that the learning works such that really SGD might kind of be tying our hands at some level. But I don't know if Tommy or Jim wants to dive back into that discussion or anybody else.
TOMASO POGGIO: Yeah, I think I can-- I think it does. It does tie your hands. First of all, you believe you have performance that, actually, is biologically unreachable. And so you have to-- we need to consider how to get more performance that is biologically plausible, which may mean ending up with models that perform, actually, better, for instance, more resistant to adversarial example, just to mention a reasonable conjecture in that direction, that, if you rely less on stochastic gradient descent, but more on multiple approaches, multiple models, multiple networks, you could get higher performance and more robust performance.
So, yes, it may mislead and put you in the wrong directions. And I don't see any reason of really taking everything the engineers have and take that as good, especially, when the difference in performance, as we discussed, is not so huge.
And it seems to me more that either it's a fashion or it's, of course, nice to have available software that works. And you can use a number of different ways and problems, but, if that's the reason, we should really be able to go over it, one way or the other
JOSH MCDERMOTT: But, I mean, there's also sort of the broader issue of just learning from labeled data, independent of stochastic gradient descent, right?
TOMASO POGGIO: Yeah.
JOSH MCDERMOTT: I think one of the things that troubles me-- I'm sort of currently haunted a little bit by the notion that it may be very difficult to actually separate biological learning, as it applies to sensory systems, from general intelligence and just the fact that we are entities that are operating in the world that have to do lots of different things and that that's sort of just so fundamental to development that we may not really be able to study sensory systems like as these isolated entities in the way that we traditionally do, which, of course, that's a very exciting and interesting problem, but, also, it's a hard one. And it just makes our lives, as scientists, a lot more challenging.
SUSAN EPSTEIN: I worry about the origin of the data itself. So much of it seems to be from web scraping. And I would say immediately that that is not the natural world, nor is it the human experience.
TOMASO POGGIO: Yeah, I mean, we don't have Boris here, but, in his group-- and Andrei Barbu. But, in this group, they have a data set that is-- I forget the name. Does somebody remember?
AUDIENCE: Yeah, it's called ObjectNet.
TOMASO POGGIO: [? Oh, hi, ?] Boris.
AUDIENCE: Yeah, so we noticed that, of course, as well, Susan. We realized that by randomly sampling a bunch of billions of pictures that people put up will not help us because these pictures were all the same. And so we set up to build a data set that our Mechanical Turk workers created and tell them what position to put it on, which room to bring it, and what angle of camera is there. And the result was a data set, which lowered the performance of all the winners of all the ObjectNet bake-offs by about 40%, 45%.
And, in fact, at [INAUDIBLE] in about 10 days, we will start a competition based on ObjectNet where, hopefully, people will improve their systems to better performance on that data set.
MARTIN SCHRIMPF: But ObjectNet does not address the issue of [INAUDIBLE] in terms of experience. It's a more controlled data set, and it's good for that reason. But it does not reflect what children see when they grow up.
And maybe one response to what both Josh and Tommy said is that there are now first approaches that do work completely without labels, such as work from [INAUDIBLE], and [? Jim ?] [? Zull ?] [? as well, ?] that do not any supervised labels. They just work based on experience dependence and still yield representations that seem to be brain-like.
In fact, the paper from [? Jim ?] [? Zull ?] also shows that you can feed a video that was captured by trying to resemble what would babies see when they grow up, and you can feed those kinds of videos into the model and still get representations that resemble those in IT for instance. So I think there are enough first approaches that actually speak to that point very directly and make some real modeling progress on that front.
JOSH MCDERMOTT: Look, I mean, I would say that stuff is obviously exciting, but my suspicion is that like the kinds of problems that Gabriel is talking about would, very plausibly, apply equally well to those kinds of models. I don't know if it's been tested, but that would be my guess.
So there's a bunch of other interesting stuff in the Q&A. Jeremy Wolfe says, it's not clear to me that the goal of the general visual system model without free parameters makes any sense. Gabriel wants his model to look at a new scene and do what the brain does. But what the brain does will depend on its current goal. A free-viewing brain is different than a visual searching-brain, which is different than a navigating brain.
GABRIEL KREIMAN: Sure, I agree with Jeremy. I mean, this is adding yet another exciting layer, which is the current goal. But, even with a simpler version of the problem, let's just focus on one goal. Take free viewing or visual search or whatever you want. I would be extremely excited to have a model that can capture that.
I want to take this opportunity to circle back to sort of the question that Jim posed earlier and sort of Tommy commented upon of where do we go from here. I think that there seems to be some general agreement that our current models are false, that we have a lot of problems with our current models. So how can we build better models?
So I think that's where maybe there may be some disagreement. And it's exciting that there's disagreement. I think, if we all agreed on what the next step is, we may all be doing the same thing. So I think it's good that there are different opinions about what the next steps are.
So I think I'm concerned that just doing data fitting and improvements of 0.25% will get us where we want to go very fast. I do think that we will learn, eventually, by trial and error and getting 0.25% improvement on fitting existing data. But I want to go back to what Tommy sort of advocated earlier and sort of the history of physics and really trying to break our models and trying to come up with basic, fundamental tests that any model should pass.
And we can discuss about what exactly those tests should be. Maybe invariance is a very basic one. Susan presented a couple of pictures that I think were very clear and very stimulating. We can talk about a series of basic yes or no tests that we can sort of require for any computational model moving forward.
And this is not about 0.25% improvement on fitting existing data, but, rather, just a basic, critical test that any computational model should pass. And that series of critical tests may be itself expanded or changed over time. But, if you don't pass those critical tests, adversarial images, invariance, et cetera, et cetera, we're in trouble.
AUDIENCE: Could I say something about that?
JOSH MCDERMOTT: Please.
AUDIENCE: Yeah, so, Gabriel, I think, again, we've already falsified the models. So it's not a question of coming up with tests. It's coming up with tests that the answer would tell us what model to build next. That is the hard problem.
And we can already look at all the tests we have right now. We can already see they're wrong. But how do we take the residuals on that to update the model? That's the move that we need.
Otherwise, as you say, we're in trouble, right? OK, we're all wrong. OK, therefore, we're in trouble. That's what you just said at the end. OK, we're in trouble. What do we do next?
We've got a bunch of residuals. There's two options. Somehow, there's humans sitting in the room being really clever and smart, and that seems to be what [? Tommy ?] and [? Thomas ?] are, mostly, advocating for, that they're going to think about this somehow, that they're going to get the principles, and they're going to get the right answer.
There's another approach, which is to absorb the residuals into the updates on the model. And you can do that in a semi-automated, human-assisted version, which is more of, I think, what Martin and maybe Michael and I and maybe Jenelle-- I don't know who else on [? the ?] [? call ?] is sort of in that mode that we're going to look at those residuals, but we're going to use the machines to help us on the update.
And I think that's the tension here, ultimately, is like what experiment are you imagining that is going to tell us what to do on the model build in a complex system. That's the really hard question. And maybe you guys have great answers, but I certainly don't.
So I like to tell my students it's like this just because I'm not smart enough. Maybe Tommy and Thomas have a clever experiment that will tell us, and we'll do it. But I don't know what it is. So we're doing the poor man's approach, which is like we get residuals and then update on it. And, at least, we we'll go somewhere on that, we think, as long as we don't overfit, which we're trying to be sensitive to.
TOMASO POGGIO: Well, we had a class today of Christos Papadimitriou on models about the brain based on ensemble of neurons. This is a completely different class of models. And there may be something there.
Now it's impossible to explore with machines and parameters the class of models you are exploring and this class of models and all other infinite number of class of models that exist. One of the main failure of the original, old AI was believing that search can be used to solve every problem. And they failed to realize that exponentials are really bad.
If you have 10 parameters, and you have to explore all the different combinations of them, this goes bad pretty quickly. It's like the story of this guy who made a favor to the Chinese emperor. And then the emperor asked him ask any favor.
And he said, well, I want a grain of rice on one of the squares in the chessboard. And then you double it on the next one, and then you double it again. And the emperor says, oh, you don't want very much. But this is 10 to the 64 grains of rice at the end, right?
But what I'm saying is I don't think you can do optimization exploring everything. It's absurd in terms of models and theories. So we have to do it in the traditional way of coming up with good ideas.
And some ideas may be quite-- what are intracortical circuits in V1 or V2? What are they doing? Are they only doing normalization? Are only they doing RLUs?
That seems the implication of current models of cortex, including HMAX. There must be something more going on. And this is not an experiment for the model. It's just doing physiology.
JOSH MCDERMOTT: But let's not forget, I mean, there's another direction that is sort of similar in spirit that Susan was getting at, which is to focus, actually, on qualitative behavioral [? discrepancies-- ?]
TOMASO POGGIO: Yeah, and I'm all for it, yes.
JOSH MCDERMOTT: --like the amazing things that people can do that these kinds of models are not great at at the moment.
TOMASO POGGIO: Very good. I think we went through a lot, and it's two hours.
JOSH MCDERMOTT: Yeah, maybe we're at a good stopping point.
TOMASO POGGIO: Unless you can provide a virtual coffee to the panelists.
JOSH MCDERMOTT: But this was fun. I'd like to thank everybody who participated, the three presenters and our discussants and everybody in the audience for all the questions. I'm sure there will be further discussions of these issues in months to come.