Vision beyond ImageNet: Understanding the brain mechanisms underlying visual recognition
Date Posted:
August 19, 2025
Date Recorded:
August 15, 2025
Speaker(s):
Thomas Serre, Brown University
GABRIEL KREIMAN: It's a great pleasure for me to introduce Thomas Serre. Many of you have seen him already yesterday evening. And so you've heard a little bit of his own introduction.
But Thomas was a grad student with Tommy Poggio. And from the very beginning, he was rigorously and meticulously thinking about the exciting intersection between behavior, on the one hand, the hardware, the neuroscience hardware, and the computational question. So he's one of the most talented people in the world of vision that directly embodies these three levels of analysis, so to speak, of Marr and Poggio in terms of thinking about visual processing.
He's made seminal contributions to many areas of visual recognition and computer vision, developing algorithms that help biologists understand behavior in animals by automatic tracking; developing careful computational models that respectfully follow critical aspects of the anatomy and physiology of the primate visual system; comparing computational models with rigorous psychophysics measurements; and more recently, over the last several years, he's done exciting work looking at the alignment between human vision and computational vision, developing new methods in AI, but also new methods in AI that are directly relevant to those of us who are interested in understanding cognitive function and brain function in general.
So thank you very much, again, for joining us. And we look forward to your talk.
THOMAS SERRE: Thank you, Gabriel, for this overly-generous introduction, as always. Thank you. Thank you.
My name is Thomas Serre. Gabriel said that it was appropriate to introduce myself, so I'll give you a very quick introduction and mostly to tell you a little bit about the journey that took me from all the way back in France, many years ago, to the US and doing what I do today, in case this is of any interest.
So as I said, I was born and raised in France. I was born in the south of France, in a little-- not so little, but by American standard, a small city called Montpellier. I did all my education in France.
My educational system postbac was the system of what we call the grandes écoles, which is a kind of a little bit elitistic system that dates back from Napoleon, where we have to take a couple of years of studying for a nationwide exam, which will then take us, or not, to one of those grandes écoles.
So I did the equivalent of math and physics for my undergrad. And then I attended a graduate engineering school in telecommunication, which I think is, roughly speaking, equivalent to EECS. I was very frustrated in engineering school. I did a couple of internships in the industry, debugging base stations for mobile phones back in the '90s, and thought that I didn't want to spend the rest of my life doing this.
At the time, I was specializing in image processing and computer vision. So back in the '90s, we were learning about artificial neural networks in school, but this was not really something that people were very excited about. But I was fascinated by it. And not so much in terms of the potential of artificial neural networks for computer vision, but really about-- I was fascinated by the biological underpinning.
And I realized that, actually, back in high school, I wanted to be a neurosurgeon, was pushed not to become a neurosurgeon. And then finally, learning about artificial neural networks, I realized that my real passion was in neuroscience. And I wanted to learn more about the underlying biology.
One limitation, I think in France, especially compared to the American educational system, is that disciplines are very compartmentalized. And so if you go to an engineering school, you'll do a lot of engineering things, but you don't have access to colleagues or faculty that can tell you about neuroscience.
Around the same time-- and that's where the story gets a little cheesy-- I was watching a documentary on the BBC. And they were talking about this group of crazy scientists growing neurons in a dish. And that was not the crazy part. The crazy part is that they were able to hook up a computer to this culture of neurons.
And so, obviously, this was through a recording array. They were able to supposedly communicate, send information from the computer to the neuron, and then do the opposite-- measure the activity of neurons and then send that signal back to the computer.
And so as a student in telecommunication, I was like, oh, my god. This is so much more interesting than debugging base station. I decided that this is what I wanted to do.
And so at the end of the documentary, they mentioned that the work was being done in a place that I'd never heard about, which was called MIT. And so we didn't have Google back then, but I had to go online and figure out what MIT standed for-- stood for. And then I proceeded to look at groups that could have anything to do with some form of vision, computer science, engineering, and most importantly, neuroscience.
And that's where, I guess, you'll find out that life is a lot of chance. I was lucky and fortunate enough to-- so I emailed Tommy, and I don't think Tommy answered, actually, to my emails. But I spanned several people in the lab. And back then, there was a postdoc from Germany who had actually heard about the French educational system and thought that despite my very meager, no research experience, so-so grades, and all that stuff, that it was worth giving me a chance.
And so I then went to MIT to do an internship for my graduate degree. And then once I got into MIT, this was like a dream come true.
I mean, I'd never seen a place like this, where people, especially in brain and cognitive sciences-- again, maybe now it feels normal to all of you. But back then, being in a place where you could sit, you could have people doing math and the mathematics of machine learning and statistical learning sitting next to people doing monkey electrophysiology, sitting next to people designing the next gen of engineered vision system, that was just the most amazing thing that I'd ever seen.
So I worked hard as an intern-- published and worked as hard as I could, published as many papers as I could, because I knew that that was my only chance to get a PhD or to get into the PhD program, which luckily I did with a lot of support from Tommy. And for this, I'll be infinitely grateful forever.
And then, yeah, did my PhD at MIT. I switched from face detection and recognition, at the time, doing really engineering systems. So back then, we were working with support vector machines and things of that sort.
But this was around a time where Tommy had just published a paper with Max Riesenhuber, a seminal paper on HMAX and the first serious model, contender as a model, of the ventral stream of the visual cortex. And so I really got interested and fascinated by the research, and then continued that line of work as a PhD student.
I spent a couple of years as a postdoc at MIT, and then moved to Brown in 2010 as an assistant professor. And I've been a professor since then.
All right. So that was a very lengthy introduction, but I thought there's a couple of life lessons, and in particular, the fact that there's a lot of luck in life. And I think I've been very lucky and get to do what I love through serendipity. But, yeah. So you need luck, but you also need to force your luck. And so I think it's important to also be resilient and work hard towards your goals.
All right. Switching topic, I'm going to be telling you about vision beyond ImageNet. And so I'll try to give you-- so I'm not going to go through a lot of depth. And I'm going to try to give you a sense of where I think the field is, and at least to give you a sense of the kinds of projects we are working on.
We probably have about an hour left, so it's going to have to be superficial. But I'm also hoping-- I don't have to cover everything. I don't have any strict agenda. So I hope this can be interactive, and you should ask questions. Look Board is not so interesting or you've heard about it before and ask questions if any of that is interesting. And I'm happy to spend more time.
All right. So I'd like to start-- oh, and before I move on, I also wanted to make a point about the summer school. I hope you realize how lucky you are to be here. I didn't come since the very day of the first schools-- of the summer school, but I've been coming for the past several years. And I promise you that this is something very special.
Several of my grad students, students who are not yet my grad students, who went through my lab afterwards, or even some of my grad students went through the course. And the experience has been transformative for all of them. And so I think I want to thank Boris, Gabriel, and Tommy for having put this very special thing together, as well as all the help from the TAs and all the supporting staff. So, yeah, I think this is a special thing.
All right. So turning to science. So I like to start my lectures with a demo. And you might have seen this demo before. I'm going to show you an example of a so-called "rapid visual categorization task." How many of you have seen this before? All right, yeah. A few have.
OK. So here, images are going to be flashed very briefly. And you're going to have to try to answer whether an animal is present or not in those images. So I'm going to give you-- I'm going to start the demo. So no, yes, yes, so on and so forth.
I hope you can appreciate that the objects here can be of varying positions, sizes, et cetera. So there is a real challenge here for your visual system. And I think although you cannot-- at this speed, you cannot perceive every detail in those images. I think most of you are able to perceive what I would call the gist of the scene.
Why do we care about evaluating people's ability to process images very quickly? Well, if you think about your visual system, this is a very complex dynamical system. There's a lot of things going on when you see. You can shift your attention. You can move your eyes. There are all kinds of connections between different brain areas.
And so the assumption behind this experimental paradigm that was pioneered some 30 years ago or so-- the key assumption was that perhaps we can push the visual system to its temporal limits by forcing very fast presentation and forcing people to answer as fast as they can. And if we do this, we're going to have an opportunity to study what people often refer to as "core visual recognition."
The assumption is that, when you process visual information that quickly, your visual system does not process the scene in full, but it's required-- it has to build what we would call a base visual representation that is sufficient to enable this rapid categorization.
But it's obviously not the end of vision. So this is not natural, everyday vision, where your eyes can move, your attention can shift, et cetera. So this is only really meant to capture the first 100, 150 milliseconds or so of visual processing.
Now, this ability to process visual information very fast is not limited to humans. I know that Jim gave you a lecture, so you're familiar with some of the monkey electrophysiology work that they've done. Monkey can be trained to solve these kinds of very fast visual presentation.
What I'm going to show you here is just a demo of a monkey doing the task. And so here, you can actually see the actual-- this was a slowed-down version. Here, you see the monkey essentially holding a position and then releasing position and touching the screen whenever it sees an animal.
And you see that, at the real speed, it goes so fast, that you don't even have time to see anything. And yet, both humans and monkeys are able to classify the contents. Obviously, both monkeys and humans will do mistakes. And the interesting thing is that, if you allow people-- if you give people more time to respond, they'll slow their response down, and they'll start making less and less mistakes.
But overall, consistent across many human subjects, you'll find that for a given set of images, there's going to be a constant pattern of correct and incorrect responses that is consistent across human subjects and across monkeys.
So the point that I'm trying to make here is that, when you force visual recognition to be fast, it's not like people just make random mistakes. They do follow a strategy, and the strategy is very consistent across human subjects.
So what do we know about the underlying brain processes? I'm assuming that-- so have you had a general ventral stream overview in neuroscience? OK, so I don't have to go into too much detail.
You've heard about the ventral stream of the visual cortex, which is a set of visual areas that have been critically and causally linked to our ability to process visual information and recognize objects. Our current-- oops, sorry. Our current thinking about the types of computational mechanisms that undergo processing in those visual areas is essentially one of a feedforward, hierarchical, gradual build-up of invariant representations.
We know that through the ventral stream, as we record from units in higher and higher visual areas, one can find representations that are gradually more complex and gradually more invariant. So what I mean by this is that, in early visual areas, one will find neurons tuned to relatively simple stuff, like edges, bars gratings at specific orientation.
And as we record in higher and higher visual areas, people have reported, neuroscientists have reported recording from neurons that are selective for more and more complex kind of things. So combinations of orientations and things of that sort in intermediate visual areas. And by the time you reach IT, you'll find neurons responsive to things like object parts. And then for those classes of objects that are ecologically important, like faces and body parts, you'll find neurons selective for those classes of objects.
So we know that the-- well, so I'll skip that if you already heard about it. So one key set of facts that neuroscience has produced in the past several decades is that visual processing is based on a hierarchy of visual areas, gradually building up more and more complex and invariant visual representations.
The second point that I wanted to make is that, although all these visual areas are almost always reciprocally connected, so they are both so-called feedforward connections that run from lower visual areas onto higher visual areas, there's almost always-- for every feedforward connections, there's almost always feedback connections. That is, information flowing back from higher visual areas onto lower visual areas.
The assumption, when we force visual processing to be fast through this rapid visual categorization experimental paradigm, is that in order for the visual system to be sped to its temporal limits, visual processing has to be approximated by a single feedforward sweep of activity through the ventral stream.
If you give more time to people, there will be enough time for feedback to kick in. But during this rapid presentation and when responses are sped, there is only time for a single feedforward sweep of activity from early to higher visual areas, all the way to motor areas to then press a lever.
There's a long history of models. I'm not going to bore you with the details, but you might have heard about some of them. Fukushima's neocognitron was essentially extending earlier work by Hubel and Wiesel, meant to describe the anatomy and the physiology of the primary visual cortex, to then modeling of the entire ventral stream of the visual cortex.
We were just discussing the HMAX model that was developed by Tommy and collaborators many years ago. And if we take a break, and if you remember the discussion we were having last night about how much neuroscience can contribute to AI and vice versa, to me, this time, somewhere between right before the deep learning revolution, was kind of a golden age for computational neuroscience.
I think for the first time, we had models that people were able to ground into monkey electrophysiology. So really, models that could recapitulate a host of experimental data. And so when you simulate a model like the HMAX, for instance, and this was part of my PhD thesis, one could show that the pattern of correct and incorrect responses of this model grounded in electrophysiology was consistent with the kind of correct and incorrect responses made by human observers during this rapid categorization task.
So in a sense, this was kind of a nice story because those models were helping us close the loop from electrophysiology all the way to behavior. Even more fascinating is the fact that these models were able to do this using only four key cascaded operations.
I told you that this is a feedforward hierarchy. All of these models were able to account for a large body of data, using only those four cascaded operations-- convolution, I'm sure you're all familiar with. This is the extension of the pooling mechanism postulated by Hubel and Wiesel, when they were thinking about single neuron with a convolution. You replicate the same idea across position.
Nonlinear rectification, to avoid negative firing. Divisive normalization for contrast normalization. And then this was kind of a prediction from the HMAX, that some selective pooling mechanisms were needed to achieve position and scale invariance, as proposed by Hubel and Wiesel. But the critical operation to achieve this pooling had to be a MAX operation based on a blend of theoretical grounding and neuroscience.
So that, I think, is kind of the nice story, where neuroscience contributed, in effect, convolutional neural networks. What happened beyond these models is, I guess, history. As you all know, this class of hierarchical feedforward models kind of stayed as they were. Meanwhile, CNN kind of exploded and took over the field.
So here's a little figure that I took from papers with code that kind of recapitulate the history of the field, starting with AlexNet. So I should point out that this is a summary of results achieved year after year for a challenge that I'm sure you're all familiar with, which is this ImageNet challenge, where the models are trained and tested to discriminate between about a million images that fall into a thousand image categories.
And so when you look at the first deep neural network that was submitted to the competition and won back in 2012, Alexnet, Alexnet was interestingly very-- in terms of size and scale and depth was very similar to earlier models of the ventral stream of the visual cortex. It's actually pretty much an HMAX model that is just a little bit wider and trained with back propagation, which was, at the time, one of the main novel contribution of this work.
If we look through the years, year after year, engineers have been able to design clever engineering tricks that have allowed these networks to learn and to be increasingly more efficient. You might know some of these tricks. But without getting into the details, for instance, the VGG here, the basic idea was just to take those big convolutional filters that were very hard to train, because there's a lot of parameters, into a stack of three 3-by-3 filters. And so people just, through trial and error, realized that it was much easier to learn a stack of small filters rather than big filters all at once.
The ResNet, which I don't think is shown here, the basic idea was that it helps with the gradient flow to just have skip connections in those networks. So not having a strict hierarchy, but also connections that skip some of the layers, which, by the way, is also consistent with the anatomy of the ventral stream, where those skip connections are also found.
Anyway, if you look at the trend here in the field from about 2012 to the end of this plot, which is 2022, you see that the field is going through quite a bit of progress. How do these models compare to human classification accuracy? The ImageNet data set is a little messy, I guess. The initial annotations were done relatively quickly. A lot of it comes from web searches, et cetera.
So a group, a few years ago, led by Shankar and colleagues, did some very kind of-- a tremendous job at going back to ImageNet and trying to clean up a subset of the data set and to get an actual clean human baseline on this data set. So one of the issues of ImageNet is that, although there is a single class label per image, it turns out that there are multiple objects present. And so when you count the top one accuracy, when you force one single output from those models, some of the models potentially could be penalized for correctly detecting something present in the image, but that is not matched by the class label.
They also evaluated a number of human subjects. So just to give you a sense, I cannot really show human level of accuracy here because it's a different kind of measure. But in this paper, back in 2020, they found that this fixed ResNeXt was within the confidence interval of humans.
So that means that since 2018, we've been able to design neural networks that are either at the level of human accuracy or better. And obviously, this trend has continued with the introduction, as you're all familiar with, of transformers, self-supervised learning, and all sorts of things.
So just to be clear, although I started motivating my presentation with this rapid visual categorization to motivate the development of feedforward hierarchical models, we are way past that. So we already had models 20 years ago that could recapitulate human accuracy during this rapid presentation. We're now at a level where models are performing on par or better than humans during these unspeeded categorization tasks.
Where is the field right now? So we are curious to know how much of the latest and greatest models, where do they stand in comparison to humans?
So we took a library, which I really like, which is called the TIMM. Some of you might be familiar with. It's a PyTorch library, where people have diligently taken representative models, retrained them, tried to reproduce the original paper, retrain them from scratch. And so these are models that you can kind of use. You have the model, the weights and all that stuff. And you can you can evaluate these models.
And so all we did is to take these models and evaluate them on this subset of multi-label ImageNet for human comparisons. And so out of the 350 or so models we evaluated, about 30% of them were within the human confidence interval. 4% of them were actually outperforming any human subjects. And we're not talking about random naive subjects. We're talking about people that were trained to discriminate between all these classes of objects.
So I think this is amazing. If you had asked me back when I was a grad student, I never thought that I would see that during my lifetime. And so because of this, if you go to vision science conferences like I do and listen to conversations at posters or in hallways, you'll hear people casually stating or colleagues casually stating that deep neural networks are currently the best models of primate vision.
Now, the point that I'm going to-- and I'll build on that point. But you have to be aware or careful of the fact that just because you have two systems that achieve the same-- or produce the same output, achieve the same level of accuracy, does not necessarily mean that they leverage the same visual strategy, right? This is one of the points of debate we had yesterday.
Remember the mapping from image to class label, and those models are so overparameterized that the solutions are never unique. And so the point that I'm trying to make is that it's entirely possible for those models to achieve human or superhuman level of accuracy, without leveraging a strategy that resembles that of human observers.
I'll get into the weeds about how we characterize visual strategy, but for now, what I mean by visual strategy, think about the kinds of visual information, visual features that neural networks might be leveraging to make their decision, versus the kinds of visual information or visual features that humans leverage when making those decisions.
All right. So I'll tell you about how we are making those comparisons. I'll start by telling you a little bit about how to characterize the visual strategies for human subjects. I should point out that there has been a lot of work in human psychophysics. And I'm not going to do justice to this work. There's the past 40 years of psychophysics that I tried to really address this very question of understanding the nature of the visual features used by human observers. while this try to solve a variety of visual tasks.
The challenge with those methods is that they typically require tens of thousands of repetition for a single image, for a single stimulus. And so because here we're interested in collecting large ImageNet-scale data sets, those methods cannot be applied readily.
And so we had to come up with coarse approximation and simplification when trying to characterize human strategies. And so we came up with our best attempt at this, which is a game which we call ClickMe. The ClickMe part is a little bit of a misnomer because there is no click involved.
The game is as follows. If you are a participant, you start from a screen that looks something like this, with an image and a class label. You are the teacher, and you are paired with a student sitting somewhere else. You know that the student starts a trial with a blank screen. And as the teacher, you need to decide what part of the image to paint, to reveal. So whatever you brush over, it gets revealed to the student. And the rule of the game is for the student to recognize the image as quickly as possible.
And so the assumption with the game is that, in order to do well, people have to introspect. They have to ask themselves, what are the most informative portions of the image? And so the hope is that they're going to click on those parts of the image which then, in turn, will allow them to make points.
All right. So we get, again, click maps. We call them click maps, although there's no click. You start clicking at the beginning, and then it's more like brushing over the screen. I'm not going to dwell with the details. We implement a lot of subtleties to make sure that people cannot just do a lot of clicks or salt-and-pepper strategy. There's a rate, max rate at which the pixels can get revealed to the student.
We've run a lot of consistency studies, making sure that the clicks are repeatable across subjects. Initially, we're pairing human subjects with human subjects. But essentially, you spend a lot of human subjects just for one trial. We realized afterwards that we could get rid of the human as a student and just put a deep neural network. The maps were consistent. Everything stayed the same.
So we can talk about more about the details if you're interested. But for now, I'll ask you to just trust me on the fact that we have a reliable way to capture part of an image, or to measure, assess what parts of an image people think are most informative about the class label.
We get ClickMe maps for individual images, and then we can average across dozens of human participants. And we get maps of that sort, which-- oops-- which we call importance maps. So you see that-- well, you see examples here of those importance maps.
So on animated objects, we find that people tend to click on facial components. They'll tend to select things like eyes, nose, and things of that sort. On vehicles, they tend to select things like wheels or front grilles.
And as I said earlier, sometimes we cannot necessarily name the parts. But we find that if you take a random half of the subjects and correlate the maps with the other half, there's a high level of consistency across subjects. So we know that they are not just randomly guessing. There's an underlying strategy.
So in phase one of this experiment, we went through a couple of-- through 1,000 or so participants. We collected ClickMe maps for about 200,000 images from a subset of ImageNet. We're in the process of improving this first phase. We're now in ClickMe 2.0.
Actually, if you want to play the game, this is a QR code that's going to send you to the game. You can make money, depending on your rank. And our goal is to cover the entire ImageNet. And so I think we're, like, 95% through ImageNet.
All right. So the point here is that we have a strategy to describe or characterize the human visual representation through this importance maps. We need to do something similar for deep neural networks. There's a whole field. And unfortunately, I missed Josh's talk this morning. But you probably have heard through him or someone in the summer school about attribution methods, which is part of the field of XAI or explainability. There are several methods available.
And for doing work in that space, I can say that there are subtle differences and some are better than others, et cetera. But if you want a simple and kind of go to ways to derive attribution maps or similar importance maps from deep neural network, a relatively low-cost and relatively well-grounded approach is called the "saliency approach."
And so the basic idea in the approach is to compute the gradient, the gradient of the model output with respect to the input image. And so maybe to give you a little bit of intuition for why that makes sense, remember that when you train the network, you're going to be taking or relying on the gradient of the model with respect to the weights. And that gives you a recipe for optimizing the weights to minimize the classification loss, given your training data.
Remember that the gradient, which is the extension of the derivative from 1D to multiple dimension, the derivative gives you a measure of the sensitivity of the function to an infinitesimal perturbation in the input. So once you have a trained network, you can interpret your network as a function of the input.
And so now you can measure the sensitivity of the output of the network by computing the gradient, with respect to the input images. And if you do this, you're going to get maps that look something like that, where there is a hot pixel. That means that the gradient with respect to the image is high, which means that making a tiny, little change at that pixel location has a big effect on the output of the network to classify that image according to the class label. Yeah?
AUDIENCE: Just going back to the ClickMe?
THOMAS SERRE: Yeah.
AUDIENCE: Because you were saying that it's introspection, so you choose it. Did you check the correlation with eye gaze or something?
THOMAS SERRE: Yes. So I don't think I have the data here, but two things. So the first is that we did a completely separate validation in rapid categorization task, so completely separate group of subjects. We present images, and we mask part of the image. Now, we decide what gets hidden in the image and what gets shown.
In one case, we use saliency data, so where people have looked when they passively fixate on images, versus revealing or showing the part of the image that correspond to a hot location from the ClickMe. Again, you'll have to trust me, but this is published. We show that with only something along, like, maybe a percent of the pixel revealed, according to ClickMe, people already are at saturation. When we do the same thing with something like saliency maps, we need to go to 50% of the pixels.
So we know that the features that people click are not just introspection. But we know that they are useful for visual categorization. In other words, if I hide most of the image and just show you what is shown here, people will be very accurate at discriminating between animal, non-animal, and things of that sort. So the approach is not perfect. But we're fairly confident that we're not measuring noise.
All right. So we have a way to measure importance maps for humans, a way to measure importance maps from models. And so now we can ask, how do the two agree?
So just to show you a subset of images, these ImageNet images. Here, you see a few examples. So each row is a different image. You see the ClickMe maps from humans. And this is what you get if you do this attribution or apply this attribution method to representative networks.
And I don't think I need to show you a lot of complex math. It's pretty obvious that the networks look at almost everywhere in the image except where humans do.
Now, to characterize or, more quantitatively, the agreement, we can just take those two maps, correlate them, and see how correlated they are.
AUDIENCE: Sorry.
THOMAS SERRE: Yeah?
AUDIENCE: Just to clarify the task that is ClickMe, they start from a blank and you give--
THOMAS SERRE: No. You start from an image, and you have to decide where to click on the image to reveal-- what pixel to reveal to a student somewhere else. And the assumption is that, to solve this task, you need to identify what part of the image are most informative about the class label. Because if I click the background, potentially, there wouldn't be much for you to recognize.
So what you see on this plot here is, on the x-axis, the object classification accuracy. So this is ImageNet accuracy versus this metric that I just mentioned, where we correlate the importance maps from models and humans. So we call that the "feature alignment."
Each dot here is a different architecture. I'll show you more maps like this. And I forgot to add the legend. But so I think these are-- well, you see a range of accuracy. I think there are CNNs. I think the purple ones are maybe transformers. The paper dates a little bit, but we have more recent work that I can show you if you're interested in.
AUDIENCE: [INAUDIBLE]
THOMAS SERRE: This is a Spearman correlation. So essentially, this is telling you the agreement between the two. So the point--
AUDIENCE: What is the human-to-human correlation?
THOMAS SERRE: The human-to-human correlation is about 0.65, I think, on this data set. Yeah. So the point here is that there's a Pareto front. There's no date here. But pretty much, we're going from earlier, older models, like AlexNet, towards going towards more and more performant models, trained with the full, bigger data sets, greater scales, and all that things.
And you see that, at this network, they're getting more and more accurate. On ImageNet, their agreement with humans kind of is falling to the point where, here, there's very little correlation between image regions that are important for models and those for humans.
Yeah. So I just added that. So this is an update. We just wrote a review article, where we added all the latest and greatest models. This is much more models. Oops, sorry.
So you see here that, again, the initial trend with older CNN was like, yes, the models were getting more performance. They were getting more aligned with humans. But since about-- I think this wide ResNet dates from about 2019, I want to say. Since this change point here, you see that the models are definitely getting better, but also getting-- whatever they are doing, their strategy is getting increasingly misaligned with human strategies.
AUDIENCE: Yeah. Could you say again how you pull out the strategy for the models?
THOMAS SERRE: So the model is just a gradient saliency.
AUDIENCE: Oh, I see.
THOMAS SERRE: Yeah.
AUDIENCE: Have you tried the model find the strategy?
THOMAS SERRE: Sorry, the model--
AUDIENCE: Have you tried for the model to find strategy? Like, use approach, where the model would find the most informative patch and so on, and then compare it?
THOMAS SERRE: So as opposed to-- I mean, I'm not sure I fully understand. So this is pretty much what we are doing here.
AUDIENCE: The model has global information. And then you are asking, what's important to you?
THOMAS SERRE: Which is exactly what we do for human subjects, right? The human subject sees the image and has to decide--
AUDIENCE: I mean, like, with the human [INAUDIBLE], what's the most informal-- what's the most informative patch in the image? So you can ask the same thing, the modeling. What is the most informative patch? And then given this, what's the second most informative patch and so on? And then you have a series of patches.
THOMAS SERRE: I mean, honestly, I don't think-- so this is based on saliency. We've tried other methods. Some of the methods actually to get the attribution maps are essentially doing this, that rather than looking at the gradient for individual pixels, you smooth out the gradient and look at the local value, and you get pretty much the same.
I'm confident that the misalignment we're finding is independent of the method used to derive the importance of pixels. Yeah.
AUDIENCE: I was going to say, there's two interpretations. One interpretation is that the model is doing something more dissimilar to humans. The other interpretation is that the gradient saliency map is less informative of the actual strategy the model is using.
THOMAS SERRE: As I said. So the results we are showing are using saliency, but we are using other-- there are sampling-based methods that are not gradient-based. That includes this rise, for instance, if I remember correctly. Those methods give very similar saliency map or importance maps. But I'm happy to discuss this afterwards if you want.
All right. I just wanted to point out, because I think Jim spoke earlier, obviously, this is not the only metric. So you've heard about the brain score and how you can use-- you can learn fit linear mapping from model representations onto neural responses. Jim has been doing a lot of this work.
And again, initially, during the early days, you have your old HMAX here, all feedforward model of the ventral stream. The early task-optimized models were kind of putting these older models in the dust. So here, this is explaining IT responses instead of our feature alignment. But this is image categorization as before, with individual dots corresponding to models.
So there was a certain level of optimism that just engineering and task optimization. Better models for image categorization would yield better models of the brain. There was an update from Jim's group published a couple of years ago, where they had some of the more recent AlexNet, et cetera. Again, the trend seemed to be going in the right direction.
And again, we just did an analysis, where we took all the models of brain score. And so we wanted to update this. And again, we looked at all the ViTs, all the different pretraining, different amount of data, and you see that the trend is always there.
I think this is a model that dates from about 2019. The trend is reversing. Those models are definitely getting less in line with neural data.
This is the brain score, Jim Dicarlo's data set. We have access. We have collaborators at Harvard, in Marge Livingstone's group, who have been giving us similar data, IT data. And the trend is exactly the same on that data. So the point that I'm trying to make here is that newer model, larger-scale models are definitely worse models of biology.
Now, the question that I'm after, and I'm going to have very little time to tell you about is, why is that? So one possibility is that there is just a fundamental limitation for this model to just learn a human-like visual strategy. Maybe this is the architecture. Maybe there's just something in convolutional neural network or transformer that prevents them from learning the same kinds of visual features that are diagnostic for humans.
We don't think this is the case. And the reason why we don't think this is the case-- and I'm not going to tell you any details about the method-- but we've developed a method which we call "harmonization," where we use machine learning optimization methods to train the model-- to force the model to learn a human-like visual representation.
And the way we do this is that, when we train the model, when we optimize the model for image classification, we also force the model-- we force the gradient of the model to agree with the feature maps, the importance maps of the humans. You just need to take a gradient of a gradient. It's all differentiable. So we can do this.
We can do this, and it works. Because what you see here are four models. So you have the base model here, which is the base VGG, which has pretty poor alignment with humans. When we harmonize, when we force the model, we prevent the model from choosing whatever feature it wants, we force it to leverage human-like features, we find that the alignment improves.
I think actually the ViT here is pretty much-- you're asking me about the noise ceiling. This is pretty much a noise ceiling. And these are, of course, cross-validated. So those models can learn a human-like strategy. It's just they don't do it when they are trained and optimized for ImageNet classification. Yeah.
AUDIENCE: With the models that have been harmonized, does their accuracy reach a ceiling?
THOMAS SERRE: Yeah. So the ViT is a good example. I think the ViT-- don't quote me on that. I should have put the-- I think this is very close, if not at ceiling.
AUDIENCE: In terms of the x-axis too, does it have a ceiling on ImageNet accuracy? Is it a bad thing to rely on human-level features?
THOMAS SERRE: Ah. OK. So the x-axis is ImageNet accuracy. So you see that. So I'm not sure I'm going to make a big point about that. I was predicting that the models would get worst at classification accuracy because we know that there are statistical shortcuts in internet images that are potentially-- an information that's potentially not even visible to a human, like, I don't know, a watermark or things of that sort.
Here, on this particular version of harmonization, we find that, for all the models, either the accuracy stays the same or it improves slightly. I think the reason why it improves has to do with just the fact that we help regularizing, a little bit, the network when we do this kind of dual optimization, where we simultaneously train the model to recognize and classify images and force it to attend to image regions that are important for humans.
So we know that those models can learn human visual strategy. They just don't do it when we train them on ImageNet. Yes.
AUDIENCE: Can you quickly describe how the harmonization works?
THOMAS SERRE: I'll tell you afterwards, because I'm going to run out of time. And this is not even the most interesting part of this talk. This is just a proof of concept to show you that we can.
AUDIENCE: Technically, can you test some of the other points that are lower in the y-axis?
THOMAS SERRE: So we are working. So this was a paper published, I think, back in 2022. You don't see the-- we have since-- so we are trying to work on the harmonization. And we are trying to scale that up to-- literally, what we would want to do is to take the team library and be able to just harmonize all the models there. So all these models would ideally be forced to align with humans. And so those are models that people could use.
AUDIENCE: So the points you tested have already started quite high on the y-axis. You're starting with a point [INAUDIBLE].
THOMAS SERRE: Again, I'm not trying to make any point about the method itself. All I'm trying to say is that, the Pareto front here, we can break the Pareto front. The models can learn the strategies. They just don't. OK.
So, all right. And so just to show you, this is what the harmonized model looks like. So this is your human data here. This is the original ViT. Initially, it looks everywhere in the image except on the object. And then as soon as we harmonize it, it actually does exactly what you would want it to be doing.
Very quickly, again, like, running out of time. So what is interesting here is that you can then now test those harmonized models on other kinds of alignment data. And so here, we're measuring the alignment of both initial and harmonized model. And then here in purple, harmonized models-- oops, sorry. I'm realizing that I'm getting out of the frame.
And you see that, here again, we have the same-- on our neural data, we have the same Pareto front. But again, so these are models that have never seen any neural data. They have just been co-trained with human psychophysics data. And you see that just this co-training allowed them to cross the Pareto front.
So yesterday, we were talking about inductive biases and the fact that humans the inductive biases on biological brains are different from machines. Here, you see that just the addition during training of this small psychophysics, inductive bias is sufficient to enable the models to suddenly align better with neural data.
AUDIENCE: [INAUDIBLE]
THOMAS SERRE: It's fair. I wouldn't say we have solved the vision yet. But the point is that we are still going in the right direction, which is, I think, somewhat encouraging.
Similarly, you probably heard from Jim about adversarial attacks and all these things. Here, we wanted to look at tolerance to adversarial attacks, as well as another measure. So this is this y-axis here. Another measure is essentially when you generate an attack, you produce a mask that gets superimposed on the image. And so the question is, we would like for our harmonized models not only to be more robust to attack, but we would also want the attack to be targeting more human-like visual features.
And so we are measuring, here, the alignment between the mask used for the attack and the importance maps derived from human, which is essentially the x-axis here. And you see that after we harmonize, we get exactly this. The harmonized models are those yellow dots.
We are not as robust as adversarially trained models. We are more susceptible to attacks. But you see that, in order for the models to be attacked, the pixels that need to be targeted are those very same pixels that were important for human observers.
In a sense, there's a bit of circularity here. We're just showing that the harmonization works in the sense that the model is really now paying attention to those pixels, because that's how you're going to need to modify those pixels for the attack to be successful.
AUDIENCE: What does it mean for adversarial human-like?
THOMAS SERRE: Sorry? So, yeah, I'm going very fast. So what I mean by this is that, when you generate the attack, you're going to get a mask of the attack. That mask, you can correlate with the importance map we are getting from humans. Essentially, it's giving you a measure of how much the attack is targeting features that are important for humans, if that makes sense.
All right. Let me close these parentheses. All right. So the point that I've been trying to make with the harmonization is that we know that baking in constraints from human psychophysics, human inductive biases is useful. We're getting better models that are more robust and that are better aligned with neurophysiology data.
The fact that we force that on the model makes the harmonization not very interesting scientifically, right? To me, what is interesting about the success of the approach is that, one, that tells us that the model can get aligned. They can learn a human-like visual representation.
The interesting question then becomes, how do we get those models? What is it in development, learning, plasticity that yields our visual system to learn the kinds of visual representation that we learn? And why is it that these models do not learn it?
The fact that we can harmonize different architectures, we can harmonize CNN and transformers, I think suggests that the misalignment probably has less to do with the specifics of the neural architecture, and more in the ways that those algorithms are being trained.
If we go back to the literature that dates back decades ago, there's a lot of work in computational neuroscience that has pointed to video data, temporal continuity as a key enabler of the visual representations that are built and learned by our visual system.
Again, in the interest of time, I'm not going to go through all the details. But some of you might have heard about slow-feature analysis. The idea is that good invariant representations can be learned if you try to learn representations that are stable across transformation sequences of an object, for instance.
Because if you are able to build a representation that is stable over time and you're experiencing, for instance, an object that's rotating in depth, or that is translating or zooming in and out, you're going to be able to build a representation that is invariant to those transformations. So there's a whole literature, that I'm not going to do justice to, that has described or made assumptions about the underlying computational mechanisms and data diets that are important. Yes.
AUDIENCE: I just have a question. How much of this misalignment can be explained by the [INAUDIBLE] of vision or [INAUDIBLE]?
THOMAS SERRE: Well, yeah, so that's kind of where I'm going towards. I think some of the work that I'm going to show you, which is a big effort in my lab and related to the discussion we were having last night, the idea is really to study, identify the key learning developmental principles that are going to help shape the learned representations of this network to better align with humans.
So, yes, I think going towards data diets that are more baby-like is going to help. Allowing a much more active vision system rather than a passive one that just observes, rather than having just passive observation and transformation, being able to manipulate an object or make prediction about how an object should look under different viewpoint are probably the kinds of things that are going to help those networks to be better aligned.
I'm not going to show you data on embodiment. But all I can say, that we're working on this, and I'm quite optimistic about the possibility of improving the alignment through active vision. Yes.
AUDIENCE: I think just the point is, one of the important parts that we've shown for models is that it was quite scattered. Humans cannot afford having that kind of scatter [INAUDIBLE] focus. So if you just force--
THOMAS SERRE: Yeah. And I would also make the claim that, unlike what Jim has been probably telling you, that image categorization is the goal of the visual system, I would argue that it's probably a very minimal part of what our visual system does.
I think navigation and manipulation of objects is probably at least equally as important. And if you need to manipulate an object, the features that matter for this manipulation will be very different, potentially, from the features that are needed for rapid categorization, for instance. So I think we're on the same page here.
Anyway, so there's a lot of work that's being done. And because I have 15 minutes, I just want to give you a quick overview of initial results we have in that space.
OK. So the general idea is that we want to get rid of training on static ImageNet images. There has been some work over the years trying to build data sets that would be ecologically more valid.
Back when I was a grad student, someone had published a cat cam data set. And so this was essentially a couple of grad students that put a little-- that taped a camera on top of their cats, let the cats roam around the woods, and then they were running behind the cat because they had a laptop that needed to-- so this was one data set. And that's where most of the early work on learning invariant visual representation was done.
There are baby cam data sets you might have heard about. This is just-- oops. Oh, I cannot click on that. I'll skip that. if I don't know how to start-- it's OK. All right. This was just some data we had collected a few years ago with a colleague of mine, with a developmental psychologist, where we were recording from kids.
So you were asking me about eye movements. And I think this is a 13-month-old that has just been given a bag of toys and seeing those toys for the first time. And, of course, the first thing that the kid is going to be doing is getting social cues, looking at objects, obviously with [INAUDIBLE], and then grabbing some of those objects, manipulating the objects.
So it's pretty clear that the kinds of data dyad that this kid is experiencing, the first time that they encounter those objects is very different from the kind of data that the deep neural networks trained on internet data images are being fed.
We are not working with those kinds of ecologically valid data sets because it's really hard to have good control. And I really want to understand and I want to be able to relate principles, statistics of those data sets to what gets learned. And so we've been working with this data set, which is publicly available, called Co3D. So you see examples here.
These are data sets that have been collected with people collecting video data, kind of rotating around objects. The cool thing is that this data set was collected for computer graphics. So people have produced NeRF and Gaussian splatting models for this. If you don't know what it is, it's not very important.
This is kind of the modern computer graphics. Once you have those models, we can essentially resynthesize novel videos and have complete control over the parameters of the camera. So we can smooth out those videos. We can control. We can have clean rotation, translation, any kind of arbitrary transformations applied to these objects, which gives us a lot of control over the data sets.
I didn't mention, but I told you a little bit about the background computational neuroscience of learning from video data. I should have pointed out that there's a lot of work right now in the field of self-supervised learning, in computer vision, and AI more broadly. I don't know. Did anyone tell you about self-supervised learning? A little bit? OK.
And so there's a lot of work. I'm not going to go through all the details of that. Maybe I'll try to give you a sense for how that works. But the basic idea is just like ChatGPT and BERT and all the large language models were trained, those models were trained with sequence of words or tokens. The idea is to mask some of those tokens.
And so in ChatGPT, you are doing kind of a next-token type of prediction. So if you think about those general methods, you can think of them or characterize them in terms of how much of the image they mask from-- or the image or the input they mask from 0 to 100%. You can also think about how the mask gets applied.
So as we are discussing in ChatGPT, you're predicting the next token. So the mask is applied to the future. In BERT, you're just masking random tokens, so there is a lot more dispersion. Anyway, all of the current self-supervised learning methods, trust me on that, will fall somewhere on the spectrum, somewhere around the amount of temporal dispersion of the mask and how much of the mask cover the inputs.
All right. So in the case of vision, the same idea pretty much applies. When we feed those sequences-- so this would be a sequence, where here, we just randomly hide. So there are three frames here. I know it's hard to-- or maybe four frames. I know it's hard to see.
So you have to fill in the blanks here. So this is an example of what is called the MAE, Masked AutoEncoder. So here, the network during training has to fill in the blanks. So the network has to solve-- or to solve the task has to essentially do spatiotemporal interpolation. It has to combine information across space and across time to solve the task.
Conversely, here, if we look the autoregressive GPT-style of training, you would have to predict the next frame. So here, you see that all the masks are kind of pushed to the final frame. And so there's an interesting distinction between those two tasks here. As I mentioned, this is an example of spatiotemporal interpolation. All the information is here. You just need to interpolate. Here, this is quite different computationally because you have to extrapolate. You have to predict entirely into the future.
And similarly, once you think about those different versions of how you can mask and whether you try to predict the present, the future, you can also try to do this prediction reconstruction, either in the input space, so literally trying to recreate the pixels, or you can do that in the latent space of the model. And that's the subtle difference between things like Dino versus MAE or things of that sort.
One thing that I forgot to mention is that there is a lot of work-- I mean, literally, all the industry work is done on this SSL, with variants of those self-supervised learning frameworks. They work on large video data sets. But ironically, none of these approaches really leverage the temporal smoothness properties on videos. In general, as far as I know, they are just using videos as an opportunity to just sample random frames, just to get more data.
All right. So we have our Co3D. We also run our ClickMe on the Co3D. We get importance maps from human. And here is initial results we're getting.
So every dot here is one of the models we took from TIMM. So you have a variety of transformers-- CNN, ViT. X-axis is accuracy on Co3D. So it's about 50 or so classes. So the accuracy is what it is.
Here's the feature alignment that we derived from humans. So you see that there is-- actually, you can see the Pareto front here. We have better models. But there, again, the better models turn out to be less aligned with humans.
So here, these are models that have been pretrained on ImageNet. We are evaluating them through linear probing on our data set. What we are after is a recipe for learning from video data. That's going to help steer those models towards better alignment with humans.
So in the ideal case, we would like to take one of those models and push them as far out in the upper-right corner as we could. But I would be already pretty happy if we can move those up, which means at least we will have learned more human-like visual representation.
So this is-- yes.
AUDIENCE: Is your overall goal with this to have kind of a model system for biological vision?
THOMAS SERRE: Well, ideally, this would be to have a model. But as we were discussing yesterday, those are transformers. So I don't think transformers are necessarily good models of vision. But one of the points we were discussing yesterday is that I think that's one of the interesting things about modern AI is that, those learning approaches really work in ways that they didn't work when I was a grad student.
So now we can start testing principles, right? We have this hypothesis about predictive coding in neuroscience. Well, we can put it to the test. And we can see what comes out of optimizing for predictive coding from temporal video data versus other forms of self-supervised learning. That's not the only way to do it.
The hope here is that we can identify principles. We can find that certain data diets, certain general objective functions, certain developmental principles will yield representations that are better aligned with humans than others. That is the goal.
I don't know whether we're necessarily going to get mechanistic models of vision, and I don't think we will. But I think if we can identify general principles, I think we'll have learned something about biological vision.
OK. So this is one-- our initial results that are just coming out. So we're starting from a middle of the pack. So this is a small ViT model. So here, we're starting from this model pretrained on ImageNet. And we're doing something a little weird, but we are fine-tuning this model through this self-supervised learning. And we're going to try different representative self-supervised learning losses.
So here, first, we start with the fine-tuning. So we start from ImageNet-trained model, and we fine-tune it to get on Co3D. So the fine-tuning works. We get better accuracy. But you see that the alignment with humans is not improving. So the learning image object or features that are diagnostic for image classification are not helping this model get better aligned with humans.
Here, we are trying the MAE. This is the BERT-style of learning. We just learn, fill in the blanks, but when we do the spatiotemporal interpretation. So the network has to leverage information that is scattered across frames. And from that, well, we are not doing very well because the accuracy goes down, and the alignment has not improved.
Here are slightly more interesting cases. The video MAE is a special case of the MAE. Rather than having scattered frames, here, we are optimizing throughout a tube. So we have a little 3D spatiotemporal volume. So there's a bit more structure in the task. And you see that we are starting to align a bit better with humans, but it comes at the cost of classification accuracy.
Here are the two interesting, so far best self-supervised learning objective functions that we found. These are the autoregressive, so what I call the predictive coding. You have to predict the next frame. So this is an example of extrapolation. And we find that, here, both in the pixel space and in the latent space, we can improve the alignment with humans.
And in the case of the autoregressive version, we can simultaneously improve with humans and also improve slightly on accuracy. That's all I have.
So we are working on scaling that up. What I would like next year, when I come back to this summer course, is to show the impact on this autoregressive self-supervised loss to all of the initial models. And my hope is that everyone is moving towards better accuracy and better human alignment.
But as we were discussing yesterday, I think, here, the goal is really to focus on the human alignment. I'm not entirely sure that the optimal solution for aligning with humans will necessarily be the optimal ones to improve accuracy. Yeah.
AUDIENCE: So I have two questions. So the first one is, would that also generalize to neural predictivity?
THOMAS SERRE: We are in the process of testing. I don't have results to show you. We are hoping. So this is very initial. So I don't have a lot more to show you, except I can say that this is initial. I'm really interested in understanding how these different losses yield qualitatively different visual features.
There's a lot of work that has been done in the past 50 years of vision science research. You might have heard about all kinds of junctions that vision scientists have linked with people's ability to recover surface properties from 2D images, things of that sort.
I would love to be able to tell you, these kinds of learning losses yields to more-- better representation of non-accidental properties of the kinds of things that are important for humans. We are in the process of annotating our data set so that we can have finer-level characterization of what is different about the learned representations.
AUDIENCE: My second question was related, like, just high level-- it seems that the search space is quite large.
THOMAS SERRE: Yes.
AUDIENCE: The functions, the data dyad, and mechanisms. Also, you talk about feedback and [INAUDIBLE]. So how do you envision approaching this large search space and how can some--
THOMAS SERRE: We struggle. I mean, we struggle doing this in academia. We keep asking ourselves-- one anecdote that I'll say-- I'm not sure I want to say that on camera, but I'll do it anyway. We got a small grant from NAIR, or something like that, just to have access.
This was kind of end of the year. We have an NSF grant that's funding this work. We were getting close. I think this was in April, May, getting close to the end of the fiscal year.
This institute has credits on Microsoft Azure. And so someone reached out saying, we have a bunch of extra credits. We see that you have a lot allocated already for your NSF grant. Are you interested in using what's left? And we said yes. And we got a grant for $300,000. With that, we tested five models. And we were not even able to train them all the way.
[LAUGHTER]
AUDIENCE: Isn't that great? We're in a challenge that we need to pick our battles, clearly. So how do we envision-- we're going to have to pick one thing to adapt in a model and see how--
THOMAS SERRE: Yeah. So as scientists, I really want to cover the space and identify general principles. But yeah, you're right. We're going to have to pick. We are not going to be able to test the entire space.
And so right now, we are kind of going in a greedy search where, I kind of like this predictive coding, extrapolation versus interpolation. At some point, we might have to-- sorry-- we might have to give up on some of the alternatives and can kind of get down to this path. Or maybe Anthropic will get interested in the work and give us access to a whole bunch of compute.
I mean, certainly, we're buying-- we're in the process of getting, what is it, B20 GPUs, which is the latest version of GPUs. We'll see-- I mean, at Brown, not for my lab. We'll see how far that gets us. But, yeah, that's the challenge.
If you're interested in this kind of work, any kind of learning developmental work with deep learning, the cost of compute will be the bottleneck, I think, for academia and the limitation in what we can do.
I think I'm at time. So I was going to show you a few more examples. I was going to mention a little bit-- obviously, this is very superficial and preliminary. But we're really trying to characterize more qualitatively and quantitatively the nature of the representations that are learned. So this is an example of the kinds of junctions that people have worked with.
We are designing new tasks, which I'm not going to have time to tell you about, unless you ask me in the question. And so I just want to finish maybe with some final thoughts, wrapping my head around yesterday's discussion.
So I try to make the claim or the point that, initially, neuroscience, I think, did have an impact on AI, right? This is the case that people are making with CNNs. There are more examples.
I think neuroscience is and will continue to be less and less relevant, at least for the kind of work that I do in vision. I envision a space where building the next gen of AI agents, et cetera, might rely on a theory of mind and notion of more human like intelligence. For vision and any kind of sensory modeling, I'm pretty convinced that neuroscience doesn't have a lot to contribute to AI. I do think, however, that AI has a lot more to contribute to neuroscience.
And so the example that I gave you is that I think, for instance, testing principles relating to learning development I think is something that we can really instantiate in AI models and really test hypotheses, which we were not able to test back when I was a grad student because the method didn't work. We were never able to learn good representation that really represented valid instantiation of those theories.
So I think we're at a turning point. I think if I sounded a little bit more negative than I wanted, I am genuinely excited. But I think the synergies will become more and more unidirectional between AI and neuroscience.
So I'll leave you with that. I'll just leave a knowledgement of all the wonderful lab members that have contributed to this work. And I'm here. We're out of time, but I'm here. And if people want to ask me a question, I'm happy to stick around for a little bit. Thank you.
[APPLAUSE]