Aligning deep networks with human vision will require novel neural architectures, data diets and training algorithms
Date Posted:
February 24, 2025
Date Recorded:
February 11, 2025
Speaker(s):
Thomas Serre, Brown University
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: Recent advances in artificial intelligence have been mainly driven by the rapid scaling of deep neural networks (DNNs), which now contain unprecedented numbers of learnable parameters and are trained on massive datasets, covering large portions of the internet. This scaling has enabled DNNs to develop visual competencies that approach human levels. However, even the most sophisticated DNNs still exhibit strange, inscrutable failures that diverge markedly from human-like behavior—a misalignment that seems to worsen as models grow in scale.
In this talk, I will discuss recent work from our group addressing this misalignment via the development of DNNs that mimic human perception by incorporating computational, algorithmic, and representational principles fundamental to natural intelligence. First, I will review our ongoing efforts in characterizing human visual strategies in image categorization tasks and contrasting these strategies with modern deep nets. I will present initial results suggesting we must explore novel data regimens and training algorithms for deep nets to learn more human-like visual representations. Second, I will show results suggesting that neural architectures inspired by cortex-like recurrent neural circuits offer a compelling alternative to the prevailing transformers, particularly for tasks requiring visual reasoning beyond simple categorization.
PRESENTER: Welcome to this talk hosted by CBMM and the Quest. I'm very happy to have here Thomas Serre. Because that's kind of a homecoming to me after-- I don't know if you recognize the building after-- how many years? He was a PhD student, a postdoc back in 2006 or '7. And now he's a professor of at Brown University with a joint appointment in the Department of Cognitive Linguistics and Psychological Science and also the Department of Computer Science.
And I think he's Associate Director of the Center for Computational Brain Science at Brown and so on and so forth. And he's also, let's see, something else in France that I'm not sure-- International Chair in Artificial Intelligence. When I mentioned to him, he did not seem to recognize it.
And anyway, the work Thomas did back then was in the good old days in which a model in neuroscience age max in the hands of Thomas was as good or better than computer vision models. And it was probably one of the first really successful deep networks in human vision. It was a model built from trying to imitating the ventral stream, which I'm sure we'll hear more from him.
But as I said, the interesting thing was that at that time neuroscience was really leading computer science. And now it's quite different, and I think we'll hear some about it from Thomas in a talk. The title is too long, but it has to do with deep networks and human vision. And let's welcome Thomas.
[APPLAUSE]
THOMAS SERRE: Well, thank you so much, Tommy, for the introduction. Thank you all for the kind invitation and for being here. It's always a little bit of a party or, as Tommy mentioned, a homecoming, being able to speak here. Many of my old mentors. Or not so old, but they were mentors when I was a graduate student are here in the room. It's always a nice experience to be able to catch up with old mentors, old friends.
And with the novelty that now some of my former graduate students, even from Brown, moving to postdocs at MIT and Harvard. And so it feels like it's a big old family. All right.
So as Tommy pointed out already, this is too long of a title. I grayed out part of the title because, as always, last night I realized there was no way I would have enough time to cover everything I was excited to talk about. So I decided to take part of our work related to the development of computational models and novel neural architectures, just to give you a preview of what I would have told you about.
As you know, the main engine behind most of the latest and greatest developments in AI are based on the transformer architectures. It's difficult, even squinting your eyes, to figure out analogies between modules in the transformers, operations in the transformers and neural circuits. So we've been focusing on the development of more biologically realistic neural architectures.
A lot of it is based on integrating cortical feedback or cortical-like feedback mechanisms into modern deep neural networks. I'm not going to tell you much about it, but I do think that transformers are not the ultimate neural architecture, neither for AI nor for neuroscience.
All right. So today I'm going to be telling you instead about another line of research in my group, which I'm particularly excited about, which has to do with our attempt at trying to reverse engineer, in general, the data diets and general learning principles that would allow us to develop deep neural network models that are better models of human vision.
So maybe some of you in the audience are interested in furthering the development of deep learning models of AI. Our own focus will be really trying to steer those models towards being better models of human vision.
So being at MIT in the mecca of computational modeling and computational modeling in vision, I think I can skip much of my typical introduction and briefly skim through the history of computational models. There's a long history of so-called hierarchical feedforward models of the ventral stream. You can probably trace back the origins all the way back to what McCulloch and Pitts in the late '40s, work of Hubel and Wiesel in the '50s and '60s, fast forwarding to maybe-- I don't know if Tommy agrees-- Steve Grossberg in the '70s, Fukushima in the '80s, and then a number of people, obviously, including Tommy and colleagues, towards the development of computational models of the ventral stream of the visual cortex throughout the '90s and 2010.
So if we look at what's what has been happening in the field of AI, as I mentioned earlier, there is, of course, a deep connection between modern day AI system and the more traditional classical computational neuroscience models of vision. If we look at the general set of neural operations that those two sets of architectures use, there's a lot of overlap, things like dot products, rectification, nonlinearity, normalizations, pooling, including max pooling, which came out of the HMAX model developed in Tommy's lab and that Tommy alluded to earlier, and things of feature hierarchy.
But I think it's also fair to say that, as much as the design of early computational models of neuroscience was largely constrained from neuroscience principles, with the explicit goal to constrain model architectures-- parameters of the model in terms of receptive field sizes, number of connections, number of layers, et cetera, et cetera, to be really reflective of the underlying anatomy and physiology, I think it's also fair to say that AI and neuroscience have, in that sense, started to diverge today, where modern deep neural networks have somewhat eased that constraint. And I would say the key driver of development of deep learning systems today is largely focused on task optimization rather than biological constraints.
So here's one example of task optimization. I'm sure you're all familiar with this ImageNet data set, 1,000 way image classification. Here's a little overview of the research done between 2012 and 2022, sources from papers with code, just exported that yesterday. And what you see here is how the state of the art-- so how the leading neural architecture for any particular year fared on this ImageNet data set.
And so back in 2012, when AlexNet was first introduced, we had an accuracy on this ImageNet of about 65%, I guess, somewhere around there. This is the top one accuracy, I should point out. So this is 1,000 way classification. Chance is about 1 out of 1,000.
And there are issues with this metric. I'll allude to some of the issues in a little bit. But for now, higher accuracy here means stronger, more performing model. And so you can see that year after year, architecture after architecture, the accuracy of these models has been increasing almost steadily.
Now, if we need to come up or trying to identify one particular mechanism or principle that has been driving this progress is quite hard. If you look at trends, it's pretty clear that those deep neural networks are getting deeper and deeper. AlexNet is probably comparable in terms of number of stages to your own visual system, ventral stream, somewhere around half a dozen or so visual areas.
The latest and greatest here are probably equivalent to hundreds of layers of processing. So certainly depth is one factor. But I don't think this is the only factor. Architecture after architecture, there's a number of mechanisms that have been introduced, clever heuristics, clever tricks that have allowed these networks to learn much more efficiently.
Examples include breaking down fairly large convolutional kernels into smaller ones, all the way down to a one by one convolution, which is literally the smallest convolution you can do. Residual connections is another example. We've also built over years better and better optimizers and all kinds of schedulers and so on and so forth.
So it's never entirely clear how much of these gains are derived through true innovations in the space of neural network architectures versus smaller and various heuristics. Because I had promised Tommy, as a way to gauge how much of the progress is due to architectures versus everything else as a pet project, I've asked a few of my students to go back to the good-old HMAX developed in Tommy's lab many, many years ago, a little bit before AlexNet.
So this is a fairly shallow deep neural network by today's standards. It's also not that wide because back then we were constrained by the availability of compute. And certainly we didn't have GPUs.
And just to give a ballpark of how much gains we've made, if you just take the standard HMAX and you just optimize it on ImageNet with backprop with class supervision, you get pretty bad results. If AlexNet is at about 65%, you're somewhere around. I think we're at about 40%, so not great.
But as I mentioned, we are constrained back then to have fairly narrow neural network. The input is effectively four kernels, four orientations times the number of scale bands. So if we just allow to have more input filters like AlexNet, we get a pretty big boost. We get somewhere comparable to AlexNet.
And then if we start including one by one-- and I won't bore you going one step at a time. But if we start integrating many of the additional heuristics and tricks that people have used to improve and to get better learning, better local minima from those architectures, we can get significantly higher. I think the latest we can get here is about 75%, and I should point out that for reasons I'm not going to get into, we are not using data augmentation. So this is really competitive with respect to a lot of the state of the art architecture.
We're not reaching the top. Not to say that the architecture doesn't matter, but the point that I'm trying to make here is that you can you can still go back to a fairly old architecture and then leverage the latest and greatest in deep learning and get reasonable results. We're still working on adding those feedback and getting better accuracy.
So where is human on that curve? As I mentioned there are a number of issues with the way accuracy is measured on ImageNet. In particular, there's the fact that often there is more than one object. So the top one accuracy is problematic.
You might be penalizing a system for making correct predictions, but I think the best way that I've seen in terms of really putting an actual number on a human baseline on ImageNet is work that came out of Berkeley a few years ago where they took a very large subset, they trained subjects. They even recruited experts, et cetera.
And they came up with a metric which is a multilabel metric, which is difficult to calibrate, again, the top one accuracy. You'll have to trust me on that. But essentially, if you were to put a human accuracy here, you'd be around here. And in fact, the claim in this paper is that a fixed ReSNeXt-- the paper was from 2020--
--A fixed ReSNeXt is at about human level. So essentially, to a first approximation, all the models that are above this curve are above human level. I'm not aware of any recent attempt at evaluating how close the state of the art in computer vision is to human level.
So I asked a graduate student of mine to do a back of the envelope calculation. So we took a library of deep neural network models pretrained on PyTorch, which is called Team, if any of you are familiar with the toolbox. There is something around 1,000 or so models, which encompass a lot of all the representative modern deep neural networks.
We didn't want to run 1,000 models. So we restricted ourselves to representative one. We have all kinds of CNNs, transformers, different kinds of training, pretraining. And we take a subset of 367 of them, and we counted how many of those 367 fall within the human confidence level.
And we find that about 14% of them are within. 4% of those networks actually are actually at superhuman level. So they outperform human on this 1,000 way image classification.
So it's pretty clear that the progress has been significant. We now have AI systems, computer vision systems that are on par, if not better, than human level of accuracy for this. What most of us would have considered being an important milestone for computer vision.
Now, if, like me, you care about understanding the visual system, just because you have an artificial vision system that matches the performance of a biological system, obviously, doesn't mean that they share the same visual strategy. It's possible that they achieve the same level of performance by simply leveraging completely different visual strategy.
So the question I'm going to be trying to answer in the next few minutes is to try to understand how relevant those AI models are, to understand human vision. And for that, I'm going to be defining a metric that that's going to allow us to benchmark the alignment between human vision and those models. So let me start briefly with the easy part, which is evaluating the visual strategies leveraged by deep neural networks.
There's an entire field of AI known as the field of xAI, or explainability. There's a gazillion methods that have been proposed. I'm not going to give you a tutorial on those methods. I'm sure many of you are familiar with them.
But I think what we're going to be using here is a way to describe the visual strategy of those models is using what are called attribution methods. So the goal of those methods is to identify to produce importance maps of this kind for individual images that indicate what part of an image are driving decisions by the neural network.
As I said, there are many methods that have been developed. Here we're going to be focusing on perhaps the simplest one. I should point out that all the results that I'm going to be showing are robust to the actual method used. We're going to be using the saliency method, and so a very simple way to describe this method is that it's essentially using the gradient of the model for the output of the model with respect to the input image.
So if you know what a derivative does, it tells you how sensitive the output of a function is with respect to some small changes around a particular value of the input. And so here we're going to be evaluating how sensitive the output of the model for the correct class label is as a response to minimal perturbation of individual pixels here. So that's the saliency method we're going to be using.
Just to send a plug for the toolbox we used, one of my former graduate student, Thomas Felt, has developed this explique toolbox, which, to my knowledge, is the most complete today. And this is what we've used for this set of experiments.
Now, easy or somewhat easy to identify the visual strategy or characterize the visual strategy used for deep neural network models, we're going to have to do the same for human observers, which is, obviously, a slightly different and more complex task. We have the field of psychophysics, and without giving you a full story of the field, there's, obviously, a variety of methods that have been described and proposed for characterizing the important features leveraged by human observers when classifying objects.
I was personally very influenced by a paper from Shimon Ullman and collaborator from a few years ago, and I know that many of you are familiar with the work, the MIRCs. I think this is quite elegant, and literally the method allows to identify the minimal features that are both sufficient and necessary for human observers to be able to still correctly recognize the class label of the image.
Now, the method is beautiful. The challenge with that is just to characterize and identify a particular visual features, the method requires tens of thousands of trials. And so clearly, since our goal here is to scale up this characterization to the entire ImageNet, we couldn't afford 10,000 trials per images. So we came up with a very coarse simplification of Shimon's methods.
We developed a game which we call Click Me. So Click Me is a little bit of a misnomer because there is very little clicking involved. But the premise of the game is that you, the player, would be told that you are a teacher. And you have to decide-- you are given an image, an image of a dog here. You have the class label.
And you have to decide what part of the image to paint. And you are told that there is a student somewhere else, and the student starts from a blank screen. Actually, I'm realizing, seeing Pavan in the audience, there is a connection here with the rise method that he developed many years ago as well.
But the idea is that the student will start from a blank screen. And then part of the image will get revealed gradually over time. What gets revealed is decided by the teacher.
The role of the game is for the teacher to make as many points as possible. The number of points is given by how fast the student recognizes the image. Initially we had pairs of human players, a human playing with a human. The problem with this is that it's hard to collect, again, a lot of data because you need to figure out ways to pair human players.
And so we realized that the student actually didn't matter too much and that we could just either have something making random guesses, just to keep the work entertaining. Or we could pass that image and part of the image to a deep neural network. And so in practice, we cheat one-half time. To get into the details, this is pretty old work from 2019, and this is all published.
I'm also happy to answer questions. But the gist of this is that we are giving passing much bigger patches to the deep neural networks. We're also making the life of the deep neural network as easy as possible so that the game is interactive. And so players feel like they are feel like there is a strategy, even though, to be honest, there is very little strategy for the game.
So that's the game. All right. So we get those clique maps, and we get potentially multiple maps for every image across many subjects. We can then average out those clique maps. And we get those heat maps.
So here you see a few representative examples. Just to highlight a couple of points for animated categories, like animals here, human observers tend to select facial components, things around the eyes, the nose, the mouth. There is also a selection of features for inanimate things like vehicles. We find that for vehicles there is a lot of-- a lot of the times, the wheels are selected or the front grilles, et cetera.
Again, I'm not going to give you all the numbers, but you'll have to trust me. We've done extensive analysis measuring the reliability of those maps, correlating half of the subjects versus the other half. We've also compared human subjects versus humans or computers. Again, we are confident that people are not just randomly clicking. There is an underlying visual strategy, and those maps are reliably replicable.
So in the first version of the game a few years ago, we recruited about 12,000 participants over about six months, and that allowed us to collect about half a million clicking maps, which when you average out and you collapse across images, gave us essentially a big chunk of ImageNet validation sets. As I mentioned earlier, you can swap the human with the DNN. You get the same results.
And we've done all kinds of post-hoc analysis to make sure that, again, those are not just random clicks. We've also done psychophysics, where we have completely different set of participants. We evaluate them in rapid visual categorization, experimental paradigms. But where we only show part of the image, only the features corresponding to the hotspots of those heat maps.
And we find that with only a handful, literally, I think something around 1% to 4% of the pixels, if those pixels are selected in the hotspots, people are reliably able to classify images presented rapidly. If we do that with random selection or something akin to a bottom-up saliency algorithms to drive the selection, we find that people need at least one order of magnitude more pixels.
So the method is not perfect. We can go over the limitations. But I think in general we're fairly confident that there is signal there.
That said, this was still a subset of ImageNet. We had about 200,000 images. Last year, we received funding from NSF to scale up the effort. So we resumed the game.
We now have been running the game for a few months. We already have twice as many participants. And we have about 99% of ImageNet covered. And I think we're just shy of a few months to get the full data collection.
Just to give you a quick live demo, this is what the game looks like. So you are given an image. You are given the class label. And you have to decide what part to reveal.
So I would probably reveal this section here. So I've made points, skirts the AI and so on and so forth. You see how the game plays? Coral reef. I'm going to select this here.
Here it looks like it's harder for the AI, I guess. All right. So this is the game. You can scan the QR code and become a participant and win prizes.
All right. So I'm going to show you here a side-by-side comparison between humans and deep neural network models. Here are representative images from ImageNet. Here are represented the corresponding Click Me maps.
And so here are representative architectures, ViT, MLP-MIXER, ResNet, SimCLR, et cetera, and the corresponding heat maps. And so I think I don't have to show you a lot of complex statistics. It's pretty clear already from just eyeballing the human versus machine maps that the two learning systems attend to very different parts of the image or very different pixels.
In general, we find that the heatmaps from deep neural networks tend to be pretty scattered. They tend to leverage a lot of the cues from the background, often much more so than the foreground.
All right. So I'm going to be using a measure of alignment here, which is going to be a very simple measure of correlation between-- for individual images, the correlation obtained by between the human map and the computer vision algorithm. So we can do that for every image. And we get scores across the entire data set, and that gives us one feature alignment score.
So what I'm showing you here is on the x-axis is the ImageNet classification accuracy. Each point here corresponds to an architecture. The blue dots are CNNs. The purple dots are transformers. Here on the y-axis you get a measure of alignment between the models and humans.
And so you see that the trend was going in the right direction. When I go from left to right, literally, I'm going from earlier older models toward more recent and more performant model. Initially, the trend was clear. The better models led to better alignment with humans.
But you see that with the latest and greatest models, potentially those models that are approaching outperforming humans. The trend is reversing. The models are getting more and more misaligned with humans, suggesting that the way they are solving this image categorization task has less and less to do with the visual strategy used by human observers.
Now, this doesn't necessarily mean that there's a fundamental limitation for this architecture to learn human-like visual features to demonstrate this. We developed a machine learning method that actually forces those deep neural networks while they are trained, while they are being trained to recognize images or classify images, we're adding an extra loss where we are brute forcing them. We're literally forcing the attribution maps that they are-- or the strategy that they are using through their attribution maps to be as close as possible to the Click Me maps derived from humans.
So the networks learn now to recognize objects. But when they learn to recognize objects, we're forcing them to not be able to choose what pixels to leverage to drive their decision but rather to rely on the human features or the features derived from human clicking maps. And so the approach works.
We are certainly scaling up the approach, and we are trying to get a system that would, out of the box, apply this method, which we call harmonization, to arbitrary neural networks. On this slide, I just have a handful of neural networks. But you see here the byproduct of this harmonization procedure with the yellow dots. And you see that in general that the procedure works. So we know how to do optimization, I guess, or how to minimize our losses.
In general, starting from any of the points, we're able to break this Pareto front and get neural networks that are very significantly better aligned with human observers. A small added benefit that I had not anticipated is that, in general, those errors tend to go up and a little bit to the right, suggesting that there is not a very large but there is an improvement nonetheless in their classification accuracy, which means that we don't necessarily have to trade off between image classification accuracy or alignment with humans in general. A good trade off between the two.
All right. And just to give you a few qualitative examples, so you see here this harmonize ViT So this was the original ViT, where the attribution maps are all over the place. After we do this procedure, you see that the network learned spot on to latch onto a human-like visual features.
We've also run additional psychophysics. We can look at the ability of the model to recognize images that have been masked with different degrees. Again, based on the heatmaps, the clicking maps, humans are in black. The harmonized ViT is almost perfectly lining up with humans. I should point out that this is the amount of the image that's being shown to the subjects in our rapid categorization task versus their performance. And you see that the harmonized ViT agree almost perfectly with human observers, unlike the original ViT, which is much less efficient in the way it uptakes visual informations.
We've done additional work. I'm going to go very briefly through, just flash a few salient results to just highlight the benefit of this type of harmonization procedures, where we force those networks to adopt a more human-like visual strategy. There's a group that I think most of you are familiar with at Harvard, a group led by Marge Livingstone, where Marge and collaborators have been developing a method that allows them to collect something very similar to the heat maps that I've shown you for humans and deep neural networks but that they are now able to collect from neural data.
In particular, these recordings were made from the IT cortex, and the idea is relatively simple. Rather than collecting neural population responses for an entire image, they consider fairly large images and then essentially measure neural responses for different patches of the image, literally, by shifting the image under the array so that it's the closest approximation you can get to one of those heat maps.
And you see examples here where they just collapse all the neural responses. And you see that in general in the IT cortex, when you present stimuli that includes living objects, typically the neuron will fire much more strongly on facial components than the background. And similarly, when you have inanimate objects, the models tend to focus on the more salient and dominant objects.
So this is a perfect setup for evaluating our harmonized model. So again, these are models that never saw neural data before. They were just harmonized based on those human psychophysics data. And again, similar curve to what I showed you earlier, every dot here is a different deep neural network, ImageNet accuracy for those networks on the x-axis. And then we derive a neural predictability score borrowed from the brain-score tool set developed here at MIT by Jim DiCarlo's group to evaluate the agreement between the computational models and neural data. And
So you see that, again, in general, we find a similar Pareto front that I described earlier for the human data. More performant models are now becoming worse model at predicting neural data. But after we harmonize with this human psychophysics data, we find that the differences are not as significant as we find for humans. But we are consistently able to break this Pareto front and improve the fit to monkey electrophysiology data.
One last example of this harmonization method and its application, I decided to include these two slides because all the work that's being done here at MIT in Jim DiCarlo's lab, we have been looking at the question of the adversarial attacks on deep neural network, how align is it wit-- how well does it target the visual features that are important for humans?
And so if you look back at in the old days of deep neural network and convolutional neural network, here's an example we had. And I should point out that the adversarial attacks here are just magnified so that they can be visible. In practice, they are much smaller amplitude.
But here's an example of an early brittle DNN, where you have a very small perturbation that's usually typically not even noticeable to the naked human eye. And here, because the features that are targeted are very little to do with the features that matter for humans to recognize images, we find that typically those perturbations will be very hard to detect by human observers. Here's an example of unaligned robustness.
I'm sure you're familiar with all the work that has been done with adversarial training. There are explicit ways to make deep neural network more robust to adversarial attacks. And so here you see the outcome of such training, where the argument is that, on this example, we end up with a model that exhibits an unaligned robustness in the sense where now the network is much more robust. The magnitude of the attack needs to be increased for the network to be full, so this is a demonstration of the strengths and the robustness of the network.
It's harder to trigger an attack. But again, in this case, this is misaligned with human vision because those features that are being targeted are again all over the image and are not targeting the features that are important for human observers.
And then that's what we are aiming for here, an example of aligned robustness, where in theory this network is not only more robust to adversarial attacks. The magnitude of the attacks is quite large and noticeable. But on top of it, it targets the kinds of visual features that are important for humans, which makes those attacks much more noticeable.
And so that's essentially what I'm summarizing here on this plot, where on the x-axis we have a measure of alignment or correlation between the actual pattern of the perturbation of the attack, and our clicking maps, which tell us how well the attack lines up with features that are important for humans. While on the y-axis is a measure of robustness with a perturbation tolerance, where the higher the better.
This is the minimum norm needed for fooling the network. So the higher the better, the more robust the network is. Again, every dot is a deep neural network. This is from the team toolbox that I mentioned earlier.
Again, we see a similar Pareto front here for standard convolutional transformer networks. We find an interesting breaking of the Pareto front here via robust networks. So you see that these pink dots here are networks that were trained to be robust. In yellow are models that were harmonized so, again, forced to be leveraging human-like visual representation.
So you see that what is interesting with the harmonization is that we are not able to get quite the level of adversarial robustness that you would be able to get if you train the network with one of those adversarial training. But there is a significant improvement. We're able to break the Pareto front.
And not only are we able to get models that are more robust. But also you can see that the alignment with the ClickMe maps is improving, suggesting that, again, the harmonization procedure works. The network are now targeting visual features that are important for human vision.
All right. So this is a lengthy part one. I've tried to ask the question as to whether AI models are aligned with human vision. I've shown you several examples. I've shown you examples of human feature alignment, neural data, predictivity and adversarial attacks, where I try to make the case that's where heating Pareto front that has models that become more and more accurate at ImageNet classification, the alignment along all of these different metrics is worsening now.
At the same time, I've shown you that we can leverage the deep learning toolbox, collect psychophysics data at scale, and then force the alignment on the neural network. We can simultaneously optimize them for image classification while being aligned with human vision.
The fact that we can harmonize so far all the architectures that we've tried to harmonize, which span different CNNs, transformers, et cetera, suggest that the misalignment problem has probably little to do with the underlying neural architectures. As I said, I don't necessarily believe that CNN and transformers are, in their details, biologically plausible. But since we're able to align all of these architectures with human data, that suggests that the limitations are probably not coming from the architecture.
Our hypothesis was that the limitations come from the training, that it comes from the data diets used to train those networks, which are static ImageNet images, and perhaps as well the underlying optimization, the tasks that are being optimized to train those networks. So the point that I'm going to try to make in the remaining of this talk is that we need to rethink the data diet we feed our deep neural networks and rethink and reverse engineer learning principles that are spontaneously going to lead to deep neural networks that are better aligned with human vision.
So obviously, we're not the first ones to make such a proposal. There has been work on the development of ecologically more valid data sets, starting with the CAD/CAMs from back in the days when I was a grad student. There's also a number of data sets that have been collected to reflect the kinds of visual data diets that babies experience, which are typically continuous transformation sequences.
So here's an example of some of the data we actually collected a few years ago at Brown with a former colleague of mine, Dima Amso, where you see an 18-month-old wearing a portable eye tracker being presented with a box of toys for the first time. And what you see here is the baby experiencing for the first time novel objects.
The kid is able to manipulate, move their heads. And so rather than getting IID samples from an ImageNet-like data set here, the baby is able to experience smooth continuous transformations of objects.
We're not going to go quite there to baby data. And I think part of the issue is that there is very little control we can get over this baby data. Instead, we're going to try to develop an alternative, I guess, to ImageNet. And so our starting point has been this cold 3D data set, which is a publicly available data set, which [INAUDIBLE] collected from people collecting video data, people moving around representative everyday objects with their iPhone.
What we're going to do here for better control is that we're going to leverage some of the latest and greatest modern computer graphics methods, which are also based on deep learning. And so some of you have probably heard about NeRFs and Gaussian splatting. What we're going to be doing here is starting from these iPhone videos, but then we're going to be rerendering them so that we can generate smooth transformation sequences, things like rotations around the objects, translation, zooming in and zooming out.
And I should point out that one key limitation of this data set-- and we're hoping that the data set will continue to grow. Right now it's only about 60 or so classes of objects. So this is very small with respect to ImageNet. But this is a starting point nonetheless.
All right. So we need better data diets. And the other point that I made is that we probably need to rethink the tasks that we are using to optimize our deep neural networks. There's, again, a lot of work that has been done on the topic. Some of it dates back from the '80s. Some of you might have heard about slow feature analysis and derivatives, which were developed to train artificial neural networks to learn better invariances from transformation sequences of the kind that I just showed you.
There's also related work in the space of predictive coding. Gabriel Kreiman and colleagues have done a lot of some work in that space. You might have heard about PredNet. There's also recent work from Brenden Lake's group, as well as Yamins and DiCarlo's group, where they have been adapting some of these self-supervised learning architectures and evaluated their plausibility as a pretraining stage for deep neural network.
I should also point out that there's a large body of work in self-supervised learning right now in computer vision. This is probably the hottest topic in computer vision. I'm not going to have time to give you a full review of the field, but in general, you might have heard about mask autoencoders.
Those were developed in the context of natural language processing. This is how BERT initially was trained, where you give an input sequence of words, and then some words are hidden. The network has to learn to fill in the blanks, so to predict a probability distribution over all possible worlds for under any one of the blanks. ChatGPT is a yet an alternative to this, where rather than hiding individual words, ChatGPT is trained to predict the next immediate word, which is a form of autoregressive kind of training.
So I'm going to put a lot of things under this mask autoencoder. But the basic idea is the same, whether it's an NLP or vision. The idea is that we're going to be hiding part of the image, and we're going to use as pre-training, we're going to require of the network to learn to fill in the blanks.
So space is large. If you look at the literature in computer vision, it's daunting. Every other day there is a new method. But in general, I think one way to at a high-level describe what's happening is that, rather than words here, we're going to be looking at sequence of frames from this video input sequences. And so depending on how different part of the image gets masked, one can formulate different objective losses.
So here's an example, which is the original version of this mask autoencoder. So again, here you have sequence of frames. And then you can just run and hide part of some patches distributed uniformly across space and time. So here we're asking the networks to reconstruct what's hidden behind the blanks.
And so the network, in order to solve this task, will have to solve a form of spatiotemporal interpolation, so leveraging whatever it can from spatial dependencies and temporal dependencies. So that would be an extreme version of that, which is essentially what a lot of the work in computer vision right now is doing. Although people are trying to leverage video data, often videos are used just to sample frames from.
So essentially learning from this video is literally equivalent to just learning from single images. What we're after here is really for networks to encourage them to leverage motion information and all the good things, all the spatiotemporal dependencies that are naturally present when we experience a transformation sequences.
So we suspected that a variant of this MAE, which here we will call in the following autoregressive, is as follow. Rather than hiding random patches across all frames, across space and frames, we're going to be showing a sequence of frames where everything is revealed, and then we're going to have the network-- we can ask the network to autoregressively predict the next frame. So this is quite literally the translation of what ChatGPT was trained to do from language to vision.
And then just as a point of clarification, one can play this game by training the model to reconstruct at the pixel level. So try to reconstruct or predict what's the patch. Or you can try to do that in the latent space. So you can learn a high-level representation, and at that representation level, you can compare.
You can use the prediction error between the incoming signal and the prediction for the network and use that error to drive learning in the network. Again, for neuroscientists in the audience that are potentially familiar with the notion of the predictive coding theory in neuroscience, that's you can think of one instantiation of this theory.
So I should point out that these are very preliminary results. I'm going to show you results for one architecture. We're trying to extend those results to a whole bunch of architectures to demonstrate the generality of our results.
So here's a bunch of models like we've done before. This is object classification accuracy as before on the x-axis feature alignment on the y-axis. But again now this is for the co-3D data set. So we collected novel data from human subjects. We know exactly what features are diagnostic in those in those video sequences presented as frames. And then we evaluate the accuracy of models on the data set.
So as promised, we picked one arbitrary model. We took a small ViT because it was manageable. And so here we start from a model that has been pretrained. And so if you just take a pretrained model and you just do a linear decoding on this 3D data set, you get an accuracy of about 0.85 and a human alignment of somewhere around 0.2.
Now, you can try to do fine tuning. So the question we are trying to answer here is, can we figure out, reverse engineer the right kind of loss that's going to yield a better alignment and potentially better accuracy from starting from this base model towards a better model? So here would be an example of the where the representation derived from a model that has been fine tuned.
So we started from ImageNet pretrained network. We are trying to move in the direction that would now allow for maximization of image classification of this code 3D. And so you see that the fine tuning works. We get an improvement in image classification accuracy on the code 3D, which makes sense.
But you see that there is no real improvement in image alignment. So driving those networks towards more class-specific or class-diagnostic features does not help the alignment between these models and humans.
Here we're introducing this masked autoencoder. Remember, this is the example where we either sample frames and ask the networks to just fill in the blanks on single frames, or we randomly distribute the blanks across the spatiotemporal sequence. And we find that here things are not going very well. We are keeping similar alignment, but the classification accuracy actually goes down. So this MAE does not seem like a good way to approximate human vision.
Here's a slight alternative, where maybe too much detail. But rather than distributing those patches uniformly, we're using tubes where there's a certain spatiotemporal coherence in the mask that's being presented. And so here we see slightly better alignments but also a drop in classification accuracy.
And we find that when we train in this ChatGPT style with autoregressive models, where we have to predict the next frame, either in the pixel space or in the latent space, we can really get a significant improvement in the alignment with humans, which means that this form of autoregressive learnings end up shifting learned visual representations towards features that are closer to those selected for by human observers when they solve this task. Again, this is just one network. We need to scale that up.
But I think the good news is that we're also finding that our results translate to ImageNet results. So what I showed you was for this code 3D data sets. Now we can take the same network and evaluate, again, all the all the different training we've done on the 3D. We find that it actually does generalize to ImageNet.
So with the proper self-supervised learning, we can improve ImageNet classification accuracy as well as the alignment with humans derived from ImageNet, which really shows that we're not just overfitting on the data set. But we're really learning a novel visual representation that generalizes to a completely different data sets, including a static image data sets.
So as I said, this is one model. The jury is still out. I'm still hopeful that we're going to be able to generalize this results to arbitrary architectures. But I think this is promising.
Just to give you a quick visualization of what those features look like, so it's pretty noisy. This is if we derive the attribution map from a linear probe from the standard ImageNet pretrained ViT small. As expected, you see that the features are all over the place. The network integrate features from the foreground, the background, and everywhere else.
When we do this autoregressive self-supervised learning, we find that the network is much more stable and seems to be hitting the right kind of features. I want to think that those are features that are related to nonaccidental properties, potentially features that are more stable across changes in viewpoints through the self-supervised learning. And we have more examples again.
All right. So just to conclude this part very briefly, try to make the case that-- and I showed you initial results suggesting that we need to rethink the data diets and learning objectives that are used to train our deep neural networks, at least if we want to improve the alignment between those networks and human vision. I've shown you that self-supervised learning from transformation sequences to predict future sensory signals, something akin to predictive coding in neuroscience, yields visual representations that are better aligned with humans.
I think it's fair to say that we still have very little understanding of qualitatively. So we know that there are quantitative improvements. Qualitatively it's not entirely clear how different those learned representations are. I would like to tell you that we get an improvement, and certainly we're working on this. I would love to tell you, well, we have better tuning for features that correspond to nonaccidental properties, the kinds of things that are stable across changes in viewpoint or we're able to capture the kinds of junctions that we know are important for 3D vision in humans, et cetera.
But I don't have those results yet, and I hope next time I come, I'll be able to present some of those results. As a final two minutes that I want to use for concluding, I think we are finding some improvements, certainly by forcing alignment between deep neural networks and humans by optimizing different kind of loss functions. But I think there is still something I think missing in the way we evaluate this model.
I've long been convinced that evaluating the ability of these models to categorize 2D images is not the way to really kind of calibrate the abilities of those models against human observers. There's still a feeling, I think, in the field that deep neural networks are still, to a large extent, processing 2D images as 2D textures and are missing some aspect of 3D human vision and certainly some aspect of 3D shape representations that we know is important for biological vision. But I think it's fair to say that at the moment we don't really have good tests, neither on the neurophysiology side of things nor on the machine-learning side of things that would really challenge those neural networks to exhibit human-like 3D vision.
So we've been working hard, trying to come up with such a test. I've been discussing and chatting with colleagues who have been working on 3D vision for a much longer than I did. And so last year I met with a colleague of ours, Zeke [INAUDIBLE] who's on the West Coast, and I asked Zeke, what do you think would be a test to really kind of challenge this neural network?
And so Zeke made an interesting proposal, and I think most of you will be familiar with this visual perspective taking task. This is a task that was proposed by Piaget somewhere in the late '40s. There are many versions of the task. This is one version.
This is the version, if you're familiar with the visual perspective taking or VPT, this is VPT-1. So this is the task that presumably is still thought to be primarily visual, as opposed to some of the more involved visual perspective taking that seems to be involving some form of cognition. So here the task is pretty natural.
I think Piaget assessed that you need kids as young as four to six years of age, I think, were able to solve the task with improvements on the experimental paradigm. I think more recent research has shown that kids by one year of age were able to solve this version of visual perspective taking.
And so here's the task. You can show an image of this kind, and you can ask whether they think the Teddy bear is able to see the house or the cross there. And so that requires the subject to change their perspective from an egocentric point of view, which is literally what they get from 2D images, to an allocentric representation, where they're going to have to, on quote, "put themselves in the shoes of the bear" and figure out whether there's an occluder between the bear and the house.
There's no occluder. So the answer is yes for the house and no for the cross because there is a mountain in between. All right. So we adapted the task to make it amenable to a computer vision task.
This is back to our NeRF-rendered 3D data set. So we have those graphics 3D models. And so we can, again, embed arbitrary objects within those scenes. I know that the contrast on these images is not great, so I don't know how well you can see it.
But here on the images we're showing what a camera in green, so I hope you can appreciate there's an arrow here and a field of view. And there's a red ball. So again, those are embedded in the 3D scene.
And so the question for the networks and human subjects is whether they think the camera can see the red ball. So here are examples of "no" answer because there is no occluder here. This is an example of "yes" answer because the ball is visible. And just to have a positive control, on the same set of images, again, completely counterbalanced, we also formulated a version of the task where we just ask the models and the subjects, which one of these two objects is closer to the viewer?
So in this case, green is closer to me than red and so on and so forth. So this is a relative depth task, where it's still egocentric. We have to make a judgment as to which one is closer to us and like the visual perspective taking that requires a change in coordinate system.
We, again, took a battery of over 300 ImageNet-optimized models, evaluated them. And so here are the results we are getting. So again each dot is a deep neural network, 300 of them. ImageNet accuracy is on the x-axis. This is the results for depth ordering task. Human level here is derived from about 30 subjects.
So you see that humans are right under 80%. Chance would be 50% And so you see that some of the top-performing ImageNet-- some of the networks that are best performing on ImageNet are able to reach human level or even exceed it. An interesting trend here, which I didn't expect, that there is actually a positive correlation between the classification accuracy of these models on ImageNet and their ability to predict this relative depth.
So what is interesting here is that those deep neural networks can see depth. So they can solve the relative depth task. And it looks like optimization from image categorization or object recognition, they are able to capture cues that are somewhat correlated with judgments or that is needed for this relative task.
Bad news for us is that our autoregressive models that did so well in aligning with humans actually don't do so well on this task. They are middle of the pack. Interestingly, if you take your favorite, you know, LLM, Claude-3, GPT, Gemini, we tested them all, and they don't do very well on these depth altering task.
And we are doing everything by the book, chain of thought training. We give them the same exact 20 samples as we give for humans. So the systems are not able to solve the task here.
So the most performing are typically by now, those are probably are-- I have the legend on the next slide. But I think these are transformers that were probably trained on extra data with self-supervision. So that's the state of the art for image classification on ImageNet.
So here's what we get for VPT. So interestingly, for humans, we get even higher accuracy than we are getting for depth. Relative depth, we get almost 90% for humans. And this is the accuracy for all the deep neural networks. Not a single one of them is able to solve the task, and most of them are really within the kind of chance level.
To be fair, there is a trend. So it's not like they are completely at chance. There is a positive trend between the accuracy on ImageNet and the accuracy on VPT. But the trend is relatively weak.
So of course, when we started to-- I was giving a talk, and I show that to deep neural network practitioner. And I was getting a lot of pushback. And so just to-- oh sorry. And autoregressive is actually doing slightly better in comparison but not great. And again, none of the LLMs are able to solve the task.
So I was getting push back. And so people were like, well, who knows? Maybe we don't know what we are doing. So we wanted to show that it's not that they cannot learn it.
So here we just fine tune those models. We took a set of images rather than doing just a linear readout. We actually fine tune the model on the task. And so this is the original accuracy. This is what happens after we fine tune those models.
And sure enough, if we fine tune them, the top ones can actually reach human level of accuracy. You see that several models now perform on par or better than humans on both tasks. However, again, after we fine tune them, we, again, make a small, subtle change on the data set that we used.
Here we are really testing for an assumption which is leading in the field, which is the assumption that for human to solve the task, they are using a strategy known as the line of sight that literally you need to estimate the direction of your gaze and then draw a line of sight. And then you can figure out whether there's an obstacle along the way or not.
And so after we fine tune the models, we evaluate them on a very controlled scene, where we have a single object. And then we just move the camera and the ball step by step so that it starts from a "yes" answer. And then, at some point, there's occlusion behind the laptop. And then we recover again.
When we do this simple manipulation where we control for everything, again, we find that-- again, those are very similar data than the one used in the training set. And yet the accuracy of the model goes back to chance.
And it's not like the models do not understand the task, because when we look at the pixels that are used to drive their decision, we find that they actually latch on to the camera and the object. So they understand that there is some meaning there. But they are just not able to extrapolate and solve this line of sight algorithm.
So OK, so I'll just leave you with one final thought. So I think we haven't solved vision yet. I think this is interesting, and I'm certainly eager to hear feedback on this visual perspective taking task. I think there is a big distinction for deep neural network and humans between egocentric and allocentric tasks. Our own hypothesis is that going and extending self-supervised learning from staying within the realm of passive learning agents, where videos are just passively streamed and learned by those neural networks to a more active learning type of things, where maybe the network has to, if not fully embodied, has to imagine how the scene looks under different kinds of occluders in the scene.
My guess is that these are the kinds of self-supervised learning approaches that will yield to better representations to solve those autocentric tasks. But this is still hypothesis. I'm running out of time, so I'll just leave you with final acknowledgments to thank all the collaborators, the lab, and our funders. And I'm happy to take questions if there is any. Thank you.
[APPLAUSE]