A Fruitful Reciprocity:   The Neuroscience-AI Connection         
      
            Date Posted: 
                  June 22, 2023       
            Date Recorded: 
                  May 18, 2023       
            Speaker(s): 
                  Dan Yamins, Stanford University      
All Captioned Videos Brains, Minds and Machines Seminar Series 
                  Description: 
                  
Abstract: The emerging field of NeuroAI has leveraged techniques from artificial intelligence to model brain data. In this talk, I will show that the connection between neuroscience and AI can be fruitful in both directions. Towards "AI driving neuroscience", I will discuss a new candidate universal principal for functional organization in the brain, based on recent advances in self-supervised learning, that explains both fine details as well as large-scale organizational structure in the vision system, and perhaps beyond.  In the direction of "neuroscience guiding AI", I will present a novel cognitively-grounded computational theory of perception that generates robust new learning algorithms for real-world scene understanding.  Taken together, these ideas illustrate how neural networks optimized to solve cognitively-informed tasks provide a unified framework for both understanding the brain and improving AI.
Bio: Dr. Yamins is a cognitive computational neuroscientist at Stanford University, an assistant professor of Psychology and Computer Science, a faculty scholar at the Wu Tsai Neurosciences Institute, and an affiliate of the Stanford Artificial Intelligence Laboratory. His research group focuses on reverse engineering the algorithms of the human brain to learn how our minds work and build more effective artificial intelligence systems. He is especially interested in how brain circuits for sensory information processing and decision-making arise by optimizing high-performing cortical algorithms for key behavioral tasks. He received his AB and PhD degrees from Harvard University, was a postdoctoral researcher at MIT, and has been a visiting researcher at Princeton University and Los Alamos National Laboratory. He is a recipient of an NSF Career Award, the James S. McDonnell Foundation award in Understanding Human Cognition, and the Sloan Research Fellowship. Additionally, he is a Simons Foundation Investigator.
                    
JIM DICARLO: OK, welcome. Thank you all for attending this special time seminar. I'm Jim DiCarlo. I have the pleasure today of introducing our speaker Dan Yamins.
I guess I'd like to start with a welcome back to MIT, Dan. Many people know Dan. We've certainly missed you around here.
Dan, just to give you a bit of background, Dan grew up-- he was born in New York, Brooklyn, actually, and he grew up on Long Island. He did his undergraduate at that other university down the street, and he also did his graduate degree at Harvard.
And I think he did that-- Dan's work background is really applied math, and he actually worked in developmental biology. He may say something about that, and I like to think his most recent work is kind of almost coming back to that thread. But we'll let him speak to that.
But, as he might put it, he became somewhat disillusioned with that, and he saw maybe there were some other opportunities. That work was done with Radhika Nagpal.
But at the time, I met Dan right around that time, and I was lucky enough to somehow recruit him to come work on problems and vision. And so I like to think Dan saw the light and took a leap to come both to MIT and to work on problems related to the brain and do brain science. So I was lucky enough to work with Dan around that time. That's probably, like, 2010 or so, 2009.
And among many things, Dan was a co-lead on kind of one of the, still one of the most important papers that I've associated with. So Dan really deserves really the lion's share of the credit for that. So I owe a lot personally to Dan for his time working in the lab. And I'll say a bit more about that in a minute.
But he was really an independent person while I was here. He did independent work beyond my time working with us together on vision problems, which he still works on. He also launched a bunch of exciting work with Josh McDermott, who's also here, in the auditory system, and they've taken that in entirely new directions. Also, somatosensory work he later launched.
So he really impacts multiple sensory systems. And we actually tried to recruit Dan here to MIT and BCS a number of years ago, at that time because he was independent, we saw his future there. But unfortunately for us, family took him to California and to one of our major competitors, Stanford. So Dan has been at Stanford for about the last five years or so.
And the general theme of Dan's work is building alternative models. I might call that engineering under neurobiological guidance and constraints. I sometimes call that reverse engineering, but not everybody likes that phrase. It might also be called accomplishing the grand integrative goal of our field using reproducible machine executable hypotheses, but that's so long winded, so I just tend to say reverse engineering.
And Dan's group has really been known as a world leader in a particular contemporary style of doing this, again, reverse engineering the brain. I think you've heard of that style referred to as performance optimization as a phrase that people often use. And basically, it's a normative approach where you take the ingredients of the alternative scientific hypotheses, our knowledge about the brain architecture, cognitive science, driven intuition about the tasks and the environmental stimuli in history, and then applying optimization methods, mostly from machine learning now, to try to then produce models.
When you mix those ingredients together, you get models that are alternative mechanistic models of how brain function happens in various parts of the brain. Those models are called artificial neural networks, or ANNs. And we think of these as alternative, again, hypotheses about the mechanisms of aspects of brain function. And they're potentially capable of explaining many things.
So beyond dwelling on all that, which Dan will tell you and I could, I know, go on and on about, this requires, I'll just point out the abilities in both applied math, computing, machine learning, as well as knowledge in neuroscience and cognitive science, and collaborators in those spaces with the long-term goal of actually understanding how the brain gives rise to the mind.
And there are very few people in the world that can span all that and do all that. And in my opinion, Dan is among the very best, and really is the leader in that space. So I think that's probably why you're all here here.
And so I would just like to just point out from his CV, beyond pioneering all these amazing things, he's already been named numerous awards. He was the Young Investigator-- NSF Career Award, the James S. McDonnell Foundation Award, Sloan Research Award, and a Simons Investigator. He's in high demand for invited talks.
He's already had multiple trainees in top-notch places, even though he hasn't been at Stanford all that long. And Dan and his team have really continued to push the frontier of these kind of scientific models, asking questions about recurrence, unsupervised learning, physical layouts of neurons. I think you'll hear that in his talk.
And I'm really most fascinated about Dan's most recent work that you'll hear about that we'll call universal principles, and how that might even at some point connect back to his longstanding roots in developmental biology with a long reach. And I hope I'm not overstretching there, Dan. But I'm excited to have Dan here to share his latest with all of you.
Before we give him round of applause, I just want to tell you there'll be a reception afterwards upstairs on the sixth floor, in the room that's with the plants on the sixth floor.
And I also know that this slot is a CBMM slot. So it's booked-- it's a 2:00 to 3:30 slot, but I know many of you have to leave at 3:00. Dan is going to do his talk with integrated questions, we'll stop at 3:00, many of you have to walk out, Dan will not take offense, no one will take offense, but we know that has to happen. So we'll pause then. But we will be in this room probably till 3:30. So with that, I'd like a round of applause for Dan Yamins.
[APPLAUSE]
DANIEL YAMINS: Oh, thanks so much for that introduction, Jim. It is definitely true that I feel like this is my intellectual home. So it's nice to be back, and to see so many people. So yeah.
There are these three things that I find really interesting, like cognitive science. I have these pretty reductive pictures of what each one of these things are. So please forgive me for that. But cognitive science, so measuring behavior and modeling behavior. Neuroscience, so measuring the brain, modeling the brain. And artificial intelligence, which is building neural networks to do things and compare to the brain.
So of course, there's all these different linkages between the different parts, and all of these are very interesting linkages. We in my group focus on a couple of these in particular. So the linkage from cognitive science to artificial intelligence, that's about basically target setting and behavior. And the linkage from artificial intelligence back to cognitive science and neuroscience, that's really about hypothesis generation.
So it's really this series of steps that we do most of our time on. So we work in each of these things with the idea of basically taking behavioral insight, figuring out how to do those behaviors with a model, and then comparing the internals of that model to the brain as a way of generating actual hypotheses.
So, I'm going to spend-- I plan to have basically integrated questions with two basic stories today, a story about hypothesis generation going from AI to neuroscience, and then a story about cognitive science and some new AI algorithms. Is there some way we can stop that from happening? Anyway.
So, I'll start with this first one. I'll try to get done, the whole thing by, like, 3:10 or so. So it's a little faster than I expected, but please still ask questions so that I'm not just talking to you, talking at you that period. OK.
So we'll start with this first one, which is the AI to helping neuroscience, right? And so ancient history now is how we build computational models of the visual system by basically optimizing neural networks to solve, I don't know, sort of ecologically reasonable tasks, like categorization and ImageNet, and having built models and fixing all their parameters, then comparing the internals of those models to the brain.
And this is something that people here are pretty familiar with and had some utility for making predictive models, and also giving you a sense of why things were the way they were. Well, e.g. because they solved those tasks.
And this was the kind of key plot was performance on the task correlates pretty well with productivity of neurons. And this was true not just in vision, but also in other areas as well. And in particular, it was also the case that it wasn't just one area for which this was the case. You could have the same model explaining multiple different areas out of different parts of the model. Right?
Since then, people have gone on to make much better models, and to measure them much better, like Martin and his work in Jim's group with Brain-Score. But as I was saying, it wasn't just the case that this was something that happened in vision.
And actually, this was an early version of a plot of work that I did with Alex Carroll and others in Josh McDermott's group, where we basically had the same idea, but now with different model classes, different neural predictivity, different task, an auditory task, but a very similar idea-- the better you got at the task, the more the model got at predicting the neural responses.
So different tasks, different models, same ideas and basic concept of, basically, performance optimization and seeing how far that could get you at building predictive models of actual neurons.
And so in the time since that ancient history at this point, lots of people have done lots of really amazing work in this direction. Folks in my group, folks in groups outside of, you know, a lot of folks outside of my group. So building like models of primate and human vision and taking those further, human audition, models of mouse systems, such as mouse vision, the medial entorhinal-hippocampal circuit, and all sorts of other things, motor system, olfaction, that work from Robert Yang, and human language work from Ev and others here.
So there's been like a kind of flowering of these kind of goal-driven models that do things in the world, behaviorally, and can be compared to brains. So the basic principle behind this is that nothing in biology really makes sense except in light of evolution, which can be specified into neuroscience as nothing in neuroscience makes sense except in light of behavior. And I like to say that we think that nothing in neuroscience actually makes sense except in light of optimization, and maybe we should specify that to computational neuroscience. But that's really the translation of that message into something like, highly pithy.
So just restating this in another way, behavior is highly constraining of the brain for animals at their scale, and maybe humans and non-human primates at their scale. So that is basically the core idea behind the neural work, the way that AI helps neuroscience, and it's been very productive, basically.
So what are the actual core ingredients in any such model? There's four of them. This is basically the setup for any optimization problem, outside of neuroscience or not. The architecture class that you're optimizing within, the objective you're optimizing for, the data you're optimizing on, and the learning rule, which implements the changes to make the system better. Right?
And so all of these have kind of machine learning-y and kind of neuroscience-y interpretation. So like you can think of the architecture class as your hypothesis about the circuit neuroanatomy. You can think of the task or objective as your hypothesis about the ecological niche of the organism.
And each of these things-- environment, natural selection, and synaptic plasticity-- have interpretations. They're not always correct, but they are implementations of those ideas.
And so back in 2016, the best proxies for the ventral stream pathway was Confnet of the architecture; multi-way object categorization for the task; ImageNet images, which is a big computer vision data set, for the data; and then evolutionary architecture search and filter learning through gradient descent. That was back in 2016 and it all seemed fine. But at the same time, there were a lot of big problems with that idea.
So the architecture was clearly wrong. I'm saying bad is obviously deeply wrong as a model of the brain or behavior because, for instance, there was no recurrence in feedback in such a model. There was no topographical structure. We know the brain is topographically laid out.
The objective was clearly wrong because it required labeled data, millions of examples from supervised computer vision data sets, and that's clearly not how organisms learn. The data set was itself a bad data stream in the sense that it wasn't real noisy video data streams the way they really are given to an organism. And finally, of course, the learning rule is backpropagation that has many discontents.
So all of these things are really good because these are opportunities to improve and develop more nuanced neural hypotheses. So I always think of these badnesses in the theory as a jumping off point for how to figure out a good research program. And so we've worked on, I would say, some elements of each of these things.
For example, Aron, who has taken MIT by storm and is somewhere in the audience, did quite a bit of amazing work on both the architecture class and on the learning rule. And there's lots of papers there and Aron is probably told you about some of those things. So I'm not going to talk about that stuff much today, but I'm happy to answer questions about it, although Aron is really the right person to ask about it.
What I am going to talk about a bit, though, are things associated with needing too much labeled data, the right data stream, and then how those actually go back and interact with topography. So this was something that really bothered me at the time that I started in my lab, which was that there's just no way that creatures like this receive millions of high-level semantic labels during learning. So it meant that whatever we were doing with category servers just couldn't be right. Real data actually looks like this and there's no labels, so it's not framed well and there's all sorts of differences.
So for example, there's this amazing data set that I heard about around that time that came from, in part, my collaborator, Mike Frank, at Stanford, which was basically the SAYCam data set. So this is a data set, three infants, 6 to 32 months, recorded the whole time with a head-mounted camera, mono video and audio two hours a week. And I was confronted with this problem. How would you use this data set to learn a representation? When Mike first told me about this, I was like, I don't have any clue. I don't know what to do with this data. Oh my God. This was a big challenge.
But obviously the answer had to be some form of unsupervised learning. So unsupervised learning with neural networks has a very long history, starting in the '90s with sparse autoencoding and all those things. But basically, for a long time it didn't work that well. People would try lots of ideas. The things wouldn't transfer very well. They were interesting, but it kind of didn't-- I don't know, it didn't do it too much.
So for example, autoencoders were this nice idea that you would basically have a system that kind of had an input and output, just tried to make the input look like the output, so reduced reconstruction loss, with some kind of penalty in the middle, meaning for example, the famous penalty from Bruno Olishausen and others of having sparse reconstructions of the intermediate-- sparse activity in the middle layer of a shallow neural network. And the key point was that you optimize something like this and what you get out of it look pretty much like, well, something like neurons in V1.
So this idea was really powerful and people were excited by it. But actually, it didn't perform very well beyond that. Question?
AUDIENCE: [INAUDIBLE]
DAN YAMINS: The question is, are we thinking of the untrained networks as just at the moment of birth or at the beginning of evolution? And it's usually the former. So I'm imagining in this very crude way that the architecture of the network is selected over evolution but that the synaptic weights are the things that are being developed, created by data-driven learning during the lifetime of the organism. That's like the usual idea.
Of course, it's probably not exactly so clean as that. But the natural physical interpretation as structural things in the network, like how many layers there are or what operations are or whatever that's selected over evolutionary time, and the synapses are basically learned by experience during the development of the organism and even beyond development. Obviously, it's a crude approximation, but it's the kind of mental picture. Yeah.
So beyond the autoencoding, there's this other class of problems called missing data problems, or at least that's what I call them. And one of the most interesting ones of these was colorful image colorization. Missing data means you take away some of the data and you've got to predict the rest. And in this case, the colorful image colorization one was where you basically you get the color image, of course. You rip off the color channel, then you give the network the grayscale image. And then it's got to predict the colored image, as opposed to just reconstruct the input.
So this is a problem that actually doesn't really need a penalty because it's non-trivial by its very definition. And it turns out that actually this one transferred a bit better to doing things. But as I say, for a long time, things didn't work out too well in unsupervised learning.
But then around 2018, a new family of methods started to get developed, and these are called set of deep contrastive unsupervised embedding methods. One of the really original, I think, really good versions of this was instance recognition from Stella Wu's lab-- she's a computer scientist-- that came out in 2018. And the basic idea is that the network, now represented by this triangle, is not trying to autoencode anything. It's just putting inputs into some embedding space, like a couple of hundred dimensional, e.g. some neurons representing things, with the objective that responses to basically head motion, like cropping by moving your head around, would be pushed together. Those things would be put together by the network, and things that were just different images that you saw at a different time would be pushed apart.
It sounds trivial. It's like just saying, the same image is the same embedding and the different image is the different embedding, but image is generated by moving your head around or augmenting the data in some way. And it turns out that this actually was a big leap in terms of performance and it opened a whole notion of contrastive unsupervised learning that was very powerful in many different ways. And actually, one of the big players on this was Chengxu Zhuang, who was a student in my lab and is sitting right there, and did just a whole series of amazing work on unsupervised learning. So he built methods of unsupervised learning and he compared them to brain data. And so I'm going to just tell you briefly about some of that.
So he built this notion, this algorithm called local aggregation, which was not just trying to figure out how to push the same image together, but also figure out how to get kind of classes of images to generically arise. And I can tell you about the details and Chengxu can certainly tell you about the details, but the core idea was is that it distinguished three kinds of stimuli, ones that were really close and looked like they should be the same at the beginning, ones that were close by but not quite, and the ones that were far away. And basically, the objective function that was optimized by this local aggregation procedure was to bring the really close ones closer together, push the ones that were kind of close but not too close further away, and then do nothing with the others. So this is kind of an inverted U-shaped loss function inspired by some thinking and memory.
And this thing actually was really good, basically, before training and embedding, might look like this. Afterward, natural category structure of various kinds would arise. This is just the loss function, but the details of the math don't matter so much as the fact that at the time this was actually the first network that trained in an unsupervised way, surpassed AlexNet, the big thing that changed deep learning, trained directly on the ImageNet labels. That's kind of an arbitrary milestone, but it basically said, look, unsupervised learning was working.
AUDIENCE: [INAUDIBLE]
DAN YAMINS: Self-supervised something. It could be computed. It can be computed by the animal. You can run this objective on natural video. You don't need to know anything that's not pretty easily given to the organism.
So whether we want to call that unsupervised in the classic sense or not, I think, is a semantic issue. The key thing is that the organism could compute this type of objective by itself. That's the real criteria.
So what the point was is that things trained like this actually were able to generalize to lots of tasks. So the red things are the self-supervised ones in this deep contrastive version. There are a bunch of versions there. The other ones are the other type-- the other colors, the other types of unsupervised methods.
And you can see that they were much better at object categorization, although not as good as the black bar, which is the thing directly supervised on that task, although they made a lot of progress. But they were as good at tasks, like object position, and better at other tasks, like object size determination. So what you have is basically a system that is a kind of generic task-agnostic proxy for learning something that's useful for transferring to many tasks.
And this was a really powerful result from the point of view of a computer vision perspective and there were many other variants of this, some of which have gone on to be even more successful but are kind of in this generic domain. So of course, the big question that we wanted to answer, having looked at this, was how well does it predict neurons. And there's this really amazing kind of tour de force paper that Chengxu has on this, among things showing that actually really, finally we were able to get a kind of quantitatively accurate unsupervised model of the higher brain areas, in this case, IT neurons. But actually, we could actually show that there was this same anatomical consistency, earlier layers looking like V1, intermediate layers looking like V4.
But basically the point was that these red bars were finally at the point of as being good at neural prediction as the supervised models were on categorization. And this was a big moment for me because I was like, I thought this problem was going to take 20 years or something. And it was just felt like finally this problem that was going to dog forever it kind of got broken.
And in particular, what was really cool was is that you could actually take the SAYCam data-- this was the data from the head-mounted camera of the kid. This is the point. You could run that data through a video version of the unsupervised learning model, since you also built a video version, which was also good at from a computer vision perspective.
But you could run that like head-mounted cam data through the model. It didn't cost anything to do that. It could done. There was no need for any labels or anything. And you could see that, actually, the models that were learned that way were also good models of predicting neural responses. So actually, you could do two things at once, get rid of the labels and also make something actually operate on the real data that organisms had, not just framed things in images or something like that.
So it started to make us think, oh yeah, maybe there'll be a candidate model of visual development. It's going to be imperfect. When digital developmental data is collected as it's collected-- we love to continue to do this. I think Chengxu is working hard on that. But anyhow, this was something basically intermediate, I'll say, that contrastive unsupervised approaches finally made up the supervision gap, both in performance and neural fits, and also worked pretty well in real data streams.
But they're still imperfect. So aside from the fact that they're not actually getting to the dotted line, I was really surprised, actually, that the red bar was not that much higher, if at all, than the black bar. And Chengxu never let me forget it.
He always said, look, we should really be able to do better than this because we're getting something that's more like the actual biological learning procedure. Why isn't it better at predicting the neurons? That was always-- so I think of this as a glass 3/4 full, but that quarter is really important.
So now I'm going to tell you something that sounds like it's going to be very different at first, but actually ends up being quite related. So what do we know about the visual brain? Well, we know about neural response properties. Everything I talked to you about so far is neural response properties. It's neurons responding to stimuli.
But there's also this other thing that happens in the brain, which is neurons are arranged on the cortical surface topographically, and that's a big, salient fact. And without that, MRI wouldn't work, basically. So it's really important. Lots of things wouldn't work, probably.
And there's all this phenomenology that's very robust and very well measured. Early visual cortex, higher visual cortex, they both have topographical structures. So in early visual cortex, there's basically pinwheels and other stuff, cytochrome oxidase blobs, all sorts of things that are kind of characteristic structural topography of V1. Higher-level visual cortex also has lots of properties like patches of various kinds selective to various things with reliable structure.
Putting aside for a moment the relationship to self-supervised learning-- I'll come back to that-- this is something you'd really like to have a model of. And so in concert with Jim and Hyo from Jim's lab and Kendrick Kay and Kalanit Grill-Spector, my student Eshed did some really exciting work in this direction. And the basic idea was that the whole purpose of the system was to represent generic visual features that are useful for some task under some general physical biophysical constraint that led to topographical structure. That's a generally natural idea.
In practice, what that means is you take a neural network and you train it to do both task performance, like categorization perhaps, and minimizing some biophysical or spatial cost. So there's two terms of the cost function here, a categorization term, for instance, and a spatial term, and then some weighting, alpha, that says how important the spatial cost is, the biophysical cost is. And then the question is, can you get networks that will generate all of the phenomenology from that idea? So how do you do that?
Well, more specifically, you've got to assign physical locations to model positions. We decided that we actually didn't want to model the emergence of retinotopy since, going back to Sam's question, this is something that's built in genetically at the beginning of the organism. So we're starting from birth, not from evolution here. So we take the networks and we lay out their units in a kind of retinotopic way naturally associated with a convolutional network. Every layer has a physical layout.
And then in each layer there's a spatial cost that basically asks that neurons that are nearby physically should have correlated responses. I'll come back to the detail of that. But anyhow, this is the kind of network that we built. And of course, there's a task loss as well.
So I'll call these topographical deep artificial neural networks. That's a name that Hyo and Eshed thought about for a long time. And in particular, the first question was, do they predict V1-like phenomenology?
So how would you know that? You take probe drifting grating stimuli, just images in this case, but coming from different grating stimuli, and you stick them into the network and see what happens in response. That's how you would do it physiologically. And you can measure the orientation tuning of each unit and you can make a map because, remember, it's physically laid out.
And qualitatively, it actually is pretty good. Actually, it's quite often hard for me to tell which response maps came from the model and which ones were figures from papers. This one actually is from the model. And if you look at a finer grain level, this is what you see from the unit responses, so about as clean as Clay Reid's distributional results.
And it's not just orientation tuning. That is one thing, orientation tuning, but there's also spatial frequency maps and also have good maps for color preference. So these are basically CO blob stains that you can compare to.
And on each of these things, you can say qualitatively it looks reasonable. There's a whole bunch of quantification that we've done to see if that's correct. And basically, the purple line is the model I just told you about, and compared to a bunch of pretty reasonable strong alternatives, this is, I think, capturing a lot of the phenomenology that you see in V1 across these different types of maps.
So I've talked about V1. What about higher cortical areas? So you can measure category selectivity maps and both amounts of units of given category selectivity and also the maps of how they're laid out. And if you actually just look at the proportion of units and how they are responsive, what you see is that actually the purple model, that topographical model, is at the human-to-human similarity ceiling on that metric.
And there's other measures that are more detailed, and we've spent a lot of time on this. And Eshed and Khalid have really given me tremendous insight about how to actually quantify these things. But basically, by comparing actual high-resolution human fMRI maps to the models, we find that basically the topographical model is a much better picture of that structure than lots of other things.
So for example, on smoothness of maps or different categories, that's one metric you can compute. Number of patches is another type of metric that you can compute. Patch area as a third type of metric that you compute. You can also compute like inter-patch relational distance.
And of course, there's going to be some ceiling. The humans aren't going to be perfectly correlated with each other. But there's enough data to get a pretty good ceiling there and the models are in quite good shape at predicting that kind of structure.
So obviously, how you quantify these things is a little bit up in the air. I wouldn't say that our quantification is of higher cortical maps is perfect, open to ideas. Please tell me if you think there's some better way to do it. So part of the effort is actually figuring out how to benchmark what it means to be a good model of topography. But putting those important questions aside for a moment, I think we've been able to find out something reasonable about that as well.
But now I've hidden two important things here, intentionally. What actually are these losses, the functional loss and the structural loss? What are they, the task loss and the spatial loss? We'd like to figure out why the thing as it is by reverse engineering these factors.
So you might have thought that what I actually did in that, what I was showing you, was a categorization model, and that is what we started with. That was the natural thing to do. But actually what that model is, that TDANN model is, is using the type of self supervision contrastive self-supervision that I talked to you about earlier that Chengxu helped pioneer.
We started with the first thing and it didn't work as good as we thought it should, and then basically did the second thing thinking, OK, maybe we'll see if it's any better. And it turned out to be a lot better. And it's better not just in category areas, but in V1 as well.
So if you look at, say, for example, pinwheel density or you look in higher cortical areas, what you find is the categorization model versus the self-supervised model has like somewhat less correct pinwheel spacing and other V1 phenomenology, but much less good patch prediction. And that's really interesting because you might think, oh, categorization would be how you would build patches of categories. But image in that category is actually maybe not quite right. And whatever comes out of the self-supervised thing that kind of gets you ready for lots of tasks, that actually gives you selective units that form much better description of the patches. So for topographical metrics, the self-supervised models are substantially better than the categorization-based models.
Now let's look at the spatial loss for a second. The natural thing to do here is to basically write down the formula. I'm going to get together units that have this-- line up the distances, the distance correlation with the feature correlation. So that's expressed by the spatial loss, absolute spatial loss, 1 minus the difference between basically this feature correlation and the physical distance. This was what we started with and it does actually not a terrible in higher areas.
Actually, though, the thing that is in the model that I just told you about is another thing. It's called relative spatial loss. So what that basically says is, instead of the absolute difference between the cost and distance, the featural similarity, the functional one and the physical one, we have a more flexible thing which basically asks for there being correlation.
Now, why is this important? Well, the first one does not scale with cortical area, but the second one does scale with cortical area size. And that turns out to be really important in allowing for the model to handle many different areas at the same time. So just to make it physical, the loss function, you can think of it as a machine learning thing, but it's actually like a hypothesis about some circuit somewhere in the brain directing neurons to do a certain thing.
So the question is, which of those circuits is there? Is it more like this? Is it more like that, absolute relative. And it turns out that this is basically hypotheses for how the brain measures physical distance versus functional similarity.
And there it turns out that basically the pattern is inverted. So it's the absolute spatial loss is much less correct at V1 and a bit less good at patches in IT. This is why it kind of worked in earlier things when we only had the absolute thing, but only at higher areas. But it was really this low level structure in V1 that allowed us to see that we had to have a different way of hypothesis about what that lost circuit was. And you can quantify this if you want, but basically the quantification boils down to both self-supervision and relative spatial loss, so both being task-agnostic general transfer, and being able to scale the cortical areas of different sizes are important for getting the topography correct.
And actually, it's not just that topography that gets correct. So you might wonder, does having spatial loss change the underlying feature representation? And it turns out that it does in a very interesting way. Fundamentally, what it does is as you have spatial loss-- 0 means no spatial loss. But as you have spatial loss, the intrinsic dimensionality of the representations reduces. So that's basically like saying that the features get more compactified in some way.
And if you put on this human VTC, this is where you find that you are. So basically, at the level of alpha, where all of the topographical features lined up, that's actually-- well, so anyway, the point is that the features have been changed in a very important way by having the spatial loss, even though the features are in some sense not spatial. And another way to think about this is that it actually improves the neural fit, at least on a 1 to 1 basis of matching.
There's a lot to be said about that, what the metrics are. But basically, if you look at this model that has the right structure, it substantially increases the ability to look like the neurons just on that pure featural perspective. And if you look at the human-to-human ceiling, this is you'll see that that's pretty close to that, again, at the same value right.
And I should say, alpha of about 0.5 is where all of the magic seems to happen. So basically all of the metrics topographical, non-topographical kind of line up at that spot, suggesting we can do some kind of parameter identification there. But you might wonder, why would you want this spatial smoothness loss from a biophysical point of view? And we think the answer there is basically wiring lengths.
So if you do a natural measure of wiring length, what you find is, of course, it's not that so surprising. As you increase, spatial loss wiring length drops. It would be kind of hard to imagine something else happening just because as things get more compactified and as they get more similar locally, you might expect wiring to get better. But that is true.
More interestingly, actually, if you compare those various alternatives, which are not unlike some trivial continuum or something, what you find is that the model that has by far the best topographical factors also has the least wiring length across all these different kind of alternatives. So to me, what this is really suggesting is now, again, at an evolutionary time scale, loss circuits are optimized to support task transfer and minimize feedforward wiring length. That's kind of the interpretation.
So there's a lot that can be done with this. I'll just briefly mention some work from Dawn Finzi, also featuring Eshed using the same model, to think not just about the ventral pathway, but beyond that and actually to think about the paths, the multiple processing streams in the visual system. So people have identified several different streams, the parietal stream, the lateral stream, the ventral pathway, and there's a lot of explanations for why. And I think the predominant one is something like behavioral demands. Different behaviors are solved in different places with different streams.
But Dawn basically looked at the previous model and was like, well, what about this alternative? This is actually just the previous model. I just wrote it down in a little equation form here, contrastive loss plus a spatial loss. Same contrastive loss, does that give you all of the structure?
So basically, there's two hypotheses here. Behavioral demands theory, for example, as having one stream for like what and another stream for where and another stream for action or something, and that can be actually implemented by deep neural networks and she's done that, versus the unified model that actually says, basically, you have the self-supervised relative spatial loss and you just compare it to the actual whole brain data. And it actually turns out that, basically, to cut a long story short, clustered stream-like structure emerges in the TDANN model, and much better than it does in the behavioral demands theory model. And-- question?
AUDIENCE: What exactly do you mean by streams? Because it looks like there's one stream.
DAN YAMINS: Like, dorsal, ventral, lateral.
AUDIENCE: I understand in the brain. But in your model, there's one stream. Is there anatomical subdivisions that--
DAN YAMINS: Yeah. So you can actually-- so a clustered stream-like structure emerges in the spatial thing and then you can tag which of the units looks like, if you compare it to actual brain data, ventral unit, which looks like a lateral unit, which looks like a parietal unit.
AUDIENCE: And they're actually connected in that stream within the big stream.
DAN YAMINS: Yeah, that's right. Yeah. So there's a lot to say about that and I'm cutting it really short to say this. But at alpha 0.5, you basically get a pretty good match to the actual data, certainly a lot better than the kind of, I don't know, behavioral demands theory. So it's basically saying the emergence of distinct visual streams may be captured simply by learning a general visual representation in the context of a spatial constraint.
So what's the argument? That basically this formula, this contrastive loss that's functional and that spatial loss-- I'm putting the half out there to make it seem like physics. It's not really. So anyway, this formula explains a lot of data. By lots, I mean most of what we actually maybe know about the visual system.
And maybe to put this in perspective, there's nothing vision-specific about that first term. It just has visual input, but there's no categories or what. There's nothing vision specific about that.
So maybe this is a general theory of functional organization, so gives us speculation. The same thing will explain primary auditory cortex and perhaps structure internal to the belt and power belt. There's reason to think that might work. There's reason to think it might not work.
I'm looking at Josh. I wonder what he thinks. But anyway, it's a speculation.
And here's a somewhat wild-- I'm not sure if it's wild or not, but anyway, it's speculation. The same theory will account for basically structure in the medial entorhinal hippocampal circuit, basically like grid cell modules. There's no reason in principle it couldn't. Whether it does or not is a big, open-- is a question I don't know the answer to.
But the idea is just to get you thinking that the principles that give rise to lots of things in vision might be more general. So just to kind of wrap that this piece up, contrastive self-supervised approaches have largely made up the supervision gap, work reasonably well in real data streams, and make a lot more sense when spatial phenomenology are taken into account. It's like it was OK before, but now we have a picture that seems to fall much more in line when we expand our way of thinking about what we're measuring to including all of those other phenomenologies.
And from a conceptual point of view, it's like, basically, a kind of unified, mode-agnostic principle explains a lot of data, but can generate pretty specific fine-grained circuit structure inferences as well. So to me, why I wanted to tell you about this thing was basically because I feel like it's synthesized all the things I hoped might happen by pushing that goal-driven neural network approach much further down into the neural details.
AUDIENCE: Dan?
DAN YAMINS: Yeah.
AUDIENCE: [INAUDIBLE]. There were quite a lot of models about orientation columns and [INAUDIBLE] columns in cortex. Is this model consistent with those?
DAN YAMINS: Well, it's better than those in the sense that if you take-- I had a bunch of bars on plots. I always slide a lot of bars on plots. Those bars were as best as we could do implement a whole variety of existing models from the literature, and there's two things to know about those models. There are things like self-organizing maps and stuff.
One of them is that those are area specific, and it's really hard to get the same model to explain data across multiple areas. And two, even when you do that, the model that I showed you is quantitatively usually better.
AUDIENCE: But the constraint is similar or not? I don't--
DAN YAMINS: I think the constraints are all similar in the sense that anytime you're going to be thinking about biophysical constraints, you're going to be thinking about something that naturally is about wiring length. So that's general. That's been around for decades, that idea.
But that you can get specific inferences about how that wiring is measured and how it can be implemented in a way that plays correctly with the functional representation, that's what you get out of doing something like this. So there's a good point. There's a whole literature on topography. But getting that to interact in a way, where you actually have a unified principle that you can think applies to all of those phenomenology and vision and maybe beyond that, I think, is really where there's utility. Yeah.
So from the report card point of view and as we transition into the other part of this, there were things that were bad before, like, obviously deeply wrong. I would say that in at least three of these areas we're doing OK, at least that it's harder to reject out of hand. We're not good yet. We're still working.
It's OK as a learning rule. That's really complicated and that's a story for a different time. But I think that there is progress that's been made.
So I told you about this. I'm going to transition to now the stuff that we get in computer vision that's inspired by thinking about the brain. Oh, question.
AUDIENCE: I'm curious about-- before the transition, I just have a question about the first part. You talk about these biophysical constraints. I'm just curious about what's your thoughts about how to map those constraints, actually experimentally measurable measurements.
DAN YAMINS: Well, in certain sense, I feel like I'll answer you glibly by saying, didn't I just show you an incredibly large number of experimental measurements?
AUDIENCE: Yes.
DAN YAMINS: So that's a back inference that the mechanism had to kind of be right to get all of those factors. I think I know what you're saying, though, to give you a more serious answer. You're basically saying, OK, if you think this circuit is right, you should be able to go in and do some detailed predictions about what you should cut and see with what if you change the circuit in some way. Right?
And so I think the answer to that is, yes, that you should be able to do that. Eshed has a lot of theories about how to connect the biophysical model of the cell, the neuron to those quantities and that circuit. I don't know myself exactly what the details of the experiments would have to look like, but I think that that's a topic very for immediate future investigation. Yeah, question.
AUDIENCE: So the goal of finding one objective function that explains the data, is it well posed? Can't you have many variants of the same objective function? How do you deal with [INAUDIBLE] and so on?
DAN YAMINS: I think that's a great question. Yeah, there could be other things that explain the data. At any given time, that could be true.
Usually, I try to-- OK, here's a problem choice aesthetic. Find places where you're going to find that you need to make a big advance to explain lots of data. But you're totally right. You get to a point in any given endeavor where there's actually a couple of different models that you then need to push by having better empirics, stronger empirics. I think that's where the brain score project is in categorization, and that's really makes that really interesting from an experimental domain.
So I was going to transition to artificial intelligence inspired by and target set from cognitive science. But there is a neural motivation, which is basically, how do you fill this gap. There's got to be something missing. And the hypothesis is that AI models of vision are actually behaviorally insufficient.
So I'm going to tell you a story about how I got to what we're doing now. I'm going to try to wrap it up not too long after 3:00, although there's some content here and I'm a little worried that I won't be able to do it in the level of detail that I need to make it live. But let me just start anyway.
So I've been inspired by Liz Spelke, through Josh Tenenbaum to basically take very seriously the idea that you need to make predictions of physical-- basically intuitive physics. And so from an empirical point of view, there's this great project that I've done in collaboration with Josh and Judy Fan, who's joining us at Stanford next year, which is basically measuring vision models of physics. That's why it's called Physion. And basically what this project was, was a data set that had a lot of different physical scenarios in it that basically covered-- whoops.
AUDIENCE: Kind of things to seriously think about architecture, task optimization. Given your conversation with Sam about this is thought of as at the beginning of the birth of a brain, what do you think about initial conditions? And the reason-- so initial conditions, that is how you set up the synapses before the training begins.
The reason I ask this is that you show that, for example, a contrastive network and a supervised network end up different endpoints. Of course, those endpoints are solutions to the problem. Of course, had you started with that initial condition, then your backprop would have landed to the same solution that the contrastive network landed on. So how do you think about the space of initial conditions that you're allowed to choose for any given learning scheme or any methodology that you imagine is going to--
DAN YAMINS: Yeah, it's a good question. Normally, we take the crude answer, which is whatever we get from machine learning as the initializer, like random features with some variance or something, is what we do. So that's basically a kind of cop out because it doesn't really have any particular content to it.
But one thing that is-- what could be said and I think is very interesting-- oops, wrong one. One thing that I think is interesting and could be said is that networks that start at an initial condition don't have zero predictivity for the brain. That's like basically saying that a random initial assignment of filters is not a random function. It's a very specific function and it has a pretty good, although it's not like the adult trained network, but it's pretty good predictivity of adult neural data, suggesting-- that's like saying, oh, if you have a baby X, it's going to look something like an adult X, even when it's born, just due to the structural factors of the network.
So that's not really answering your question. But what it basically says is, it's not like the space of possible things that you can get by changing initial users or by choosing reasonable initializers is totally weird. It's a reasonable thing to do. Now, ideally we'd have much more power to talk about exactly what the initial conditions were and rejecting the basic standard model of random filters chosen a certain way.
AUDIENCE: But that's not-- I wasn't saying [INAUDIBLE] to reject a particular thing that comes from machine learning. I was saying, when you compare supervised to unsupervised, and you say unsupervised is better, I imagine if the solution that the unsupervised gave you, you had used as the initial condition for the supervised, you would have led to the same answer, which means it's not really a difference that is strictly about supervised versus unsupervised. There is a really important question about what does backprop plus the choice of initial conditions can give you.
DAN YAMINS: So you're saying, you think that if you start with the unsupervised thing as the initial condition and then you run supervised loss, you end up in the same place as what?
AUDIENCE: Unsupervised, because it's a better solution. According to your resource, it's a better solution--
DAN YAMINS: Is that true?
AUDIENCE: Yeah, I don't think that's what happened. The supervisor loss will drive the weights again to its own preferred landscape, and that's very different from the unsupervised solution is going to end very similar to just starting from the running initialization passed to supervised. The initial point at that time for supervised will not matter too much.
DAN YAMINS: But it's a good initial-- it's a good question, how much is catalyzed by that. So let me tell you a little bit about this computer vision stuff. We were inspired by doing these physics experiments, basically, intuitive physics experiments.
And you saw it run, I think, what was happening was basically humans were asked, for a bunch of these scenarios, to choose whether or not the red thing would hit the yellow thing. So this is basically a contact prediction task. And across a whole variety of different scenarios, measure some aspect of physical understanding.
And this was done in humans for thousands of examples over thousands of humans to get a reliable human data on that. And then we tried a bunch of physics models, basically, models from the AI literature at making predictions of the future by basically taking the model that had like a physical visual encoder, some kind of dynamics model on top of that, and then reading out, would contact occur. There's a lot of details in there. But anyway, the core results were the following.
Basically, the entire thing that we tried from basic various visual models of the future, humans were pretty good. They were actually titrated so that they would be kind of in the intermediate range here so there would be a ceiling-- there wouldn't be at ceiling or floor. There was a huge gap between pure vision models and humans on this task. And this is a little bit old, but we have new results from the latest and greatest visual transformers, big-scale models, still pretty bad at this task.
And the reason they're bad is because effectively what happens is that, real physics, the objects move around and a lot of models, the objects kind of do that. So object-centric learning may help. So models that have some kind of object centricity built into them do better, but not all the way. There's still a pretty big gap there.
So this led us to think, OK, look, there are these results that say physical intuition is present very early. We wanted to capture this in some kind of model very directly, not actually a vision model to start with, kind of a psychology model. And that was work that I did with Josh and with Fei-Fei and with a bunch of others, including Chengxu. And basically the idea here is to build physical scene graphs, where basically the nodes of the graph correspond to particles comprising the object and the edges correspond to the relationships between the particles.
Of course, humans don't think about all the particles in an object all at once, so actually what we built is a hierarchicalization of the underlying scene graph. So higher level nodes, a lot of details would be dropped out, but they would still be computed from the lower level seen particles. And the underlying model here was a hierarchical graph convolution. So it's like a convolution, but on a graph.
And I'm going to not tell you about this picture in too much detail, but the key point was that you could train its parameters to make predictions about physics and the network would learn about how to interpret the description of the scene as a graph. And it did pretty well. So for example, this is an example of a deformable cone bouncing off the flat floor, ground truth and prediction, or the Stanford bunny, or rigid sphere rolling out of a bowl, or a floppy teddy bear bouncing off the floor and recovering, knocking over an unstable block tower, folding cloth. Definitely not perfect, but pretty good across a wide range of material types and stuff.
And then there's a whole literature that developed from this, some of which are much better than that, but were based on that basic idea. And so in contrast to the vision-based models, the thing I just told you, which has several different subtypes but the main one is the one that's the DPI there, is much better. So this is basically saying, having an explicit physical scene description helps a lot at this task.
And actually, if you look at the error correlation, not just the performance of the models, you see almost the same pattern. So basically, across a wide range of models, being better at this task means being better at your correlation of the errors that you make with humans. None of the models are perfect at that, by the way. There's a pretty big gap there. So there's still quite a bit to explain.
But to me, the core point of this was it begged the question, putting aside physics for a second, how do you get that scene representation? What I just told you, is not computed for vision. It's cheating. It was, like, you know the scene and the parts.
So that meant that focused us back on this vision problem. We started with dynamics. This focuses us back on the simpler-- I don't know simpler-- anyway, vision problem. How do you get scene graphs?
So this is a kind of, I don't know, hierarchy of potential thoughts for how that might happen from pixels. One is you start with knowing optical flow. From that, you somehow compute segmentation and depth, which you can then lift from 2.5 D to 3D, which then gives you a kind of scene graph, from which you can then predict physics. So this is kind of like a Spelke-Marr synthesis.
The problem with this type of idea is-- some of you who follow the machine learning literature may be familiar with the bitter lesson, which is, just TLDR, it's like all your fancy theories make things too complicated and everything will be solved by large models and lots of data, and that might feel better. So more constructively, it's like, make your theory as simple as possible, but no simpler and try not to build in too much by hand.
So how do you use the thing I just said, but kind of do it in a way that's not inconsistent with, I don't know, Sutton's bitter lesson, which has been pretty powerful. I'm going to tell you some thoughts that kind of go toward a Spelke-Marr-Sutton synthesis, at least for the first couple of steps of this chain. So the very first thing is that you might think that a good problem is unsupervised category-agnostic segmentation of static real world objects into images, in the unsupervised segmentation to objects. So it's basically like this problem, like in that video, being able to know where the segment boundaries of the objects are.
And there's a clear inspiration here from developmental cognitive science, which is this concept that we at least we call a Spelke object, namely a collection of physical stuff that will always move together under the application of everyday physical actions. And that concept has been operationalized by Liz and many others in trying to think about what's very early, either built in or quite early in organisms. And here's a kind of, I guess, obvious idea, do Spelke object inference from motion.
So basically, extract optical flow from the video. That's what you're seeing being extracted. Threshold the optical flow to get the object borders. And then train the static segmenter from that teaching signal. And I'm talking in explicitly computer vision terms, and I'm not going to apologize for it because I'm telling you about how to actually build good computer vision algorithms.
So this is a reasonable idea. It's not that hard to implement. And it led to some pretty good results in collaboration with Josh and Jiajun Wu, who's now at Stanford, also was from here, led by Honglin Chen, Dan Bear. So this was pretty reasonable from a computer vision point of view. This was an oral paper at ECCV22.
But there's a really fundamental problem here, which is that it was limited in the purview that it could apply to because thresholding flow does not work in many real world situations. Like this one, you have a bunch of objects that are kind of different objects, but moving together, or in this example, where you have single objects that are deformable. So thresholding flow in these cases would be kind of bad.
Now, that might not seem like a big problem. It feels like a little crack. But actually, in attempting to solve this problem, we kind of formulated something that we think is getting towards a kind of pure vision foundation model.
So what is a foundation model, you might ask. It's this thing that was coined at Stanford by my colleague Percy. It was probably one of the few people who was not on this paper.
But what is a foundation model? There's many people talk about it in different ways. It's a computer-- an AI thing.
I say it has two components, a large pre-trained model-- that's what my colleague Jitendra Malik likes to call it-- and a generic task parameterization. A foundation model means a pair of these things. So let me make that concrete. In language is where it has really blossomed, this idea.
So the large pre-trained modeling language is contextual next word predictor. That's what GPT-3 is. And the generic test parameterization is the very simple idea that if you want to parameterize any language task, you write it down and you put it in the context of the predictor. And then whatever comes out is the answer to the task. That's a really simple task parameterization. It almost sounds trivial, but this realization was amazing, that the predictor could be good enough so that lots of tasks could be solved this way.
So now the question that struck me when the GPT-3 came out was, how would you do this in vision? It's much harder to imagine exactly how to do this right. And this is actually led to the interesting situation where vision has kind of been lagging from the point of view of true foundation models.
So I'm going to tell you about an approach that kind of seems like it may be going in that direction. Some of you here may be familiar with the concept of a masked autoencoder. Basically, a masked autoencoder is a function that you take the whole image, you subtract off 75% of it in patches, and you predict the whole thing from the 25% of patches that remain. It might sound funny as a task, but the idea is that you can train a network to solve this completion task.
And it turns out that, 1, it works pretty well, kind of amazingly well. And 2, the underlying representation transfers pretty well to many tasks. So it's a pretty good unsupervised learning framework.
So I just told you about a technical idea. It works on single frames. We wanted to solve tasks like optical flow, so we needed to do it on two frames. But the non-obvious question is, what should the mask policy be? That might sound like a technical detail, but it's not.
So I'm going to tell you about a particular prediction problem, and this prediction problem says the following. I start with two frames, frame 0 and frame 1. I give you all of the information in frame 0 and 1% of the information in frame 1. That's a very, very sparse mask, much too sparse to work with a single image. And then I got to predict the rest of the frame 1. That's the prediction problem.
If such a predictor works, the key point is that it must cause factoring of motion from appearance because appearance information has to come from frame 0. Where else is it going to come from? There's not enough information in frame 1 to make it work. But object transform information has got to come from comparing those few little patches between frame 0 and frame 1. Where else is it going to come from? And F has got to effectively learn-- that predictor has got to effectively learn to apply transforms to each object.
It turns out that this works quite well. You can solve this problem with a standard modern architecture and build what we call a temporally factored masked autoencoder, temporally factored for the reason I just said. So let's look at temporal factoring in action.
These are factual predictions. You want to get from Xt to Xt plus delta. That's one mask point which tells you what the background looks like. That fixes how the background has moved. There's camera panning.
The next point tells you how the horse needs to move in gross level. The horse moves from the beginning to the end. It moves all the way to the forward with a couple of point-- one point is really enough.
As you add points in the right places, you get fine detail, like where the legs are. So this is temporal factoring because what it's basically saying is you can have a very tiny number of points that give you different pieces of the way that the system changes. Let me just make that more specific.
That's my wife. In the video, she's leaning in, she's smiling, smiling. So let's do that again. One point moves the head into the right location. Two points does a size correction, couple of points for pose correction, and a couple more around the mouth to get the smile. That's what it means to do temporal factoring in factual predictions by adding more points, key points essentially, which you can determine by seeing which ones reduce the error.
That's fine. But now I'm going to tell you about how to use counterfactuals with the thing I just told you to do interesting stuff. So let's say you start with the underlying ground truth image. Now there's no second frame. And you want to move one point a given distance, just move the patch and produce the mass counterfactual that you would then put in to the model.
That moves the whole horse, not perfectly, but it moves it pretty well. So that's one patch. That moves down to the right. The horse moves down to the right. That's a counterfactual motion.
So you can do lots of different counterfactual motions, like horse down or right or back or rearing or whatever. Not perfect predictions in the diffusion sense, but very small number of points moved to produce big changes in the system. There's another case, ground truth, and then moving a ball, moving the banana, moving the basket, moving the basket and the banana but not the bowl, or pitching the bowl upward by moving one point toward you and one point away to give you pitching, out of plane rotation.
Counterfactuals turn out to be really useful for doing things like optical flow. Let me show you how. Let's start with the ground truth image and add a few little perturbations to this image. Now, this is not motion counterfactual. This is just RGB counterfactual.
I'm adding a little red patch. Three little squares of literally been added to the RGB image. Here they there. So now we've got the original image and the counterfactual image, and we can apply the predictor to get the clean prediction and the counterfactual prediction.
And what you see is if you take the difference between those, there's a counterfactual effect. The red dot has moved. The green dot has stayed in place because it's on the background. It's moved a little bit differently. It's a panning.
And what happened to the blue one? Well, basically, flow has happened. The red has moved. The green has stayed still and the blue one has been occluded. There's no response to the blue because there's the occlusion.
So in some sense, this resolves the hard problem of unsupervised flow estimation, which is dealing with low contrast regions, by counterfactual creating high contrast regions wherever you want them. So it's a bit like saying, if you want to track a point, you put a little perturbation at that point, apply the predictor, and see where it goes. And that's how you construct flow and occlusion maps in this very direct, unsupervised way from that pre-trained model.
Now, you might be wondering, OK well, even if you thought counterfactuals were good, they'd be really inefficient because you basically have to poke at each different location and that's really going to take time. You have to make a different inter-image for each one. How is that going to work?
Well, it turns out that if you normalize the effect and take the limit as the perturbation gets smaller, so you stay better and better within the image distribution, you get to something that should look very familiar from a first-year calculus class. What this formula is basically saying is that normalized flow is the gradient of the predictor with respect to the input patches. So that's nice because it means that you can see how to do like infinitesimal flaws and not stay out of the image distribution.
But also it's tensorial and it can be done efficiently like in PyTorch. It's the same gradient you use for backprop. So that actually works really well to give you high quality flow on random video from the web.
So I've constructed flow. I'll construct one more thing and then wrap up. So I've said, how do you do flow from this predictor, plus counterfactuals as derivative. I was originally interested in segmentation, so let me show you how to use that.
Here's an image from a counter-- a picture of my counter at home. Let's say I move one point on the tissue box and then look at the prediction. And it doesn't look like it's changed that much, but actually it has. If you toggle back and forth, that object has moved.
If you look at the flow of that counterfactual image prediction, you've got the second, the Spelke segment. I've just constructed it really fast, but the point is that you can do this counterfactual flow generation to pick out lots of objects even in very complicated circumstances. And the key point here is that much like flow was the derivative of the predictor, the Spelke affinity, e.g. how much two points should move together, is the derivative flow.
Basically, if I change flow at one point, does that change or not change the flow at another point? If so, then they have high affinity, they're on the same object, else not. And that's a derivative because they're saying, OK, if I change this, how much will the other respond. So what I've just basically shown you is how to construct two very basic concepts from taking counterfactuals as derivatives on this particularly-- this temporally factored, masked autoencoder. You can also do this with depth if you want. But just in terms of time, I will just have you believe me that you can estimate depth well.
So putting this back in the form of a foundation model, claim is that now we're thinking about unifying visual cognition by having a large pre-trained model in the form of a structured-masked predictor particular temporally factored mass predictor, and the generic test permutation is counterfactuals constructed as derivatives. And basically, we've done a bunch of constructions, like optical flow, segmentation, depth, key points, a variety of others, to the point where we think that this is a potential-- it's both useful for getting to unsupervised things that we can use in the wild for all these different visual concepts.
But also, it might even be the beginnings of a foundation model for vision. So if we think back to the how are scene graphs extracted, what we're basically doing is starting with short-range prediction, getting optical flow from that, getting segmentation and depth from that, but through a generic process that doesn't mean you have to have a different construction for each one. And then we think that some of these other things will be possible and there's reason for the literature to think that's possible, although that's speculative and quite interesting.
But hopefully, the idea is that we've kind of, I don't know, synthesized the problems with how bitter lesson and being able to not have to build too much structure in, but still be able to autodiscover structure that will solve lots of different tasks. And I think that I'm-- I wasn't planning to talk about those, just mention briefly. This is what Jim was alluding to.
My hypothesis is that such a structure could be discovered evolutionarily, that basically these different parameters of how much to predict and how much to mask and which derivatives to take, essentially, that those are things that actually could be parameterized evolutionarily and discovered automatically without having to build anything in. This is basically an intelligent design pathway, but in practice I think we could release that assumption. Of course, there's a big question of whether this will explain neurons better. Like, will it help us with dorso-ventral integration and all those gaps that are there? I don't know, but hopefully that will be able to allow us to complete the cycle.
So I just want to wrap this up by thanking the team who did this work, Honglin, Kevin, Rahul, Dan, and Wanhee. We needed two dots to separate Dan from Wanhee, but we can get the others with single dots. And just to remind you that this is the type of stuff we do. Ask me about intrinsic motivation separately if you're interested. That's a totally different research line that I'm really interested in but didn't have time to talk about today.
And thanks to everybody who funded this and who did the work. Was just a really amazing set of people. So yeah, thanks.
[APPLAUSE]