CBMM Panel Discussion: “Testing generative models in the brain”
April 16, 2021
April 13, 2021
Talia Konkle, Ila Fiete
All Captioned Videos CBMM Special Seminars
Panelists: Profs. Talia Konkle (Harvard), Josh Tenenbaum (MIT), and Sam Gershman (Harvard)
Moderator: Prof. Ila Fiete (MIT
PRESENTER 1: Well, welcome, everybody. So I'm looking forward to today's discussion. We're going to be hearing from Sam Gershman, Talia Konkle, and Josh Tenenbaum on testing generative models in the brain. So I certainly hope to learn a lot.
Just to introduce our panelists-- so Sam Gershman is a professor in the Department of Psychology and Center for Brain Science at Harvard. He has been working on what kinds of inductive biases people use to solve challenging problems in reinforcement learning. Talia Konkle is assistant professor in the Department of Psychology and Center for Brain Science at Harvard, and her lab focuses on mapping and modeling topography in the visual system. And then we have as our third panelist Josh Tenenbaum, professor of computational cognitive science at MIT, and he has done a bunch of work on the nature and role of generative models in human cognition and focused on inductive learning, reasoning, inference, intuitive physics, intuitive psychology, and planning. So he's been especially interested in generative models and how we can find signatures of that.
So just a bit of really general background on the terminology of generative models. This is just a one-slide primer of generative and discriminative models. So let's start with an assumption about the world. Let's suppose that the world has some latent causes y that generate data x.
And we can really think about either y is these latent variables of causes, or we can think about them as being supplied to us as labels in the form of labels that go along with the data and supervised learning. And so discriminative models consist of learning this conditional distribution, y given x, and often characterized as some function over here directly that describes this condition distribution. On the other hand, generative models model the flow of joint distribution of y and x together. And often, they're modeled by decomposing this joint distribution into a probability over these latent variables times a model for generating the data given on the latent.
It's also possible to work from one direction going to the other. So you can start with the discriminative model-- you can end up with the discriminative model starting from a generative one. So given it a model p of y comma x, you can then build a discriminative model based on the generative models, p sub g of y given x, by marginally doing some marginalization.
And similarly, you can build actually generative models if you start with the discriminative ones. If you have a distribution model p of y given, then you can additionally learn to predict the distribution of the data, p of x, to get together this discriminative model based on generative model.
So in general for discrimination-- so if you want to do a classification test, discriminative models tend to generally do better in the limit of large data volumes, but genitive models can be good for smaller data volumes and when there is unlimited data. So I guess that's the basic one-page textbook primer definitions. But of course, we're going to hear that these are pretty sharp distinctions. But maybe in reality, they get more [INAUDIBLE]. So that's it.
I'm just going to give you the position statements of Sam, Talia, and Josh and let them take it away. So Sam's going to argue that generative models are inevitable and often implicitly specified in the brain. Talia is going to argue that discriminative and generative models are two sides of the same coin and that maybe the visual brain really gets to its internal model through discriminative learning and that somehow nevertheless makes the underlying generative processes accessible for readout.
And finally, Josh Tenenbaum is going to talk-- well, he's going to start from the assumption [INAUDIBLE] contain generative models but then address how these models get there and how we determine that they are there. Without much ado, take it away. And I think, Sam, you're going first.
SAM GERSHMAN: Thanks. That was a great introduction. And maybe I should just preface this by saying that I'm not really making a super strong argument that implicit generative models are inevitable, but I want to put it out there as a conjecture.
So Ila already introduced some useful introductory material, and I don't need to rehash that. My notation is just slightly different, and that's the only reason I'm showing you this picture. So I want to make a distinction between that ground truth generative process-- this is the world that's generating data. And then, inside your brain possibly is a generative model that approximates this generative process. I'm going to call that Q to distinguish it from the ground truth data-generating process P.
And so the question, of course, is does the brain have a generative model something like Q and what form does that take if it does? And I think my points are going to be complementary to the ones that Talia and Josh are going to make. I want to call attention to a distinction between explicit and implicit generative models.
So the easiest thing to understand is an explicit generative model, which means the computationally accessible representation of the generative model. And by computational accessibility, I mean that downstream neurons are reading out something about that generative model. So it can be decoded from the neural activity. And there are a bunch of examples of this, and I mention just two here. One are probabilistic population codes and another is what I'll call Monte Carlo codes. Sometimes, that's referred to as neural sampling.
And just for illustration's sake, let me show you what I mean by this using probabilistic population codes as an example. So there is an important paper by Wei Ji Ma, and Alexandre Pouget and colleagues from 2006 where they proposed-- they took the classical population coding idea where you have a bunch of neurons that are tuned to different stimuli. So for example, the x-axis here could be orientation of some stimulus and each neuron has a preferred stimulus that orders them along this x-axis. And then when you present a stimulus of a particular orientation that's shown here, then the neurons are going to respond to varying degrees depending on their preferred stimulus. So that's the peak of their receptive field.
And this is going to be noisy because it's assumed that these are Poisson-spiking neurons. And the critical thing here in this setup is that the amplitude of this hill or the gain of this hill of activity is going to be related to the posterior uncertainty. So if there's a downstream neuron or population neuron that's looking at this population activity, then it can apply a Bayesian decoder to read out a posterior distribution over the stimulus. And the variance of that posterior distribution is going to be monotonically related to the gain of this population activity.
So this is an explicit generative model in the sense that you can read out the probability distribution directly from the population activity. And the claim here is that it actually is getting read out. It's not just there, but it's actually being used for something.
And I'll just point out two other things about this, which may be important. So one is that if there's no stimulus present, then the assumption is that the population activity is going to represent the prior distribution. So that's what we often mean when we talk about the generative model. It's like the distribution overstimulate a priori.
And then the other thing to notice here is that this model is not actually generating data. So it's not it's not literally sampling stimuli in some sense, but it is representing a distribution over a stimuli. And we can contrast that with Monte Carlo codes where we interpret neural activity as actually samples from a probability distribution, and then we can see that it literally is generating data.
So when there's no stimulus present, it's sampling from the prior. In other words, it's sampling from the generative model. But either way we would think about both of these as an explicit generative model.
Now, what about implicit generative models? So here, the idea is that there's no computationally accessible representation. So it's not like there's some downstream neurons that are reading out the generative model or some property of the generative model from the population activity. But there's some property of the neural system that's isomorphic to the generative model. So the system operates as if it had a generative model, even if it's not actually representing one in a computationally accessible way.
So there are a bunch of examples of this, and I don't want to spend too much time going through all these examples in detail. I'll just briefly go through the argument about compression. The other arguments for implicit generative models take a variety of forms. And I'm referring to a few of them here.
The unifying theme here-- and this is what I mean by the inevitability of generative models-- is that you could start with a principle like I just want to predict, or I just want to compress, or I just want to take actions that are going to maximize my cumulative reward. And under some assumptions if you set those goals for yourself, then you will in some sense have to implicitly represent a generative model or encode a generative model in some way. In other words, the system that's trying to achieve those goals, that's being optimized to achieve those goals, is going to, in some sense implicitly depend on a generative model in some sense.
So let me just give you an example of this. And before I do, I'm for flavor invoking a famous quote by Voltaire that, "If God did not exist, it would be necessary to invent Him." And we could say if generative models did not exist, it would be necessary to invent them. That is, even if we didn't start off with the goal of building generative models, if we were trying to solve these other problems like compression, prediction, reward maximization, we might have to invent them in our brains. So that's another argument for why we might look for generative models in the brain.
So let's take compression as an example. So suppose you're receiving data from some distribution. This is the data-generating process, and you want to store it with as few bits as possible. So Shannon's source coding theorem famously states that lossless compression is going to be achievable with no fewer than bits equal to the entropy of the data-generating source.
And one way to achieve this Shannon bound, which is known as entropy coding, is to assign a code word to each datum. This is a binary code word whose length is equal to the negative log probability. So if you think about this for a second, what this suggests is that if I'm trying to build a neural system that does efficient storage, then to the extent that it approximates entropy coding it's going to be functioning as if it has a generative model, even if that generative model is not directly read out in the sense that there's not necessarily downstream neurons that are going to be reading out those probabilities.
But the efficient compression depends implicitly on something that's isomorphic to those probabilities. That is the code length. So just to conclude, my first point is that neuroscientists should not necessarily be looking for narrowly-construed generative models in the brain. That is, sometimes when you hear the word generative model, we think, we should be looking for something that's literally generating samples of data in the brain.
And there are some people that take that idea literally, and I don't think it's actually that implausible that we might have Monte Carlo codes in their brain. I think there's some pretty good evidence for that, and we could talk about that later if there's time. But that's not the only possibility, so we don't want to be too limited in the way we think about generative models.
And in particular, I want to suggest that we could have generative models in the brain but not necessarily in the obvious way that there's some representation of a generative model that gets read out downstream, but rather that the generative model is implicit in the functioning of other systems that are not obviously representing generative models. So that's the implicit story. So I'll stop there. Thank you.
TALIA KONKLE: So thanks for having me on the panel. I want to click my slides. So I just want to give you some context of where I'm coming from, which is that my work is based basically working on measuring, spelling things properly, and mapping response structure in the human brain. We've been using a lot of deep convolutional neural networks to operationalize different kinds of feature spaces and see how those relate to brain measurements and some self-organizing math algorithms for making topographic predictions.
So there's no explicit generative models in here. And so to get a launching point for thinking about generative models in the brain, I actually remembered back to way back when I was in grad school at the BCS Department. And I asked Antonio Teralba-- oh, I lost the slide. There it is.
What's the difference between generative and discriminative models, and is the distinction important, something like that? In that energetic zen master kind of way, he said, discriminative? Generative? It doesn't really matter.
They are two sides of the same coin, really. It's better if you can hear it in a Spanish accent too. Pondering this nugget of wisdom has led me to the following intuitive interpretation of what he might have meant.
So by generative-type models, I think maybe these are models that are explicitly describing what makes the thing a thing. So an X is two slashes. An O is round. The generative model of x, we make no reference to the generative model of O at all. The relative building blocks are explicit in the model itself.
Whereas by more discriminative-type features in contrast, I'm thinking about maybe focusing on the ways in which things are different from other things. So if you're looking for an X among O's on the display, the optimal template is not an X but actually an X minus an O. I learned this from Ruth Rosenholtz probably the same year.
The presence of evidence in the right place is the absence of evidence in the wrong places, similarly for O among X's. And in this way, specifying what a thing that makes direct reference to other things that it's not. And these links between my intuitive take on why discriminative and generative models might be two sides of the same coin have come up recently in a number of bits of work.
So I'm just going to give you two cases that I think will foster this idea. So the first is principal component analysis in low-dimensional manifolds of a population code. So this is actually based completely entirely off the Smart Talk I saw by Sarah Soya recently, and so I'm basically just recapitulating her arguments here.
So the setup is you have a monkey and you have an array in M1, and you're recording a number of neurons from the population. And say, the monkey is doing a and in six directions. So it's generally accepted that the number of neurons is exceeding the amount of information that an area is encoding. And this redundancy implies that there's covariation among neurons, and that reflects constraints of connectivity, actually. And what that means is that there's a data manifold that's lower dimensional than the number of units, and this makes sense if not all parts of the population code are necessarily occupied.
And so what she does is just use principal component analysis to find the neural modes, or the major bits of neurons that are going together at the same time, groups of neurons that are going together at the same time. So I thought of PCA as generally discriminative. So I'm trying to clarify the dimensions along which items are different.
But then in her talk, she says, well, now you just think of principal component analysis as a generative model. So spikes are generated by the activation of these neural modes. So the reason why this is interesting is where she took it next in the context of thinking about this in terms of long-term recordings.
So if you have this electrode or this array and it's implanted for say two years, by two years later it's shifted. You may not be recording from any of the same neurons at all, but you're still tapping the same general population. But how are you going to decode the arm movement now across the array? The neurons have changed, so you're sampling the population differently.
So here's the value of thinking of it in this generative way. You can do PCA on the recordings from-- oh, yeah, my mouse over here-- the first day, and you can do PCA on the recordings from the day two years later. And if you assume these are both samples from a underlying true data manifold, then these spaces should be able to align well. And they do, and they align well enough that you find these really consistent, stable, low-dimensional dynamics.
And you can actually take a decoder learned on this day and run it through the alignment and very strongly predict how these neurons should go and decode actions just as well. So these work much better than fixed decoding that's slowly degrades as the array shifts. So the take-home here is by thinking of PCA generativity the step to align these codes made sense.
But it's built off PCA, which has this discriminative type of feature that's related to variation across the six movements that the monkey did in this task. And in fact, they actually really lean into this point. They say to really understand the code, we need to have a much more natural behavior and not just these six reaching movements to actually better characterize the neural manifold, data manifold.
Second case is related on related to some advances in self-supervised contrastive learning and the emerging understanding of how and why it's working so well. So as I'm sure everyone in this audience is keenly aware, DeepNets trained on it to do 1000-way object categorization learn features that have a pretty good emergent match to visual system tuning. And of course, this is a discriminative model proper. It gets 1,000-- it gets labels for all 1.2 million images for 1,000 categories.
But what do we mean by categories? Are these are the right categories? What level of categories? And ideally, you'd want a learning algorithm that can operate over any view without presupposing categories at all.
But if not discriminating all the categories from each other, what's the representational alternative? And we were inspired by this paper in 2018 by Wu et al where they changed the goal from 1,000-way categorization to 1.2-million-way image level categorization-- so try and make every image separable from every other image. And they did this by mapping it into 128-dimensional L2 norm to space.
And what they found is that you have emergent category structure that's just naturally in that space that's due to the similarities the structure in the natural world and the prior through the DeepNet architecture. But it still has 1.2 million labels. You're still saying, oh, this is image 762. Here it is again. Put it in the right spot in this latent code.
So to get around that, us and others have been doing contrastive learning. So I'll tell you this is work done with George Alvarez. This is the model I know best, but it's the same flavor as many of these others. So you have some image that is the world, and you can take samples from it or augmentations, different crops, and maybe jitter the contrast, et cetera. You run those through a DeepNet, and you map that into 128-dimensional space.
Now, how do you learn good structure in this space? You try and make the different instances similar to each other and different from all the other things you've seen in recent memory. And in general, all the contrastive learnings have the same sort of flavor. Some things are similar and pushed away from other things.
And there's no labels during training, but this model ends up with emergent categorization capacity. And since then-- and many new contrastive learning models have come out that basically operate over instances and use these same kind of rules with slight differences on exactly how you pick your positive parent, exactly how you separate from your negative guidance. They now have parity with supervised models on ImageNet classification. They have parity with supervised models on capturing brain responses. And directly comparing them, people are finding out that the learned feature spaces between these categories, supervised and self-supervised, contrasted with learning spaces are pretty similar.
So what's going on? Now, we're starting to get even deeper insights into this. So this is really beautiful work by Timothy Zola who's also at CSAIL. So three useful properties of these latent spaces-- by aligning things, by making two things similar, you're basically telling the model what the relevant invariance is. These are the same.
And by contrasting with all the things you've seen, you're doing some sort of uniformity or preserving maximal information. Or I think of it as using your representational resources wisely. And as a consequence, you end up with structure in this latent space that is extremely easy to read out with the linear classifier because similar bits of things end up in similar spots, and a linear classifier can just top off that spherical cap, if you will, allowing some easy read out of category information that wasn't directly put in there.
And following on these insights, a new paper by Zimmerman et al has actually recently just directly argued that contrastive learning inverts the data-generative process. They say, an encoder aligned with contrastive loss can recover the true generative factors, building theoretical bridges between contrastive learning, independent component analysis, and generative modeling, and that the successor of contrastive learning is actually due to an implicit approximation, an approximate inversion, of the data-generating process. Maybe this likens to some of the things Sam was saying. But exactly how, I do not know yet.
So in sum, I've try to show you a couple of cases of what I would typically think of as more discriminative-focused unsupervised approaches but that can be cast as generative or might be able to recover generative models. And we were tasked with trying to say something provocative at the end, so here goes. So maybe the brain is getting to its model through discriminative-type operations that nonetheless make explicit some kind of underlying generative properties for read out. Stepping back a little bit from the ledge, maybe, perhaps fluidly moving between these two framings may be critical for or useful for understanding neural computation and the emergent properties of the representation. Thanks.
JOSH TENENBAUM: Maybe my provocative point is just that I think it's just completely clear that we have some kind of generative models in our brains. And the questions are, what is their form and content? What information is in them? How is that represented? How do we learn them?
How do we build them through some combination of what our genes give us and our experience? What's the data? What are the learning algorithm of the loss functions? How are these models implemented in actual neural circuits?
Most of my work and my perspective, which I'll share, comes from thinking about the mind at the cognitive level, looking at behavior. I'll try to keep it kind of light for this afternoon panel at the end here, but I'll give you some behavioral demos. But I'm really very fascinated by the question of, how might these things that we've been studying in computational cognitive science work in neural circuits and what are the experiments we can do? How can we best get an empirical handle to use neuroscience data to help answer these questions?
So to give you some flavor of this, consider the following questions. And I'd like the panelists to just weigh in. So if you can unmute yourselves, you can be my participant panel here. But I hope everybody at home can just give me your own answers.
Imagine a chimpanzee eating an ice cream cone. I don't know if you've ever seen one of these before. I haven't. But I can imagine one, and you guys imagine one.
So you tell me, did you imagine something maybe like this? Maybe. Here's a few other things you might have imagined. These are actually all images that I got off of just asking Google Image search query to imagine a chimpanzee eating an ice cream cone, and it came up with some things that I hadn't thought of including something that isn't a chimpanzee but it was holding two ice cream cones or a chimp sharing ice cream experience with a girl.
And in some sense, you can think about-- if you want to think about, what are the data from which you might build a generative model? In this case, something that allows you to condition on a statement or a question and, say, imagine a scene. You can think about what's out there on the web, a search engine like Google, as a way of implicitly specifying a joint generative model. In this case one that's not bad.
But imagine a slightly different query. Imagine a chimpanzee eating an ice cream cone with three scoops. You could fill in more detail like a strawberry on top, vanilla in the middle, and then chocolate on the bottom. Panelists, I think you can all imagine this, right? Yes.
You can say [INAUDIBLE]. I'll give you opportunity to speak, as you like. But this is actually harder for a web search engine. If you type in this query, you get all sorts of things that are related-- ice cream, sometimes with scoops. You get-- I think it might be a chimp eating a popsicle.
You get dramatically-- some ice cream in the shape of a monkey of some kind, and you get a penguin eating something with three scoops. So you get a lot of things that are related to the query. But fundamentally, you don't get the actual thing.
How about this one? Imagine a parrot eating an ice cream cone. Here's a question. Who's holding the ice cream cone? So, panelists?
TALIA KONKLE: Parrot, until you said that.
PRESENTER 1: In the cage, and he's just pecking at it.
SAM GERSHMAN: Yeah, yeah, I think the person is holding the ice cream cone.
JOSH TENENBAUM: It could be a parrot. It could be just off scene, or it could be a person. And actually, the web is pretty good at this. So these are the things that you get when you query the web on that. You get all sorts of different possibilities.
How about this one? This is with credit to Noah Goodman and Toby Gerstenberg. Imagine a giraffe eating an ice cream cone. The giraffe has a blue ribbon around its neck. The giraffe is standing on the ground and holding the ice cream cone with one of its four feet-- so three feet on the ground, one holding the ice cream cone-- while it licks the top scoop of ice cream.
So you might imagine something like this. This is a fanciful drawing from Noah and Toby. But then you can again ask the question of yourself or of this image, how many scoops of ice cream are there?
And this image shows-- I don't know-- seven or eight scoops, and this is a short giraffe. But hopefully, all of you could agree you could all imagine something like this even though if you search the web for this-- I mean, nobody's ever drawn this until this particular case. Nobody's ever seen anything like this.
Yet, you can imagine an answer to this question. And knowing how tall a giraffe is, there's got to be a lot more than the usual three scoops. I think we all agree.
AUDIENCE: And I don't know if a giraffe can get its mouth to its foot.
JOSH TENENBAUM: Well, that's the whole point. I mean, it would have to have a whole lot of scoops.
SAM GERSHMAN: And Josh, what does it mean? I feel like-- I'm sure I'm not the only one who's had the sensation that for some of these questions, you don't think about it until you're asked. What does that mean about generative models?
Save that question. That's one of the questions I'd like us to think about. Save that. And I think this is something that Tomor Olmen has been investigating in a lot of interesting ways. I'm not sure if Tomor here, but I'd love him to weigh in on that if you'd like.
Here's another example. Imagine a red cube on the table in front of you. Now imagine next to that one red cube is a stack of three cubes with a green cube on top of a red one and a blue cube on top of the green cube. Everybody got that?
Now imagine one more stack next to that one, this time with nine cubes, all red, stacked one on top of each other. Could you imagine that? So now, imagine that someone bumps the table. What will happen?
AUDIENCE: The stack of nine falls over.
JOSH TENENBAUM: What about what about the stack of three?
AUDIENCE: No, not mine.
JOSH TENENBAUM: Anybody else's? Which is more likely to fall over, the stack of three or the one?
JOSH TENENBAUM: So Kenny, thanks for being a good volunteer.
AUDIENCE: Everybody else is muted.
JOSH TENENBAUM: It's fine. It's fine. So the point here is that you can imagine this complex scene. It has a bunch of objects in a certain kind of structure, and you can make sort of plausible guesses about what might happen in a certain case that puts together your knowledge of physics, a little bit of an intuitive probability judgment, and so on.
One more kind of thought experiment-- imagine being on the second floor of a house, leaning out the window to see a stack of cubes on the ground coming up to the height you are on the second floor. The cubes are red and yellow, alternating colors in the stack. Can you imagine that?
JOSH TENENBAUM: How many cubes are there?
AUDIENCE: I made them very big.
AUDIENCE: They're very big.
JOSH TENENBAUM: And how big are they?
AUDIENCE: Not many.
JOSH TENENBAUM: So how--
AUDIENCE: I'm getting five.
JOSH TENENBAUM: What did you say? What?
JOSH TENENBAUM: Huge cubes. If you gave a large number-- if you give a small number, then they were big cubes. If you gave a large number, they could have been much smaller cubes. Imagine the stack of alternating red and yellow cubes is much taller, continuing far up above your head until it reaches a low layer of clouds blanketing the sky.
The stack of cubes continues up into the clouds beyond where you can see. The cubes aren't even perfectly aligned horizontally but offset in different ways, like the steps in a funny sort of ladder or stairway. Can you guys imagine that? And imagine-- so now, could you climb up the cubes until you reach the clouds and beyond?
AUDIENCE: I'd have a hard time, but it's a Jack and the Beanstalk picture.
JOSH TENENBAUM: So you could, in principle-- I mean, if you had enough time. It might require that the cubes are glued or stuck together. Exactly. This is exactly the kind of story that could be a funny version of Jack and the Beanstalk only with my favorite kinds of intuitive physics demos.
So these are somewhat fanciful examples. But what I'm trying to get us to think about here is when we think about generative models, the models of the world, what are the kinds of things, capacities, we need in the mind and the brain to capture the human ability to imagine all these things that we've never seen before? So in the work that we do-- and again, I think most people in CBMM are familiar with this, so I won't really go over the details.
But we build-- we use the idea of probabilistic programs to capture our generative models which have symbolic structure, the kind of things that we might put into game engines where we have objects and physics, the ability to simulate the mental models inside a kid's heads. But then we wrap these inside a framework for probabilistic inference so that we can answer these questions. So like in the work that we've done as part of CBMM, where we build models of simple kinds of blocks world scenarios, and people's judgments about what's likely to happen with these blocks all over, which way will they fall, but also questions like a demo I've shown many times in CBMM, which I was also trying to trigger before.
If I say, well, imagine I have some combination of red and yellow blocks and I knock the table, is it more likely to have red or yellow blocks fall on the floor? Again, a question that isn't something we've had direct experience about but we can make probabilistic-graded judgments, as I'm showing in the data from one of our experiments here across different scenarios. And we can capture those by doing these simulations and something like this physics engine in your head. I can simulate a small bump of a table or a big bump up a table.
And that's a way to specify a generative model by simulation where you query it by imagining just even a small number of very poor Monte Carlo. A small number of simulations is able to make rough estimates at the grain of intuitive physics that gave you these judgments. But also, for example, we've used this kind of thing to model the intuitive physics of young babies.
In this case, looking in the experiments of Erno Teglás and Luka Banadi and others, you can have kids looking at little things bouncing around inside a lottery machine, vary the number of objects, when one comes out, where the relative positions are, and capture the graded looking time. Again, this goes back to one of the many ways to implicitly get at expectations like Sam was talking about-- the idea that kids will look longer when something is less of a good fit with their genetic model. And that's captured by these probabilistic judgments.
So these are some of the things that we've done at the cognitive level to try to address these questions. And the fact that you can see both probabilistic object-based compositional models giving good quantitative predictions of graded judgments in young babies, we and others have taken a sum interpretation that-- it's not to say that you don't learn your intuitive physics. But there's some combination of the structure that's built in with early experience allows even young babies to go beyond their experience. They haven't seen things bouncing around in lottery machines necessarily like that, even anymore than you've seen some of the fanciful scenes that I described.
Now, how do these work in the brain? Well, this is the most interesting, outstanding challenge. And partly, my earlier examples were also designed to make something to think of something which I know a lot of us have been thinking about, which is recent exciting developments basically trying to build generative models of language or images or joint models of this. For example, the DALL-E model from OpenAI, which uses transformers, is the same thing that underlie the big neural language models that have revolutionized a lot of NLP.
But now you train the same models jointly on the kind of data I was showing you of text captions and images. And they can do these amazing things, like for example, something that's close to like the giraffe example I gave. Now, they haven't actually let other people play with the code yet, so I don't know if they could do the giraffe with the blue ribbon and the ice cream cone. But they can imagine an armchair in the shape of an avocado or a baby daikon radish in a tutu walking a dog.
So in some sense you might say, well, maybe this is an example of how some kind of neural model could learn from something like the data that people have, a powerful generative model of the sort that people seem to use in these tasks. It raises challenges, though. The model gets far more data than any person gets, and we don't know how transformers might map onto-- we call them neural networks because you can write them in PyTorch. It's not at all clear how they would map onto the brain, so those are interesting challenges to ask about.
And there are bigger challenges, like you give it some of these kinds of scenes, which are also the ones that you guys had no trouble visualizing, like a stack of red cubes with blue on green and green on red or another thing like very similar, but a stack of three cubes with red on green and green on blue, and you get samples like these, which are very impressive in the sense that they look like images of cubes and they're red, green, and blue. But they don't capture any of the compositional structure of these scenes that you can see. In fact, it's hard for most of us to tell which of these images goes with which query.
I barely remember myself. It's clear that it's not really capturing what is easy for us to imagine in these scenes, what the structure of language with its compositional structure gives rise to in our compositional imagination. There's far more than three cubes in most cases, and there's no clear sign of blue on green on red or red on green on blue in either of these.
Then the last thing I want to show is it's just interesting to actually go back to our Google Image queries. Can we see something in the source of the training data that would explain the successes and failures? Well, if you search for a chair in the shape of an avocado, this is what you get from Google today, this morning-- a lot of images of OpenAI, but actually a lot of avocado chairs.
These are real chairs that might provide the training data that would be the basis for this amazing success. The fact that this exists in the training data is not to take away from the accomplishments of the algorithm, but it's interesting that it is there. And related image queries like beanbag avocado chairs-- there's actually a lot of data towards this.
Whereas, remember this example, this looks more like what we have with the red on green on blue like this case here, which is-- well, there's ice cream. There's three. Maybe there's more. But it's not really capturing the compositional structure of these scenes. So again, these are just the questions that I want to leave us with.
One possibility is that something like these transformers, which are very powerful, could be a model of how to think about certain canonical neural computations. And maybe if the learning algorithm were improved, it might be able to learn something that generalizes the way people do from the data we get. Or maybe we need different models, and maybe we need models which can combine some of the compositional, symbolic structure and probabilistic capacity that you have in these probabilistic programs that enable us to do the kinds of intuitive physics or imagination that I was talking about in the beginning-- maybe something like that.
But then we have the challenge of figuring out how that could work in the brain. So anyway, that's the things I'd like us to think about. And I'll stop there.
PRESENTER 1: Thanks, Josh. OK, let's give a hand to all our panelists first before we go to discussing some of the very interesting points you all raised. And by the way, a quick question for you, Josh, is what happens when you ask DALL-E the question of imagining a chimpanzee eating three scoops of ice cream?
JOSH TENENBAUM: Well, so DALL-E-- you can't actually-- so all those things I showed came from OpenAI's blog post where they very generously gave you these little templates that you could play with. It's not actually available to the public or really even written up in a paper yet, so we don't know.
I'd love to find that. They gave you the ability to query various templates and play around with that. So I don't know. It's a good question.
I can tell you that related to DALL-E is something called CLIP, which is a set of joint language vision embeddings which are not a generative model. They're used to kind of re-rank the outputs of the transformer, and that is public in some form. And you can play around with it.
And it also shows like very striking successes and patterns of interesting failure in compositional scene generation. For folks here who know some of the work that Sam Gershman and I did back in 2015, where we looked at the ability of generative models-- or of compositional language models to describe simple scenes like a young woman in front of an old man or an old man behind a young woman-- there have been some advances in the last few years. But actually, the CLIP embeddings don't fundamentally advance over-- they make the same kinds of errors that big language models have made for the last three or four years that are easy to diagnose with some of the diagnostics that Sam came up with. So I think there's been huge advances, and I don't want to take away from the impressive results. And yet, when you apply some of the simple diagnostics that cognitive scientists have come up with to at least whatever publicly available code is there, you can also see the ways in which easy kinds of scene graph structures, basically, don't seem to be present in those generative models.
PRESENTER 1: Great, thanks. So I just wanted to ask a very high-level question that I think is motivated by the title of today's panel discussion. And it's really, when is a model not a generative model? I'm asking a discriminative question about generative models.
I'm trying to tell whether there is actually an assertion that there are some models of the brain that are not generative models. And so just to chase-- just to sort of build in the context. So Sam asked the question, must sampling the built in? And no. He argued that sampling does not need to be built in.
In fact, according to the definition, the generative model is one that just learns the full joint distribution. One does not need to have a model that does sampling and actually generate data to be generative model. And therefore, while variational autoencoders and GANs are obviously generative models in the very sampling base ends, so is LDA, linear discriminant analysis, which also learns a joint distribution.
So the answer is those sampling need not be built in for a model. You don't have to have a sampler to have a general model. Next question, must it involve unsupervised learning?
And the answer is no. You can have generative models that can be trained by supervised learning or by unsupervised learning or by something in between. So it's not it's not the case that generative models specify a specific form of learning.
And then it raises this other question for me, which is that, can every compressed latent variable learning model be interpreted as a generative model, especially if, Talia argued, PTA does have an interpretation as a generative model? Really, should we be thinking about all of these models as generative models? And in which case, what's not a generative model? And so this is very high-level, and then I have a bunch of [INAUDIBLE] questions. But let me let you respond to the high-level questions before we get to this very long list of audience questions.
SAM GERSHMAN: Can I just mention one thing in response to the last point? Basically, what isn't a generative model? And I wish I had a crisper answer to that.
But I mean, it does make you think that if so many things can be interpreted as generative models, does the whole concept of a generative model completely lose its usefulness? And I think the answer is no in the sense that the appeal of using a generative model is that you can make-- basically, a generative model encapsulates hypotheses about the world. It's not just about something inside the algorithmic machinery, but it's something out there in the world that you could then go around and verify.
So for example, in bygone days when people did natural scene statistics, they would go and measure the distribution of orientations in natural images and then say, the distribution looks like this. And then we can make predictions about things like oblique biases and orientation discrimination and things like that.
And that was a really powerful research strategy because we could actually verify these claims about generative models. And I think that this is potentially still an important point in the age of deep learning where we no longer hand-engineer these descriptors of the world and then go try to estimate the distribution of orientations. We just feed in natural images and see what happens.
But I think that a place where this comes up is in the context of data augmentation, something that Josh and I have been discussing recently. It's been observed empirically that you can improve the performance of supervised learning models by feeding a larger augmented data sets where you've generated them synthetically in some way-- sometimes through some kind of procedural generation, sometimes through various kinds of modifications of the training set. And actually, some of these ideas have a long history. They go back partly to some work that Tommy did in the 90s.
But there's a really big question there which is, is this essentially just an empirical enterprise? We're going to hypothesize that some data augmentation can be useful. We'll try it. If it's useful, good, we'll write a paper about it. And if not, we'll move on to the next augmentation.
Or is there some theory of data augmentation that says basically what are good data augmentations? And people have tried to do something like this in the language of group theory where you say basically the good augmentations are the ones that are invariant with respect to some group that defines the category. But that's a little bit, I think, vacuous unless you have some hypotheses about the nature of that group structure.
Otherwise, it's just a mathematical re-description of the problem that you're trying to solve. I think where thinking in terms of generative models is still relevant because you can ask, can I justify a particular augmentations over others by appealing to the idea that particular augmentations should arise from certain kinds of models? These are the kinds of data that should be generated by my model of the world, and these other things shouldn't.
JOSH TENENBAUM: I think there's a direct link between what Sam was talking about and what Talia was talking about and what I was trying to talk about. So just so everyone knows, data augmentation is this thing where you sometimes by making image transformations but often with simulators-- like it's often been used in robotics research to do so-called sim-to-real transfer where you train like an algorithm in a robotic simulation. Another project from OpenAI with the Rubik's cube was a good example. But other things where you train a robotic perception and planning system in a simulator and then you want to transfer it to the world.
And you do all these crazy things to the textures, the objects. You basically think of anything you can imagine that would keep the basic structure the same but change the visual appearance and train some kind of-- often it's a discriminative method to do the perceptual task you want, like object pose identification. So I think it's a version of the two sides of the coin that Talia was talking about.
But where did the data augmentation transform come from? They come from your mental model from your imagination. So it's a way, in a sense, of basically saying, let me imagine all the different possible worlds which actually are-- most of them are quite different from anything you'll ever see in the real world. But if you train a classifier that solves the discriminative problem on not the actual world but the space of possible worlds that come from your mental model, then it might be a way, an interesting practical way, of training effectively a proxy for a much more powerful generative model than you could get from the empirical data sources available.
TALIA KONKLE: That's so interesting. I was going to go in a completely opposite direction of where your augmentations could come from, which is motor system, your motor system. Eye movements actively sample. You turn an object, and you get the data.
You use-- you don't have to simulate the world. That's computationally heavy. The world is there. You just turn something, and you see the consequence.
So I always thought of the sampling-- and you just have to learn this world. And its invariances have possibilities for future other worlds. But this one's pretty good, and you have this rich data source and all you have to do-- so it's really more about links, like eye movement links, visual motor links, attentional operations that have a physical-- a bodily component that I think. I mean, that's where my mind goes for where those augmentations you'd want them to come from.
JOSH TENENBAUM: I think there's a link there too, Talia, because I mean-- as a lot of people at CBMM have seen from Nancy Kanwisher's lab, like a recent talk that Pramod gave, we've been studying using neuroimaging. Where are these intuitive physics computations in the brain? And it turns out they're in like the parietal and premotor areas that might be used and basically a lot of what you were talking about in your talk and now what you're talking about for motor planning.
So I think the idea that there could be a powerful generative model for simulating the world, including our actions on it and the physics that bring our bodies into contact with objects in the world-- that these could all be linked up both in linking up planning and perception. Again it's-- well, I think you can get a lot from embodied action.
But especially if you're talking about training a perception system, you're-- moving the cube around doesn't make it pink and polka-dotted and make the background deeper, striped, and all the other crazy things. But the mental models that are needed to do the kind of action effectively that you're talking about could be. So there could be a link evolutionarily if not necessarily developmentally.
PRESENTER 1: Let me, let me-- Talia, do you want to respond to that or--
TALIA KONKLE: Quick texture point-- it's very quick.
PRESENTER 1: Go for it.
TALIA KONKLE: --which is actually deep learning systems are operating over RGB images. But of course, the eye is separate channels for the luminance and color, and that actually, the input that we're really working over is something more like edges where color gets painted in way later stages. And so actually, I wonder if some of these artificial textures-- I mean, those are artifacts more of the way deep learning systems are getting the visual input and that's less of [INAUDIBLE]
JOSH TENENBAUM: That's a good point.
PRESENTER 1: OK, so there's some fantastic questions in the Q&A, and I want to read a couple of them. So one of them sounds pretty quick. Does a generative model in the sense that Josh talked about necessarily need to be probabilistic in the explicit sense that Sam talked about? Or does it make sense to talk about non-probabilistic generative models. So in other words-- yes. Do generative models need to be probabilistic?
SAM GERSHMAN: But are we talking now about implicit or explicit generative models? Because I think explicit models have to be probabilistic.
PRESENTER 1: The question asked about in the explicit.
SAM GERSHMAN: I mean, I don't understand-- how do you define a non-probabilistic generative model?
JOSH TENENBAUM: Well, the kind of models I was talking-- I mean, one way to do it is with a simulator where--
SAM GERSHMAN: Oh, I see. Right. But--
PRESENTER 1: It's a prediction [INAUDIBLE]. You can have a prediction.
SAM GERSHMAN: So I think this is a semantic question about what we mean when we refer to a generative model. Is it just the simulator or is it the joint distribution that we can actually access the-- we can query probabilities, basically? I mean, if you have a simulator, you can always approximate them by Monte Carlo.
JOSH TENENBAUM: Yeah, I mean, what's interesting if you have a causal model is it's-- you could say you can approximate. You can do Monte Carlo simulations on it to get a distribution, or you can say it's actually a family of distribution.
So it's not just one of them because it might modularly composed with other knowledge or conditioners that you might have. I think the modularity of being able to have your generative model not be monolithic or not even just be like a black box latent variable model for distribution but to have components with causal structure is partly what allows our generative models to go beyond the data that we directly experience to imagine lots of possible worlds that we haven't seen. think
SAM GERSHMAN: I think that part of the--
JOSH TENENBAUM: --judgments in them. Again, Tomor has nice results on this. People can make sensible conditional probability judgments about magic, things that have never been observed by anybody, and they--
PRESENTER 1: If you don't mind, let me move on to the next question. We have very little time. There's another fantastic question. I feel like you got some responses in on that one, which were great. And by the way, that question was from Richard Lange.
This next question is from Max Siegel. This is the question. One of the major advantages of generative models from a cognitive perspective is that they answer arbitrary questions about a concept. For example, how likely is it that any unfamiliar plant is going to build a house?
To answer, sample a bunch of plants and count how many have that property versus the total number. This requires that our plant distribution models reasonably well the relative frequency of plant species, which discriminative models don't. Do any of the generative-like models have this capacity?
SAM GERSHMAN: It depends so strongly on which kind of models you're talking about. If we're totally unconstrained in which generative or descriptive models we choose, I think the answer is yes. There exists something that can do that.
But I think this question and the discussion we were just having, it points to the fact that even if you accept that generative models are in some form in the brain somehow-- maybe the more interesting question is, what kind of generative model would it have to be, for example, to answer all the questions that Josh posed? Would something like the kind of model that Talia showed, where it's this contrast of learning model-- can it learn to answer questions like that?
PRESENTER 1: That's right. That's right. And so there is-- anyone else? Talia, would you like to say something?
TALIA KONKLE: I'm going to say I don't think the model can do it yet in part because it's pretty dumb, and it operates over images. And it doesn't have and attentional focus and doesn't know what's context and what's salient. But transformers are an interesting model because they start to bring in those self-attention mechanisms to pull out a thing and shift its representation in the context of the things around it.
And so I think the current-- so sometimes, I take the stance where these initial-- it's a visual model, and the initial system is just trying to get you into the right regime where you have reasonable primitives to attempt to learn these kinds of things over. And then attempting to learn these kinds of things all the way down over like the V1 or LGM type representations is just really hard.
But you can get pretty far by doing these local contrastive things, normalize, keep trying to separate the data and represent all things uniquely, and that that becomes a useful basis over which other kinds of operations can operate. So I actually often think these questions you're asking are not in the purview of the visual system, per se. They might hook into bits that then you can like use the recursive connections to then make as vivid in a mental image as you need to answer the question to give the output. But I think that these operations that you're talking about aren't particularly visual in terms of-- I would put them somewhere else, not in the visual system.
PRESENTER 1: OK, so I have a question here from Tommy Poggio. The necessity of constraints to solve ill-posed problems in perception has been recognized for some time. Would Sam define the algorithms that solve these problems by leveraging constraints as generative?
So to unpack that, I guess the point is that if you want to build models of the world, you have to add some priors because the problems are constrained. And can we think of our generative models, I guess, as a special form of imposing a prior? I guess. That's my interpretation of the question. But maybe it was [INAUDIBLE].
SAM GERSHMAN: Yeah. I mean, I think so. The follow-up that Tommy gave about-- that where forms of regularization are equivalent to data augmentation-- I was actually explicitly referencing that work. I think that was a nice example of where there is a clear relationship between a particular kind of inductive bias and data augmentation. You don't necessarily have to interpret that inductive bias as or regularize it as a generative model right.
But I think the thing about a lot of these regularizers can be reinterpreted as basically arising from optimizing some generative model. But it again raises the question, is it then useful to invoke the generative modeling concept? If you can write down a regularizer, a cost function that when you optimize it it does the thing that you wanted it to do, is it useful still to talk about generative models?
PRESENTER 1: That's right. I thought that we think about generative models as a very specific kind of regularizer, where the regularizer is specifically is also Model X, the data, right?
SAM GERSHMAN: But I mean, you can always-- every regularizer that people have written down-- not every regularizer maybe, but many-- have a corresponding generative model. I mean, there's a recipe that you can follow to write down regularizers as particular generative models. But I think when it comes to data augmentation-- I think actually the example that Tommy gives-- I don't know if Tommy wants to jump in here.
But I think it's illuminating because in order to use that strategy, you need to know that-- in that, if I'm thinking of the right paper, this Partha Niyogi paper, they had a 3D model of an object which they could have rotate around and say, I can generate an infinite number of images of this rotated object because I have access to a 3D model. I can do imaginative operations on it, just like what Josh is talking about.
So it presupposes that you already have access to a certain kind of generative model that defines what a good augmentation is. Would you agree with that, Tommy? I don't know if Tommy can--
TOMMY: Yeah. I think you don't need to know the regularizer. It's just that when you use the time instead of data augmentation or you use the term virtual examples-- so you can create virtual examples if you know that, for instance, the identity of the object in the image does not change if you scale the image up, you rotate it, you translate it, and so on. So these are legal transformations.
And by providing this data augmentation of this legal transformation, it's like a regularizer. You don't need to know the regularizer. It's like a regularizer, and you got a system that will be invariant to these transformations. So that's a powerful idea to keep in mind.
And in some simple case, you can formally prove mathematically that's exactly what happens. Now, I'm sure that there is more general data augmentation that do not correspond to sample regularizers. But I think the basic idea is correct.
PRESENTER 1: There's a question from Ruth Rosenholtz. Do we need to distinguish between generative tasks, for which to refer to an earlier speaker, Sam? Any solution you have would certainly look generative and other tasks that aren't so generative in nature. In other words, do we really need to distinguish between generative tasks and other tasks?
Ruth isn't here. She had to leave. So if that question is not clear, we can continue.
JOSH TENENBAUM: That's partly what I was trying to get at with these causal examples or where you imagine what happens if you do this, or what would be effects of your plans, or could you do this, or, I think, what Talia is getting at also in these embodied things if we want to be able to act on the world. And one of the things we want our generative models to do is to be ways to evaluate possible plans we could make. Then I think that's an example of, in some sense, a certain kind of more generative task than just perceptual discrimination, object recognition.
I think a lot of these places where it seems like there are two sides of the same coin come up in the context, maybe not surprisingly, of what the task is-- basically, a classification task. So it makes sense that good representations for classification might look a lot like generative models or they might be more interchangeable. Again, maybe this also goes back to one of the other questions.
It's true that when I think about generative models, I think particularly about causality structure genetic models that aren't just probabilistic models but they capture something about the underlying structure of objects and our interactions and their interactions and our interactions with them. I think if you want to make the claim about generative models in the brain, have a little bit more teeth and say, it's not just anything.
It's those kind of generative models which are particularly useful for being the glue between perception, action, and even language. And so I think we should be thinking about that particular kind of generative model and how it might work in neurons. And then, that might mean we have to focus on tasks which engage that as opposed to ones which could be just used by any much more generic kinds of representations.
SAM GERSHMAN: I think-- can I just add that the more toothy argument there is interesting from a neuroscience perspective because when we write down causal models what we're in effect doing is factorizing the joint distribution in a particular way. And then from a neural perspective, you might ask, well, do those factors correspond to distinct parts of the brain or processes in the brain that can act semi-independently-- or not independently, but rather they interact in a specific way that's specified by the way the causal model. So neuroscientists have for a long time been interested in questions about modularity and build wiring diagrams and box-and-arrow diagrams. But maybe the kinds of box-and-arrow diagrams they should be making should look more like the causal models that we of the sort Josh is talking about. Or maybe that's the notion of modularity that we should be looking for in the brain.
AUDIENCE: Josh, can I ask in the strong generative models, the sort where you say as you did to us, imagine that or something like that that's your prompt-- I found that for some of them, I had distinctly unphysical models that I was building, like the Jack and the Beanstalk. I mean, there's nothing particularly physically plausible about what I was constructing. I have some experience in cartoon form or fairy tales form or something like that with structures like that that I was probably drawing on. Or at least introspectively, that's what it seemed. But I don't know how well that fits into these ideas about intuitive physics or plausible worlds or something like that.
JOSH TENENBAUM: Yeah, I mean, I think those are great questions. And partly, I wanted those examples to raise some of those questions. I think that by studying this more systematically, that could be a way to get a better handle on what actually is the more specific representational content of our generative models.
Let's say our intuitive physics because intuitive physics is different from real physics. In the work that we've done, we've emphasized some of the ways that it's different but only in very high-level terms. We said, well, maybe like game physics engines, our intuitive physics make certain kinds of approximations which help to simulate real physics in ways that look good and work on short timescales but aren't actually correct.
But in some ways it emphasizes more the parts, at least in the experiments we do, that are more correct than incorrect. But if we push more towards questions like the ones I was asking where maybe your things are very unphysical or, as Sam was pointing out and as Tomor Olmen has looked at and been looking at also, things where like you imagine some things and other things you don't even imagine because you don't need to imagine them until the question or the task leads you to imagine them. So these sort of partially specified, incrementally built models, which is a different way of thinking about simulation and conditional inference that again, I think would give us-- if we study those more systematically at the cognitive level, it'll give us more of a handle on what are the really interesting distinctive representations in human intuitive physics.
SAM GERSHMAN: Or simulators or lazy evaluators.
JOSH TENENBAUM: Yeah, exactly. I mean, again, Tomor-- I don't know if Tomor is here, but Tomor and some of his students have been looking at those questions in really interesting ways.
PRESENTER 1: I think on that note, we should probably conclude the panel. Thanks, everyone, for participating. I had a blast.
I hope the audience did too, and see you again at a future discussion, maybe, of generative or other models.
SAM GERSHMAN: Thank you, Ila.
AUDIENCE: Thanks so much, everybody.
JOSH TENENBAUM: Thanks, Ila. Thanks for keeping us all on track and raising great questions.