Bayes in the age of intelligent machines
Date Posted:
March 19, 2024
Date Recorded:
March 12, 2024
Speaker(s):
Tom Griffiths, Princeton University
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: Recent rapid progress in the creation of artificial intelligence (AI) systems has been driven in large part by innovations in architectures and algorithms for developing large scale artificial neural networks. As a consequence, it’s natural to ask what role abstract principles of intelligence — such as Bayes’ rule — might play in developing intelligent machines. In this talk, I will argue that there is a new way in which Bayes can be used in the context of AI, more akin to how it is used in cognitive science: providing an abstract description of how agents should solve certain problems and hence a tool for understanding their behavior. This new role is motivated in large part by the fact that we have succeeded in creating intelligent systems that we do not fully understand, making the problem for the machine learning researcher more closely parallel that of the cognitive scientist. I will talk about how this perspective can help us think about making machines with better informed priors about the world and give us insight into their behavior by directly creating cognitive models of neural networks.
Bio: I am interested in developing mathematical models of higher level cognition, and understanding the formal principles that underlie our ability to solve the computational problems we face in everyday life. My current focus is on inductive problems, such as probabilistic reasoning, learning causal relationships, acquiring and using language, and inferring the structure of categories. I try to analyze these aspects of human cognition by comparing human behavior to optimal or "rational" solutions to the underlying computational problems. For inductive problems, this usually means exploring how ideas from artificial intelligence, machine learning, and statistics (particularly Bayesian statistics) connect to human cognition. These interests sometimes lead me into other areas of research such as nonparametric Bayesian statistics and formal models of cultural evolution.
I am the Director of the Computational Cognitive Science Lab at Princeton University. Here is a reasonably up-to-date curriculum vitae.
My friend Brian Christian and I recently wrote a book together about the parallels between the everyday problems that arise in human lives and the problems faced by computers. Algorithms to Live By outlines practical solutions to those problems as well as a different way to think about rational decision-making.
I am interested in how novel approaches to data collection and analysis - particularly "big data" - can change psychological research. Read my manifesto and check out the Center for Data on the Mind.
IKASH MANSINGHKA: Welcome, everybody. It's my pleasure to introduce Tom Griffiths. He's a professor of psychology and computer science at Princeton. And he'll be talking to us about Bayes in the age of intelligent machines.
Just want to say, personally, it's just a real pleasure to have Tom here. I owe him a tremendous debt of gratitude. He taught me how to build my first probabilistic models and run my first experiments. And over many years and really many, many areas of cognition, his work has been an inspiration for me and, I think, maybe for many of us here in bringing together philosophical acuity, mathematical rigor, and empirical care all at the same time. So really looking forward to your talk, Tom.
[APPLAUSE]
TOM GRIFFITHS: All right, thank you, Vikash. That's a lot to live up to. [LAUGHS] I was amused to find a picture of myself outside Josh's lab on the fourth floor. So if people are curious, you can go and figure out which one of me is the Simpsons characters that appear there.
So yeah, it's wonderful to be back here. And today I'm going to be talking about some ideas that are new ideas and old ideas meeting together. And I think some of the old ideas go all the way back to the 18th century, when the Reverend Thomas Bayes made the suggestion that maybe we could think about representing our degrees of belief in terms of probabilities. And one consequence of that is that you get probability theory as a tool for trying to understand how it is that you should make sense of the world.
So if you're an agent who's going around trying to evaluate different hypotheses that you have about how the world works, Bayes' rule tells you how to do that. You should start out with what's called a prior distribution that encodes your degrees of belief in those hypotheses before you see data. You want to end up with what's called the posterior distribution that tells you how you should assign degrees of belief after you see data. And Bayes' rule says, to get there, you just apply a principle of probability theory. You multiply these prior probabilities by a likelihood that tells you how likely it is that you'll see those data if that hypothesis were true and then just normalize by summing over the space of all of those hypotheses.
And this is a nice, simple way of thinking about solving inductive problems. It's one that has a number of virtues. One of these is that if you look at this equation on the right-hand side, only one of the terms here involves the data. So that's this likelihood term. The other term, the prior probability, tells you about all of the things that the agent is bringing to the problem that they're trying to solve. It tells you about the biases and dispositions that that agent has.
And that's a really useful thing to do when you're trying to do something like understand human cognition. And so a lot of the work that I've done in cognitive science is really about trying to understand what it is that people bring to the problems that they solve. What are the inductive biases that inform human learning that make it possible for us to learn from small amounts of data, and so on?
And so this approach of building Bayesian models, as I said, has a bunch of virtues. You end up with models that are easy to understand that make clear what the inductive biases of the learners are, what it is that they're bringing to inductive problems other than the data that's influencing the conclusions that they reach.
And they also have some challenges that are involved in them. So typically, if you're going to make this kind of model, it needs to be something which is tailored to the specific problem that you're thinking about. You have to think hard about what the hypotheses are and what prior distributions make sense.
And scaling can be a challenge because if you're doing probabilistic inference, if you're applying Bayes' rule in the way that I showed you, that's something which involves a lot of computation. You have to think about all of your hypotheses. You have to do those calculations for all of your hypotheses. And so that can limit the scale at which we can use those models.
So despite these things, these kinds of Bayesian models are things that have been used pretty widely in statistics and continue to be used in parts of machine learning, where people are really concerned with things like quantifying uncertainty or being clear about the assumptions that go into those models. It doesn't really need to be said that a lot of the rest of machine learning is focused on a very different kind of approach, deep neural networks, which you've probably heard a lot about.
But deep neural networks have a very complementary set of strengths and weaknesses. In fact, we can take each of these things and then find the converse of it. So deep neural networks are notoriously difficult to understand and have opaque inductive biases in the sense that when we make a particular model, it can be quite hard for us to make predictions about what kinds of things that model is going to find easy to learn or hard to learn.
But nonetheless, the same kinds of architectures can succeed in solving a surprisingly wide range of problems. You don't need to necessarily make specific models that are tailored to specific problems in the same kind of way. And these models can process large amounts of data, so scaling, in this case, is less of a challenge and more of a virtue.
And that capacity for scaling is something that's driven a lot of innovation in AI over the last decade and little bit. This is just an example from a blog post from OpenAI. This only goes up to 2019, so all of the crazy things that have happened in the last few years don't appear here yet.
But the key observation is that as we look at time here on the horizontal axis, these are all various sorts of breakthroughs in building these AI systems. The amount of compute that's gone into those systems, which corresponds also to the amount of data that they're using, has been increasing. And it's worth pointing out that this is on a logarithmic scale, so it's increasing exponentially and, I think, doubling about every three months.
So that capacity for scaling which deep neural networks have, that you have this architecture which is general purpose enough it can be applied to lots of different problems and something that you can translate into models that can be applied to larger and larger data sets, is something that's then brought us to this present moment, where we have things like chat bots appearing on the cover of TIME magazine, Noam Chomsky writing critiques about chat bots, chat bots writing critiques of Noam Chomsky's critique--
[LAUGHTER]
--and people writing papers with titles like "Sparks of Artificial General Intelligence." And so this talk comes out of talking to a lot of stressed out graduate students in early 2023, who were really feeling like they were interested in this kind of thing, expressive interpretable models. But they were worried that there wasn't really much point in doing those things because this was the kind of thing that seemed to be solving all of the problems that they might be interested in.
And I don't know if people in this room identify with that sort of sentiment. But it's something where it made me think a lot about whether there's room, in a world where we have intelligent machines that are built out of deep neural networks, for thinking about these kinds of Bayesian models and whether this is a useful perspective in the world that we currently live in.
And so as a consequence of thinking about that, I realized one thing, which is that many of the properties that I listed on the right-hand side of the slide here also apply to human beings. So if we look at our humans, humans are kind of difficult to understand. We have opaque inductive biases, again, notoriously. We've really spent a lot of time, people in this building have spent a lot of time trying to figure out what it is that the inductive biases of humans are like.
But nonetheless, humans are able to succeed on a surprisingly wide range of problems with one piece of architecture and are capable of processing lots of data. We do that through our lifespan. We take in a lot of data, produce reasonable behaviors.
And so that recognition that, in fact, many of the characteristics that are true of deep neural networks are also true of humans suggested to me that, in fact, yes, we do live in a world where there's room for making these kinds of Bayesian models because we live in a world where it's actually valuable for us to make probabilistic Bayesian models of human cognition. And so that's what I've done through my career is making probabilistic models of human cognition because those models allow us to explore these kinds of questions about what the inductive biases of humans are.
And that connection then gives us a way of thinking about how we can actually try and make sense of some of the big AI systems that we're currently building but that we don't really understand-- that we've succeeded in producing systems that display a kind of intelligence that we'd really like to know more about, but that are increasingly opaque to us as people who are scientists who are thinking about intelligence who are trying to make sense of these things.
And so one of the key things that motivates this in the human case is the idea of levels of analysis-- again, an idea that's going to be very familiar to people in this room, something that dates back to the work of Marr and Poggio here at MIT, was encapsulated in Marr's book Vision. This is actually a photograph-- I don't know if people recognize who's in this photograph. So we have-- this is Marr, Poggio. This is Francis Crick in the background.
But this idea of levels of analysis is that if we want to make sense of an information processing system, we can do so at different levels of analysis. The kinds of questions that we ask about that information processing system can engage with different levels at which we might try and understand that system.
And so in terms of this classic distinction, distinction was in terms of three different levels of analysis-- the computational level, where we say, what's the goal of the computation? Why is it appropriate? What's the logic of the strategy by which it can be carried out? The level of representation and algorithm, where we say, what's the representation for the input and output and the algorithm for the transformation? And then the level of implementation, where we say, how can that representation and algorithm be realized physically?
And when we're doing cognitive science, these levels map on to the sort of abstract question that we ask is the kind of question that a Bayesian model engages with. What is the problem that we're trying to solve? What's the optimal solution to that problem? Something like Bayesian inference gives us a way of understanding that in terms of what it is that a rational agent should do when they're trying to solve a problem of inductive inference.
This level of representation and algorithm is more like the level of cognitive psychology, where we're engaging with trying to figure out what the processes are that might support those abstract solutions to those problems. And the level of implementation is something which is more like neuroscience. But this same idea of thinking about different levels of analysis is something that we can also apply when we're thinking about AI systems.
And that's the key idea that I want you to take away today is that even though we might build an AI system in a particular way, that doesn't mean that that's the only way that we can understand it. And in fact, understanding what that system is doing might require us to ask questions that are at a different level of analysis. And that allows us to use a different kind of modeling approach.
So I wrote that down here. So the idea is that different models can coexist at different levels of analysis answering different kinds of questions. And so in this age of intelligent machines that are, say, built out of deep neural networks, Bayesian methods have an important and novel role to play, which is that they actually give us tools that tell us what it is that machines should do. And they give us tools for understanding why those machines do the things that they do.
And this role is very much the role that they played in trying to understand human cognition. But it's novel in the context of thinking about engineered systems, where, I think, if you're an engineer, and you're going to go and build a system, the way that you normally think about it is, I have a problem to solve. I want to make a system that's going to do x.
Then I'm going to choose what method I'm going to use to make that system. Shall I choose a Bayesian method, or shall I choose a neural network? And then you say, OK, I'm going to make it with a neural network. You make your neural network model. You now have a system that solves that problem. And then you're done.
But the key observation here is that, in fact, you could actually think about this in terms of different levels of analysis. You've made your neural network. And that's something like that representation and algorithm level. You've made a system which is implementing a solution to that problem. But if you want to understand what it is that that system is doing, then you can ask that question at that more abstract computational level for which it's appropriate to make a Bayesian model as a tool for understanding what that system is doing and why it's behaving in the way that it's behaving.
So the novel and slightly perhaps unfamiliar idea here is that, in fact, making these parallel versions of models in a particular domain might be something which is useful to us. We shouldn't just think about it as we're going to build a system one way, and we're done. We can think about it as we build a system, but then we can build a model of that system as a tool for understanding what it's doing.
And so the critical observation here is that these Bayesian methods can be useful for understanding what the machines do, even if those underlying representations and algorithms don't look like Bayesian inference. So you've made your neural network. Now you're going to try and make sense of it. You know that really what's going on is that there's a neural network which is being implemented in your machine, and that's what's doing all of this work for you. But that doesn't mean that you can't get insight into it by making that more abstract model.
And so what I'm going to do today is give you a couple of examples of how adopting that perspective can help us both understand the things that machines do, but also give us new tools for making machines do things that might be a little more like the kinds of things that people do. And so the talk is really going to proceed in two stages here corresponding to these two arrows, one where we think about how we can use Bayes as a tool for understanding neural network models, and then the other where we think about using Bayes as a tool for making neural network models do some of the things that we might want them to do.
So this first part is really focused on asking what it is that's the benefit that we get from making a Bayesian model of something that might be a large, opaque neural network. And the large, opaque neural network I'm going to choose is this one, so GPT-4, inspiration for the "Sparks of AGI" paper.
And what I'm going to talk about is maybe a different perspective on the way that we can make sense of the system, where, I think, while the focus of this paper was on some of the ways in which this kind of system was able to solve problems unexpectedly, we wrote a paper where the focus of the paper was on some of the ways in which the model fails to solve problems in ways that we might have anticipated.
And so our paper is called "Embers of Autoregression" to contrast with "Sparks of Artificial General Intelligence," where really the argument was that, in fact, we can understand some of the weird things that these systems do and the ways in which they might not be things that we want our artificial general intelligence to do by thinking about the problems that they solve. And so this is joint work with Tom McCoy, Shunyu Yao, Dan Friedman, and Matt Hardy.
So here are four examples of weird things that GPT-4 does. And in our paper, we have more like 14 of these examples. I'm just going to show you four of them.
So this is the first one. And I should say, since we put out our paper, they fixed some of these things. [LAUGHS] And so the focus here is these are all deterministic problems, so problems where there should be a single right answer to these questions. And they're all examples of things that you could do using a simple algorithm. And so the way that they fixed it now is that for some of these things, it will say, oh, here's a simple algorithm that solves your problem. But we were focused here on, what are the responses that you get from just using this as a large language model?
So the first of these is just counting things. And so what we do is we give the model a string of letters, and we say, how many letters are in the string? And it's asked to produce an output. And the slightly surprising thing is that it is more likely to be correct if you give it 30 letters than if you give it 29. So you can think about what's going on there.
Here's another one. This is just a simple linguistic manipulation. We say, swap each article, a, an, or the, with the word before it. And it's more likely to be correct for some sentences, like this one, than for other sentences, like this one. You can think about what's going on there.
This is one of my favorites. This paper came out of the bunch of us getting together and thinking about weird things that these models might do. This is one that I came up with while in an airport waiting for one flight to go to another flight, which is getting it to solve shift ciphers.
So the way that a shift cipher works, you replace each letter with a letter that is shifted some number of positions forward in the alphabet. So these are fun ciphers that you learn about when you learn about codes as a kid. And so we're asking-- here we say, decode the sequence by shifting each letter 13 positions backward in the alphabet. And if you try to shift 13 positions backwards in the alphabet, it does a pretty good job. But if you give it the problem of decoding the sequence by shifting each letter 12 positions back in the alphabet, it fails terribly.
And here's the last one. So if you ask it to take a number, multiply it by 9/5, and add 32, it does a good job. If you ask it to take a number, multiply it by 7/5, and add 31, then it doesn't do very well.
So these are four weird things these models do. You can think about why they might be weird. My argument is that it does these things because of what it's been built to do. And we can actually understand how these models work by thinking about them from this computational perspective in terms of the problem that they're trying to solve.
So when we make computational-level models of human cognition, one of the hardest things we have to do is to try and specify, what's the problem that human beings are trying to solve? What's the abstract problem that humans are built to be able to perform?
For large language models, we don't have that same issue. It's much easier for us to answer the question of what it is that they're trying to do. So what they're trying to do is predict the next word in a sequence based on information they're provided in internet text.
But the way that you should be solving this particular problem of trying to figure out the answer to a question is by performing probabilistic inference and, in particular, performing probabilistic inference where because these problems are deterministic, there are some subset of hypotheses which are the answers that you could give which are correct answers. And those correct answers should receive some non-zero probability. But the probability that you assign to all other answers should be zero.
And so there's some sense in which here the prior distribution over hypotheses shouldn't matter at all. So if you're trying to solve these deterministic problems, then the answers that you produce should be entirely driven by the data which is being provided to you. And so that's what makes a problem deterministic. There's no stochasticity that's involved in it.
But if you have problems where-- sorry, if you have a model where you don't quite have the likelihood learned in the appropriate way, where you're not recognizing the deterministic structure of that problem, and you're just treating it as another inductive problem, as another problem of probabilistic inference, then you should be applying Bayes' rule. And you should be taking the prior into account.
So if you have a likelihood function which is non-zero for some invalid hypotheses, then the prior probability can eventually, maybe get high enough that it favors some of those invalid hypotheses and favors them over the valid answers. And so essentially what's going on in these models is that when you give them deterministic problems, they are not appropriately constraining the likelihood and, as a consequence, getting leakage of the prior. And that explains all of the phenomena that I just showed you. So I'm just going to go through those one by one and show how this works.
So this first problem that I mentioned, where you just ask the system to count how many times it sees-- how many letters appear in a sequence, the reason why it's right more often for if the answer is 30 than if the answer is 29 is that in internet text, the number 30 appears more often than the number 29. So what's going on is that the prior distribution is leaking into the answers that it produces.
And we can actually show this. This is plotting the accuracy of the model in just performing this counting task versus the negative log frequency of those sequences. And these two different lines correspond to-- this is GPT-3.5 and GPT-4.
And so the accuracy of the model is decreasing as the frequency of the answer is decreasing, which is exactly what you should do if you're trying to solve an inductive problem. You're allowing your prior distribution over possible answers to have an effect. But because this is a deterministic problem, this shouldn't be happening. And so what's going on is your likelihood is leaking a little bit, and so you're getting an influence of the prior.
Same thing for this case. So we can look at, what is the probability of the sequences that are actually being produced as output here? And we see the same relationship, that the accuracy of the model is increasing as a function of output log frequency-- sorry, output log probability. So when the output is a probable sentence, then you're going to get it right. If the output is an improbable sentence, you're going to get it wrong.
And so, again, it's worth thinking about this. If you're going to be using these systems to try and solve problems, it's worth recognizing that the answers that it's going to produce are going to be influenced by the probability with which those answers occur within the data set on which it's been trained. And this is, again, not necessarily something that you want your robust AGI system to be doing.
The other kinds of effects that we see are effects that are modulation of the tightness of the likelihood by the amount of experience that the model has with different kinds of tasks. So if we look at these shift ciphers, again, here the mystery was there's really not a big difference between shifting something backwards-- shifting each letter backwards 13 steps and shifting each letter backwards 12 steps, right?
But that's because you're thinking about this algorithmically. There is a big difference between these if you look at the frequency with which these problems show up in internet text. So this is the performance of these two models, GPT-3.5 and GPT-4, as a function of the amount of shift which is involved in the cipher. And you can see GPT-3.5 only gets this right for 13 some of the time and GPT-4 for 1, 3, and 13.
So what's special about those numbers? Well, 13 is special because if anyone here has heard of the code ROT13, this is a very standard code that's used on the internet. For example, if you are on internet puzzle forums, sometimes they will encode the answer in ROT13 so you don't accidentally read it. And you have to actually deliberately decode it in order to work out what the answer to the problem is. And so in internet text, ROT13 is actually another language that you can experience if you're reading the internet.
And likewise, ROT1 and ROT3 are often used in examples of decipherment when you're learning about ciphers. I think ROT3 is called the Caesar cipher. It's the cipher that Caesar used for encrypting his messages. ROT1 is used as an example. So these are the things that show up in internet text.
And so what's going on is it's not performing some kind of deterministic algorithm to solve this problem in the way that it should. It's solving this problem by basically having learned a mapping in the way that you might for learning a foreign language. And so the tightness with which it's able to identify the appropriate answers is being constrained by the amount of experience that it has in that particular domain. And when you get outside those domains, it's going to do something where it's going to basically default to its prior distribution. And I'll show you that in a moment.
For thinking about these linear functions, again, it doesn't seem like there's much difference between multiplying by 9/5 and adding 32, multiplying by 7/5 and adding 31. But does anybody recognize one of these? OK. So the first of these is this is the operation that you use for converting from Celsius to Fahrenheit, and this is not.
And so, again, you see a big difference in performance between this rare version of the task and the common version of the task. And it's a consequence of the fact that one of these is things that you can find in internet text, and the other is not. And so here the frequency with which it's performed that task is constraining the degree to which it's able to overcome the influence of the prior distribution in terms of the responses that it's producing. And again, not necessarily a property that you want in your AGI system.
I said that it's defaulting to priors. What does that mean? Well, what it means is that if we have a situation where the likelihood is uninformative-- so if it's, for example, decoding a ROT10 cipher, which it's not seen in the data-- this is the shift cipher where you shift 10 steps forward-- then the prior guides the response.
And we actually see this explicitly in the responses it produces. Shift ciphers are a really good tool if you want to try and figure out what the priors of these models are and the way in which they behave because you can very easily construct messages which you're making it say things or making it come up with things because it doesn't know how to solve that problem.
And so, for example, in this case, the correct answer was, "She never regretted her passion for the artistic craft, nor did she waver in her tireless dedication." And GPT-4, when it's trying to translate this from ROT10, says, the quick brown fox jumps over the lazy dog but not the sheep in the background. So at least part of this is a very familiar, high-probability phrase.
Here, for this one, the correct answer is, "As a doctor of humanities, he was a university professor, founded a university and the newspaper, and won awards in journalism and literature." GPT-4's output is, "To be or not to be, that's the question, whether 'tis nobler in the mind to suffer the slings and arrows of outrageous fortune--" so you can see it's going off in another direction. You could even say something is ROT10 in Denmark.
[LAUGHTER, GROANS]
So that is the best pun that we ever put in a paper. And I should say Tom McCoy is the one who came up with it. All right. OK.
So you can also ask about what the consequences are of introducing better kinds of prompts for these models. So here this is what happens if we take GPT-4 and then instead of using that basic prompting method, try having it think step by step or using a chain-of-thought prompt in which we give it an example of a solved version of one of these ciphers. And what we see when we do that is we do see an improvement in performance. But it's this kind of non-uniform improvement in performance. It's sort of like it's getting bumped up near the cases where it's actually reasonably good at solving these problems.
I think one way of thinking about this is that what you're doing when you're giving it those better prompts is you're doing the equivalent of tightening the likelihood function. You're giving it more structure, which is allowing it to overcome the influence that the prior distribution has. And I think this gives us a tool for thinking about how it is that we can make these systems better at solving these problems.
OK, so the quick takeaway from this part of the talk is that I think here thinking about Bayesian modeling can help us understand the behavior of really complex neural networks. And we've got other kinds of examples of this I'll talk about very briefly, where we've been trying to do things like estimate prior distributions used by these models or actually explicitly making Bayesian models of neural networks to identify their inductive biases.
So one example of this is some work with Michael Li and Erin Grant, where what we did is actually just explicitly make a Bayesian model of very simple kinds of neural networks in this case, where we take the neural network. We then train it on some data and have it produce responses. We use that to then construct a bunch of data sets, where those data sets consist of the input to the network and the output that it produces. And then we can actually then train a Bayesian model-- in this case, a Gaussian process Bayesian model-- that allows us to then estimate explicitly what the prior distribution is that corresponds to the prior distribution assumed by that neural network.
And so the way that this works, that Gaussian process model is assuming some prior distribution on functions. And we estimate the kernel parameters of that prior based on the behavior of the neural network, which then gives us this picture of what the prior distribution might be.
And so just as a simple illustration of this, this is-- if you take here neural networks with ReLU activation functions at different depths or sinusoidal activation functions of different depths, then we end up fitting a model which explicitly then-- this is what the output looks as a function of the input. So this is the explicit functions that are produced by that neural network. We can then estimate a prior distribution that tries to capture some of the characteristics of those functions.
And by doing so, we end up with something where you've got your original neural network. That neural network might be costly to retrain on particular data sets. Or it might be hard to decide what neural network you should use for solving a particular problem. But by having an explicit probabilistic model of what that neural network is doing, we can then much more efficiently do things like calculate what the consequences would be of leaving one data point out of the training set or even do things where we integrate over the prior distribution to answer questions about, would this neural network or this neural network do a better job of capturing a particular data set?
OK, so that was the first part, where we're making Bayesian models of neural networks. The second part of the talk engages with the question of, if we've got an idea about what we want to be doing, what sort of prior distribution we want to have for a particular neural network model, then how do we translate that back to the algorithmic level? So how do we build a model that actually instantiates the particular prior distribution that we're interested in?
And the tool that we're going to use for doing this is meta-learning. So I'm going to take a moment to introduce that idea in case it's unfamiliar. So I hinted at that very briefly in just that very last little part that I talked about.
The idea of meta-learning is learning is where we have our neural network that has some weights phi that has a loss function L. And we have a characterization of a task that we're going to perform where we're going to try and find the best weights phi for minimizing that loss function L.
In meta-learning, we have many such tasks, which are drawn from some task distribution. Each of those is associated with its own loss function. And then we have a set of learners. Each of those learners has their own weights. But we're going to try and find something like, say, an initialization of those weights such that these models end up with shared hyperparameters that allow them to solve tasks that are sampled from that distribution.
And so the way that our meta-learning procedure goes, what we're going to be doing is we're going to be saying, we want to find some shared hyperparameters theta that are going to make it so that across all of these different tasks, our learners are going to end up finding it easier to find the right weights phi.
So one algorithm for doing this is called model-agnostic meta-learning. The basic idea is we say, well, let's define our global loss associated with this initialization theta to be the loss that we get on each of our tasks summed across our tasks when we started that initialization, and then take one gradient step away from that initialization in the direction of the loss that's associated with that specific task.
And so, intuitively, what this looks like is saying, what we're looking for is some initialization theta such that that initial set of weights is close to all of the phis that we want to end up with, the weights that we're going to use to perform the individual tasks, so that we can reach each of those individual phis by taking a gradient step away from our initialization theta. And the way we're going to find that theta, which is going to get us close to the solutions to all of these different problems, is we're going to have a global gradient step that will differentiate this global loss.
And so in meta-learning, what we're doing is we're trying to solve this problem of minimizing this global loss, where we're looking for a solution which is going to get us close to something which is minimizing the local losses on those individual tasks. The kind of solution that you're going to find is going to be influenced by the distribution of tasks that you assume.
And so Erin Grant wrote a nice paper where we showed that, in fact, you can analyze this MAML algorithm as a form of hierarchical Bayesian inference. So the way to think about this is that in a hierarchical Bayesian inference problem, we are trying to estimate the parameters of a prior distribution on, say, the parameters of a model.
So we have a bunch of models that we're trying to use to explain data. So this would be our data. We have parameters of those models that we're trying to infer. And then we're going to estimate hyperparameters that characterize the prior distribution that applies to the parameters of those individual models.
And you can show that the procedure which is used in MAML, where we think about going one gradient step-- going that one gradient step, you can analyze as defining an implicit prior distribution on the parameters of the model. And then you can think about the optimization that's being done in MAML as basically trying to optimize the hyperparameters of the implicit prior distribution, which is defined by that taking the gradient step with early stopping.
And so this gives us a tool for thinking about what meta-learning is doing. You can think about what meta-learning is doing is trying to learn a prior distribution. What prior distribution is it trying to learn? It's trying to learn the prior distribution that corresponds to the distribution of tasks that you've given to the system.
And so that gives us a new idea, which is that we can actually use meta-learning as a tool for distilling an inductive bias from a Bayesian model and putting it into a neural network. So if we want to build a neural network that instantiates a particular prior distribution, one way we can do that is by starting with that prior distribution, sampling hypotheses from that prior distribution, using those hypotheses to construct tasks that the neural network is going to perform, and then using meta-learning to learn a prior distribution that allows it to perform well in all of those different tasks.
And so what we're doing, then, is effectively taking this explicit prior distribution that we started out with, turning it into an arbitrary amount of training data that we can use for training that neural network, such that we end up with a neural network that internalizes the prior distribution that we started out with.
And you can think about this as a powerful tool if you're a cognitive scientist. You say, I want to understand human cognition. I'll figure out what their inductive biases are. We can take those inductive biases. We can express them in the form of a prior distribution.
And this gives you a way to actually train a neural network that will then instantiate that human-like prior distribution because it'll be very costly to train a neural network to emulate human behavior directly. You need to get lots and lots of training data in order to do that. This is a way that you can do that without needing to get all the training data because you summarize the human inductive bias in the form of that prior distribution. And then you use the prior distribution to generate all of the data, which you're going to use to train the neural network to end up with that appropriate prior. That sort of makes sense? Yep? OK.
So what I'll do is I'll show you a couple of examples of using this approach. One of the cool things about this is that we can start with a prior distribution over here, which is something which is a symbolic, expressive, probabilistic model. And we can end up with something over here, which is a neural network which is opaque and has everything expressed in terms of continuous parameters.
And so this actually gives us a tool for thinking about how the kinds of symbolic Bayesian models we build could end up being things that end up being inside people's heads and the neural networks that are inside our brains, where it gives us a way of translating the prior distribution into the initial weights of what can be a relatively generic neural network.
As a consequence, it's a very soft and expressive way of identifying inductive biases, where we know exactly what those inductive biases are because we've written them down in the form of the prior distribution. But they're instantiated in something as simple as the initialization of the neural network.
So my motivating example here, again, comes from thinking about GPT. So one of the critiques of GPT is that it doesn't learn from human-like amounts of data. The best example we have of a system that's able to learn from human-like amounts of data comes from this work by Yang and Piantadosi. So this is not work on learning human languages. This is actually work on learning formal languages.
But one of the things that they showed was that by making a Bayesian model of formal language learning, you could actually learn arbitrary new formal languages from just a handful of examples. And this is the first system that was able to do something like that. So what we can then do is take the prior distribution that's instantiated in this model and then use that to generate the meta-training data set that we're going to use to then take our neural network model and try and get that neural network model to be much more efficient in its learning of languages.
And so we do this. We take-- it's not quite exactly the same prior as the one that's in the Yang and Piantadosi paper. And I can talk about how it's different and why. But we take something which is a very expressive prior distribution on formal languages expressed in terms of a simple probabilistic program or distribution of probabilistic programs. And then what we can do is then look at how well this performs.
And so this is the Yang and Piantadosi model, and this is measured in F-scores. It's measured performance in learning a particular formal language. Their key discovery was that it was possible to make this model which could learn something from one example and, in fact, learn quite a lot from 10 training examples. And if you contrast that with a neural network model, that neural network model is really not doing very well. In order to get to the point where it's doing as well as the Bayesian model with 10 training examples, it has to have more like 500 training examples.
But what we do is take this pipeline, where now we're generating samples of languages from this distribution, using those to construct our training tasks, where basically you're just deciding, is this sentence going to be in the language or not, and then using that to meta-train our neural network. And when we do that, we take exactly the same neural network architecture, an LSTM model. And that model is now able to perform much, much closer to the Bayesian model and, in fact, parallels it across most of this learning curve.
And so the only thing that we've done here is change the initialization of that neural network. Architecture is exactly the same. We find that initialization by using this meta-training procedure. And we end up with a neural network that's able to learn essentially as efficiently as a Bayesian model with the appropriate prior distribution.
It's also worth pointing out that because this is instantiated in a neural network, the Bayesian model here could take on the order of a week in order to actually identify the correct hypothesis. The neural network can do this on the order of seconds.
And as a consequence, we're actually able to apply a prior trained neural network like this to an English corpus. In this case, this is a small English corpus. It's the CHILDES corpus. We end up getting a language model which is a better language model than the best previous language model for these data. And we can look at, in English, what are the characteristics that actually makes this language model better than previous language models?
And so one of the ways that we do this is by looking at sentences that have particular grammatical constructions and then gradually taking phrases from that sentences and then increasing the degree of recursion that appears in those phrases. And so when we do that, what we see is that as we increase the level of recursion, the accuracy of the standard neural network that's trained on this child-directed speech data set decreases rapidly, whereas our prior-trained neural network actually decreases more slowly.
And this is true across all of the different kinds of constructions that we looked at. It seems like what it's really learned is a capacity for recursion. It's much more robust to recursive structures. And so not only are we able to identify an initialization that seems to support faster language learning, that initialization has pulled out something that seems like an important characteristic of natural language through this process of training on these formal languages.
OK, so the key takeaway here is this idea that Bayesian models can actually be a tool for identifying inductive biases in an explicit way that we then transfer to neural networks using something like this kind of meta-learning pipeline. And again, I'm going to show you a couple of other examples that use this kind of approach. So one thing that we can do very directly-- and this is with Ioana Marinescu and Tom McCoy-- is take the same kind of strategy but apply it to logical concept learning.
So in logical concept learning, we're trying to learn a Boolean concept. We're trying to learn a logical rule from examples. There's a nice Bayesian model of this called the rational rules model, where it defines a prior distribution over logical concepts by specifying probabilistic grammar on concepts. So these are the rules of that grammar.
By following that probabilistic grammar, you can then-- that defines a prior distribution. We can generate many, many samples from that prior distribution, use that as a meta-training distribution. And when we do that, we end up with a neural network model, which, again, is just a standard LSTM model. There's nothing special about the architecture.
But that's being trained on-- it's being meta-trained on examples that are generated from this prior distribution. And it ends up giving us a very nice correspondence with both human performance when they're learning particular concepts and also with the predictions of this rational rules model. And this kind of approach also generalizes the first-order logical concepts. And we're starting to do some empirical tests of whether it actually does a better job than the probabilistic model in terms of accounting for human judgments.
The other context in which we've looked at these kinds of things is in the context of taking some of the other ideas that come from Bayesian statistics and thinking about how we can translate them into methods that we can use as components of deep learning models. So this is an example. This is work by Jake Snell and Gianluca Bencomo, where we take Dirichlet processes, which are a classic idea in nonparametric Bayesian statistics, and then create a neural circuit that instantiates the inference procedure that's involved in what's called a Dirichlet process mixture model.
And so the idea here is that if you have a data set where you've got objects that belong to classes, but you don't know how many classes those things belong to, you can use this as a component that you introduce into your model, where now your model is able to then make predictions about what the class membership of novel things might be.
And so we do this by taking our Dirichlet process mixture model data. We generate many, many instantiations from the Dirichlet process mixture model. We then meta-train a recurrent neural network, which is learning a policy, essentially, that corresponds to inference in this nonparametric Bayesian model. And so the output of this neural network is a prediction of the class label in the training data for each of the inputs that it's provided.
And then that becomes a component that we can then introduce into an arbitrary deep neural network. So you can take your convolutional neural network, and then we just put this trained neural circuit on top of your convolutional neural network. And it allows you to take that convolutional neural network and now apply open set probabilistic inference to naturalistic image data in a way that expands the capacity of the kinds of neural network architectures that we're able to apply to those models.
You end up with a model where it's performing Bayesian inference. It's competitive with particle filter-based models for Bayesian inference on small data sets. Those models can't transfer to large data sets. And so we're able to solve data sets that-- to apply this on data sets where these kinds of methods couldn't previously be used. And this neural circuit is fully differentiable, which means that you can use it as a component in a model and then differentiate through it to further fine-tune the representations that appear in that model.
And so this gives us a kind of general recipe for taking cool ideas from Bayesian statistics, turning them into little neural circuits that you can then introduce as components that you add into whatever your favorite deep learning architecture is. And we're super excited about this as a methodology.
OK, so I'm going to wrap up here. But the key takeaways are exactly the things that I told you they were going to be. So I think the real thing that I want you to get from this talk is the idea that different models can coexist at different levels of analysis and answer different questions. And in particular, in the context of thinking about machine learning, that means that even though you built a neural network to try and solve your problem, it might be worthwhile thinking about making a Bayesian model to try and understand how it is that that neural network is solving that problem and why it's behaving in the way that it is.
The two virtues of these Bayesian models, which we get from thinking about this in the context of human cognition, is telling us what machines should do. They instantiate a rational solution to these problems. And so they give us tools for understanding what a solution to a problem looks like, but also understanding why they do the things that they do. And I gave you a couple of examples of this, where if your machine is doing something weird, it gives you some tools that you can use for trying to diagnose where that weirdness is coming from.
And the key idea here is that even if the thing that you're doing is a deep neural network, if it's doing a good job of solving a problem in much the same way that, when we think about this in the context of human cognition, humans do a reasonably good job of solving particular problems, it can then-- even though they're doing something quite different in terms of the underlying representations and algorithms, if it's doing a good job of solving that problem, then it can be worthwhile to think about making a Bayesian model to try and help you understand exactly what it's doing. So thank you very much.
[APPLAUSE]