Clustering and generalization of abstract structures in reinforcement learning
Date Posted:
September 7, 2022
Date Recorded:
August 12, 2022
Speaker(s):
Michael J. Frank
All Captioned Videos Brains, Minds and Machines Summer Course 2022
Loading your interactive content...
Description:
Michael J. Frank, Brown University
MODERATOR: It's a great pleasure to introduce Professor Michael Frank from Brown. So thank you very much for joining us. He has made seminal contributions to many different aspects of cognitive science, ranging from questions about reinforcement learning to competition models of cognition to cognitive control.
And I think that there's some menu, probably, for you today of possible topics for him to discuss. So I think this is going to be very exciting. And thank you, again, for coming. And again, we're going to pass the microphone for people for questions.
MICHAEL FRANK: I was initially planning to talk about clustering and generalization of abstract structures in reinforcement learning. That's this title. It's essentially a talk about computation and, essentially, the lack of generalization in abstraction that deep networks may have and how we might start thinking about that, connecting it to human behavioral data and a little bit of neural data. I have another talk which I could talk about, which is more about dopamine across species. Some computation, but some more animal data.
So my question for you is, which one would you prefer? I want to vote.
AUDIENCE: Deep networks.
MICHAEL FRANK: All right. Here we go. Good. So, as I mentioned, a lot of you are familiar with deep networks. This is the paper from Nature in 2015, showing that, if you train deep networks on games that play Atari, they can do well-- better than humans-- at a lot of tasks. But it's also known that they, also, can specialize to specific tasks, but you have to train them again to perform a new task and, even, within a single task.
This is a video of A3C, which is a reinforcement learning deep agent that was trained to play Breakout. And it's playing very well. And another company that was a competitor to DeepMind showed that, while this thing is actually brittle, if all we do is take this paddle and shift it up by one pixel, now, the complete ability of the agent to play the task fails. It has to be, essentially, retrained, which is clearly not the kind of thing that would happen to a human if you'd learn to play a task. Even if you hadn't seen that pixel shift, you would be able to very quickly adjust.
Well, I guess I might go back to this for a minute. My point is not to pick on deep networks. There's a lot of ways in which people have made them better. And this forefront of trying to get them to do continual learning, multitask learning, and so forth. But our goal is to try to understand what makes humans a bit different. And, in order to do that, the strategy that we and other people take in computational cognitive neuroscience is try to isolate the specific computations that are needed with tasks that, often, we would describe as toy problems but allow you to really identify the key things that are needed, rather than trying to scale it up and compete.
So there's a complementary aims here. The general gist of my talk is in the defense of toy problems. We're trying to understand computation. So I wanted to, also, motivate this by something that-- I'm not at all a developmentalist or, really, a motor-learning person. But there's this problem that always bothered me, and I'm going to try to motivate the way I'm thinking about things, in terms of this problem.
So one is that, if you look at what happens with animals-- so these are some kittens that are a couple of weeks old and they're doing these marvelous feats of jumping and playing around with each other and doing acrobatics and so forth. And this is a picture of my two twins when they were, I don't know, a month old or something. And they were lying on the ground, just staring up at me. And I wanted to be proud of them, but they weren't doing anything. They're just staring up at me.
And why is it? Why are humans so incompetent when they're born? And there is a standard story for this. The standard story is that infants are born early because they have a big head and the mother has a small birth canal, and they have to come out earlier than they would have normally. And so, you need a fourth trimester after they're born for them to develop further. And then, they become more competent. And I'm not denying that there's something to that, but I don't think it's nearly enough.
So, if you look at what happens after that fourth trimester, you could, as I did when I was a younger parent-- babycenter.com lists all these milestones of what kids are supposed to show. So if you look at what happens at three months, the milestones are-- you no longer need to support his head. When he's on his stomach, he can lift and head and his chest. He can open and close his hands. It's not really mind-blowing stuff, compared to what's happening with these kittens.
So my hypothesis here is that the human brain-- and, perhaps, other primate brains-- is wired, in a way, to have inductive biases. So it's a way that's trying to search for latent generalizable structure. That is, actually having the architecture that is needed to discover that structure is actually inefficient for learning very specific problems. But once you develop it, it allows you to generalize and be much more flexible to a wide range of tasks.
So, now, my kids are a few years older. They're attending the French American School of Rhode Island. They're speaking different languages. I have some videos. I'm trying to speak brain areas in French, but I need to show those to you. So, of course, my point was not to just say that humans are incompetent or that I'm not proud of my kids. The question is what happened inside their brains that allowed them to do this? And so, an overall overview to my talk today is going to focus on-- I'm going to have all these different musical themes, although everything that I'm going to talk about has analogs in language and basic motor learning and navigation and so forth. And I'll try to, sometimes, make allusions to that.
So let's say you're trying to learn to play the guitar. The first part of my talk is just about how we can learn rules in a hierarchical manner so that we can, then, generalize to other things that are similar in some ways. So if you learn to play a single guitar, you should, then, be able to generalize the way in which you're trying to play the song and the movements that you need to make to other guitars.
And maybe, initially, you don't know that the thing that matters or generalizing is that they're both guitars. Maybe you would think that there are other features of the stimulus that would cluster together. So you have to learn, OK, these are all the same so you can do that.
And once you do that, you can generalize and learn between them. And I'll try to illustrate how we study that in humans and in models. But then, I'll move on to a problem that I think if you're thinking about it in terms of music-- if you're a really skilled musician, you don't only, necessarily, learn one instrument. You might also learn to play, for example, the piano. And the things that you learn when you're playing the piano and lane-- learning to play the guitar are different but, in other ways, they're the same in the sense that you might want to play the same song on the guitar and on the piano but use different movements.
So how can you figure out how to be more compositional in what you're transferring? So instead of reusing an entire structure, you have to break it down into bits, like what you want to play and how you play it. And then, finally, at the end-- if I get to it, I want to talk about abstractions-- about how you want to learn to re-use not just specific songs or specific ways in which you play things, but you have to learn equivalent structures that allow you to internalize how something works so that you can learn one song and then immediately learn a different song, even though the note the order of sequences you have to play might be completely different.
So for the first part, we think that there is a critical role of nested, hierarchical corticostriatal circuits. So there are these loops that go through the frontal cortex, the basal ganglia, the thalamus and so forth that seem to have hierarchical structure to them. So you have a first-order loop that would be about action selection for selecting motor programs or premotor programs. Let's say motor programs.
But when you're deciding what motor program to do, you might want to contextualize that by a premotor plan. And then, you might want to contextualize that by some higher-order context. So let's say you're in a coffee shop that looks like this, and you have to decide-- maybe I want to choose between different coffees, depending on the context. And then, when you get the coffee, you then decide to reach towards it, lift it, and drink it. And so, there's action selection.
And learning is conditionalized by the higher-order context. And the question is, if you learn that this is a good motor action and you drink and you get the coffee and it tastes good, should you really credit your low-level motor program for reaching and getting that coffee or should you credit the point that the action that you took, which is, I ordered a coffee in this place? And that has a different causal structure as to what deserves to get reinforced. And maybe you're going to go into that same coffee shop. But now, you're the musician and you have to decide not just which coffee to get, but what kind of music to play or which guitar to play.
So all I'm saying here is that action selection is highly conditionalized, based on the context of what you're trying to do. So, if we want to study that, one thing that we use is this framework called task sets. So let's say you just want to map stimuli onto actions and you may have a cue-- a higher-level context-- that tells you which stimuli go with which actions to get reinforced. So you can think of the cue as, I'm in the coffee shop, for example.
And if you were going to do this in a cognitive psychology experiment with people or with animals, you can think of it as just simply mappings between stimuli and actions. So let's say you have a circle, triangle, and a square, and you have to map them onto A1, A2, and A3. But, maybe, the rules for which action you take for which stimulus might depend on other contexts-- in this case, a color.
So I'm going to interchangeably use the letter C to reflect context and color. But you should just know that that's arbitrary. The C could also be a shape or any other dimension. And this thing could also be a color or something else. And so, you can write that in more compact form, that there's a bunch of states that get matched on to a bunch of actions. And then, when you go to another context, maybe, you can learn a whole other set of mappings because, maybe, the rules are different. If you go to a different coffee shop or play different guitar and so forth.
And you could just keep doing that. But, of course, if you did that, it wouldn't be very general. So what we want to do is think about how there might be a latent rule set that you're trying to learn that we call a task set, which comes from the cognitive psychology literature. And perhaps you might want to cluster the context around past task sets. So even if these contexts don't look like each other-- these colors are not necessarily similar to each other-- they may all indicate the same set of rules.
And you want to be able to do that. And if you do that, you can generalize between them. But the problem is that task set space is latent. You don't observe it directly. All you observe are stimuli in the world, actions, and consequences. So you have to try to extract those. And again, you can think of these different colors, for example, as different guitars. And then, what happens if you go to a new context altogether that you've never seen before? Well, what are you going to do?
You, potentially, can say, well, I'm going to just try to reuse the task set that I've seen before that's the most popular, the one that has been clustered-- that has been seen across the most contexts. And, critically, for our models, it's not really the one that you've used the most frequently in the past, necessarily. It's the one that has the most variability across contexts. And if you're interested in that, we can come back to it.
Or you can try the other one. And, for those of you familiar with the Chinese restaurant process, this is a non-parametric Bayesian model where we, basically, say that your probability of trying out a given test set is going to be proportional to how popular it was across contexts. So you'd be more likely to try this one, less likely to try that one. But you also need some possibility of creating a new one altogether because, maybe, the set of rules that you need to learn don't match anything that you've seen before.
So the Chinese restaurant process prior is just something that enforces parsimony. It tries to get you to reuse things if you can because that will allow you to generalize. But it also allows you to continually expand them to create new latent rules when needed. So, besides it being latent, it also has unknown size. So that's why you have this parsimony bias to try to keep it as small and compact as possible while allowing it to grow as needed.
So it turns out that, even though we have seeds for Bayesian models for inferring and creating these test sets, you could also implement something that approximates that in a neural network that has its hierarchical corticalstriatal circuitry. I'm just going to show it in cartoon form here, rather than showing the nitty-gritty of the neurons. But the idea is that you have these different corticalstriatal loops that are classically learned through reinforcement learning.
So a lot of people study-- including me-- a single circuit where you have these different motor actions. You have this thing that we call the basal ganglia that has complex circuitry in it-- but I'm not going to focus too much on the details of that here-- and that undergoes some kind of reinforcement learning process for doing what actions produce the best outcomes. But when you expand it hierarchically, you now have another whole set of loops that selects, in this case, not literal motor actions, but representations of what will become task sets.
So the neurons in these models are not hard-coded to represent test sets. They don't mean anything, initially. But as the model learns to-- through reinforcement learning, has dopamine that goes up and down and reinforces synapses.
As it learns to select certain motor actions together for certain stimuli, it will tend to reinforce that there-- it will create these abstract task set representations. And it can do reinforcement learning at both of these levels at the same time such that, if you, then, take the model to a new context altogether, it will be faster at learning if one of the task sets repeats than if it had to learn from scratch.
And this is just to show that what's underlying any of these circuits is a more detailed neural network model. Let's see. So one thing, just to connect it to how I motivated in the first place-- and I'm not going to show it too much here-- if you built this all into one conjunctive circuit where you have Cs and Ss just as conjoint-- conjunctions that, then, go into one circuit, you can actually learn very simple task sets much faster than having this complex hierarchical structure.
So that's like the kitten point that I was making before. You can learn to do new things really quickly if you just have one circuit [INAUDIBLE] to memorize for these conjunctions of states and action [INAUDIBLE] and context what to do. But if you do that, you will not be able to show this kind of transfer. It would be highly specific to that content. So the upshot of this is that building in this nested, hierarchical structure causes the system to try to search for, which dimensions of the world should I treat as higher-order task sets that, then, can conditionalize what actions I should take? And if the world is structured in that way, then you will benefit from it.
And, while this is more of a focus of my other talk, I just wanted to mention it briefly-- that we have evidence that the dopamine system, which projects to these different striatal areas, is not as people had initially thought-- global, where it just goes up and reinforces all parts of the striatum at the same time. But, rather, there are these traveling waves of dopamine activity that projects towards or away from different striatal areas, depending on their involvement in the current task.
So the upshot of this paper-- it's a pretty dense paper, but it's showing that there are wavelike dopamine dynamics that go across the striatum. And the direction of those wave propagation depends on whether striatal region underlying it is necessary for doing well in the task. So we think of that as a mechanism of [INAUDIBLE]. I'm happy to talk about that later, if anybody is interested. And I'm going to just skip that. OK.
So how do we study this phenomenon of task set learning that I mentioned in humans? I'm going to go back to this idea of simple, really, toy problems in order to point out where deep networks fail and where humans don't. So let's say we just do a simple reinforcement learning experiment where we give people these different contexts, C-- which, again, could be colors or shapes or anything else-- and lower-level stimuli-- which could be shapes. And we just have them learn that, when they see C0 and S1 together, they should press button A1-- just a key on the keyboard-- in order to get a reward, whereas, when they see S2, they should press A2 to get a reward.
Then, they also learn, in interleaved fashion, that C1-- even though it doesn't look like C0 in the experiment-- happens to have the same set of rules. And then, they also learn that C2 has a different set of rules, A3 and A4. So, again, you can just learn this. Very simple. It's only six mappings. And you could learn it in a conjunctive way. But if you were to learn it in a hierarchical way, you don't know, initially, which dimension of these things is the higher-level one.
But if you happen to realize that these two things are equivalent to each other, then you could cluster them together. And the way that we test that is not by looking at learning in this phase. But after they learn here, we start adding in new stimuli. And the actions that they have to map to those new stimuli are arbitrary, so they can't really know what the correct thing to do is, initially. But, as soon as they learn that, let's say, A1 goes with S3 and C0-- if they've realized this is equivalent structure, they should immediately know to try A1 for S3 and C1.
And vise versa. They see S4, A4 with C1 first. And then, they can transfer that back here, whereas, for C2, there's no other context there. So they also have to learn an arbitrary set of associations and there's no opportunity for transfer. So it turns out that, if you just run a vanilla backdrop deep neural network or not-- so deep neural network doesn't matter-- on this problem, it doesn't actually show this kind of transfer because it doesn't have this internal latent representation that says that C0 and C1 are pointing to something that, then, generates the test set.
It really just learns like units that map C0 and S3 onto A1, which are different from the units that map C1 and S3 onto A1, and so forth. So that actually doesn't show the transfer. Of course, you can build a deep network, like the hierarchical one that I showed, that does do it, but you need to have that kind of architecture, or something like it. So now, I'm going to show you simulations from our models-- both the Bayesian models and the neural models-- for this task where, let's say, we taught people that, when they see, let's say, red and gray, they both point to the same test set.
And they've learned already that A1 and A2 go with these different shapes. And now, suddenly, we've introduced a new shape. Let's say a diamond. The prediction from the model is that the model should be faster at learning, in this context, when C0 and C1 point to the same test set they can transfer knowledge between these two things compared to learning this diamond shape in a new test set altogether, where it has no opportunity for transfer.
And that's despite the fact that, for experimental control reasons, we presented each of these contexts on one quarter of trials each and this one on half the trials each. So it's not just that these guys are more frequent than each other. They're actually equated in frequency. And so, that's what people show. If you have them learn this experiment, they show this transfer rate. Really, right to the first trial, they're more likely to benefit from having seen one of the other contexts before. Any questions about the test structure at this point? Makes sense?
So we wanted to ask questions about, how does this relate to what's going on in the brain, the kind of transfer that you might-- that-- is the reinforcement learning system acting in a way that is sensitive to the structure? And the standard thing that people often look at in model-based reinforcement learning-- or, sorry, when you're using a reinforcement learning model to interrogate the brain-- is the prediction error. It's a quantity that's in all reinforcement learning models, which is just the difference between the reward that you get and the expected reward that you would get.
So just to illustrate the cartoon of that, let's say you're experiencing these new orange diamonds for the first time. You're pressing button three. And you happen to get reward feedback that says you were right, and you're going to win some points. Maybe you'll win some money or whatever. The prediction error at that point, according to reinforcement learning model, would be positive because you never experienced that association before, so you don't know what to expect.
But then, after you've done that, you then select button 3 again because it was rewarding. The prediction error to the next reward feedback should be smaller if you've done any learning because you've adjusted your expectation to expect reward. So that's just the standard idea. But then, let's say what happens for the red one here, same thing. Let's say you press button one, and it's correct. You get a reward prediction error. And then, let's say, following that-- not necessarily right after, but at some point later-- you get the gray diamonds, which you've also never seen before.
According to a standard reinforcement learning model-- for the kitten or the conjunctive model-- this initial response with this new stimulus should also elicit the same size reward prediction error. But if you're sensitive to structure, you've noticed that gray and red both point to this latent thing. That suggests that, then, your prediction error should be smaller for this very first time you've ever experienced the diamond and that button because you know that it's not gray that matters.
It's the latent rule. And gray just happens to point to that. That make sense? That's the prediction. So what we did is we used EEG in humans. I'm not going to go into a lot of details, but we use a GLM approach. So rather than looking at traditional, event-related potentials, we're going to try to model the voltage at all electrodes and time points on a trial-by-trial basis and see to what degree they are sensitive to quantities like prediction error and structure prediction error and other things.
And we did find that this is just a regression coefficient-- in this case just a single electrode, which is classically related to prediction errors, up on the mid-frontal cortex. That is between in this time period, late, and this time period, early, it's sensitive to reward prediction error, which we and others have seen before. And then, this is just a scalp topography of what that prediction error effect looks like.
It doesn't matter. I'm not trying to make any claims here about the underlying neural source because it's EEG. ALl we want to know is, is there some prediction error in the brain, and is it sensitive to structure? And what Anne Collins found when she was a postdoc in the lab was that, if you look at either the early or the late time period in which the brain is sensitive to prediction error, there's an additional, unique effect of structured prediction error of the type that I showed you on the last slide-- that the brain seems to be sensitive to knowing that hierarchical structure.
And so, then, we could ask, does that actually predict the degree to which people are generalizing and transferring? Not only the kind of transfer that I just talked to you about, but transfer to a new context altogether, which, so far, I haven't told you about. So I'm going to go back to the cartoons here, about what the task looks like. Let's say, after they've learned the initial phase, the second-- the transfer phase-- now we start introducing new contexts like C3 and C4.
And we could introduce C3 such that it is one that is the most popular task that passed. Or it could be another one that, also, had been seen before but less popular. Or it could be a new one altogether. And we could ask again, do people show transfer? So that's what this looks like. After we go through this transfer phase, we, now, start adding in new contexts. And if people are transferring, then, for this entirely new context, they might say, well, if it's one of these that I've seen before, then, maybe, I should learn faster than if I have to create a new one altogether.
And we could additionally ask, do they transfer, yet, better if it's the more popular one than the lesser one? So the model does make that prediction, if it's-- again, I showed this in the neural network before. This is the Bayesian mode. It doesn't really matter. They both predict that learning should be faster when the task that is old in the new context. And people also show that. So, obviously, I wouldn't have motivated all of this if they didn't. But, to me, what the interesting thing is, we can now look at the brain measures of prediction error during the initial learning phase to see if people are sensitive to this hierarchical structure and see if that relates to the transfer.
So what I showed you before was, across all subjects, people are sensitive to structured prediction error. But if you look at the individual subject, the regression coefficients of how much they have a unique effect of structured prediction error-- not the overall reward prediction error, but just the structure part-- you could see that there's a bunch of people that are positively sensitive to structure. And then, there's another bunch of people that are clustered around zero. And maybe we could speculate, later, why some people would do that.
But the interesting thing is, if you now look at their later transfer in the third phase of the task, the people who had this kind of structured prediction error showed very clear transfer to that old test, whereas these people show no transfer at all. And then, one other quick result is that, also, those people who show transfer seem to be doing it in a way that's consistent with this clustering prior. So if you look at the action that they selected in the very first trial, they're more likely to choose the task set that is the most popular compared to the less popular one.
So the summary of this is that this kind of structure learning affords transfer, both within contexts and across. It seems to depend on clustering priors, and it informs neural representations of reward predictions. And I didn't show you the performance in the initial learning phase. But, actually, we see no benefit of clustering in the initial learning phase before they know the clustering. Just like I said, my kids weren't doing very well, or that there's actually a cost to building the structure. So people actually do worse at these contexts than this one, initially, which I can talk about later if you want.
But the critical thing is that, once they've learned that, the structure learning the forced transfer of new information within learned clusters and known rules to new contexts with the clustering prior. OK. And those are the papers, in case you want to look in more detail. So a lot of this work is built on what Anne Collins did in my lab and what she's done since. But one thing that you might have noticed here is that these structures are all-- when you go to a new context or you add new stimuli, you're basically having to import or transfer the entire structure as a whole, which might apply, sometimes.
But I motivated before that-- what happens if you learn to guitar and, then, learn a piano? You don't want to be able to say, well, this is completely different from anything I've ever seen before. After all, it's something that plays music. So we wanted to start addressing this question of compositionality, which is a buzzword that comes up all over the place in computer science and in cognitive science. In this case, we're talking about compositionality in terms of reinforcement learning-- both what do you do and how do you do it.
So in reinforcement learning, there's typically something called a reward function, which is just what do you value? What states of the world are things that you want? And then, there's also a transition function, which is, given that you're in a state and you take an action, what's the next state that you're going to end up in, regardless of the reward? And those are two things that get composed together to figure out what the optimal thing to do is.
So if you're going to transfer between a guitar and a piano, well, they share things, like the chord progression for a particular song or the rhythm, for example, of the desired sound and song, whereas, let's say, the flute and the saxophone might share things like the physical movements that you need to play certain notes. But, maybe, you're going to play different songs across them. Ideally, if you're a really good musician, if you're exposed to a new instrument, like the piccolo, you should be able to say, OK. This seems pretty similar to the flute, so I'll be able to make those movements.
But maybe you want to use it to play a song that you learned on a guitar. And if this seems really hard to you-- it is because you have to be an experienced musician to do that kind of transfer. But this kind of compositional transfer, I would argue, is actually happening all over the place, not just in music and navigation. I'll give you an example of that in a minute. And maybe you can think of other situations.
So we need compositionally. We need to be able to reuse the flute mappings to play a song that's usually played on the guitar. So, basically, you would add this plus that together and say, OK, now I can do the right thing.
The clustering models that I talked about before could not do that because, essentially, what they did was something like this. You'd have a context. You would figure out how to cluster it into something latent-- the task set. But the test set essentially defines, together, what the transitions are, or what you have to do to get to the next state, and what the rewards are.
And, so if you're clustering those jointly, that's not going to allow you to generalize them independently of each other. And I told you I'll give you another example. This is just one of them. Let's say you've learned to drive a car to get somewhere and you would learn to ride a bike to get somewhere. If you're going to a new place-- let's say you want to reuse the bike routes and the rules that you use when you're riding a bike to go somewhere that you usually drive to. It shouldn't be that difficult.
It's not like you'd have to learn to bike from scratch and learn all the bike routes again. You'd be able to put those two things together. And that's essentially what this is addressing. OK. But to study that, we break it down to its key elements.
And so, we are going to have our agent navigate in just a grid world, which has actions-- they can move up, down, left and right-- and it has reward spaces. Later on, we're also going to have people do this online. They're going to press buttons to try to move around.
And we can vary the rewards so that the different spaces in the grid can have rewards of zero or one, in this case. But we also vary the keys that you have to press in order to move up, right, left or down. So, sometimes, this A5 maps onto this left button. But sometimes, if you want to move left, you have to press some other button. So that's the transition function of how you act to move to a different state, which is different from the reward function.
And, again, this seems contrived. But if you think about a guitar and you just pick up a guitar and you play a single note on a single fret-- you're a guitarist-- you then know, immediately, what the other tunings are-- at least, on that string and probably the rest of the guitar. So, similarly, here, if there's a certain set of mappings that, if I press this button to go up north, I might know immediately that that means this button takes me east and this button takes me south. That's the way that we've structured it.
So Nick Franklin built these clustering models. The joint clustering model is essentially the same one that I showed you before. It just takes these contexts, puts them into these clusters that are latent assets that, then, define both the phi-- the transmission function-- and the reward function. That, then, influences your policy. You just did that in a model-based context that requires you to-- the agent a plan to decide how to move around to get the reward.
But we also built a model that we call an independent clustering model, which says, well, maybe we should treat the reward function and transition function a priority as completely different things. So we should do our clustering of them both separately. So it could be that we have a set of clusters for a reward function that may or may not be the same set of clusters that move you to the transition function.
And just wanted to mention that we have some theoretical and empirical work on this that. I'll talk to you about in a minute. But Rex Liu, who's a postdoc in my lab-- just published a paper just accepted a couple of days ago, in artificial intelligence, where he realized that you can get the best out of both worlds. You don't have to have just an independent clustering agent or a joint clustering agent. You can actually simultaneously learn the covariance structure and the independent structure.
Well, I'm not going be able to go into the details of exactly which situations that's important. But I'll show you an example of it in a second. Sorry. But I use something called the hierarchical Dirichlet process, if any of you are familiar with that. But, to illustrate the need for this kind of clustering and why it's important to think about, we built something called-- we designed a task that's called a rooms task, often used in reinforcement learning, where an agent has to go around the grid world, go to some door which is going to, then, take it to another room that's going to take it to another room. And, finally, it gets a reward.
But we structured it such that there are a number of doors in each room. Only one of them takes you to the next room. At the end, you have the reward. But also, in each room, the mappings-- the transition functions of which buttons to take to actually move in the right direction-- could be different. But in some rooms, they might be the same as things you've seen before. And, similarly, the door that you have to take to get out might be the same as something you've seen before, or it might not. But they don't necessarily have to go together.
So, if you're going to learn well in this task, you have to be able to try to transfer what is most likely to be true, in terms of the transition function and the reward function and, maybe, do that separately. And then, we made it diabolical, which means that if you make a mistake at any point along this task-- you go into the wrong door-- you have to go back to the start. So that's really going to force the agent to really need to benefit from transferring the right thing-- the right structure.
And what Nick showed is that, if you look at, in this case-- the room structure is independent. If you look at the total number of steps it takes, the independent model is much faster at solving this task compared to a joint model which is, itself, faster than just a flat model, which would be the kittens learning everything from scratch. And that advantage grows with the number of rooms. So, in this cartoon, I only have three rooms. But the agents actually have to go through up to 40 rooms. So the benefit of independent clustering grows with the number of rooms and, also, with the size of the grid area within an individual room.
So, as the task gets more complex and there's some repeated structure that you could re-use, it really matters whether you're clustering in a way that is sensitive to the statistical structure, whether it's independent or joint. And I mentioned Rex's work. So, just to very briefly highlight that, he made it yet more diabolical. It's amazing that he got it to perform anything. This agent had to go in a room-- we call this a castle with dungeons test or the hierarchical diabolical rooms test-- where the agent had to navigate to a particular spot on the grid.
But instead of that taking it to the next room, it took it down to a sublevel. So we think of this like Super Mario Brothers. If you're playing the video game, you have to go down and solve a level and come back up. And so, once it solved that-- oh, sorry. Once it got to this thing, it would go down to the sublevel, it would have to learn the mappings and the reward function in order to come back up and, then, go to the second one, go back down, then go to the third one and go back down and, then, finally go to the door that would take it to the next room. And then, do the same thing all over again until it gets to the final room.
And there's going to be repeated mappings and reward functions at all these levels. But they might be independent from each other. In some cases, they may be joined. There might be a popular, overall room in which it tends to be that there are certain mappings down here that go up with a higher-level one. And if you're going to be smart about it, you should be able to learn that as a special case, even though things are more independent otherwise.
And, again, I wouldn't have shown it if it wasn't helpful. But he showed that this hierarchical Dirichlet process model, which basically learns covariance structure and independent structure outperforms both the models that we had developed before and, of course, the overall joint model. So this is a purely computational level. We haven't tested this particular thing in humans or in animals. But it's just a way of trying to think about abstraction or generalization, or the way in which clustering must have to happen if we're going to benefit from natural structure in the world.
And we don't yet know how the brain would actually do that. We haven't implemented a neural network that can do this. This is purely a Bayesian model. But, briefly, just to show that people do do it empirically-- at least, not in the hierarchical one, but in a simpler, clear greater world, sorry-- we can construct environments that are joined. So this is an environment in which you see a couple of different contexts. They look different from each other, but they both have both the reward function in the bottom-left corner and this mapping function.
These other two contexts go with this mapping function, and that one, and so forth. And that would be an environment that has joint structure. So, once you know the reward function, it tells you the transition function or vise versa. And so, our agent can infer whether to use the joint actor and the independent actor. When it's exposed to this environment, it increases the probability of assuming that the world is joint. But you can also construct an environment, of course, that is completely independent. And then, you get the opposite.
So the question is, what do people do when exposed to these different grid worlds that have different structure? And without going into the nitty gritty of all the details, all I'm showing here is the degree to which people's performance is similar to this, what we call the meta agent that infers whether it's joint or independent. And what we found was that people are smart. They generalize in a way that accords with the statistical structure that they've seen.
So, if the world, overall, seems independent, when they go to a new environment, they assume independence. And they generalize that way. But if the structure that they've seen before is even joint, they look more joint. One other point I want to make is we do have some neural networks that can do things like compositionality. And I'm just going to stick with the musical theme here. I had a post-doc in my lab, called Chris Calderon, who built a recurrent neural network of how you have these populations of neurons interact with some basal ganglia circuitry that then selects actions.
But the compositionality he was studying here was how an agent can learn to play musical sequences independent of their rhythms. So imagine you learn a song and you learn another song in a different rhythm. You want to be able to combine the two of them. You don't want to have to learn a song only in one rhythm and have to, then, learn it from scratch when it's in a different rhythm. Let's see if this works.
This is mostly just an advertisement for a fun demonstration. Oh, I forgot to say that he called the model the ACDC model-- the Associated Cluster-Dependent Chaining model, I think, is what it stands for. So I said, if you're calling it the ACDC model, you better make it play an AC/DC song. So I'll put the model to the song "Thunderstruck." And what you're going to see here is just the trajectory of the recurrent neural dynamics as it's playing the intro to "Thunderstruck."
[MODEL BEEPS TO "THUNDERSTRUCK"]
I don't know if any of you recognized that. It's super simple.
[MODEL RESUMES BEEPING]
And he showed that, if the model had also been trained to play something in a bossa-nova tempo for a different song, you can combine that tempo with the "Thunderstruck." In the very first time, it can play "Thunderstruck" in a bossa-nova tempo because it has this inductive bias to do things [? compositionally. ?]
[MODEL BEEPS IN A DIFFERENT TEMPO]
OK. So, for the last part of the talk, I just want to focus on how, so far, we've considered with the transfer, how an agent can transfer specific transitions or goals-- rewards, like learning a C scale and transferring it to other guitars and to pianos. So you learn it on one guitar. You transfer it to another guitar. If you're learning in a smarter, compositional way, you should be able to learn things on one instrument and transfer it to another.
But, for the last part, I want to focus on something that is more related to abstraction, which is-- in the example of guitars, let's say you've learned a particular scale-- like C, D, E, F, G, A, B on a guitar. Even if you're going to transfer it within a guitar, you might want to learn a different scale that has a different set of ordering of the sequences of the notes that you need to play, a different set of songs that you want to play.
But if you're learning it in an abstract way, you should be able to learn a new scale quickly, even if you haven't seen this particular sequence before. And I'll try to illustrate that in a minute. And this is going to be a situation, if you're familiar with other models of generalization in reinforcement learning, where both the rewards change-- so which things you want to do, the rewards that you get, are going to be completely different-- and the transitions change. But yet, the abstraction is preserved. So the way in which nodes map onto different parts of the fretboard is preserved.
And there's a lot of examples of that. Again, not just in music. In speech, for example, when you're trying to speak, you're trying to say words with a certain volume and pitch. Maybe accent, in some cases. And maybe your tongue is in one part of your mouth or another. But you learn the abstraction of how all of those things mean the same thing, in terms of phoneme, that you're trying to produce. It's something that is effortless, but you're learning that abstraction and transferring it immediately.
So, in the example of the guitar, you might want to learn that this is the fretboard of the guitar. So this string and this fretboard-- this is a C note, and so is this, and so is this, and so is this, and so is this. They all actually mean the same thing in note space, even though, in movement space, they're totally different. If you learn that abstraction, then, if you learn one scale, you can immediately learn another scale on a different part of the fretboard. So that's the gist of it.
So we talked about these corticostriatal circuitry for this kind of transferring. But one question is, how can you learn to compress this representation? Instead of this high-dimensional state space, how can you learn a compressed representation that would allow you to, effectively, transfer it? And so, this is going to be all worked by Lucas Lehnert, who was a grad student with Michael Lipman in computer science and with myself. And we got these models to perform a bunch of tasks, including navigation tasks and so forth, that transfer in ways that some other models don't.
But for the musical theme, I'm going to focus on just the guitar-playing task here. So if you're learning to play the guitar, what you're trying to do is figure out, again, what button to press or what string to press to make a certain note. And you might want to transfer that to other situations. And, as I mentioned, a guitarist can internalize how these notes are mapped onto the fretboard of the guitar which, then, accelerates the learning of new sequences.
And so, the question is, how can reinforcement learning algorithms construct these kinds of internal representations that will transfer? So this is Lucas. And I'll very briefly mention some work that we're doing empirically with a lot [? jester ?] of how this might relate to the brain. So the usual way in which a deep network performing reinforcement learning would work is it tries to maximize reward. That's what reinforcement learning algorithms do. They try to get as much reward as possible.
So if you're going to connect something that has a high-dimensional state space into some internal representation through a sequence of layers, you would then get it to try to produce a particular note. And the latent space-- what the neural network will learn-- is going to be informed by the feedback about whether you got it correct or not. You get a reward, and that's going to modify this internal representation. So what the neural network will learn is going to be informed by just getting the most amount of points
But what Lucas worked on was called a reward-predictive model. So it still uses a reward. But instead of trying to maximize reward, it tries to figure out, what is the representation I can have that's going to allow me to still predict sequences of rewards?
So it's going to say, I'm going to predict that I'm in this particular state of the world. And if I take this action, here's a sequence of rewards that I'm going to get. And if I can produce that-- make that prediction correctly, I'm going to collapse the states that are all similar to each other that still allow me to make those predictions. That's the gist of it.
And so, now, instead of reward maximizing training the neural network, it's these reward sequence errors that are going to train the internal representation. And then, once you've learned the abstraction, you can throw away the details of exactly the transition function, the reward function, all that. You're going to keep is the way in which things are equivalent to each other. And the question is, does that provide you usable knowledge? Here's the thesis statement from Lucas's dissertation, which is learning an internal representation that's detailed enough to predict reward sequences prevents overfitting to one task and it allows you to accelerate learning across previously unseen tasks.
So I'm not going to show it here, but he showed it in simple Markov decision process situations, like simple grid works, that-- of course, you can get it to do well by compressing the state space to maximize reward. But the amount that you have to compress will, then, not work very well if you have to learn new grid worlds. But if you're doing it in a way that allows you to predict sequences of rewards-- and if there is a hidden abstraction, you'll be able to find it.
So, in the guitar example-- again, I'm showing you that there are these different locations that are all equivalent to each other. And I also mentioned already that we're using-- even though this seems really difficult and you have to be a really expert musician, I would submit that this is the kind of thing that we, as humans, are doing all the time. Speech is one example, but there's a lot of examples of this.
The idea would be-- is to try to make all these things equivalent to each other so that you could, then, transfer in the most compressed representation possible but still allows you to know what's relevant. So to transform this into a Markov decision process, he had an agent start in a state. And if it played the C note, it would get a reward because it's trying to play a particular scale. And the only way it would keep getting rewards is if it did this particular order-- C, D, E, F, G, A, B. Otherwise, it wouldn't get rewarded.
And what he showed here is that, if you're learning the initial scale here-- in green is this reward-predictive model that is trying to predict the sequences of rewards while also trying to do the right thing. And in orange is the model that is compressing by trying to maximize reward. If you're familiar with it, it's using something called a successor-feature model. And it can do just as well, as, initially, it's just in the initial sequence that is learning.
AUDIENCE: Yes, I just wanted to ask, what do you mean by the reward sequence, once again? Is the abstraction, essentially, coded in the reward sequence of the fact that you go from C, D, E, G, and it was F, G. And now, [INAUDIBLE] can be mapped to a sequence of rewards? Is that the idea?
MICHAEL FRANK: Now, it's basically saying, OK, if I were to take this high-dimensional state space and compress it such that A and X are equivalent to each other so that, as far as I know, I'm going to make all my planning based on that compressed representation, would I still be able to predict the actual, observed sequence of rewards that I get? So a simpler example-- which, maybe, I should have shown, but I know I'm putting a lot into this talk-- is, let's say, you have a 3 by 3 grid world, and one column of it is always rewarded.
If you're a reward-maximizing model-- and, yes, let's say the right column. If your reward were maximizing, you can compress that down to just one state. It just says go right. I don't really care what column I'm in or what X, Y location I'm in. I'm just going to get rewarded if I go right. If you're not compressing anything, you have the 3 by 3 representation. But if you're the reward-predictive model, you would say each of these rows is equivalent to each other-- so I'm going to compress along the row dimension-- but the columns are different.
And now, if you go to a new grid, in which a different column is rewarded and, even, the actions that you need to take to move between columns is different, as long as the thing that matters is columns, it will be able to maximize reward in that column world. Whereas, the reward maximizing one would not.
AUDIENCE: So can I ask a follow-up question on that? Essentially, does that require that, when you go from one world to another, there are-- in the first instance, I guess you can say there is a subset of states, and those map to a subset in the other states. That's why you can re-use the compressed representation of sequences in some sense.
MICHAEL FRANK: Are you saying, does this depend on us creating tasks in which the abstraction is the same when the agent moves.
AUDIENCE: Yes. In the sense if you go from a guitar to a piano, for example, there is the abstraction of a thread or a note. And there's a 1-to-1 mapping. What I'm wondering is if it is necessarily required that there be a mapping between the two different worlds for this to work out.
MICHAEL FRANK: Between the two rewards?
AUDIENCE: The two different worlds that you have.
MICHAEL FRANK: Oh, I see. Yes, so we addressed that in a number of ways You're definitely right, that if we only assume that there's one abstraction-- one compression-- and we go to another world, it may work if it happens to have that same set, abstract rule. But it wouldn't otherwise. And so, we use the same trick with the Chinese restaurant process, where we build a number of abstractions. When you go to a new world, what you're doing is you're inferring, does it look like that previous one? So does this look like a guitar, or does it look like a piano?
If it doesn't look like any of them, it has to create a new one altogether.
AUDIENCE: In order to create that abstraction, do you have to explore all possible states? Otherwise, how do you know that-- in a grid, 3-by-3-- you have to go right if you haven't explored all the a possible states.
MICHAEL FRANK: Yes. That's an excellent question because it motivates where we're going with this, which is this model only-- the compression only works if you sufficiently explore the environment.
AUDIENCE: Exactly.
MICHAEL FRANK: And if you try to compress too early and you get into these local minimum, that are-- is going to be really bad.
AUDIENCE: Can go really wrong, right.
MICHAEL FRANK: And this is why-- I'll mention it briefly, but we think it's happening during replays. So the stuff that Matt Wilson was referring to a bit this morning, that, while you're learning a new world, you're actually using the high-dimensional representation-- some hippocampal, conjunctive representation. But after you've explored it enough offline, you might recognize that things are similar to each other in the reward-predictive space. And then, you start compressing in ways that allow you to transfer it.
I'll get to that in a minute, but that's the gist of it. Just because we could, we also output this to something that just makes sounds. So if you look at the initial sequence, scale 1-- for some reason, I can't play the things. But it doesn't matter. It sounds good, here, for both models. And when it transfers from a C scale to a G scale or some other scale, the reward-maximizing algorithm-- even if it's trying to compress, according to the state-of-the-art other models using successive features and so forth, it waffles around, though, because it depended on the specific transitions between latent states.
Whereas, the reward-predictive model doesn't care about any of the specific transitions. It just cares about similarities. So a C here and a C here are always the same. D here and D here always the same. But I can put them in whatever order I want, and it'll still be able to learn quickly. That's the jist of it.
And so, this is getting at what I was just mentioning in response to the last question, what Alana is exploring here, conceptually and with some human experiments, for now is that, perhaps, because the algorithm requires you to learn everything, you have to explore the environment enough to be able to compress it. And perhaps, during the initial experience, you're just using a high-dimensional representation. Do I have a picture? No, I don't have a picture. So, during rest, you would replay, but not replay specific sequences that you've experienced before in their order, necessarily. But rather, you would replay items that are similar in reward-predictive representations.
So you might actually replay something that is not, I went from this state to that state to that state that state to get the reward. But if this state, in terms of its ability to predict sequences of reward, is similar to some other state altogether, the idea is that you would actually replay those sequentially, next to each other. You would sample from the part of your brain that's learning these reward-predictive sequences, replay them. And then, your cortex would, then, consolidate those two to have a similar, abstract representation that, then, would allow you to generalize.
So that's the idea that we're exploring now, that would, then, allow you to do this kind of transfer. But we don't have any evidence for that, except that Alana created a test that shows that humans can do this kind of reward-predictive transfer, it seems like. But we don't even know if it's sensitive to sleep or anything like that.
AUDIENCE: Does this work, also, in the case of a very sparse reward? What if you only get a reward when you're in the win state? How do you reason on a sub-path if you didn't get any reward yet-- if you have only zeros?
MICHAEL FRANK: Sorry. If the reward structure is very sparse, then the question is, how can you compress?
AUDIENCE: Yes. If I understood correctly, you basically reason in the terms of the subclass. You don't go till the end, right? Is that what you said?
MICHAEL FRANK: Yes.
AUDIENCE: I didn't understand.
MICHAEL FRANK: So, yes, it's important that you don't start compressing the representation too early. So you would have to have experienced that sparse reward at some point. Then, you also would have to experience that, in order to get to that sparse reward, if there is a particular sequence that you needed to get there, that-- you would have to have experienced that enough. And then, this algorithm for compression is done offline, afterwards. So there would have to be some criterion for how well do you know this environment? How much world have you gotten before you actually start using the abstraction?
And we also think that it's not-- even if you do the abstraction, your brain still has access to high-dimensional representations that it could still always use, if needed.
AUDIENCE: OK. Thank you.
MICHAEL FRANK: OK. Well, that's the gist. So summary here is that we think that these hierarchical frontal corticalstriatal systems interact to support not just basic reinforcement learning of-- stimulus response learning, as has often been studied, but, really, structure learning across multiple levels of abstraction in hierarchical ways. And that, if you have that inductive bias, that architecture imposed on you, it leads to a slower ability to acquire action contingencies, compared to having just a single circuit.
Because you have to do credit assignment across these different levels, you don't know which dimensions of the world are indicative of the higher-level states or the lower-level stimulus action associations. But it affords generalization and transfer. And I motivated in terms of my kids and the kittens and so forth. I'm not a developmentalist, but I did collaborate with FEMA, and SOSLAB, and Denise Fortune, who showed that even eight-month-old kids seem to show generalization that's consistent with this hierarchical structure. And it relates to activity in their prefrontal cortex.
And then, finally, I talked about-- in order to be able to learn to learn, there's this balance where you need to be able to figure out how do you maintain compositionality, separating the things that you need to compose together in order to act and structure things that go together that are indicative of each other. And, finally, I talked about state of abstractions for deep transfer and, potentially, a role of replay. So, with that, I just want to thank everybody who did this work-- Alana, Nick, Lucas, Rex and, mostly, my lab. And thank you for your attention.