Predictive Maps in the Brain
Date Posted:
April 7, 2020
Date Recorded:
April 7, 2020
CBMM Speaker(s):
Samuel Gershman All Captioned Videos CBMM Research
Description:
Sam Gershman, Harvard University
Abstract: In this talk, I will present a theory of reinforcement learning that falls in between "model-based" and "model-free" approaches. The key idea is to represent a "predictive map" of the environment, which can then be used to efficiently compute values. I show how such a map explains many aspects of the hippocampal representation of space, and the map's eigendecomposition reveals latent structure resembling entorhinal grid cells. I will then present evidence, using a novel revaluation task, that humans employ such a predictive map to solve reinforcement learning tasks. Finally, I will discuss the role of dopamine error signals in learning the predictive map.
SAM GERSHMAN: I'm going to talk to you guys today about predictive maps in the brain. Let me start with the big picture. So many of you have probably heard the term "cognitive map" before. And particularly in connection with the hippocampus, this idea has risen to prominence as people have discussed things like place cells and grid cells as the underlying neural substrate of the cognitive map. But what's remained somewhat murky is what exactly is a cognitive map? And from a computational perspective, what kind of cognitive map should we have in our heads?
So there's a kind of conventional notion of a map which we're all familiar with, which is a map of space. It tells you where you are. So we can imagine that this rat here, who is in this arena with a bunch of objects, he has an internal model of the environment that tells the spatial layout of the different objects and it's Euclidean distance between itself and the different objects and between the different objects themselves.
So that's the conventional notion of a cognitive map. But from a computational perspective, I'm going to argue that a possibly more useful map is one that will tell you where you will be in the future. And I'll try to elaborate why that's useful and put some formal flesh on those bones.
So here's the outline. I'm going to talk about the conceptual origins of the cognitive map idea. And then I'm going to offer a different perspective, coming from a reinforcement learning framework and talk about what kind of cognitive map a reinforcement learning agent might want. And that's going to lead us to this notion of a predictive map, which has been implemented in various ways.
And I'll talk about one particular way, which is called successor representation. And I'll use that idea to reinterpret a number of results from the physiology of literature concerning place cells and grid cells. And then I'm going to take it one step further and ask how can we actually learn this predictive representation? And that's going to lead to a new interpretation of dopamine as a kind of vector value and error signal that's used to update the successor representation.
And finally I'll talk about some evidence for the successor representation from humans. And if there's time, I'll also talk about some rodent studies, particularly with respect to dopamine.
So as I'm sure many of you know the, term "cognitive map" originated with Tolman. And Tolman, who was working at a time when American psychology was dominated by behaviorism, he really struck out in a different direction. He did some very clever experiments that seemed to argue against the prevailing behaviorist dogma that all learning was based on reinforcement.
So here's an example of the kind of experiment he did. You put an animal, a rat, into this maze, where it would start in the circular part of the maze. And it would have to navigate through the circuitous path to get a food reward.
And then what he would do is he would start the animal again in the circular part of the maze, but he replaced the circuitous path with a radial arm maze. And what he observed was that the animals typically went straight along the shortest path to the food reward even though they had never gone on this path before and never been rewarded for it. So that seemed to be problematic for a notion of learning that was driven truly by reinforcement. But it was compatible with the conventional Euclidean map notion of space, where the spatial corners of the food reward were represented in some kind of allocentric coordinate system.
Here's another example of one Tolman's studies, the so-called latent learning study, where he would put an animal the maze for a number of days without any reward. And then after day 10, he'd start rewarding the animal. And what he found was that these animals learned faster if they had received this preexposure compared to animals that started without any preexposure to the maze.
So again, this was problematic for a notion of learning purely by reinforcement. Clearly, the animals were learning something during this preexposure phase. And it was not driven, at least by overt reinforcement.
So what exactly is a cognitive map? Let me try to make the classical-- conventional notion more formal. So it's a set of landmarks in a Euclidean metric, encoding distances between landmarks. And that was the notion that was used, for example, by O'Keefe and Nadel in their 1978 book on The Hippocampus as a Cognitive Map.
So how is this map actually constructed and what is it good for? The standard algorithm for updating and using a cognitive map is known as path integration or dead reckoning. And it's very simple because position is just the integral of velocity. So if an organism can track its velocity over time and integrate it, then it can use the resulting integral to position itself in some point in this Euclidean map of space.
And one big advantage of that is that it can find the shortest path home without retracing its steps. So think about, for example, a desert ant, who is meandering around the desert. And it finds a food source. Now, instead of tracing its circuitous path, it could just calculate the shortest path back home. And, indeed, this is what desert ants do.
And one of the clever ways that people are showing that desert ants do this is you can pick up the ant and move it to some other part of the desert and it takes a straight path directly to where its home should have been if it hadn't been moved. And this is also the technique that European navigators used. And it was used to make many of the discoveries that we know today.
But there are a number of problems with this classical definition. So path integration is useful in wide open spaces like desert and oceans, where few obstacles exist. But think about a complex environment of the sort that mammals live in, like rodents or humans. Shortest Euclidean path is not going to be very useful, right. Like if I wanted to chart the shortest Euclidean path to my car, I'd have to go through lots of walls. So it's really only useful for route planning over short distances.
There are also more general issues here. So what path integration does is it's a solution to the problem of how do I update my spatial location and use that to potentially get home. But what about multiple goals, or varying reward magnitudes, or uncertainty about rewards, uncertainty about my spatial location, or action costs? The issue here is that path integration is not a solution to the full decision problem that's facing most organisms. And so if we really want to understand what the brain is doing in these kinds of tasks, we need to understand the more complete position problem.
And that's where the reinforcement learning framework comes in because it at least gets us closer to the decision problem that we think the brain is solving, which is optimizing some sequential decision policy in a long-term-- over some long-term horizon. So in the basic reinforcement learning setup, you have an agent who is taking actions. And those actions are impinging upon the environment and causing a transition in states and the delivery of rewards to the agent.
And the standard objective for the agent is to maximize cumulative future reward or possibly discounted cumulative future reward. And path integration is not a general solution to this problem. So if we're looking at this from a reinforcement learning perspective, how do cognitive maps fit into this picture?
So to give you some background, I'm going to quickly go over kind of the current state of the art in thinking about reinforcement learning in the brain very briefly. And then I'm going to talk about how cognitive maps, and in particular predictive maps, fit into that framework.
So let's imagine this rat, who's trying to maximize its cumulative future reward. And it's moving between these states, which are here to find this particular physical locations, S1, S2, S3. So you can see that there's various kinds of items of food and water. And so what the animal will need to do is chart a path that takes it through these states and maximizes future reward.
Now, broadly speaking, there's two ways to solve this kind of problem. One is called model-free because the way it works is you're learning some kind of cash value function that maps state and actions to values. In the simplest case, you could just have a lookup table that tells you the cumulative future reward. I'm sorry. And this is a technical definition of value, is cumulative future reward or discounted future reward.
And solving, and using this kind of approach, you can circumvent the need to build an internal model of the task. You just learn through interaction with the environment the elements of this lookup table. And it turns out that there's a simple algorithm for updating the elements of this lookup table, known as the temporal different learning algorithm. And it takes advantage of the fact that if the environment is Markovian, so if your transitions and rewards depend only on your current state and action, then the value function can be decomposed recursively into what's known as the Bellman equation. And that Bellman equation can be entered into a stochastic approximation procedure for approximating the elements of this lookup table. And there are more general elaborations of this basic idea. So you can replace the lookup table with a linear function approximator or even a nonlinear function approximator, like a deep neural network.
All right, so that's the model-free for your approach. And one reason that the model-free approach has been so influential within neuroscience is that there are these seminal observations from Wolfram Schulz, and by now many other people, that the phasic firing of dopamine neurons in the midbrain seems to track the prediction error that's being used to update the value functions. So you see in this previous slide the value function is being updated incrementally by this prediction error delta, the discrepancy between received and expected reward. And there's a lot of evidence now that dopamine signals something like that, although I'll revisit this a little bit later at this talk.
Now, the other broad approach to solving sequential decision problems is known as model-based algorithms. And here the idea is to build an explicit representation of the Markov decision process and use that then to plan explicitly. So the Markov decision process here is parameterized by a transition function that tells you which states you'll go to when you take particular actions and a reward function that tells you how much reward you'll get when you visit particular states. And then to actually use that model to generate plans, you can use a classical approach like dynamic programming, know as value iteration in the reinforcement learning literature, or you can use various kinds of forward planning approaches, like Monte Carlo tree search.
So these have are different operating characteristics. The model-based approach is computationally inefficient because it requires planning, but it's very flexible. So if there's some change in the environment, that allows the agent to make local changes in his model rate. Like, for example, if there is some barrier that prevented the agent from moving through this environment, then you just have to update that one part of the transition function. And then when you do planning, your planner will automatically take that into account. So that endows the model-based approach with some flexibility.
And this contrasts with the model-fee approach, which is computationally efficient because in the simplest case you just have to inspect your lookup table to know what to do. But it's inflexible because the recursive structure of the value function means that even local changes to environment translate to non-local changes in the value function. So you might end up having to relearn all or a large part of the value function.
And it turns out that both of these systems seem to be used by the brain, although not necessarily at the same time. So I'll give you a very brief caricature of the classic task for demonstrating this, that was developed by Tony Dickinson in the 1980s. This is known as the devaluation paradigm.
So let's imagine that there is a rat. He's pressing a lever for food. And he does this for a while. And then you take the lever away and give it the rat food freely, but you devalue it, for example, by pairing it with illness. So now if you offer the food to the rat again, it doesn't want to eat the food. It's learned a food aversion.
Now, the key question is what happens if in the choice test you put the animal back in front of the lever? So think now about model-free reinforcement learning. So model-free reinforcement learning is really a kind of souped up version of Thorndike's law of effect. It says that if I took an action and I got rewarded for it, then I'm going to continue taking that action in the future. So it predicts that the animal will actually continue pressing the lever to obtain food that it doesn't actually want because it's always been reinforced for the lever. Remember, the lever is absent during the devaluation.
The model-based system, in contrast, would know something about the causal structure of the environment. It would know that if it pressed the lever, then it's going to get food. And if it eats the food, then it's going to get sick. So that by using this causal knowledge about the environment, it can avoid pressing the lever to obtain food it doesn't want.
So these two systems would make separate predictions about what the animal should do. And it turns out that you can see evidence for both of these predictions depending, for example, on how much you train the animal. So if you moderately train the animal, then it appears model-based in this test phase. So it abstains from pressing the lever. But if you overtrain it during the initial instrumental learning phase, then it will persist in pressing the lever to obtain food that it doesn't want. So it seems like there's some kind of shift in control from model-based early on to model-free later on.
So with all this as background, it might be tempting to interpret the cognitive map as basically a form of model-based RO. That is how some people have interpreted the cognitive map. So, for example, the cognitive map might encode something like the transition function between states. And then we can think of the navigation as being solved by some model-based planning.
I'm going to offer a different account of the cognitive map, which is a kind of predictive code. So the idea here in a nutshell is that the cognitive map encodes predictive statistics about upcoming states. And this looks superficially a little bit like a transition function. But it's not a transition function.
And this goes back to an idea that was proposed a number of years ago by Peter Dayan, which he called the successor representation. And the real computational value of the successor representation is that it renders value computation a linear operation. So if you have the successor representation, then you can get some of the computational flexibility-- that computational efficiency of a model-free approach. But as we'll see, it also endows an agent with some of the flexibility of a model-based approach. So in some ways it's a kind of middle ground between model-based and model-free algorithms.
So to give some intuition for how this works, let's think about this environment where we have a bunch of neurons whose preferred-- that are spatially attuned. So we can think of these as place cells, for example. And end they're labeled by their preferred spatial location.
So now the question is, if the animal is in state 1, what are the firing rates of all these different neurons? So if we think about the spatial map here, the two functions encoding Euclidean distance, then the states that are farthest away in Euclidean distance are going to show the weakest activation. So states 3 and 4 are going to show weak activation. I'm sorry-- states 2 and 4 are going to show weak activation compared to state 3.
And now compare this to a map that encodes predictive statistics. So now the firing of a neuron corresponds to how often an animal is going to be in its corresponding preferred spatial location in the future, given its current location. Then it changes the picture. So now look at the neuron whose preferred spatial location is state 3. Because the actual geodesic distance between 3 and 1 is very long, even though its Euclidean distance is very short, it's going to show weak activation because if the animal is in state 1, then it's unlikely to visit state 3 in the near future.
So this gives you the basic intuition of the predictive map that I'm going to talk about more formally in a moment, which is that the neurons are going to fire in proportion to the predicted future state occupancy, rather than distance in space. And that's going to render it sensitive to things like topology and geodesic distance.
OK, so here's the formal definition of the successor representation. So let's imagine this big table that's shown in the bottom right, where the rows and columns correspond to states. And we can think of the rows corresponding to some initial state and the columns corresponding to some destination state. And we'll envision an agent who is traversing the state space according to some policy.
And we can ask how often is this agent going to visit a particular destination state after starting a trajectory in the initial state? But in addition, we're going to discount these occupancies exponentially. So occupancies that happen far in the future relative to the time at which the agent started its trajectory are going to be downgraded. And that's what this gamma term means in the equation up here. So we're going to count up the number of times they visit a particular state, discounting it exponentially. And then this expectation operator is taking the average over randomness in state transitions, actions, and rewards-- or actually just state transitions and actions rather.
So that's the definition of the successor representation. And it turns out that in a Markov decision process the value of a particular state is simply the inner product between the successor representation and the reward function. So the intuition here is that if I want to know my cumulative future reward in a particular state, what I should do is consider all the possible states I'll visit in the future, and how much reward I'll get in each of those states, and how often I'm going to visit those states. And then by multiplying those things and summing them up, I'll get the expected future reward-- cumulative future reward for that current state.
And another important feature of this is that for a Markov of decision process, the successor representation can actually be learned using a form of temporal difference learning. I think actually I have this in the next slide. Yeah. So it turns out that the successor representation obeys the Bellman equation.
And that's what's shown at the top here. I'm showing you here for a single row of the successor reputation, so the row corresponding to a state. So that the expected future occupancy of all the other states, of course, that are encoded in this m vector, equals the expectation of the current state occupancies-- the current state occupancy, which I'm showing with this x vector, plus a discounted future state occupancy at the successor state. And this expectation takes the average over the state transitions.
So this is analogous to the Bellman equation for the value function. But here we're defining it for state occupancy rather than rewards. And, in fact, we can generalize this to a feature-based version of this Bellman equation, where x now corresponds some feature vector. And if one of those features corresponds to immediate reward, then that corresponding component of the successor representation is simply the classical value function.
And we can use the temporal difference learning rule, which is basically a stochastic approximation of this Bellman equation, to update these expected future occupancies. So if you entered into a state which you weren't expecting to visit, then that's a positive prediction error. And you're going to increase the expected occupancy for that state.
So with all that theoretical background, we can now turn to the physiology data. So I won't belabor this. I think probably many of you are familiar with the basic physiology of place cells and grid cells. We have place cells, which are classically defined as a spatial receptive field. So these are neurons that fire when a animal is in a particular location in space. And then grid cells are also spatially tuned. But they're periodic. So they arrange themselves into this diagonal grid.
So the theoretical interpretation that Kim Stachenfeld, Matt Botvinick, and I developed in this 2017 paper interpreted place fields as retrodictive codes. So what I mean by that is they correspond to a column of the successor representation. In other words, a given neuron is telling you which states it was likely to have been in the recent past. That's what a single cell's place field is.
And then that means that the columns of this matrix correspond to the population code. So the population code is a predictive code. Its encoding the row of the successor representation. Whereas the place fields in the single neuron firing fields are retrodictive codes. By the way, please feel free to interrupt me if you have any questions.
So in the simplest case, we can ask what do these successor representation place fields, so to speak, look like in two dimensions? So if we just take a square open field, and you have a random walk in that field, then you get roughly radially symmetric receptive fields, except near the boundaries, where they're a little bit distorted. And that looks a lot like the radially symmetric place fields that you see in many hippocampus recordings.
But things get more interesting if you add more structure here. So keep in mind that the predictive map depends not just on your location and space, but also where you're going and where you came from. So a nice example of this was work by Mehta-- Mayank Mehta in had Matt Wilson's lab now several decades ago. And what they did was they had animals running repeatedly in a particular direction along a linear track.
And what they found was that there was a backwards skewing of the place field. So as the animal ran more and more, place cells that were initially tuned to earlier locations, preceding the animal's current position, now started to fire more and more. And this is exactly what you'd predict of a retrodictive code of the sort that I just laid out because now these earlier locations in the track are predicting upcoming locations on the track.
Another implication of this theoretical framework is that there should be reward clustering because the predictive map is policy-dependent. So it depends on where the animal actually goes.
And in environments where there is inhomogeneities in the reward location, the animal's going to spend more time naturally in the place where there's more reward. And so you're going to see greater representation, greater occupancy, in the predictive map. And that's what you see experimentally. The place cells tend to cluster in your rewarded locations. And that's what you get in the simulations.
Another interesting implication is that you get geometric constraints on the representation. So the place cells will distort around barriers. And that is indeed what's been shown in a number of studies. And you see that also in the simulations, where you add barriers and the place cells are distorting around them because the predictive map acknowledges the fact that the animal can't pass through walls. And so it has to go around those barriers.
Here's an example of an another paradigm that was originally developed by Tolman, called the Tolman detour maze. And in this setup, Tolman trained the animals in these mazes. And then what he did was he blocked off part of the maze. And interestingly, the animals immediately changed the path to go along a different path, which they weren't accustomed to taking, to get to the reward.
And what was observed here in this particular study was that the place fields in state 1 seemed to specifically distort near where the barrier was inserted, but not far away from the barrier. And this is what you also see in the successor representation because the main changes in the predictive map are going to happen near the boundary, rather than far away from the boundary.
So I've been talking about spatially defined states. But actually the successor representation is more general than this. The states can be defined arbitrarily, as long as they conform to a Markov decision process. So, for example, we can look at the-- actually, sorry. I'm jumping the gun a little bit. So before I get to that, let me just mention this one study where they were looking at the effects of both space and time.
And this was a clever study in humans, with fMRI, where they had people navigating this virtual environment. And they inserted these teleporters there. So they could basically dissociate space and time by teleporting people from one part of the virtual environment to a very far away part of a virtual environment. And they showed that hippocampal patterns were sensitive to both spatial similarity and temporal similarity. And this makes sense from the perspective of the successor representation, where both space and time are going to determine the predictive statistics of the map.
So let now let me come back to what I was saying before, which is the nonspatial state. So here's an example of a study by Anna Schapiro, where she had subjects traversing this state space, where the states were represented by these fractal images. And unbeknownst to the subjects, the state spaces were organized into this particular structure where they're communities, they're modules.
And what she showed was that the hippocampus was sensitive to this community structure. So that states within a common community showed more similar pattern representations in the hippocampal BOLD signal compared to states that cross community boundaries. And this will fall naturally out of the successor representation because states within a community are more likely to have higher future occupancy between one another in the same community than between states in different communities.
We can also use this theoretical framework to make sense of some data that takes us a little bit away from place cells. So one classic finding from Fanselow, and many others since then, was that if you do a contextual fear conditioning experiment where you put an animal-- put a rat into a box and then shock it, the standard finding is that the animal is going to learn a conditioned fear response to the context. But interestingly, the fear response is much greater if you preexpose the animal to the environment before shocking it. So this is known as the preexposure facilitation effect of the immediate shock deficit.
And one idea for why this happens is that during this preexposure phase, the animal is wandering around this environment and it's learning the predictive relationships between the different parts of the state space, such that when the shock arrives, the animal is going to be able to generalize what it learns in one particular part of the state space to all the other parts of the state space. But it can't do that if you shock it immediately and it didn't have time to explore the environment.
So we show this in simulation, that first of all you get this preexposure effect. But the other critical data point is that if you lesion the hippocampus, you no longer get this preexposure to this facilitation effect. And that's also true in our simulations, of course, because if you can't learn this predictive map, then you can't take advantage of the generalization that it enables.
So now I'm going to turn from place cells to grid cells. Now, when people originally looked at these periodic firing fields, it was very tempting to think of them as something like a Fourier basis for place cells. But there are a number of limitations of that perspective. And without going into too much about that, let me just talk about the alternative interpretation that we'd like to argue for, which is grid cells as a eigendecomposition of the successor representation.
So the idea here is the same idea, that we're going to take this predictive map. But now that eigenvectors correspond to grid cells in this framework. And in an open field, some of these eigenvectors are going to look periodic, with different frequencies.
And it's tempting to map those different frequencies, which correspond to the different eigenvalues, onto the dorsal-ventral axis, where you see a gradient of frequency in the medial entorhinal cortex, so that on one end you have the smooth eigenvectors and the largest eigenvalues. And on the other end, you have the least smooth eigenvectors, the highest frequency.
So if you take the open field, you get an eigendecomposition of grid-like-- you're going to get the composition of the SR that produces grid-like fields. And interestingly, these are also sensitive to structure. So, for example, if you introduce compartmentalizations, you get compartmentalization of the eigenvector fields.
And there is some evidence for this, although not a ton. So, for example, this is a hairpin maze that was developed by Dori Derdikman, where the animals are going up and down this track. And you can see that there is this kind of repetitive structure of the grid cells recorded in this task, where it's almost like the grid cells are being worked around the hairpin maze. And you see something like that also in the eigenvector fields.
Now one important question is, what is exactly the relationship between grid cells and place cells? Should we be thinking about the grid cells as literally taking the eigendecomposition of the place cells or are they serving some other purpose? So a few things to consider. So one is that grid cells do not seem to generate place cells. Grid cells, for one thing, develop after play cells. And removing entorhinal input to the hippocampus does not eliminate grid cells, although it does mess them up a little bit, I'll talk about in a second.
So what I'd like to propose is that the grid cells act as a kind of regularization network. And the reason you need such a regularization network is that place cells are primarily updated on the basis of noisy sensory input, whereas grid cells are updated primarily on the basis of self-motion cues. So, for example, in the dark, the place cell representation is going to degrade. But the grid cell representation won't, as long as it gets proprioceptive cues.
So if you can use the grid cells as a kind of smooth basis for approximating the place cells, then you can then build a spectral regularization network that will basically denoise the place cells when they're being updated on the basis of noisy sensory input. So you can construct a smooth approximation of the place cells using the top K eigenvectors. And this is inspired directly by some old work from Tommy's actually.
So what is the evidence for this? So one is that entorhinal cortex lesions reduce place cell stability and also discharge rate and field size. So even though the place cells are not destroyed, they do basically get messed up. They get noisier. And the other observation is that grid cells are stable across environments and in the absence of visual input. So it makes them a good candidate for serving this regularization function.
And also I'll just mentioned that you can write down a stochastic gradient descent rule that learns the eigenvectors of this detection representation directly without actually taking the eigendecomposition of the successor representation. So it's not like the grid cells need to necessarily be listening to the place cells and computing the eigendecomposition. They can actually learn that the eigendecomposition autonomously. And this is a technique that we adapted from some earlier work from Laplacian on eigenmaps.
So this is more of a speculative part of this talk, trying to make the case that spatial structure is useful and it can be exploited using the successor representation of this eigendecomposition. So in particular, hierarchical reinforcement learning can exploit a structured decomposition of space, for example, by producing subgoals for planning. And you can do this with the eigendecomposition by segmenting the eigenvector with the second largest eigenvalue, also known as the Fiedler vector. And this is closely related to the normalized cuts algorithm from Shi and Malik that is applied in computer vision.
So the idea is that you can recursively decompose space on the basis of this eigendecomposition. And then you could plug this into a heritable reinforcement region, which we showed provides a useful representation for reinforcement learning. So here's an example from some work by Matt Botvinick, where planning with options facilitates planning-- sorry. Options is a technical term that corresponds to-- is one way of formalizing this type hierarchical structure. So planning with this hierarchical structure can lead to pretty major learning speed gains. But as of right now, we don't have any direct evidence that grid cells are actually useful for this purpose-- or rather that the brain is actually using grid cells in this way.
So I want to now come to some new experiments that we did-- well, they're not new anymore. They were new a few years ago-- that distinguish between model-based and successor representation accounts. So the successor representation has some of the flexibility of a model-based system in the sense that it can adapt very rapidly to changes in the transition structure of the environment-- or track changes in the reward structure of the environment. But critically, because the SR compiles the transition information into a predictive code, it's going to be insensitive to changes in the transition structure, unlike a model-based system.
So let me show you the task that we developed. And I'll give you some intuitions for what this means. So we had humans play this task, where in the first phase they traverse two chains of states ballistically. And those are the same for both conditions. And then as a consequence of this initial learning phase, they learn that state 1 leads to much more reward than state 2. So if they were given a choice to express their preference between states 1 and 2, they would strongly prefer state 1.
Now, in the second phase we devalued these initial value functions by either changing the reward structures, so swapping the rewarded locations, or changing the transition structure, so swapping the transitions, as shown in the bottom over here. And these have equivalent effects on the value function at the initial state, states 1 and 2. And then in the test phase, we're going to re-evaluate people's preferences for states 1 versus 2.
So let me give you a few predictions from different theoretical accounts. A model-free learner is going to be equally insensitive to both forms of transition-- both forms of devaluation because it has no representation of either reward or transition structure. It only learns these cash values.
And one property of the temporal difference learning algorithm is that it requires these unbroken chains of experience in order to update the first stage values. But we basically prevented the learners from gaining this experience because we start people off in this middle state during the devaluation. So they never have a chance to go back to the first states and realize that this full chain of experiences will result in different rewards. So we've kind of disrupted the efficacy of the temporal difference learning algorithm.
Now, a model-based learner, in contrast, is going to be equally sensitive to both reward and transition changes because it explicitly represents them and will immediately propagate those changes to the value function. Any mixture of model-based and model-free algorithms will just be somewhere in between these two extremes.
The successor representation, in contrast, predicts sensitivity to reward, but not to transition changes. You can think of the successor representation as basically a partially compiled representation of the environment. So it represents the reward function explicitly.
But the transition function has now been compiled into this predictive map, where the individual transitions have been erased in essence. So it only learns long term. It only knows about long-term predictive relationships, not instantaneous transitions. And then as a consequence, it's basically blind to these local transition changes. It needs to relearn the successor representation, much in the same way that in a classic reward evaluation task, a model-free algorithm needs to relearn the value function.
We can also consider various hybridizations of these, like some kind of mixture of model-based and successor representations. So, for example, the case where the successor representation is being used to initialize a model-based value function that's subsequently refined through presearch or dynamic programming. And then you you'd predict partial sensitivity to a transition, rather than complete insensitivity. And the upshot of these experiments is that we find evidence for the hybrid model, where you get differential sensitivity to reward and transition just as we predicted from the successor representation. But you do see partial sensitivity to transition changes.
And we also included this control condition where there were no changes, just to confirm that agents were not going to-- that learners were not going to exhibit any of these revaluation changes when no change had actually occurred.
Here's some new data that has not been published yet. This is worked by Evan Russek in Nathaniel Daw's lab. So he took this task into the scanner. And the basic idea was that you could use neural representations of the future states as an index of how much we think people are relying on model-based computation or the successor representation. And in particular, he looked at face and scene classifiers for the case where the terminal states in this graph correspond to faces and scenes. And these are things that we know how to decode from brain activity. Right, so that's showing that.
And then the idea is that the model-based predictions and the SR predictions will differ in how much they think that face versus seeing activations should occur in a given trial. And we can basically compute a summary statistic of this, that we call SR alignment, which is how aligned are the activations of these category-specific representations in the brain with the successor representation predictions? And overall, we see evidence for higher SR alignment in these signals on transition trials.
So this is broadly consistent with this idea that on trials where there's transition devaluation, the subjects are basically failing to completely prospect about what the future states are. And they're falling back on the successor representation. So on average, predictive neural activity tracks outdated successor representation predictions.
Another interesting observation is that this seems to be higher, this alignment seems to be higher, on error trials compared to correct trials, which supports the idea that these errors are really driven by failures in perception about future states that are consistent with the successor representation. And when I say failures in perception, what I really mean is failures to use the model-based-- to update the internal model and use that model to make predictions about future states.
So how is the successor representation learned? So remember I told you earlier about this influential idea that phasic firing of dopamine neurons encodes the reward prediction error. And what I'm going to argue for here is that you can take that same idea and generalize it in a pretty important way to learn the successor representation. So remember, the successor representation can be learned using the temporal different learning algorithm.
We've now shown behaviorally that people act as though they're updating their successor representation consistent with a prediction error-- a temporal difference prediction error. And now we'd like to ask the question whether the dopamine signal itself is carrying this error signal? Now, a critical feature of this is that the error signals now need to be vector-valued because the successor representation is vector-valued.
And that's a fairly significant divergence from the classical theory, where the prediction errors are scalar. But it can still encompass the classical reward prediction error if we assume that one of the features in this vector representation corresponds to reward. So we're not throwing out the original theory. But we're generalizing it strictly.
So here's a piece of data from Geoff Schoenbaum's lab. And before going to that, I'll just mention that in the interest of time, I'm only going to talk about one little piece of this work. If you'd like to learn more about how we've explained a whole bunch of different findings from the dopamine literature, you can take a look at this paper that was written with Matt Gardner and Geoff Schoenbaum.
So I'm going to focus on one task that Geoff Schoenbaum's lab has used quite a bit, where they have blocks of trials where they shift between values or identities. So you can hold reward identity fixed. These are flavors and change values. So you can go from one to three drops or from three to one drops. Or you can change identities, so you hold the number drops the same, but you change the flavor.
And one of their key results was that you see higher firing rate early during these identity shift changes compared to late in the block. And they also see this for value shifts, which is not surprising given the classical theory. But it is surprising that you'd see this dopaminergic sensitivity to changes in identity because that's putatively not accompanied by any change in the rewards, by construction. And you see this also in the model if we assume that dopamine neurons are encoding the successor representation prediction error because many of the features in that error signal are going to correspond to sensory features, not just reward features.
So trying to be a bit more specific about what we're claiming here about dopamine neurons, one hypothesis here is that the collection of dopamine neurons in the midbrain corresponds to a kind of population code for the successor representations. The idea here is that we have some tuning curves that are defined over these features. And the population response is some linear combination of these.
And so when we're talking about the ensemble of dopamine neurons in this in these kinds of tasks, we can ask whether that ensemble is carrying information, not just about reward, but also identity? And I'll just mention here that it's possible rewards and punishments may be privilege features, which is why they can be read out from individual neurons. So what we still need to explain why is it that if you just put single units into the midbrain, you record a lot of neurons that look like reward prediction error coding neurons. And that might be because that's a particularly important feature to encode.
So let me give you some evidence for population coding that comes from the same task that I showed you before. This is recent work that was done by Thomas Stalnaker in Schoenbaum's lab. So what they found was that if you look at single unit recordings, you can decode reward magnitude, but you cannot decode reward identity. But if you take the population recordings, you can actually decode reward identity, but only during the initial part of the block, so only doing the first few trials of a block.
And that makes sense if you think about the successor representation prediction error idea because the errors are only going to really be non-zero in the beginning of the block, after the change. Over training, within the block, they're going to gradually go to zero and stop encoding information about the reward entity. So that would explain why there's a dynamic change in the ability to decode flavor identity.
So just to wrap up this part, I've argued that the successor representation provides a significant generalization of the reward prediction hypothesis for dopamine. And this enables it to account for a number of anomalous phenomena, but without discarding the core ideas that motivated the original hypothesis. And I only really told you about a small number of these phenomena. But you can look at that paper if you're interested in learning more.
And a number of people have suggested that dopamine neurons might do more than just reward prediction error coding. I think the value of taking this perspective is that it grounds it into a normative theory. So it's not just explaining the what, but also the why. Why would it make sense for dopamine to carry this kind of vector-valued error signal? It's because that's the kind of error signal that we need to update the predictive map.
So to conclude, SR is an old idea. Peter Dyan first proposed it in 1993. But it's experienced a kind of renaissance since then, in the last five or 10 years. And in particular, it's become a new, fertile concept in the discussion about multiple systems of reinforcement learning. And it adds to the kind of menagerie of different systems that have been postulated for reinforcement learning in the brain.
And I'd emphasize that it provides a framework for thinking about the brain's cognitive map and how it's used in the service of reinforcement learning, refocusing the cognitive map on a predictive conceptualization, that I would argue is more useful for solving sequential decision problems. And I told you about a bunch of experiments that provide direct support for this representation.
And with that I'd like to acknowledge my collaborators here. So Kim Stachenfeld in particular, who worked with me on this idea, developing it as an explanation for hippocampal place cells and internal grid cells. Ida Momennejad, Nathaniel Daw, and Evan Russek collaborated on the human experiments, the fMRI experiments. And Matt Gardner and Geoff Schoenbaum worked with me on developing this idea for dopamine. And with that, I'd like to thank you. And I'll take any questions.
Associated Research Module: