Emily Mackevicius: Learning from a Computational Neuroscience Perspective
Date Posted:
June 2, 2014
Date Recorded:
June 2, 2014
CBMM Speaker(s):
Emily Mackevicius All Captioned Videos Brains, Minds and Machines Summer Course 2014
Description:
Topics: Marr levels of analysis, types of learning (unsupervised, supervised, reinforcement), Hebb rule, LTP, correlation and covariance based learning, reinforcement learning, classical conditioning, conditioning paradigms, credit assignment problem, TD learning, model-free vs. model based learning; birdsong: behavior, how the brain produces song, refinement through reinforcement learning (Goldberg, Fee, J. Neurophysiology 2011)
EMILY MACKEVICIUS: Hi, can everyone hear me? Good? OK, awesome. All right, so I'm talking about learning from a computational neuroscience perspective. Josh briefly went through the Marr Levels yesterday, and you guys are familiar with this. There are three levels: the computational level, the algorithmic level, and the implementation level.
So Marr's original example in his vision paper was a cash register. So can someone say what the computational level would be with a cash register? Any volunteers?
AUDIENCE: Trying to add up the prices to find out what the final price should be.
EMILY MACKEVICIUS: Exactly, so it's addition, and there's some maybe computational things that addition should satisfy. For instance, if the order of adding things doesn't matter, you add something to 0, then you get the original thing. But it's a very high-level computational theory. And then, what about the algorithmic level? No volunteers?
AUDIENCE: Maybe just how to implement the calculations. I mean, is it Taylor Series or some other kind of role.
EMILY MACKEVICIUS: Yeah, exactly. So the way that you implement it. So if it is carrying the 1, or as you're saying, a Taylor Series, it's more complicated or something. That's the algorithmic level. And then the hardware level in this case is a digital display. But some cash registers are automatic with automatic mechanisms that are maybe doing some sort of carrying thing and coming up with the answer.
So those are three Marr Levels for the cash register. And we think about the Marr Levels for a wide variety of tasks. For example, object detection. So "Where's Waldo? I think he's here." And then a lot of cases like this. For example, solving Jeopardy, chess, self-driving cars. You can think about them as a broad goal that you're trying to fulfill, and then an approach or algorithm, and then just the technical specifications of how you're doing it, like a person to do Jeopardy or machine could do Jeopardy. They might use different algorithms and also different hardware, but the same goal that they're trying to solve.
So this has been a very useful way of thinking about different problems. But I wanted to point out that it doesn't capture maybe all problems or all aspects of learning that we think about. So, for instance, I play cello. And if I'm trying to learn this cello piece, you could maybe think about it that I have some goals to play it well. But if it takes me several years to get to the level where I could even play it, it's less clear of it as being really a task that I'm doing. And part of my goal is coming up with sub-goals, and coming up with maybe the next goal. After I've learned this piece, what else should I try? So there are aspects of learning that might not strictly fit into this. Do you have a question? Oh, sorry.
And again, in terms of the algorithm, it's somewhat connected to having an input and output, but it seems like there might be something else going on, as well. And the hardware, in some cases. For the cello, the whole point is that I play it on the cello, and that I'm playing, and it's fun for me. So it might actually matter, the algorithm. Whereas in other cases, the algorithm might not really matter. Anyway, just wanted to bring that up with learning in general.
So learning, we often categorize it into different types of learning. We talked about unsupervised learning, where you're estimating the probability distribution of the data that you get. There's supervised learning, where a teacher gives you feedback on what the output should be, and you want to estimate the probability distribution of y given x. And there's reinforcement learning, where you get feedback, but it's really minimal supervision because you get feedback in the form of punishment or reward. You're not actually told what the answer should be. And these types of learning are often treated separately, but they're really not mutually exclusive, and they don't encompass everything.
So if I'm trying to learn how to be good at making conversation with people, I might use aspects of unsupervised learning, observing the conversations around me. I might use aspects of supervised learning if somebody is telling me or giving me an example of what sorts of things are interesting to say. And I might also use reinforcement learning if I can tell that people are bored with me or happy with what I said.
So these are types of learning, but they're not strict types of learning. And in addition, it's sometimes difficult to define what the task is. And part of the difficulty in learning is just segmenting what you're trying to learn into tasks and doing them in some sort of progression.
OK, so how does the brain learn? For awhile, people thought of the brain learning through strengthening of synapses. Ramon y Cajal said learning might occur through the strengthening of existing connections between nerve cells. And famously, Donald Hebb said, "When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficacy, as one of the cells firing B, is increased." So this is one mechanism that we think of with the brain learning.
And before I get into some of the math behind how this might happen, I want to review some notation. If you talk about neurons-- in this case, I'm just talking about rate model neurons. So neurons have a firing rate. And the firing rate v of the output neuron depends on some weighted function of its input. And usually, in general, it would approach this weighted function of its input with some [INAUDIBLE]. But in steady state, if dv dt is 0, then the output is just some function of the input, which often has weights. And I'm going to focus mostly on linear neurons, so it's just v equal to u times w.
And in general, a neuron will have many inputs. And the outputs of the neuron is the dot product between the inputs and the synaptic weights. So for a given magnitude of input, the response is maximized when the input is parallel to the weight vector. That's the dot product. And you can think of this as the receptive field of a neuron. So the stimulus that best excites it aligns with the weight vector. Just a question. For L1 magnitude-- instead of an L2 magnitude, we're given an L1 magnitude, what's the best stimulus that excites the neuron?
AUDIENCE : Same?
EMILY MACKEVICIUS: Sorry?
AUDIENCE: Same input?
EMILY MACKEVICIUS: So the answer is the same input parallel to it? So in this case, let's say that you have two inputs, one along the x-axis and one the y-axis, right? And they're each weighted. This is your weight vector, and that's the input you're giving, right? So if you go along here, you'll get a certain output, which is the dot product of the two, right? Does anyone have another answer, because that's not quite right. There's something more, but I don't want to give it away by--
AUDIENCE: Would it just be the maximum-- the direction that's closest to parallel to your weight vector?
EMILY MACKEVICIUS: So the direction parallel to your weight vector?
AUDIENCE: Yeah, never mind.
EMILY MACKEVICIUS:Sorry.
AUDIENCE: No, I got it.
EMILY MACKEVICIUS: OK.
[LAUGHTER]
Yeah, so parallel to the weight vector is if you have an L2 magnitude, which is the square root of the squareds, the sum of squareds. But let's say that now you're just adding up the weight. So now you're adding up the inputs. And you have a certain amount of power that you can distribute to different inputs.
AUDIENCE: Just put it in the one that has the highest weight?
EMILY MACKEVICIUS: Exactly, yeah. So if it's given it a L1 magnitude, you would put it in the input that had the highest weight, and then that would excite the neuron. Make sense? OK.
So first thing we're going to talk about, supervised learning, unsupervised learning, so some simple rules where you could change the synaptic weights. And I'm going to show some useful transformations that are implemented by these simple rules.
So the basic rule that everyone thinks about is Hebb's Rule, which is encapsulated by "neurons that fire together, wire together." So here, your output times your input, if they're both high, if they fire together, then the weight will increase. And the weight change is slower than the neuron change. So we are talking about steady state, but if you have an input that will excite your neuron much faster, then the weights will actually change.
AUDIENCE: But wouldn't Tw be larger than Tn?
EMILY MACKEVICIUS: Oh, good call. Yeah, typo. This is larger, so it takes longer. OK, so in neurons, you see a phenomenon that you can think of as Hebb's Rule, which is long-term potentiation. In this experiment, a neuron was recorded to have the amplitude after you probe an input to that neuron, so this is how much an input excites that neuron. And here-- so this is baseline-- here the neuron was stimulated for one second at 100 Hz, which is pretty fast. And here, after this stimulation, after firing together, the neurons wired together. So now, when you probe the neuron, the response is higher. And there's a similar phenomenon in reverse, LTP, long-term potentiation. And here, if you stimulate a lot slower so it tends to not fire together, the weights go down.
OK, so we talked about how the weight vector changes as a function of the product between the outputs and the inputs. So now, if we take the average over time of this product, which is what winds up happening because it's a slow change, here, we get that-- remember v is u times w. So now the weights are changing as a function of this average. v times u is w. And this is actually the correlation matrix. So what is Qij in this case?
AUDIENCE: ui uj?
EMILY MACKEVICIUS: Sorry?
AUDIENCE: ui times uj?
EMILY MACKEVICIUS: ui times uj, but it's the average ui times uj over time, which is the correlation between ui and uj, so the correlation between those two inputs.
So that's sort of cool. The weights are changing as the function of the correlation between the inputs. So you're almost learning something about the structure of the correlation of your inputs. Yeah, exactly as you said: ui times uj averaged.
So here, in a paper by Miller in 1994, they built a network, which was based on these types of rules, the correlation-based rules. And the inputs were the type of responses you would see in the visual thalamus. And they were building a model of V1, just based on Hebbian plasticity, so wiring up based on the correlations between the inputs. And these are the receptive fields that their network developed. So you can see these receptive fields have become orientation selective, sort of like you'd see in actual V1. So your network has done some sort of unsupervised learning about the structure of [INAUDIBLE].
But one thing that we're ignoring in this rule-- so firing rates of neurons are always positive. You can't have a negative firing rate. So after you strengthen a synapse in this rule, you can't undo it. So it's bad that you can't undo it, and it's also bad that if you have any recurrent connections, your network can start going epileptic, because everything is strengthened a ton, and it goes crazy. So what's a way to fix this?
AUDIENCE: [INAUDIBLE]
EMILY MACKEVICIUS: Sorry?
AUDIENCE: [INAUDIBLE]
EMILY MACKEVICIUS: Yeah, so one type of normalization is a threshold, such as subtracting out the average. So you can add a threshold, either a threshold to your inputs or a threshold to your outputs. So you can say "it's fire together, wire together," but the input would have to be bigger than average, or the output would have to be bigger than average. So why wouldn't you do both, a threshold on both the inputs and the outputs, instead of either the inputs or the outputs? Sorry?
AUDIENCE: You'd have a constant term in there, so if you extended the product, you'd have data u, data v terms.
EMILY MACKEVICIUS: Yes, you have a constant term that doesn't depend on either the inputs and the outputs. That's true. What I had been thinking is just the fact that if they're both low, then you get a potentiation. So that might wind up being the same thing, but the idea is, in this case, if the output is low, then this product is negative and the weights will stay low. In this case, if the input is low, the weights will stay low. But if you had both of them, if the input was low and the output was low, the weight would increase, which isn't really what you see and might not make sense.
AUDIENCE: So you assume that u and v are always positive?
EMILY MACKEVICIUS: Exactly, so they're firing rates. How many times does it spike per second? So it has to be greater than 0, but obviously, it's different from its average. It's not greater than 0.
So what this gives you is correlation-based learning. So here I'm doing a threshold on the inputs. I'm setting the threshold to be the average value of u. Remember, the output is the weight times the input. So now, I'm changing the weights based on the output times, the difference between the inputs and the average input. And I'm averaging that across many training inputs, because it's a slow process. So what do I get in this case?
AUDIENCE: Covariance?
EMILY MACKEVICIUS: Yeah, so what I get in this case, if you do it out, you substitute v by the weights times the inputs. And you get that the weights change as a function of your covariance matrix times the weights. So that's cool. An average of the weight change is given by the covariance matrix. So over time, if your weight keeps changing by being passed through this covariance matrix, what happens is that w becomes parallel to the first eigenvector of this covariance matrix. So after learning, after it's seen many inputs, what does the network do to its in?
AUDIENCE: [INAUDIBLE].
EMILY MACKEVICIUS: Exactly, so now, after learning, it projects its inputs into the first principal component, because the first principal component is the first eigenvector of the covariance matrix. So that's cool. It's a natural way, just by simple Hebbian plasticity, that a network can do PCA on its inputs.
One thing that you might have noticed is, even in this case with the thresholds, the weights still just keep growing a lot. So you can check that the magnitude of the weight, the change of the magnitude of the weight is always positive, and so you could get very high weights. And there are a number of ways that people use to keep this in check.
The simplest way is probably just setting some maximum weight, and saying none of the weights can be higher than w max. There's also the BCM rule, which sets a sliding threshold dependent on the activity. Different types of normalization and a lot of these wind up having synaptic competition, so it's a practical matter. You have weights to certain inputs, but not all of your inputs getting strength.
So now I'm going to move on to reinforcement learning. That was unsupervised learning.
AUDIENCE: Emily, can I a question? What about extreme mental agitation [INAUDIBLE] control of the weight?
EMILY MACKEVICIUS: For the competition?
AUDIENCE: [INAUDIBLE]
EMILY MACKEVICIUS: Oh for any of these--
AUDIENCE: [INAUDIBLE]
EMILY MACKEVICIUS: That's a good question. I don't know offhand I mean, I've heard of synaptic pruning. And I think that there are also ways that these changes can happen locally in dendroids, instead of through the whole cell. So you could get effects that just happen locally instead of on the global output of the cell, because a lot of synapses, they happen on spines. And so the spines capture the local activity of the cell. But I don't know offhand of a good experiment for that.
Oh, and then also just practically, cells can't have an infinite strength. So there's only so much that you can sit at the post-synaptic ending. So as a practical matter, there might be some.
AUDIENCE: With respect to BCM, there's [INAUDIBLE] like static healing, right? So there's a set point for the neuron, and it shifts all the weights up or down based on that. So it's thought to be a long time [INAUDIBLE] mechanism that maintain the overall firing range of a neuron.
EMILY MACKEVICIUS: Yeah, so yeah. Homeostasis, basically then. Did someone else have a question? Looked like it. No, OK.
All right, reinforcement learning. So this is learning about stimuli or action solely on the basis of rewards and punishments that are associated. So the simplest form, you get reinforcement independent of any action that you take. And then in operant conditioning, you get reinforcement after taking an action. And that reinforcement can happen immediately after the action or it could depend on an entire sequence of actions, and the reward could happen after the whole long sequence.
So you're probably all familiar with classic Pavlovian conditioning. Before training, when the dog gets food, he salivates. Before training, when he hears a tone, he doesn't salivate. After training, which involves pairing the tone with getting food, then just hearing a tone alone will make him salivate. So here, this could be seen as a simple example of "fire together, wire together." He heard the tone and he got food paired together. This connection forms, so now just hearing the tone would make him salivate.
So this rule can be described by the Rescorla-Wagner rule. So the weight changes as a function of how your reward differs from your predicted reward. So let's say you're trying to predict how much reward you're going to get. That will depend on the difference between how much reward you expected, and how much reward you actually got.
So this simple rule explains many conditioning paradigms. So there's Pavlovian conditioning, which we just talked about. There is extinction, where you pair a stimulus like a tone with a reward, and then you play the tone by itself a lot. After that, playing the tone alone would make no salivating. That's because here, in this phase, you were predicting the reward, and you got less reward than you actually predicted. And so that brings down the weights to not predicting the reward. You can have partial training. You can have blocking, where if one stimulus predicts a reward, and then you play that stimulus along with another stimulus for a reward, the second stimulus no longer predicts the reward, because this wasn't unexpected. This was already explained by the first stimulus.
And there's also inhibitory conditioning. So if you have two stimuli that get nothing, but one stimulus gets a reward, then it's almost interpreted as this one is preventing the reward or something. Yeah, so now the stimulus gets the reward and-- did I explain this wrong?
AUDIENCE: No, you were right.
EMILY MACKEVICIUS: Yeah. Sorry?
AUDIENCE: I think you were right.
EMILY MACKEVICIUS: OK, OK, yeah. So now this stimulus gets the reward, and that stimulus gets the reward. So overshadowing, and the one thing that it doesn't explain is the secondary conditioning. So let's say you have a stimulus which predicts a reward in one phase, and then you train with a second stimulus that predicts the first stimulus. In that case, what will wind up happening is that the first stimulus predicts the reward, but this reward prediction doesn't actually predict that outcome. One thing with that that's sort of newer or different is this sequence or causal chain. It's like first one thing, and then another thing, and then reward.
That's maybe a good time to bring in the fact that directionality matters in a lot of these paradigms. So, in this case, we got "fire together, wire together" from hearing a tone to salivating, but presumably we didn't get in the reverse direction. And why wouldn't that happen? Why doesn't it?
AUDIENCE: The synapses aren't directed, and also, events in the world are causal.
EMILY MACKEVICIUS: Yeah, so it makes sense intuitively that you would want to have some sort of causality where something that happens first predicts something that happens second. In this case, I was maybe being somewhat naive about whether the tone happens first, but the tone does happen first. And so the tone happens, he gets the reward, he associates that with the tone causing the reward. But in terms of mechanism, how does this work? And so if we're just talking about "fire together, wire together," you might assume that you would also get the reverse happen.
AUDIENCE: And it is known that they don't get that?
EMILY MACKEVICIUS: Is it known?
AUDIENCE: I don't know, that if you give them food, they don't expect a tone from [INAUDIBLE]?
EMILY MACKEVICIUS: Yeah, I mean, I think that you don't think of it as the causal direction if it happened in the reverse order.
AUDIENCE: No, I know, but has anyone ever tried this with animals?
EMILY MACKEVICIUS: I mean, it's hard with tones because you don't really know if he hears the tone. Yeah, I think so. I can't come up with something offhand. I don't know if anybody can.
AUDIENCE: The stimulus has to be showing. It actually has to be a reward to train. Right, so if you're walking around, and you hear a random tone, and it doesn't indicate anything, you learn to ignore it. And so there'd be no way to determine-- learning the presence of the tone stimuli, he'd probably learn not to notice it over time. So you might be able to learn that steak subsequently indicates another set of rewards after it, right? But that set of rewards would actually have to be rewarded. [INAUDIBLE]
EMILY MACKEVICIUS: I think he's asking about whether, let's say, the next time the dog got a steak and he salivated, whether he would call the tone to mind or something. Right?
AUDIENCE: I think he would expect to hear the tone after he gets the food [INAUDIBLE]
Why do you assume [INAUDIBLE] metrics in the teaching process, [INAUDIBLE] metric in the underlying mechanism. Why would you even suggest symmetry?
EMILY MACKEVICIUS: Yeah, so I guess the point is that there isn't symmetry. But before, I was just talking about steady state, where you'd have two things that we're both firing. Why not, if this was firing and that was firing, why not have the reverse?
AUDIENCE: But they're not together. You need some causal connection.
EMILY MACKEVICIUS: Exactly, so that's why we need to bring in time. There are different ways to bring in time, and one way that's often used is time independent plasticity. So if you have a pre-synaptic cell and a post-synaptic cell, and the pre-synaptic cell fired first, then that might cause the weights to increase, whereas if they happened in the reverse order, or the non-causal order, then that would cause the weight to decrease. So here, this is the time post-synaptic minus pre-synaptic And if post-synaptic was before the pre-synaptic, then it would be strengthened. And otherwise, it would be different.
So we've been talking about timing, and I wanted to go into a bit of an interlude about timing, and about also how is your modeling intelligence? How should you use neural architecture? How is it relevant to you?
I just want to have a warning of don't be too constrained by what the brain can't do. If there's something that hasn't [INAUDIBLE], and nobody knows of how the brain could do it, it doesn't mean that the brain can't do it. You shouldn't force your models to avoid things that you think the brain can't do. But you should perhaps be inspired by what we know is very easy for the brain to do.
So we know it's very easy for the brain to do things like dot products. Measuring correlations, we just saw PCA. Linking sequences together, for instance, it's like time independent plasticity of one event happens, then the next and the next. There are certain things like that are very easy for the brain to do, though they're by no means the only things the brain could possibly do.
And one example is that a huge difference between the brain and a computer is that the brain doesn't have a fast centralized clock in the way that computers do. And this is a real difference. I mean, you can do things very quickly, even though there's no real evidence that you have a clock parsing it out in the same way that a computer does. You might have state oscillations or breathing rhythms or circadian rhythms, but those are all a lot slower than what a computer clock would do.
So anyway, I think this is an important and interesting difference. But that also doesn't mean that you can't use timing. So there might be a way for the brain to do it even if we haven't come up with it. There are ways of synchronizing things without a centralized clock.
So now we're going to address the situation where the reward or punishment might depend on an entire sequence of actions, and it might be delayed until the very end of the sequence of actions. So this is often a hard problem because you don't know what actions actually caused the reward, right? So what's going on in this picture is you have some baseball players who are superstitious, and they think that doing this balancing the ball on top of their cap is creating a rewarded outcome, making them do well in the game. And you could imagine this happened because they might have done this and then done well and assumed that it was causal. And then they repeatedly do it and keep doing well, and it seems like the right thing to do. But most people would say that this is not actually causing. So it's hard to know what causes the reward.
And one way to try to estimate what causes the reward is, if you have a long sequence of steps and you get a reward at the end, you can try to assign an intermediate value to each state that you have. And that intermediate value would give you an estimate of the predicted reward starting at that point until the end of the trial. So, for instance, if you are a baseball player and you are losing, the other team has many runs, then you could say well, probably you'll lose. You have a lower expected value from there than if you were ahead.
So we think of this as TD learning, where you have a sequence of states, and in each new state, you calculate the expected reward for the rest of the trial. And you change your weights based on how this estimate changes as you take action.
And this works fairly well. So TD-Gammon is using TD learning, and it was, at the time, on par with the best human players, using this method of estimating the difference between successive states. And it practiced just by playing against itself like that.
So this concept of your expected reward versus your actual reward is very important for learning. And there's evidence that it's actually encoded the brain. Specifically, it might be encoded in TA.
So here, you have a simple stimulus and reward paring. Early on, the reward gets a big spike, because it's very rewarding compared to your expectation, which is no reward. But after training, you don't see a spike in reward.
And so here's another neuron, and here is a case after training, when either you get the reward or you don't get the reward. So here, going along, the stimulus happening itself is actually a rewarding experience because you're expecting reward in the future, so that's a rewarding state. So the neuron fires. It doesn't fire when he actually gets the reward, but it fires less if he gets no reward, because he expected to get a reward.
So this reward prediction area can be used for learning to do again things that work well for you. A long time ago, Thorndike came up with the Law of Effect, saying responses that produce a satisfying effect in a particular situation become more likely to occur again in that situation. And responses that produce a discomfiting effect become less likely to occur again in that situation. I'm going to soon talk about this in relation to birdsong, which is what my lab work is on.
Thorndike originally came up with this by making these puzzle boxes for cats. And so it's complicated for the cat to get out of this box. He wants to get out of the box, and he does a sequence of actions. If that gets him out of the box, he's more likely to do that sequence of actions again.
But I want to point out that stimulus response pairing isn't all there is to learning. It's also important to make cognitive maps, or models of your world. This could be a physical map of [INAUDIBLE], how do you get to places, or a map of state transitions between possible outcomes in a game. You want to be able to extrapolate to new situations, not just interpolate between things that you've seen before. You don't want to just repeat getting out of the box or something that you've done before.
And so this is an illustration of this distinction between model-free and model-based reinforcement learning. In this case, you have a mouse who has a maze with three different paths that he could go along. One is short, and longer, and even longer. And in this case, there are no obstacles. He's seen it before. In both model-based and model-free, he just takes the short path.
The next case, there is an obstacle, and the obstacle is just on the short path. So again, both times, he just turns back and takes the second easiest path. But here's a situation where the two shortest ones are blocked. In this case, in the model-based learning, he knows that they're both blocked, just by seeing it, because he has a map. And so he goes and takes the longest route. But in the model-free learning, he doesn't have a map, so he just takes the second route, and then needs to go back again and take the longest route.
So it's useful to have models. And another thing is this model-free and model-based learning is not a-- sorry. It's not a binary distinction, either. And this paper by Dolan and Dayan gives a good review of how the thinking about model-free and model-based learning has evolved, and also how you could have a combination, or either one training or the other, or a competition between these different systems. It's not completely separate.
So now I'm going to start talking about birdsong, which is the work my lab has done. This is not my own work, but this is previous work from Michale Fee's lab, where I work. I'm going to start by going through the basic behavior of song learning.
So zebra finches, the birds that I study, learn their song from their father. They hear a tutor song, which sounds like this. It might not be plugged in. Sorry.
This is actually challenging, because I can't play in both-- Oh, I can. Never mind.
Does anyone know how the volume works on this? No.
[BIRD CHIRP]
Oh, there we go. OK, sorry. Just didn't press it. OK, so that is what the tutor song sounds like. So they'll hear this song. And this early stage is called sensory learning, because he's forming a sensory representation of his tutor song. He's memorizing his tutor song. And the next overlapping phase of learning is sensory motor learning, where he starts practicing his own song. So he starts singing early sub-song, which is babbling. And it doesn't sound similar to the adult song. Gradually, it gets better, a little more structure in the song.
So the first structure is called plastic song. And then, as an adult, he sings a near perfect match of the of the tutor song. So I'm just going to play the two of them, so you can compare.
So it's a near perfect match. It's a difficult skill for the bird, because it takes them a couple months to learn. But a lot of birds learn a near perfect match. And this process of matching to the tutor, my lab has developed a hypothesis of this in the framework of reinforcement learning. So the idea is the bird is singing some song, and he notices when he sings his notes slightly higher than normal, it sounded really good. So he wants to do that again the next time he [INAUDIBLE] that. That's the basic idea, and then I'm going to go into the brain circuitry behind it.
AUDIENCE: So how does the bird determine what sounds good, exactly?
EMILY MACKEVICIUS: That's a good question, and actually, slightly related to what I'm working on, but it hasn't been nailed down exactly how he determines it. There are some regions involved that seem to respond to errors in the song. So if you target a particular syllable with white noise, then you can get error responses. They seem to respond to error, but it's not clear exactly how he makes an error. If I were to completely speculate, I think it might be some sort of coincidence detection, aligning what he's hearing with some memory of the tutor and being rewarded if it's a good match, but that's speculation.
So how does this reinforce the learning curve? The brain regions that are typically though of to be involved in reinforcement learning are at the basal ganglia. And this is connected in several loops with the pre-motor cortex, the cortex in general, and also with the thalamus. And there is analogous circuitry in the bird. The bird basal ganglia that is related to song is called Area X. It contains neurons that are analogous to different parts of the mammalian basal ganglia, and connects to cortical nuclei as well as [INAUDIBLE].
So how does the bird brain generate song, first of all, before any learning. There is a descending motor pathway. Neurons in HVC drive motor neurons in RA, which produce vocalizations. And in HVC, each neuron will fire at a particular time of the song. Mike Long, a former lab member, now is calling these time cells. Sort of analogous to place cells, which fire at a particular location and place, these neurons are time cells, firing at a particular location in time. And we think about these as firing in a sequence. So when the sequence runs, each of these will go downstream to RA to produce different notes in the song.
And there's a question of whether this sequence actually sets the timing of the song. And to test that, Mike Long, in the lab, cooled HVC. So the idea is if you cool a brain circuit, that slows down all of the dynamics. So if you have a sequence, it'll slow down the connections between successive moments of the sequence.
So this is a spectrogram of the bird's song before any cooling. And various amounts of cooling stretch the song uniformly. So it's a cool causal test, because obviously if you just lesion a region and song goes away, the region could be doing anything. But with cooling, you can really test if it's controlling the dynamics of the behavior. Any questions so far?
AUDIENCE: So this was behavior in zebra finches?
EMILY MACKEVICIUS: Yes. So he developed a Peltier device, which is like a little refrigerator where running current in one direction will pump heat from here to here, and the other direction will pump heat the other way. And so, they'd cool HVC, which is on the surface of the brain. When the bird starts to sing, different amounts of current will cause different amounts of cooling is how they got these different results.
AUDIENCE: This mechanism suggests that temporal resolution of [INAUDIBLE]
EMILY MACKEVICIUS: Yeah.
AUDIENCE: [INAUDIBLE] that temporal resolution?
EMILY MACKEVICIUS: That's a good question. The question is about what the temporal resolution of the song is. And there are features in this song that are repeatable to a few milliseconds. Firing of HVC neurons is also repeatable to within just a few milliseconds. And if you want to induce changes in the song-- so I talked before about targeting parts of the song with white noise. You can induce changes that are very localized to the specific points in time, so it's a very precisely timed pace.
AUDIENCE: Resolution agrees with 10 milliseconds, or even higher order?
EMILY MACKEVICIUS: On that order. I don't remember the exact numbers, but it's definitely a local point in the song. Within the syllable. It doesn't affect the whole syllable. It's a certain point in the syllable. Whatever he conditions [INAUDIBLE] feedback.
AUDIENCE: Does the bird learn to speed things up? It gets back to the real--
EMILY MACKEVICIUS: That's a good question. And so one thing that is really awesome that Mike Long has gone on to do now is develop a device for cooling regions of people's brains. And so he's cooled different areas, including Broca's area. It's really awesome. And what it appears that people are doing in that case is that their speech actually slows down. But it's not as simple as this, because they appear to do some sort of correction. I think with the birds, it would require a lot longer to do that correction, and they didn't cool for that long.
AUDIENCE: Any systematic errors in doing that? Because I imagine if you only cool one area, that it would then no longer be acting on the same time course as other areas. And so if interactions matter, then you should see some kind of errors, right?
EMILY MACKEVICIUS: Yeah, that's a good question. And how we think of this is that HVC is really setting the time of the song, and it is interacting with other areas, but that's all downstream. If you cool it too much, he just won't sing. So you can't infinitely cool it. But at least at some range, you can cool it, and that just sets the timing and influences downstream areas accordingly.
AUDIENCE: Does that fact that you can reach scale also imply certain things about the structure of the networks that produce the pattern? So just cooling it to scale as opposed to destroying it or something else, does that imply something about the network? Do you guys know?
EMILY MACKEVICIUS: Yeah, I think it implies that the timing is controlled in HVC specifically. Yeah. Did someone else have a question? No? OK.
All right, so this covers the ascending motor pathway, including HVC, is another pathway at the anterior forebrain pathway, which involves Area X, which I talked about, and also a region, LMAN. So LMAN is thought to add variability to the song, add little explorations on top of his stereotyped song from HVC.
And there are several pieces of evidence that LMAN is the source of variability. If you have a young bird who is singing sub-song, and then you lesion LMAN, all of his variability in his songs would go away. And he'll be left with one repeated syllable.
Also, I haven't shown it here, but if you take an adult bird and you leave him only with LMAN, so you lesion HVC, he reverts back to sub-song. So all that's left is the [INAUDIBLE]. And the idea is that through development, the bird song starts off controlled by LMAN with lots of exploration, and then HVC takes more control. But there are still slight variations from LMAN. So LMAN is injecting variability into the song.
Now I'm going to get to this theory of the reinforcement learning in Area X. So the idea is that Area X keeps track of what variation LMAN did and what time in the song it occurs at. And if that variation sounded good, then Area X, through the thalamus, biases LMAN to do that same variation again, next time he gets to that point in the song.
So one thing that's missing from that picture I just described is how does Area X know what sounded good. And I think you were asking about that before. How do you know if it sounded good? We think this is part of the theory of it. The reinforcement signal comes from dopaminergic midbrain, from VTA, which receives input from higher-level auditory areas.
And there is unpublished work from the Fee lab showing these auditory areas are involved in the pathway which goes to VTA, that neurons in AIV respond to errors in the song. And there's word from Richard Hahnloser's lab showing that neurons in CM, some of them respond to areas in the song, as well.
AUDIENCE: So that Area X does bias LMAN to do the same thing, so LMAN needs to have some kind of memory as well?
EMILY MACKEVICIUS: So that's one thing I didn't cover, but-- so the question is, when Area X biases LMAN to do the same thing that sounded good before, there's actually different loops within this system. And so if you have a neuron in LMAN that projects to a population of neurons in Area X, that population in Area X connects back to the original population in LMAN. So just the topography of the connections determine which part should be influenced.
AUDIENCE: And the correct time, as well.
EMILY MACKEVICIUS: So the correct time comes from HVC. I'm about to go in more detail how we think this works. So here's a diagram with HVC on top, the sequence of time cells in HVC, neurons in Area X getting input from VTA, and also input from LMAN. And let's say this LMAN neuron is responsible for driving the pitch of the birdsong up. And let's say that the bird wanted to drive the pitch up at timepoint four. What would you do to make that happen?
So let me go through this again. So these are time cells in HVC. Each time cell projects a neuron in Area X. Those neurons also receive input from LMAN of what variation to do, but the output of those neurons influences what LMAN does. So if this is a pitch-up neuron in LMAN, how would I get pitch-up at timepoint four?
AUDIENCE: And do you have the outputs turn LMAN on at timepoint 3? So by the time it gets around [INAUDIBLE]
EMILY MACKEVICIUS: So assume that everything happens very quickly here, or instantly. So I think you said it correctly that you increase the weight onto this neuron from the timepoint. So let's say that timepoint four now had a strong weight here. The birds start singing. Time cells activate at timepoint four. This causes pitch-up.
AUDIENCE:Could you just make that it's more activated? It can do all sorts of things, comparable to any other animal that changes, not just [INAUDIBLE].
EMILY MACKEVICIUS: So I'm saying that this neuron in LMAN is a pitch-up neuron.
AUDIENCE: Yeah, but not [INAUDIBLE]. Not just pitch -up, but you increase the weight between four and that neuron, [INAUDIBLE] the input any other LMAN, which is also pitch-down, or--
EMILY MACKEVICIUS: The thing is Area X also has a corresponding pitch-up channel.
AUDIENCE: Ah, so the entire channel is pitch-up.
EMILY MACKEVICIUS: Yes, yes. And we don't know if there actually is a pitch-up channel. So what we do know is that if you have a small region of LMAN, that projects to a small region of Area X, which, through the thalamus, goes back to the same small region of LMAN.
AUDIENCE: So for every modulation, you should have [INAUDIBLE]
EMILY MACKEVICIUS: Think about it in different channels, yeah. OK, so if you wanted to get a pitch-up at timepoint four, then you'd strengthen synapse four. And, according to our theory, how that happens is that you have a plasticity rule which strengthens synapse four whenever you get input from LMAN and HVC, followed by reward. So the idea is if he's singing, if his pitch was high at timepoint four, these neurons would be active at the same time. If he subsequently got reward, strengthen the synapse, and so next time he goes around to sing his song, he'll sing a higher pitch at timepoint four.
AUDIENCE: What's the time overlap with the reward like?
EMILY MACKEVICIUS: That's a good question. It's somewhat delayed, obviously, because he has to hear himself after he has produced the song. And I'm going to go into a little bit more detail later, but the idea is that we think there might be an eligibility trait or something vocalized with synapse that says this is eligible to be strengthened if reward comes within a certain amount of time.
AUDIENCE: Are there any [INAUDIBLE] copy sent? Do they have any internal models to [INAUDIBLE] model to say, okay, here's what I think I sound like. Is it based on that instead? Or [INAUDIBLE]?
EMILY MACKEVICIUS: Yeah, I mean, we think of LMAN as a efference copy. Because LMAN is influencing the song through RA, but LMAN is also telling Area X what it's doing. So there's some experimental evidence that this bias is stored in LMAN. So in this experiment, Jesse Goldberg in Michale Fee's lab-- Jesse is the one who developed this whole theory. So there's a bird who's singing. He has a particular syllable that is going to be targeted. This syllable has some mean pitch. So it's a threshold. And he says, if the syllable's pitch is above the mean, I'm going to blast it with noise. Otherwise, I'll let it go. So here's an example of the bird's song blasted with noise. So what this sounds like to the bird is if he did something slightly too high. It sounded terrible. And it's dependent on how high he sang.
So in this paradigm, you can drive the pitch up. So this is over a matter of days. And on some days, the pitch was driven up. On other days, the pitch was driven down. So you drive the pitch down by putting noise when the pitch is below the mean pitch. So here's the pitch as a function of day, and it's gradually going up, and then down, driven all the way down, and then up. So you get this behavior. But the cool thing is that after driving the pitch down, if you inactivate LMAN, the pitch will go back to where it was before.
AUDIENCE: So does the bird think that it's making that noise, or is that a negative reward?
EMILY MACKEVICIUS: It's hard to say. He thinks that he's making it to the extent that he'll change his behavior to avoid making it in the future. I mean, what's happening is he is doing something, making his pitch high, which is causing it to sound bad. And it's unclear whether he thinks he's making it himself or just that something he's doing is creating it or what he's thinking at all. Yeah. Yeah, so this bias is going through LMAN.
So again, here's the picture that I started off with. And remember, bias is going through LMAN. And I mentioned the eligibility traits, how we think it might happen in a little bit more detail. So here is time. This medium spiny neuron would get [INAUDIBLE] activities from HVC at a particular time, also from LMAN. And that would create this eligibility trait, which is slightly delayed, which would be if he gets reward in the slightly delayed timepoint, then the synapse will get strength. So if he does get the reward-- this is VTA input-- then the weight will get strengthened. So that's the idea that if these three inputs converge with the right timing, that weight will get strengthened. That will bias him to sing the good-sounding variation again.
So that's it for birdsong reinforcement learning. I'm close to done. Does anyone have any other questions about how reinforcement learning works in the bird?
AUDIENCE: Would there by anything analogous in this model to muscle memory? Or is that all related to building the circuits?
EMILY MACKEVICIUS: What do you mean by muscle memory?
AUDIENCE: So if he has some kind of muscle memory in his vocal tracks? It would be analogous to your muscle memory when you were playing the cello?
EMILY MACKEVICIUS: Yeah, you might call it muscle memory, but it would actually be more of a sequence that's in the brain that's causing this to happen, or in the spinal cord in some cases. But I think of HVC as that sequence in the brain that's causing the song.
AUDIENCE: Are there any predictions about how long it should take to learn from proper reinforcement models?
EMILY MACKEVICIUS: Yeah, that's a good question. And so one of the general critiques of reinforcement learning is that it takes a really long time. All you know is whether it was good or bad, and it takes a really long time to learn it. The thing with birdsong is that it actually does take a really long time. It takes several months for him to learn the song. He's singing hundreds, maybe even more than 1,000 renditions every day. And so, in that case, it's a slow learning process and [INAUDIBLE].
AUDIENCE: [INAUDIBLE] keeping an order of magnitude [INAUDIBLE] population based on [INAUDIBLE]. I mean, so you don't need a trillion.
EMILY MACKEVICIUS: Yeah, so Michael Stetner, another member of the lab, has coded up a reinforcement learning base model of how this works. And I don't remember the exact number, but yeah, he's gone through. It doesn't take a trillion. It's long, but it's on order of how long it takes the bird. Does anyone else have a question?
All right. And I just wanted to raise the question, what is happening during the very earliest song. So at this stage, if you record in HVC, which another member of the lab, Tatsuo Okubo has done, you don't have the same structure that you have in an adult bird. So you're not yet engaging the system of rewards and learning that I just described. So something else is happening during sub-song. And there are several theories of what type of learning might be happening during sub-song. Oops, that's sub-song. Sorry.
So during sub-song, instead of trying to learn the match, he might be more learning how to breathe properly. So song, as well as speech and other behaviors, requires breathing. The bird breathes out at every syllable and in at every gap, and this is the same for every syllable. So he might actually just be learning this basic pattern of breathing, this proto-syllable, which he can later turn into other syllables.
So some people, Surya Ganguli and Richard Hahnloser in particular, think about the bird is developing an inverse model during sub-song. So the idea is that he might be forming some connections between what he does and what he hears, "fire together, wire together" plasticity. Creating an inverse model so he can go between the motor representation and the auditory representation.
And I might also be thinking about learning what the tasks are. So at the beginning, I talked about how a lot of learning with the Marr Levels is framed in terms of these goals and tasks. But you also have to learn what the goals are. And that's an important part of learning. So in terms of bracketing these tasks, I talked about time cells, but is there some equivalent of grid cell, which might bracket these tasks into different segments?
And there's a quote from a review by Kyle Smith and Ann Graybiel about bracketing tasks together. They say, "Habits are sequences of actions that are grouped together-- chunked-- for ready deployment. And beginning-and-end task-bracketing activity provides a compelling candidate neural correlate for the chunking of actions together into a habitual unit." So if you're learning something that involves time, a really important thing is to break it up into units and sub-tasks that you can later deploy in succession.
So ending a bit early, but thank you very much. Everyone, [INAUDIBLE] lab.
[APPLAUSE]
Any other questions?
AUDIENCE: Going back to your comment about the difference between a brain and a computer, and that the brain doesn't have a fast clock. So doesn't the brain already work on a slower time scale than a modern computer? So then you could say that something like the theta rhythm or other kind of clock signals.
EMILY MACKEVICIUS: So there are things like the theta rhythm, which you can think as a clock. It might just be a slower clock. With theta rhythm, it's not synchronized in the whole entire brain, whereas the computer has a clock which sets the time scale for the entire computer.
And so I guess I'm thinking about it-- not that a clock doesn't exist in the brain, but it's something different, like it might synchronize a pair of regions or a set of regions and not the whole brain. And also, even though the theta rhythm or any real biological oscillation that I can think of is much slower and messier than a computer clock, the brain can actually do things that are very quick compared to what you would think if that was clocking every computation. So I guess I'm thinking about it as time in the brain is fundamentally different than how it is in a computer with a centralized clock.