Next-generation recurrent network models for cognitive neuroscience
Date Posted:
June 16, 2021
Date Recorded:
June 15, 2021
CBMM Speaker(s):
Guangyu Robert Yang All Captioned Videos CBMM Special Seminars
Description:
Recurrent Neural Networks (RNNs) trained with machine learning techniques on cognitive tasks have become a widely accepted tool for neuroscientists. In comparison to traditional computational models in neuroscience, RNNs can offer substantial advantages at explaining complex behavior and neural activity patterns. Their use allows rapid generation of mechanistic hypotheses for cognitive computations. RNNs further provide a natural way to flexibly combine bottom-up biological knowledge with top-down computational goals into network models. However, early works of this approach are faced with fundamental challenges. In this talk, I will discuss some of these challenges, and several recent steps that we took to partly address them and to build next-generation RNN models for cognitive neuroscience.
PRESENTER: Robert Young, whom I'm introducing today, would be joining MIT as an assistant professor in the Brains and Cognitive Science Department and with a joint appointment in the Schwartzmann College of Computing starting next week. I think July 1 or the week after.
He received his [INAUDIBLE] from Peking University and his PhD in neuroscience from New York University working with [INAUDIBLE]. And during his PhD, studied how distinct types of inhibitory neurons in the brain can coordinate information flow across brain areas. Very nice work.
And in another piece of work, he studied how the same artificial neural network can accomplish many cognitive tasks. He still is, I think, a post-doctoral research scientist in the Center for Theoretical Neuroscience at Columbia University for a few more days. And I'm very happy to welcome him at CBMM. He's probably the newest member of CBMM. I really think that his approach is exactly what we need to bridge computation with machine learning and to really transform deep learning models into serious models of the brain.
Right now they are engineering artifacts with little relations, concrete relations, with neurons, and often even with the anatomy and physiology of the brain. But Robert's approach incorporates a lot of biological constraints that we know from neuroscience and his models are real neuroscience models.
And we speak about some of this work. He has been doing a lot of different things but some of [INAUDIBLE] in terms of recurrent networks. So welcome, Robert.
ROBERT YOUNG: Thank you for the invitation and for the very nice introduction. Very happy to be here. So today, this is kind of an unusual talk because I'll talk about science but also give you some opinion on what I think should be the next generation recurrent neural network models for cognitive neuroscience. This is by no means the only view. But I want to present to you one view with some evidence from our own work.
So of course, I don't have to tell this audience that neural networks have been used in cognitive science and neuroscience for many decades, dating back at least to the '80s, probably earlier. In neuroscience, for example, there have been neurons and feed forward network have been compared to neurons and parietal cortex [INAUDIBLE] Anderson. And then in the cognitive science field, it's a whole business. The connectionist is have did a lot of work starting with a lot of these earlier feed forward network. But also they did this [INAUDIBLE] recurrent networks and then multi-error recurrent networks. It's a very rich literature.
But in neuroscience particularly, there has been an increased interest since the 2010s, and a lot of it started at MIT with [INAUDIBLE] work and [? Jim ?] [? DeCarlo's ?] work. But many people have also worked on this or too many to be listed here. At this point, it seems like everyone was doing competition neuroscience has at least one person in their group who does some neural network stuff.
Now of course, it's not a coincidence that this happened in the last decade. So we have better hardware. Deep learning happened. We have better algorithm and better software. But I want to make a case that our use of these neural networks in cognitive science and neuroscience is not just because it's hot. It's not because it's the latest fancy tool. There are some fundamental reasons to use them because they have some advantages but also disadvantages compared to traditional computational models.
So we describe some of that in a recent primer. But there are many, many other interesting reviews and opinion articles on this topic in the last two or three years. So neural networks can, of course, be used as data analysis tools. But I'm not going to talk about that. So when we use them as kind of computational models of the brain, they can help us because they allow us to model much more complex behavior. So this is most obvious in vision or other sensory areas. For cognition, they're really helpful at explaining complex activity that is observed in the brain. It's quite hard to capture complex neural activity patterns that you observe in prefrontal cortex with hand designed neural networks.
Another thing that is very nice about these neural networks is that they provide an optimization or you can call it deep learning perspective or evolutionary perspective. It provides a way to look at the network from kind of the objective and the architecture and the learning algorithm instead of necessarily the mechanistic model after training. So that's kind of why some of us want to use neural networks in the brain. It's not just because it's convenient and hot.
And today I will focus on recurrent neural networks for cognitive neuroscience. And again, this has a long tradition. But really, there is a paradigm that has emerged. And it's really well exemplified in this paper by [INAUDIBLE] and [? Bill ?] [? Newsom. ?] And many other people have worked on this tradition as well.
So the tradition is, as you start with a single task, then you train a single network and you train it on back probation. And that gives you a computational model. So in this paper, they took a cognitive task where animals have to make a decision based on a rule. Sometimes they need to tell whether dots are moving left, right. Sometimes they have to tell whether dots are green or red. And then they train [INAUDIBLE] recurrent neural network. So these are actually very similar to rate-based recurrent networks that traditional computational neuroscientists study.
So in terms of the neurons, they are not that different. The main difference is that these networks are trained with gradient-based methods, which is, of course, not biologically plausible at the face value. But usually the point of this approach is not to mimic learning per se. It's to come up with a candidate model for whatever cognitive task that you're interested in.
And then you have this candidate model. And then you can do whatever data analysis you want to do. You can look at neural representation or neural dynamics and so on. For example here, they did some PCA-like dimensionality reduction analysis. And then you can compare that with monkeys performing very similar tasks. And then you can make a decision whether or not this is a good enough match.
So this is a nice paradigm because it can be applied to many, many different tasks. So back in 2016, [INAUDIBLE] and I thought we would apply this paradigm to many classical task just to kind of make sure that these networks are not doing something crazy. And so you can, for example, take a perceptual decision making task and take some kind of narrow data, some salient feature from the neural data, and then compare the model with it and do this on another task, compare metric working memory task, multi-sensory integration task.
So the general lesson here is that usually the most salient feature that people have found in neural activity in prefrontal cortex, for example, can be, to some extent, recovered in these recurrent neural networks despite having really different learning rules. And all of this is done with supervised learning. But you can play the same game with reinforcement learning, which is more biologically relevant. And then you can find a very similar match between data and model.
So this is nice. And this is fun. But of course, today I promise you to talk about the next generation, so how to move beyond that. Now before we talk about how to move beyond that, we should think about why we should move beyond that. So using neural networks to study the brain has many problems in itself that everyone already knows about. For example, like it's difficult to interpret, the gradient descent is not biological, and things like that.
But other than that, there are also some specific issues-- besides that, there are some specific issues when using RNN models for cognitive neuroscience. So one is cognition involves a wide range of brain areas, cortical and subcortical areas. It's not just a single recurring network. Cognition is very flexible. And that's the hallmark of cognition.
But when you train a recurrent network on a single cognitive task, it is not very flexible. And that's the only task it can do. And learning of lab task relies on existing circuitry. So animals and humans go into the lab and their brain already can do other things.
So it's not just a completely random network like what we do when we train a network in machine learning. And so there are probably more issues that I'm not talking about here. So these are reasons that we should move beyond this first generation.
And now, how should we move beyond it? And so just to recap, the first generation here, we're talking about single cognitive task, a single module rate-based RNN. And then training algorithm is back propagation based. And then hopefully in the future we can get to more naturalistic cognition. We can incorporate multiple areas and cell types and so on. And then we can have biological learning and plasticity rules.
So how do we go from the first generation to the future? So today I will talk about a few works. Some is done by me but some is collaborative that really just start to explore this effort. Nothing is fully satisfactory yet. So first, let's talk about how we can move away from looking at a single cognitive task.
So of course, prefrontal cortex is engaged in many, many things. It's not just a single task. The same neurons be involved in working memory and decision making and other things. So how does the same circuit perform many tasks? So this was the scientific motivation for us when we did this work. So this is already published two years ago.
So we can come up with some hypothesis how the same circuit can perform many different tasks. And so on a single neuron level, it's possible that you have a very clustered solutions versus non-clustered solutions. So what do I mean by that? Let's say you look at how a network represents two different tasks. It's possible in one extreme that you have a completely private network for each task. And then they are completely independent. Or you can have a complete mixture where each neuron is involved in both tasks, maybe to different extents. And then there is a gradient in between them.
And so there is another axis, which I'm not going to talk much about today. But it's whether or not you can represent these task compositionally. So I'll skip this today. And so in this work, together with [? Mary ?] [INAUDIBLE], [? Francis ?] [INAUDIBLE], [? Bill ?] [? Newsom, ?] and [INAUDIBLE], what we did is to try to investigate potential solutions in neural networks. So what we did is take a single module, recurrent neural network, very similar to what David [? Cecilio ?] and other people use. And then we simply train it on many different tasks that cognitive neuroscientists had studied, particularly in animals like monkeys.
So this includes a memory-guided saccade, parametric working memory, perceptual decision-making, context-dependent decision making, multi-sensory integration, [INAUDIBLE] sample, and [? delay ?] measured category. And some technical details, so here we used vanilla recurrent neural networks and train it with simple stochastic gradient descent [INAUDIBLE] in particular.
So the equation is rather straightforward and very similar to how you would code up a traditional neural network model, rate-based neural network model. An important detail is that all the tasks are randomly interleaved during training. So now we have this network. And it can perform these tasks pretty well. And it's actually pretty fast to train. It takes maybe an hour to train on a CPU, laptop CPU.
And now we need to quantify how each neuron is engaged in each task. And we need to quantify it in a way that generalizes to all the tasks we have. So we introduce a simple measure called task variance. So what it does is, you can take a unit and you can look at its activity across different task conditions. So this can correspond to different stimulus or different response in one task, different conditions in one task. So you have these different curves.
And then you simply look at the variance across these curves. There's a single time point and then average across time point. So in the end, you get a single number for each neuron and each task. And then you can do that for all the tasks. Then you get 20 numbers for 20 tasks. And that tells you, for example, this unit we care about the task conditions in these tasks but not so much in these tasks.
And then you can do that, of course, for every unit in a network. And we got a plot like this. Here the color indicates how strong a neuron is engaged in a task. So each column here is one neuron. And each row is one task. We already got rid of all the neurons that are completely silent. And what you can see here is that when we sort this, it's already obvious that there's some clusters of neurons. So some neurons, they have very similar kind of engagement across tasks. For example, this set of neurons is engaged in this set of tasks.
Importantly, it's not that for each task you have a specific cluster. In fact, these clusters, they usually are involved in multiple tasks. And each task usually involves multiple clusters. So what this is suggesting is that perhaps it's not that each task corresponds to a cluster here, but underlying cognitive process would correspond to a functional module.
And for example, it can be seen more carefully when we look at the specific task. So I didn't tell you the task. But these are decision making tasks and these are working memory tasks. And people have suggested that working memory and decision making can have the similar underlying neural circuit. And here we have a module that is engaged in both decision making and working memory tasks.
And so we can also look at whether or not this is causal. So you can lesion each cluster at a time. And you can see that if you lesion a cluster, for example cluster 5, then it would hurt the task where the task variance is high, so telling us these clusters are causal to good performance.
PRESENTER: Do these results hold across different architectures slash activation functions slash other modifications, ergo drop out?
ROBERT YOUNG: Right. That's a great question. So we did look at a combination of hyperparameters. And in general, I think that's very important. There was a time when these papers published just a single set of hyperparameter. And that's not satisfying.
Here, we train hundreds of networks with different combinations. And most hyperparameters don't matter. We did find that the clustering result seems to depend on the activation function choice. So here, we used the activation function that is rectifying. That's for lower values. So if a neuron receives negative input or low input, it's not very active. If we don't have that, then it seems like we don't get clusters. So we don't really understand it. So more work is needed to figure this out. That's a good question. Thank you.
PRESENTER: Great. Thanks. We have a follow up to it. You mentioned once you exclude the inactive neurons, you get the heat map. What fraction of neurons were excluded?
ROBERT YOUNG: It's a small proportion, maybe 20%. Yes. I don't think that is a major concern. If you want, you can probably make sure that all the neurons are active by, for example, in the network during training, if a neuron has been consistently silent, you just remove it from the network. And then you can have a network where everyone is active. And then the result would be the same. I think.
So one question is about catastrophic forgetting, which is a great question because it leads me to the next slide. So as I mentioned earlier, all the tasks are interleaved. So we don't really have the catastrophic forgetting problem. But if we do train them sequentially, then yes, we would have this problem.
And just for people who are not familiar with catastrophic forgetting, the idea is simple. If you train a network on one task and it finds some parameter that is good for the task, then you train it on another task without going back to the first task, then it's going to find a parameter that's good for the other task, task 2, which is not necessarily good for task 1 anymore. So then you forget task 1.
So this is a pretty big problem that people have spent a lot of effort on in the past five years. And some of the earlier work on this, some of the earliest work on this use essentially this idea that you can penalize deviation of important synaptic weights. So if you have a method to determine that some synapses are important for a previously learned task, then you can try to not mess with it. You can try to protect it. And technically, you can add a penalty to the loss function to prevent the weight from moving away.
And so we used one of these methods developed by Friedman Zinc and [INAUDIBLE] group called synaptic intelligence. And what you see here is on the task performance, when you're directly applying gradient descent versus when you're using this continual learning technique. So here what we do is we first train it on this task, then on this task, then on this task, and so on. And what you can see is, for example, if you look at this task, when you're training it, of course, it's doing pretty well. But when you're not training it anymore, then the performance starts to drop using traditional method.
But using continuous learning methods, you can get better. And this is quantified here. So this is a performance after all the training. So it's measured at the end across all tasks. And you can see, for example, for this task, it really makes a big difference. If you're using this continual learning method, you can do much better than if you do traditional method. But of course, if we just stop here, then this is just another application of their very nice rule.
What we want to do is, again, look at the neural representation and see whether or not continual learning is having an impact on the representation. And in this case, we also want to compare with data. So to compare it with data, what we will do is we will focus on just two tasks. Because that's where we have the data. So these are the context-dependent decision making 1 and 2 tasks, which is the [INAUDIBLE] task that I introduced earlier.
So now we need to have a measure that focuses on just two tasks. So if you remember this plot, this is a task variance map. And then if you look at these two tasks, you can already see some structure. You can see these neurons are engaged in both tasks whereas these neurons are engaged in DM1 and these neurons are engaged in DM2. So we can quantify that better using this fractional task variance which is simply the task variance for one task minus the task variance for another task divided by their sum.
And so what this gives you is a number between minus 1 and 1 for each neuron. And then we can plot the distribution of this value across a network. And so if this value is close to 1, it means that the neuron is only involved in one of the tasks. And if it's close to minus 1, then it's only involved in the other task. And if it's close to zero, it means it's equally involved in both tasks. So here, in this network, we see that we have these three peaks corresponding to three modules.
Now what is the impact of continual learning? So what I showed you here, this is without continual learning. So this is from our previous results. If we introduce continual learning, so here C equals to 1 corresponds to continue learning. What you can see is that there is a very big difference, very big change in the neural representation.
Now I want to remind you that the performance is very similar. So all that's changing is the learning rule. And what this suggests is that when we have continual learning, it would actually increase mixing. So you have more mixing and less modularity, at least in these two tasks.
And we can compare this result with prefrontal cortex data recorded from monkeys doing very similar task. And what we see here is that, at least in these two tasks, in the area we're looking at, it seems like the data is more consistent with the network trained with continual learning.
So there is another question about spatial locality. Is there a spatial locality in the clusters or neurons that are assigned to similar clusters nearby connected? That's a good question. So we don't have topology here. But an interesting application is to embed neurons in kind of a two dimensional sheet. And then you can say, you can also introduce some costs in long range connections. And then you can look at some interplay between spatially embedded neurons and clustering. Thank you. That's a good question.
So I'll move on for now and I'll come back to see if there are highly valid questions. So I'll just briefly mention that when I was doing that work, it became very obvious that it's very kind of tedious to code up 20 tasks. And then it'd be tedious if someone has to do it again. So working with [? Minal ?] [? Milano ?] at [INAUDIBLE] in Barcelona, we have this kind of collection of tasks that we have open sourced online. So feel free to check it out.
So I'll move on to talk about another work where we try to introduce more biological learning and plasticity rule in these recurrent networks. So in particular, we introduce short term plasticiities. So these are short term synaptic changes that last usually hundreds of milliseconds to a couple of seconds. And they have been hypothesized to be important for many things, including working memory.
So the classical theory about working memory is that it's based on persistent neural activity. And this hypothesis has been refined so you can have-- so this is a space of neural activity. And if you want to store something, you can store it in some sustained state. You can store it in a unique dynamic trajectory. Or you can even store it in some transient trajectory where it would go up and then go down. But of course, the direction that it goes up depends on what you're storing in working memory.
Now an alternative hypothesis was proposed, which is that-- it's not so much an alternative. It's a complementary hypothesis, which is that working memory can rely on short term synaptic plasticity. So the way it works is that first you activate some neurons when you show a stimulus. And that would strengthen the connections going out of these neurons. So that even when the neurons themselves are no longer active, you still have the trace of memory in the synapse.
And then you can essentially recover that trace by reactivating this population just uniformly. But because these synapses have been strengthened, they would be able to reactivate the neurons that are relevant. So this is a hypothesis. And people have designed clever experiments to test this hypothesis in humans and animals. But it's a fairly difficult thing because it's very hard to measure synaptic variables.
So what we thought is, maybe we can look at this in neural networks. In particular, we want to test a hypothesis that working memory is more active, it relies more on activity when it needs to be manipulated. So the intuition is that if you store something in synaptic weights, so that's very good. It gives you high capacity. But it's very hard to change whatever you're storing because the synaptic weights, it's not as easy to modify as activity.
So our game plan here to test this hypothesis is to first introduce recurrent networks with short term plasticity and then train them on various working memory tasks. And then we would quantify their reliance on activity versus plasticity based mechanisms. And finally we would test if working memory is more active when it needs to be manipulated.
So this is really fun collaboration with Nick [INAUDIBLE] and [? Dave ?] [? Freedman ?] at Chicago. Nick [INAUDIBLE] is a postdoc with [? Dave ?] [? Freeman ?] and did 95% of the work versus I did very little here. So first we introduce RNNs with short term plasticity. Now all we need is to have an implementation of short term plasticity that is rate-based. So then it works with our rate-based neurons.
And then we train them on various working memory tasks. Importantly, we chose tasks where some tasks have a low level of manipulation and some tasks have a high level of manipulation. So I'll show you what I mean. So for example, in this delay match to sampe task, you have a fixation and then simple and then delay period and then a test stimulus. So what you need to report is whether or not the test is the same direction as the sample. Now in this case, you don't need to manipulate the sample stimulus. All you need to do is remember it.
Now this is in contrast to a modified task that is delayed match to to rotated sample, where you still have similar structure but now a match is when the sample is 90 degrees away from the test. So in this case, intuitively you need to do some effective rotation of the sample stimulus, either during encoding or during the delay period, so that you can compare it with the test. It's not necessarily how the network works but we thought this is one way to introduce a need for manipulation in the network.
And we have a series of tasks like this. And then we need to quantify the network's reliance on activity versus plasticity-based mechanisms. And because this is a recurrent network, we can do whatever we want with it. So we can try to decode the amount of information about the stimulus that is available in neural activity and synaptic variables. So for example, in a network trained on this delay match to sample task, so here each curve corresponds to an independently trained network. Here you can see that you can decode the stimulus very well from synapses. And some networks, you cannot decode down from neurons at all towards the end of the delay period.
And all of these networks, they can do the task just fine. So what this is telling us is this recurrent network, when endowed with short term plasticity, can solve delayed match to sample task with a silent working memory mechanism where at the end of the delay period the network is silent. There is no neural activity.
And in comparison, in a network trained on this delayed match to rotate its sample task, you see all the networks, they have some decodable information in neural activity during the delay period. And we can quantify that. And so we can quantify how much information is in the persistent activity.
So finally, this allows us to test if working memory tends to be more active when it needs to be manipulated. So we also have a measure for how much manipulation is happening in the network for each task. And this allows us to make a plot like the following where for each task, we can measure in the network how much it relies on persistent activity versus how much manipulation is happening. So to measure manipulation, we essentially look at the vector that is the activity vector at the beginning of the delay period versus kind of a synaptic vector.
Here, we have one synaptic variable for each neuron because essentially the mechanism is presynaptic. And then now we have two vectors. And we can look at whether or not these two vectors are similar or not. So if they're very similar, then that indicates low level of manipulation. And if they're not similar, that's high level.
And so what we see is that there is a very strong correlation between the level of manipulation and the network. And the amount of persistent neural activity indicates that it's possible that in the brain more persistent activity is seen in tasks that require more manipulation.
So finally, I will just briefly talk about one work and then talk about some discussion points. So another thing that is missing in the classical paradigm is that there is no place for evolution and development. So you start with a randomly recurring network, randomly connected network and then you just train it on one task. But of course, we go in the lab with a lot of knowledge already.
So how do we build that in and why would that be useful for explaining data? So here in this collaboration with [? Minal ?] [? Milano ?] and [INAUDIBLE] at [INAUDIBLE], we looked at a suboptimal behavior that they discovered in a previous paper. So here they train rats to do simple decision-making task, left or right. Now the interesting design is that there are blocks where the correct choice tends to repeat. And then there are blocks where the correct choice tends to alternate.
And then, rats, they do something strange. So in the repeat block, they tend to repeat more. So that's correct. That's the right thing to do. But if there is a single error, if they experience a single error, they essentially just throw away that information about whether or not they're in the repeating block or alternating block. And so they treat the two blocks the same. So this is quite confusing. And it's not the optimal thing to do.
And so you can quantify this behavior with a reset index. And if it's high, it means it kind of ignores the block it's in and just feeds them the same. And so these are results from different rats. And then if you train a network directly on this task, then you don't see this behavior. So which, to some extent, is expected because this behavior is suboptimal. If you train a network very heavily on one task, there is no guarantee that it will learn an optimal strategy, but it certainly tries to do that.
So we thought, how do we address this discrepancy? And we thought, in a natural environment, but not in 2AFC case, not in a Two Alternative Fourth Choice. In a natural environment, a correct and an error is not equally informative. A correct is good because you can just keep doing what you have been doing. Whereas if you get an error, it just says that what you are doing is wrong but it doesn't tell you what is the right thing that you should be doing.
So we thought, maybe we can use this intuition to build this intuition into our network. So what we did is we pre-trained network on a NAFC task. So instead of a 2AFC, where the network you always sees two choices, where if you get error on one that means the other choice is correct, we train on a NAFC. And when we pre-train it on NAFC and then test it on this 2AFC task, then the network does reset. And in fact, there is a parametric relationship between how complex the environment you use for pre-training and whether or not the network behaves more like an animal.
And so here, as you can see, as you increase the number of choices in the pre-training environment, the network approach has higher and higher reset index. So this is an example where taking into account of the history, it can be evolutionary or it can be developmental, of animals before they enter the lab can presumably help us better capture suboptimal behaviors of animals. And the suboptimal behaviors in themselves may not be suboptimal in general. They may just be suboptimal in that specific task.
So finally, I want to end with some kind of discussion points. So I have shown you three examples about how to go beyond the first generation. So one thing I really want to do in my new lab at MIT is to look at multi-area models. Now, I want to make a case that we can already build multi-area models if we're only focused on engineering. So for those of you who use PyTorch to go from one area to n areas, it's a single number when you do torch.rnn.
And this work that I did together with [? Igor ?] [INAUDIBLE], [? John ?] [INAUDIBLE], and [? David ?] [? Cecilio ?] back at Google, so what we did is build a multi-area model that incorporates visual areas, a short term memory area, a semantic memory area, and a controller that sends a lot of feedback to these earlier areas about different attention mechanisms. And so we built this network really in admiration of the classical cognitive control model that proposed that prefrontal cortex kind of guide how information is transformed from sensory through association to output areas.
So here, similarly, our controller guide how information is processed in earlier areas. So we can build a complex model like this. And it can do complicated things. So it can perform pretty well on this clever data set that was proposed several years ago where a network is shown an image like this and then is asked questions. Are there an equal number of large things and metal spheres? So this is already a pretty complex task and I would argue is more complicated, in many ways, than most tasks that we study in animals.
So this is showing that from an engineering standpoint, maybe we already have the computer power and the tools to build a sophisticated network that can have multiple areas and do very sophisticated things, at least in terms of animal cognition. But we're not done because there are many scientific questions that are not answered if we care about science.
So for example, one thing is, how do we capture the diversity across brain areas with a few principles. So I'll go into more about this in the next slide. And there are many more. For example, how to meaningfully map RNN areas to brain areas, how to evaluate multi-area models across many experimental data sets, how should multi-area models be trained. So here we just train it on one massive task or you can metatrain it on some massive meta data sets. But is that the answer? Maybe it is. Maybe it's not. We don't know. And there are many more.
And so to give you a taste of the challenge ahead of us, so I'll talk about this question of how to capture the diversity across areas with a few principles. So earlier, I was talking about prefrontal cortex as if it's just a single area. And of course, it's not. It consists of many areas. So this is just a view of the lateral areas and prefrontal cortex and there are many more areas.
And they each have kind of not completely different but different properties and different engagement across task. So how can we build models that meaningfully capture this diversity? So one idea or perhaps the only idea, I would say, that we have right now is to rely on area-specific long range connectivity.
So one example is to build in a hierarchical organization where you have several areas. And you can have the first area received or preferentially receive sensory inputs, and then the last area preferentially produce multiple outputs, and then middle areas that it's in the middle. So that's one way that we can build area-specific long-range connectivity. And it does work to some extent.
The early area, for example, in this very nice paper by [? Jonathan ?] [? Mikos ?] and colleagues, this early area is more similar to the parietal cortex that they recorded. And middle area is more similar to prefrontal cortex. And then the last area is more similar to model cortex. And so another possibility is to have, for example, different readout. So here in this work back in 2017, they built a two-area network where it's an [INAUDIBLE] predict structure trained on a bunch of tasks. And this area needs to produce action outputs and whereas this area needs to produce a value of the state in reinforcement learning.
And what we see is that this action-producing area is more similar to DLPFC whereas this value-producing area is more similar to orbital frontal cortex. But these are just some examples. What we lack is kind of a more-- just a more general demonstration that this principle works. And we also need to understand to what extent this doesn't work, this is not enough.
So finally, I'll end by saying how should we build next generation RNN models with certain style. I've been struggling a lot or thinking a lot about this. Because many of us, we read a lot of machine learning papers. And at some point, you start to think more like machine learning people. You're tempted by that.
But on the other hand, if we want to do science, we cannot completely do machine learning. So how do we strike the balance? So this is just my thoughts. And it's just some ideas. One is that we should have a continued commitment to functions. And so today, mainly I talk about how we should introduce biology in the network.
But of course, if you just focus on that, then the networks may not be able to do interesting things. And then we kind of lose the whole point of studying these recurrent networks in the first place. So we should still commit to studying networks that can do interesting things. But at the same time, we should stay close to experimental data. If we just focus on building very, very powerful machines then we become farther away from science and it's harder to see the relevance to the brain.
And then finally, I think we should emphasize both quantitative metrics and intellectual insights. So I think in our field, it has been predominantly on intellectual insights. And that's great. But we can also learn something from vision and also from machine learning where a large scale benchmark is used. But we shouldn't rely solely on that.
So with that, I'd like to thank people that over the years have been working on this line of work. I only got to talk about some of this work. In particular, I want to thank [? Francis ?] [INAUDIBLE] who went to DeepMind and then [INAUDIBLE], who was my PhD advisor at NYU, and then [? Minal ?] [? Milano ?] with whom I did several work as really fine collaboration, Nick [INAUDIBLE] who's brilliant at U Chicago, and [? David ?] [? Cecilio ?] who hosted me at Google. And of course, thank you.