The Simplest Neural Model and a Hypothesis for Language
Date Posted:
December 19, 2023
Date Recorded:
December 7, 2023
Speaker(s):
Daniel Mitropolsky, Columbia University
All Captioned Videos CBMM Special Seminars
Description:
Abstract: How do neurons, in their collective action, beget cognition, as well as intelligence and reasoning? As Richard Axel recently put it, we do not have a logic for the transformation of neural activity into thought and action; discerning this logic as the most important future direction of neuroscience. I will present a mathematical neural model of brain computation called NEMO, whose key ingredients are spiking neurons, random synapses and weights, local inhibition, and Hebbian plasticity (no backpropagation). Concepts are represented by interconnected co-firing assemblies of neurons that emerge organically from the dynamical system of its equations. It turns out it is possible to carry out complex operations on these concept representations, such as copying, merging, completion from small subsets, and sequence memorization. NEMO is a neuromorphic computational system that, because of its simplifying assumptions, can be efficiently simulated on modern hardware. I will present how to use NEMO to implement an efficient parser of a small but non-trivial subset of English, and a more recent model of the language organ in the baby brain that learns the meaning of words, and basic syntax, from whole sentences with grounded input. In addition to constituting hypotheses as to the logic of the brain, we will discuss how principles from these brain-like models might be used to improve AI, which, despite astounding recent progress, still lags behind humans in several key dimensions such as creativity, hard constraints, energy consumption.
PROFESSOR: It's great to welcome Daniel Mitropolsky. Did I pronounce correctly? He's a computer science theory guy to prove that a number of results that I'm told are very interesting.
And he is a student of Christos Papadimitriou at Columbia. And Christos is very excited about the results Daniel will speak about, which is a model of the brain that can generate a language program. This is based on assemblies, right? Neural assembly, which I think is probably the only interesting alternative to the neural network model of the brain. So, looking forward very much to your talk.
DANIEL MITROPOLSKY: Thank you very much. So welcome, everyone. Thank you so much for coming. My talk is about the simplest neural model and the hypothesis for language. So, of course, we're all here to study the brain, because one of the greatest fundamental questions of basic science is, what is the logic of the brain.
And I think nobody put this better than the famous neurobiologist, Richard Axel, who in 2018 said, "We don't have a logic for the transformation of neuronal activity to thought and action. I consider discerning this logic as the foremost research direction in neuroscience."
So just to put into context more what this means, it sort of means solving this bridging problem between, on one hand, neuron biology, which we understand very well. We have very sophisticated models of individual neuronal cells, as well as neuronal circuits. For example, for some small animals we've mapped out, all the neurons in the animal, usually on the scale of a few hundred.
And on the other hand, we observe and have studied and know a lot about animal and human cognition, reasoning, linguistics, and cognitive science. But we don't know the mechanism by which millions of neurons act en masse to engender these phenomena. So in a sense, that's why we're all here in this institute and studying this.
So what is our approach or what can we offer as computer scientists? So this is sort of-- I like to motivate this with a paper that was a meme a few years ago, which many of you likely saw, called Could a Neuroscientist Understand a Microprocessor? Basically, these authors asked the question of whether, if you collected data from a microprocessor in a similar way as we do from the brain, and then applied similar data processing and analysis techniques, would you be able to understand, mechanistically, a microprocessor, which is something that, as computer scientists, we understand very well. We understand the exact logic. We know how logical gates work and so on.
And their conclusion was, you couldn't. You could find some patterns and some regularities. But you wouldn't really-- there's no way to uncover the structural logic of what's going on.
And so this really seems like a call to computer scientists to contribute somehow. And one way I like to phrase our approach is that it's sort of in the opposite direction, perhaps, of many neuroscientists. Maybe you could call it bottom-up or something. So instead of looking at more and more neuroscientific data, and now there's much more. We have connectomes of some animals completely mapped out, and so on.
What if we take the most basic principles we know to be true across animal brains, and from the ground up, try to construct really minimalistic models of the brain as computational models, and maybe show that these are capable of doing something non-trivial? And then maybe eventually, we can bridge the gap and come from the other side and explain data from these computational models.
Here's an overview of how this talk is going to go. We're going to, together, come up with a very simple model of the brain. We'll think about how to do it, and then define one.
And then we'll see that this extremely simple model can do some surprising things. And then we'll talk about how you might be able to do language in this model. And I'll motivate why we would want to do language, once we've defined a model like this. I mean, you may already be very motivated to, as it is human's greatest trick.
How would you go about constructing a theoretical computational model of the brain? So what are the most basic tenets that we have to work from? Obviously, there's firing neurons. So all computation in the brain compiles down to firing neurons or firing as an atomic operation, right?
Oh, and I should say something now before I go on, which is this talk was designed to a variety of audiences, both to computer scientists and to people in neuroscience and cognitive science. So I'll be defining some things that may be very basic to you. So please bear with me, keeping in mind that this really is for an interdisciplinary audience.
So firing neurons, good, atomic. We also know that most neurons in the brain are either excitatory or inhibitory. So this is a way in which artificial neural networks are stronger in many ways than the brain, because in the brain, most neurons have either negative or positive weights. So we know that.
And we also know that the main mechanism by which synapses change in the brain, and by which learning happens is plasticity, that synapses change locally. So these are the ingredients we would start with, the simplest tenants that are true across brains, and across the brain. But we have to make even more concretizing and simplifying assumptions to be able to define anything mathematically. So here are more.
For firing, we're going to discretize time, which means that at every discrete time step t, a neuron can fire or not. Now, whenever I present this to computer scientists, they're like, this is not an assumption. Of course, you would do this. But, of course, for many neuron modelers, you might have real time.
So this is a very strong assumption. But it makes the model much easier to study. And it also makes it simulatable on standard hardware. If you're familiar with the fundamental issues of neuromorphic computing, it's that you need to design a hardware system that can simulate some real-time dynamical system of neurons. But ours can be simulated on standard hardware.
Now, when it comes to inhibition, there are many kinds of inhibition in the brain. But let's focus on one that is cross-brain, ubiquitous, and we know it plays a very important role. And start just with that, which is local inhibition, which means that where inhibitory neurons act on a small, local cytological area, we have the patterns like this across cortical columns and cortex and so on maintaining some kind of activation threshold in local areas.
I'll make it clear what we mean by local inhibition in a second. But we'll restrict ourselves to that. And when it comes to plasticity, let's also start with the simplest possible kind, which is Hebbian plasticity, which is summarized by the principle, neurons that fire together, wire together.
So now I will define to you a very simplified model of the brain based off of those principles and assumptions. And by the way, our goal before wasn't just to simplify as much as possible. It might sound like that. But then we could just say, OK, time will be constant, and then it's trivially simulatable, right?
The point isn't just to simplify, but it's to simplify to the right extent where we get something that is simple enough to study but can still do something interesting. So let's see what we get. So this is our neural model, which, for short, we'll call NEMO. And I will now define it to you.
So in this model, we have a finite number of brain areas, which we'll label with capital letters. And each brain area will have n neurons. So n is a parameter of an instance of a NEMO model. So in different experiments or theorems, you'll change n. And I'll remark upon other parameters as we go.
And we initialize each brain area-- you could think at birth or in early development-- as a completely random graph, an Erdos-Rényi random graph, which means we connect every neuron to every other neuron with a directed edge with a fixed probability p. So p is another parameter of an instance of this model.
Pairs of areas may be connected or not. So another choice you make in an instance is which pairs of areas are connected. And if an area is connected into another area, we'll say that there's a fiber between the areas. Now for those weights which we sampled with probability p, those edges that we sampled with probability p, we initialize them with a weight, which we set to 1 initially.
This is an extremely simple setup. But so far, nothing I told you actually makes areas, areas, right? Because so far, I've connected neurons inside of an area with probability p. But then I also said that for pairs of areas that are connected, I randomly connect them as well with each synapse occurring with probability p.
So far, it's just one big area, really. So here is where areas become areas. And this is where the local inhibition piece of our equation comes in.
We'll assume that at any time t, exactly k neurons in every area can fire, where k is something small. Again, it depends on the instance. But we'll study settings where k may be square root of n or log n, something like that, so much smaller. So we actually don't model any inhibitory neurons explicitly. They're all modeled implicitly, in that they are maintaining a threshold at all times so that exactly k neurons in an area can fire.
And I should have said, by the way, if anyone has any questions throughout this talk, please feel free to ask. Prefer to be informal. Good. Yes?
AUDIENCE: You say fiber and random connections. I don't have a good intuition of what a fiber is.
DANIEL MITROPOLSKY: A fiber is just what we call the connections between two areas, if, in that instance, we decided to connect to areas. That's just terminology that I'll reuse later.
AUDIENCE: A neuron?
DANIEL MITROPOLSKY: Their synapses. Yeah, so area A has n neurons, area B has n neurons. Those n's could be different. If I say I want a fiber between them in this instance, then that initialization, birth development, I just connect all the neurons from here to here with probability p. Yeah. Good.
So I just said something without defining it. I said that in an area, k neurons can fire. So what is firing? That's the fundamental operation of this model.
So if we have two areas, A and B, and assume we chose, we pre-initialized some neurons in A to fire, this is what happens. It's very simple. For each neuron i in B, its input is just the sum of incoming synapses from the neurons that fired. So for all the neurons that fired, if there's a synapse, you count its weight, you add them together. So every neuron can locally compute its input.
And then this is what happens. The top k inputs in B are selected. And those will be the winners at this time step. These will be the neurons that fire in the next time step in area B. So this is where our local inhibition mechanism is implemented.
And in this model of computation, selecting the top k, it costs you one time step. So that's very important. And here's the last ingredient to our model, which is plasticity, which I mentioned before. And again, we're using the simplest kind for now.
If a neuron j fired, and immediately after, a neuron i fires, we reward this synapse by scaling it by a a number, one plus beta, where beta is a non-negative number. So this is some positive constant. And beta is another parameter of an instance. So multiplicative weight updates.
Now, especially for any neuroscientists in the audience, what I just presented to you is extremely simple. I mean, the ingredients in it are known to be in the brain and to be important. But we could add much more neuroscientific detail. We could have many more sophisticated notions of plasticity. We could have an additive, even Hebbian plasticity could be additive, or have a general increasing function beta. There's STTP.
We don't have any sort of weight normalization. This is a question I get often when I present the basic model. People say, well, your weights will just blow up, right? But we know that homeostasis and weight renormalization is important in the brain.
We could have also variable k. So even though there's good evidence to believe that there are areas in the brain maintaining pretty consistent thresholds of how many neurons fire, we also know that in those same areas, with overwhelming input coming in, you might have two k neurons fire at some time step. So this could also be modeled in a more elegant or a more complex way, I should say.
And we also only have one kind of inhibition. In fact, all the neurons we modeled explicitly were excitatory, right? But we know that there's long range inhibition in the brain and that it plays an important role. So we could have inhibition between areas or long range interneurons.
And I want to say that all of these extra ingredients, we do sometimes use them. And we study them. So you have something like a complexity theoretic zoo, if that means anything to you, where we have the most basic model, which is the one I defined to you. And then we have these extensions. And I'll mention them when they become important. In particular, the long range inhibition will be important.
But it suggests the following computer science to neuroscience agenda. So the way I see this work is we start with the simplest possible model, which is the one I showed you, and we see what it can and what it cannot do, and build a theory this way. Good. So I told you early on that I'll show you that this thing can do some surprising things. So here we go. Yes?
AUDIENCE: [INAUDIBLE]
DANIEL MITROPOLSKY: Same probability, p. Yeah, you could have a model where you have different probabilities. You could have different p's in different areas or among different fibers. But for now, think of them as all the same.
AUDIENCE: [INAUDIBLE]
DANIEL MITROPOLSKY: Yeah. Well, I think all the algorithms I'll present to you, except maybe one, k is the same between all areas as well. Yeah. That's the simplest. You can make all those parameters the same.
Good. So now I'm going to show you that this model can do something non-trivial, something surprising. So assume the following setup. I have two areas, A and B. And I'm going to fix k neurons in A that are going to fire for all time steps, one, two, and onwards.
So this is sort of breaking the model. I'm forcing the same k neurons to fire. And when we do this, we call this a stimulus. You can think of it as, when you're seeing something, whatever brain area is responsible for that perception, no matter what other inputs are coming in, the stimulus is always firing and overriding anything else.
So at the first time step, there's only the stimulus firing in A. So it fires. It will engender some neurons in B that will fire at the next time step, and so on, and so on. And the theorem that we prove is that, surprisingly, the system converges to a single set of neurons in B.
Now, why is this surprising? This will also explain what's happening here, what the dynamical system is doing. So at time t equals 1, the stimulus in A fires those k neurons, which summons a completely random set of winners in Y1, those that were the best connected to those k neurons. By symmetry, over the randomness of initialization, it could be any subset of B.
But what happens at times t equals 2? At time t equal 2, the winners of the second time step, Y2, are not the same as Y1. Depending on the parameters, they may have very little overlap with Y1. Intuitively, it's because the neurons that won at the first time step were the ones that had the most connections from the stimulus.
But now you're firing those neurons recurrently inside of B, so there are some neurons out there, especially if B is large, which the area's are usually quite big, there are some neurons that are reasonably well connected to the stimulus, and reasonably well connected to the first set of winners. So they weren't well connected enough to the stimulus to win in round one. But they do win in round two.
But as you continue firing round 3, 4, and so on, old winners start to show up again. They start to show up again because you've been sampling neurons that have high connectivity to the previous firing neurons. So you start finding dense areas of this graph.
And eventually, in fact for reasonable settings of parameters, we can talk about what that means, this converges really quickly to a single set Y. So what are the properties of this set Y?
One important property is that it's highly interconnected. The internal probability of an edge in this set is several times the ambient connectivity, which is p. So you can think of this, actually, as a biological algorithm for finding pseudo cliques or dense subgraphs. But don't forget, this is in the model of computation, where finding top k is instantaneous.
And it also has stability properties. So it has the stability property I just showed you on the previous slide, which is if you fire the stimulus and that set, you will always get that set winning in area B. It's stable in that sense.
But here's the thing. If you kept firing the system, even after you converged to the single set Y, you would very quickly get even stronger stability properties, which I'll just mention. For instance, if you erase the winners in B, but kept the synaptic weights, because what happened-- so the set is remembered in the synaptic weights. So even if you forget who the winners in B are and fire the stimulus, in just one time step, you'll get those that set Y again in B.
Or alternatively, if you just fire Y all by itself in the area B, only considering recurrent connections, you'll get Y again. So it's self-sustaining. And we call this set an assembly.
And what's suggested here is that it's no coincidence that assemblies, or sets with the properties I just mentioned, emerge organically from the simplest tenants of neural computation. I went ahead and made the most stylized simplified model of neural computation, and yet, dense subgraphs find themselves naturally and emerge. Because in neuroscience, there's increasing evidence that assemblies play a fundamental role in cognition.
In neuroscience, assemblies are large and densely interconnected sets of excitatory neurons whose near-simultaneous firing is tantamount to a subject's thinking of a memory, a concept, a word, or so on. And so when we discovered that you can get assemblies naturally, just from the properties of this dynamical system, this got us thinking that maybe we can do more kinds of computation, where the assemblies will be our basic variables of cognition.
Now I want to say that assemblies were hypothesized a very long time ago, in the '40s, by neuroscientists. And since then, there's a long work of finding them experimentally and studying them. And this is just to say that there's growing evidence and experimental neuroscience for their role.
AUDIENCE: It's probably less with [INAUDIBLE].
DANIEL MITROPOLSKY: Yes. I've been reading his book recently. Yeah, good. We can talk about his model of assemblies and how it's different from ours. It's very interesting.
Good. But I want to say, what I just showed you, which is you take a stimulus in one area, you could also take an existing assembly in one area, by the way, and firing it into a downstream area to create another assembly, this is an operation of the system. And we call this operation, projection. You can think about this as some sort of variable copy, in some sense.
And it turns out that many more fundamental operations are possible. So I'll just mention a few. If you take an existing assembly and fire a small subset of that assembly, something like 40%, now that's breaking the rules of the model that I just told you, because I told you k neurons fire in an area. But let's say we did that. Then in a few time steps, you would have the whole assembly firing. That's suggestive.
And here's another very useful operation that we show as possible. It's like projection, but where you have two inputs. So say I have two stimuli or two assemblies, and two areas, x and y, and I project them into a third area, A. Then very quickly, as well, I can create an assembly an A, which is richly connected to both of them. So firing either one of these can summon x, y.
And more interestingly, if we actually consider firing back to the upstream areas, this also works. Because if these two are assemblies, then firing back from A won't destabilize them. So in the end, you'll have connections from x, y back to x and y as well. So you can fire this assembly to activate x and y in those two areas.
So when you see something like this, I hope you also have the reaction that, wow, it seems that you really can compute with this in some way. Oh, sorry. I should mention there's a few more operations, which I won't show much about. But we can do association of two assemblies to make them overlap. We can do disassociation, take two overlapping assemblies, and get them to be non-overlapping by projecting them, sequence memorization, and several others.
AUDIENCE: [INAUDIBLE]
DANIEL MITROPOLSKY: Yes, [INAUDIBLE] plasticity, and they require different parameter settings. Yeah, some of them require very demanding parameter settings, which is interesting. Good. So all of this suggests that this-- you can compute with this model, somehow. And, indeed, we proved the following result. We proved that our model, NEMO, with an asterisk, is Turing complete. So what does this mean? It means that, technically, this model can carry out arbitrary computation.
More concretely, if you have a Turing machine that uses S, space, and T, time, then our simulation in NEMO will take something like polylog S times T, time and space, where "space" means size of the areas, not so important. Now this is in a strengthened model of NEMO, actually. OK? This is in NEMO with LRIs, with long-range inhibition. And I'll define, more concretely, what this means later because we'll use this again.
But we need an extra ingredient, which are neurons, such that when that neuron fires, it can inhibit or disinhibit a whole area. So that's a pretty powerful ingredient. We need to add neurons that can fully silence or unsilence an area, but the whole thing. Good. We don't know how to prove Turing completeness without this, I should say. OK. Now, I'm always hesitant to present this result to computer scientists, in particular. Yes?
AUDIENCE: Is the inhibition scale with [INAUDIBLE]?
DANIEL MITROPOLSKY: No.
AUDIENCE: It's independent of [INAUDIBLE]?
DANIEL MITROPOLSKY: Yes. There's different ways you can model the other [INAUDIBLE]. But in the simplest version to present just the firing of that neuron, or you could model it as a small set of neurons, inhibits an area-- another area. Yeah. So why do I hesitate to present this to computer scientists, especially theoreticians? Because they'll say, well, OK, great, so you're done. You've found another model of computation that is equivalent to Turing machines, which is all of them.
But are we done? I mean, in some sense, this is a positive result. It does show that this very simple model we came up with from basic tenants is capable of arbitrary computation. But the more interesting and important answer, I believe, is that, no, we're far from done. I mean, it's important that we have this result. But we're interested in how the actual logic of this computational model can be exploited and used and how it works.
So there's two senses in which this means we're not done. First of all, the algorithm that's used for simulating the Turing machine is wildly impossible, biologically. So insofar as we're interested in modeling actual cognitive algorithms for tasks that we know animals and humans are good at, implementing it via Turing machine reduced to NEMO through this theorem isn't a good hypothesis, at all. And it's also wildly inefficient. So this suggests the following research program-- and I'll use the term [INAUDIBLE] complexity theory here.
We're interested in the fine-grained complexity of NEMO and related NEMO models-- because, really, NEMO is a family of models-- by which we mean we care about the exact time and space of the algorithms that we implement and study in this model. Specifically, we're interested in studying algorithms for those problems we know humans do and do well. And we want algorithms-- so what does "efficient and biologically plausible" mean? It's efficient if you, at least, beat the Turing simulation. Because if you beat the Turing simulation, that means you've exploited the actual, mm, structure of this model of computation, at least to some extent, instead of going through this circuitous reduction.
Now, when it comes to making algorithms that are biologically plausible, that's much harder to define. And that's something our sense of which has evolved and continues to evolve in this project. OK. But it means something like using a number of areas that's cognitively plausible and parameters of the model that are cognitively plausible. Or maybe the execution of the algorithm actually reflects what we goes on in the brain. If you can do algorithms that way, as well, then you have something that resembles an actual hypothesis for what this algorithm might look like in the brain.
So, hopefully, I've motivated this research program [INAUDIBLE] to continue studying this model, implementing algorithms in this model or proving results in it, even though we know it's Turing complete. So I'll now take the time to mention our team, our collaborators. So, of course, work with my advisor, Christos Papadimitriou. We work with Santosh Vempala, a theoretical computer scientist at Georgia Tech; hit PhD student, Max Dabagia.
We work with Michael Collins, a computational linguist at Google Research [INAUDIBLE] before, so many of you will know him, yes. And, more recently, we've been working a lot with two very interesting researchers at Google, Srini Narayanan and Iulia Comsa. These are cognitive scientists turned Google AI researchers. But their bones are cog side neuron. So it's a great team. Good.
So I just showed you, A, how we come up with these very simple computational models. And we saw that it's capable of doing surprising operations. And I should say the operations I showed you, that we walked through, the NEMO without LRIs could do those, right? Whereas, we needed the LRIs for the Turing completeness result. So that's a small, but important distinction. OK. Yes?
AUDIENCE: [INAUDIBLE] implement the algorithms in this model, what things are you [INAUDIBLE] that make the model specific to that [INAUDIBLE]? What kinds of things, in general?
DANIEL MITROPOLSKY: Good. So a big challenge, in general, with this field is figuring out how you're even representing-- what's the setting you're working in? So if I say, I want to see whether I can use NEMO to implement an algorithm for, I don't know, mm, distinguishing images or, I don't know, learning a hyperplane or something, so one big challenge is figuring out how we-- what does input look like. Or what's the setting? How do we model data? How do we model those things?
And then, also, it's setting up-- so all the parameters of the model that I mentioned before, the areas that we have, which are connected to which, and the other parameters, the size. Size of assembly is in beta.
AUDIENCE: Just a few parameters.
DANIEL MITROPOLSKY: Just a few parameters. Yeah. But, often, the choice of areas, and which ones are connected to which, the choice of LRIs, in those algorithms where we use them, those are the ways in which we design an algorithm. Yeah. But I'm going to show you some concrete algorithms. So it'll also become clearer. Yeah. So why did we think about doing language in NEMO? Well, on one hand, because I absolutely love language and so does Christos. I'm really obsessed. So it's the first thing I wanted to think about.
But, also, because if somebody comes to me and says, here is a highly stylized but an abstracted model of brain computation that I think captures the essence of how the brain works, I would say, well, how do you do language in it? Because, in some sense, language is the hardest thing that brains do, well, at least on this planet. So I want to know-- and they're particularly good at language. Language is something that the structure of which might-- is pretty restricted. And maybe brains are somehow specifically good at language.
So our first idea was to build a parser in NEMO. And, again, I'll remind you. For this to be interesting, we're done already if we can use any algorithm, if that's all we're trying to do. Because we have the Turing completeness result. So it needs to be efficient. And it needs to be biologically plausible to the extent possible, roughly in line with what is known about language in the brain. OK. And we show that we can do this. There is an algorithm for parsing in time, roughly linear in the length of the sentence in the NEMO with LRI as a model. And this assumes a setting.
And this is where you'll see how you design algorithms here, part of the complexity there. It assumes a setting where we already learned the language. So this isn't learning. This is implementing the mechanism for parsing. So you can think of this as the adult brain of a native speaker of some language. OK? And so what are these LRIs? I mentioned them before. They're going to be neurons such that when they fire, it either inhibits or disinhibits a whole area or a fiber between two areas. So this is actually quite a powerful primitive that we're allowed to use here. But it's used in a limited way. And when I walk you through the algorithm, you'll see exactly how that's used.
So how do you build a parser in NEMO? OK. We're going to have a bunch of areas, NEMO brain areas, corresponding to grammatical roles, verbs, subject, object, determinant. Determinant are words like "the," "a," "any," "some," that go before nouns in English. OK? These are also, by the way, the tags in the parse tree that we are going to construct. So I'm aligning the details of what kind of parsing we're doing.
But for those in the audience that are interested in parsing, we're doing dependency parsing here. And these areas are also going to be the dependency parse tags. But if you don't know what that is, it'll become clear when I walk you through an example, what kind of tree structure where we're building here for a sentence. OK.
Now, we also have a special area, LEX, which is thought to correspond to the neural lexicon in the brain, which is going to contain a fixed assembly, xw, for every word, w. So we actually plant a bunch of assemblies. In simulation, you could create these beforehand by projecting into the area and make as many assemblies as you want words in your lexicon. OK? So these assemblies, xw, contain LRIs that will inhibit or disinhibit areas and fibers. This is important.
These are the grammar of the system. And, again, these are planted. We just hard code them because we assume-- we're modeling the adult brain that's already learned the language and the grammar. OK. And then it's-- the algorithm for parsing the sentence is extremely simple. So, given the sentence, that's a bunch of words, you just feed every word, one by one, by illuminating, by firing the assembly for that word in LEX and letting the whole dynamical system fire for a fixed number of times. OK. And this is a very modest number of times in reasonable settings of parameters.
So let me show you how this works with an example. OK. So our example is "the cat ran." And this is the resting state of the machine. So, for every language, there's a resting state, when you're ready to hear a new sentence, which is restored between sentences. This is, again, dependent on the language. So we've learned this, somehow. And so, for English, the resting state-- or for this subset of English that we're going to handle, the resting state looks like this. So all the fibers are inhibited, but Subject and VERB are disinhibited. LEX will always be disinhibited, as an area, by the way.
So it's very simple. So when the word "cat" comes, this illuminates the "cat" assembly and LEX. But "cat" has some LRIs that fire in the first time step. And what do they do? They disinhibit the Determinant area, as well as the fiber from LEX to Determinant.
AUDIENCE: [INAUDIBLE] "the," not "cat"?
DANIEL MITROPOLSKY: "The," sorry. Yes. Did I say, cat? I'm thinking ahead to the next slide. [LAUGHS] Good, so "the" fires. And what does this do? This, very quickly, creates an assembly. In fact, if you're doing multiple sentences, "the" projects into the DET area so many times there's a neural assembly ready to fire there, right away. So it immediately recruits this assembly for "the" in Determinant, which is connected to the word assembly in LEX. OK. So nothing that interesting, yeah?
Then the word "cat" comes along. OK. Cat is a noun, so it has more complex rules encoded by its LRIs. So these LRIs, in particular, will disinhibit the fibers between LEX and Subject and Object, as well as between DET and Subject and Object. Now, intuitively, you don't know, when you hear a noun, just by itself, whether it's a subject or an object of a sentence. OK? But the state of the machine is such that only Subject is disinhibited, at this time, as an area.
So the assembly for the word "cat" and LEX will fire into SUBJ, as will this assembly down here in DET. So an assembly quickly forms in SUBJ, such that if you fired it into LEX, you would get the word "cat" again. So you've stored, you've remembered, temporarily, that you've heard "cat." But, more interestingly, if you fired from SUBJ to DET and then from DET to LEX, you would get back "the." So this assembly actually encodes a noun phrase. It encodes the noun and its determinant, "the cat."
And this is extremely easy to do in our model because-- exactly because NEMO is really good at taking existing representations and quickly merging them into one that can restore the previous ones. It's very good at building these tree-like structures, if that makes sense, between areas. Good. So then "ran" comes. Ran is an intransitive verb for now. Of course, it's not, but in this subset of English. So it disinhibits the fiber from LEX to VERB and from VERB to Subject. It doesn't disinhibit any fibers anywhere else because it's intransitive.
So here we have an assembly quickly form in VERB, which is connected to the word "ran," but also to the noun phrase in SUBJ. So, once we get here, I claim we can read out the dependency parse of the sentence with a very intuitive and simple algorithm. And, briefly, what you do is you take the active assembly in VERB, because the verb is always the root of a sentence in the dependency parse, OK, and you project-- you try firing it into other areas. And when you fire it into SUBJ, you're going to get an assembly. You're going to get something stable.
So you know that there's a dependency here. And you can read out the dependent, the word. In fact, you can read out that it was "cat." If you tried firing into Object, you wouldn't get an assembly. So you-- there's no dependency there. And, similarly, firing from [? Subject ?] to Determinant, it you would get that dependency. You would get the dependency for the word "the." So, to summarize, after parsing, the synaptic weights of the graph that have been created temporarily implicitly encode the sentence's grammatical structure. OK.
Now, this parser can handle simple sentences, something like "the young couple in the next house saw the little white car of the main subject quite clearly." OK. Now, for settings of parameters that we used, generally, if n is around-- if k is around square root n, this thing converges in 20 to 25 spikes per word. So applying our heuristic to go from spiking in our model to real-time, that's something like half a second per time-- per word, which is quite cognitively plausible. OK.
Now, after we made this, students-- and I worked on some of these, myself, as well-- implemented versions in Russian, Japanese, Chinese, and Hungarian, mostly to show that this isn't somehow limited to the structure of English. And you could implement very distinct and diverse grammars using this idea of LRIs, inhibiting/disinhibiting areas, and firing the dynamical system for a fixed number of steps. OK.
Now, I said that this sentence that I gave you is extremely simple. And, even though it's long, it's very simple because it has no dependency in it. It has nothing like "the man whom I know came yesterday." And that's the most interesting part of language, actually. But we did come up with an algorithm for handling dependent or embedded clauses. All right. And if there's extra time, I can talk about it. Or we can talk offline. Because it's something I really am excited about. Yes?
AUDIENCE: [INAUDIBLE] says having a plausible [INAUDIBLE].
DANIEL MITROPOLSKY: A failed parse, so this thing does have some error detection built in. Basically, it would mean that you either fail-- so it failed-- it means that when you run that, the algorithm that I sketched for you, you start with the assembly in VERB and project into the-- all areas it's connected to, see if you get an assembly. For those that you do, you count as a dependency, and then recurse. OK. This algorithm won't get back to you the whole sentence in this dependency tree. It'll get you, maybe, a small piece. And that's what'll happen for every sentence that was grammatically incorrect. There are other ways to detect errors, as well, during parsing. So, yeah, good. Yes?
AUDIENCE: [INAUDIBLE].
DANIEL MITROPOLSKY: You mean-- good. So there's two answers to that question. When we create an instance of this model, in development, all of the connectomes inside of areas and between areas are initialized randomly via Erdos-Renyi randomness, like I said when I defined the model. Then these assemblies in LEX for each word, those, we plant. But you could also create them using projection. But you do have to plant the LRIs because the LRIs encode the grammar of each word.
Really, the LRIs are-- grammar isn't per word. It's per part of speech in English. So you can run this model where the LRIs are shared by all assemblies for the same part of speech. So all intransitive verbs of a certain kind could share those LRIs. Good. So this is-- yeah, I mean, it's a very simple algorithm. And once you see it, you might think, well, that's obvious, that this works or you could do this. And maybe that's the point.
But the thing is the parser, of course, is not some advance in NLP. There are much better parsers out there, which were also learned from data. Ours wasn't learned from data. We planted it using our knowledge of English or other grammars. But what you have to remember is that this is a parser implemented without any higher-order language, variables, or conditionals, right? This whole thing is implemented entirely by realistic, though highly stylized, neurons and synapses with very basic primitive operations-- firing, and inhibition, and disinhibition of errors. OK.
So, with approximately eight brain areas, depending on the language, this is something with around 10 million neurons or more, a huge dynamical system that fires, merging a bunch of representations, which converges with high probability. There are failure cases, but with high probability, this thing converges and, after parsing a sentence, encodes its dependency structure in the weights.
So the biggest question left open by this thing, by this algorithm is, where do these word assemblies come from? In fact, they did a lot of the meat of the computation, the LRIs. And it's the question which we immediately ask ourselves. And so now I'm going to sketch for you some very recent developments we've had on this front. So what does this mean, when I say, where do these assemblies come from? What would our desired algorithm be? What is the task here?
What we want to be able to do with NEMO is take some sort of tabula rasa or blank slate of a few dozen brain areas and fibers between them, maybe like the ones that I showed you for the parser-- maybe we'd need some extra areas, but something like this-- and then feed it, input it, in some model of input, grounded shared attention sentences, by which we mean whole sentences that occur simultaneously with some perception of the semantic content of that sentence at the same time. OK.
And what it should be able to output is a mature language organ, which, in particular, has an area, LEX, that contains word representations for every word in your lexicon, OK, that can parse and generate. Now this is a huge problem. It's very difficult. And, so far, we've been able to do a very small, but, I believe, important piece of this, which is learning concrete nouns, nouns with the most consistent perceptual signature, so to speak, something like animals and basic objects, and concrete verbs or action verbs.
And we'll assume, as well, just to be able to make progress, that phonology is completely soft. We have something that just knows the phonological form, and we can encode it in some hardcoded way. So in order to be able to do this or even approach this problem, we need a model of input for our-- for context. So we're going to introduce a bunch of contextual areas. These are just going to be regular NEMO areas. And two of these contextual areas are going to be privileged. They're MOTOR and VISUAL
So MOTOR is going to contain a representation for the motor response to witnessing an action verb. So there's a lot of neurological evidence that these kinds of things exist, even when you're not performing the action, for example, mirror neurons responding to witnessing an action, so on. So we assume that, for every kind of action, there's going to be an assembly that represents the higher-order perception of this motion. OK. And we're going to have an area which has representations of static images, of static shapes. OK.
And then we can have any other-- any number of other contextual areas. And in our experiments, we can vary these, but they might represent other kinds of context, like the affect you feel towards things around you; maybe other auditory signals that aren't language, that occur at the same time. So this is how we're going to feed in context. And this is what the model is going to look like. I haven't explained any mechanism. I'm just outlining what it looks like. So here are the context areas I've just told you about.
We're going to have an area, PHON, through which we'll feed in fixed representations of the phonological encoding of a word, something like a little motor program that can read out the word or perceive it. Not uncontroversial, that this is shared between production and generation, but a lot of evidence to support that. And in this model, to make this work, we need two lexical areas. LEX VERB and LEX NOUN. And we need some extra geometry here that doesn't break NEMO. It's still possible in NEMO.
That LEX VERB is much more highly connected to the motor perceptual area. And LEX NOUN is much more strongly connected to the visual perception area. OK. There can be some interconnection in the other direction. And they can be both connected to the other contextual areas. But you need this kind of geometric biasing, so to speak. And what is the algorithm going to do? Now, actually, the algorithm I'm going to show you now, for learning word representations, is going to work in plain NEMO. You don't even need LRIs for what I'm about to show you.
And here's what makes this problem hard. The contextual representation of the whole sentence is going to fire throughout the duration of the whole sentence, throughout all the words you hear. So when I hear a sentence, "the dog jumps," the represent-- the MOTOR signature for jumping is going to be firing. And the static visual signature for "the dog" is going to be firing, at the same time, for the whole sentence. OK. But words, in [? fawn, ?] will be fed, one by one. And for each one, we'll fire the dynamical system for just a few time steps, just a few time steps, so that, hopefully, over many sentences, over time, we start to learn something. OK.
Mm. Good. Let's see how we're doing on time. Good. So let me just show you my pictures. So what happens when, in the sentence "The dog jumps," we hear the word "dog?" So in the moment we hear "dog," the "dog" phonological assembly fires, and the visual trace of the dog fires. So this starts to create some sort of merge-like representation in lex noun for the concept-- for the word "dog." And similarly, when the verb "jump" comes, we have the phonological form of "jump" firing and the motor signature of "jump," and this starts to create something in lex verb.
So actually, it's not very surprising that if I just feed this thing many sentences over time, and these contextual signatures are very consistent, then you end up getting strong assemblies in lex verb for verbs and than lex noun for nouns, which are strongly connected reliably to the phonological form and to the contextual representation. I guess that's not surprising, though it's good that it works just from data. But I tricked you.
When I showed you what happens when a noun fires, I just hid the verb area. But when the word dog is heard, isn't there also firing into lex verb? And this is what makes the problem non-trivial, because for all you know, English could be a language where dog and jumps, these two words are perfectly reversed. Jumps could mean dog and dog means jumps. So it has to be able to handle that as well. So there has to be firing from dog into lex verb a priori.
And this is where the simple dynamics of NEMO come to save us once again. OK, so they save us in a sort of more straightforward way by creating a good representation in the right area. But they also save us with respect to the wrong area. So what happens when you hear, dog? It's true that dog fires into lex verb. But you know what else is firing into lex verb? The neural signature of the jumping action. So by the rules of the system, this selects some set of winners in lex verb at that time and rewards them a little bit.
So you certainly couldn't fire this too much at this stage. You need to just let it fire a little bit. OK. But shortly after, you'll hear a sentence like, the dog stretches. And now, the visual representation for stretching-- sorry. The motor signature for stretching is firing. And then when this fires into lex verb with dog, you get a completely different set of winners. So the dynamics of NEMO make it such nothing stable can emerge in lex verb. You get a non assembly in a concrete sense. So we feed the system a modest amount of grounded sentences, and we check that robust representations of words have been created in the correct area for each word.
That means there's an assembly in the correct area, which is well connected to phon and the context. And more importantly, if you fire from phon or context into the wrong area, you get a non assembly. OK, and I just want to kind of take a step back and say, what have we created here after training? You have word assemblies in their respective lexical area, which are really stars of assemblies. It's like a hub, which is connected to the phonological form, and a bunch of contextual representations of this word.
And the star is also dynamic. Over time, the strength of synaptic connectivity, the assemblies in context that it's connected to, these can be changed with time and with experience. OK? And another extremely important test of this system is that it needs to be able to generate. So the way we do generation is we present this model a new scene that it's never seen before through the sensorimotor system, the context areas. So for example, a cat chasing a dog. We'll light up chasing action, cat, and dog, and so on. And then we let the thing fire.
And by the dynamics of the system, it should somehow squeeze these inputs and output them in the correct order, cat chases dog. Now, I haven't told you actually anything about how you get it to input-- to output the words in the correct order. I mean, you might believe me that it'll-- that you can get the correct assembly and the correct area, and read out that word's phonological form, but I haven't talked about order. But we also have an algorithm for learning very basic word order, which again, if there's time or interest perhaps offline, I'd be very happy to tell you about.
And it's an important ingredient. And it's really just a first step in syntax. So when you have a model like this, you can run all sorts of experiments. So for example, you can just increase the lexicon size and see how many sentences you need for this thing to succeed, which means to form stable assemblies for all the words, and not in the wrong area. And it roughly increases linearly with the number of words. But remember. This represents the learning of kind of the earliest words that a baby might learn.
Because after you've bootstrapped, we know that language learning acquisition goes through many other algorithms as well. Like for example, you could use syntax to quickly determine whether a word is a noun or a verb. Certainly as adults, we do that whenever we hear new words. So this really represents the earliest stage. But this is not incompatible with what we know about acquisition at the earliest stages.
And you can really run any kind of experiment you want. So for example, we asked what if you randomly, every few sentences, fed this system a single word? So you did some-- you helped it a little bit. We call this single-word tutoring. So you fire, for example, the word dog. And just the semantic contextual trace for dog in the contextual areas. So no interference coming in from motor or anything like that. And we see whether or not this helps speed up training. And indeed, in regimes when you're learning only a few words at a time, this can speed you up a lot. Yeah.
AUDIENCE: I have a question [INAUDIBLE].
DANIEL MITROPOLSKY: Yes.
AUDIENCE: So I can see how you can do this with nouns. But is doing this with verbs plausible with respect to-- like, can you illustrate jumps without complete subjects?
DANIEL MITROPOLSKY: True. But you're not-- at least you're not saying the subject. Yeah, so this is definitely an abstraction. That's a very good point actually. Yeah. But you would be doing something like halfway there perhaps. Yeah. Good. Very good question. Yeah. Great. So these are the kinds of experiments we run.
But to recap, what we've done here is we've come up with a plausible, and reasonably efficient, algorithm for a small but crucial step of language acquisition. And really, we have just begun. But the suggestive thing here is that as we tackle more and more cognitive phenomena and problems, and try to come up with algorithms for them, we have yet to perceive fundamental limits.
But on the other hand, maybe we will need more enhancements of NEMO. And so this is kind of the area we are trying to create here. Another way to see what we're doing is that we've created a brain-like model that can support novel modes of learning. So even if you forget about the brain, or the theoretical aspects of this, this is a new kind learning algorithm. And it is one that does everything through Hebbian plasticity without any backpropagation.
And for those in the audience that sort of know the history of Hebbian plasticity in the learning world, it's something that fell out of popularity decades ago actually because it was hard to make-- to use it to do anything meaningful. But I think what we've done is we've shown that with just a few extra ingredients, in particular, this K winner-takes-all mechanism, Hebbian plasticity can be used to in something useful and something interesting. So it's time to revisit it and see what else it can do.
So is NEMO the missing link that bridges neuron biology and cognitive phenomena? I don't know. Time will tell. Let me tell you a little bit about future work, and what one might pursue. So I made three different slides here. Maybe I'll quickly do each one for this audience. So language, hope you'll agree with me that it-- we've really just scratched the surface. There's kind of-- for anyone interested in the modeling of psycholinguistics, or computational linguistics, there's really a treasure trove here of things to do.
We need to be able to scale this system up from right now, it handles something like 20 words to a hundred words, and ultimately, [INAUDIBLE] thousands of words. And this may require some modifications to the most primitive NEMO model. And how do we handle abstract words? So far, we've relied on really consistent contextual signatures to be able to learn these words. But what about words like unknown, or barter, or democracy, which incidentally are words that we continue learning up through adolescence is what psycholinguistics tells us.
So there are many mechanisms for this. Syntax, of course, needs to go beyond basic word order. And we may discover a lot about our neural model, or new ways to use it. Because in some sense, I'd like to say that linguistics is biology. It's a manifestation of what the brain is good at doing, and sort of naturally through language change, has evolved. So things that are true cross-linguistically must reflect, I believe, something that the brain is particularly good at doing.
And showing that in our model, I think, would be very interesting. Handling multilinguality is another interesting problem. So we almost-- we tried doing this right away in our model. And we partly succeed. But it's very unstable, at least in the parameter settings we were using for learning the words in one language. So that's interesting. AI. So one area we really want to-- and this is really the most aspirational area. But perhaps the most exciting to many is to somehow apply our neural model [INAUDIBLE]. Sorry.
So AI is doing great, but it's still lags behind brains in many key dimensions, groundedness, hard semantic constraints. We've all experienced this when asking some very concrete question of ChatGPT. Continual learning, inventiveness, and emotional intelligence. Energy consumption. But these are all areas where human brains are really great. So can AI benefit from brain-like intelligence? Or can we create some sort of hybrid model of an LLM that interacts with a NEMO-like model that learns from sentences? And could that be useful in any way?
And then for the computer scientists in the audience, I'm also very interested in the model fundamentals of NEMO. So we need a more systematic study of really, really simple problems in NEMO to prove upper and lower bounds for them. So a very recent result we have with Max is that there's an algorithm that learns any discrete conditional distribution, so a two-variable graphical model, using three NEMO areas. And we prove it's impossible if you only have two NEMO areas.
So this might seem like really, really fine grain to you. And it is. But I think these are the right kind of questions to be asking for this model of computation. And this also has very interesting cognitive implications that you would expect there to be two-- certain numbers of areas, or certain parameter settings, for some problems in order for our brain to be able to handle them.
And we need a more systematic study of the NEMO family of models. So I define for you the most basic one. But then I use the NEMO with LRIs and I mentioned others. And my kind of dream is that we could prove something like a neural Church-Turing thesis. And this is hard to define concretely, but it would say something like, NEMO with LRIs is in some concrete sense just as powerful as NEMO with other additions, with other cognitively plausible, or biologically grounded, additions.
Now of course, this thing is already Turing complete. So what I mean is with very-- very little overhead, you could simulate the extra power you get from adding more biological realism. Now, this is interesting not just from a pure TCS perspective, but if you had something like this, it would give even more evidence, theoretical evidence, that maybe the primitives of NEMO, or NEMO with LRIs, really do capture the essence of how brains compute. Thank you.