Computation and Learning with Assemblies of Neurons
Date Posted:
February 24, 2021
Date Recorded:
February 23, 2021
Speaker(s):
Santosh Vempala, Georgia Tech
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Prof. Santosh Vempala, Georgia Tech
Abstract: Despite great advances in ML, and in our understanding of the brain at the level of neurons, synapses, and neural circuits, we still have no satisfactory explanation for the brain's performance in perception, cognition, language, memory, behavior; as Nobel laureate Richard Axel put it, ``we have no logic for translating neural activity into thought and action''. The Assembly Calculus (AC) is a framework to fill this gap, a computational model whose basic data type is the assembly, a large subset of neurons whose simultaneous excitation is tantamount to the subject's thinking of an object, idea, episode, or word. The AC provides a repertoire of operations ("project", "reciprocal-project", "associate", "pattern-complete", etc.) whose implementation relies only on Hebbian plasticity and inhibition, and encompasses a complete computational system, thereby enabling complex function. Very recently, it has been shown, rigorously and in simulation, that the AC can learn to classify samples from well-separated classes. For basic concept classes in high dimension, an assembly can be formed and recalled for each class, and these assemblies are distinguishable as long as the input classes are sufficiently separated. Viewed as a learning algorithm, this mechanism is entirely online, generalizes from very few samples, and requires only mild supervision --- all attributes expected of a brain-like mechanism. The talk will highlight several fascinating questions that arise, from the convergence of assemblies to their unexpected generalization abilities.
This is joint work with Christos Papadimitriou, Max Dabagia, Mirabel Reid and Dan Mitropolsky.
PRESENTER: I'm personally becoming more and more convinced that there is one key question we should focus on that [INAUDIBLE] from the experimental and computational point of view. And the question is, what are the neural circuits underlying human intelligence, language, logic, and so on? Where are they? And what is their motive? I think this is the key question today for the science of intelligence.
Now, in general, in terms of basic proposals about neural circuits, I know two basic frameworks. One is the McCulloch-Pitts one, the proposal for specific circuits he believed were neurons performing specific computation. And the recent special case of this class of neural circuits are deep nets. These are circuits with specific connectivity, with specific weights. They may be redundant, but they're wired precisely.
The other one, the other broad proposal, is the neural assembly's idea, first proposed by Sherrington, more forcefully by Donald Hebb in his book Organizational Behavior, and then later by [? Ibanez, ?] and Buzsaki, Leslie Valiant. And in this proposal, individual sets may be part of different assemblies that are created perhaps in real time when needed. And they may be involved, each individual cells, in multiple computations.
And it's kind of sobering to admit that we still don't know which one is the correct theory, if any one of the two. Especially now, with so many people on the bandwagon of deep learning, I think it's important to keep our mind open and, therefore, to speak a bit more about assemblies of neurons. And there is no better theory of neuron assemblies that I know of than the recent work of Santosh, Christos Papadimitriou, and colleagues. And so I'm very happy to introduce Santosh Vempala today to speak about assemblies of neurons.
Santosh got his PhD in computer science in Berkeley. He taught there, and at MIT, and is now a professor of computer science-- actually, distinguished professor of computer science-- at the Georgia Institute of Technology. His main work so far has been in theoretical computer science, randomized algorithms, and so on. But he got in love recently with the question of, how does the brain work? Welcome, Santosh.
SANTOSH VEMPALA: Thank you very much. And thanks for the great introduction to assemblies and the context. I've been on this journey for the past seven years, all of it with Christos Papadimitriou who's been a long-time mentor and collaborator.
And the high-level question is roughly what [? Tommy ?] just asked. How does the mind emerge from the brain? And to say it a bit more precisely, there's fantastic, amazing, stunning progress in neuroscience and, increasingly, insight in cognitive science. And yet an overarching theory remains elusive. How do these neurons and synopses and other molecular, chemical, and electrical mechanisms result in this robust behavior that we don't understand?
So that's the overarching question. And this was phrased precisely by Nobel Laureate Richard Axel not that long ago where he said, "We do not have a logic for the transformation of neural activity into thought. I view discerning this logic as the most important future direction of neuroscience." Yeah.
So at the end, maybe if there's time, we'll discuss further motivations for this. But that's the high-level thing. And let me get straight into it. What kind of formal theory would qualify as such a logic to convert the activity of low-level hardware into cognition?
An early-- or now it's 25 years ago. A theory was proposed by Les Valiant, the neuroidal model, where he allowed little units to change state based on what they saw from their neighbors, and thereby simulate various types of computation, and memory allocation, and so on. But what is it that we would like to have in this model, in such a model, for turning neural activity into cognition?
First of all, it should be a computational system, something that you could simulate, something that you know how a state's changing, what the output, input of the behavior will be, and so on. It should be consistent with current understanding of the brain. At least it should not have some clear discrepancy. We don't need it to be precise at the level of every single reaction or chemical. But certainly, we want to be consistent.
And then of course, most importantly, it should explain cognitive phenomena, some of which we do not explain by any means, forget about bio-neural means. So more precisely, what would such a system consist of over its data types be, and all what would its operations be? And if we're talking about a computational system, we must have data types and operations at some level.
And what is the right level? That is the next question. Are we talking about the whole brain, the entire brain as a network, as Camillo Golgi would have said? Are we talking about neurons and synapses, like Ramon y Cajal said? Maybe that's the seat of activity-- maybe even lower, maybe dendrites, molecules? Or is it something maybe in between the whole brain and this level of description?
So this is where the assembly calculus comes in. This is our proposal, building on much work from the past. It's a formal, probabilistic model of the brain. So first of all, it's going to be precise enough that you can reason about it, and in addition, can argue what will happen, and so on.
It will have a single basic data type. It'll have a handful of operations that you can do, which is the [? protocols. ?] And it'll have a completeness theorem. So you'll be able to say that, using this, you can actually simulate interesting computation. And finally, and this is certainly a work in progress, we hope that it has a killer app, which hopefully is language. But this is early days for it.
There's a paper that came out last year based on our rigorous formulation of these ideas. It's called "Computation by Assemblies of Neurons" by Christos, Dan Mitropolsky, who's a student at Columbia, Mike Collins, a language and natural language expert at Columbia, and Wolfgang Maass, who has been doing [? clinical ?] neuroscience for decades now. So I'll start right away with the mathematical model of the brain. So this is going to be a serious abstraction. And I won't even bother to tell you about the biology.
So here we go. We'll think of the brain as having a finite number of brain regions. So there are these regions. And those are finite-- 10, 20, something like that. Each one will contain a large number of neurons. And we'll just say, for simplicity, they are the same number, n, neurons. You could n1, n2 as well.
And the following very important principle will be in play to help-- that, in any brain area, at any given point, only k of the neurons are firing. So neurons we can think of as just on or off. So they're either firing once, or they're off, zero. And k, something smaller than n, significantly smaller, will be a cap on how many are firing at any given time.
So there will be-- and this step here, let me just mark, is inhibition. Everybody might want to fire. But only the top k are allowed to fire. And one of the beautiful things that has been well understood is the brain's mechanism for quickly regulating how many neurons are firing. So this is an abstraction of that.
Now, some pairs of areas are going to be connected by a bipartite random graph. So these red arrows represent directed bipartite graph from one area to another area, with the probability of a particular edge is P, and since it's n by n, following B and P. And every area is internally recurrently connected. So there are recurrent connections within each area. And that we will model just as G and P.
Again, the P and the P could have been different. But let's just say n neurons in every area and probability, P, of an edge being present. This is the base connect all, so to speak. You can already contrast this with any modern deep learning architecture.
OK, now, that's the set up of the graph. Now, what are the dynamics? The neurons will fire in discrete steps. So this is, again, a theoretical simplification to this time that's going by in discrete steps.
And what are the k that are going to fire? The k neurons that are going to find in any one area are the ones with the highest weighted input from their neighbors, whatever their neighbors be, whether they're from the same area, or other areas. You look at every synapse edge as a weight.
Look at the weighted sum. And whichever k are the highest k in the entire area, they will fire, the top k. The connections between areas are allowed to be inhibited or disinhibited, meaning you can declare that, from A to B, it's blocked for now, or it's open again.
And finally, we have plasticity, which we'll model in the simplest Hebbian version. So if there is a synapse from i to j, and i fires, and then in the next step j fires-- so i fires in step t, and j fires in step t plus 1-- then the weight of this synapse goes up by a small multiplicative effect, 1 plus beta. We'll call beta the plasticity parameter. This is how much it changes if you fire in quick succession.
And then, in addition to this, there are global processes, which are very simple, natural things. Homeostasis is just the idea that every neuron will keep the sum of all its incoming weights held constant or upper bounded so that things don't explode. And forgetting refers to, at some slow rate, all neurons will-- all synapses, the weights will decay. So unless reinforced, or reinforced a lot in some period, things will forget, meaning you would go back to resting state.
So that's the model. That's the entire model. That's the entire model. There's no algorithm. It's just a model, and it describes how each part works.
Now, if you want to have specific parameters in mind, we're thinking of n as several million. k, the cap within each area, is about square root of n, roughly. And the edge probably is tiny, so one in 1,000 or so. And this is all consistent with biology. That was the reason for these ranges.
And the plasticity parameter will be 1.1. So every time you have this in quick succession, you get a 1.1 factor boost. So the idea is that we'll have just this randomness-- you've seen this-- selection, that's the inhibition, and plasticity. So that's it.
And there's really only one operation underlying everything. So this is the assembly-level operation on which we'll build slightly more sophisticated operations. And that's this random project and cap. We'll discuss that in a bit. But this refers, basically, to the idea that you are going to take the top k and cap. Everybody else gets set to zero.
At the end of this time, I'll show you a quick online simulator. It's web based, so anybody can access it. And you can run your own brain programs using these operations.
OK, so that's the model so far. How realistic is it? At one level, it's not realistic at all. But at another level, it's reasonably so.
It's a formal model. It's not inconsistent with any of the biology. And given that it's an extreme simplification, maybe it's reasonably so.
For example, the discrete steps assumption is certainly realistic. There's no reason to think that there's this coordinated activity at such a strong level. But as far as we can tell, it likely doesn't change the phenomena that we're able to establish.
Plasticity and assemblies are used in rapid timescales. And so it is important to think about the timing. So it's our feeling-- and this is just a subjective thing-- that this model itself is somewhere between realism and being useful. And it's at the point where we can be rigorous, and make some progress, have some insight.
OK, so let's get back to the question. What is the basic data type? I haven't defined that yet. I've only told you the model so far. And we want it to be larger than single neurons, but smaller than brain. And indeed, the basic data type will be assemblies, or ensembles, of neurons.
I should have put Sherrington on this list, as I said. But Hebb's book really brought this out, and the work by Harris. Buzsaki has been working on this for a while. And his lab does both experimental and theoretical work-- Yuste, and many others.
Now, what is an assembly? It's going to be a large and densely interconnected set of excitatory neurons, so not the inhibitory ones that are trying to suppress the firing, the main ones, et excitatory. These end neurons that we're talking about are all excitatory. The inhibition, we'll model away as the k cap.
And so large and densely interconnected set of excitatory neurons in the brain area who's firing, either simultaneously or in a fixed pattern, is tantamount to our thinking for particular things or a particular concept. So an assembly maps a particular concept-- a memory, a concept, a person, a name, a word, an episode. All of these will have assemblies. So this is your basic memory data structure or data type.
If you have an apple, you see an apple, somewhere in the brain there is a subset of neurons that are mostly likely to fire when you see an apple, or think of an apple, or somebody says apple, and so on. So that's what assemblies are. In his book, Buzsaki calls assemblies the alphabet of the brain.
So here is the hypothesis. There is this intermediate level of brain computation. We're calling it the assembly calculus. It's implicated in higher-level cognitive functions.
And for this, we will need to demonstrate or review experiments, such as reasoning, planning, language, storytelling, math, music, et cetera. And assemblies of neurons are its basic rep, data type. That's the first part.
Now, what are the operations? What operations are we going to be able to do this? First thing is going to be projection. Remember, when I say operation, I don't want any additional control and a computer sitting on top of this do more things. It should work on this substrate.
Association-- I'll describe all these things. Pattern completion-- projection, basically, you can think of as a copy from one area to another area. So it's an assembly in one area projects to an assembly in another area.
Association-- two assemblies that come together. And it's something that seems to happen in human cognition all the time. Pattern completion-- where you have a hint of something, and it completes to the full [INAUDIBLE]. You, ah. It's ignited. You smell an apple, and you think, apple.
Merge-- this is the most complicated operation that we've managed to implement in the assembly calculus. And it seems to be quite useful for language. I'll describe it later.
And then reciprocal projection-- where an assembly copies itself to another area. But the new copy also has a link back. So they have this tight connection with each other. And a couple of control commands-- by control commands, I mean this inhibit of connections from one area to another, or disinhibit.
OK, so let's jump into it. What I'm going to do in the next part of the talk is go in some detail over two of these, the first two of these operations, projection and association, so you see a flavor of how the model works. And we'll be able to do things rigorously there. And then we'll get back to high-level questions.
So for projection, projection is something that's not just human. It's implicated in all kinds of animals. In fact, we can go all the way back to the fruit fly and its olfactory abilities, which are at some level very similar to any other animals. Odor consists of these odorants, molecules that come by. And then we have these odor receptors, which react to particular odorants.
And in the case of the fruit fly, we're talking about 50 different odorant receptors, the sensory neurons that project onto these 50 neurons. And then there this step, this strange step here, where these 50 are expanded to about 2,000 Kenyon cells. Each cell seems to be more or less a random combination, a fixed but random combination, of these 50.
So you get this expansion. And after these 2,000, what goes forward up to the rest of this nervous system here is just the top 100 or so. Top 100 are the ones that go through. The rest are silenced. So you have this.
So this is the motivation for random projection and cap. You have this sensing, after which there's this random projection to a larger space. It's still a projection, but to a larger [INAUDIBLE], maybe random lifting. And then there's a cap. You only take the top tier.
So that's the whole procedure. Now, you could ask-- OK, fine. So what's happening with the random protection and cap is, if you're seeing 2,000 of these, you could think of this as sampling from the distribution, which is a weighted sum of these inputs here. So it's a random weighted sum of these inputs. That's what each neuron is.
And you're sampling that distribution 2,000 times. So since it's a random weighted sum, it looks like a Gaussian distribution. And when we're taking the top k, all we're doing is, out of this Gaussian shower, we are taking only the tail. The highest k are the ones you are keeping. So that's the operation going on.
Now, going back to this, you could ask, why are you saying this is random? That's a great model, but we're talking about the fruit fly. You should be able to look at the connections. Why do you say this is a random bipartite graph? Why is this random projection? And indeed, it is.
In this paper by Caron, Ruta, Abbott, and Axel, they do these measurements very precisely. And up to statistical bounds, this really looks like a random bipartite graph, except that the distribution is not uniform on the left-hand side. So some of the neurons that have come in, the 50 neurons have higher degree, and some have lower degree. But conditioned on that, it looks, up to the law of statistics, random. So that's the process.
Now, why is this any good? What are we doing here? So the main feature here is the following. Suppose we have-- let's just think of this as both sides being n. It doesn't really matter. And a subset of, let's say, k neurons fire on the left-hand side.
Now let me just go to it. There we go. Those are k neurons. And as a result, the top k on the right-hand side-- so lots of neurons get input. But the top k by random projection and cap, are ignited.
And now let's take another subset B. Maybe it corresponds to a different smell, a different combination of odor receptors. And it ignites a different subset. Now, here's a question. What we'd like to have is that, if A and B overlap, then the projections should also overlap.
Why? Because then you recognize some [INAUDIBLE] as a basic step. And then this suggests that, if indeed that happens, maybe this is a reasonable algorithm for near-neighbor search. Do this projection and cap. And then compare the cap, which is relatively small, potentially.
And is this any good at doing near-neighbors? Now, it's not going to be the state of the art. But it turns out it is actually quite good on benchmarks. And this was a paper a couple of years ago by Dasgupta, Stevens, and Navlakha. But we want to do this mathematically. And then, of course, that's what we were going to establish, is the intersection of cap and cap b.
So A and B are both subsets of size k. And their overlap is, let's say, alpha. The overlap is alpha, so alpha fraction of 10 k. Alpha k [INAUDIBLE] overlapping.
And you were asking, what's the fraction of overlap in the projections? And if it was totally random, each one was independent and random, then you would only expect the k over n fraction, because each one is taking k out of n, so the fraction of overlap should be k over n. But in fact, it's much higher. It's k over n to the power of 1 minus alpha over 1 plus alpha.
So when alpha is a constant fraction, you really get this boost in the overlap on the right-hand side. In our proof, we have a denominator that's logarithmic. But it's still there. And as far as we can tell, it shouldn't be there. So that's a question that might be nice to resolve
But going back to this-- and these parameters match what has been observed in fruit flies, but also locusts. And lots and lots of insects have a similar behavior mechanism. So why is this the case? Why is the cap higher?
This is relatively easy to explain. The idea is it's probability. So you have a subset A. And you're asking-- so suppose this is the cap that came out of one stimulus. And what does it mean, these x and y1? That means that what you got from-- the input you got from A put you over some threshold.
Let's say, the top k needed to be higher than some threshold. The threshold is whatever was the value of the k at the highest input. So you know that the total input you got from A, which I'm thinking of as x plus y-- x is the input from just A, y is the input from just the intersection at z from here.
So if there's some neuron here, it's getting input from everything, some random input. Remember, every edge is probability P. And I'm just [INAUDIBLE] the input to the x part, the y part, and the z part. And the fact that this neuron made it to the cap for A-- so this belongs to cap of A-- means that x plus y is bigger than t. [INAUDIBLE].
Now, the question is, what is now the chance that the same neuron belongs to the cap of B? For that, you will need y plus z to be exceed t, because z is new, but y is old. But the point is that this conditioning gives you an advantage. y, in general, should be slightly higher than just random, with respect to this threshold. And therefore, you get this little advantage. And it's much higher than k over n.
So that's the idea. And if you compare it with simulations-- I mean, in simulation with the actual overlaps that you get-- this yellow line here is what the theorem proves. This purple is the conjecture. The blue is what's actually observed to be the overlap of projections.
And the upper line here, intriguingly, is what we get with actual assemblies, which we'll discuss next. So keep that upper line in mind. There's something that gives you even higher overlap. OK, so that was the fruit fly. But that's your question-- is, does something similar happened in mammals?
Actually, there is something significantly nicer that happens, which is that-- well, it's based on two things. First of all, here we were just talking about one layer of protection, you're done. That's it, done, bipartite graph.
Now I'm talking about, in mammals where you have-- going back to things like the brain model I described, there's recurrent connectivity. There's connectivity between the neurons on the other side. And there's plasticity. So we'll use these very carefully.
So here's the [INAUDIBLE] of the process. Here is an assembly somewhere, or a stimulus that's firing. And we look on this new area. And these are the top k. The top k fire.
Now it happens again. But now you seen, in the second round, these things are firing, these k from somewhere, stimulus or assembly from outside. And in addition, the top k of the first round, this first cap, C1, is also firing. So you've got those k and these k firing.
So now the new top k could very well be different. These were the ones that had the highest input from outside. But the new top k depends on who's getting highest input when you take both of these together. And so we get a different top k, and again a different top k, and so on.
And so the question is-- you've got this process here. And there's a question. Will this converge? Under what circumstances will this actually converge?
If it's just a random graph with no plasticity, then in general it won't converge. It'll just keep on moving. But with plasticity, it's going to converge. So in fact, the theory is that, with plasticity, as long as it's above this threshold, which you can see is sub-constant, the total number of neurons that will ever be activated-- looking ahead forever, as a stimulus is presented-- is k plus [INAUDIBLE] k.
So of course, the very first theorem, already there's k. And the point is that the second round, it will mostly overlap with this because of plasticity, except for a few, and then even fewer, and even fewer, so that the total additional thing still remains k plus [INAUDIBLE] k. Now, this doesn't mean that it's the same k that will always light up. There's going to be little fluctuations. It happens in simulations. And that's what the theory says, too.
But the total support is only slightly more than k plus [INAUDIBLE] k, which means there must be roughly k neurons are firing almost all the time for this particular. So that's the convergence theorem. I'll discuss this is a little bit more detail here.
But just to show you some quick plots, each curve here is for a different plasticity. So at plasticity zero, this total support just keeps increasing. So it doesn't actually converge. But by the time you get to plasticity just over 0.005 for this simulation size, eventually you do converge. You stop activating any new neurons. You just cycle between all neurons, or just fire the same ones again and again.
And once plasticity is a little bit higher, like a 0.1, very quickly it converges. And this is another way to look at it. If you look at the number of new neurons activated, it drops very fast during this process in about a dozen iterations on the size and speed of this expansion. Very quickly, the proof of convergence goes more or less like this. The first cap is the k highest ones, great, out of the n.
Now, the second cap receives k from this, as well as from the outside still. So we can think of this as-- approximate this with this Gaussian, whose mean is 2pk p because of the edge of probability. And the variance is 2p over minus k.
Now, there is this competition between the k previous winners. They have a slight advantage, because they were the ones that got the most input from the outside k. But then there's this much larger pool of n neurons. We think of k as much smaller than n. And so what we can do is we can ask for the fraction of new winners at each step.
How many new winners will you get? And that determines the threshold, the threshold you must pass for your input in order to make it this top k. So get this threshold. OK, great. But we won't use the thresholds to bound how many new winners.
And the idea is just this. The people who won the first time-- you see, because they won, they get this boost in the plasticity from the stimulus. Stimulus was on. Then this top k won. So this edge increases its weight by 1 plus beta.
And so this 1 plus beta boost is enough to really give you a much higher chance of being in the second k. And of course, if you survive multiple iterations, then you get this exponential boost in your ability to survive, meaning [INAUDIBLE]. So that's basically the idea.
One very nice aspect I'd like to point out here is the following. So you could ask, what was the advantage of recurrent connections and assemblies, or just projection? Why couldn't we just stop the projection if evolution would give you assemblies with recurrent connections?
And here's one clear advantage. Suppose, instead of firing the entire stimulus, which leads to most of the assembly firing, I only fire a tiny piece of it. So somehow, I create a tiny piece of it. The point is that if you do that, and if your assembly is strong enough, then it will just quickly fall out to activate almost the entire assembly.
Roughly speaking, assemblies are expanders, because their connections internally are very strong compared to the outside world. So once you ignite a small constant fraction of them, it takes virtually no time to ignite the whole thing. So your memory is fully enabled. And this, you can do with just projections. OK, so that was only one.
So the next operation is association. So in association, we have two stimuli or two assemblies that are both firing repeatedly, together. As a result, the corresponding assemblies, which had already been created, come closer. They are overlap in terms of number of neurons. And the strength of the connections between their corresponding neurons increases.
This one, you don't have to take my word for it. Here's an experiment by Ison, et al, 2016, when they were recording from human patients with electrodes that are directly into several hundred cells in the hippocampus. This is in the medial temporal lobe. And here's what the observed.
So here's what they did. They showed them pictures of places, places they would be familiar with. And then they observed which neurons are firing and how many are firing. Great.
Then, for different places, there will be different neurons firing. And then they showed them pictures of famous people. And when that happens, you have again different neurons firing.
And then they did this interesting thing. They would show them together a person and place, something that the person might never have actually seen together. But this is being done in the experiment together. And some neurons would fire. Some of the ones fired for Obama, some of the ones were fired for Eiffel, and some new ones, and so on.
And then here's what they would do. They showed the picture again of, say, the person. And some of the neurons that were originally firing only for the place that they were shown together with Obama also started firing. So literally these assemblies, if you think of the subset that's firing as an assembly, actually became-- increased their [INAUDIBLE]. Some neurons, which were not previously in one assembly, came over now, after this repeated simultaneous presentation.
OK, so that's association. This is proven. This, we can prove in the same model. And the proof of convergence is a little bit more complicated, but quite similar.
Merge is more complicated. It has to build a backwards thing. Maybe I will skip this for now. But there's a question. We can get back to it.
So to recap the simple calculus, we have a few operations. And I'll put them all up here-- projecting from one area to another area, associating to existing assemblies, pattern completing, merge, which is about two assemblies merging into one assembly in a different area, and then some control commands, which are [? fluid ?] with reading an area, or is something firing, and disinhibiting, meaning cut off, these connections are not going to change around.
And you could ask, with this setup, how powerful is this system? It turns out that you can actually perform arbitrary Turing computations. So whatever a Turing machine can do, square root and space, you can do with a constant number, find regions, each with n neurons, with some assumptions about interference and obliviousness.
But nevertheless, this is already enough for simulating power for all computations, actually, with some loss in the space usage. We're not trying to say at all that Turing machine type of computations is what happens in the brain. It's just the point that we're not limiting yourself severely by sticking to these neuroplausible or motivated calculus here.
So here's what I want to do in the rest of the talk. Go to Learning with assemblies. The random graph model is something that one could have objections with, and how to move away from this, and is there a benefit to it? And briefly, this time, language.
I could take a question or two now. Or I could proceed. Whichever, as you prefer.
KRIS BREWER: Sure. There are two quick questions. One's from [? Maha ?] [? Allay. ?] How can you verify the assumptions are valid? Can this model learn more efficiently than the current exist deep learning models? Does the model generalize well?
SANTOSH VEMPALA: Right. Good. So we'll get into learning right now. So I think you will be in a better position to appreciate that.
As far as, how do we know the assumptions, yes, so the big principle throughout has been to be consistent with what is accepted in neuroscience across a [INAUDIBLE]. So we've had neural recordings now for a while. And the point is that things like plasticity, inhibition, and more recently the fact that assemblies come together, and so on, are all experimentally validated.
Now, in terms of competing with deep learning, the short answer is certainly not on benchmarks. But the hope is that there are aspects on which it will provide lessons and we might find ultimately, even though it seems to be limited, suggest ways to do better learning. That's certainly the hope.
KRIS BREWER: Great.
SANTOSH VEMPALA: [INAUDIBLE]
KRIS BREWER: And we have another one from anonymous. Can this approach simulate, explain every neural activity, as mentioned in the second side, or only a subset of them?
SANTOSH VEMPALA: Oh, certainly a subset. This is a very severely limited thing. Certainly a subset. But the hope is that, if you can explain an interesting subset of activity, then maybe by extending it suitably you could explain others. But no, absolutely, it's going to be for a small subset.
KRIS BREWER: Great. And then one last quick one for now. How about garbage collection, i.e. Memory reuse or memory decay? Can assemblies model those?
SANTOSH VEMPALA: Yes. So this is what I meant by-- I quickly said, forgetting. So basically, just as we have-- so one thing awaits exploring is normalization, homeostasis, where each neuron will make sure that the sum of the weights of its incoming synapses is kept bounded.
But also, if you do nothing, generally across the board the weights of neurons will decay at a slow rate, not at the plasticity rate at which they increase when something active happens. But they will all decrease. So things that are not used will just keep decaying. And in fact, it's not just about the weights. Synapses appear and disappear continually through life.
So learning, how do brains learn? So locally, we understand its plasticity. I put up only the simplest, most common rule. But there are other versions of plasticity. It's a very interesting question about the landscape of plasticity rules. But it's local.
Globally, you could ask, does the brain do gradient descent? You can do lots of great things with gradient descent. This is a topic of debate with very little convincing evidence the brain can or does do gradient descent. So in fact, this motivates a whole area of biologically plausible gradient descent alternatives.
So we're going to stick to the question, is synaptic plasticity by itself an effective learning mechanism, and how effective? And then why not go further? Could it have advantages over gradient descent? Not only do we want to say that it can do what Gd does, up to some level. But can it actually have advantages?
So for that, we first need to set up learning. How can the brain learn through assemblies? It's a crucial problem. And how to formulate this idea up front is unclear. How best to formulate this, this funny system? We don't have a clear input-output specification.
There's some initial progress on Boolean functions. I mentioned that. But I mostly focus on a different approach, which is to view what's happening with assemblies, their creation and their operation, as quite similar to recurrent neural networks. So that's going to be the plan.
But first let me tell you something that came out of CBMM, Rangamani and Gandhi. And they showed how you could learn Boolean functions of two variables. Basically, their idea was, let's have assembly areas for labels.
So each label-- in this case, zero and one-- will have separate assembly areas, brain areas. And then, when you present-- so this is a version of supervised learning. When you present an example, if its label is zero, then all connections to the one label are disinhibited. And then it's Hebbian plasticity that changes the weights of the ones that are allowed to be happening.
And then, if you present the next example and that's labeled as one, the connections to zero are disintegrated, and so on. So you do this. And it learns AND and ORs of two assemblies, two inputs. And they tried it on MNIST.
And it works reasonably for pairwise classification, which is, I'll immediately remind you, much easier than doing all 10 digits. And of course, MNIST itself is-- yeah. But it's already a demonstration that plasticity can already be effective for supervised learning. So this was what they did.
Now let me go to a different view, unsupervised learning. It seems to be, at best, mildly supervised, certainly not supervised in our traditional sense of, here's the label, here is the example. And also, something that's talked about a lot, it's only a few examples. There are relatively few examples, relatively little data.
So we know that the assembly projection process preserves overlap. We can prove it, and that's why. But here is the question. Suppose, instead of talking about single inputs, I talk about a cluster of data, a set of possible inputs, which are close to each other, another set of possible inputs.
What I'd like to do is-- does this process create for you an assembly, one per cluster? So it's actually generalizing. It's saying, all of these guys are really one assembly. All of these guys are really one assembly.
So no matter which one of them we actually see, it's the same assembly that lights up. So it doesn't matter which apple you see. You've got one assembly firing. It's Fuji, or Gala, or [INAUDIBLE]. So that would be nice.
So far, we talked about memorization of single stimuli. Now we're talking about of distributions, or of clusters. Is that possible? So how do we even set this up? So here's how we set this up.
Each distribution will be just a sparse Bernoulli distribution. So the input is just still 0, 1. And about k out of the n will be on. Now, the way we will run the model is that there's this big graph. This input comes on. And it projects to the big area, one example. OK, done. Plasticity happens.
Next example, from the same class-- plasticity happens. And just five examples-- so five examples come in from one distribution, one Bernoulli distribution. And plasticity happens. Next, you get five examples from the next distribution, the next cluster. And plasticity happens.
So as usual, when I say, everything operates as it should, which means that you do the random prediction in top k and plasticity. That's all. No special algorithm. And here's what you get.
So for example, at four assemblies here, these are the assemblies. Now, this is what happens when you now look at a new stimulus from the same class. This is the first distribution, second distribution, third distribution, fourth distribution. But these are the actual assemblies, the subset. This is just showing you the subset of neurons that are firing on presenting new stimuli from the same distribution.
So the overlap within a class becomes really high, the neurons that fire. So that's why it's an assembly, 98%. And between classes it's low. It's 15%. And so I'm showing you this with four clusters.
If you look at it a little bit more closely, what's happening in the five rounds? Remember, each example is shown only once. So the first time the example is shown, this is what's lighting up. Second example, this-- third example, fourth example, and fifth example. And it's basically stabilized already with five examples. Same thing for all of these.
But this distribution is different. So it's going to a different first subset, different second subset. And it stays significantly different from the other one. So that's the process.
Can we prove this? Do you have any hope of proving this? Fortunately, the answer is, yes. But we need to define a couple of things.
So we have a notion of a stimulus class, which is just a distribution of 0, 1 to the n. But it has a core subset S of size k. And so the elements in the core subset-- so you have n things. Part of them, there's the core subset of size k.
And in here, these things are more likely to be fired. That's probability r. And then everything outside is less likely to fire. We'll call it q times k over n. So there's a subset that's more likely to fire, meaning in the input. That's how the inputs are selected.
And the rest of it is less likely to fire. So these are Bernoullian distributions that have these random core subsets. And so the theorem is that, if we have sufficiently high plasticity, and the plasticity needs to go as the inverse square of the probability of the core firing. So if the core fires with smaller and smaller probability, you need higher and higher plasticity.
But then you need only log, log n samples from each class. So that's the 5 here. And this will lead, again, to a stable assembly with support just k plus little o of k. It will quickly converge. And after that, if you give me a new stimulus from the same stimulus class, same distribution, it'll just ignite that assembly.
If you make this a little bit more spread out-- so in this example here, we're setting r and q to be equal, so roughly the same number of neurons from the core and outside the core. Look at what's happening. These are the corresponding assemblies. They're quite disjoined. But these little 2 by 2 matrices are showing you the overlap of the input distribution within a class, between classes, versus the overlap of the firing neurons, the actual assemblies. So you see?
Now, once you project, it's very likely that things from the same class ignite almost entirely the same subset of neurons. That's the observation here, and consistent with theory. You can do this with multiple classes, as I just mentioned, and classification. And then, of course, once you have these assemblies that are consistently firing for something from a class, it's easy to build a classifier. Just add up all those neurons.
How that's done biologically, I don't know. That's why I'm not putting it in the model. Maybe there is a readout neuron that gets activated once the assembly is activated, so it specializes to a subset. But that's something to be worked out. This is just showing the same thing.
Now you could say, OK, great, semi-supervised, or mildly supervised learning. How about a more classical problem like learning a halfspace? So here, we have examples, again from 0, 1 inputs from the Boolean cube. And what we will do is that we'll say halfspaces are those that have delta higher than the expected value in some direction v, arbitrary direction v. And then the other halfspaces are those that are minus delta.
So there's this little margin between the positive and negative examples. That's already enough to create these assemblies. This is the assembly corresponding to the positive examples. This is the assembly for the negative examples. And if you look at the overlap matrix, indeed, these are just Boolean n dimensional coordinates with about k1's that just have a bias in one direction. The overlap is almost the same within and outside.
But once you've projected, created this assembly, exact same process-- one presentation, five examples at a time-- you get this much more amplified separation between these corresponding assemblies. Can this be proven [INAUDIBLE]? We have a theorem towards that. I don't think this is the last word. But it does establish something.
Basically, if you have k-sparse inputs with margin two delta, so that's your halfspace separation, as long as delta is bigger than square root k, then everything [INAUDIBLE]. Log, log n presentations create stable assemblies for each halfspace. And the new samples are correctly mapped.
So that was about learning. I'm at 4:54. I could either do one more topic or go to conclusions. Any suggestions? What is preferred?
PRESENTER: Well, maybe conclusions. And then--
SANTOSH VEMPALA: Questions.
PRESENTER: --I'll ask you a question about random draft models, if there are not too many other questions.
SANTOSH VEMPALA: OK. Just to entice you, these slides have some very nice simulations. This is the joint work with-- and a beautiful process that might be very nice to understand. But this is work with Mirabel Reid.
So language is very interesting to study from this point of view, and in general from the learning point of view, because it's one of the last things that the brain adapted to. In fact, it adapted to the brain's strengths. Language came after human brains. And so it could be an invaluable lens. And it's this demonstration of what the brain is good at.
And there's been a huge, great set of experiments. I'll just show you one, the Poeppel experiment from Poeppel's lab. He presented subjects with a series of words-- fret, ship, hill, melt, fans, blue, guess, hits, then. Looks like complete nonsense. And then he noticed that there is this-- it peaks at roughly 4 hertz, which is roughly what it takes you to see these one-syllable words. This is the activity, just measuring electrical activity.
And then, stage two, he reformulates them into little sentences-- bad cats eat fish, new plan gave joy, little boy kicks ball. And now he notices that there are three spikes. There's still one at 4 hertz, one for every word. But then there's one at one. And there's one at two.
And if this were live, I would now ask you, what do you think the one and the two are for? And indeed, you would see that the two is for phrases-- bad cats, eat fish, gave joy, and so on. And the one is for sentences. And all of these things are happening. Great.
So you're following these little trees in your brain. And indeed, this is also experimentally validated in another set of experiments. The completion of phrases activates parts of Broca's area.
And so here's a assembly calculus based architecture for syntax in language. You see something, or you get a stimulus, and you look for the corresponding assemblies of kick. And it projects to this area that's going to try to make sense of the word. And you see boy, and that particular one boy. Great.
And maybe you see ball. These are existing assemblies are activated. Now kick and ball together form this word phrase. That's this merge that I mentioned, but didn't talk about. And together, they form a sentence. And the sentence now can go back and make sense of the whole thing. So assemblies can be hierarchical and could potentially do parsing.
I'll stop there. And work that Christos and others have done-- they've actually built a parser on top of these assembly calculus operations. There's very nice mathematical questions about assemblies. Most fundamental of all is, what should we define as an assembly?
We have working definitions. But it's not entirely clear that that's right, or the final answer simulator. And so there are open questions on all of these aspects-- the calculus itself, random graph models, sparsification, learning, learning robustly. So one thing we've observed is that plasticity rules seem to give more robust outputs-- and then, of course, language. OK, thanks to all of my collaborators, and to you.
PRESENTER: Very good, Santosh. That's great. Let's go to the questions.
KRIS BREWER: What about this model is novel or interesting?
SANTOSH VEMPALA: Let's see. As far as I know, there's a novelty. There isn't a mathematically rigorous explanation of the creation of memories and the association of memories, even those operations, that's consistent with what we know actually can happen in the brain. But that's the novelty part.
Interesting? Well, the memory is perhaps the first big difference between how computers work and the brain works. We don't have lookup tables or other immediate data structures. So figuring out that seems fundamental. And if we can get to how parsing might happen, that would be great.
Now, the most recent experiments with language as well as with the learning suggests that there is much more here at work that could really be interesting. Maybe these limitations, both in terms of what you're allowed to do to update, and how you do it could be the reason why you get this behavior that's much more robust and generalizable. So that's conjecture for now, only supported by empirical things. But you've seen what we can prove.
KRIS BREWER: Great. Thanks. The next one is from [INAUDIBLE] Kenny [? Blum. ?] It might be slightly related. But my impression is that everything up to, basically, the first 45 minutes of your talk is in valiance work-- associate, merge, n, k, et cetera. What is different?
SANTOSH VEMPALA: OK, great. So valiance, the model, is certainly inspiring and very helpful. But while those operations were around-- I mean, those are operations-- the actual implementation is very different.
So in the neuroidal model, each neuron and each synapse has states. And it can basically decide to change state arbitrarily. There's a little program sitting in each neuron that can tell you how to change state, what state change to make in response to anything. So much more powerful, and one could argue, as far as we know, not plausible, at least not in the generality.
But second-- for example, the creation of memories. The way it works is that you actually go ahead and try to recruit new neurons every time you create an assembly. When you do an association, it's not that these things come together. You actually create a completely new thing that has common neighbors to both. So I think the implementation details are both more consistent here with biology, but also potentially allow us to do things like this unsupervised learning.
KRIS BREWER: Great. Thanks. The next one is from Naveen. Most of the talk develops a model based off of neuroscience experiments, such as the random graph from fruit fly experiments. Has the inverse occurred whereby this model predicts some new behavior, which was previously undiscovered by neuroscientists?
SANTOSH VEMPALA: Yeah, that would be fantastic. These are early days. And we are working with some imaging neuroscientists to try to make final predictions about assemblies, and especially the creation of assemblies on the fly and hierarchical creation.
But it's early days to go the other direction. For example, we're pinpointing how something like merge could occur. And so it would be great to actually find neural evidence of merge at the level of recordings from [INAUDIBLE].
PRESENTER: Santosh, one question from me is, is one assembly equivalent to one neuron in the McCulloch-Pitts deep network framework? And if so, how many assemblies can you have there, and neurons?
SANTOSH VEMPALA: Right. Very good. So the same question is, indeed, in terms of function, assemblies are doing only-- McCulloch-Pitts neurons can do basically anything, compute in any [INAUDIBLE] you can put down.
Here we are saying, yes, think of a McCulloch-Pitts neuron that recognizes a particular input, or a class of inputs. Yes, their correspondence will be a single neuron. Yes. Sorry. Your second question was?
PRESENTER: Well, essentially, can I consider one assembly as--
SANTOSH VEMPALA: Ah, capacity.
PRESENTER: --one neuron with a threshold? Or it's more complicated, the equivalence between the two, the correspondence?
SANTOSH VEMPALA: I think, functionally, you could say they are similar, or quite close. Yes. Or I should say assemblies capture many aspects of single neurons. You can do single neurons that do extremely powerful things, which I don't know how to do directly with assemblies without building on top.
But to answer your second question about capacity and how many you can have, in principle, it's n choose k. But we have to be very careful about interference. Once the overlap becomes too large, then--
PRESENTER: Yeah.
SANTOSH VEMPALA: And so at the moment, it looks like quadratic n is plausible with just these methods.
PRESENTER: But this means, suppose I have n neurons, and they can choose the McCulloch-Pitts framework with precise wiring for each neuron, so each neuron is a threshold unit.
SANTOSH VEMPALA: Yes.
PRESENTER: Or I can choose to have assemblies out of these n neurons.
SANTOSH VEMPALA: Then you can [INAUDIBLE].
PRESENTER: Can I get enough assemblies to have the same power as--
SANTOSH VEMPALA: Oh, no, you get even more.
PRESENTER: More.
SANTOSH VEMPALA: More. You get even more. I'm saying, at the moment, you can plausibly show that you can get at least quadratic in n. So if you're doing a quadratic in n, assemblies that behave reliably without interference taking over.
PRESENTER: Yeah, because interference is the question, which is difficult.
KRIS BREWER: The next one, from Sanjee, how did you come up with these operators? Can a smaller set of operators also lead to Turing complete computation?
SANTOSH VEMPALA: Yeah, that's a good question. This is just-- so each operation was consistent with some rule observed across the board by neuroscience. That's one. But second, indeed, for the small set that we do have, removing any one seems to not allow us to do the Turing complete proof. So at least it's locally [INAUDIBLE]. Yeah.
KRIS BREWER: Great.
PRESENTER: Santosh, can I ask you now if you can say what you left out from your talk?
SANTOSH VEMPALA: OK. Yeah. So there is a beautiful question here about convergence. So I want to just present that very quickly. And the question is just about the following process. So this is the basic convergence process that we've talked about, the top k five.
You see, the point is that the brain graph is not completely random. Of course, it does not look like Gn,p. We know that. So what does it actually look like?
And so there's these beautiful papers that measure these. And they find that the probability of reciprocal connections [INAUDIBLE] two cycles. And other small cycles is much higher. Little motives is much higher.
And someone could ask, why is this? What's the advantage of this? And so to study this, we can still find super simple process.
You have a graph. There are k vertices firing. And now you sum up the weights of edges coming in, and now the top k fires. That's it, the same process we've talked about.
But instead of Gn,p-- in Gn,p, we need plasticity. What about in other graphs? Maybe you take into account the structure of the graph is not completely random. So a natural choice here is geometric random graphs where you have hidden variables representing your position. And the probability of an edge depends on the hidden variables.
So here, for example, this is a Gaussian graph where the probability of an edge is proportional to the Gaussian function, e to the minus distance squared. And as a result, you see what happens is that this top k process converges without plasticity.
PRESENTER: Without.
SANTOSH VEMPALA: Without plasticity, it still converges. And so you could ask, actually, what does it converge to? So it turns out that, at least on the line-- we go to the completely continuous 0, 1 interval-- any decreasing function and anything where the probability of edge decreases with distance.
So it strictly decreases in distance. It will always converge to single intervals. Size corresponds to the size of the cap. And the convergence is constant time. It won't have time to do this.
When you go to a higher dimension, I don't even know what the shape of the limit is going to be. If I wrote in 2D, I don't even know what the shape is going to be now. In 1D, it's intervals. It's nice.
So that's what I was going to say. But the point is that the convergence is faster. The geometry is actually helping the convergence.
PRESENTER: Interesting questions. OK.
KRIS BREWER: Let's say we want to compare this method with deep learning for a given task. What would be the metric of success? With neuron assembly, it seems that not all the neurons are used when making predictions. Can we take this into account?
SANTOSH VEMPALA: I see. Interesting. For a first cut, it would be nice if we just compare it on the same benchmarks as you would use for deep learning. Can you actually classify it to the accuracy of the [INAUDIBLE]?
Why not? That would be nice. But yes, short of that, we can try to measure the sparsity. If you think of energy, how many total activations in the process.
But the thing that already comes up, as you saw in my last few slides, is the efficiency in terms of samples. You need very few samples here to do this. Even if you were to do a simple averaging, you'd need a lot of samples in a traditional supervised learning method.
KRIS BREWER: And the next one's from Demetrius. How is the production-- sorry. How is the prediction that assemblies are almost the same for the same stimulus consistent with the fact that multiple patterns of neurons, different assemblies, give rise to the same electromagnetic field? The solution of the inverse problem is non-unique.
SANTOSH VEMPALA: Right. Assemblies are at the scale of neurons. So the electromagnetic activity that we're measuring is a much more course thing. So if all your activation is, let's say, in the medial temporal lobe or one part of your visual cortex, then you can certainly have this confusion of which exactly generated it. So at this point, it's a matter of the granularity of the measurement.
PRESENTER: Very good. I think this was great. And we took a lot of your time. But thank you, Santosh. That's very nice work. A lot to think about. It would be nice to come up with a critical experiment--
SANTOSH VEMPALA: Yes.
PRESENTER: --to prove it or disprove it. Great. Thanks to everybody. We'll be back next week.