Emergent Computation and Learning with Assemblies of Neurons
Date Posted:
August 29, 2024
Date Recorded:
August 10, 2024
Speaker(s):
Santosh Vempala, Georgia Tech
All Captioned Videos Brains, Minds and Machines Summer Course 2024
Loading your interactive content...
SANTOSH VEMPALA: Unlike my fellow theory speakers, I'm actually going to talk about the brain. So I hope that's still OK here. So I'll explain every part of this, at least the precise sense in which I mean it. So please interrupt with questions. I hope to convey an entire model of the brain and what consequences we can derive from it so far, what things we can explain.
So before I start, a very quick background in the sense of what I mean by computation as this universal thing. So it's just a well-defined sequence of state changes and state. So it's any system that maintains a state and has some rules about how the state changes. So for a computer, that could be memory contents and what's on the tape or what's in the input and what you're outputting, that's your state, OK. Memory could be various levels of memory, yeah.
But computation is universal in a much broader sense. So there's the planets. They're in a current state right now and due to some rules, they'll be in a different state this afternoon and so on. Weather, metabolic networks in your body, that's a computational system how various chemicals change their content and their rates of change, et cetera, ant colonies, but also brains, some physical system we don't fully understand that has state and it's changing state based on some rules that we don't fully understand.
OK, so that's it. That's one part of it. The second part is not to tell you what the brain is but to tell you what my understanding of the brain is, and all that this model is going to be based on is this cartoon version of a brain. I will only be using these elements to define the model. So these are going to be principles that are generally accepted, but of course, they don't cover the many nuances and important sophistications the brain has.
So it's about 80 billion neurons. Each one has 1,000 to 10,000 connections, more or less. And it's a directed graph, so these connections are directed, and these synapses or directed connections have variable strengths. So not all synapses are equal, some are stronger, some are weaker, and it's not a static graph. New synapses can appear and existing ones might disappear. OK, the graph is dynamic, It's weighted, dynamic directed graph. So that's the connecting part.
And then there's activity. So individual neurons will spike or fire based on some activation rules, which are just functions of the signals on their input synapses, the presynaptic neurons. Whatever is happening there, plus the weights, give you some input to this neuron. And then there is a rule that determines whether you fire or not.
So these activations are non-linear. And a common model is a threshold function. If you exceed a certain threshold, the neuron fires. But of course, this is by far not comprehensive. There are more than 1,000 types of neurons. And these signals also are not just about on or off, but there seems to be important information in the rate, the rate at which you're spiking, which changes over time.
But that's basically it. That's all I'm going to assume about the brain. Now, the question, going back as a theoretician, is how does the mind, meaning higher level behavior, perception, cognition, all these things emerge from the brain, things that are very physical and that we can measure and quantify neurons, synapses, et cetera?
And in spite of the stunning progress in neuroscience in the past decade or two especially, where you can measure thousands of neurons in live mammals and do crazy things with mapping them and lots of insight from cognitive science for over a century, we don't have an overarching theory. There really isn't even a plausible theory that tells you how this could be happening.
So how could this not be so-- how could this not be the most important question in science? Why are you not doing this and nothing else? I don't know but anyway, so.
[LAUGHTER]
So we'll propose something here. So the question arises, what's the level at which you want this model to be? The whole brain? Spiking neurons and synapses that you can actually verify things about? Even lower, there's very interesting activity that goes on at dendrites. Even lower, of course, there's molecular change. And that could be responsible for certain computations. Where it the right level to base such a theory?
I don't have a slide here, but in 2018, a few years after we started taking this somewhat seriously and publicly, the neuroscientist Richard Axel, has a nice quote that says that basically, in his view, the most important problem in neuroscience is figuring out the logic that translates neural activity into thought and higher level cognition.
And so this is the model. And in the next few slides, you'll know the entire model, every detail of it. It's going to be a formal probabilistic model. There's going to be a probability aspect to it. You'll see what it is. It will be implementable by neurons and synapses. It will be consistent with the principles of neuroscience, as essentially I've already mentioned to you. And there's early work of Les Valiant on what he calls the neuronal model, that was certainly an inspiration.
So the graph itself, we're very far from figuring out the precise structure of the graph, and maybe it's not so important. So we're going to just model it as a random graph. And in case you're not familiar with the theory of random graphs, it's an elegant, beautiful theory. And here it is on one slide.
So GNP, the basic object, random object, is a graph on n vertices, where every possible pair of vertices is connected by an edge independently with probability p. You toss a separate coin for each one. If it comes up heads, it's a p probability coin, you put in that edge. Of course, in our case, we'll think of directed edges.
OK, so p0, it's an empty graph. p1, it's a complete graph. And this is a very simple model with surprising structure. So you might think, what could this structure, could this have? I mean, every graph is different if you haven't seen it before.
Well, for example, one thing that maybe is not so surprising is that the maximum degree, the maximum number of edges at a vertex, either in degree, out degree, or in the undirected set, is very closely concentrated near its expectation. That's not surprising because you're doing multiple tosses, and it should converge to the average.
But here is a more surprising fact that even when it came out was quite striking. Suppose I start with this random graph, this distribution where p is 0, so it's just the empty graph, and I slowly increase p. So there's a distribution evolving on graphs with higher and higher density.
Now, certain properties, which were not true before, might become true. For example, if I ask you, is the graph connected? Increasing the probability can only increase the probability of being connected, meaning there's a path from everybody, too.
So there are lots of other monotone properties such as, does the graph have a Hamilton cycle? If you increase the probability, it only increases the chance of it having a Hamilton cycle. So monotone just means adding edges doesn't hurt the property.
And the theorem is that for any edge-monotone property, its existence, the probability of it being true, has a sharp threshold. So it goes from basically 0 to 1 in an interval that's smaller than any constant. That goes to 0 with the size of the graph. So it's a sharp threshold, the canonical sharp threshold.
So all of these properties automatically come true. You don't have to prove it one by one. Of course, if you want to determine what is the threshold, then you have to investigate more carefully. But if you just want to know is there one, there's nothing more to do. It's done.
And we are talking about directed GNP, where both directions are independent, so the simplest possible thing. So this is a simple, unrealistic-- of course, the real connection is not like this, but it's still going to be useful to see what, even with this, we can say.
For example, if you could make a statement about, you pick a graph like this, you pick a connectome tone like this, and with high probability, over the choice of the graph, all of these things will happen. That's saying something interesting. You don't even need that much structure from the graph except what you would get from it, OK. So that's the thing.
Now, here's the actual model of the brain itself. So we view the brain as having a finite number of brain areas, regions called areas, about 10, 20, a fixed constant number of brain areas. For everything that I'll be doing, you'll see that five or six areas is going to be sufficient. So it's a small constant number of areas. Each contains n neurons. So we could have made them different sizes. We'll just say same n everywhere.
There's going to be activity in discrete time steps. So this is also an approximation. There is some clock steps at which there's activity. We only imagine it's discrete time steps.
OK, now, some pairs of areas are connected, not every pair, some pairs are connected. So these red tell you that these areas are connected. And how are they connected? By a directed bipartite random graph. So it's just a random graph between every pair of neurons here. The probability of that edge in this direction is p. Some are bidirected like that, so that's just a graph between those.
And every area itself is recurrently connected, so meaning this is a random graph, just like the one I described to you earlier, independently drawn everywhere, where the probability of an edge is p in both directions, independent. OK, so that's it. There's a bunch of random graphs that are connected randomly. That's the connector.
And then the activity, I already mentioned that it fired in discrete steps. Which ones fired? Who fires? So the rule is the following. The k neurons in each area, with the highest total input, total weighted input, fire. So just like with a neural network or with the brain, each neuron does a weighted sum of whatever its presynaptic activity is. In this case, it's a zero one sum of the weights, and that gives some weight.
And then we're not doing a threshold locally. We're saying, out of the end, there is a fixed k and the top k fire. The neurons with the highest k total input, those are the ones that are going to fire. That's the next round in each region. And then that gives you the next round, the next round.
When I say total input, the input could be coming from the same region or from other regions, the total input to the neuron. The neurons don't know which area they're in. That's our interpretation. Well, you may have a question about how are you doing top k. We'll get that in a second.
Now, the connections between areas can also be inhibited. This is not active right now, or it is active. And you'll see a mechanism how we'll do that also with neurons. So there's going to be no-- by the middle of the talk, there will be no global control. Everything will happen. Basically, there is a set of rules for how every component operates, and that's it. The only thing in your hands will be what input you present. There will be no central control, no program, no loop, nothing.
Now, in addition to this firing activity, the weights of the neurons-- this is the weighted graph. The weights of the synapses, I'm sorry, might change. Again, there is a rich variety of rules that have been observed here and are analyzed. We'll use the simplest one, the most basic, Hebbian plasticity.
So for the synapse i to j, if i fires and then j fires, in the very next step, i fires in step t and j fires in step t plus 1, then the weight of the synapse i to j is increased by beta. You can do it additively or multiplicatively. You get slight differences in the quantitative behavior, but that's basically it.
So that's the entire description. And for the purpose of bookkeeping, we don't want these numbers to-- since these are all non-negative weights, and they're only increasing, we don't want things to go off to infinity. We'll do homeostasis, which is just we make sure that every once in a while, it doesn't have to be at the same clock frequency, the total weight is normalized back-- the input to each neuron back down to some constant, say one. OK, that's it.
So now I have given you the entire model. Now, you may ask, how is input presented? Well, input is certain areas. So you have an area for input. So you might think of it as an olfactory area or an area for vision or so on, for some sensory areas. And each of those areas is neurons again. And an input means some subset of those are set to fire.
And this really is how it happens. I mean, if you think of smell, there are all these neurons. I mean, an odor consists of odorants, these chemicals, and then there are these neurons that are sensitive to specific odorants. If that's sensed, then this will fire. Otherwise, it won't fire. So some subset of neurons in your olfactory sensory area fire if you present that smell.
Similarly, if you're talking about vision, which is very, very well studied and understood, there are neurons that are very specific. If there is light at this angle, or if there is movement in this direction, then the neuron fires. Otherwise, it doesn't fire. So there are all these lots and lots and lots of yes/no sensory neurons.
So that's the model. Now, the question is, will computation, interesting computation, and learning emerge, rather than have to be programmed? I don't want to program anything. What's going to emerge here, and in particular, will interesting computation and learning emerge?
Note that there is no backprop or gradient descent here. There's no data type. There's no programming language. And so maybe the first question you might ask is, how do you even create or recall a memory, right? I mean, at least you should be able to remember that you saw something before.
And some people would argue memorization is the most is the core of explaining even the most advanced models today. Whether or not you believe that, memory seems to be a crucial, important thing and something that's true for the brain. We memorize things all the time. So we'll start with that.
And this is an idea that's well over 50 years, more than that, 75 years old idea, assemblies of neurons. And what is an assembly? It's a subset of neurons. It's a large subset that's a little bit more densely interconnected than the base graph, so the probability of connections within it is higher.
And their weights might also be higher. And its functionality is such that it corresponds to a particular memory, a particular concept. So when this fires or mostly fires, not all of them have to fire, so it corresponds to you thinking about a concept, a person, a name. It doesn't have to be very concrete.
All of you have an different assemblies, for Woods Hole and for this summer school. That's right. But also maybe for this particular podium here and so on. So there are assemblies for everything, subsets of neurons, OK, great. And as recently as 2020, the empirical neuroscientist, Buzsaki, calls assemblies the alphabet of the brain, are measured.
All right, so here's the hypothesis. There is this intermediate level of brain computation, where you can actually see the computation happening. We'll call it the assembly calculus. It's implicated in higher cognitive functions, all of these things. And the basic data type, which is emerging, again, not programmed, are assemblies of neurons. That's the representation. That's the only representation in the intermediate level that we'll need to do all of this. All right, so that's the first claim.
So what do I mean? So let's say that's a sensory area and there's a stimulus being presented. So it fires. And when it fires, in the next upstream area in the brain, whatever it is, maybe there are multiple, some subset of neurons fire. By our rules, it's the top k in this area.
So what does that mean? It's the ones that had the highest weighted input from the sensory area, highest in degree from there. This was random. So this is essentially a random subset, great, OK, once.
Well, that fires again, but this time, that fires and this fires. And now, what's going to happen to the top k? Potentially, a different subset because now you're interested in the total, the top k highest input from both of those together. So it could be a different subset and then fires again. And maybe it's a different subset.
But there's also Hebbian plasticity. Every time you do this, if it so happens that i was, the first time and j is the second time, the i to j synapse will get strengthened. So the question is, is that enough to create a subset that's strong and stable so that next time I present that stimulus, you skip all of these steps, and you directly fire that? That would be recall.
And the answer is yes. This was sort of our first concrete theorem in this model, 2018. This projection, you may call it projection-- all you're doing is presenting stimulus and seeing what happens in an upstream area --converges exponentially fast with high probability and has the following guarantee.
The total number of neurons involved, if you take the union of the caps, over all time, is k, which of course, it has to be because it's top k plus little o of k, something that converges that's asymptotically smaller than k if the plasticity is higher than a certain threshold, which I'll put out explicitly in a second.
And as long as it's positive, it will still converge but potentially, to a much larger support. And the threshold looks like this. That we can prove. This is not a threshold in the previous sense. This is a threshold in the sense only lower bound. If you are above this, this will happen.
I don't know, for example, if you're just interested in the math, you could ask, is this threshold sharp, or is there a sharp threshold? It's unclear. It's much more complicated than saying the property is monotone.
So the threshold looks like this, some constant. Don't pay attention to the nice looking constant. p is the edge probability. k is the number of neurons that are firing each time, and n is the total number. So if you think of p as roughly 1 over square root n, each one has about root n connections, and k is about root n, then this is all looking like just a constant, not even that large. So the plasticity is higher than some constant. For reasonable ranges of p, k and n, then you will have an exponential convergence.
And this is an experimental plot just showing you that the number of new activations falls off rapidly. And if you have high plasticity, basically, you pick up hardly any more new neurons. If you have low plasticity, you may take a few iterations before you stop picking up new neurons, new things fire. And if you're at zero, as far as we can tell, it just keeps going.
Of course, it cannot be bigger than exponential in the size of the graph because the total number of possible subsets is n, choose k. And at some point, you'll cycle, and that's it. But till that happens, this could be go for it, OK.
So I will just have a quick-- I'm very happy to discuss the proof in detail. And I think there's some insight here, but let me at least sketch it without spending too much time on it. And a sketch looks like this. Why are we making progress in this way?
So the very first cap is just the highest neurons Now it's really Bernoulli, it's a sum of Bernoulli random variables. It's a binomial distribution, the total input at each neuron, which is closely approximated by a Gaussian. To prove the formal theorem, you cannot say it's just approximate. You have to figure out the bounds.
But let's say for the purpose of this intuitive explanation, we're thinking of the winners as being draws from that Gaussian distribution. The mean is pk because k neurons are firing. The probability that a neuron has a connection to each one is p So you expect pk to be your input, but then there's a standard deviation of pk. So that's it.
So you take the highest k from this. Draw n times and take the highest k. That's the first cap, great. What about the second cap, cap meaning the top k? Well, the second cap is more interesting because it's now the top k when both these are firing.
So you see, there is now a competition between the neurons that fired the first time and the big pool of neurons that did not fire the first time. These guys just looks like random for them. It's like this. And 2pk, 2p1 minus pk because 2pk is the total input from together, and you're looking at the highest in this.
But these are a little bit different. Why? Why is the first cap different?
AUDIENCE: Connected.
SANTOSH VEMPALA: Yeah, they're connected. And what happened to the connections? The weight went up a little bit, just a small factor, 1 plus beta, but that means that the mean went up. So you have the small pool of higher mean neurons, and you have a big pool of lower mean neurons. And now you get the top k from this entire pool.
And of course, by choosing how much higher the mean is, you can adjust what happens here. And that's what happens. This is the fraction of winners. To do the proof, we have to be a little bit retrospective. Say less, these are the fractions of new winners at each step, and we'll show the fraction of new winners. People who have never won before is exponentially diminishing fraction.
So we can figure out these thresholds, and then prove using those thresholds that, in fact, yeah, this is basically the outline of the proof that it converges exponentially fast, for a large enough threshold. You have a--
AUDIENCE: I have a question. What is the difference between this and [INAUDIBLE] model?
SANTOSH VEMPALA: And what?
AUDIENCE: [INAUDIBLE] model.
SANTOSH VEMPALA: Oh, oh.
AUDIENCE: [INAUDIBLE] model with some basic connection which is in there.
SANTOSH VEMPALA: I mean--
AUDIENCE: --around connection.
SANTOSH VEMPALA: Yeah, you'll see. So, so far it's just the creation of a memory, right? So in that sense, functionally at least, you created memories. One thing is that the patterns can be arbitrary. We don't need to assume it's the pattern itself. The input pattern is not a random or anything. It's just arbitrary.
And I haven't argued anything about capacity. How many can I encode here? But because of the nature of the projection, it's going to be as if it were random because they're going to project to random subsets. But the more interesting differences will come up now. So far it's just memorization.
The next thing you get is pattern completion. So first of all, already from before, if you present the same stimulus, you'll get the same assembly firing immediately. But what if you fire only a subset of it, say 10% of that?
So then what happens is that, as long as the assembly was strong enough, that means the stimulus was presented sufficiently many times and roughly log of-- I put up the theorem in a second. If you want that when you fire an epsilon fraction, the entire assembly should fire-- so an epsilon fraction, this much of this person's face is enough to remind you of who it is. You just have to have seen that face enough times. And the number of times is log 1 over epsilon, basically.
So yeah, so if this small fraction fires because internally, the connections are stronger, the expansion is higher, compared to what's happening with the rest of the graph, this activates the rest of it very quickly. That's the point. And this is a benefit of recurrent connections. If you didn't have recurrent connections, this wouldn't happen.
OK, so that's pattern completion, which to some extent, you also see in Hopfield networks, yeah. Now here, let's go to the next one. Now, I see this is something that's very, very, much in our experience of the mammalian experience.
You see two assemblies co-occurring, right? I don't know. You see Tommy and Woods Hole, and you see them co-occurring. And then what happens? These assemblies, they seem to start overlapping more, at least for us. Next time you're in Woods Hole, you're like, where is Tommy? Well, at least that's my experience. Maybe you only saw him today but OK.
And what happens in the model itself, is that the overlap of the assemblies actually increases. Assemblies do change over time, and they pick up more in common. SO this is association. It probably happens with this, just assuming the original graph was GNP.
And this, in 2016, I mean, such fortuitous timing, was verified by recordings in human subjects, where they presented stimuli, recorded neurons that are firing out of several hundred. They noticed about a dozen neurons are firing, and then tens of neurons. And then places, like famous places, Eiffel Tower, so on, and then Great Wall of China.
And then people, famous people, and they saw, OK, there are assemblies, and they're firing a different dozen or so. And then they did this extremely clever thing where they superimposed the place and the person and presented that a few times. And this is to live humans.
And then they measured, after some time, by presenting only, let's say the place. And some of the neurons that were previously firing for the human, also started firing. OK, so this is happening. And they test this after weeks, it's still happening. And this experiment, originally by Ison et al, has been replicated multiple times. Yes?
AUDIENCE: What kind of brain recording was this?
SANTOSH VEMPALA: Neural. No, no, no, no, no, we're talking about individual neurons. They record from individual neurons, about 600 some number of them in the medial temporal lobe. No, no, no, no, no gross fMRI. We're talking about actual neurons. That's how neural assemblies are today claim to actually exist for memory, yeah.
OK, so far, so good, fine, association-- yes?
AUDIENCE: On the conference, you mean the-- like they're given at the same time?
SANTOSH VEMPALA: Yes.
AUDIENCE: [INAUDIBLE]
SANTOSH VEMPALA: No, together, together a few times, yeah.
AUDIENCE: We do like one and then the other?
SANTOSH VEMPALA: I don't know. [CHUCKLES] I wish I could just do that, but yeah, it's a real experiment. Yeah, there are some variations they did to show the robustness of this finding, but I don't remember them actually doing--
AUDIENCE: I'd like to talk about more yourself. So like the-- when you present this thing.
SANTOSH VEMPALA: Oh, yeah, if you alternate them enough, then it's as good as putting them together. Yeah, in fact, I'll get to that. A more general thing in that line, yeah.
So, so far, so good, where we have storage and retrieval, maybe retrieval of near neighbors. But you can do much more. I won't mention what these operations are, but we want to get to learning, higher-level learning.
So how do brains learn, just ask crucial question here? And does it do gradient descent? It's still a topic of debate, but there's really no evidence. And then there is plasticity, but is this an effective learning mechanism? Can you show me that this actually is useful? And so this is a crucial problem, in general, but also certainly, for the model, because that's the only mechanism we have.
So by the way, my plan is to finish by 4:00 and do a simulation. Not a set of slides, but we will run the simulation for the last 10, 15 minutes. So if you don't believe the proofs, hopefully, the demo, yeah, OK.
So here's the next. So learning, what is learning? So far, it's just memorization. I see a stimulus, I memorize it. I memorize it, so strongly that partial is enough, and so on. But what would are learning is if you could memorize patterns?
So I see examples from a distribution that's somewhat concentrated. And I learned one assembly for it. Not only one for each object, but one assembly for it. And then I see examples of another distribution, and I learn an assembly for it. And next time I see one from one of these, the correct assembly from which this came should fire.
OK, so we design these assemblies. I will call them stimulus classes. So these stimulus classes are distributions over what's going to fire. It's going to be k firing. But if you're from distribution I, or class 1, there is a subset that's more likely to fire, and then the rest are random. If you're from distribution 2, there's a different subset, not entirely different, but largely different subset, that's more likely to fire, and the rest are random.
But each individual stimulus, there's still variation. So if you do this-- OK, sorry, the punch line was already there. Yeah, this is similar classes, we'll see. It learns it beautifully. All you need is a few examples from each class.
But. But the examples from each class have to be presented consecutively. Remember, there is no label. You give me the five examples from class 1, and then five examples from class 2, and you can do it for 10, 20, or whatever number of classes. And now you give me a new example from one of these classes, and it will fire the correct assembly, very likely. Because it's higher level assembly. There is some learning here, some rudimentary learning going on.
OK, now here is a more classical vein learning theory example. I want to learn a halfspace. So the way we'll design it is that there are positive examples and negative examples, and we don't know what the direction of the halfspace is.
So there is some linear combination, so that the positive examples are a little bit further out than the negative, so we assume there's a margin here. That's important, at least for both in theory and in experiment here, in this model. So there's a margin.
And so that's the delta there. So it's a little bit higher. And that's it. So we pick a random v, we make sure there's a margin. We draw random examples. The positives are the ones above the threshold. The negatives are the ones below this delta threshold. And we'll just do five examples from each in experiment.
And it forms very different assemblies with very little overlap. In fact, if you look at the matrices, now of the coordinates, this is just overlap. Remember, the vectors are just Bernoulli. It's whether it's a one or a zero, and so this is the overlap.
So those from the same halfspace, are slightly more overlapped than those from different halfspaces. This is from the same halfspace, one side, the other side, slightly more overlap than if you were in different halfspaces.
And now once you've done the assembly, and now here you look at the fraction of firing neurons that are the same, and from the halfspace, it's this much. And if you're from different, it's the overlap is much smaller. And so now if you give me a new example, very likely this is the one that's going to fire because most of your neurons are going there, and indeed, that's what happens.
So this is a theorem. So far, so good. Very basic sort of classification learning. But in this mild, unsupervised-- it's unsupervised, but there is label information because of how we present all the positives first, followed by negative.
We get to sequences, which is an important question. Sequences seem to be very crucial for the brain. We learn lots of things by sequences. Certainly, we learn linguistic things by sequences. But as you may have already learned in summer school, or beforehand, even when you look at a painting, you parse it as a sequence. You look at a person, you parse it as a sequence.
You see an eye, your eye does a sequence of saccades, where it decides to move somewhere else, move somewhere else, move somewhere else. Because at any given point, it's only looking at a small region of your view. This fovia is small.
And so there is a sequence in which you see an eye, great. You look for another eye, great. You look for a nose. Oh, if there's something else, it's like surprise. So even a painting is really a sequence for the brain and not for GPT, as far as I know but anyway. Maybe I'm mistaken there. Not for deep learning, I should say.
Now, assembly sequences are also observed now in experiments, I mean, neural recordings. So they do these experiments where-- this one is from Buzsaki's lab --where a mouse has to enter and traverse this maze in a particular way, because that's what gives them the reward.
And they have to make a sequence of decisions about left or right, move or stay. And they practice, practice, practice. They're trained fine. And here's the interesting thing. When they are then put down in this maze, in the brain, they preplay the sequence that they're going to do, and then they do it.
So yeah, there was also one from Tonegawa's, essentially a refined or a different variation of this preplay of sequences.
AUDIENCE: What is the date of that paper?
SANTOSH VEMPALA: Let me see. We are talking about very close to 2020, but I can certainly check. Sorry, yeah. So then here, so what's the problem for the neural model, for Nemo? How should we present this?
We want to present a sequence of stimuli, not just one stimulus, but I present a sequence A, B, C, D, E, F, A, B, C, D, E, F a few times. OK, what should happen? Next time I say A, you're going to say?
AUDIENCE: B
SANTOSH VEMPALA: B, thank you. I mean, you already knew that one, but I could have tried something else, yeah, OK. And not only the first one. If I tell you C, you're going to say C, D, E, F if you're paying attention. And this actually happens in probably. You don't need any additional thing. You just present it like this, and you'll get this behavior, OK, sequences, great.
But sequences have very powerful consequences. Basically, once you have sequences, you can do finite-state automata. What do I mean by this? What I mean is, before I get to this, sorry.
What is a finite-state automata? You have state and transitions. The transition might depend on some input, and then the transition might produce some output. Let's say it's a deterministic one. We're just talking about a deterministic finite based on your current state and the input, we do a very specific thing. You either stay or change state, and you output something.
But that's a set of two-- sequence of length two. So how do we train this? All we do is we present, sufficiently many examples, of the transitions themselves, just the transitions. It's a very helpful trainer. It's not like, oh, you've got to figure out the finite state Markov chain, given only the output. No, no, we're not talking about a hard learning problem.
We're just saying, a helpful teacher is showing you what the sequences are by playing them out, sufficiently many times, in some order. And now, will you have learned the whole machine, not just the individual sequences? And the answer is-- so the answer is going to be yes.
But here's the thing. Every algorithm, just every algorithm, period, is a finite-state machine with memory that it can read and write on. Every algorithm is that. That's all it is. It's a finite-state machine with memory. So basically, if you're also able to simulate the memory, which you can do, then you can do any computation.
Yeah, so you get to do this. And this is what I mean by emergent brain computation. So in this assembly, calculus, without any control commands, the only thing we need is that a certain pair of areas have to be forced to alternate in their firing.
When this area fires, then this area doesn't fire. And then the next time this area fires. So there has to be an alternation between the areas, which there are multiple mechanisms for in the brain. I don't know which one is actually responsible for it, but one of them-- so, by the way, the way we do this is, we have three areas symbol, states, and transitions. And by presenting the symbol state pairs in the next step, you create these transition assemblies, you learn these transition assemblies, which then map back to the next state, and that's how it continues.
But what I want to say, so there's no control commands here. And this alternation can be achieved by inhibition. By the way, I forgot to say, how are we doing top k. It's not some magic or some extra assumption. That's just the well-known excitatory, inhibitory, feedback loop.
So the moment too many-- or more than a certain number of excitatory neurons are firing, there's more inhibitory neurons firing, which reduces the number of excitatory neurons and very quickly reaches a balance in a wide range of parameters. Certainly, you can set up bad starting points, which kill it, but for a large range of parameters, it stabilizes super quickly.
So that's how we enforce the k. It's not going to be exactly k by this. It's going to be k plus a tiny amount, plus or minus. That's what would happen from this, yeah, OK.
What I'm saying here, is that if I want the alternation to happen between areas, meaning in one step, a subset of k from here are allowed to fire. In the next step, nothing is allowed to fire from there, and a subset of k are allowed to fire from here. This can be set up with what are called long-range interneurons.
Maybe this is the mechanism, maybe there is some other way to do it, but this is certainly one plausible way it can happen. There's a small, simple circuit with inhibitory long-range neurons that will allow you to achieve alternation, and that's all we need. Yes?
AUDIENCE: Are these inhibitory connections chosen by--
SANTOSH VEMPALA: Oh, all we have is that every area has a population of inhibitory neurons for it, and what we call disinhibitory neurons, which inhibit the inhibitory neurons. So they're already there. And when you present stimuli, these weights get changed accordingly.
AUDIENCE: But they are also learned with [INAUDIBLE].
SANTOSH VEMPALA: No, no, they are-- they're the-- yeah, let me go back and say a little bit more. So, this alternation part is already there.
AUDIENCE: It's like hard coded.
SANTOSH VEMPALA: Yeah, it's in the network, basically. It's like we say we have some areas are right and yeah, so from the ab initio, [INAUDIBLE] makes sure that these areas are only going to alternate for all time, so yeah, exactly. So in the purpose of the rest, this allows us to just abstract this away and say, OK, there's alternation.
So the realization of this finite-state machines is emergent computation. And it's only with-- all you need is a sequence of inputs. Everything else is operating by local rules. And in particular, what this says is that you don't need this middle level of Marr's 3-step mechanism. There's no need for an algorithm. You prescribe your behavior, you prescribe the lower level behavior. That's it. You don't need you don't need to specify algorithms anymore.
All of this, you can do in simulation. But before we get there, my title had coin flipping. And there is an issue here, which is, yes, the graph itself is random, but once I fix the graph, everything I've described so far is deterministic. It's completely deterministic. If you did the same thing again, the same exact same outputs would happen, once you fix the initial graph.
The only probability was over the initial graph. But of course, our behavior is much more complex. You may do some things the same every day, but some thing's different. Yes?
AUDIENCE: How would you choose the top k?
SANTOSH VEMPALA: By inhibition.
AUDIENCE: Yeah, but I think that they are not randomly sampled.
SANTOSH VEMPALA: No, no, the top k means the k neurons in the area that are receiving the highest total input from the previous steps activation.
AUDIENCE: But if we give always the same stimulus, why don't you get always the same--
SANTOSH VEMPALA: You will have exactly the same. If I present the stimulus, I get the top k. But remember, the next time that k and this k are firing, that's a new top k, new top k. There is some behavior but that--
AUDIENCE: [INAUDIBLE]
SANTOSH VEMPALA: Yes, but this entire behavior is deterministic, meaning if I present it the same stimulus again, on the same graph, you will get the same sequence, which itself is a bit more restricted than what we observe in reality. And part of the reason is that, of course, you may choose different path to work or a different coffee or a different clothes every day.
But it's not just because you're doing things randomly. Randomness is just this metaphor for all this other stuff that's happening. There's lots and lots of other sensory input. So who you see, how you slept, whatever. All of these things we just model as there's some randomness here. But nevertheless, we don't want to keep this deterministic here.
So this takes us to an extremely basic question. How does the brain learn statistics? OK, so for example, all of you have some kind of probability in your mind about the chance of it raining in Woods Hole or raining in some other place and just from experience. It's not that you kept track, physically and other things of course, at a more lower level perceptual. Sorry, you had a question?
AUDIENCE: Maybe you answered it already.
SANTOSH VEMPALA: So the first question is, how does it learn statistics? And second, even once it has learned these statistics, how does it actually draw from that? How does it sample from that? And we don't want to do this again with any new mechanisms, to the extent possible, just the same rules.
So here's what I mean. So suppose we have this basic probabilistic event. There's an assembly A that's firing. And when A fires, sometimes B fires and sometimes C fires. You could say half, half, one third, two third, whatever. When A fires, sometimes it's B, and sometimes it's C.
So what's going to happen in the model right now? The connections from A to B will be strengthened every time A is firing is followed by B. The connection from A to C are strengthened every time A is followed by C, great, so far, so good.
But what we like is that if we now fire A, what would we like? We'd like B to fire the fraction of the time that we saw that and C to fire the fraction of the time we saw that. That would be nice. But it's deterministic, so that's not going to happen. There will be some subset. It might not even be either of these.
So there is no randomness. So we need randomness. We can't produce randomness out of nothing. I mean, that's a theorem. We need randomness somewhere to be able to make a probabilistic choice. So how would we do this? We're going to abstract away everything else that's happening in the brain as noise.
OK, so how do we do this? The input to each neuron, in addition to being whatever it's receiving from everything in the part that you mapped out and you're looking at, is perturbed by a small randomness, which is a Gaussian of some zero mean and small variance, let's say.
Maybe you can attribute this to a biochemical variation or input from the rest of the brain or whatever. But there is some randomness each neuron feels, in addition to what it's getting from the actual mapped-out activity. This is a tiny bit of randomness. So now we don't have the entropy problem. We're not going to create randomness out of nothing. But Is this actually useful? Can we do this?
So here's a problem. We were stuck on this for about a year when I gave a talk here last year. Of course, this is a natural question. It seems deterministic. How do you learn statistical statistically? And I didn't know.
But as of this April, we made some progress. And so let me reduce the question to just one thing. How do I flip a fair coin in the brain? You've seen heads half the time, tails, half the time in some order. And now you want to be able to pick one of those at random.
So we can't use an algorithm. Of course, there are ways to do this. I mean, we know algorithms to do this. I mean, there's a beautiful paper by Manuel Blum that shows you how you can do this on the telephone, without cheating. You can flip a fair coin. But we can't do all of this. So what's our setup?
So we have a brain area A with n neurons, and this has two assemblies. The heads assembly and the tails assembly, already there. Yes?
AUDIENCE: Can you clarify this is the way the coin-flipping?
SANTOSH VEMPALA: You need to be able to pick one of two assemblies with probability half. That's it. So I tell you, pick, which is some assembly, and then the next step, either head assembly is firing or tail assembly is firing with probability half.
And the way we'll set this up is, that this input S, which is the one that's saying pick, for whatever reason, has weights to the whole area. But the weights to these are, let's say, stronger by a factor of 2, to both of them. The weights to H are stronger, and the weights to T are stronger. The weights to the rest are 1, and the weights to this are 2, and they're all random. So that we could have gotten there by just the learning that we did so far.
Now, here's the theorem. We need a little bit higher density than before. We need density about 1 over root k. k is the subset that we're picking. And now let's say S fires. That's the thing that says pick. And CT is the top k neurons at round t. So S will fire some cap here, some top k. And then there's a next top k, and next top k, and so on, in this area.
Now, here's the claim. So the first time S fires, it picks up something, may not be H or T. The next time it moves somewhere else. But the point is that very quickly it will converge to you get H or T.
And in fact, the number of rounds is going to be this fixed threshold. And the statement is that the probability that what you converge to is exactly H, And you can replace H with T if you like. The deviation from half is little o 1, more than a constant. For this, we need the base graph to be a bit denser than what we needed for just the creation of assemblies. The randomness requires a slightly denser graph.
So that's something there. So this allows us to record statistics. And in fact, you can get any probability you want. If I present A ta times and B tb times. And next time I see the input I need to be able to pick according to the proportion what you've seen, we can do this. But we need a slightly different plasticity rule, sorry.
So this would be just Hebbian plasticity. It's additive. The weight change is just alpha. Here, we're also capping it off. So at some point, these are all fixed parameters. But as the weight increases, you may get lower and lower increase. So your curve looks like it's increasing, and then the rate of increase is slowing down, which is observed. But nevertheless, this is what we came up with by proof, not by experiment.
And this is a one-to-one map. So you can go back and forth. I mean, this is an invertible map. So that's recording statistics. So we have recorded by Hebbian plasticity. Now let's do something more interesting.
So now that we can flip a coin, based on statistics, how about can you learn a probabilistic finite automaton, a Markov chain, which very much the simplest probabilistic model? Yes, so all we're going to do is play out the state transitions. But now they're not deterministic. Sometimes you see A, sometimes you see B. And this works in experiment beautifully. We'll show it to you in a minute.
The theorem, we have to assume, at least what we can prove, that the outdegree is two. So each state, previously, when I see a particular symbol, I'm only going to one. That's a deterministic transition. Now you have two choices with possibly whatever probabilities they might be.
But if it's more than two, we don't know how to prove it. But for two, you can do it. Of course, you can take any probabilistic finite automaton and replace it with a one that's not much larger, and the degree is, at most, two because you can binarize the out edge. But nevertheless, that's technically not the most pleasing thing to do, but this is what we have.
Yeah, so you learn this. And this is in practice. I'm just showing you a quick picture. That's a 3-by-3 Markov chain with those transition probabilities on the left. Those are the actual probabilities presented by just presenting the transitions. And this is what was learned, meaning, how do I know what it's learned? By letting it play out several times. We just play the Markov chain several times. Whatever is learned in there without any new stimulus. And this is with 15 states. And the theorem about how many states you need, how many neurons you need, is order of the number of states in the Markov chain.
All right, so that's essentially the technical results I was going to present, just a couple of comments. So once you're able to toss a coin, simulate a coin and learn its probabilities, you can enable a brain-like mechanism, meaning exactly. In this model, to learn probability distributions and over sequences.
Well, once you can learn over sequences, why not do what's most in fashion now? How about a generative model? OK, so I want to present a sequence of tokens, sure. Each token will have an assembly. That's no problem, like, the words in some lexicon. And then there's a sequence that does some Hebbian learning, and there's noise too.
So now what's going to happen? Is this thing going to work? I don't know. A few of you, not the students perhaps, have had kids. And you may remember what it was like when they turned two, roughly. You'll see that here.
So this is a real output. It was trained on just one poem. This poem, called "The Owl and the Pussycat." And this is being generated based on what it was trained on. You can see this is not the original poem for various reasons, but who's going to deny that this is exciting? I mean, this is like babbling. They start babbling. Oh, my God, they're babbling. This is where we are. I can babble with this model.
OK, so great, statistical correlations. Can you go beyond just correlations, in some sense? What about structure? We already saw that assemblies can be hierarchical. . I saw that. You can get assemblies of assemblies of assemblies, no problem. But look at this. I love this experiment. This is from the lab Popol. Yeah, So this coincides with the rhythm of speech. It's just a universal thing. You read it, no problem.
Now, this is the rhythm of spiking neurons, the most common cycle, about 12 times faster. Now, here is an experiment. The experiment that they did. OK, I need one volunteer, please. All you have to do is read a word, that's all. OK, just read whatever word appears on the screen. Go ahead.
AUDIENCE: Fret, ship, hull, give, true, melts, fans, blue, guess, hits, then, cats.
SANTOSH VEMPALA: OK, great. So while he was doing this, I did an fMRI recording of his brain, relevant parts. And I saw the following frequencies. Anybody want to guess what I saw? Just one frequency at four hertz, which is roughly what it takes to read one word. OK, good.
Now, one more time. We should do the same brain for consistency, ready? OK, go ahead.
AUDIENCE: Bad, cats, eat, fish, new, plan, gave, joy, little, boy, kicks, ball.
SANTOSH VEMPALA: And then I measure his fMRI again. Whatever, sorry for the grammar. And I recorded the frequencies. What frequencies do you expect? How many, first of all? Take a guess, no cost. 36? fewer. You said 1, higher. No, how many frequencies?
AUDIENCE: Two.
AUDIENCE: Three.
SANTOSH VEMPALA: How many people think two? OK, how many people think three? Yeah, you guys got it. And these are the frequencies. The four hertz, no surprise. That's each word. What's the one hertz? Each little sentence. And the two is each little phrase.
And so these experiments were already done, and yeah, so you're on the fly forming these parse trees that are firing these little assemblies telling you, hey, hey, hey, you got what's going on. And this has been confirmed in different ways.
We know from linguistic researchers in neuroscience, the completion of phrases and especially of sentences, activates parts in Broca's area. And so can this tree-building step be done in a dozen spikes? And yes, at least in this model, a dozen spikes is what it takes to build an assembly on the fly.
So there are many research directions here, but that's the conclusion of whatever I was going to tell you in terms of from the model. One question is, is there a phase transition-- I don't even know how to address this. --phase transition in cognitive ability based on the complexity of the connector?
Just as you're increasing the size and the synapses, do you get some point at which you suddenly enable cognition of higher level cognition? Is there a difference between mammal and below mammal or rat and human, in this sense? Or is there some additional structure that's coming in to the connectome? That's the factor. I don't want to bias you either way,
The point is that this is such a question can now be asked when you have a precise model. So on the assembly model itself, we've talked about GNP, the basic random graph model. We know that the brain is not a pure random graph. There are all kinds of structures.
For example, the probability of triangular structures is much higher than what you'd expect at random. These motifs are well known. And indeed, one can try to understand, how does it help? What we found-- I may not have time --but is that, in some, for example, what are called geometric random graphs, where the probability of edges is higher if you are closer physically or in some attribute space, some of these operations are more efficient, these functions.
Then there's the whole issue of the brain has this phase of six months, certainly, but all the way up to two years, when there's mostly sparsifying. The connections are getting pruned. What is the role of that? What's that actually preserving? You can ask this question. In the model.
We saw a very rudimentary statistical learning in terms of next token prediction and somewhat solid in terms of Markov chain learning. That's a theorem. Can we actually scale this model up to learn at human level? I don't want to learn the entire internet. But if I read five books on machine learning, I'm not planning to do that, then tomorrow, I should be able to speak as if I know what I'm talking about, something like this.
How about word embeddings and analogies? Things that humans are very good at. Can we get these things to happen here? Now, here's a question that I'm currently just struggling with. So far, it's the classical supervised learning paradigm. You present examples.
There isn't this phase where you get to rehearse, where the child is trying to say something, and the mother corrects them. There isn't this global feedback, or says, yay, it did well or whatever. There isn't that global feedback. But we know there is dopamine that does global feedback. How can we incorporate dopamine in this model in some meaningful and powerful way? Maybe that's important.
Reasoning, this is a question about-- I hesitate to say it because I think one has to define reasoning. And maybe there are attempts out there that I'm unaware of, as opposed to learning. Reasoning seems to be an important part of cognition. But do we need something additional, or is this enough? Maybe some structure. And of course, somewhere distant in the future, some people, I'm not claiming I'm one of them, might want to explain what is consciousness.
This is joint work for the last decade-plus with Christos Papadimitriou of Colombia. And the work is with Dan Mitropolsky, Mike Collins, Wolfgang Maass, Max and Mirabel are students at Georgia Tech.
Thanks for listening. And there is a GitHub here with all of the simulations, which I'm going to play in a second, so you're not done yet. But just one thought before we get caught up in the experiments. We have a tiny working memory, literally, very tiny. Working memory is-- we're talking about a few symbols, a few pointers.
Slow serial processing, serial processing it's serial and it's slow. Very little energy usage, Arun already mentioned that. And yet we get this robust vast majority of all the neurotypical human do these cognitive tasks reliably. So we're missing some ideas. And so I'm happy to listen to any suggestions of what that might be.
I'll stop here with the talk itself. You're welcome to ask questions while I try to get this running, anytime yeah.
[APPLAUSE]