From Associative Memories to Deep Networks and from Associative Memories to Universal Machines
March 9, 2021
March 10, 2021
Christos Papadimitriou, Santosh Vempala
All Captioned Videos CBMM Special Seminars
Panelists: Profs. Christos Papadimitriou (Columbia), Tomaso A. Poggio (CBMM, MIT) and Santosh Vempala (Georgia Tech)
Moderator: Kenneth Blum
Abstract: About fifty years ago, holography was proposed as a model of associative memory. Associative memories with similar properties were soon after implemented as simple networks of threshold neurons by Willshaw and Longuet-Higgins. It turns out that the recurrent Willshaw networks were very similar to today's deep nets. Thinking about deep learning in terms of associative networks memories a more realistic and sober perspective on the promises of deep learning and on its role in eventually understanding human intelligence.
KENNETH BLUM: Welcome, everybody. Thanks for joining us for this CBMM panel discussion. From time to time, we, in CBMM, gather to describe in some, perhaps, new way the failings of current deep networks. Of course, nobody would be much interested in their failings and limitations if it weren't for their spectacular successes. So we'll just take that for granted. This is another such occasion. We have three panelists and one is known very well to you and, I think, I will skip a formal introduction. That's Tommy Poggio, the Director of CBMM.
The second is Santosh Vempala. Actually, you're appearing in the other order, I think, as you said. So, second person who will speak is Christos Papadimitrios. He's currently at Columbia University. He spent many years at UC Berkeley. He is well-known for his work on computational complexity and algorithmic game theory, among other topics in theoretical computer science. He's a winner of prizes with fancy names like Babbage and Von Neumann, and Godel and Knuth.
He is also one of, I think, two authors of a graphic book called Logicomix that was published a decade or so ago, which reads extremely easily, which must mean that it was incredibly difficult to write. It is a description of this debates in mathematical logic, philosophy, and musings about rationality and madness. And it's told, in part, through the stories of Bertrand Russell and cameos by Turing and Godel and Frager and others.
So our final panelist to speak is Santosh Vempala. He has been for 15 years or so, a professor at Georgia Tech. He is also a theoretical computer scientist, prominent in the field, fellow of the ACM. His work is on algorithms and algorithms in particular for convex sets and probability distributions. And both he and Christos, together and separately, have been interested in recent years in brains and human computation. And so with that, I think Tommy speaks first, then Christos, then Santosh, and then some discussion amongst the three, and then we'll open it up for questions from the audience.
TOMASO POGGIO: Thank you, Kenneth. Everybody, welcome. I'll start with some brief thoughts about associative memories and deep networks. And it's basically a historical observation. Back in the '60s, holography was quite fashionable, a little bit like quantum computing today. Dennis Gabor received the Nobel Prize for his invention of holography. He was a Professor at Imperial College London at the time.
And a very brief description of a holographic memory is that through coherent optics, lasers, and lenses, and so on, you store on a photographic plate the convolution between here x and h, where x is, you call it a reference signal, noise-like, a little bit like white noise with a delta-like correlation function, and h is an image you want to store. So this is stored in a photographic plate. It's basically the convolution of the two. And therefore, a transfer of that is the product of the transformer vaccine h.
And then, when you want to retrieve h, what you have to do is to present through the photographic plate, h, again. And in that case, what happens is the correlation of h, what it was stored on the photographic plate. And so, you will get, because of the delta-like correlation of x index, you get out y. So this is an associative memory distributed across in the old photographic plate. If you cut the half of it, the memory is still there, just a bit more noise.
And shortly after holography was recognized and used for some initial associative memories of this type, you associate one item to the other and you can retrieve the associated item with the first one, there was a paper by Christopher Longuet-Higgins, a student of his, David Willshaw in Nature 69 about a network implementation of this holographic-like memory. And by the way, Christopher Longuet-Higgins, I think, was also the PhD advisor of Geoffrey Hinton, so a lot of things connect here.
And in this network that they proposed, there are neurons which are capable of taking 0 and 1 value, so it's binary, and this is a matrix. This is the memory of what you store here, this product, the exterior product of the vector y with a vector x into this matrix. And then, if you want to retrieve it, if you want to retrieve y, you can present x through one of the inputs where the B are and get out y. And you can understand how this works in a simplified version of the original Willshaw model, in which they had thresholds in order to clean up the output.
What you have to think, it's about the linear model, in which I want to find a matrix A that can associate each column of a matrix x to a corresponding column of the matrix y. And if I assume, for simplicity, that x and y are square matrices, then I find that A is equal to yx transpose and the column of x are, in that case, the column x are also normal. So then I can, with that choice of A, I can retrieve for any xj column of x the corresponding yj, assuming, as I said, that the columns of x are all through normal.
So if the reference beam or signal is rich enough-- and by the way, this is what is used in-- very similar to what is used in CDMA encoding for cell phones. An analogy I never saw but it's pretty clear. What David Willshaw did is not only to come up with this one layer network, but also to empirically experiment with putting the output of the network back into, as input of the network several iteration, and found that this often gave a cleaner retrieval then less noise.
And the recurrent Willshaw's nets of this type are essentially equivalent to recurrent networks of today, of deep networks recurrent. So from this point of view, you start thinking of deep networks as memories. And to show it's not so far fetched, think about an older learning machines, which are kernel machines, and think about a specific version of them, which is when the kernel needs the Gaussian basis function, so this k is a Gaussian.
Then depending on the sigma of the Gaussian, you have a network that when sigma is very small, so the Gaussian is very narrow, you have a look up table. The network can essentially retrieve only the input that was stored. The training can retrieve only the y of the training data xiyi bears. And if sigma becomes larger, like you you've seen the right of the figure, you start to have some simple interpolation between the training points, which we call generalization in standard machine learning. But it's really not much more than nearest neighbor in terms of classification.
OK, so, and to reinforce this point, consider that in many cases, deep networks with many layers, if they are either very wide or you train them with large initialization norms that turns out, that are equivalent, to a kernel machine. With a kernel that it's-- the so-called neural tangent kernel, which is very similar to a Laplacian basis function, which is self-radio basis function similar to a Gaussian. So from this point of view, deep networks as lookup tables, simple memories, are clearly not enough for intelligence.
And in fact, look at what the New York Times wrote 50 plus years ago about neural networks at the time, perceptron's Rosenblatt. Clearly, the deep networks of today cannot walk, talk, see, or produce themselves. They're just lookup table. OK, so this is really a kind of sobering observation which I find interesting because it may say, it is slightly exaggerated, of course, but it gives a perspective on the limitations of deep networks. There is much more than they can say about this, but, I think, this is a pretty good analogy.
There are large associative memories. And we could discuss about this because that's a topic in itself. Let me go to another one and then we can discuss both of them together. The next one follows quite naturally from this first one. If deep networks are just memories, how can we go to something more powerful, something that could really underlie human like intelligence? And Christos and Santosh have a version of it, which is very interesting and, that as they will explain, quite powerful.
Another one and there are many, probably, is to think of how I could modify as associative memory, a deep network, to make it more powerful from a computational point of view. What could evolution have discovered? I like the idea that evolution could have started with associative memories because this seems a small step after discovering say reflexes in very primitive animals like aplysia that are many hundreds of million years old. There are reflexes like defensive reflexes, the gill withdrawal, and so on.
And it seems a relatively small step to go from that to an associative memory. But then, as I mentioned, associative memory is just a memory. It's not enough for really intelligent behavior. So as I mentioned, when something that evolution could discover is what Christos and Santosh speak about. But the other one, we could have discovered how to make a finite state machine out of a memory. Essentially, and you can think of a finite state machine as a Turing machine running for a finite number of steps.
And what you need, you have some inputs and there is a state and you have a look at table with your program, which maps inputs and the states into outputs and the next state, and you can iterate this. And if you have a finite set of instruction, and you did make this work for a finite steps in time, you have a finite state machine. So it would be-- the discovery here would be recurrence plus hidden states. You can discuss whether this is reasonable from a computer science point of view and from the neuroscience and evolution point of view.
Anyway, that's one proposal. And so, in summary, this is the kind of thesis I propose for discussion. How could I have discovered-- how evolution could ever discover how to build more powerful computing systems from lookup table. And then of course, if we solve this question, the next series of question is how evolution discover programs, how to store them, which means storing weights, which is neurophysiologically not very plausible or simple to do, and how to evolve these programs.
So, let me give the word back to Kenneth, probably to Christos.
KENNETH BLUM: Thank you Tommy. Yes, that was great. Thank you very much. Great way to start off. I think that Christos, next, you're up.
CHRISTOS PAPADIMITRIOU: Thanks, Kenny, and thanks, everybody, for being here. Santosh and I will speak about our travails with this question. And basically, what I'm trying to point out is that there are two communities who are interested in this question, how the mind emerges from the brain, cognitive scientists and experimental neuroscientists. But there is a huge gap, both in scale, but also in experimental methodology, in point of view, and in, actually, computational modeling.
And these gap has been, to my mind, the main roadblock for making progress on this question. And two years ago, my hero in this business, Richard Axel, my colleague here at Columbia happens to-- I mean, when I saw this, I felt that I was blessed by the Pope, because that's exactly what I was working on. I mean, that-- he said, we do not have a logic. I mean, you know that Richard Axel is a profoundly experimental scientist. We do not have a logic, not is the choice of word, for the transformation of neural activity into thought if you're discerning this logic as the most important future direction of neuroscience.
By logic, I believe I understood correctly that he meant some kind of formal system. And Santosh and I have been working on this for some time by then. And this is what I'm going to tell you about. We have a proposal. Many of you remember, the old timers among you remember, that in the 70s and 80s, there was this myth that we have a cell, and when the cell fires, we remember our grandmother. But now we know that this cannot be possible because one cell, one neuron, cannot-- there is no way that it can have an effect on our cognition.
And so if there is one grandmother cell, there is no grandmother. Now we have wised up and we call this the Jennifer Aniston cells because, first of all, we noticed them in patients, in subject, human subjects. And two, by back of revel calculation, we realized that it's not one cell. It's probably many tens of thousands of cells. Hence, what have had visualized 70 something years ago came into existence. And assembly is a large, highly-connected stable set of neurons representing a word, idea, object, person, et cetera, and this is what I'm going to tell you about.
Assembly, as many of you know, is not science fiction. They are something that-- there is a growing consensus in neuroscientists, in neuroscience and cognitive science, that they are for real and that they represent such things. So what I'm going to show you is how we outfit assemblies with an algebra of operations. But before that, I want to show you something else which is cool. Underlying model of the brain, which we see as basically, interacting recurrent nets. It's a finite number of brain regions, each contains end neurons.
Only k of this fire at any moment. Some parts of areas are connected by directed bipartite random graphs, and all are connected by directed random graphs. This means, of course, that any two points have the same small probability of being connected in any direction. And so the point is, that these are randomly-connected interacting, randomly-connected recurrent nets. So this is the original model. Neurons firing discrete steps, and an important point, the k neurons with the highest input in each area. it is highest enough the k input are selected to fire.
So instead of ReLU, they have these random projection and cap thing, as we call it. And these, of course, models inhibition-excitational equilibrium. Incidentally, I mean also, these assemblies are an important behavior over recurrent nets. They are coming back to Tommy's discussion of evolution. In the insect brain, in the fly's brain, this is how we know-- the assemblies is how-- smells, for example, are stored and to have known this for decades. But the point is that these are unwieldy, assemblies that do not have a life of their own.
Because the muscle body of the fly lacks recurrence. OK so the recurrence of cortex is an incredible powerful evolutionary step in the story of assemblies throughout the millions of years. Connections between areas can be inhibited and disinhibited. And there is also plasticity, which means have like info. I is connected to i. I fires, and the next step, j fires, then weight of ij is multiplied by 1 plus beta, some kind of-- and then there are some other details that I'm not going to bother you about, but this is our basic model.
Not only that, but it's alive. You can actually look. You can actually play with it on the web. We have in our PNA's paper I'm going to show you later, we've described our simulator, which you play with. So the first command that you've got to type is load brain, which is a funny thing to type, especially if it's the first time we do it, the first thing we do in the morning. This is supposed to be a sort of tongue-in-cheek model of what happens to the real brain.
And what we have is that n is about, is that several million or tens of millions excitatory neurons. K, the size of assemblies of 10,000 neurons. Let's say, probability of connection is 1,000, and beta, let's say, is 10%. The main ideas of this model are sort of the three basic forces of life, randomness, selection, and plasticity. And the basic operation is random projection and cap elector. So, assemblies are the attractors of this model in some sense. So it's a stable set of k densely interconnected neurons in an area representing an idea, word, episode, et cetera.
Gyori Buzsaki, she has recently simple call them, just calls them the alphabet of the brain. In other words, the new thing here is not assemblies. Assemblies just don't stop. Then you hear the operations of the assembly. The operations of the assembly can be projected, copied, to another area. And this means that in the future, the first assembly fires a second will also fire. Two assemblies can be associated, increased, overlap, to reflect any kind of affinity like occurrence or correlation, and so on.
The merge of two assemblies in a third area is an important operation that can build hierarchies and, of course, very useful for language, for example, and more complex thinking like deduction, planning, and so on. The point I want to make is that these assembly operations are real in the following sense, in the following two senses actually. They're reals in the sense that they reflect behaviors of assemblies that have been observed in experiments or can explain other experiments, in the case of merger.
And they are real in a very important, different, second sense, that probably, in theorems, these assembly operations, what happened converge with high probability, where the underlying probabilistic space is the randomness of the graph within a dozen or so steps. So in other words, this is a language that's-- essentially computational programming language which you can combine down to neurons, not down to the spiking neurons. And we will have also simulated them in much more realistic models of spiking neurons.
And then, again, simulation show that these operations work. So these are the assembly operations and, of course, it's fair to ask how powerful is this system There's all this sort of Turing complete. It can perform arbitrary squared root of n space computation. I mean, as you know, square root of n space where n is, let's say, even 10,000 is a lot of-- it's basically 100 parallel steps. This is a very powerful computation, as powerful as language comprehension deduction, and so on. OK good.
But I know that Turing machines are not what you want to hear. so I'm going to tell you something that we are very extremely excited about very recently. We have written and have in submission a parcel of English written in this language. So it's sort of-- no, we have something that actually parses English sentences and does it purely by a spike in neurons and obeying all the constraints that we know from reading the literature of how language happens in the brain.
And the basic architecture is that we have an area, one of the areas is lexicon, the lexicon lex, one on your right, where several tens of thousands of assemblies are there representing words. And these, we know that exists in the medial temporal lobe. that's the purple area on the image there. And these are special assemblies in the sense that they trigger actions which are the inhibition and disinhibition of various series. And basically, there is no other control than this incoming stream of words exciting and firing these assemblies.
So this is the whole control mechanism of the program. We have this running, so we have a submitted paper. I want to emphasize, because this is sort of unusual. It works exclusively through the spiking of stylized neurons, so in the operations of the assembly calculus. It parses simple sentences like, the young couple in the next house saw the little old white car of the main suspect quite clearly. It does so in about 20-m 25 spikes per-- which is about to 0.3-0.5 seconds per word.
In other words, that's very much commensurate with what I'm doing right now, what you are doing, actually, right now. This sentence, it sounds complicated, but it's really very simple. So all in all, there are many things that this parser cannot do. But there is nothing that the language organ does and you don't know how to implement these parser. So for example, you have ideas about implementing recursion in the [INAUDIBLE]. We know about solving all other parts of speech that we have not taken care of.
And we are going to work about-- and we sort of know how to handle ambiguity and polysemy. We have a plan for handling ambiguity and polysemy. Very interestingly, we have also implemented a parser for a subset of Russian. And Russian is a very different language from English. It does not have a word order, sort of all permutations are good, it has cases. And the fact that, of course, you are using the same architecture, except that the words are different and the actions associated with every word are different.
Because, you see, this is sort of a lexicalized parser, in the sense that the words themselves contain information about the part of speech in the syntactic role of the word, and therefore, of the actions that must be taken. In this architecture, by inhibiting disinhibiting fibers in order for correct parsing to proceed. So, the fact that the same architecture, exact architecture, can host an English and a Russian parser, I'm pretty sure Arabic, Japanese, and so on, sort of-- I mean, I don't want to exaggerate the importance of this experiment.
But the fact is that it brings you, essentially for free, very close to extremely fundamental questions. Because we know that the Russian babies and English babies have the same hardware, and this hardware ends up behaving very differently. And the question is sort of, how is this done? So what I'm saying is that this is a question, how are the parameters of language learning but at the age of two? This question has been haunting us for decades. I mean, what I'm saying is that this is an interesting way to get in sort of very intimate stature of it.
I mean, in some, the neural basis of language is sort of a cliche phrase that we have been using for decades, sort of a mutually understood between speaker an audience, oh, OK. So what I'm saying is that now, we have a concrete proposal for a concrete hypothesis for what could be, and probably is not, but what could reasonably be the neural basis of language. So thank you very much, and Santosh will take over here.
SANTOSH VEMPALA: OK, so we've seen some exciting presentation of hardware, but also computation and lots of computation is enabled in these models. But perhaps the a very exciting part of what the brain does and maybe the most exciting part is learning, the fact that we're born with almost nothing, and robustly neurotypical babies and children, and adults will do all these crazy things, and how is all this happening. I'm going to focus on unlearning this part.
But before I start, I can't help but say a little bit about a couple of things that Tommy said up front and the ones that I disagree with, much as it was nice to hear them. Willshaw nets being equal to deep nets, I don't think so. I mean, maybe the analogy is like comparing Willshaw nets is to deep nets is like fruit flies to humans. You have something like one layer going on, but it seems to be much more than lookup tables. And even just atom completion is already a bit more, but certainly there seems to be a generalization in association with these things.
So I guess my view is much more that learning, even in the brain, is much more than memorization and we don't completely understand it, but it's far from it. I'm sure we'll get to discuss this from other aspects, but, for example, one of the simplest arguments that really stuck with me is this by the Russian motor neuroscientist Bernstein who Proust you in one thought experiment that there must be a hierarchy in our brains. As he says, imagine teaching a child how to draw a circle with their hands, say.
And so, you just show them once, maybe on paper, maybe just in the air. Great, and now they can draw a circle again, no problem. But they can draw it through their left hand, and then they can draw it with their foot. They can draw it under water. They can draw it lying on their backs. The point is that all of these different actions require completely different sequences of muscle activations. So therefore, he concludes, there must be a higher level representation of what it means to draw a circle.
You must have hierarchy. So anyway, it seems like exciting as the story is, we're far from explaining or going beyond deep nets. So getting back to the assemblies and learning. So how to brains learn? I mean, the obvious proposal is why not read in the sentence? It's so successful, in practice. And whether it-- so there's one point of contention. It's whether it's actually plausible in the brain, whether it's biologically plausible, and there's little evidence that it is. And indeed, there's a lot of search for plausible alternatives to it.
But then, the plus side in the brain that we do seem to observe all the time, that learning in the brain is through plasticity. That's the change, main change. And the simplest of these rules is, if j fire soon after i in close proximity, then the weight of the engine goes up. In relation to this, you also have some normalization, but this is the main rule. There are lots of variance about how much it goes up. This is the simplest possible rule. And so, the actual question is, is synaptic plasticity actually an effective learning mechanism?
Or, can it perhaps be as effective as steep learning, I mean, as gradient descent? And then another, maybe even more, looking forward question is maybe, does it have advantages? I mean, at this point, we all agree to the brain's ability seems to be much more exciting than what neural networks can do in spite of all the advances. And so, does plasticity have advantages? What are they, quantifiably? But to get there, how can the brain learn through assemblies? This is an important problem for the calculus that Christos described. How exactly to formulate? This is not clear, to be honest.
There's been progress with some reasonable formulations. For example, with bully networks. But let's focus on the recurrent neural network aspect, which is very similar to Willshaw nets. So we can think of this as a brain-inspired neural network. So there's just one layer, it's a recurrent. Input is coming in and then maybe there is a readout output, that's it. So there's input, there's one recurrent layer, and output. That is the basic recurrent neural your network. But there's one deviation from the standard version, wherein ends of each neuron having a threshold. We just say the top k fire.
That's the radiation in the assembly calculus. Only the top k fire, and this might happen for several rounds, say for t rounds. So input's connected this graph, connects output. And how do you train this? I mean, that's what learning is like, to figure out what the weights are from the input internally and going to the output. One way would be to just to gradient descent, but already there's going to be a crucial difference. We're going to do this one sample at a time. Each sample should somehow update the network or at plasticity.
Now this, we could run this on my laptop and indeed it works, at least in the simplest setting, say mnist, which almost anything works there, but this also works with a very small model. It doesn't work on a more complex thing. Maybe it's not a surprise because I'm using no convolution and only one layer, but it doesn't work on that. You could ask though, does it have some advantages even in the case where it works? Is it, for example, more robust than just using thresholds, the fact that we're using the k-cap? Does it avoid the vanishing ingredient problem perhaps because we're using top k and stuff, each one doing?
And the answer, empirically, is yes. And this was in a paper in ICLR last year. Some colleagues of Christos and the k-Winners-Take-All with the same as the k-cap. And indeed, they show that it's actually more robust on a range of data sets, were robust of elevations. So trying to prove that this is robust noise would be very nice. Robustness is indeed observed across the board for brain learning. That's one, that's one aspect. Going to plasticity rules, are plasticity rules provably effective? Now we've, talked about one rule, but we could--
There is a landscape of plasticity rules. They're heavy and it's is just one rule that we observed. I mean, the rule itself could be a small network. For example, it could be a lookup table, maybe a small lookup table that says, if this is the activation at i and this is the activation at j, and Not just one round. Neurons firing spikes, spiking patterns, so look at the activation pattern at i and an activation pattern at j, and that determines what happens to the weight of ij. Why not? It's still very local and very much plausible for the brain. Would such a rule let you do better learning?
Maybe the rule itself instead of being a lookup table is a little network with weights. Maybe those weights are given by the weights for the plasticity, not the weights for the networks you learn in life. So all these things are possible. Recurrent neural nets are well suited because you to get to implement this very quickly. And now, how do we choose our design of plasticity rule? Now, one possibility would be to do learn in the set. But maybe we should learn the plasticity rule also with plasticity. Why not? This is there. So, why is there any hope here or something reasonable here?
One concrete point of evidence is that two of the most successful algorithms in machine learning, perceptron and multiplicative weights, are both simply plasticity rules. One is an additive plasticity rule and one is a multiplicative plasticity rule. Both can be viewed like this, and we know that they have guarantees. Certainly for simple classifiers like halfspaces or kernels with halfspaces. But also, when there are errors under in a benign way. And if this was only at the output layer, then it's a convex problem, and you could do gradient descent to figure out the best plasticity rule for a given day lesson.
However, I know there are some nice properties in this optimization. The best plasticity rules always have the same sign pattern as the known perceptron or multiplicative rules, although not necessarily the same weights. Not necessarily 1 minus 1, but the same negative positive patterns. For the recurrent neural network itself, we don't know. And it seems to empirically improve the performance having a plasticity rule in the recurrent part. And why should that help and how it helps sets some fascinating question here. That was supervised learning, where we have a target and labels.
But much of what the brain does seems to be unsupervised or semi-supervised. And so, let's look at a first attempt to trying to lose. One thing you might want to do is not have-- So far, in Tommy's presentation and also in Christos', we see that particular inputs are mapped to representations in the network. These things that the fixed points that they go to like assemblies, great. But what about if I wanted to have representation for the whole distribution? Something that's tightly clustered, different views of a person, or what a table is? Something like that.
So I would like to have one assembly per distribution or per cluster of inputs. And we just see this, see a cluster of inputs, maybe some samples from them, and we see another cluster of inputs and would like to have different assemblies. Assuming these are different examples of pigs and examples of rhinos or whatever, and so, this is what you'd like to see. Now you see different inputs that are different in their input characteristics, their features, and they should map to different assemblies. But the ones that are similar should map the same.
This, at least, works out very nicely with just the one layer neural networks. If you throw examples from a sparse Bernoulli distribution, let's say, a hundred dimensional, and do this exactly, this random projection and cap process to your current neural network. And we just present five examples from each class and followed by just heavy on plasticity. And it turns out that we get this, that the figure on the left shows you the firing pattern of the assembly sorted according to the first one. For the sofa, four different classes of four different Bernoulli means. They're different in their means.
But random samples drop are from them and the overlap within the classes-- for example, within the classes in the assemblies becomes really high while keeping the overlap between classes quite different. So you really get two stable assemblies with mild little fluctuations. Indeed we can model and prove this. I won't go into too much detail here, but basically, it's saying, with sufficiently high plasticity, this is provably true. It actually magnifies the separation between the clusters. It makes the clusters go together while keeping them separate, for this model of four clusters are.
So we can do this in multiple classes and so on. What about a classic machine learning problem, like learning halfspace or learning halfspace with a kernel? So you can run this experiment again. Now, the data is not coming from the solutions that are chosen to have small overlap.
Rather, the two classes are simply separated by halfspace. You pick a space, random points, and then those in one half are one cluster, are called one type and are the other type. So indeed, if you look at this matrix here, the top matrix, you see that the overlap between an across clusters is quite similar.
However, after we do the assembly process, we present again just five examples from one and five examples from the other, the assemblies themselves are quite concentrated and separated, the different ones. We can't completely prove this. We can prove a weak version of this with sufficiently high margin between the positive and negative, turns out it's true. But the experiment seems to show that something stronger is true.
So going along the analyzed style, you could ask, what concept classes and example distributions, distributions over inputs, would with this type of unsupervised or mildly supervised assembly-- it's mildly supervised because you're seeing the sequence of examples from one cluster together. What does it work for? Maybe one interesting aspect of this is that we only talk about one layer, simple neural networks. Well, hierarchy make this learning more efficient and more powerful. The success of deep learning suggests that these models are still quite a ways behind.
So, to conclude learning in the brain, fascinating question, wide open. But maybe, most fundamentally, more than even technical questions that were raised in the experiments is that, maybe were missing a model of how it actually to ask the question even of the success of brains learning, and the fact and using its strengths. Its strengths include the fact that you don't need much supervision, that a few examples suffice, that it's robust in motivations, that you learn continually over many tasks, and so on.
And there are many attempts being made because these are also important questions from machine learning point of view. But how do we even define a family of tasks and goals? Classical theories such as pack and statistical learning seem to miss the point. Maybe they need to be extended properly. So how do we even measure the complexity of these things so that they reflect the brain's success and strengths? And one aspect of this, perhaps crucial, is modeling the environment. Now, how do we talk about the data that's provided to the brain and that it builds on? So that's all I have. Yeah.
KENNETH BLUM: Wonderful. That's great, thank you. Thank you, Santosh. Thank you, everybody. I think that because-- well, no surprise. Three interesting speakers have a lot to say. I'm going to read one from Jaja Jao. Thanks for the presentations. What are the main mathematical tools used in the process of your theorem such as the halfspace learning theorem?
SANTOSH VEMPALA: It's a probabilistic analysis. So we show that over repeated iterations, this process of k-cap, to keep the k highest getting the input localizes. So at each step, neurons that emerge at the top are likely to remain at the top and thereby converge to an assembly. This probabilistic process is a combination of discrete and continuous considerations. But, yeah. It's an iterative convergence proof.
KENNETH BLUM: I'm going to move to a question that's extended in the chat, but I'll just paraphrase it. And maybe, Christos, you can speak to differences between assemblies and Hopfield networks.
CHRISTOS PAPADIMITRIOU: That's right. So there is a question by Dmitri Crothall of what is the difference? In one sense, they couldn't be more different. I mean, the reason that the convergence proof of assemblies is completely different from the convergence proof through these potential functions, called hopfield nets. The convergence proof for assemblies is a random graph theory and probabilistic analysis. So Dmitri responds that you can do plasticity an d hopfield nets too. The point is that without plasticity, you have no assemblies.
Plasticity is the basis for assemblies. And so, it's something absolutely different. So in all these-- I mean what I'm trying to differentiate myself or prior work. So this is like day and night.
SANTOSH VEMPALA: I mean, one aspect is, for example, association or the fact that assemblies change over time, immediate departure from hopfield nets.
CHRISTOS PAPADIMITRIOU: Right. So assemblies are designed to change and shift around over time and also to copy themselves elsewhere, and so on. OK, yeah. Thanks. Yeah.
TOMASO POGGIO: Also, you have shown Christos and Santosh that, in your assembly calculus, you are equivalent to universal computing machine. I don't think there is a similar result for hopfield networks.
CHRISTOS PAPADIMITRIOU: I mean, you know, to-- Really, thanks. That's absolutely right, Tommy. I mean, we believe that assembly sort of, you know-- we are doing experiments now. We have some ideas. Assemblies can be used sort of as a form of storage device of storing correlations between tens of thousands of a huge Venn diagrams and retrievals and some kind of associative memory, but you have to struggle very hard to do this. For hopfield nets, it's immediate.
KENNETH BLUM: Well, it looks like there was an answer typed. But I'll read an earlier question which was about clarifying in, Christos, in your whirlwind tour of the English parser or language parser. What did you mean by parsing? So maybe you can [INAUDIBLE].
CHRISTOS PAPADIMITRIOU: Right, yes. So I should have mentioned that because-- and thanks for the opportunity. This is very fundamental. The processing of a sentence leads, creates, through the spiking neurons and through the assembly operations, creates a retrievable trades of side effects Retrievable, I mean, that by new, other operations of the assembly, we can read them. We can read the side effects. And these side effects, put together, constitute a dependency graph of the sentence. So it does parse.
So that's-- and our experiments, we do the retrieval also and we check that it retrieves a correct dependency parts of the input sentence. But thanks a lot for the opportunity to clarify that.
KENNETH BLUM: And there's another, looks like, follow up question, but it's somewhat separate from Jaja Jao. Is there a clock? Is there discrete timing or is it continuous? Does it matter, both?
CHRISTOS PAPADIMITRIOU: Yeah, so number one and three is correct. I mean, he said there is a discrete clock, not continuous. But we also know that does not matter. In other words, this is a very, very arbitrary assumption. But it's so mathematically convenient that it's irresistible. But we have evidence that it is, of course-- it's not a faithful representation, but it's also not distortative. That does not distort the results. Because we also have simulated operations assemblies in a sort of new in realistic neurons, using textbook, using standard neurons used in practice to simulate the cortex, and they work. They have the same behaviors there as well.
So, I mean, even though these simple-- there are two basic mathematical simplifications that make this model, I believe, very, very fun, a lot of fun to work with. One is the assumption of the random projection in cap, which is a simplification of the vision. And, of course, the second one is the same, is that everything proceeds in lockstep, which, of course, is not realistic but it's also does not change things too much.
KENNETH BLUM: Great. I see a question in the chat. That spiking per words, from Nina Dekane. The spiking per word seems high for what is experimentally observed in recordings from human brains. You have comments on that numerical discipline, quantitative discipline?
CHRISTOS PAPADIMITRIOU: I see. So I didn't know that. My impression is that speech is foreheard in syllables. Almost universally, roughly, that this means to me that something between a quarter and half, a second per word is about right. And also that the kind of spiking we're talking about is a gamma, actually, the low gamma range, which is something between 50 and 70 or 70 hertz. So 70, with 50 or 60, 70 hertz. So, I think these calculations are commensurate with speech.
I mean, we're not competing against transformers, we realize. We want to do better. We want to do as well as human speech.
KENNETH BLUM: I have, I guess, a radical generalization of that question. I suppose it's directed more at Christos and Santosh than Tommy, but it's about incisive experiments that really have the potential to, in some strong fashion, falsify some central piece of your proposal. I mean, these are hard. I mean, I know that because I think that less valiant in some sense has been looking for way, experimental tests that have the potential to really speak directly to these kinds of questions. And so far, I haven't seen those experiments.
So do you have ideas, proposals, for good experiments that really could put some of this to the test?
TOMASO POGGIO: We have one, but Christos, go first.
CHRISTOS PAPADIMITRIOU: Again, thanks for mentioning Les Valiant. Les Valiant is a giant in our field who has been an inspiration in our work, for our work. Of course, our work differs from his proposals, but we read his works many years before he started working on this subject. So indeed, I mean, Les is interested in coming up with experiments. We are also. So we have gotten an NSF grant with cognitive neuroscientist from CUNY.
Basically, in order to look hard for evidence of assembly computation, essentially in the medial temporal lobe and the superior temporal gyrus, where language is supposed to be happening. Unfortunately, everything has stopped for a year now. But, I mean, we have designed experiments of this sort. So, yes.
KENNETH BLUM: It sounds exciting. Tommy, you had a comment?
TOMASO POGGIO: Yeah, we are starting a collaboration with [? Hermiller. ?] It was a wonderful protocol for measuring activity in prefrontal cortex. So it's kind of working memory in monkeys. The monkey is fixating. There is an object to the right to the fixation point. The left hemisphere sees that object and there is activity there. Then the object disappear at the same time the monkey suck cards to the right of the object. Now, the object is on the left and goes to the right hemisphere. What is transferred is a working memory of the object. This is assuming that the working memory is stable with respect to the fixation.
So now you can see, it's like copying an assembly in your terms, from one area to the other, it's a mystery. And now you can look whether the activity is precisely the same or not, for instance. And it looks like it's not precisely the same. You can look for other differences.
CHRISTOS PAPADIMITRIOU: Looks precisely the same between which two things, Tommy?
TOMASO POGGIO: Well, seeing directly.
CHRISTOS PAPADIMITRIOU: Why is that
TOMASO POGGIO: And then moving, copying, that activity to the other hemisphere. The pressure of the precise activity or it's just a transfer of assemblies that maybe, probably, are different in terms of individual neurons.
CHRISTOS PAPADIMITRIOU: And these are not in the visual cortex? They're in the prefrontal cortex, you are saying?
TOMASO POGGIO: Prefrontal cortex, yes.
CHRISTOS PAPADIMITRIOU: OK.
TOMASO POGGIO: Where it's not the only place, but you assume that there is a working memory of what we see around us. You close your eyes, there is still a memory of what is around you, and things should be stable with respect to this working memory. So if you move your eyes, make sense it moves around like he found. Anyway, there is more to discuss but it seems an interesting way of looking at how copying works. And whether it's consistent with your version of copying or something else.
CHRISTOS PAPADIMITRIOU: Yeah, very interesting. That's fascinating.
KENNETH BLUM: Tommy, maybe this question in the chat is directed to you. What does it mean to say that, quote, activity is the same across different brain regions? That they show the same response and some task condition?
TOMASO POGGIO: Well, you can, for instance, I think Earl did that. There is a neuron paper in which you can record from a largest number of electrodes, train a classifier, and then check whether you can have the same readout on the copied memory instead of the real one. So you can verify whether it's the same or not.
KENNETH BLUM: But, Tommy, you, yourself, put a comment in the chat about evolution and Turing machines and assemblies, and I'm not sure I understood it, but you can view a bit more?
TOMASO POGGIO: I think one simplified but very concise version of sort of summarizing what you said, I said. My version was essentially, could evolution ever discover Turing machines? And your version is different. Assemblies, I think, is different from point y precise wiring like I had in my case. Now, can you think of using assemblies in computers, to forget for a moment, the brain? You think it's a computing scheme that makes sense to use in computers, and which kind of properties of the hardware do make it interesting for today or tomorrow computers?
SANTOSH VEMPALA: That's a good question. Indeed, one of the first papers that influenced us was this paper trying to show that this random projection and cap operation. It gives you an algorithm for nearest neighbor search that's competitive with localities instead of hashing, benchmark data sets. So the cap operation seems to be useful purely computationally as hash, as a locality sense you've had. But more broadly, I would guess that the answer is yes, and we have to find the right applications.
And maybe, for example, things like word embeddings or language associations could be one way to, one place where assembly-type data structures could be more useful than most standard ones. Let's say, if you want to solve the analogy task. Man is to woman as King is to whatever. There are some more precise algorithms solved, but maybe, if you just create assemblies for these things, also assemblies for different types of relationships, maybe that works better. At this point it's purely conjecture.
TOMASO POGGIO: Yeah, but I was asking, they are both-- a Turing machine is universal. Your calculus is universal. Can you see that as the foundation for something different with computers? And they have different, ideally different, components. Like, instead of gates, something else. I don't know.
CHRISTOS PAPADIMITRIOU: I mean, what comes to mind is that now they have these massive and very clever, very efficient, neuromorphic computers all over. I mean, starting in Europe a few years back. I mean, it could be that assembly might be a mode of operation that would be interesting for such machines. I mean, simply because the brain seems to have chosen to do it this way. I'll tell you, for a very long time, and Santosh knows that. We were trying to talk to Geoff Fenton about assemblies and he was he was interrupting us.
I mean, he wouldn't listen to the second sentence. And that was all you know. I cannot believe that the brain uses such a wasteful technology sort of that-- there is a point there. So why have 30,000 bit words? That doesn't sound so very efficient. But on the other hand, we know for some applications, it could be beneficial.
TOMASO POGGIO: There is another question, which is, evolution may have discovered both Turing machines. I'm saying Turing machine because this was what I came up with, and assemblies, it could have.
CHRISTOS PAPADIMITRIOU: Yeah?
TOMASO POGGIO: But the cortex is very similar everywhere.
CHRISTOS PAPADIMITRIOU: Yeah.
TOMASO POGGIO: What you think about the visual system is similar to Broca's area in terms of anatomy?
CHRISTOS PAPADIMITRIOU: Tommy, I have to tell you this. Before I started thinking about the brain, I was thinking about evolution.
TOMASO POGGIO: I know. I know. I know. Yes.
CHRISTOS PAPADIMITRIOU: Do you still want to ask me?
TOMASO POGGIO: OK. Yeah, you had a very nice paper about evolution. Santosh, do you have one minute?
SANTOSH VEMPALA: Yes, yes, of course, there is.
TOMASO POGGIO: I had an audio crash while you were saying that deep networks are not lookup tables so, can you repeat what the final--
SANTOSH VEMPALA: You said several things, bridging the gap between these models and deep nets. And I was contesting those points, including the characterization of deep nets as lookup tables.
TOMASO POGGIO: [LAUGHTER] But a lot. If you look at the NTK in that regime, they're really machines.
SANTOSH VEMPALA: But NTK, we already know, is very limited in what it can capture.
TOMASO POGGIO: Well, the performance is not too bad. It's like--
SANTOSH VEMPALA: --start on. [LAUGHTER] Right? Oh, yes. On benchmark tasks that are--
TOMASO POGGIO: Right. But the same spirit, the deep networks densely connected are not very good at all. I've never seen a task where they are better than current machines. It's only convolutional, networks which are better.
SANTOSH VEMPALA: Yes, right.
TOMASO POGGIO: --connectors are the other reasons. This is like a hierarchical lookup table.
SANTOSH VEMPALA: Yeah. I see.
TOMASO POGGIO: Anyway.
SANTOSH VEMPALA: Multi-label-- hierarchical lookup table.
TOMASO POGGIO: And I was exaggerating a bit, but I think the generalization you get is essentially similarities, like the interpolating nearest neighbor, a bit better. But if you don't get-- it's not like rule-based. It's the reason why I don't think that a lookup table, or an interpolating lookup table trained end-to-end, can explain real human level intelligent things. I'm arguing for something different like you are, your system.
SANTOSH VEMPALA: One missing piece here is the architecture itself. And as you just mentioned, CNNs seem to be crucial to the performance of deep learning in many aspects. And we have evidence of some CNN structures, let's say, in the visual system. But the question is, can we describe an experiment analogous to evolution, which would result in the architecture converging to CNNs?
TOMASO POGGIO: Well, I'll tell you. I think it's-- First of all, they're working much better for stimuli that have the appropriate properties. Vision is one, speech, text, but not everything. Predicting the financial market now. In the meantime, so we know from approximation theory that if you have a locality, so functions that are function of function meet, this convolution networks are very good. They escaped the exponential curse of dimensionality.
In the meantime, we also know that they cannot be learned from data. There is a very recent example, not published yet, by [INAUDIBLE] and Aaron [INAUDIBLE]. You need too many data to learn locality of the architecture. So it's a strong trial.
Once you have the trial in the architecture-- So you have a local receptive fields. Your unit's looking at nearby units in the layer below, not that they're very distant, like in the visual system. Once you have that priori, you can learn, from example, the weight sharing, which is much less important anyway.
TOMASO POGGIO: My bet is that locality of connection is genetically determined, probably because of just how development works. It's easy to connect to nearby things to make precise connection. And the shedding is learned during childhood or babyhood.
CHRISTOS PAPADIMITRIOU: Sharing?
TOMASO POGGIO: Weight sharing.
CHRISTOS PAPADIMITRIOU: Weight sharing.
TOMASO POGGIO: Yeah, if you have a stimulated translation in--
CHRISTOS PAPADIMITRIOU: I'm just not there yet. OK.
TOMASO POGGIO: Because also, weight sharing is biologically very difficult to justify, that's the usual problem. But if you learn it, I don't think you can put it as a [? fiat, ?] like you do in a computer program. But you can learn it from data in an essentially unsupervised way without directly imposing it.
CHRISTOS PAPADIMITRIOU: Yeah, good points of those needs, Tommy, thanks.
TOMASO POGGIO: Yeah, but I'm curious, you know, my simplification of-- For evolution to discover recurrent and hidden state variables, that seems to me, is the right thing for getting to a Turing machine or a finite state machine. But maybe I'm forgetting something. The difference is always in the detail, like the memory or the tape or--
CHRISTOS PAPADIMITRIOU: Right. At the level of assemblies, you can have memory, because a new assembly operation can erase the effects of the previous one, and things like that.
TOMASO POGGIO: I see. Yeah.
CHRISTOS PAPADIMITRIOU: At the neuron level, it's harder to see how this can be done in a reliable way. But having said, the big surprise of the parser was that it barely needs any control at all. So it is a streamlined thing that says, get the next ward and do the right thing. So nothing else.
TOMASO POGGIO: Do you think it will be more difficult to implement it in the [INAUDIBLE] current network that I had? Be a very interesting point if it is, more difficult.
CHRISTOS PAPADIMITRIOU: I doubt it.
TOMASO POGGIO: You doubt it?
CHRISTOS PAPADIMITRIOU: I don't think so. You need plasticity. Plasticity is paramount.
TOMASO POGGIO: Have the entire plasticity, right?
CHRISTOS PAPADIMITRIOU: Have the entire plasticity. [INAUDIBLE]. If you ask me, what is the feature of the assemblies models that we are the least comfortable about? Experimental neuroscientists believe in plasticity a lot. And also, they believe in assemblies. But they never thought of them at this rate, so not for health. Think of them as intermediate scale. It's not that they know they don't work, but they just have not--
TOMASO POGGIO: There is an old literature by [INAUDIBLE].
TOMASO POGGIO: Yes, thank you. Who has very rapid plasticity. [INAUDIBLE]
CHRISTOS PAPADIMITRIOU: Was this about audition? About--
TOMASO POGGIO: No. It was related, I think, to oscillations, and training between different parts of the brain. In the 80s or 90s, there was a period in which neurophysiologists were quite in love with the idea that you could have oscillations synchronized in different parts of the brain.
CHRISTOS PAPADIMITRIOU: I'm sorry. I missed the name of the researcher.
PRESENTER: Christophe [INAUDIBLE]
CHRISTOS PAPADIMITRIOU: OK. [INAUDIBLE] I think I know who you are, OK.
TOMASO POGGIO: I knew him quite well. Yeah, it may be useful just in terms of references to data about this.
CHRISTOS PAPADIMITRIOU: I have--
CHRISTOS PAPADIMITRIOU: I do have very few neuroscience references that talk about rapid plasticity. [INAUDIBLE].
TOMASO POGGIO: Right, right.
CHRISTOS PAPADIMITRIOU: --exist. But I--
TOMASO POGGIO: So this would be a place to try to look at it.
SANTOSH VEMPALA: But, Tommy, another point of deviation from something like a general machine, or even our simulation of Turing completeness at the assembly [INAUDIBLE], is that we still require a lot of control. It's still this high level thing saying, this happens, then this happens, then this happens.
CHRISTOS PAPADIMITRIOU: And then if this is equal, that, then do this. That's why it was a relief to see that our first application-- first up, up, up, so the parser. There's none of that. Because this is something, of course, we are uncomfortable about.
TOMASO POGGIO: Yep. But this is related to the questions I had at the end, we did not come to it. But you, again, think about evolution inventing, say, your first program. And then the way evolution probably would improve it, is to copy it and mutate, right? And then mix them up, and things like that.
CHRISTOS PAPADIMITRIOU: Yes, absolutely.
TOMASO POGGIO: Like with genes. And also, this is what programmers do more and more, by the way. [LAUGHTER] These days. But then, you need to have simple operation in copying, for instance. And if your network is weights, I don't know how to copy weights in the brain.
CHRISTOS PAPADIMITRIOU: The language organ must have some massive copying device. We know that there are a lot of new fibers, evolutionary new. In other words, the teams don't have it, don't have them in that area.
Or in fact, the largest of these five fibers is many times the size of the same fiber in the team. And it's bigger on the left hemisphere around the right hemisphere. So this must be used for massive copying, could be used for massive copying. That has to be a big part of what's going on.
I am convinced that a lot of language is basically retrieved. What I'm telling you now, I've said before, and I'm pretty sure I said it pretty much the same way. Not just poetry, or prayer, or mantra, or something. But every day-- All the people talk about idioms, like kick the bucket and so on, that retrieved. But I believe that actual sentences are retrieved. So they exist for the future.
PRESENTER: Tommy proposed that formal systems, for example, programs may be stored in the brain using recurrence and states. Do you think these would be enough to capture these formal systems? Are there any other components you think might be useful in your view? How difficult would it be to build an equivalent neural network model for a given formal system, such that it can store and use it? What are the major challenges, in your view? I lost a bit what was being referred to there. But if one of the panelists parsed that, please go ahead.
SANTOSH VEMPALA: The question, I think, is how can a program, a sequence of instructions, be stored in one of these recurrent networks? How can the control mechanism itself be stored as a program?
CHRISTOS PAPADIMITRIOU: That's a good--
CHRISTOS PAPADIMITRIOU: --question. So we know--
TOMASO POGGIO: It's the associative memory, right?
CHRISTOS PAPADIMITRIOU: Yes, but how is this then loaded and executed, Tommy? That's the question. So I think it's going to be once-- if people believe that something like programs do run in the brain, this will have to be the next thing to start to look for. So how is the control of these programs done?
TOMASO POGGIO: Well, the way it would be represented is simple. It could be like a [INAUDIBLE] network. By the way, it's [INAUDIBLE] not Wiltshire, Santosh.
SANTOSH VEMPALA: Oh, excuse me.
TOMASO POGGIO: [LAUGHTER] So it could be a network, right? And specific weights. Now how it's loaded, or changed, and started this-- Yes, I agree. These are big questions.
PRESENTER: So with thanks to the panelists. Thanks to all of the people in the audience, especially those who asked questions. Apologies to those people whose questions did not get addressed. I think that we will bid everybody goodbye. Thanks. It was great, very stimulating, interesting. And I look forward to future conversations, discussions, debates.
TOMASO POGGIO: Thank you, Santosh. Thank you, Christos. This was great.
SANTOSH VEMPALA: Thanks, Tommy. And thank you, Kenneth.
CHRISTOS PAPADIMITRIOU: Thanks, Tommy, again. Bye bye.
PRESENTER: Thank you. Thank you.