Panel Discussion: Open Questions in Theory of Learning
Date Posted:
November 19, 2024
Date Recorded:
November 12, 2024
CBMM Speaker(s):
Tomaso Poggio ,
Ila Fiete ,
Haim Sompolinsky Speaker(s):
Philip Isola, MIT, Eran Malach, Harvard
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
In a society that is confronting the new age of AI in which LLMs begin to display aspects of human intelligence, understanding the fundamental theory of deep learning and applying it to real systems is a compelling and urgent need. This panel will introduce some new simple foundational results in the theory of supervised learning. It will also discuss open problems in the theory of learning, including problems specific to neuroscience.
Moderator: Tomaso Poggio - Professor of Brain and Cognitive Sciences, MIT
Panelists: Ila Fiete - Professor of Brain and Cognitive Sciences, MIT
Haim Sompilinski - Professor of Molecular and Cellular Biology and of Physics, Harvard University
Eran Malach - Research fellow, Kempner Institute at Harvard University
Philip Isola - Associate Professor, EECS at MIT
TOMASO POGGIO: I'm Tomaso Poggio and today we have a great panel. So I'm looking forward to a lot of interesting ideas and discussions. I think we need a science of intelligence, and I don't want to waste time why we need it. We need a science because we understand natural intelligence in addition to artificial one and we want to understand what's going on inside transformers and other architectures.
And we need a theory. We need a theory. Maybe we disagree or agree. So we need a theory, I think a bit like physics in which there are some fundamental principles. They may not be all like Maxwell's equations. They may be some of them more like a fundamental principle in molecular biology, which is the DNA helical structure and how it immediately suggests how to copy and replicate.
But I think it's important to have a theory for many reasons. One of them is also to be able to deal with problems like explaining what's going on and aligning with what we want and safety considerations and so on. The main reason why you want a theory, it's really something that electricity can tell us. I think electricity is a bit similar in its history to deep learning, to machine learning.
Until 1800, there was no continuous source of electricity. The pila, Alessandro Volta, was the first one. And once that was found, then a lot of applications immediately followed. Generators, electrical motors. The whole electrochemistry was done within 20 years. The telegraph was designed by Alessandro Volta-- never built-- between Pavia and Milan, 30 kilometers away.
This was the time in which information was traveling in the world at the speed of a horse, and suddenly it was the speed of light. It was-- but all of this happened, telegraph and so on, without people knowing what electricity was. It was only 60 years later that Maxwell had a theory of electromagnetism, and a lot of other things happened after that. So I think from the theory point of view, right now in deep learning we are between Volta and Maxwell. I'm not sure exactly where. That could be a topic for discussion.
Deep learning-- there are many questions one can ask, just to tell you what a theory should be able to answer. Why optimization is so easy, apparently so easy. Why you have generalization despite overparametrization. It's in gray because I think we know the answer to this already. Why is our new activation so critical? It's not, so that's why it's in gray.
Some of the others are still open. Like, what is really happening in transformer? What's the magic of it. Maybe Eran, one of our panelists, will explain a big part of it. Anyway, we're looking for fundamental principle that can answer these kind of questions. And so let me now do the following. I'll introduce our panelists. You can-- not here. That's mine.
[LAUGHTER]
Thank you. And I'll introduce our panelists. And then each one of them will speak for no more than 10 minutes, a few slides. And then I'll be the last one and then we'll start the discussion. And the discussion will be between us and then we'll open it to all of you.
So prepare your questions. So Philip is a professor in EECS, a member of CSAIL, and he has been working on computer vision, machine learning, is known for its pioneering work on generative models, image synthesis, and all other good things that ChatGPT wrote for you-- about you.
Ila Fiete is a professor in BCS building, and she also co-directs the ICoN Center. And her research focuses on theoretical neuroscience, particularly the mechanism of memory and navigation in the brain. Haim Sompolinsky is professor at Harvard, senior investigator, Hebrew University in Jerusalem. Still true. And directs the Safra Center.
HAIM SOMPOLINKSY: Former.
TOMASO POGGIO: Former director of the Safra Center. OK. Has been a pioneer in theoretical neuroscience, exploring the principle of neural computation and dynamics in complex systems. And Eran Malach is a Kempner Fellow at Harvard University. He graduates from Hebrew University in Jerusalem, and his work centered on the theoretical foundations of deep learning and neural network optimization. That's all ChatGPT.
And I'm a co-director of the Center for Brains, Minds, and Machines here, a professor in this building. And I'm mostly known this time because I have a famous postdocs and students. So that's my claim to fame and I'm very proud of it. OK. Let's start with Philip.
PHILIP ISOLA: So the blurb that I'll give is on this hypothesis that we're calling the platonic representation hypothesis. This is work that some of my students and myself have put out recently. And yeah, Tommy asked us to have-- what's one of the fundamental principles of intelligence. This is really just a hypothesis. I don't know if it's a fundamental principle, but I think that there's something here that could lead to fundamental principles. So I wouldn't treat this as a call to action to study this.
OK. So I'll tell you what that title means. But I want to start with this paper, which is one of my favorite papers from the last decade or so. This is from Antonio Torralba and some of his collaborators and students. And it's Object Detectors Emerge in Deep Scene CNNs.
And what they observed, which is a story that you'll all have seen now over and over again over the last decade, is that if you train a neural network to do something like detect whether this is an indoor scene or an outdoor scene, or if it's a cat or a dog, whatever it is, and you probe neurons at some layer of that network, you'll find that there are object detectors. So a neural net that's trained to do scene classification ends up kind of self-organizing to find units that respond selectively for these images.
So neuron A will fire. These were the top four images that most activate neuron A, and here's the top four images that most activate neuron B. So there is a dog face detector and there is a robin detector. And in BCS over here, you'll have seen the same thing in monkey visual cortex. There are units in IT and other places that respond selectively for objects.
OK. So this is something that's interesting. There's some internal structure. These are not just black boxes. We can understand the representations they learn to some degree. So other people have studied similar things. And here's one that we did a few years ago where we trained a network to do image colorization. So in this problem, we're going to try to predict the missing colors in a black and white photo. We're going to do the same test. We're going to probe neuron A and neuron B at some layer of the network and ask what do they respond to.
OK. So the thing that was surprising in 2016 now is kind of known is you get the same things. You get a dog face detector, you get a flower detector, maybe you get a robin detector. It's further down the list. So this is very strange, right? A neural network that's trained to classify scenes, of course it will parse the world into objects. Those are the components of scenes.
But a neural net that's trained to just colorize black and white photos, some low level photometric property of images, it will also discover the same types of structures. And that's just a story that's repeated over and over again. That whenever we train deep nets to do whatever we care about, they seem to be learning similar detectors and patterns and parsing the world in similar ways. OK.
So a lot of people have put forth versions of this hypothesis that different neural networks, trained in different ways on different data sets are somehow converging to the same way of representing the world. And in particular, I'm going to focus on the same kernel function, which I'm going to describe in a second. OK.
So that's the rough hypothesis. Just one more example, of course. This hypothesis has been proven to be true at V1, at the first few layers of visual cortex where Gabor-like detectors emerge in all of these networks. But further, deeper in the net, we don't know for sure. OK. So we looked at this. We've been looking at this in terms of representational kernels. So let me quickly define what that is.
We're going to characterize representations with vector embeddings, so mapping from data to vectors, and we're going to then characterize the representation by its kernel, which is how does the embedding space measure distance between two different items. So the kernel for a set of images under some image embedding would look like that on the left there. We have the representation.
The vector that represents apple and the vector that represents orange might be similar vectors, so the kernel says that those things are alike. Whereas the embedding for elephant will be distant or dissimilar from the embedding for orange. So this structure is just one of those fundamental structures that is important to characterize representation. It tells us how does the representation measure distance between different items.
In some of our recent work, we ran a bunch of experiments looking at these kernels and how they're alike or different between different neural networks and different training regimes. And I'll show just one experiment where we looked at the similarity between a kernel for a language model and a kernel for a vision model. We're asking, does the language model represent two sentences?
Does the language model represent the distance between two words, apple and orange, in the same way as the vision model represents the distance between an apple and orange as being small? OK. So are the language kernel and the vision kernel for matching items alike? And here's the main finding.
If we plot language modeling performance of language models on the x-axis against alignment, kernel alignment to vision models on the y-axis, what we're seeing is that, over time, as language models get better and better at just doing next character prediction, they're getting more and more aligned in their kernel representation in a way that matches the kernel of a state of the art vision system dyno. So this is a pure LLM performance on the x-axis and alignment to a pure vision model that's not been trained with any language at all.
And yet over time they're getting more aligned. And if you look at alignment to bigger and better vision models, you get increasing alignment too. So the best vision model is the most aligned with the best language model and worst language models are less aligned with worse vision models. So it does look like there's some kind of convergence going on. And the hypothesis is that that will keep on going. But who knows? Maybe that will be false. Maybe this will fall off after a while.
So why is this going on? So I think the most common response at this point when I've talked about this work is, oh, it's all about the data. All of the models are trained on the internet, so we're just learning a model of the internet. But the thing is this is comparing the kernel for a language model and a vision model. The data is not really the same. It's a different format, different modality. So it could be the architectures. We all use transformers. Maybe there's other fundamental principles here.
We're telling people to use the same methods. But our rough argument, and what we could talk more about in the dialogue, is-- my rough argument is that it's about the world. It's something about nature. So that was what led us to this kind of articulation of the platonic hypothesis, going back to Plato's Allegory of the Cave, Plato imagined these prisoners whose only experience of the outside world is the shadows cast on the cave wall.
And so they have to infer what is out there in the actual reality from just these projections. And it's an allegory because he's saying that that's really how our senses work. We don't have direct access to the physical state, but we do have some measurements, some observations of the state. So there is some real world out there. Platonic reality, latent variable z.
And we observe it through cameras or through text or through other modalities because if we do representation learning in any modality, we'll get a representation, and because they come from the same causal process, at the end of the day, those two representations should somehow become alike. That's the rough argument. Modeling the world with different modalities should arrive at a similar representation, because the underlying causal variables are the same.
So I won't have time to go in full detail into the math, but we do have a toy theory or an initial theory we've started to work out, which basically constructs a toy world of discrete events and so forth. You observe those events with language or vision or other modalities. And we could talk about this more offline, but the current candidate for what is this kernel, where is this convergence all heading to, is that we're learning a kernel that measures distance between events in a way that is proportional to the co-occurrence rate of those two events.
So the formalisms here are for the contrastive learning setting. Again, I'm not going to have detail to go into all of that. But if you have a world in which different events co-occur with different rates and you observe them under different modalities, you can prove in that simple toy world that they will converge to the same kernel representation. These are the three points I want to leave you with, which I think could be the starter of some fundamental principles.
One is that I'm just more and more convinced that kernels are an object of fundamental importance for understanding representations. They seem to be converging, in theory, at least in simple theory, but also in practice empirically, with large language models and large vision models, and one candidate for this kind of convergent kernel that it might all be heading up toward is a kernel in which distance is proportional to the co-occurrence rate of events in space time. And I will leave you with that and we can keep on discussing. Thank you.
[APPLAUSE]
HAIM SOMPOLINKSY: So I'm going to tell you a story which is an overlap with what we have now, but kind of a different angle, perhaps. Neural manifold is a framework for understanding representations of categories in AI and brains. So the question. We all know that at some point in the cognition systems, there emerges emerging representations of categories. In vision it will be object and so on. So it's a fundamental question how these categories emerge from the continuous stream of signals that impinge on the brain.
One might imagine that there is somewhere at the top layer IT cortex neuron that is specific to a particular object, like grandmother's cell hypothesis. But the answer is there is no such a thing, that there is-- many neurons are responding to all objects, basically. A more sophisticated assumption would be what's called neural collapse-- is that there is a unique distribution or pattern of activity in IT cortex that is unique for particular object.
But the answer is that this is not even the case, because the neuronal responses depend on the physical variability of these categories. So the natural hypothesis is that we are not talking about unique patterns of activity, which is invariant, but we are talking about object manifolds as the representation of these categories. So object manifolds.
So this is a description of them. You have many images corresponding to a category dog. All the collection of these response vectors define a manifold. Similarly for cat, and so on and so forth. And the idea is that part of what the job of a deep networks in AI and in the brain is to reformat or reshape those manifolds so that they can allow downstream computations which have to do with object identity.
So two question is, what are ensembles of category based computation downstream relative to which we will measure how these representations are good at, and what will be the relevant geometric measures of those manifolds in the context of those computations. So basically I'm flashing here, I'm not going to go into the math, but statistical mechanics give us a qualitative and precisely also quantitative measures of predictions which are sufficient and necessary conditions for the manifolds at the feature layer that will allow for different computations.
So you have here examples. The computation that I mentioned briefly soon is one of them is high capacity for linear classification of a large number of categories. So you have large number of categories. You want to classify some of them as plus, some of them in minus. So the manifold need to have a appropriate radius, appropriate effective dimension. The radius is normalized to the mean distance between the centroids, which is the signal.
So that's one type of result. Another set of ensemble of computations is fast learning, few shot learning, to discriminate between new categories. This, again, has kind of predicted SNR, which has the radii, has dimension, has overlap between variability of the manifold and U and the signal vector. And finally, a zero shot learning involving a cross-modal estimation of prototypes. So you transfer knowledge from language to vision and so on and so forth.
So let me give you an example. So linear classification of a large number of categories. This is a cartoon. You have many categories that you want to describe, to categorize, plus one, minus one. This is a lot of work done with my colleagues, and primarily Susan Chang has done beautiful work on these type of computations. Here is an example of what the outcome is. So here you have a deep network. In this case, ResNet, I don't know, 150 or whatever.
So the capacity or the separability you see as a function of the depth in the network and is kind of incrementally increases, but it's kind of non-linearly rising at a certain stage in the network. So you can say, this is the point where object manifold emerges. And in this, if you look at the radius, you look at the dimension, you see similarly, at this point, the geometry of the manifolds is getting into the shape that allows for high capacity of discrimination.
So for the second type of computation, a few shot learning of new categories. In this case, you-- again, all what I'm talking about is pre-trained network and all what I'm doing is looking at downstream computation on pre-trained network. So in few shot learning you take the pre-trained representation induced by a few examples from new categories that the network had not seen during training, and you see within the network, the downstream classifier can separate between them in a nice way.
And surprisingly, to us at least, the performance is very high. If you take pre-trained networks that are trained for the canonical ImageNet object or object classification, you see amazingly good performance for few shot learning. And as you might see here, it is across different architectures but also across different types of learning. So supervised learning for object recognition or self-supervised contrastive learning and so on. All of them perform very well on categorizing, with a few examples, new categories.
So moreover, the type of errors that they do on this task is consistent across different networks. So there is some universality here which is fundamental to the weight the network, after learning its own tasks, whether self-supervised generated representations which is conducted for this type of computation. And down here, you see that our SNR theory predicts very well the empirical error for this task.
We can use the same computations but also the same geometric measures to compare how data from IT cortex, the representations compared to deep convolutional neural networks. And what you see here is that there is, again, a very strong correlation between the SNR if you measure it from neural recordings from DiCarlo lab. In the deep convolutional network, there is a very strong correlation between the performance or SNR and different geometric measures on this task.
So this is not only performance measures, but it also gives you an underlying understanding of what are the geometric features that give rise to this performance. And this is a level that is very useful to compare different architecture, different learning, and different tasks, and also brain and networks. You can also compare the emergence of these properties of this manifold, for instance, by measuring the error as it goes down along the depth of the network.
Again, you see V1-like or V4 is in the right place, so to speak, but IT cortex is kind of fit roughly what you expect from deep networks. On the other hand, if you look more closely at the effective dimension of the manifold, you see very strong violation or discrepancy between the dimensionality of V4 for images and objects and the dimensionality predicted by deep convolutional networks. And basically the message here is that, yes, we see a lot of commonality, a lot of universality even between brain and artificial networks.
But if you look more carefully, you also discover-- using this methodology, you also discover substantial discrepancy which call for further understanding of the reasons for this discrepancy. Zero shot learning is already hinted by Philip. You're trying to learn to discriminate between two new visual categories based on no example, no visual example of these new categories, but simply using the language representation of these categories.
So you fit a linear mapping between the representation of the centroids of the manifolds in the future and the visual system and the corresponding word embedding in the language model. And the question is whether this mapping, once you freeze it, it allows you to estimate where this prototype of the centroids in the visual space will be for new categories for which you have only the representation from the language model. And again, surprisingly, at least for us, the performance on these networks is extremely high.
And again, our SNR theory predicts very well that the pattern of errors empirically. So language and vision prototypes are aligned up to scaling and rotation, which, again, as hinted already by Philip, is telling us something fundamental about different modalities representing the same natural concepts. Finally, I would like to show another example of application of the usefulness of this methodology and framework is looking at word manifolds in speech hierarchy.
So this is, again, work-- this is now work by Shane Shang and Shailee Jain from Edward Cheng lab, where here we are looking at, again, manifolds, but now manifolds of words like air or fire now constructed by many, many utterances, spoken words by the same person several times or by a different person, different gender, and so on. So there is an entire manifold of representations in the language hierarchy now that form a manifold.
And again, the question is whether these manifolds have a nice property of separation that would allow the downstream system to actually recognize that this is word air, this is word fire, and so on. So we are doing some ongoing analysis on new pixel recordings from human brain for patients that listening to many, many sentences and words and so on.
But I want to show you here an example of an analysis on a specific speech to text network, what's called Whisper, where it takes the acoustic signal and then eventually-- there is an encoding and decoding stage, but eventually it generates a word recognition, automatic speech recognition. Like in this case, it will correctly identify the word. And if you take the same methodology of measuring manifolds or the SNR of a particular performance, we see now an interesting pattern of increasing performance but in non-monotonic fashion.
There is difference between the encoding part and the decoding part, which tells us something about the inner working of the network. But eventually, there is a non-linear increase in the manifold geometry towards the end. And we can go deeper and ask what are the features that underline it, like the variability of the speech signal, the number of phonemes, et cetera. So we can actually go even into more finer level and use that geometry to give us a hint of what are the critical features that characterize the difference between the manifolds and the variability within manifolds.
So I would like to add that when we talk about a theory of intelligence, basically, you can divide it into two main problems. One is the learning problem and another is the idea representation. What is the nature of the solution that the system has come up with? And I think Philip and I talked about the second one, is how do we understand the representations in the different systems in AI, and how do we compare them to the brain, what kind of predictions we can make.
But I think one of the big questions is how those manifolds or, in general, how good representations emerge through learning. And I think this is-- we and others have made some progress in this direction, but I think this is still a very hard problem to deal with and probably will be something that we will address in our discussion. Thank you.
[APPLAUSE]
Yeah.
ILA FIETE: All right. Well, wonderful. Great to be here. Thanks, all of you for coming. And I think I wanted to talk about a slightly different regime from the regimes that Philip and Haim have been talking about. I think they've beautifully illustrated that if you have data from which you've trained models and the data are representative of the world at large, you have enough data, then you get very beautiful structures in the representation that are cross-modal aligned.
And one thing that I'm very interested in as a neuroscientist, and also, I think, in terms of theory of learning in deep networks and how to make those networks more efficient are these questions about sample efficiency, and this fundamental question, which I think comes about when you look at biological systems, which is the prevalence of modularity. And so what I would like to talk about is the principle of modularity for efficiency and robustness for learning in brains and in deep networks.
So the story told by Herbert Simon, the Nobel Prize winner, for his work in biology was he told the story to illustrate the benefits of modularity. So he said once there were two watchmakers, Hora and Tempus, who made very fine watches. The phones in their workshops rang frequently and new customers were constantly calling them. Hora prospered while Tempus became poorer and poorer. In the end, Tempus lost his shop.
What was the reason behind this? The watches consisted of about a thousand parts each. The watches that Tempus made were designed such that when he had to put down a partly assembled watch, it immediately fell to pieces and had to be reassembled from the basic elements. Hora had designed his watches so that he could put together subassemblies of about 10 components each, and each subassembly could be put down without falling apart. 10 of these subassemblies could be put together to make a larger subassembly, and 10 of the larger subassemblies constituted the whole watch.
So we also understand that in neural systems, if we want to learn compositions of colors, red color and then animals, colors versus animals, if we understand colors as an independent variable and animals as an independent variable, then we can do things like imagine a red panda, even though the data set never contained red pandas before. So we don't need to see all examples of all possible colors of all possible animals.
So sort of having understanding of factorized, disentangled, and modular understanding of concepts can be very useful for being able to imagine and generalize to new situations. And so in general, if the data points that you're learning from are drawn from some latent states that have k dimensions, if you have to-- and those vary independently.
But if you learn all of that as one combined set of representations, you need an amount of data that scales exponentially with the dimension k. But if you were to learn the independent factors of variation, then the amount of data that you would need to learn you would need to learn all of those, all of the data would be of order-- like everything about the world would be scaling with k rather than k in the exponent. So the idea is that if you can understand the independence and modularity structure of the world, you can get by with much more modular, much more efficient data efficient learning.
So there are many actual articulated reasons for modularity in the literature. And this literature spans everywhere from the theory of evolution to evolutionary dynamical simulations, to learning in deep networks, to other biological systems. And so the different advantages of modularity have been listed-- at least I've listed a few of them, but there are some-- this extensive literature from across fields saying that modular solutions have enhanced robustness to sparse perturbations.
Because if you perturb a part, you're not then perturbing the whole system. You're just perturbing just that one module. It allows for the evolution of complex systems by allowing modifications of individual modules and parts. So for example, if you've got an animal that has a visual hierarchy and you have a knob that just can tune how many layers or how many levels deep that hierarchy is.
You can then-- evolution can-- if that knob is controlled by a gene, then evolution can simply change the scalar value of that gene in some way and then change the number of layers in that processing hierarchy. It doesn't have to rewrite the whole brain wiring network from scratch. It's just sort of a modular solution that can tweak the depth of sensory processing. Also, modularity allows for compositional generalization and sample efficiency in the ways that I just talked about with red pandas.
And also it's possible to then build upon the existing functional units and add different functionality to the system, or recombine functionality without redesigning the whole system. And finally, from a machine learning perspective, from some sort of societal sort of regulatory perspectives, modular solutions just tend to be much more interpretable.
So now, of course, I think we all appreciate these challenges of modularity-- I mean, these benefits of modularity. But we haven't been that successful, I think so far, in articulating solutions that are modular for the problems that we task our deep learning networks with. Somehow the deep networks tend to be very mixed. Solutions involve-- at least the initial conditions of the network start out very mixed.
The networks evolve to be very mixed and overlearning, and they don't tend to become modular. And also if we build in modularity, there are some challenges associated with using those modules. So I wanted to highlight two different main challenges I think that come about related to modularity. So challenge one is for-- how can networks or models or learning systems discover and sort of self-organize to be modular? OK?
So the reason this is a big challenge is because-- just, I'm illustrating it with the following example. So consider that we have a task y, a task which is to learn this function y is equal to this function of x1 up to xn. All right? And this function, the actual function, it actually decomposes into F1 of x1, F2 of x2 and so on. OK? So it has a factorized form like this.
So if you have just a small data sample-- so you've only seen a few examples of x and y-- then there are way more non-modular solutions to this problem than this modular solution. So there's no reason why the modular solution should be discovered, because it is that needle in the haystack in terms of the whole function space of functions that could be fit to a finite data sample. Of course, if you go asymptotically in the limit of very large number of data samples, then all of those degenerate solutions, which are the non-modular solutions, start to fall away. And maybe the model can find the modular solution.
But it takes a lot of data. But what's interesting is that in biology, we see that evolution is a process that has discovered modular solutions in body systems and in the brain. And so really a fascinating challenge is how is it that evolution-- if finding a modular solution is finding the needle in the haystack, how has evolution done that? How has it found that needle in the haystack? So one idea is the idea of-- the fact that modularity gives rise to robustness.
So now if you train networks or systems, a learning system, to perform a computation in the presence of noise, maybe that would push the system towards modularity. So here is an example of Boolean networks in which there are-- the target is to have two Boolean functions. I think this is an and, and so we have-- and they're decoupled. There are two independent, so we give four inputs. So that's x. This is x, x1, x2, x3, x4. And the outputs are just x1 and x2, and it should be x3 and x4.
OK. So when you evolve now these networks through genetic algorithms, then you get a diversity of solutions. Here is if you evolve the system in the absence of noise, then you get these highly coupled-- these four inputs project to this tangled mess. And then you get your two outputs over here. But if you evolve in the presence of noise, then you see you get these decoupled networks over here that are each have two inputs and then an output. OK?
And it turns out that-- you can further analyze these networks and it turns out that they form really good fault tolerant computers. So they're sort of robust to single bit flips in the internal nodes. OK. The other thing that's really interesting is that these solutions have less mutational lethality, so that if you have single bit deletions or mutations in these solutions, then these error correcting or modular solutions have smaller probability of getting the solution wrong and they better survive.
Again, they better survive a whole sequence of mutations. So if you do one mutation and then another and another and so on, they are more robust to these mutations. And in fact, these properties of better mutational robustness means that these networks are also more evolvable in the sense that if you want to evolve to a better solution from where you are, exploration, if it leads quickly to lethality, like a completely dysfunctional solution, it means that you won't be able to traverse that minimum.
You won't be able to traverse that lethal state and discover another solution that might be better. But by smoothing or having this noise robustness or fault tolerance, it's possible to explore a bigger space and find even better solutions. So it's a more evolvable system in general. All right. The second challenge of modularity. So the first challenge was discovering modularity and modular solutions. The second challenge of modularity is utilizing modules if they exist.
So what do I mean by this? So now this is another very simple example of a network where there are two inputs, x1 and x2. And we're going to build in two nodes that implement a function f1 and another that implements the function f2. And our target is to attain y1 is equal to f1 of x1, and y2 is equal to f2 of x2. So just feedforward network and we just want to find this simple solution.
And if u and v are both unconstrained and you solve this problem, so you supply these non-linear functions f1 and f2-- so the solutions exist in the network. All it needs to do is discover that it should pipe x1 only to f1, x2 only to f2, and then f2 only to y2 and f1 only to y1. That's all it needs to discover. If you try to train this network with end to end training, backprop, usually if u and v are both free, backprop typically fails.
And of course, even though if you fix v at the correct value and train u or the other way, it works. So basically end to end back propagation does not learn to exploit a modular solution even when it exists. All right. And so I just want to conclude quickly by just saying that there are plenty of examples of modularity in biology. Like I said earlier, they're really striking examples. You can look up some of these if you're interested, or come ask me about them later when you're interested.
So the sort of modules that operate in parallel or there's also hierarchical modularity with discrete networks that then feedforward into multiple processing areas. This is very familiar to most of you in the room. The visual system in mammals, including primates, actually consists of just relatively few numbers of layers feeding forward to one another with local recurrence within them. And that's really in contrast to the extremely deep networks that we have in computer vision.
And so somehow nature has committed to-- although we can say that the deep networks correspond to unrolling in time of a few shallow recurrent networks, the fact is biology has committed to a small number of networks that said, there are five. And each one is locally recurrent, but then they're largely feedforward between them. So why so few? And why five? And so I'll just-- the final insight here is that in biology we've got different learning rules and we can use spontaneous activity.
There are rules for whether neurons are going to wire up to one another or not, which can be dependent on distance between neurons. And there are also competitive dynamics in the innervation of neurons. So if a neuron receives an input and that input is strongly strengthened, then maybe the other inputs to that neuron may be slightly weakened because they're all competing for a scarce resource, which is innervation of that neuron.
So if you have these kinds of competitive dynamics as well as some distance dependent growth rules, then it turns out that even with a completely undifferentiated cortical sheet, with all to all connectivity within the sheet, very quickly, these kinds of learning rules can give rise to hierarchical modular architecture in which an input comes in and ends up innervating only a small subregion of that undifferentiated cortical sheet, and then these neurons then innervate the next layer, and the next layer, and so on, forming discrete areas that are hierarchically connected but are small in number and discrete.
And in fact, that can mirror the visual hierarchy and also give rise to other features. So basically, there are many advantages to modularity and it's going to be very interesting, I think, going forward to think about what are the drivers of modularity and how to incorporate them into our models. Yeah.
TOMASO POGGIO: OK. Thank you.
[APPLAUSE]
ERAN MALACH: Thank you. Thanks for having me. I will talk about the power of learning with next token predictors or, as you probably all know them, language models. So I'm sure you're all aware of how great these kind of new, brand new language models are, how well they're doing in various different tasks and benchmarks. I don't need to tell you that.
I would like to point out one very interesting thing about language models is that we really train them to do something very simple. We train them to predict in parallel the next word in a sentence. We feed them massive amounts of data, and we just want them to predict the next word in the sentence. And then when we use them at inference time, all we do is feed in some question, maybe from the bar exam, and ask them to predict the probability of the first word in the answer.
We sample from this probability, feed this back into the model, and then generate the second word, the third word, et cetera, et cetera. So really feeding them questions and having them generate the output word by word. And this sort of seems like magic. We're really training them to do something very simple, and then they end up doing something very impressive and being able to solve very complex tasks.
So why is this mechanism so useful for driving the capabilities of language models? I'll try to give a partial answer for why this kind of autoregressive mechanism of predicting the answer word by word is so useful. As a motivation, I'll show you one experiment. So I'm training or fine tuning a language model on this kind of simple logical reasoning riddle. Jamie is telling the truth.
Sharon says that Jamie is telling the truth. Michael says that Richard lying, et cetera, et cetera. And then I ask you for the last person in this list, is he telling the truth? And I trained the model on this problem, and you can see that it converges, it finds, it gets 100% accuracy in this problem. It takes it roughly 160,000 examples generated from this problem. I can increase the complexity of the problem by just adding more people to the list, and then it takes the model a bit longer to find the solution to this problem, but eventually it does.
It takes it around 300,000 examples. I can increase the complexity even further, and then it takes it about half a million examples until it solves the problem. You can imagine if I keep increasing the complexity of the problem, it will take it longer and longer to solve the same problem. Now I'll do something a little bit different. I will feed it not only the question and the answer, but also the step by step reasoning of how to solve this problem.
So an output that consists of the truth value of whether or not each of the people in this list is telling the truth or is lying. This is maybe the way that you would solve this very simple task. And I trained this model with the question, the output that contains both the step by step reasoning, and the final answer.
And you can see that in this case, it's able to solve all of these different complexity of problems roughly at the same rate, very fast compared to the previous experiment. It takes it a few tens of thousands examples. So it makes learning much faster to feed the model with this kind of step by step solution during training. OK?
So this, I think, is a very nice illustration of two approaches for doing supervision with language models. So assume I have this input question. You can imagine it goes through a computational process and then generates the answer. I can write down this computational process as a kind of step by step reasoning. And I can either do outcome supervision, so only supervise the output of this computational process without giving the language model any kind of transparency into the computational process itself, or I can supervise the process.
So really give it some hints into what is the computational process that happens. And we really saw that giving this process supervision is very helpful in speeding up training and making everything converge much faster. So why is this the case and how does this relate to this kind of autoregressive mechanism of doing language models?
So in the outcome supervision setting, I'm only asking the model to produce the answer. And this answer is one word or one token. Really depends on a lot of variables from the inputs. It's kind of densely depends on the input context when I ask it to generate word by word, the step by step reasoning solution, and then arrive at the answer. Each important word in the solution only depends on a few variables. So maybe the first word here depends only on one word from the input.
And the second line. I need to know what was the state of the previous person and get some variable from the input. But really the dependencies are very sparse, and this makes every word here very easy to predict, given the things that you already computed or the things that you already predicted. So imagine that you're generating the answer word by word.
Really, everything breaks down into a sequence of very simple problems. And the interesting point is this is not just a property of this particular problem that I showed. In fact, any computational process that you'll give me, I'll be able to write this kind of sparse chain of thought or reasoning process to decompose it into a sequence of very simple operations and the length or the complexity of this.
This chain of thought will kind of reflect the complexity of the computational process. I'm not going to prove this, but it's kind of very simple, given what we from basics of computer science, that you can really assemble any computer for very from very simple logical gates, or really can decompose any problem into a sequence of very simple problems.
And another thing that we can show that there are some problems similar to the problem that I just presented that are very hard to learn if you're only given outcome supervision on the problem, but given process supervision, if I supervise the entire chain of thought reasoning for solving this problem, language models, even very simple language models will be guaranteed to learn to solve the problem. And this might explain a lot of the progress driving LLMs. We're just able to provide them with data that sort of decomposes complex problems into simple ones.
And maybe another thing that I would like to point out is that even though process supervision is really speeding up the training of gradient descent, it's not that you cannot solve this problem just from outcome supervision. In fact, the first experiments that I showed showed you that gradient descent is able to learn all of these problems eventually. It takes it maybe hundreds of thousands of examples, orders of magnitude more data than you actually need.
But at the end, it's able to solve the problem, get the same kind of level of accuracy. And this really relies on the ability of backpropagation gradient descent that's driving the learning of these language models to tweak all of the circuit in the network until it finds the correct solution. So it might take it a very long time to arrive at the correct solution, but eventually, it does. So it can do well only with outcome supervision, but the cost is extremely high if you compare it to the cost of training with just process supervision.
And maybe this could explain, to some extent, the kind of enormous cost of training large language models as just increasing in cost from one generation to the other, because essentially, for most of the training data that we're providing the language models with, it's mostly outcome supervision, text that is gathered from all over the internet and doesn't necessarily have this kind of step by step solutions to these complex problems. Maybe the model needs to learn to infer these solutions on its own.
OK. And just to maybe leave some room for discussion, maybe relate this to a previous talk or more generally to neuroscience. So I think we'll know essentially that learning in the brain is, in some sense, very different than the optimization, the kind of global synchronous optimization of gradient descent. In the brain, we understand learning as more of like an asynchronous learning rules that operate independently.
And backpropagation essentially relies on this kind of global synchronization of the entire system. This seems to be very important if the only thing that you have is this outcome supervision. I give you a problem, a very complex computational process that generates the answer, and you don't see anything in the middle. To solve this problem, you really need to optimize everything together.
Maybe with more of this kind of process supervision, more transparency into the computational process, maybe you don't need this kind of global synchronized optimization and these kind of local updates are enough. And maybe this-- again, giving some wild hypotheses-- maybe this can explain the efficiency of learning for humans that see much less data and use much less energy and are able to learn, in some sense, more efficiently than these language models. OK.
[APPLAUSE]
Thank you.
TOMASO POGGIO: Let me try to tell you about a potential principle. Compositional sparsity. So what do I mean? Here is, again, a series of puzzles like the ones I've shown before that you can ask. And they are essentially 'why do you need deep networks?' and why networks escape or seem to escape the curse of dimensionality, which says that you potentially need a lot of parameters for increasing dimensionality of the function you are trying to learn.
OK. There are related question about generalization and about physics. But let me remind you briefly what is the framework of classical machine learning theory in the deep network case. So think about using a deep CNN or such for learning to classify images. The basic framework is that you have a set of data x and y. You have an unknown function f mu that produces this data. You don't know this function.
So you are trying to learn a proxy for your data. And you want to do this by using a family of parametrized function that approximate well the unknown function. You want this to be parametrized, because eventually you want to optimize the parameter by minimizing the error on the data, which is the only one thing you have.
So the key part is to have a family of parametric function-- this would be deep neural networks, and the parameters are the weights-- that is powerful enough to approximate a very large class of functions, and to do this approximation without having a number of parameters that explodes with the dimensionality or other properties of the unknown function.
OK. So the main result I want to show you is this one, that every function, which is efficiently Turing computable-- so it's computable by a Turing machine in non-exponential time. In this case, in the dimensionality of the function. For every such function, there exists a sparse and deep network that can approximate it without curse of dimensionality.
OK. That's the main result. Let me try to explain the framework. Suppose you have a function of d variables. d is bigger than 20 or so. Then the theory says-- and these are known for many years-- that an upper bound on the number of parameters you need to approximate this function with an error in the sup norm of epsilon. You may need a parameter. This is minus d divided by m, where m is some measure of smoothness, like the number of bounded derivatives of the function.
So if you have d equals 10 and m, let's put it to 1 for simplicity for now, you have epsilon to minus 10, epsilon, say, 10% error. This is 10 to the 10, which is big but not so big. But if you have, say, an image, a small image, CIFAR is 30 by 30. You have 1,000 pixels. So now you have 10 to the 1,000.
Just to remind you, the number of protons in the universe is 10 to the 80. So what happens is that if you have a function that can be represented as a function of functions, think about a binary tree, or a tree, a graph, a directed acyclic graph. So the function is a composition of many functions. The red dots are function.
This function has two variables. This one has three. This one-- sorry, two. Two inputs. Four inputs. So basically for functions of this type, the number that enter in the curse of dimensionality, the d in the previous result, is not the d of the compositional function, but is the maximum d among the constituent functions.
For example, in the binary tree, if each node is a function of two variables, the curse of dimensionality for that compositional function has a d equal 2. So compositional functions can avoid the curse of dimensionality if the constituent functions are sparse. That's one result that we proved a few years ago. And the second theorem is that efficient Turing computable functions are compositionally sparse.
And if you think about it, a Turing machine can be represented, at the end, as a very deep series of conjunctions and disjunctions. So that's the basic intuition. You can compose complex computation in terms of simple ones. Think about a program and about rewriting in terms of simple subroutines. So it turns out that compositionality is almost equivalent to-- sparse compositionality is almost equivalent to computability, at least in the efficient case.
And so after 20 years, I have an answer to this question. We had a paper in which I wrote. The paper was about theory with of shallow networks, like kernel machines, one hidden layers, and there was no-- we had no understanding why we needed depth in the brain or in artificial networks. So now you need depth if you want to represent a large set of functions. But you still can do it with the ability to approximate very well without curse of dimensionality.
OK. So there is furthermore some other result that says that if you have sparsity-- so if you have a function that is sparse and you assume that you have layer of a network that represents each compositional function, then you conclude that you should have a small number of effective inputs for each of the hidden units or subnetworks.
And if you have that, we have a separate proof that you can get a bound through Rademacher complexity of the test error in a deep network that is several orders of magnitudes better than the standard ones. So sparsity seems to be important for generalization in this case. It's an open question whether it plays a role or not in optimization, which is, of course, the most open area of machine learning today. So let me finish here, and I think we can now assemble here all the five of us and try to answer a question from each other and from the audience. All right.
[APPLAUSE]
Let me start. I think we spoke about-- all together as a panel-- about the formation of a representation. I think that was you and Haim. You were grouped together quite accidentally, but in a good way, about the evolution of architectures that support representation and about principle for transformers or large language models, the autoregressive principles, which is related to compositionality. So let's start with feature representations.
I think this is a question in optimization. I don't know if we agree, but it's the question, if you focus on deep networks of how features-- I don't know, actually. I don't like the term feature. But the output of each layer change and the weight at each layers changes across layers and across iterations. And I don't know whether the manifold hypothesis can say something about it or not, because, in a sense, it addresses the end result. Right, Haim?
HAIM SOMPOLINKSY: Well. Me? OK. As I hinted, I think it is a very important problem and largely open, in my view. We have made some progress in understanding how each, I would say motif of a deep network, changes the geometry of the feature representation, nonlinearity, pooling, convolution, et cetera. But still, I think the overall picture is still open.
And I would like to highlight the reason why. I may be wrong, but I think a large part of progress in understanding the theory of learning and generalization-- or, the many, many approaches to it, and Tommy has described one of them. But another line of approach used wide networks and the notion of kernels as a way to theoretically make advance on the theory of how solutions and representations emerge.
And I think we understand very well the regime what is known as lazy regime, where basically the learning is doing small fine tuning of an underlying random weights. And it's big enough, it's wide, it's deep and so on, so it can deal with the training problem. But it also, as Tommy, you mentioned, it has enough regularization or inductive bias to yield reasonable generalization. This type of solutions or architectures will not do the job in terms of the representation that I described and I think also that you described.
So the emerging representations are nearly random, not entirely random, because there is some structure in underlying input. So there is some structure, but not more than that. So I think the actual real life high performing networks are really living in a different regime, either because its, I don't know, non-lazy or rich regime, or because the amount of data and the structure of data and the task are really living in different regime.
And I think that's, to me, that's kind of a key point. That what we have-- the progress that the theory of learning has made in the last, I don't know, 5 or 10 years in certain directions have not captured the emergence of the very high quality representation that we see in deep networks that solve real life tasks.
TOMASO POGGIO: Well, that's a question for optimization. How does that happen? Clearly the kernel machines suffer from the curse of dimensionality. So they cannot be general enough. I wish I would have known 20 years ago. But unless you use a modified kernel in which you have-- it's what we called many years ago but we forgot about it, hyper BF. You have, for instance, a Gaussian, but you have a Mahalanobis type of distance, learnable. Then you can avoid the curse of dimensionality.
HAIM SOMPOLINKSY: Sure. But I think RBF will not make much difference. The RBF, you think about them-- in the context that I was talking about, you can think about them as different nonlinear units that do the RBF. the problem is I'm not using-- in the context that I was speaking about, it's not an assumption about using kernel machine or RBF and so on.
It is an emerging property of why deep networks in a certain regime of data that more or less random kernels-- even if they're RBF, then the kernels of the RBF will be random, emerges. In other words, there is not too much not too strong pressure for the network to actually build very high quality representation.
TOMASO POGGIO: There are a lot of questions we could discuss. I think maybe we should-- and a lot of open problems. Let's focus on which one may be between math and artificial networks and neuroscience. So one problem is I think it's a missing-- It's a gap at the moment between deep networks, the engineering of it, and neuroscience. We don't know which kind of optimization algorithm in the brain could replace SGD or gradient descent techniques. Anybody has an idea?
ILA FIETE: I mean, I guess I could take one stab at it. I think it's related to one point Haim was making at the end, which is that we don't know. I mean, one way to look at it is in terms of representational learning and feature learning. And it's true that if you have a very wide network with weak scaling of the weights, then you don't get rich feature learning to emerge in these neural tangent kernel theory limits. But the brain clearly has very rich representations.
And I think it's also related to other architectural emergent properties and other ways in which the brain understands the world, which is that we-- I think if we see data, we see pixel data, we see a ball running into a wall, we're not thinking in pixel space and we're not finding dense models at the pixel level of interactions. We're actually assuming that there are sparse causes.
There's an object, a ball, and it's moving coherently as one entity and then running into the wall and then bouncing back. So we tend to infer sparse causes even from dense observations. I think all of these things, the fact that we have rich feature learning, the fact that we learn, we have the sparsity bias in thinking about causes and the fact that we can have these-- we tend to favor disentangled representations. These are not biases that backprop contains.
And somehow I think it is indeed, as you say, Tommy, about learning rules. And it could very well be that-- it's a very difficult problem, because we know that if you're not doing gradient learning of some type, then a difficult problem is hard to solve unless you're moving along the gradient. But at the same time, there's--
TOMASO POGGIO: One option is that there is an alternative, an implementation of gradient techniques in the brain. Possible. Who believes that? I believe it.
ILA FIETE: I mean, I think there's no alternative. What do you all think? I think there has to be gradient learning of some type, but complemented with other things. There can be other pressures too.
TOMASO POGGIO: Yeah. I mean, backpropagation, the consensus seems to be it's not biologically reasonable to expect exactly backpropagation in the brain. But I think there are good alternatives. The other option is something completely different, like learning one layer at a time.
ERAN MALACH: Yeah. I think-- I mean, we take it for granted that back propagation is like the only way to optimize neural networks. But other methods have been explored with different degrees of success. Optimizing one layer at a time in certain situations can be competitive. I feel like it's not a question of whether or not you're using gradients, like taking derivatives, but whether or not all the optimization is kind of synchronized throughout the network, or you have something that's kind of more local optimization.
I think that for optimizing neural networks, we rely on this mechanism for synchronized optimization of the entire circuit, because in some cases, data is cheap, compute is cheap. This is the only thing that we have, and we might-- it's easier to throw more data and energy into the problem and solve things with gradient descent, where there could be an alternative algorithm that is maybe just as good but we haven't discovered it so we're just using what we have.
PHILIP ISOLA: Yeah. I guess my thought on this would be, what do you mean by gradient descent? Like, almost everything in some sense is a local move that goes toward a lower loss. But I have a postdoc, Jeremy Bernstein, who's been teaching me a lot about-- the gradient descent algorithm is one specific way of doing a local perturbation that minimizes the loss, and there's actually a lot of other steepest descent methods, a whole family of them. So yeah, they're all gradient descent from my prior, yeah, kind of way of thinking about it. But I guess--
TOMASO POGGIO: The question is whether there is a biologically plausible implementation that uses only Hebb-like rules, things we know the brain or synapses in the brain do.
PHILIP ISOLA: So I guess my thought would be it'll be a local perturbation that moves toward lower loss, but it might be very different than the gradient given by backprop. And I'm not super optimistic myself that it will be better than backprop. I think there's been so many attempts at trying to come up with local learning rules that are better, and maybe that hasn't panned out yet. But at least it'd be biologically interesting, and maybe the benefit would not be so much in optimization, but more in how that regularizes or affects the-- it has an implicit bias toward a different type of representation that's learned.
TOMASO POGGIO: I think this is a very interesting gap that, if we could fill, would unify research in neuroscience and in artificial networks. Yeah?
HAIM SOMPOLINKSY: I would like to comment on that. I think we have to be careful what are we comparing to what. If we are comparing vision or artificial neural networks for basic visual recognition or language, I mean, the brain has evolved through millions of years. I mean, are we trying to compare SGD to backprop to evolution?
This is why I'm going back, I think, to the representation. We have a pre-trained network in our mature brain and we have pre-trained networks which got it through SGD. But I don't think it's actually a relevant scientific question about trying to approximate SGD in these cases. If you really want to compare, a fair comparison, we have to take a task where a mature brain is learning, and maybe a random task, so to speak, for the mature brain, and then compare how pre-trained network will do that.
Because then you can actually look both behaviorally, and for animals also neurally, and make the comparison. Otherwise, the brain has so much advantage. I mean, you start from scratch from some-- even if it's deep, you start from some random weights.
TOMASO POGGIO: So one implication of what you say is that a lot of people are using deep neural networks to build models of the brain, if they--
HAIM SOMPOLINKSY: They're doing it, yes.
TOMASO POGGIO: Right. So why do you think that's right?
HAIM SOMPOLINKSY: Because I don't care how the two systems emerge at the solution. I care about the solution. And the fact that we find that different networks with different learning algorithms, they may use SGD, but one of them is kind of contrastive. Didn't see any labels. Another one is supervised, and so on and so forth. They have very similar properties in the solution that they arrive. So this is why I think it is fair to take as a tentative hypothesis. I'm not saying--
TOMASO POGGIO: But I assume you agree that synapses do learn during the adult life, right?
HAIM SOMPOLINKSY: But what? They learn vision? What do they learn? I mean, yeah, you have to take a task, which is really not a natural cognitive task that we just are born with and naturally develop. I mean, I think that would be-- if you are really interested in the question of learning, then you really have to take those tasks. And there are tasks like that.
From what we know, at least most of what we know is really kind of reward based learning for taking an animal, mature animal, and you just learn a new-- some new navigation task. And reward based learning, just reward is an optimization. There is an objective. If you look at the algorithms, it's more like local perturbation and basically explore. It's more the exploration than exploitation.
TOMASO POGGIO: Anybody wants to add something?
PHILIP ISOLA: So I completely agree about the importance of the pre-training and the representation. Maybe just to emphasize that a little more, I mean, it's always been kind of confusing to me when people compare the learning algorithm in the brain and the learning algorithm in deep nets sense. It's apples to oranges if they're not pre-trained in this to the same degree. But one point to drive that home is I think that there's theoretical results that state that any learning algorithm for some definition can be approximated by gradient descent on a pre-trained representation.
That's universal in some sense. Chelsea Finn and Sergey Levine had this in one of their works on MAML. They say a pre-trained representation plus n steps of gradient descent can approximate any learning algorithm. So if we don't take into account what the pre-trained representation is, then it's not much we can say about the learning algorithm and its efficiency.
TOMASO POGGIO: I personally think that if you put one of us as a baby in the forest, no parenting, no teachers, we would be very much like a monkey. So what I'm saying is there is a lot of learning going on and what we call intelligence is learning based on what mankind has done, written over centuries or millennia.
ILA FIETE: That's interesting. I mean, I would almost take the opposite tack. As a parent, I would say that there's very little that is imparted from the environment. Of course, the things that we value as human beings, like cultural learning, literacy, all of that stuff is surely stuff that you learn from your culture. But just looking at people coming up and growing up in very diverse environments and countries and levels of affluence and stuff like that, I guess I would almost say that there's very little learning that we do in a lifetime. My sense would be that it really is all evolutionary time scale learning, and there's just very little that we learn on top of that. So that's--
HAIM SOMPOLINKSY: But again, I would like to-- if we're really serious about this problem, and I think it's a fundamental problem, we have to be practical about it and we have to not ask about how vision evolved or developed or how language and so on. I think this will be hard to actually solve the problem. We have to take a mature neural network and a mature brain and test how these two systems learn on the basis of their pre-training, learn a new task. And then we can make a fair comparison and ask--
ILA FIETE: How can we do that if the initial-- the pre-trained architectures are really different? Right? Because then it's the pre--
HAIM SOMPOLINKSY: It's a feature representation which is very similar.
ILA FIETE: OK, so you're saying hopefully they're similar on some level.
HAIM SOMPOLINKSY: Let's see what happens.
TOMASO POGGIO: What I'm saying is where is learning in the brain? What is the algorithm for learning in the brain?
HAIM SOMPOLINKSY: So I can give you an example, which is hyperacuity perceptual learning, classical psychological and neural paradigm of learning. You take a mature sensory system, mature visual system, and now you take an animal or human and ask it to do very fine discrimination between two nearby things.
So this is the example. Now you have a mature perceptual system in deep network or whatever you choose and the real brain. And now you can ask, OK, so how do I learn it? You can ask about gradient descent, but you can also ask where the learning occurs. Is it in the readout? Is it in the input? You know? And then--
TOMASO POGGIO: We did that.
HAIM SOMPOLINKSY: We also did that. Thank goodness. What is the answer that you got?
ILA FIETE: You guys know.
HAIM SOMPOLINKSY: What is the answer that you got, where learning occurs in this task? High acuity feature learning?
TOMASO POGGIO: It can be quite simple. One layer. Yeah. It can be.
HAIM SOMPOLINKSY: Where?
TOMASO POGGIO: Well--
HAIM SOMPOLINKSY: Where? Where in the brain?
TOMASO POGGIO: In the brain.
HAIM SOMPOLINKSY: No, where? In which stage?
TOMASO POGGIO: The point is which kind of algorithm. Do you need something like backpropagation across more than one layer? Or you need a very local rule. Right? And if it's more than one layer, where are the circuits plausible that do that? I think you need more than one layer.
HAIM SOMPOLINKSY: I can tell you that for some classical perceptual learning paradigms, what we find, surprisingly, at least in deep networks-- in the brain, it's still controversial. You don't mess up with the readout. The readout is actually fixed. The deep layers are fixed. You actually have to go to one of the early layers and change them. Let's say V1.
Just change the V1 and everything else is fixed. And if you do the opposite, if you fix the representation and just build a specialized readout, you're not going to solve the problem. Counterintuitively. So here's an example of-- and in the brain, by the way, it's controversial.
Some labs find changes in V1, some labs claim no, and some labs find changes that are not really good for the task. In other words, you just have kind of thing which is not really helpful for the task. So in any case, but this is the kind of thing that is concrete enough that we can argue about this algorithm or that algorithm, how to do it in--
TOMASO POGGIO: Yeah. What I'm saying is an open question in the brain where--
HAIM SOMPOLINKSY: But this is concrete enough that I can imagine that we can ask our colleague experimentalists to actually make good measurements of that and shed light on this kind of questions.
TOMASO POGGIO: I think it would be good to have first plausible ideas and then to do the experiments. Anyway, I'm suggesting this is a very interesting open problem. That's all. Let's see. Maybe we should ask audience to come in with question. Yes? Yeah?
AUDIENCE: Yeah. I have a question about the anatomic and the structure of the neural net, because you mentioned that there's anatomical and biological evidence to show that even locally connected shallow network can mimic the performance of a pretty dense neural net. So I wonder what is the challenge of us replicating this in the machine learning? My question is, why can't we do that right now? What is the challenge to replicate shallow but locally recurrently connected layers, like DNN?
TOMASO POGGIO: Right.
ILA FIETE: That's a great question. So actually, I think conventionally what has not been done is building in local recurrence. So the conventional answer is that the extremely deep networks for visual processing are like rollouts of recurrent processing. And so instead, what biology seems to be doing is it seems to have a few feedforward steps but with local recurrence in each of them.
And I think computationally, there's no technical challenge to trying to model such a circuit anymore. I think it's possible to train a network, which is a few layers with lateral connectivity within each layer. And actually, that's an effort that's ongoing in my group and in a few other groups. I think Jim DiCarlo's lab is trying to-- they've built a shallower network, but it doesn't have the lateral recurrence in the fully realistic way.
But I think they're working on that. So I think there are technical challenges. The interesting question is, what is the right model for the recurrent connectivity? And training is harder, but yeah, we're finding competitive results to much deeper architectures with the sparsely connected lateral shallow networks. Shallower networks.
AUDIENCE: I can ask my question. So this is a question for Ila. You mentioned that the emergence-- yeah, I'll wait. You mentioned that the brain is emerged in a modular way. It's modular, right? And the primary reason you said is because of the noise. If you have noise in your search space and how you're searching for it, then that forces you to have a modular solution that's more robust to noise.
But that seems like that can't be the only driver of modularity and there has to be other ones, because if that were, then we would have came up with deep learning algorithms that there's noise in the search process or something. So I guess my question is, what are the other main drivers of modularity that you see being a thing? And then, can we use them to create better modular networks for deep learning?
So I guess one main one that I see is the developmental process, because in the real life you have a constraint of being compressed into a DNA, so you can't encode every weight. You need to encode only the developmental process. And that means that you can only encode modular structures, basically. So are there other drivers that evolution found for modularity?
ILA FIETE: Yeah, that's a great question. And that's kind of the program that I'm hoping many of you will be excited about and want to work on. We should talk. But yes, indeed. Noise is just one of the drivers for modularity. In the literature there are other claims for modularity emergence, which are if you've got tasks that are, by nature compositional with many reuse subtasks, that's one other driver of modularity.
Other drivers of modularity can be just spatial constraints, which are that developmentally, each neuron only makes connections locally and not further away, and so that forces connected and functionally related neurons to be physically close together and that encourages more modular solutions. Another one is competitiveness, like competition in wiring so that you're forced to prune weights that don't really contribute to the task.
There may also be not backprop learning rules, but more local rules where you choose a neuron that's doing the best at a task and then only learn its weights. So do a step for that neuron, but not all the neurons in the circuit. I think there's got to be-- there's a huge number of potential drivers for modularity, and I think it's going to be very interesting to mix and match, and then they can have very synergistic effects and not just combine linearly, but they can accelerate the dynamics of a system towards modularity.
HAIM SOMPOLINKSY: Are convolutional neural networks not enough?
ILA FIETE: I mean, they are in some sense because they're spatially-- they've got the spatially local kernels. So I guess you could say that they're kind of modularized in how they're doing their processing of the local pixel space. And we've built that in by hand, so that's good. And I think that's one of the strengths of convolutional networks over MLPs in the visual domain, clearly. But then they're not modular in the sense of-- I mean, there's a lot of layers, like in the limit of infinitely deep networks-- in the limit, it's almost a continuum, like 50, 100 layers. I would say biological circuit is more modular because there's five layers with then recurrent connectivity within it.
AUDIENCE: So excellent panel. My question has to do with-- and I come from left field a bit on this, so please excuse any things that were implicit or already said. But my question has to do with supervised learning versus unsupervised. My main question about the state of the art of the field is, why are we looking only at supervised learning?
What is happening with unsupervised learning? In particular with respect to sparsity or modularity, however you want to call it, and efficiency, which are the two top themes that were talked about today. I would like to hear what your thoughts are on unsupervised learning or experimentation, if you want to call it that, versus either sparsity or efficiency.
PHILIP ISOLA: Maybe I can start with just unsupervised versus supervised. I would say that the line between them has gotten quite blurry. So there's something called self-supervised learning, which is applying the tools of supervised learning to predict raw unlabeled data. And that's kind of the main paradigm right now for how you pre-train these models.
So you arrive at a really good vision system by just training it to predict missing pixels, just mask out some of the pixels and predict them. You come up with a really good language model by just mask out some words and predict them. And so people call that self-supervised learning. It's not supervised toward the final task you'll use it for. It's not supervised for, like, I want to classify my emails into spam or not. It's just supervised to predict missing data. So I think that is-- that's the main framework right now behind these models.
AUDIENCE: It's different than experimentation though, somehow. I mean, you're-- maybe it's not. Maybe I'm thinking about it the wrong way. If I try to experiment in the world and it walks around and stuff like that.
PHILIP ISOLA: Oh. Oh, oh, yes. So, right. Interactive learning or learning by experimentation, I think that's not-- yeah, that's not part of the way that these things are trained, like language models and foundation models right now. But I agree that that's fundamentally different active learning.
HAIM SOMPOLINKSY: I think in reinforcement learning paradigms there is a strong aspect of component of that exploration. And sometimes, in some versions of reinforcement learning, there are parts of the learning is driven by curiosity, exploring the world and gaining some information about the world. But ultimately, it's only a component of something which ultimately has a reward and has a goal-- reinforcement.
So in my own view, the problem with unsupervised learning is that the space is large and there are no really-- because it's not supervised, there is no really-- we lack rules to guide us, whether Hebbian or anti-Hebbian, competitive or cooperative. No, we can experiment with different unsupervised learning paradigms, but it's hard to come by with something that works. It may work for one problem, it may not work for another. So I think the attraction of supervised learning, even in the self-supervised fashion, is that there is ultimately an objective function, which is--
AUDIENCE: Would it not be much more efficient to use this expert experimentation to unsupervised?
HAIM SOMPOLINKSY: Yeah.
ERAN MALACH: I think maybe reinforcement learning or this type of experimentation is getting more and more into the domain of language models. There's reinforcement learning from human feedback that is becoming more dominant. You let the model generate sentences, which are basically-- you can think of these as novel explorations of new solutions, and then it gets rated by human evaluators. So there's, I think, this component of what you're describing. This is not necessarily interacting with the physical world, but you do interact with a human that is channeled.
TOMASO POGGIO: I would say that most of the architecture at the moment are a combination of the supervised learning framework, even if there is not an explicit label, and reinforcement learning. If you look at reinforcement learning. Yeah. I mean, AlphaFold, AlphaGo, all the large language models they use in various types, the supervised framework and, for instance, human reinforcement for fine tuning and so on. Yeah. I think we are told that the food will disappear if we don't go there. It may have already disappeared. So I think we should thank all of our panelists and--
[APPLAUSE]
ERAN MALACH: Thanks, Tommy.