Panel Discussion: Open Problems in the Theory of Deep Learning
Date Posted:
August 29, 2024
Date Recorded:
August 10, 2024
CBMM Speaker(s):
Tomaso Poggio ,
Brian Cheung Speaker(s):
Jacob Andreas, MIT, Eran Malach, Harvard University, Santosh Vempala, Georgia Tech
All Captioned Videos Brains, Minds and Machines Summer Course 2024
Description:
Moderator: Brian Cheung, MIT Panelists: Tomaso Poggio, MIT; Jacob Andreas, MIT; Eran Malach, Harvard University; Santosh Vempala, Georgia Tech
BRIAN CHEUNG: Today, we have the pleasure of having a panel session, and I'll be your moderator. So I'm Brian. And I'll have the speakers introduce themselves to make sure that we all know their backgrounds and their contexts of how they're answering.
SANTOSH VEMPALA: So I'm Santosh Vempala. I'm at Georgia Tech. I study algorithms. And I'm interested in the complexity of computational problems. And for the past decade, intensively, I want to understand the brain rigorously.
ERAN MALACH: Hi, I'm Eran Malach. I'm a research fellow at the Harvard Computer Institute working on ML theory, theory of deep learning, and, more recently, on starting to think of theory of language models.
JACOB ANDREAS: Hi. I'm Jacob Andreas. I'm at MIT. I'm going to be giving a talk tomorrow and thinking about natural language processing.
TOMASO POGGIO: And Tommaso Poggio. And I like compositionality. And I'm interested in optimization, and whether it works, and when it works for compositional functions.
BRIAN CHEUNG: Great. So I think for the format, we're going to have audience also ask questions. But also, in the meantime, while audiences are preparing their questions, I'll ask maybe a first question, which, how have research priorities shifted now that we have these kind of huge models, but even larger data sets for these models? Has the paradigm changed for how you think about your own work?
TOMASO POGGIO: Jacob?
JACOB ANDREAS: Yeah. So I think maybe the biggest thing that's changed for me is that even when I was in grad school not so long ago, it felt like a lot of the questions about, how do you build computational models of human language processing, and human language acquisition, and how do you build models that generalize in human-like ways from human-sized data sets felt like they were sort of on the critical path to also accomplishing all of our engineering goals and building language technologies that were actually useful for doing things out in the world.
And it does feel like those things have diverged a little bit in the sense that there's still a big role for computation to play, for example, in our understanding of human language use, and language processing, and language acquisition. But I feel like now, progress on those things is not necessarily progress towards the GPT5 or whatever the next big LLM is going to be.
TOMASO POGGIO: Yeah. For me, I think I thought that until transformers came around, the architecture to be considered was basically convolutional deep networks. And now, there are a number of other architectures that are intelligent, to some degree. So there is a real zoo of intelligent architectures. And the question comes, what are the fundamental principles that make them work so well?
It's not just a single architecture. And there may be many more that I'm pretty sure that people have not yet thought about.
BRIAN CHEUNG: Great. So, then, do you think transformers is the kind of end of the architecture search? Or do you think there's something that we will-- or an intermediate process--
TOMASO POGGIO: If it's the end of the road.
BRIAN CHEUNG: Yeah.
TOMASO POGGIO: Very surprised, also because they have clearly serious limitations. Just look at the last-- I don't know if it's a paper, but achievement of DeepMind with a silver medal at the math Olympics. Yes. That's a combination of transformer and other things among long-term reinforcement learning. I think transformers by themselves, my guess, are not enough to get the best possible intelligence.
ERAN MALACH: Yeah, I think for me, things are kind of changing all the time, and the paradigms are changing all the time. Yeah, as Tom mentioned, a few years back, convolutional networks was Batchnorm and Dropout. This is what everyone did. And the networks were highly overparameterized.
And then they're a lot of learning theory focused on explaining why Batchnorm is so important, or why Dropout is so important, or why ConvNets or overparameterized networks are super important. And now, we're training on basically infinite data. We're not doing Batchnorm. We're not doing Dropout. And the architecture is completely different. And this is within a few years.
So I think, definitely, we should not stick to the architecture or the paradigm that is working right now and say, this is how things should work, because in just a few years, everything is changing. And we want to maybe try to understand or explain the fundamental principles and not necessarily what we kind of got working right now with the current hardware and technology.
SANTOSH VEMPALA: Yes. You still want me to answer the first question? Yeah, sure. So, yeah, I've always thought that it's just a matter of time before computers are going to be just as smart as humans, or smarter. They're machines after all. And so is the brain. But it was very hard to convince a lot of people.
And now, with transformers, at least now you can say, hey, look, it's doing what you thought was impossible. So that's one great aspect in life in general. Concretely, I love the problems that come up. Why is a next token prediction doing very well at that, giving you so much ability, so much power?
I think it's a question I don't understand.
JACOB ANDREAS: Can I just-- one more thought responding to all those things and also to your initial question. I feel like maybe even the biggest thing that's changed, at least from the part of the field that I sit inm is that it's gone from being an engineering discipline to something that feels more like a scientific discipline. That it used to be you would come up with some clever new model for task x, and in writing down that model, you would also have a sort of pretty precise description of the kinds of things that it was going to be able to do, the kinds of things it wasn't going to be able to do, the kinds of ways in which it was going to generalize or not generalize.
And now, we have these sort of weird computational artifacts that do all these things that we didn't expect them to be able to do and are much more in the mode of just trying to describe and explain rather than design from first principles, what's going on. And I feel like, actually, a lot of the way we even just train people to do research in the field hasn't caught up to that.
BRIAN CHEUNG: OK, so then related to that, I guess, in terms of things that haven't caught up, what common assumption in deep learning do you think needs to be abandoned, whether it be in theory or just deep learning in general? I think everyone has to answer this question, so no one can avoid it.
SANTOSH VEMPALA: Do you have candidates in mind?
BRIAN CHEUNG: Well, there's examples-- like Dropout was something that was very popular, and then it kind of--
ERAN MALACH: Dropout.
BRIAN CHEUNG: So there's a history of things that didn't stand the test of time, I would say. But do you have your own personal belief that something that's popular now that might not stand in a few years from now? Think. Eran?
ERAN MALACH: I would hope that we would go beyond back propagation gradient descent to-- maybe this is an extreme view-- but it seems that with all the research that people in the field are doing, it's all very focused on training neural networks with gradient descent. And we have no other algorithm, at least for these kind of problems.
And the brain was mentioned in previous talks. And I think it's maybe pretty clear the brain is not doing backpropagation gradient descent. So it seems that there's a better algorithm out there, and we just need to find it. So maybe this is kind of the big thing that maybe not in a few years, but looking into the future, maybe this is something that would change.
BRIAN CHEUNG: [INAUDIBLE] was actually using a model that was learning not with backprop and also learning things and not an IAD way. So do you have any kind of beliefs about that?
SANTOSH VEMPALA: Yeah. At least in the model and the way we can understand things right now, there's a very helpful teacher. The environment is helping you, like the human experience. So, yeah.
In terms of what might change and what's to be dropped, I guess the lesson is really don't be too attached to the current model. Yeah. Yeah.
TOMASO POGGIO: When I spoke briefly about the conceptual approach of machine learning, I said it's to find a parametric approximation to the unknown function, and deep networks are such an approximation. But there are other approaches non-parametric-- for instance, nearest neighbor, no parameters. And there are other ways-- so-called Watson something estimator. So one could try to explore this non-parametric way to learn, but that's going outside the normal traditional deep learning. It's more like trying other directions that may be quite good.
BRIAN CHEUNG: Jacob, do you have a--
JACOB ANDREAS: Yeah, I think it's largely been said already, that, basically, don't get too attached to any specific architecture that we're using right now, because, historically, at least, the kinds of shifts in choice of architecture that have happened in the field have really been driven more by computational constraints by rather than somehow, this parameterization is a better fit to all of the tasks that we care about than that parameterization. And I would expect that to continue to be true.
BRIAN CHEUNG: So since this is a theory workshop, and I think a lot of the talks have been on aspects of visualization and overfitting-- just for each speaker, what do they think about, since the models are training on so much more data now and their parameters are very large, do you think it's a issue of overfitting or underfitting? Because the CEO of Anthropic recently said the training cost is going to go from $10 billion to $100 billion in the next three or four years. So I'm wondering if you feel that the problem is overfitting or underfitting, which is a bigger problem in the space right now? Tommy?
TOMASO POGGIO: Well, I think overfitting has been exaggerated in the sense that you can have more parameters than data and, effectively, even not overfit because you have constraints like regularization. It's not the number of parameters, necessarily, that gives you control capacity in a system, but is regularization and other properties of the network.
So you can overfit and generalize. That's fine. In the old version of VC theory, this was not possible, because, essentially, the hypothesis space was fixed. And the main results were for increasing number of data. But you can formulate equivalent results in a framework in which the hypothesis space is not fixed and still get generalization even if you overfit-- if you fit the data and you have more parameters than data.
ERAN MALACH: Yeah, I think maybe you're related to this. These big language models are pretty much trained without repeating data. So it's kind of like in the online setting, which is, again, maybe a few years back, we would worry about generalization and overfitting. Now, you're training in the online setting. In some sense, there is no worry-- you only care about optimization, because generalization is, in some sense, for free, right, because you're doing this like online learning procedure.
Each example is new. If the loss is dropping, this is all you care about. So maybe there's out-of-distribution generalization or generalization to new domains that you don't see during training, these are still open. But kind of like the classical notion of train test difference is no longer a problem, which is maybe nice. We got rid of this generalization problem. Ww're only left with, how can we optimize these things? Yes.
SANTOSH VEMPALA: Yeah. I agree with the comments about generalization, or at least so it seems, although I still don't know the proofs, that more parameters doesn't seem to hurt your generalization for target for a specific task or distribution. But I find particularly puzzling how we seem to learn higher level things.
You play, I don't know, tic-tac-toe, you learn how to play it, and then you play Connect Four or something, then you play something more complicated-- and by the time you learn the next game, you learned something even for a game you've never seen before. Like, you learned how to learn to play.
And I'd love to see this-- maybe it's already doing it. Maybe somehow, next token is capturing all this. But how?
BRIAN CHEUNG: Actually, that's a good question, because I think in the first lecture when Tommy presented this notion of compositionality, do you believe that the models can learn compositionality the way Tommy kind of specified it as building concepts that are just stacked on top of each other rather than individual concepts one at a time?
SANTOSH VEMPALA: I mean, certainly for efficiency, it appears that that's a crucial element to keep both the sparsity and the compositionality. And given that all empirical training seems to be quite efficient, I would guess that there is some level of this, either because of regularization or other things that's being done. Whether that generalizes across distributions and across tasks or how it does that--
TOMASO POGGIO: Well, you can imagine you have various constituent function, various modules, that are common among different tasks. And so if you learn those, it's like having maybe a large dictionary of skills or the constituent functions--
SANTOSH VEMPALA: You might have learned explicitly--
TOMASO POGGIO: Yes.
SANTOSH VEMPALA: Some of them, at least.
TOMASO POGGIO: Some of them may learn explicitly. Others may learn during learning more complex compositions. But if they remain available, it's all pretty vague.
JACOB ANDREAS: Yeah, I just go back to the underfitting or overfitting discussion. I feel like, in some ways, actually, the problem is not that so much as that we don't actually know how to specify the distribution that we're trying to fit after all. Going back to what Eran was saying in the talk earlier today, we're really good at building models of the entire internet.
The problem is that the language model that you actually want to hand to somebody to use is not a model of the entire internet. It has a consistent personality. It doesn't necessarily like generating correct facts or say too much or say too little.
It's not sensitive to cues that the user might put into the input, like misspellings or things that would generally be correlated with, then, bad behavior from whatever prediction was downstream of that. And I think the problem that we have right now is that system, that interaction that we want to build, you don't get from training on the entire internet, because the internet doesn't look like that.
You don't really get it from the various kinds of clever fine-tuning techniques that we have right now. And so the problem is that we just don't actually know how to specify what is that the target distribution we're trying to hit.
BRIAN CHEUNG: That's great. So then I have a follow-up question to that. So I know all of you, or some of you, have done work in this direction, which is why do models hallucinate? Why do they hallucinate? Why do they generate things that are not even in the training set?
SANTOSH VEMPALA: I'm happy to start, because the only theoretical paper I've written on language models is about hallucination. The title of the paper is "Calibrated Language Models Must Hallucinate." And this is with Adam Kalai, who came up with the idea and wrote the proof.
But, basically, the one thing that has been shown is that if you just do loss minimization and say some of the standard loss measures, the models tend to be calibrated to the input, so in various senses, to the input distribution. And once it's calibrated, you can show an extremely simple notions of general notions of what are facts and non-facts, that the hallucination rate is, essentially, lower-bounded by the fraction of unseen data.
And so that's just a theorem. It doesn't matter whether it's independent of architecture, any statistical language model, which hasn't seen some fraction of data, is going to hallucinate at roughly that rate. So it seems inevitable. And, indeed, I think it's a very important problem about how to, I guess, provide feedback so that we move away from this to more reliable outputs. Yeah.
JACOB ANDREAS: Something that is sort of a trivial follow-on from that is that if you have a data set in which nobody ever says, I don't know, then the model is never going to learn to say, I don't know. And always, if you asked a question that you sort of don't know the answer to, the loss minimizing response is to place it sort of uniform distribution or whatever the right posterior over things of the right type rather than just to say, I don't know.
And you can't have, simultaneously, a model that assigns probability proportional to the right answers, which is something that we do want for a lot of applications, and also have a model that communicates to a user, hey, I have uncertainty. I happened to sample this token right now, but if you call me again, I'll give you a different output instead.
And I think one of the really interesting, again, sort of empirical questions there is there's this behavioral characterization of hallucination. Does the model say, I don't know? Or does it give you the right answer? Or does it make something up.
It seems totally plausible that even in the setting where you have a sort of calibrated language model that's giving you this distribution over answers, there's some internal representational correlate of knowing the answer and not knowing the answer. And I think figuring out, if we had a better handle on how knowledge, certainty, whatever got represented internally, these models, that would be much easier to take a model that is a good sort of calibrated model of the distribution and just turn it into something that says "I don't know" in the right circumstances or back again.
AUDIENCE: Could it also be that since the language models are trained with cross entropy, that it finds that is a mean-fitting objective, and so then it just like upweights parts of the distribution that could have been lower in the actual distribution of language?
SANTOSH VEMPALA: So the point is that even if you have a perfect model-- let's say you've learned the distribution exactly, the training distribution, but you haven't seen some fraction of the data, you're going to hallucinate. So it's independent of the mechanism-- not only the model, but also the mechanism by which you acquired the model. Yeah.
What do you want to say? I wish at this point was a language model, I could remember. Yes, what I would love for GPT2 to do is to output what it thinks is the probability of what it just generated. I think that should be a law.
If you're going to tell me this is the completion or this is the answer, just give me your real confidence-- what the probability of your model is, which, obviously, they must have because they sample from it. And then that way I can make a decision whether I want to trust it or not.
Maybe my bar is 0.99, and somebody else's bar is 0.5, or depending on the task. But they don't do it right now. I ask it, what's the probability in your confidence? And it gives me a different number each time, because it's using just another generation to do that.
JACOB ANDREAS: Well, but what do you actually want there? It's certainly not the probability of the string, right, because for any answer to a question, there's a million ways of expressing it. You have uncertainty over how much detail the user wanted. You have uncertainty over--
SANTOSH VEMPALA: The sun rises in the blank-- east, and probably give me 99.7%.
JACOB ANDREAS: Yeah.
SANTOSH VEMPALA: Right?
JACOB ANDREAS: Well, no, because it could have also said morning. It could have said some very precise angle. I think the hard thing is figuring out, how do you translate the entropy of this distribution over strings into somehow the entropy over concepts?
SANTOSH VEMPALA: Jacob had a blank for lunch.
JACOB ANDREAS: Yeah.
SANTOSH VEMPALA: And then it says-- sorry, I already gave it away. Salmon burger. What's the probability? Is it true?
JACOB ANDREAS: Well, no.
SANTOSH VEMPALA: It could have been right.
AUDIENCE: Maybe this is just from my perspective, but it also often feels like, especially recently in ML theory, that there's this sort of loop where something works and then, not necessarily your papers, but a lot of people will come out with papers explaining why that thing works. But often, it feels like that's not really the right direction things should be going and theory should be informing the applied sciences.
So my question is, what are the best pathways for ML theory to inform applied ML today?
ERAN MALACH: I agree. And it's often a problem. And I think maybe some of it is a consequence of the kind of explosion in technology and the value of just putting a ton of engineers and kind of pushing the state of the art.
And you don't invest so much money in learning theory researchers to catch up with this. So it's kind of like thousands of engineers that are kind of moving forward, and then much less people that are trying to do theory.
And definitely, you want to be able to explain what's happening right now. And I agree that kind of the ideal role of theory would be to point out, what is the right thing to do in the future, and not just explain what we already know-- which is, of course, much harder.
I think in some aspects, there are theoretical works that are doing this in kind of certain areas. But it's, I think, right now, mostly due to, I like the analogy of the time before Maxwells. There's this explosion in technology that it's just happening. And I think that over time, the theory will just catch up.
TOMASO POGGIO: Yeah, let me take a controversial view and attack computer scientists-- you guys. I think in the theory of machine learning, until 15 years ago, in the time of kernel machines and so on, the overall picture from the theory point of view was pretty clear. And so what theoreticians had to do was details, fixing things, finding useless bounds, things like that. And now, we are in a new era.
It's more like natural science. It's like physics. There is transformer that spits out stuff, and we don't know why it works. It's not a question of proving a formally correct theorem. It's, what is the basic principle?
It's like, is it fusion of helium atoms that power that? These kind of questions-- fundamental questions. Forget the formal details. In a sense, machine learning today has become a natural phenomenon.
Your computer-- by the way, the title of my talk was "Computable Intelligence of Mathematics," because these mathematics and machine learning actually does useful things. It's a natural phenomenon that we have somehow to explain. And that's a different framework.
The priority is not having formally contract bounds, but is more finding the principle-- the basic principle that makes these things work. So, sorry, did I offend anybody? I don't think anybody here is actually guilty, but maybe some people in the community have been emphasizing, let's say, this kind of old mode, where--
ERAN MALACH: Maybe let me try to respond. So, OK, there's scientific theory, which is you can think of like physics is a scientific theory of nature--
TOMASO POGGIO: Natural science?
ERAN MALACH: Yes. And then there's--
[INTERPOSING VOICES]
ERAN MALACH: Yeah. And there's mathematical theory.
TOMASO POGGIO: Yeah.
ERAN MALACH: Which is maybe--
TOMASO POGGIO: Maybe nature or not.
ERAN MALACH: You're trying to criticize. And these are similar but maybe different things. And I think a mathematical theory is often a part of a scientific theory, and not always that all of the theory has to be mathematical, but there's definitely value in doing mathematical theory as part of developing a scientific theory.
And I think that the power of mathematical theory, even in today's paradigm of learning, is that it can tell you things about experiments that you cannot run. If I ask you what would happen if I gave you exponential compute and the size of the problem, will you be able to solve it or not? This is not something you can do. You can not run this experiment.
You can only do mathematical theory. And asking the question, these questions of lower bounds, what are the problems that you can absolutely not solve even if I gave you an extreme amount of resources? This is not something you can ever answer with an experiment. And I think there's value in these kind of theoretical statements. This is a problem that you cannot solve no matter how much money you put into this or how many GPUs you buy. And I think--
TOMASO POGGIO: Some of the more powerful results in physics are this No Go theorem, showing that some type of model can never explain, for instance, quantum mechanics or some of the puzzles.
BRIAN CHEUNG: So I guess to help the computer scientists out a little bit. So neuroscience, arguably, at least as a field, has contributed many things to neural networks. But recently, there hasn't been as much transfer, I would say-- maybe the most recent one being Fukushima and the continent being inspired by that. Why is that? And what can you do about that?
TOMASO POGGIO: Well, Santosh may be a counterexample where you have--
BRIAN CHEUNG: How does [INAUDIBLE]?
SANTOSH VEMPALA: I mean, the brain is the most solid example we have of artificial general intelligence-- sorry, of general intelligence. You see? Got me. So, yeah, the lack of a theory is absolutely mind-boggling.
But just to relate it, but going back to the question, I think explaining phenomena has always been part of science, certainly. Yes, the hot phenomena in machine learning are changing rapidly. That's a bit of an issue.
But nevertheless, the lots that we'd still like to explain, we don't understand, and we'd like to explain. But I guess there are also, as all three of them have pointed out, questions that have emerged that just didn't see before. And that is fantastic now that we have this possibility of AGI.
One highlight, in particular, I feel is very important is safety, which can take many aspects. But, for example, we talk about bias in all kinds of machine learning systems, certainly in generative models, and there are all these fixes, right? There are people try to fix it because of whatever reason, and it works, and it does breaks, it works, and it breaks.
But what would we like AI to be? We'd like it to be, sure, unbiased. We want ourselves to be unbiased. We'd like it to be ethical. Why not? We'd like it to be altruistic. We'd like it to be kind, all these things.
So you can go back and what I'd like to ask is, how do I get AI to do this without doing spot fixes? Keep fixing it. That seems like a, what is it called when you-- whack-a-mole. It's not going to happen.
We didn't do the whack-a-mole with humans. Somehow, humanity, for the most part, decided that these are things we're going to have as a society, and it's working.
ERAN MALACH: Evolution.
SANTOSH VEMPALA: Well, evolution-- the culture and ethics only happened in the last 100,000 years or so after language, at which point the brain has hardly changed, whereas with AI, what can we do or what is missing to make it have this natural? OK?
Is it because we're putting all our resources into one unit so it doesn't care about sharing anything? Is it because the objective is just loss and not something else? Or is there some other more complicated thing that we could incorporate into AI that will make it safe in all these senses?
BRIAN CHEUNG: I think on that note, there's this kind of space now going on that's kind of an emergent field of mechanistic interpretability, which, from my point of view, has been very observational and very empirical. How does theory help with the notion of interpretability of models that are already trained that exist today?
JACOB ANDREAS: So the example of this phenomenon that's most on my mind right now is there's a lot talk about world models, and language models, and to what extent can they simulate things. And we have all these nice results from formal language theory and circuit complexity that talk about, for a transformer-shaped thing, what are the classes of problems that you can solve, and what are the classes of problems that you can't solve. And one of them is just simulating general unrolling of automata.
And so if you have a language model that looks like it's doing something, if you have finite depth, and finite precision, and all the usual caveats-- so if you have a model that looks like it's doing sort of state tracking with an arbitrarily complex state for an arbitrary number of time steps, it can't actually be doing that, because we have these negative results that under standard assumptions. There is no such circuit that solves that problem.
And so I think to the extent that you're, then, engaged in the business of observational sort of theory-building, hypothesis-formulation about what's going on inside these models, knowing at least the theoretical impossibility result, constrains the set of hypotheses that you need to entertain about how a model is solving a problem and maybe it tells you something about what kinds of shortcut solutions to look for instead.
BRIAN CHEUNG: So that sounds like you can, given the architecture, figure out the bound of what it could possibly do to simulate something. And if it's doing something that seems more advanced than that, then it's probably doing it in an incorrect way.
JACOB ANDREAS: Yeah. And I think that's at least one cat-- and then, in the opposite direction, where we know, OK, here's a great algorithm for solving this particular task, our first guess at how a model is doing solving some problem might be to look for that algorithm specifically.
And I think there's been lots of recent examples of this in the context of things like in-context learning. And yeah.
ERAN MALACH: Yeah, I think for theory, there's a lot of theory works on simple, maybe synthetic tasks that we understand well. And these tasks, there's not a lot of ways to solve them. There's multiple possible algorithms or solutions that you can implement, and then you just need to check and see, what is the algorithm the model is actually learning?
And we often see that if you look at this correctly and you probe the model correctly, it did learn one of the kind of possible algorithms that you had in mind. So definitely starting from a problem that you know how to solve, and then looking at the model, and seeing how it solves the problem is definitely the right way to do.
BRIAN CHEUNG: It's like a well-defined toy environment where you know what the ground truth is, and solving from that.
AUDIENCE: I was wondering, what are your thoughts on how neural networks and machine learning, in general, maybe, have changed our view of the world. Or what have we learned about the world? And I don't mean in a narrow case where we learn the structure of a protein, and now we discover new proteins, and we make stuff, and we understand the world better-- but, rather, as maybe a shift of perspective on how we think about the world.
Because main thing I've studied is machine learning, and now I look at everything as distributions that we learn, and representations, and what is the minimal representation, and stuff like that. So I was wondering whether you have thoughts on this.
TOMASO POGGIO: I've always found that science is kind of a social phenomenon. And, for instance, what it means to understand a phenomenon varies not only across different fields of science, but also across time. There was a time when I arrived at MIT in the '80s, 1980s, that it was a very successful time for molecular biologists.
They were pretty arrogant, in general. And for them at the time, finding a gene involved in a disease was understanding the whole problem of that disease. Of course, we know now that finding a gene is just a step. Then, you have to understand how the proteins that the gene encode for interact with the disease. It's complicated, right?
But at the time, finding a gene was understanding. And the scientist who got a Nobel Prize for discovering, I forgot whether neutrons or neutrinos, said, the rest is chemistry.
So I wonder whether machine learning will change what we mean for understanding a problem. So, for instance, let me take example from what I did in the past. Back, again, in the '80s, we developed with David Marr an algorithm to solve stereopsis.
You take two images, you can compute the distance, the depth from the disparity from the two images. Now, you don't need to understand the disparities involved, where it comes from. You could, in principle, train a machine learning model with a lot of stereo pairs and the ground truth. And the system will learn a stereo.
Will that be understanding stereo? I don't think so. But from a certain practical point of view, it is, right? If you understand how machine learning works, you may understand, in principle, you know how to solve a lot of other problems without really understanding them.
I think that's dangerous, but that's one way-- you ask how machine learning can change science. In a sense, that's one way. So I think it's even more important to try to understand how machine learning works, right, really.
SANTOSH VEMPALA: I completely agree. I mean, at least this is one bad effect of machine learning. People feel like, scientists, cannot afford not to do it. And it gives you these seemingly impossible solutions, some level of solution. And, unfortunately, then you move on to. It's not satisfactory like it used to be. Yeah.
TOMASO POGGIO: So, for instance, this is another good example, which is really big accomplishment, is AlphaFold. So you can infer the structure or the 3D structure of proteins from the sequence of amino acid. And that's the first program that should get a Nobel Prize.
But it's a big problem. And it's a problem that from a computational point of view is probably should be NP-compete. Yeah. You don't solve, really, the folding-- you don't really understand in the program how proteins fold.
You understand through data. You may understand it's too much work. You predict through data how the final 3D structure is going to be. But it's a big thing.
SANTOSH VEMPALA: Yeah, I feel more concretely even in natural language processing, it's amazing what computers can do for translation. It's better than basically.
TOMASO POGGIO: Translation.
SANTOSH VEMPALA: Translation.
TOMASO POGGIO: Absolutely. Yeah.
SANTOSH VEMPALA: Which is great for users and for people who are not in the field. But I feel like it must feel like a hole for people who are in the field. Previously, whatever fragment of language you understood, you really understood. And then there were things you didn't understand. Now, it's all big-- yeah, sorry.
JACOB ANDREAS: Just going back to what Tommy was saying before, I think there's an interesting analogy between, on one hand, the sort of process of scientific modeling and scientific model-building and theory-building, and, on the other hand, the problem that these big unsupervised neural sequence models have to solve.
And I think the biggest or most surprising thing in all of this, for me, is that within some relatively large family of pretty simple, pretty unstructured neural net type architectures, if you train on enough data, you get good at all these problems that seem to require understanding a lot about the latent structure of the world and that, in some cases, people only seem to be able to do because we've really been hard-wired to do them like language learning, or that require going off, and getting a PhD, and spending several lifetimes accumulating scientific knowledge-- things like building models of the process by which proteins fold.
And I think it's still, then, an open question whether these neural sequence models are solving these problems by just noticing a bunch of shallow surface heuristics or whether there are actually things about the natural world that are getting learned inside these models that somehow aren't known to us, that we somehow need to build tools to pull out. And that feels like one of the big, exciting, important challenges is just that we have these really good predictive models that, in some ways, we can run experiments on that we wouldn't be able to run in the real world.
And for all of the problems that we care about, the work now is figuring out, are those problems getting solved by these models, one, in a way that actually reflects the true data-generating process in the real world, and, two, in a way that allows us to derive some kind of scientific insight?
AUDIENCE: I was still thinking to the problem of hallucinations in generative models. And I had a question regarding this. So do you believe that one methodology that we can exploit to, let's say, train these models in the regions of the representation where they are not trained with actual data could be training these models with their own generated samples? And, in general, what do you think about this technique, like training the models on their own generated sample? Do you think that this can be helpful in the future of deep learning?
ERAN MALACH: In some sense, this is done, with definitely the fine-tuning or early phase, you are generating from the model. And based on different ways to rank the generations, you can fine-tune the generations of the model.
And presumably, more and more models are trained from synthetic data, which is basically generations of the model. So I think this is becoming more and more popular to do.
I guess maybe the question is, can you bootstrap this way, get a better model by training on your own generation? And there's some recent works that give different answers in this. Maybe other people know better, but I think it's currently not very clear how much better you can get by training on your own generated data, unless you have some kind of sophisticated mechanism to throw out the garbage, basically.
TOMASO POGGIO: There is a recent paper showing garbage in, garbage out.
SANTOSH VEMPALA: Going back to deep learning, which was the motto of this panel, this data augmentation, right, by data perturbation was a very successful technique. I believe it's still successful. One way, for example, to avoid, do better at adversarial examples and so on-- so it has its, certainly, merits using generate data generated by the model for improving.
Whether it will get rid of hallucinations, that's completely unclear. For that, we might need what Erin mentioned about human feedback or other kinds of feedback.
AUDIENCE: And to center your theory, like theoretical researchers-- so maybe that's not your strongest side, but as someone that doing more empirical work on deep learning, I'm really thinking about, do you know maybe Francois Chollet arc challenge or ARC-AGI. Some of you, I see you're nodding out of that.
So he basically presented a benchmark that shows a generalization problem with current ML architectures, let's say. And, basically, if you have any notion of what's the next type of model or any idea, any intuition that is not necessarily based on deep learning-- if you have something to it, and if you knoe the challenge, maybe you can reference that, that would be great.
ERAN MALACH: We will split the million dollars now.
AUDIENCE: Yeah.
ERAN MALACH: Yeah. Do people know this? I don't know. So, do you want to explain it?
AUDIENCE: No. I'm really stressed. I think that it would be better if you would.
ERAN MALACH: It's basically these kind of visual puzzles. You get like two or three examples with simple rules that humans figure out very easily. But apparently, it's hard for, basically, any AI models. Currently, they're very far from human level performance.
And there's a prize of, I think, $1 million or something.
AUDIENCE: $1 million. Yeah. $1 million. Yeah. So, basically, humans get like 85% on this. And currently, machine learning, I think, gets 40%. And that's by the last month.
But I think it really references the problem of data generalization, like with LLMs and current machine learning paradigms.
ERAN MALACH: Yeah. So I think--
AUDIENCE: If not, it's OK. You can move to the next question.
ERAN MALACH: Of course, I don't know if I have a solution. Otherwise, I would go for the $1 million. But I think one of the approaches that I saw that people try for this problem I think is interesting is basically try to generate a lot of programs from the language model and basically kind of choose the best ones and maybe iterate, which is also something that is becoming more useful for code and math.
Which I think the interesting idea here, maybe from a theory perspective-- so this challenge, and maybe like, actually, a lot of problems, we have very few examples of input and output. And you need to figure out the rule.
And our only paradigm for doing this right now in machine learning is training neural network with gradient descent. That's a terrible idea when you have very few examples, because you'll immediately overfit and nothing will be learned.
But another approach is actually maybe much more data efficient is really to try out a bunch of different hypotheses and choose the one that kind of works best. And this is actually a very sample-efficient thing to do. There's kind of classical learning theory results that the number of examples you need for this is kind of logarithmic in the number of things that you're going to try.
So this is very data efficient. And in some sense, this is kind of one of the solutions that people are trying is just try out a large number of things and choose the ones that work best is actually a pretty good algorithm if you have a heuristic of what to try.
TOMASO POGGIO: Yeah. This will be also non-parametric approach, in that sense.
AUDIENCE: Hello. So do you think the scientific community is doing enough to supervise the application of these technologies, such that they do more benefit than harm, and also maybe to counter potential exploitation such as those deepfakes? So if "yes," what are these attempts? And if "no," what can be done?
SANTOSH VEMPALA: I mean, it's a very important question. Obviously, we don't know the answer. But it's the type of service that I think none of us can escape now. We all have to spend some time trying to formulate, what would be a satisfactory approach to even formulating.
It will include some regulations. It will include some cultural development. It's such a new age with all this possibility. So, obviously, we don't know the answer-- some transparency, some honesty. People have tried, for example, you might have seen this, pledged several months ago trying to stop the use of development of generative AI for a certain number of years or something.
And that didn't go anywhere in spite of a lot of things. But, yeah, we don't want to wait for something really disastrous to happen. There's enough bad things already going on.
JACOB ANDREAS: Yeah. And I think just adding to that, there are things that the scientific community can do alone. But at the end of the day, the decision about how to regulate these technologies and what constraints to put on them is not a decision for the research community to make on its own. And probably the most valuable thing we can be doing right now is just engaging with the public, with policymakers, making sure that people really understand what's possible, what's not possible with current technologies and contributing to the public making of these kinds of decisions, rather than trying to make them ourselves.
ERAN MALACH: Yeah. I don't know if I have anything interesting to say about regulations, but I think what I find interesting is that we kind of still don't know if we're kind of a few years away from something that is superhuman and really is intelligent in the way that we mean intelligent, or that is just kind of a smart Google that can retrieve things from the data set, maybe do some manipulations-- definitely very useful, but not intelligent in the way that we humans think of intelligence.
And I think maybe this is kind of the fundamental question, because definitely you can do harm with language models the same way you can do harm with Google Search. But this is not the question.
The question is, is this thing going to develop real intelligence that can do actual harm just from the fact that it developed intelligence in itself? I don't know what the answer is.
SANTOSH VEMPALA: The answer is, yes, of course, there will be intelligence that's supreme. The question is when, and what are we going to do about it?
JACOB ANDREAS: Well, but there's also lots of harms that you have to start thinking about well before that point.
[INTERPOSING VOICES]
JACOB ANDREAS: Yeah.
AUDIENCE: Kind of following on that, I guess you could think of humans as also large models, but multimodal. And then following on from that, you can think of, for example, creativity or discovery, like the hallucinations. But from our view, it's justified because we have all of the information from all the other modalities. So do you think there's a difference between, I guess, humans based on your subjective experiences, and a very big multimodal model?
SANTOSH VEMPALA: The title of the book will be I Am My Language Model.
AUDIENCE: It won't be a language model, because you can't just use language. It's not sufficient.
SANTOSH VEMPALA: Language model in the general sense that it's a generative model. But no, sorry, that was just a tongue-in-cheek answer. Yes, there are some high level similarities between creativity and hallucination.
And I don't know the answer to your question, but something related that I feel like maybe we can try to answer is-- something I've forgotten.
AUDIENCE: It seemed to me like neural networks, like the one you use now, are very passive observers of the world most of the time. They kind of simulate our perception. They take data, they have some input, which is a concept, but they're not acting like we do.
And there was some machine learning acts like AlphaGo and stuff like this, but they're using other kind of frameworks, which is, I don't know, maybe less successful, or at least less stable, really slow learning, which is a little behind.
Do you think we need a new theoretical framework for acting and strategizing in the world? Or do you think reinforcement learning is enough with some perception that we have already quite figured out, or at least quite advanced? Or do you think there is no problem?
SANTOSH VEMPALA: I'm still answering the previous question, but here's the thing. We haven't seen, as far as I know, a new insight from a generative model. We've seen amazing stuff, but I haven't seen a truly fundamental new explanation.
AUDIENCE: But we have seen hallucinations.
SANTOSH VEMPALA: Yes, absolutely. Absolutely. So whether those hallucinations are the ones that will lead to creativity that gives you a new insight-- so you could summarize it by saying what's known is known and what's not known is not known.
So is that a drawback or a fundamental limitation of language models, not being able to come up with a new insight? Or is it just that we're not there yet in terms of capacity and training? I don't know.
But so far, there hasn't been like, oh, my god, I now understand how this thing works, which we didn't know before. That hasn't happened yet. Yes, you can solve an IMO problem, but so can 10 other people.
And I don't want to distract away from the next question, which was about something very interesting-- do you need to embody to be intelligent.
ERAN MALACH: Yeah. I'm not sure this was maybe more about active interactions with the environment. So the recent IMO silver medal was exactly this. We're doing some variant of search and reinforcement learning for generating math solutions to problems. And it seemed to give good results.
So I would say that definitely this is something that is becoming more and more dominant in the field, and definitely for these kind of solving math problems or coding is definitely very useful.
AUDIENCE: So, yeah, I'm going back to, what is theory useful for. I know Tommy mentioned, maybe this is his bias with a physics background, but there are more sciences than just the natural sciences, right? You have the engineering sciences.
If you think back to when we built bridges with new materials, you first read this recently, apparently, they first built a huge chain with steel to figure out what the sag would be without knowing the equation for a catenary. So I'd like to think about these new demonstrations as this is how you would do it.
But the demonstration usually comes first, and then you go back and try to figure out what are the conditions under which this is possible. So I think there's another role for theory in terms of thinking of this as an engineering science rather than a natural science.
AUDIENCE: It's kind of a follow-up in the question of embodiment. So is it a shortcut to go straight to use language? Because at the end, language is the structure that humans found to describe the environment. Shouldn't we expect that if we want agents that develop their own intelligence, they should either be able to interact with the environment to extract concepts from the world or let them have some input modalities, and they should recover on their own some computations and language that would make sense to compress the information they are receiving?
Like a fly has the brain that uses different types of coding, so should we keep pushing for intelligence language models when these languages will never have access to the environment that's created those embeddings?
JACOB ANDREAS: So one question is, do you need embodiment to learn things about how the physical world works? Are there things that we learn from having access to vision, hearing sounds that we never describe in text or that we don't encode in text to the precision that you would need to build really detailed models of the real world?
That seems almost certainly to be true, right? There's all kinds of really-- among the many things that language models can't do right now is solve really complicated motion planning problems. And there's a lot about just like what it takes to walk or what it takes to grasp a glass without dropping it that we never talk about and that we really only learn by being in the world.
And if we want models that can do those things, certainly, you need access to all those other sensory modalities. There's a separate question, then, of whether you need interaction or whether you can learn these things purely observationally. And I think one of the big surprises, again, with what shows up in a sufficiently large natural data set is how much about the kind of causal structure of the world you can get in purely observational terms. And all these things, these sort of perlin, well, you can't really tell what the causal relationship is from this pair of variables observationally.
But then, in the real world, the data says, well, I'm the doctor, and I'm giving the patient this treatment. And this is what the outcome occurred. And there's a lot of just people talking about causality and people describing running experiments in the real world.
And if enough people describe their experiments to you, it's as though you had done those yourselves. Whether we could learn more efficiently by actually being able to do the interactions, almost certainly. But there, it seems less clear-- it's not the sort of thing that could at least naively be solved by scale.
AUDIENCE: All right. If there are no more questions, let's thank the wonderful panel.
[APPLAUSE]