CBMM Panel Discussion: Is the theory of Deep Learning relevant to applications?
November 4, 2020
October 27, 2020
Daniela L Rus
All Captioned Videos Brains, Minds and Machines Seminar Series
Deep Learning has enjoyed an impressive growth over the past few years in fields ranging from visual recognition to natural language processing. Improvements in these areas have been fundamental to the development of self-driving cars, machine translation and healthcare applications. This progress has arguably been made possible by a combination of increases in computing power and clever heuristics, raising puzzling questions that lack full theoretical understanding. Here, we will discuss the relationship between the theory behind deep learning and its application.
KENNETH BLUM: On behalf of the Center for Brains, Minds and Machines, welcome. We have a panel discussion today and the title is, "Is Theory of Deep Learning Relevant to Applications?" And I have to say, at the outset, it strikes me that probably it's a bit of a straw person question or prompt. So maybe I would modify it slightly and say-- I think it's a straw person because I suspect that everybody is going to say that theory is relevant, just based on the backgrounds of the panelists. So in that case, maybe I would modify it slightly to be, How or why is theory of Deep Learning relevant to applications?
And I want to introduce all five of the panelists at the outset and then step back and let them have at it. So in the order in which they will be speaking, they are Tommy Poggio, Tomaso Poggio-- he's the director of the Center for Brains, Minds and Machines. He'll speak first.
After him will come Andrea Tacchetti. He's a senior research scientist at DeepMind. He's working on things like Multi-agent Reinforcement Learning, how to play seven-player diplomacy board games, things like that. Third will be Max Tegmark. Max is a professor of physics at MIT who is equally well known for his analyzing large astrophysical data sets and for his adventurous theorizing. And he's increasingly interested in AI. That's a major part of his research endeavor now.
Fourth will be Lorenzo Rosasco. He's a professor at the University of Genova, team leader at the Italian Institute of Technology, and research scientist at MIT. And he works on both learning theory and practical learning algorithms. So I think we know where he stands. And final panelist will be Daniela Rus, who as you probably all know, is director of CSAIL and an omnivorous roboticist who gobbles up anything she needs in order to create embodied intelligence. So with that, I turn it over to Tommy.
TOMASO POGGIO: Thank you, Kenny. I want just to start with a little story from history. And the topic is that it's quite common in science that applications come first after perhaps a random discovery. And then for further progress, there is a theory that plays a basic role. And the one example that is dear to my heart is electricity.
Electricity really started with Alessandro Volta in Pavia. The year is 1800 when he published a paper about his [ITALIAN] in the Proceedings of the Royal Society. It's 1800. Napoleon is in power. Napoleon made him a count because of his paper on the Proceedings of the Royal Society.
It was a momentous discovery. Until then, electricity was just sparks, was lasting a few microseconds. The [ITALIAN] I have here, a faithful copy of the first one that Alessandro Volta built, it's discs of copper, of zinc, and wood. And if you pour water on it, you'll produce 2.1 volts for about four minutes or so. After that, you have to take it apart and clean all the discs and put it back together.
Now once the [ITALIAN] arrived, it was the first time scientists could study electricity. Until there were sparks, they could not. And then things developed very quickly. Volta himself designed telegraph lines between Pavia and Milan. Pavia is 30 kilometers from Milan. And I was there, actually. They opened a museum in the year 2000-- was the bicentennial of the invention of the [ITALIAN].
And they opened a museum for Alessandro Volta. And what happened in Pavia after that-- and it looked for the next 70 years was like Silicon Valley of electricity. Among the other companies that were started, there was one that was started by Albert Einstein's father and uncle, which eventually went bankrupt. And so fortunately, we have the theory of relativity and not Einstein and Einstein, the start of Siemens, which could have happened at that time. But the point I want to make is that if you look at what happened--
By the way, think about it. Until Volta invented the [ITALIAN], information traveling the world at the speed of a horse for thousands of years. And then immediately after that, there were telegraph lines. Information traveled the speed of the horse. There are these letters when Constantinople is fell to the Turks around 1450, more or less the year Columbus was born, and then this letter announced there were people writing letters in Vienna saying "I heard Constantinople is fell," and people in Paris writing letters-- "I heard Constantinople is fell."
And so you can construct that it took three weeks for the news from Constantinople to arrive in Vienna, four weeks in Paris, five weeks in Madrid. That's this horse traveling 24 hours a day. So it was a momentous discovery. And a lot of things, applications, happened without people really understanding what electricity was. Volta did not. It was more or less a random discovery.
And his motivation was typical for a professor-- he was a professor in Pavia. He wanted to show that another professor in another university-- Bologna, this was Galvani-- was wrong. And so he invented the [ITALIAN] to show that electricity did not need biological life to exist. And so people did not understand what electricity was, but this not was an obstacle for people to develop electrical generators, telegraphs, electric motors, and so on.
The real progress started when a theory was developed. This was Maxwell, 60 years later or so. And after that, electricity really took off with the radio, electric lights, transformer, photoelectric effect, computers, the internet, machine learning of today. So this a kind of metaphor for what may happen with machine learning.
I certainly don't compare machine learning or, especially, deep networks with electricity. But we may see something like that. Until now, applications, which have been quite impressive, were developed without understanding what was going on from a theoretical point of view. But I can see a time very, very soon in which we'll have a theory, and then things may develop, not only for deep learning, but more generally machine learning. OK.
ANDREA TACCHETTI: Thank you so much for inviting me. This is a fantastic topic, and the panelists other than me are outstanding. So I just wanted to give my thought on the topic of the day. And the title of the slides is Levels of Analysis for Learning Systems. So the levels of analysis of a computational system where introduced by Marr and Ptolemy in the '70s, then revised in the '80s. The last revisions that I know of is in 2012 by Ptolemy. And these are levels of a framework to describe a computational system.
And so there are three levels. There is the computational level, which is what is the system trying to do? And you could think of-- like a vision system, it's trying to represent the light that comes into a sensor or an eye in a way that is useful to the viewer. That's what a vision system is trying to do. There is the algorithmic level, which is how is it trying to do it? There could be an LGN, then a V1, V4, IT, cortical subregions that process this data. Or there could be a CNN implemented-- some computer that processed this data.
And so how is the data represented? How is this goal achieved? And finally there's the implementation level. How is a system-- how does it manifest itself in the physical world in hardware or software? This could be at a level of neural circuitry, or the Python implementation of your CNN. And Marr advocated that we should describe computational complex systems at these three levels of understanding.
What I would like to do in the next three slides is to provide a framework for the discussion that will follow with the levels of analysis of a learning system. And what I'll argue is that we'll need theories at all of these three levels if we want to have a serious impact on applications of the kind that Tommy just described from actual equations to Silicon Valley.
So the computational level of a learning system is what is it trying to do? This could be described, for example, in a loss function. It could be described in a game that the system is trying to play. One example that I hope most people are familiar with is a DQN that tries to play Atari. And what a DQN is trying to do is to efficiently produce actions that will maximize reward given a observation from the environment.
Now, how is this learning system trying to do so? Well, in the specific case of DQN, it's collecting a lot of trajectories where it plays its current policy, it refines the value function, and then it follows the value function to improve its gameplay. And this is the algorithm. How is it trying to achieve the computational goal? How is it trying to output actions in the environment so as to maximize the reward it receives while it's trying a lot of stuff and sticking to what works? That would be a way to describe it. And this is an algorithmic description of a learning system.
And then finally, there is the implementation level. How does this system manifest itself in the physical world? How is it implemented in hardware or software? Well, in this particular case, this paper, there were a few convolutional layers followed by fully connected layers. And that's what was used to implement the value function. So what I wanted to-- I don't know.
My opinion about the topic of, how can a theory of deep learning impact applications, is that we should try and come up with theories all of these three levels. The questions that we could ask are of the form of which goals or games lead to systems that perform very well, for example, outside of their training environments? At the algorithmic level, which representations should we aim for? What should we look for in the data? Which optimizational algorithms lead to system that perform well, and why is it the case?
If we understand all of this, we can make a huge impact on applications. And finally, the implementation level-- what do we need in programming languages that we don't have today? Which submodules should we develop to accelerate research so that the physical manifestation of the systems that we want to build can be achieved very quickly? And so I'm just going to-- I think I have a couple more minutes-- give a couple of examples of the types of problems that I-- I just want to give a flavor of the types of questions that I have in mind.
So recently, I was involved in a project where we trained a reinforcement learning agent to play Diplomacy. Diplomacy is a seven-player board game. The board is a map of Europe at the beginning of last century. And the seven agents control units on the board. And the goal of each agent is to capture some supply centers.
The important things is that the players can support each other's move. So for example, a player could move between Paris and Munich, and this move could be supported by another player, thereby gaining strength. So the interesting thing that this leads to is that the game is mixed motives. So while players have to collaborate to make progress towards victory, they ultimately have to stand alone to win. This very complex multi-agent interaction leads to a new phenomenon, which is high-level strategy shifts.
So what this means is that we can't simply train a system just by having it play previous versions of itself, because it will start obsessing with eating the previous version of itself, and it will get worse at the game. So it needs to be robust to all kinds of opponents. This is something that we have. So this is a computational-level goal. What is the true goal of the system? It's not to beat the previous version of itself, it's to get better at the game.
And so we don't have a good theory to describe systems that have these goals. This is obviously very relevant for applications. Imagine a phone with a learning system that tries to save battery and a learning system that tries to promote apps that are relevant to you. Imagine that you have very low battery and the promotion systems wants to show you how a specific app works because it thinks it might be very relevant to you.
Now, the battery system might not be super pleased with downloading videos and showing them to you. In this instance, we'll come to a head. So they have locally incompatible incentives, but globally they want to improve the experience that you have with the phone. If we understand the theory of these interacting systems, we can really make progress in applications like this.
Similarly, think about a system that recognizes individuals with very few labeled examples. What is the computational goal of the system, and how can we improve the algorithmic level? So the hypothesis of this work is that things that happen close in time are semantically related to one another. So the top row of this image, you see frames of the video. And you can see that the identity of the person that's speaking is not changing through all the pictures, even though the physical representation, so the image itself-- they are very different from one another.
And so if you can build a representation that is robust to all of these changes, we find that we can train a system that recognizes people based on this representation that requires very few examples to perform at a high level. So these are just two examples of computational-level theoretical questions that we have and algorithmic-level contributions that we could desire in future systems. Again, what I'm trying to say is that if we want a theory of Deep Learning that impacts applications profoundly in a way that Tommy was alluding to, possibly the levels of analysis might help guide the discussion. And we need a theory at all levels.
MAX TEGMARK: Thank you so much for inviting me. So if the question were, are Italians relevant to applications? I think the answer is obviously yes, because not only have we heard from two awesome Italians, but we have a third one coming up soon on this very panel. For the actual question we were supposed to answer, is the theory of Deep Learning relevant to applications, this is my five-second answer. Yes, of course. I'll spend the rest of the time explaining why I think so and with a little bit of examples.
I think Deep Learning theory-- first of all, it can be very powerful not just for making Deep Learning work better in various ways. We have already heard some great examples about that from the two previous speakers. But it can also help us make systems that we can trust more, because they are intelligible. Trust not because the salesman said so, but because we understand how they work. That's the focus of my MIT group.
And I want to just give you one example from a paper that we had just had accepted to NeurIPS on this. So we've worked on automatically discovering various kinds of simplicity in machine learning systems. Today, I want to focus on discovering equations in particular. So you have some data. In general, you train and get this inscrutable black box model that does something you think is intelligent. If you can simplify it somehow so that it still performs as well on your metrics but is also simpler, then you might have more reason to trust it.
So this is the NeurIPS paper, Pareto-optimal symbolic regression exploiting graph modularity. Symbolic regression is simply taking a data table and autodiscovering a function that approximates it well-- approximates the last column as a function from the other three. Johannes Kepler, continuing with historical examples in the spirit of Tomaso Poggio, here, he spent four years staring at Mars data until he realized that this was an ellipse and revolutionized astronomy. How can we automate these sort of things?
Graph modularity refers to the fact that the computational graph of the function that you have, seen in the middle here, for many of the computations we actually care about in science and engineering, they have a modular structure, meaning that you can decompose, in this example here, for example, a function of three variables into two functions of two variables, and much more broadly, with functions of very many variables, like pixels in the images we just heard about.
So what we did here was we trained a neural network to, as a black box, just approximate that function defined by the data first. And then, we developed some tools for automatically discovering graph modularity in this. We showed that by studying the gradients of the neural network, we could actually automatically discover all such modularity. Then we do this recursively until the pieces gets simple enough that we can get good approximations for them. And then we put the whole thing together, and have solved the symbolic regression problem.
So you can think of what I said here as a recursive divide-and-conquer algorithm in the spirit of another famous Italian, Julius Caesar. An earlier work we had done on this exploited some kinds of modularity. But this is very general, what we have now, where it discovers any modularity in terms of any modules with any numbers of variables going into it. Finally, what's the deal? What's Pareto optimal, the third weird phrase in the title?
Well, I told you that I'm very excited about not just obsessing about minimizing the loss or inaccuracy, which is on the vertical axis, but also minimizing this complexity. In the spirit of Occam's razor, if you have two machine learning models that perform equally well with equally low loss, I would prefer the simpler one, because I'm more likely to understand it. It's less likely to have problems with it. So we'd used an information-theory-based definition of complexity and put it on the y-axis.
And then of all possible models you can find, we kept the ones that are on this so-called Pareto frontier-- named after another Italian, Giuseppe Pareto-- where you will only want to keep for each complexity of the model that is the most accurate in that class. And for this example, we threw in some data on how much energy of motion, kinetic energy, objects had. It not only discovered a very accurate model in the lower right corner here, which is Einstein's formula for kinetic energy, but it also discovered the approximation, NB squared over 2, that Galileo worked out.
I promise it was unintentional to sneak in so many Italians into the talk. But I'm just realizing now, they're just one after the other. So what's really nice-- what we humans tend to appreciate the most are convex corners in this plot, models that are kind of unique, both in accuracy and simplicity. What we implemented here is a trade off between Occam's razor, which is too vague to ever code up as a actual numerical implementation, on one hand, and Ray Solomonoff's algorithmic complexity, which is NP hard to compute. We have something in the middle, which is fast and easy to compute and really seems to work pretty well in practice.
We tested this on the 100 most famous or complicated equations from the Feynman lectures on physics, from a bunch of harder equations from other physics graduate textbooks, from a bunch of very modular equations. And we also tested it on samples from a bunch of probability distributions, where we use normalizing flows first that converted into a mystery function. We then ran this on and it works really well. It's not just state of the art in terms of the rate at which it can actually guess the correct answer, or a really good one.
But it turned out that this rewarding of simplicity dramatically improved the robustness. When we added more and more noise to the data until we broke it, it turned out that with this information theory thing that rewards simplicity, we could often tolerate a hundred or a thousand times higher noise levels. So the three-level summary-- the three-word summary of this application is, "pip install aifeynman," because that's all you have to type on your computer if you want to play with it and run it.
And coming back to the original question, I think yes, deep learning theory is very relevant to applications in so many ways. The one example I showcased here was, I really believe that training neural networks might only be the first step in a lot of future deep learning applications. Once you have something that works, it's very interesting to see if you can then gradually and automatically simplify it by discovering modularity or other things in the systems that you understand better and therefore have more reason to trust. And the more we put AI in charge of systems that affect people's lives, the more we want to be able to trust them. Thank you.
LORENZO ROSASCO: I hope I'm not the only one that didn't prepare any slides. I guess I could ask Tommy and steal one from him. He has a slide from Steve Smale and his work a few years ago. And I guess that's pretty much where I would like to start from. I mean, Kenny started at beginning guessing that most of us would be indeed in favor of theory, and that, yeah, you got it right. I'm indeed in favor of theory.
I think it's hard to ever say that you want the science, the practice, without theory. So I think as far as that discussion goes there, there's not much to say. I think about this in a different light. I started to do machine learning theory and algorithms around 15 years ago. And 15 years ago, it's what I would call "golden years." Papers like the one in this slide came out. And this was after 15 years of great papers that somewhat generalized the law of large numbers, really looked into approximation of theories.
So statistics, approximation theory-- at the time, the models that were popular were things like kernel methods. Then there was the whole sparsity, compressed sensing, high-dimensional phenomena, high probability-- probability in high dimensions. And then a bit later, as well, all about optimization and revamping old ideas. Think simple ideas, like gradient descent. And it was great, because the stuff that was used in practice was the one that you could actually study. And the theory, the math, was precise enough to make prediction and inform to some extent the way algorithms should be designed and used.
I guess the question becomes-- the discussion got a bit different the last few years, because I think deep learning is undeniably successful but also way more complicated than these models that were used to go. In some sense, we replaced systems of modules-- feature selection, classification-- with just one, which is somewhat hard to study. And then, of course, because theory's important, there has been all kinds of theories that were proposed to explain deep learning.
And I guess the question is, is deep learning theory useful for deep learning? Perhaps. I have no doubts saying that theory's important for application and machine learning. I think I'm not sure where I stand if deep learning theory these days is useful for deep learning. I'm not sure. [LAUGHS] Would I recommend people work in deep learning today to try to navigate the tons of deep learning theory papers that have come out? I'm not sure. I'm really not sure. I think this is a good question for a panel.
I would definitely suggest to study more classics. There are fantastic results from approximation theory from the '80s and '90s and their follow-ups paper that come out now. The classical stuff on universality, for example, and the recent developments, I think they are useful. They're one of the few things that actually give insights. I think it's good to see that things are so complicated that perhaps what we should study is not deep learning, but some of the phenomenon that deep learning helped discover, like this idea that interpolating data is not so bad. And it turns out that maybe we don't even know how to explain these for linear least squares.
And so maybe that's a good thing. Maybe we should look back. Practitioners should look back at the extremely simple situations to try to see if what puzzles them is there already. It turns out that linear logistic regression has some really weird stuff. And ultimately, I think perhaps theory and practice should try to get back together by trying to design-- this is happening a lot-- to have more experimental results that somewhat are simple defining really well the setup, the parameter, the details of the algorithm that are run.
Because especially with these libraries that are used today, it's often kind of hard to go into the details of how algorithm are used. And then if you want to do theory, the sentence that the devil is in the detail, everybody have done theory know is true. So if you don't know what the details are, I often find these days that it feels hard to do theory of deep learning exactly because we don't know all the details of this story. So yeah, I guess that's all I want to say.
DANIELA RUS: I firmly believe that theory is important and there should be a cycle between theory and application. They can go back and forth to inform and empower each other. And so when we have a complex system like a computer or a deep learning engine, it is important to ask the question, what is the system capable of, and how accurately? What is the scope? What are the guarantees of the current solutions? Are there limitations?
And then if we are developing a technology for people for broad use in applications, how can we use it safely? How can we improve our understanding? Can we make it better? And so these are the questions that I think about when I imagine the brains of machines that are driven by complex machine learning models.
Now, going to the first question, what is the system capable of? With machine learning, we have seen a huge range of really powerful applications. And I just want to share with you one example. This is the machine learning system that worked side by side with doctors to diagnose lymph node scans. And the error of the doctor was about 3.5% in this task. The error of the AI system was 7.5%. So both imperfect but working together, the humans and the AI systems-- the machines achieved 80% improvement.
And so this is an example that shows you how the current possibilities around machine learning can truly empower people with the kind of tools that are not possible to get at without machines. So I like to think about the current solutions as being such that they could be seen sort of like interns that help process data and provide humans with support for making decisions.
And why is that? Well, this is because of the current limitations of deep learning. And these limitations range from availability of data, ensuring that we have the massive data sets that are well labeled and that include the corner cases, ensuring that whatever data is used has good quality, ensuring that the corner cases are there, ensuring that there is no bias. And if the data is shared privately, then the privacy of people who donate data is not broken.
Another limitation is on the computation size. The models tend to be really big. And so the cost of training and the cost of storing are non-trivial. Let's just think about the GPT-3 models, how huge they are, and how much it costs to train them. I heard on the order of $4.6 million in energy costs to train the GPT-3 models.
Then we have the black box. Then we have the robustness issue, and also the fact that we don't get much semantic out of deep learning engines. And so with these challenges, we can then start asking, OK, well, if we know that these are limitations of current systems, how can we make progress? How can we address each of these areas so that we can come up with better mathematical foundations?
And I just wanted to show you an example of a system that we have built in my group. And this example is about learning how to steer a self-driving vehicle by connecting directly raw images to actuation via a deep neural network. Now this solution was actually very valuable for autonomous driving, because prior to putting this deep network in between perception and actuation, we had a lot of modules in the self-driving pipeline. And these modules had to be finely tuned, had to be parameterized for every type of environmental situation, for every type of road condition encountered, whether it rains, or it's nighttime, or the visibility's poor.
So we did this and we achieved a really robust solution, verified empirically and experimentally that the solution is robust. And here, what you can see in this video is in the middle, you see the dots. They represent the machine learning engine that is driving the vehicle, that is steering the vehicle. You see the map. You see the trajectory that is taken by the vehicles. And then you also see the attention of the neural system.
And so now, what do we learn from this? Well, from this we learn that, yes, we have a system that has great empirical performance, but we don't really understand what happens among all these yellow and blue dots. And when we look at the attention map, when we look at where the machine learning system is basing the decisions-- where from the road the system is basing decisions, we see that this is all over the map. And so from this, we can learn that maybe the models that we have right now have some limitations. Can we improve the models? Can we improve the algorithms?
And so I just wanted to show you a different model, a different version of a machine learning system that essentially replaces the step function inside a neuron with a differential equation. And so now, notice it's the same task. It's the same input processing where you see the convolutional layers. But now the decision part is encapsulated by many fewer nodes. We can associate those nodes with specific high-level behaviors, abstract behaviors in the world.
And we see that the attention map is actually better. It's much more focused on where we would intuitively expect the system to look. And so in summary, with this example, what I want to underscore is that theory and models are important. It is important to understand what guarantees we can make of our systems. It is important to understand how our systems operate. And these are necessary if we want any kind of independent autonomous behavior.
If we don't have that level of understanding, we can still use the systems, much like we did in the medical application where the system provided guidance for humans to make decisions on. But there is a continuum. And the more we experiment with, the more we build systems, the more we understand what properties matter, and the more we can tackle those properties with improved algorithms and models. So this is all I wanted to share. I'm going to stop and open this for discussions.
KENNETH BLUM: Thank you. Thank you so much, Daniela. Thank you to all the speakers. So I'm hoping to have a few minutes of discussion amongst the panel. And as I suspect, although I see we're running out to-- quarter to five. So maybe this segment won't go long before we switch to the general Q&A. But I wanted, since, as I suspected, that we have general agreement that theory is important for applications, and we've seen some indication that for considerations of safety, and interpretation, and robustness, and improvement, that theory is going to play an important role.
I wanted to know if we could get to the answer "no" instead of getting to "yes." Set some limits on this. Accept for the moment the premise that deep learning is stalling or is going to stall in five years or something like that. Do you think that we can get it unstuck with theory, or is it more likely to be something else? So is theory going to get deep learning unstuck?
TOMASO POGGIO: While I think it-- another question to ask first is-- we all believe that theory will come and will be useful. But maybe it will not happen. So far the story has been hackers have run the day. Maybe they will continue to be.
Maybe this is the end of science-- sorry, end of theory, say. End of Science was a book that appeared several years ago. But end of theory-- you don't need to prove theorems anymore, because you can run so many simulations, given the computing power that we have. And then machine learning will be just one area where this appears, but--
LORENZO ROSASCO: But you're joking, right?
TOMASO POGGIO: We would have to be afraid of that, that this is the end of theory.
KENNETH BLUM: Max.
MAX TEGMARK: I think that the challenge should be more ambitious than just getting machine learning unstuck, as you said. I think if all that happens is that somehow we manage to accelerate just training neural networks better until they can do everything that we humans can do, I think that's the worst possible situation that we could put humanity into, that we have these incredibly powerful systems and we still have no clue how they work or why we should trust them.
I would feel much more confident if we can, instead, have deep learning play a part of what we build, but have it modular, have it bring in all sorts of theoretical tools so that we can actually understand more what's happening and trust it. I was encouraged by what you showed, Daniela, actually, and if you have time, I would love to chat more with you about this wonderful driving work you did, because it felt like, morally, it was in the same sentiment as what we were doing, where you found, in the end, that you can have a simpler system that is bit more modular, where you can say, actually, here is a sensory part. And here is this decision part. And here is the motor part.
And I think more broadly, if you look at us humans, why is it that we trust Tommy Poggio, and Daniela Rus, and so on to be much better at solving things in a trustworthy way? It's because you guys aren't-- they aren't just-- with our human brains, we're not just blindly training a neural network to try to replicate things. We're also using symbolic reasoning at this higher level.
We aren't just Galileo-- to come back to another Italian, here-- he wasn't just training-- he, as a small child, learned to pick up apples that his father threw to him, or pencils, or whatever, because his neural network could predict, roughly, the motion. But when he was older, he figured out a symbolic representation for this. y equals x squared, it's a parabola. He started doing the science of this. And I think this is what sets us humans apart from most other animals, this ability. And I'm hopeful that we can bring together these more logic-based tools that help us understand things with machine learning.
KENNETH BLUM: Great. Daniela.
DANIELA RUS: So I usually love to argue with Max, and I'm finding myself in this odd position where I have some agreements with him. I would like to underscore his proposal that says let's ask for more. Let's use deep learning as a way of encouraging us to develop better models and better understanding of learning and intelligence, because ultimately, that should be the profound objective for the field to develop, as we say with Tommy, the science and the engineering of intelligence.
Deep networks are based on a very old model. They result in very complex systems that seem to have extraordinary behavior. I mean, look at GPT-3. It is a huge model. It's immense. It uses every single text that has been made available by humanity. And yet, it's not perfect. So do we really need such big models?
Other recent work has shown that you could look at these hundreds of thousands of model networks and you can remove many of the parameters. You can remove many of the nodes. Up to 80%, 90% of the nodes can be removed, and we could still retain the same performance. And so that's kind of interesting, because a smaller, more compact model will give us a better handle of perhaps developing the function approximation that describes the system as a whole, rather than just individual tiny functions inside the network.
What I forgot to tell you about the driving example is that the difference between the end-to-end original model and the new model is going from over a hundred thousand nodes in our neural network to 19 nodes in the new network. Now, I don't have a function for approximation for the 19 nodes, but I can visualize it better, and I can have better hope of composing all the functions into an overall function if I have such a small handle than if I had to deal with hundreds of thousands of nodes.
KENNETH BLUM: Wonderful. So I think I want to-- just being mindful of the time here, I want to move on to the Q&A session where we open it up to questions from the outside. And Kris Brewer will take them, I think, from the Q&A box, and we'll read them out to the panel, and you can answer them.
KRIS BREWER: Sounds good. Our topmost question at the moment is from Jason Lynch. "What topics and directions in deep learning theory appear useful? Which ones are not?"
DANIELA RUS: I actually think that dealing with the size of the network is important. I think that dealing with the robustness of the network is important. These kinds of research directions aim at really better understanding the approximation that the network computes and giving some guarantees of what you can expect from the network. So I would say that anything that gives us any kind of handle on guarantees that go beyond empirical guarantees is valuable.
LORENZO ROSASCO: I would like to expand on that, if that's OK. I think the boundaries of where deep networks are successful are starting to delineate pretty clearly. There are applications where the data is virtually free, exploration carries no risk, so you can unleash an agent that can try a lot of stuff, and there is no risk of crashing a car or anything like that. Computation has to be plentiful, and that is true also for inference. That has to be true also for inference, which is what Daniela was alluding to. I think any theoretical advance that could chip away at this boundary, could make this boundary larger, would be truly useful for deep learning. I hope that makes sense.
DANIELA RUS: Well said, Lorenzo. Well said.
TOMASO POGGIO: So I think the key, from a theory point of view-- that the most interesting feature of deep networks, as Lorenzo said, is not only deep networks. This is a general question. It's interesting because it goes beyond the deep networks. They are overparameterized models. This is quite unlike classical statistics, where, typically, if you have to fit a model to data, you want to have more data than parameters in the model. In deep networks-- and you can do the same with simple linear cabinet methods-- you have more parameters than data, typically, many more.
So this underlies several questions you can ask. Why are convolutional networks so good? You know, it's not only deep networks with many layers, but it's convolutional networks. Why are they so good? Another one-- why is optimization so easy? And then [INAUDIBLE] is it's easy because you have many parameters. You have more parameters in the data, and you can fit the data. You can find the zero of the loss. But formalize this.
And then, also related to overparameterization is how can you generalize, and not overfit, and predict when? These are, I think, some of the basic questions. They're around this overparameterization question. And just to show you, I think what Daniela and Max said is quite relevant.
I think one of the most important features if you look at convolutional networks is the fact that they can avoid the curse of dimensionality in approximating functions that have a graph structure like this one. They are functions of functions of functions. And the constituent functions have low dimensionality-- in this example, two variables.
These are essentially, for a function of this type, if you use a convolutional network, you can approximate very well with a number of parameters that is linear in the dimensionality instead of exponential, which is so-called virtual dimensionality. And this brings up the idea that convolution is OK for situations like images. You have local structure.
But it should not be OK for many other situations-- for instance, using convolutional when you do speech recognition. It's OK to have convolution in time. Makes no sense to have convolution in frequency. You don't have a locality in frequency. By the way, the important part of convolution is not weight sharing. Weight sharing does not reduce the exponential curves of dimensionality. What is important is the locality. The weight sharing improves but not exponentially.
So I think in the future, you may analyze a particular task. And it's a bit like dimensionality analysis in physics. You have to say which variables are important and interacting which other type of variables. Think about financial problems. You have bonds and stocks and other things. Convolution is not good, but I'm sure that there is a compositional structure like this one, a directed secret graph that may represent your program pretty well.
And if you have a deep network with the same architecture of this graph, you'll be very good position to solve it efficiently. So this would be the kind of theory at the level of decomposing a task in modules, a bit like the function of Max Tegmark. And I think there is a lot of theory and potential applications in this direction.
KRIS BREWER: We have one from Jim DiCarlo. "To be provocative, if we were to not use the word 'theory' and instead say that we need models that can make accurate predictions, what would be lost? Similarly, do we have a theory of an iPhone? If no, then do we not trust our iPhone? If yes, then please tell me how we use that theory to explain the limits of an iPhone."
TOMASO POGGIO: We have a theory of an iPhone. We have theory of different levels. We have theory at the level of the transistor. We know how they work exactly, of the chips-- we have a theory at the level of the software, of the graphics interface. We have a theory.
MAX TEGMARK: Also, just to chime in quickly a little bit more on Jim DiCarlo's question there-- if it's just the word game, if you're comfortable saying that we don't have a theory of general relativity, but general relativity is just the model that makes accurate predictions, then I guess it's just a word game. But if you take theory in the more traditional sense of science, then I think the theory is something more ambitious, where when you find it, you're often able to apply it much more broadly.
Tomaso mentioned Maxwell's discovery of Maxwell's equations. Even if it was discovered first to explain some limited phenomena with magnets and carbon wires, it was so much more general that it could also be used to build radio. I think similarly, deep insights that you might want to call contributions to deep learning theory should be applicable in much wider domains.
MAX TEGMARK: --problems that you didn't even think about applying them to initially.
TOMASO POGGIO: Let's see if I managed to provoke anybody in more questions.
KENNETH BLUM: Lorenzo, were you going to chime in?
LORENZO ROSASCO: I was just saying that there is-- I think there is a certain level of precision what you mean by 'theory.' Oftentimes, we refer to mathematical theory that gives extremely quantitative predictions, not just accurate but that is-- Mark was saying they're broad and they're quantitated to generalize well. Sure, if that's what we mean by 'model,' I'm happy to say that. And as Tommy was saying, think of-- think an example that is often made when you compare things to where you don't have theory, or deep learning, or AI is airplanes.
I'm not sure if I need to explain my iPhone, but certainly I want to explain my airplanes. When I used to jump on them, I simply it's in a fainting memory, but I used to be on a plane very often. And definitely want to know that somebody knows why they fly and how they stay up there. And it's not just a trial-and-error thing. So yeah, I don't know. I mean, I think we do have a theory there, and this is the same theory as we have the iPhone. I think a distinction is pretty clear to me.
MAX TEGMARK: And finally, coming back to the iPhone in Jim's question, what I would ultimately love to have is to be able to do formal verification of the software on my iPhone so I can prove to myself that it will never be hacked. And the more safety critical a device is, I think, the more valuable that is, including Lorenzo's airplanes.
KRIS BREWER: Our next one's from [INAUDIBLE]. "Could you give us an example where the machine learning theory paved the way for a practical discovery which boosted the performance on real applications?"
LORENZO ROSASCO: Let's apply the same question to Maxwell, Tommy. What is an example where Maxwell equations were useful beyond Volta battery?
TOMASO POGGIO: Yeah, the red ones are arguably all things where theory played a role. All are big.
MAX TEGMARK: Just to add to that for machine learning specifically, I agree with Tommy that we're very far away from having anything we could call a theory of deep learning. But there has been a lot of progress, but every AI conference you go to has a theory section. And you can already see how progress there has helped in a lot of ways. For example, various abstract analysis of the loss landscape for machine learning problems has given us a lot better training algorithms. And that's why the GANs today train so much better than they did in Ian Goodfellow's first scan, for example, very practical.
TOMASO POGGIO: And there are people like Ben Recht and Ali Rahimi whom I think wrote a successful, pretty well-known paper comparing the state of the theory of deep learning to alchemy and saying it's time to start developing the chemistry of diplomacy enough of alchemy. So I think it has to be done, but will be done pretty soon, I think. That's my prediction-- like two to five years.
KENNETH BLUM: What about counterexamples-- of a limited sort, I understand, but-- where the theory of something did not particularly advance applications, even though we understood the science of the thing better? Can people think of something like that? I don't know, people-- slaves built pyramids without Newton's-- without Newtonian mechanics or an understanding of gravity. And lots of things, lots of widgets were made.
So along comes Newton and, what, do we have some better machines? Or is that a kind of a counterexample, or is there something else where you have a tremendous advance in theory, but it's not really linked to application so much? Or maybe you could say that for QCD, or something like that.
TOMASO POGGIO: What about biology?
MAX TEGMARK: Well, the buildings of MIT weigh a lot more than the Giza pyramids combined, but they used a lot less than 400,000 slaves in their construction, thanks to all this theoretical progress.
KENNETH BLUM: OK. But one might say that we have a deeper understanding of symmetry breaking in fundamental physics, but that has not helped with applications.
MAX TEGMARK: Actually, the study of symmetries is exactly what inspired Yann LeCun and others to come up with convolutional neural networks. There've already been some interesting papers trying to look at other symmetries. Conduits come from translational symmetries, just like Tommy mentioned. But there are many other symmetries which we may be able to exploit to improve other aspects of machine learning.
KENNETH BLUM: OK.
KRIS BREWER: It's from Mozart200. "What kind of important and essential role does BD play in deep learning, both in theory and practical applications?"
TOMASO POGGIO: BP?
KENNETH BLUM: Big data.
TOMASO POGGIO: Oh, big data. OK.
DANIELA RUS: Well, models require a lot of data to train. And so most applications that are effective today are built on huge amounts of data. And so from this point of view, big data is super important. In fact, I-- let's see. Going back to the GPT system, it has-- do you remember how many parameters you'd have? I think it has 175 billion parameters. It used 45 terabytes of data. That is huge data.
Now GPT-3 is considered sort of universal for training natural language processing, although each application can rely on GPT-3 but would require some specialized training. But the reason this performance of the system is so good and yet still not perfect is because of this very, very broad access to data.
LORENZO ROSASCO: And maybe, expanding on a place where theory could really help, one way that we mitigate the need for large amounts of data is just a simulation-to-real transfer. If you could come up with systems that are trained in simulations that are demonstrably robust in the real world, you could easily leverage the fact that simulations generate virtually infinite amounts of data. And we can then deploy our systems in the real world. And this is a wide-open field for theoretical contributions. How can we characterize the robustness properties of networks that we train in one environment when we transfer them to the next environment?
TOMASO POGGIO: Here is a question. Max, would we need the theoretical physicist anymore once we are all AI?
MAX TEGMARK: [LAUGHS] Will any of us be needed? Well, I think if Demas and his-- at all, ever succeed in building artificial general intelligence, then we have to ask bigger questions about, of course, all of us. But I prefer not asking the question, will we be needed, as if somehow we're just sitting here, passively waiting to see what's going to happen. I mean, this is our future. We're all building it together. So I would prefer that we build a future that we're excited about where we all have meaningful things to do.
But coming back to the more near-term future, I think-- certainly, I am hopeful that ideas from theoretical physics can continue to be useful for machine learning. I think physicists tend to be often the most ambitious in terms of what they mean by understanding something and saying, we don't just accept that Elon Musk's rockets are probably going to go to Mars and not crash because of some trial and error. We trained it a lot.
But we decide to figure out the laws of gravity and all these other things. If we can take both that audacity from physics and also take some of the physics techniques for studying dynamical systems, nonlinear dynamical systems, which is what neural networks really are, maybe physicists can help a little bit in the quest that you're advocating, Tommy, for more of a theory before we become obsolete.
KRIS BREWER: Let's pop onto the next one from [INAUDIBLE]. "It seems the current research works are stacked in deep neural networks and data. Is there any direction that our AI models free from data and move toward the AGI?"
MAX TEGMARK: I can say something brief again, connecting back to what we said earlier. I think if you compare our intelligence with the intelligence of a chipmunk, for example, I would say the chipmunk is just as good at deep learning in its visual system as we are, at recognizing the difference between an acorn and a peanut, and so on. And we tend to think of deep learning as the new latest, coolest thing, because we came to it after we did the more logic-based stuff.
But actually, the deep learning is where we share with most other animals a skill for. What's unique about human intelligence-- human intelligence is sort of related more to our definition of AGI. It isn't that we can do deep learning with our heads but that we can also do symbolic reasoning. So I think the path forward there will probably involve combining deep learning with more of the old-fashioned symbolic reasoning AI, the way human scientists do.
KENNETH BLUM: But then, do you think that theory gets us there?
MAX TEGMARK: I think theory is very much a key part of it, yeah, to understand how this all fits together. That's why I mentioned Galileo earlier, again. He didn't just content himself with having enough intuition for how a ball was going to move to be able to catch it. But he tried to distill out the theory of motion, theory of gravity, and so on. This is the edge we humans have over chipmunks, that we can also do theory. And I think we should leverage that.
KENNETH BLUM: Yup, great.
MAX TEGMARK: Well, thank you so much for hosting this. This was a really fun conversation.
KENNETH BLUM: Wonderful. Yeah.
ANDREA TACCHETTI: Thank you very much for inviting me. It was awesome.
MAX TEGMARK: [ITALIAN]
TOMASO POGGIO: Can't see you, but it has been good to see you, Andrea.
ANDREA TACCHETTI: Yeah. Can't wait to do this in person. I mean it.
TOMASO POGGIO: Yeah. Very good.
KENNETH BLUM: OK. Stay well, everybody. See you the next time. Thanks for joining us.
MAX TEGMARK: Thanks, [INAUDIBLE].
TOMASO POGGIO: Bye bye.