Stability of overparametrized learning models
Date Posted:
April 15, 2020
Date Recorded:
April 15, 2020
CBMM Speaker(s):
Tomaso Poggio ,
Lorenzo Rosasco Speaker(s):
Mikhail Belkin, Constantinos Daskalakis, Gil Strang
All Captioned Videos CBMM Research
Description:
A panel discussion featuring Tomaso Poggio (CBMM), Mikhail Belkin (Ohio State University), Constantinos Daskalakis (CSAIL), Gil Strang (Mathematics) and Lorenzo Rosasco (University of Genova).
Abstract: Developing theoretical foundations for learning is a key step towards understanding intelligence. Supervised learning is a paradigm in which natural or artificial networks learn a functional relationship from a set of n input-output training examples. A main challenge for the theory is to determine conditions under which a learning algorithm will be able to predict well on new inputs after training on a finite training set, i.e. generalization. In classical learning theory, this was accomplished by appropriately restricting the space of functions represented by the networks (the hypothesis space), characterizing a regime in which the number of training examples (n) is greater than the number of parameters to be learned (d). Here we will discuss the regime in which networks remain overparametrized, i.e. d > n as n grows, and in which the hypothesis space is not fixed. Our panel discussion will center on key stability properties of general algorithms, rather than the hypothesis space, that are necessary to achieve learnability.
TOMASO POGGIO: So I'm Tomaso Poggio, and today, we will have a discussion, and we'll do it in the following way-- that Misha Belkin will speak for 10 minutes, and then I'll speak for 10 slides, and then Lorenzo, and Costis, and Gil will start questioning, and discussing, and criticizing what Misha and I have said. And there will be, also, time for a question from the audience. The best way, probably, would be to, let's see, raise your hand, or maybe using chat to send me a message.
All right, so it's great to have a Misha, thanks to Zoom. He's in Columbus, Ohio. And I was thinking today that you are my grandchild, academically speaking, because Partha Niyogi, who was your advisor, was a student of mine a long time ago. Partha was great. Unfortunately, he passed away.
But let's start with Misha. And he had some very interesting work-- not only the one with Partha, but also the one he's speaking today and some other one, more recently, that you'll speak, maybe, next time, or soon, at CBMM. And this about overparameterization, interpolation, and, essentially, the puzzles of modern machine learning Misha?
MIKHAIL BELKIN: Thank you, Tommy. So yeah, so I was asked to give a short overview of some problems which are very much inspired by more or the interesting aspects of machine learning. And some aspects of which we're starting to understand now, and I think it makes sense to frame it as 1.5 lessons, of deep learning, is that, well, there are actually two lessons here, but I only have time to, really, discuss one. But I'll just very briefly say something about the second one.
And so the lessons of the following. And the first lesson, I think, which we're somehow learning now and that we're starting to understand is that we need to-- certain foundations of statistical learning need to be rethought with a new view of model complexity and overshoot. And the second lesson-- so that's the lesson for statistics. And the second lesson is for optimization, that we now realize that we need a new theory of non-convex optimization for large systems.
And the common thread of those two things is this idea of overparameterization, which really, I think, has become one of the main narratives [INAUDIBLE] recently, explaining empirical phenomena with deep learning, and the consequence of that, which is interpolation.
So well as we know, empirical risk minimization is one of the main sort of theoretical foundations of machine learning. And most-- well, many, probably most theoretical analyses for machine learning are based on empirical risk minimization. We simply try to minimize our empirical risk, the risk on the training data, over a class of function.
And the way this analysis works-- and I'm repeating things which you will, of course, know already, but I think it's probably useful to set this up as a basis for the discussion-- is that we have-- it classically has been analyzed by using uniform laws of large numbers. There are many, many various forms of those laws of large numbers, but they're all what you might call WYSIWYG bounds-- what you see is what you get-- in that you're comparing expected risk, which is what we get in the future, to the empirical risk-- the risk, the loss of the training data-- plus a complexity term depending on the function complexity, which may or may not be data-dependent. Classically, it's not data-dependent, but there are actually a lot of bounds now which are data-dependent-- well, not just now-- starting with margin bonds, another thing we call it.
Now, this has been a foundation of machine learning. And if you look at Vapnik's book, he said that the theory of induction is based on the Uniform Law of Large Numbers plus capacity control. And the outcome of that thinking that we all know-- and which actually precedes Vapnik-- is this U-shaped curve.
And the way it works, that as we increase complexity from low to high, the training error goes down, but the test error traces this U shape going from under-fitting to over-fitting. And the goal of machine learning would be to find this model, which is the bottom of the U. The [INAUDIBLE] area would become how to theoretically identify the thing using bounds.
And an outcome of this, a corollary of this, is that if you fit the data exactly, the model with zero training error will overfit, meaning that it will generalize form. So that has been the structure of machine learning. We'll go over in just a couple of slides.
But empirical evidence actually suggests that this is not exactly true. And you can train, in particular, a neural net to have 100% training accuracy. Yet, the test accuracy is still very good. There is no obvious over-fitting going on.
And then, something, perhaps, even more surprising happens. Well, first, you don't really need to have a neural net here. You can do this with the kernel machine. And you can do the same thing, even with data which are very, very noisy-- like 60% of the data or even 80% of the data is noise.
So you can fit this kernel machine to fit the data exactly. The data [AUDIO OUT] be noisy, yet you'll get performance which is close to the Bayes optimal. It doesn't lack the [INAUDIBLE]. Graphically, this is represented by this red line being close to the green line. And this is very counterintuitive from the point of view of statistics, because usually, we think that fitting noisy data is bad.
So that's the empirical finding, just to summarize it briefly-- that the goal of optimization, which is interpolation, unexpectedly aligns with the goal of machine learning, which is to minimize the expected loss. I emphasize this "unexpectedly" here, because I think that's something new. At least for noisy data, I think it is new.
Now, by some extra argument which I'm not going to go through, we can see that WYSIWYG bounds really have a lot of difficulty accounting for this type of generalization. So maybe that's the first lesson of deep learning, which is for statistics, is that the theory of induction-- to use a term from Vapnik-- for more than learning, it cannot be based on uniform law of large numbers.
And the question, of course, is then, what do we do about this? So where should we go next? And while we don't yet have a full theory of it, or we only have a rather partial theory of it, I would say-- rather special cases, and Tommy will have some suggestions of where it may go, the idea of selectability and understanding it from the point of view-- but we're kind of, I think, starting to get some sort of direction.
And the direction is based on this picture. The classical risk curve, there is nothing wrong with it as such. There is a regime-- you may call it classical regime-- where you actually get this U-shaped curve.
But this is only half of the picture. And the other half of the picture starts when this U-shaped curve ends and when the training risk becomes 0. So basically, for this overparameterized model whose was is 0, adding parameters usually helps.
And in fact, this curve goes down and often goes down indefinitely as the number of parameters, as the complexity of it goes to infinity. It sort of asymptotes, and often, it [INAUDIBLE] U.
So this is, maybe, the kind of view, at least from my point of view-- this modern machine learning. A general theory of machine learning needs to explain not just the classical part, but the new part, and ideally, of course, both-- both at the same time. That would be the ground unification.
Now, just to give an example-- which I think is helpful to have something specific in mind-- to making a very simple neural net with the red points here. And my data is just one-dimensional data. Blue lines is output of a neural net. I'm fixing the first layer, so the first layer just has random weight, and I'm only training the second one. So the problem becomes linear regression.
And you can see that with three neurons, you get this kind of classical, nice statistical fit. With 30 neurons, we get, again, classical overfitting. The curve goes crazy. But with 3,000 neurons-- way, way overfit; there are just way more parameters than data-- it works nicely, and you get this nice curve.
Now, which is better, 3 or 3,000? Well, it depends on your model, depends on your data. But they're both, as you can see, sort of reasonable approximations [INAUDIBLE]. So that may be a kind of visual picture of what's going on.
Now, to connect to this idea of stability is that you can say, well, what if I take this random features, right? I can do different samples. And you can see that our 35 random ReLU features, you get overfitting. And each one of those curves is kind of crazy.
So you get a lot of instability in this regime. Yet, when you average those curves, you get this red curve, which is not so bad either. So somehow, you can gain stability by averaging, which is very reminiscent of bagging. Essentially, it's a a version of bagging.
And interestingly enough, with 3,000 random value features, you don't really need any averaging. This is very stable. So each of these curves is essentially the same. Average doesn't do very much. And, well, if you've seen it, it's kind of reminiscent of boosting.
So in a sense, stability clearly connect the dots to this phenomenon that we have in general.
AUDIENCE: I have a question.
MIKHAIL BELKIN: Yeah?
AUDIENCE: So essentially, it takes like 3,000 parameters, your generalization error is the same as taking, I don't know, 300 parameters and averaging?
MIKHAIL BELKIN: Well, in this case, there is no actual generalization error, because this is just data that I chose somehow. There is no more underlying model, so I can't really talk about generalization error. But if you actually do that for, say, three, yes, you'll get something very similar. I don't know that it's the same. I don't think it's exactly the same. But the models you get are quite similar.
AUDIENCE: So essentially, you claim that if I overparameterize--
TOMASO POGGIO: I'm sorry, [INAUDIBLE]. Let's have the question after the two of us speak. OK?
AUDIENCE: OK.
TOMASO POGGIO: Thank you. Misha?
MIKHAIL BELKIN: OK. So I have-- so this is a kind of a sum-- well, this is maybe a cartoon on ERM and interpolation. And then I have a couple more slides on optimization.
So what do we have with ERM and interpolation? So you can have this cartoon of the situation that we have, by the following. The classical empirical risk minimization tries to minimize the loss over this class of functions. Now, Morgan-- this is a classical regime-- the Morgan regime minimizes the norm subject to fitting the data exactly.
Now, if you look at linear regression, both of them are actually what we call linear regression. One is an underparameterized case. The other is overparameterized case. And remarkably, when you optimize it by gradient descent-- which is very similar with what people pursue in practice with SGD-- you don't have to know which one is which.
So somehow, SGD just knows whether it's underparameterized or overparameterized and always gives you the right solution. It's pretty remarkable. And presumably, something similar is happening for neural networks. Of course, we understand that much worse linear case. So maybe this is a useful way to think about this, although it's not the full [AUDIO OUT]
So let me take a couple of words about the second lesson, just for a minute. I think with optimization, what we find is that those methods, like gradient descent [INAUDIBLE], are extremely efficient. And this can be kind of a puzzle from the point of view of, well, usually, when we think about non-convexity and those things are extremely non-convex.
We have this picture. There are a bunch of local minima. There is no generally good optimization technique for such things. And it's kind of a mystery why SGD does what it does, or GD.
However, this picture is deeply misleading, I feel, for an overparameterized system. Really, for an overparameterized system, the picture should be this. It's a landscape, and the global minima form a valley of local or global minima. So it's a [INAUDIBLE] full of global minima with curvature.
And the interesting thing about this type of landscape is that it's not convex, even locally. So this one is convex locally around every local minimum. This is nowhere convex-- not at a single point. But somehow, it does map. You can optimize it.
So the theory-- and maybe this is the second lesson-- the theory of optimization for overparameterized systems cannot be based on convexity. And we need new analysis? Well, actually, this analysis turned out to be not so new. They go back to Polyak-Lojasiewicz condition, which is back in the '60s. But somehow, we need a new view on these overparameterized systems. And this is a topic for a separate discussion, so I think I'll stop. But this is kind of summary slide.
TOMASO POGGIO: Thank you, Misha. So I think overparameterization, you know, Misha spoke about it, made a very good point. But I think it's very important for understanding the success of deep learning, deep networks.
My present conjecture-- and I would love if somebody could disprove it or prove it-- is that the reason why deep networks perform so well, one is because they capture quite powerful prior information about the compositionality of certain tasks, in vision, speech, and text, thanks to the fact that they can be decomposed in terms of their functions in the same spirit as parts of parts of parts and wholes. But the other one is overparameterization, the fact that they are much better suited than shallow networks to deal with the number of parameters that is larger than the number of data.
So this is just putting some definition. You have a training set, pairs of x and y. We have the expected error, which is of the function f, which is the expectation over some measure of error, like the square error or an exponential error of the function f, expectation over z. We have the empirical error of f, which is measured on the training set from 1 to n. ERM is defined as the function that minimizes the empirical error on the training set. This is in infimum, so this is a quasi minimizer, a bit more general than minimizer.
And when we say "generalization," technically, what is meant in the literature, it's not productivity per se, but is convergence of the empirical error of the empirical risk minimas to the expected error of the minimas. For n going to infinity, where N, capital N, is the number of examples.
And consistency is convergence of the expected error of the minima. It's the best you can do in the function class over which you are minimizing. Like for instance, reproducing kernel hilbert space, or a space of, say, some smooth functions, or a compact space of functions, for instance.
I will speak like Misha did. He invented the terms which are, actually, quite good. Classical regime means number of parameters less than example of modern regime. It's number of parameters greater than example.
So why is this? It's because in the classical theory-- the public one and so on-- you are assuming that the hypothesis space, the function space, is fixed. This means your number of parameters may be large, but it's fixed. And this means that at some point, when you go to consider the large and going to infinity number of examples-- going to infinity-- you are going to be in the situation of the classical regime of n greater than d.
And so we know that in the classical theory, like Misha said, the uniform convergence of large number wholes. So in this theory, you can prove the empirical risk minimization. So the solution of the empirical risk minimization will generalize. The empirical will converge to the expected and will be consistent. So the empirical will be the best you can do, converge to the best you can do in the class.
If and only if their function class is uniformly [INAUDIBLE], which is a class of functions that corresponds to for a binary function to VC dimension, finite covering numbers, and so on.
So notice, as I said, that this theory deals with a classical regime. It cannot deal with a modern regime, because in the modern regime, you'd like to keep d greater than n when n goes to infinity. And you cannot do this in the classical framework because h has to be fixed.
And as Misha said, in the modern regime in which the empirical there can be 0-- it's an interpolating case-- we cannot expect generalization-- in other words, convergence of the empirical to the expected-- which will be, in general, different from 0.
So we need a new theory. Which kind of theory? Well, an indication is given by the fact that on the right, you see here the double descent curve that Misha found. And this is from his paper on a kernel machine, I think, for the test error.
And we observed with Andy and Gil that the condition number of a matrix also has a double descent curve-- which is an interesting fact because it says that from the point of view of stability of the solution of a system of linear equations, the worst situation is d equals n, and which is the usual, the classical one you've studied in high school, as many are known as equations, and then with some condition of the terminant of the matrix, you can solve it.
But this turns out to be the most unstable, the worst possible solution. Notice that for n less than d-- so the overparameterized case-- you have an infinite number of solutions in the linear case to the linear system of equations. And you have, in general, zero solutions in the under-parameterized case, when n is greater than d. But you can find a unique solution that minimizes the square error.
All right, so here is my proposal for a general theory that works for the classical and the modern regime. And this is by using an instability that we called 20 years ago in a paper actually, with, Partha Niyogi, Sayan Mukherjee, and Ryan Rifkind, we called "Leave One Out Cross-validation Stability." And its definition and expectation is the following.
In expectation over the training set S, for any I, we want the absolute value of this difference to be less than a certain beta, and we want the classical regime beta to go to 0 as n goes to infinity. So if you look at this, this is measuring the difference in the loss when z i is in sample-- z i is one element of S, of the training set. And the error, by the function trained on s i, which is the training set minus the data point z i, on z i-- the error on z i. So this is out of sample. This is measuring the difference between in sample and out of sample by perturbing the training set, but in one data point.
OK, so we proved that-- this was, as I said, 15 years ago-- that if beta goes to 0 for n goes to infinity, this happens if and only if you have a generalization consistency-- in other words, if and only if h is UGC. It's completely equivalent to the classical conditions.
So this definition captured the classical theory, replacing ideas like we see dimension with stability, CV leave-one-out stability. And what happens in the modern regime?
Well, in the modern regime, the terms that measure your error in the sample, the first one, v over fs z i is 0, if you are interpolating. You are overparameterized and interpolating-- that's the assumption. And then, what remains is the other term. And this is the expected error.
So in this case, we have this trivial result. The CV leave-one-out stability and expectation is equivalent to the expected error. Notice that we can use the idea, in this case, of looking at a perturbation of the training, the effect of that perturbation in the training set, because, for instance, this example of the square loss, linear regression, the y i is exactly equivalent to the w s, the weights in the linear regression, obtained by training on x i because of interpolation. And so we are still measuring here the difference between the W subscript s and the W subscript s i.
Now, a more interesting statement is that for ERM, both the classical model interpreting a regime, minimization of the beta minimizes the expected error. So that will be a unifying pictures. And furthermore, what does it mean, optimizing stability, minimizing beta CV?
Well, this is still being done, but the result seems pretty clear is that for both kernel machines and deep networks, minimizing data CV in the model regime is equivalent to select the minimal norm solution This is true.
So in the linear case, linear regression, this is equivalent to selecting the pseudo-inverse among all the infinite number of solutions. The same, essentially, for the kernel regression. And this is true, also, for deep networks. And so this also, by itself, explains the double descent curve that Misha found in the test of kernel machines.
Now, these are a couple of observations, is that in the classical regime, you just look for the minimizer of the empirical risk, and then this is your solution that will predict well.
For the modern regime, ERM is not enough, because typically, it will give you a lot of solutions that fit to the data. You have, in addition, TRM-- select one of these solutions. And you select a minimum node one, as Misha said. But this is an extra step in addition to empirical risk minimization.
However, if you are using gradient descent algorithms to perform empirical risk minimization, then this extra step is not needed because gradient descent will do it for you. That was a proof that, under the exponential type loss-- like the exponential loss or cross-entropy-- gradient descent techniques converged to the minimum known solutions Independently of the initial conditions. This is the same to say they converged maximum margin solutions. So this, I think, gives the clear connection between minimum known stability and gradient descent.
Now, if we want to be-- just a philosophical thought here. One can think of learning theory as a metaphor for telling us when theories in science are scientific, are predictive. So in the classical regime, doing the right thing means that you have a fixed small set of theories and then you pick the one which fits the data best. This is the Vapnik point of view.
In the modern regime, the idea is you choose a theory from a large hypothesis set-- which, actually, you can even increase when new data arrive, before new data arrive-- and you chose the hypothesis that fits the data, and then, among them, you look for the simplest one. This is like Occam's Razor or Einstein saying, choose the simplest theory you can have.
And all of this can be summarized by the principle of selecting the most stable theory-- the one that changes the least if data perturbs or changes the least when new data arrive-- most of the time. Remember, it's an expectation. It's in high probability. So scientific revolutions, in the sense of Thomas Kuhn, are allowed, but not too often. And all of this is kind of a tautology. And of course, mathematics is tautology by definition.
So let me finish here. This is just a tribute to Hadamard, who defined well-posed problem and ill-posed problem. And so let me now, let's see, open the discussion to-- maybe we should have some questions about what Misha said and I said from the audience. And then, we have Gil Strang, and Costis, and Lorenzo comment and criticize what Misha and I said. So first, the audience-- any questions?
AUDIENCE: I will. Misha and I actually had this discussion.
MIKHAIL BELKIN: OK.
AUDIENCE: If-- actually, if you have a grading system and you're wondering whether it tends to a unique minimum, the condition for that is not convexity. It's contraction. Here, it's contraction in the sense of contracting dynamical systems. And convexity corresponds to contraction in an identity metric, where, actually, you could have contraction in any metric, and it would still trend towards a unique minimum.
WE can also show that if a system is semi-contracting, then, actually, if you take a gradient descent, then it will tend towards a global minimum, and that all equilibrium points will correspond to the same cost. All equilibrium points will be global minima, and also, all equilibrium points will be path-connected by a path of points, which all have all equilibrium and all have the same cost. So very much like the drawing that Misha gave at the end.
TOMASO POGGIO: I was suggesting to make sure that we should have a separate CBMM discussion in a couple of weeks about optimization. And I think we should speak about productivity and stability today. But Misha, do you want to answer Jack quickly?
MIKHAIL BELKIN: Yeah. Yeah, it's a very closely-related, issue? there So Jean-Jacques has a closely related analysis for the dynamics case. And we have this analysis for gradient descent, for stochastic gradient descent.
There were some important differences between the continuous dynamics and the discrete dynamics, but this phenomenon seems to be very general I actually suspect that the phenomenon may be even more general. There seems to be even more to it than that. Why do you think that this method-- maybe there is a much broader class of things which could be somehow put into this framework. But I'm happy to have this discussion next time. Maybe we can concentrate on generalization issues right now.
TOMASO POGGIO: So I think I would like to have Gil Strang comment on what he heard. Gil, of course, is the prince of linear algebra, not only at MIT, and I know he was very interested in this double-descent phenomena. Gil?
GIL STRANG: So I was just saying that I wish-- I mean, this isn't linear algebra. This is really non-linear. And I think it's great that this Polyak condition seems to be exactly right, or PL, PL* seems exactly right for these problems. Is this something we knew, or is this essentially something new that-- the right way to prove convergences with Polyak's condition?
TOMASO POGGIO: Misha?
MIKHAIL BELKIN: Well, I mean, Polyak, of course, came up with the condition a long time ago, right? It's like, I think, '63. So in a sense, yes, we knew it. Did we know-- I don't know. I mean, there are a lot of recent analyses which are essentially doing something very similar to this. So in a sense, yes, I don't think that their condition was directly identified there as a sort of mathematical structure. So maybe that's the new part of that.
I think-- well, I think for me, at least, I think the interesting bit was that somehow this essential non-convexity of the landscape. The landscape is never convex, not even locally, which just seems very--
TOMASO POGGIO: But the minima are convex.
MIKHAIL BELKIN: What?
TOMASO POGGIO: The global minima are convex.
MIKHAIL BELKIN: Oh. It's no it's not even convex around any-- so there is not a single point, but generically, for a generic nonlinearity. Of course, you can--
TOMASO POGGIO: And where the gradient is 0, the Hessian is positive definite-- positive semi-definite.
MIKHAIL BELKIN: Semi-definite, semi-definite.
TOMASO POGGIO: Yeah.
MIKHAIL BELKIN: So you cannot really speak about convexity at one point, right? Convexity has to be in a neighborhood. So there is no neighborhood-- yes, it's positive semidefinite at that point. But if you take any neighborhood at that point, that's not positive.
TOMASO POGGIO: Yeah, but the Hessian is positive semi-definite.
MIKHAIL BELKIN: Only at the actual minima.
TOMASO POGGIO: Yeah.
MIKHAIL BELKIN: But not nowhere else.
AUDIENCE: In continuous time, the Polyak condition is just saying that the minimum costs, the residual cost is exponentially convergent. It's, in condensed time, the polio condition is a one-line proof, right? It's just saying the residual course is exponentially convergent and gives you directly the Polyak condition.
TOMASO POGGIO: Guys, can we go back to stability and generalization? I will do the optimization in two weeks from now. Maybe Jean-Jacques will organize.
AUDIENCE: No, no, it was just to comment on Gil's question.
TOMASO POGGIO: No,no, no, OK, Gil, do you want to say more, or should I ask Costis?
GIL STRONG: Go ahead to Costis. I'll come back.
TOMASO POGGIO: Costis is in CSAIL and is very well-known, very nice guy. Costis? [LAUGHS]
COSTIS DASKALAKIS: Yeah. Hi, everybody. Yeah, so I found both talks very illuminating. And just to step back a little bit, I think it's great that deep networks and deep learning are asking us to rethink all these foundations.
In particular, coming from-- so I'm a theoretical computer scientist. And in algorithms, we often take a worst-case perspective on the problems we're studying. Similarly, ERM could be thought of as a worst-case perspective on learning problems.
And here, we have to face the fact that this, let's say, worst-case perspective is not going to give us the full picture. I mean, at least it doesn't give us the picture of what's going on. So Misha's work and Tommy's work have been illuminating on that front, of what we could expect.
Just to push more on the going away from the worst case perspective, what I would you to discuss is a little bit-- so it sounded like the observations, as well as the theory proposed by Tommy, do not talk at all about trying to exploit any structure in the data.
So if I would like to push on that front-- so the fact that images are not worst case distributions in our RN-- OK, so I doubt there is any theory that talks about worst-case distributions in RN. You're not going to generalize those distributions.
So what I would like to understand is, if I would like to push this concept-- I mean, make use of the stability proposal or push on the front of proving generalization bounds for deep learning, how would we go about exploiting structure in the distribution?
TOMASO POGGIO: OK. I think I have some answers, but they're disconnected from the stability consideration. They have to do with approximation theory and representation power, so that the fact is that, from the point of view of theory of approximation, both shallow networks-- one hidden layer, like support vector machines, or one layer deep network with a ReLU nonlinearity, and deep networks, both of those architectures can approximate arbitrarily well a continuous function of compact support. So there is no advantage of deep networks in that case.
But in general, both of them suffer from the curse of dimensionality. So the number of parameters that are needed to get an approximation in the sub-norm of epsilon, depends exponentially on the dimensionality of the function. So it's like epsilon to the minus D. And so if D is large, like the number of pixels in an image, this becomes easily larger than the number of electrons in our universe.
OK, so that's a real problem. But it turns out that there are certain functions that are a subclass of compositional functions-- so these are functions of functions of functions-- where the inner functions have low dimensionality. So think of a binary tree where each node is a function of two variables. This, and consider functions that have a binary tree structure. Then, it turns out that deep networks can, with appropriate architecture-- which turns out to be very similar to convolutional deep networks. Deep networks can avoid the curse of dimensionality, and shallow networks do not.
So I think-- so this is not true for every function, but there are certain problems, especially in vision, and text, and speech, where this compositionality is likely to hold in the underlying target function.
COSTIS DASKALAKIS: But if I may add, though, that does not talk about generalization. It talks about representation.
TOMASO POGGIO: Yes.
COSTIS DASKALAKIS: So trying to explain why deep nets learn well through some training procedure might, at some point, have to take into-- I mean, I guess the stability proposal explains the potential how this may happen. But if I want to understand, you know, that coefficient beta, or whatever symbol you were using beta, for how stable you are, I would have to prove something that is, I believe, making use of some structure in the data.
TOMASO POGGIO: Yeah this is I think the first slide I showed. So the stability issue is the second point here. I think to answer your question, the first question, the first point is also important. It's also I was telling about.
COSTIS DASKALAKIS: Yes, it's a very regular representation.
TOMASO POGGIO: And you know, I don't know whether it's true, and I don't know how to connect one to two. I think that's-- so yeah, it's open, but it's--
COSTIS DASKALAKIS: But just to understand your definition, though. So your stability has an expectation outside. I wanted to understand, is that expectation you have in your definition, is it distribution-dependent, or is it for a fixed data set? Could you bring up your stability implies generalization slides?
TOMASO POGGIO: Let's see.
COSTIS DASKALAKIS: Yes, could you parse this for us, please? Can I put an expectation of a data set used in the-- I could, in principle, do that, right?
TOMASO POGGIO: Yeah, there is an expectation of other data sets.
COSTIS DASKALAKIS: So maybe the structure in the distribution might come handy in getting your better bound better if you put another expectation around?
TOMASO POGGIO: Yeah. Yeah, that would be a possibility, yes. I think that could be-- I don't know. It seems you have to restrict, also, the f somehow to be of this hierarchically local compositional structure. Anyway, yes, very interesting, I think.
COSTIS DASKALAKIS: Yeah, I kind of feel that there are three things. So I talk about those two lessons, right? One of optimization, one for generalization. And there is really the third. Well, first we don't actually even know very well how those two lessons really connected together. Well, we are sort of trying to identify the mechanisms.
MIKHAIL BELKIN: But the third lesson, which is somehow in addition to that, maybe orthogonally, this notion of depth, and how this compositionality works, and why it's raising certain structures in the network. Why we cover this manifold, and why is it compatible with this optimization strategy?
So somehow, OK, in a novel parameterized system, the deep network maybe be not that different from a shallow network But as a generalization, it does seem to be capturing this manifold. And it seems to be capturing this manifold, at least for tasks like vision. I don't know about generic tasks, but for things like vision, it seems to be significantly better than what you can do with a shallow network.
So it seems like we're missing one way of understanding, somehow, in this understanding of the data. And so perhaps, yeah, it's not an answer to your question, maybe just to identify the issue.
COSTIS DASKALAKIS: Right. So I guess it's what Tommy was saying, connecting 1 and 2 in the slide he showed us.
TOMASO POGGIO: Yeah, because the structure, there is this theorem here which is just about approximation.
COSTIS DASKALAKIS: Right.
TOMASO POGGIO: So this is a compositional function with no core constituent functions. Here, dimension two for each one. And this is basically a convolutional network without weight-sharing. The weight-sharing case would be the one in which G12, G11, G13. So the constituent functions at the bottom level are all the same. This would be weight sharing. But interestingly, it's not weight sharing, but it's locality of the constituent function that avoids the exponential curves of dimensionality.
OK, before we get to following this, let's-- let me ask Lorenzo whether he has anything to say. Lorenzo survives a COVID scare in Italy. And so it's a national representative. And also, we should get him to speak before he falls asleep because it's six-hour--
LORENZO ROSASCO: That's OK, it's 9:00. OK.
MIKHAIL BELKIN: OK, Lorenzo, any biting criticism?
LORENZO ROSASCO: It was great, but I wonder if you might want to comment on the role of-- I think there was a bit of a conflation between parameters and dimensions. I think when you talk about things like neural networks or kernel networks, you can talk about the number of points, the size of the function space. But then you have the dimension of the data.
To me, that's the most important ingredient in this whole story about having to do something new with respect to the past. And Misha was showing this one example in one dimension. So one can say, well, there goes one dimension. So if it happens in one dimension--
But I don't know if this was exactly random data. But if you take 10 points in one dimension, you don't get them so nicely spaced as you showed. You end up having, with very high probability, two points that are extremely close.
And I think this fact, the fact that you might have functions in a domain and then you might have points scattered in domain, and the relative distance of the points in the domain actually plays a big role in this whole story. And in fact, the fact that you might have stability or not, and the fact that you might have a smaller or bigger [INAUDIBLE] space, is regarding the number of parameters. I think to me, as a theoretician, it basically depends, primarily, on this-- on what I would call, what is sometimes called the separating distance between the points, or the field distance.
So in neither one of your talks this aspect was addressed. And I wondered-- I thought there was a good point to comment on and discuss a bit. Either you or Misha. Thinking, you know Misha, I think you had this one-- and I just thought of it. Because you had this one-dimensional plot, and I was like, oh, that's one-dimensional, but look, these points are very nicely spaced. I mean, they're not perfectly grid, but definitely not close to each other. And I would like to see how well the 3,000 parameters network works when you start to have two points that are very close to each other. I mean, the double descent might have a different shape, maybe.
TOMASO POGGIO: Yeah. I think, Lorenzo, I think it's a good point. I think the separation--
LORENZO ROSASCO: It's not optimization.
TOMASO POGGIO: No, that's good. [LAUGHS] But--
[INTERPOSING VOICES]
TOMASO POGGIO: Also, beyond the fact that it's not optimization either. I think, you know, separation is an ingredient to have stability, for sure. And in the linear case, in the linear case, you definitely get instabilities if you get points that are too close. In the limit case, if you get two points that are identical but have different Y's, then you cannot have interpolation. So one of our hypotheses falls away.
MIKHAIL BELKIN: Yes, so if I add to this, yeah, this is really a great question, Lorenzo. I think you're right. I mean, the points were sort of reasonably spaced, so you get a nice curve. If you start moving them together-- I actually haven't tried this, but I'm pretty sure you start getting more and more instability as you move them close together. For Gaussian kernel, I know this for sure, for that specific kernel
But the interesting thing is that it seems that we are somehow getting this blessing of dimensionality for free here-- well, not for free, for something-- because somehow, in high dimensions, as we know, the field distance becomes much larger in high dimensions, right? Because if you put points in large dimensions, the distances between them is much more similar than in low dimensions.
And you can really see that with Gaussian kernels. If you tried doing interpolation in the low dimension with Gaussian kernels, minimal norm like in dimension two, once you have more than, I don't know, 50 points, you basically cannot do it. Matrices are too degenerate. They just get out of numeric accuracy. But if you have 15 or 20 dimensions, it works just fine.
So there is some sort of-- certainly, there is this interaction with dimensionality of the space, and the dimensionality of the data is important. And I don't think we have the theoretical understanding of what it is. But clearly, those problems are easy and they work better in higher dimensions, at least for kernels which are very small, like the Gaussian. Yeah, no, that's an excellent point. I think that's something that we need to understand, for sure.
LORENZO ROSASCO: Because I think it's good to remember this, because kernel machines at infinitely many parameters, sparse models, tend to have bases which are order of magnitudes more than number of points. So this number of parameters, larger number of points regime has been going on for several years. So maybe that's not really what's new. What's new is the interplay between number of points, dimensions, and parameters.
MIKHAIL BELKIN: Well, I think [INAUDIBLE] it's really driving the error to zero on noisy data.
LORENZO ROSASCO: Yeah, which happens in a good-- which happens when the dimension is high enough and the number of points is small enough.
MIKHAIL BELKIN: Yeah, that, I think, is the kind of-- and I mean, I completely agree that this overparameterization as such is such is not such a new concept. But I think this interplay of feeding the noisy data exactly and this overparameterization gives a new aspect of it. But yeah, you go back to splines, of course, they're infinitely parameterized. [INAUDIBLE].
LORENZO ROSASCO: OK, so are there questions from the audience?
AUDIENCE: Let me ask a question about Tommy's observation on the expected condition number and the new, the non-linear, Misha's non-linear condition number. Does Tommy's observation carry over directly to give a double descent curve for the nonlinear problem condition number, or how are the two condition numbers connected?
TOMASO POGGIO: Misha, do you want to-- which condition, which nonlinear condition number are you referring to? Peter?
AUDIENCE: I'm taking the one in Misha's paper, the long paper with-- the new paper.
MIKHAIL BELKIN: With [? Yakno ?] [? Ghezevich. ?]
TOMASO POGGIO: We decided not to speak about it. That's about optimization. It will be two weeks. [LAUGHS]
MIKHAIL BELKIN: But is condition number OK? Can we speak about that part?
TOMASO POGGIO: Oh, right. [LAUGHS]
MIKHAIL BELKIN: Yeah, because that's a piece of Misha's paper and a key observation of yours.
TOMASO POGGIO: Yeah, OK. Misha, you have to explain which definition you have for the non-linear condition number.
MIKHAIL BELKIN: Yes, so we have this nonlinear condition number, which essentially-- OK, it's not exactly that, but maybe just easiest to think of you pick a neighborhood of the point, and you just take the worst condition number for the tangent kernel or in that neighborhood.
TOMASO POGGIO: So of the loss function?
MIKHAIL BELKIN: Not of the loss function, of the-- so it's always a condition number of the map. Yeah, yeah, It's not the loss function. It's actually-- so if you just think about optimization, you just have a map, and you just evaluate it with your data points. You don't care about anything else, because that's only the optimization problem.
So you have a map which is represented-- so it's a map from rn to rn, where n is the number of training points, which is given by your neural net. And you can view it as a function of a parameter as W. So you have F of W, which is a map from-- it's a nonlinear system of equations, right? You are trying to solve our f of w equals y. So optimization is just a nonlinear system of equations, and machine learning does that.
Now, you can look at it as yes, say OK, I am going to take the tangent kernel of that, which is simply taking the derivative transposed times the derivative of this map. That's a matrix whose size is d by d-- n by n. And you can just take-- in a neighborhood, you take the worst condition. And it's not exactly that, but it's very simple over the worst condition number.
Now, how does it relate to-- or tell me, how does it relate to your condition number, which is kind of-- I mean your argument for generalization, this is-- so in that paper, we show, essentially, that if this condition number can be controlled and optimization works.
TOMASO POGGIO: Right, so the key question could be a double descent behavior of this nonlinear condition number.
MIKHAIL BELKIN: Yeah, I mean, if it's a question for me, I think it's really likely we couldn't really prove it. So our proof works primarily in the overparameterized case. So it doesn't really-- it's kind of tricky to analyze it like near the actual interpolation boundary. So yeah, I'm convinced that it should be true. I don't know how to prove it.
AUDIENCE: Yeah. Yeah, I'm probably asking well into the second descent into the overparameterized part. Well, do we have an idea expected condition number, like Marchenko-Pastur for the linear case? Perhaps that's pretty complicated.
MIKHAIL BELKIN: We can control the condition number for large enough networks using, it seems quite-- yeah, so the way we could-- yeah. So yeah, it's kind of-- it's not so bad, actually, because we can control it at the initialization using some Marchenko-Pastur type of stuff, where it actually was proven by several people. Like Simon, you had some version of that.
AUDIENCE: OK.
MIKHAIL BELKIN: And then, the key observation is that for large enough neural networks, you can prove that along the optimization path, the tangent kernel doesn't change very much, and that allows for the control in the whole neighborhood, along the whole path.
So we can kind of control it. It's not completely satisfactory since it's depends too heavily on the fact that this tangent kernel is kind of close to constant. It really changes small, which seems like too special. But probably, this is just the efficiency of the analysis. It seems like it has to be a more general fact.
AUDIENCE: Thanks.
MIKHAIL BELKIN: But maybe there is a kind of a really interesting sort of different question about that, right? How is conditioning for optimization related to this conditioning for stability, right? Because they seem to rely on a very similar type of mathematical property of the systems. But this property seems to arise in different ways on these two occasions. So maybe that gives a little bit of a hint why these optimization methods somehow respect generalization. I don't know. Tell me if you have any sort of thought.
TOMASO POGGIO: Yeah.
MIKHAIL BELKIN: [LAUGHS]
TOMASO POGGIO: This reminds me of the fact that originally, when it did the work with Patta on stability, my initial observation that started this was that the condition to have a well-posed optimization in machine learning were very similar to the condition for generalized-- for productivity. And so that was puzzling. And in fact, we found, therefore, this stability condition.
So maybe, yeah, my hint is that you're right. There is a deep connection between optimization and the good general test error. But work to be done.
LORENZO ROSASCO: Can I make a comment?
TOMASO POGGIO: Yeah.
LORENZO ROSASCO: So I think, perhaps, one technical aspect of the condition number and the Polyak-Lojasiewicz condition, and the relation to stability is that usually, this kind of convexity condition, like Polyak-Lojasiewicz and related ideas are local around the global minimizer or a global minimizer of the objective function.
So they're not global properties of the system of linear equations or the function to be optimized. They're local around the minimizer. And so while this is in stark contrast what happened with other notions of convexity or condition number that are global.
So in my view, I mean, when I think of this kind of Polyak-Lojasiewicz, I think of a local condition number on not only a local instance of a neighborhood, but potentially on a subset or sub-space of the whole space. So you have to go around the minimizer and, potentially, project on some subset. And then you have a notional condition number. So that's one big difference.
And I think it will-- I wanted to make this comment because I think it relates to what Costis was talking about. So this is a property of a minimizer of the true distribution error. So it really depends-- it's really a structural assumption on the true distribution, and it's not just a property of the objective. It's really about a property of how the best possible solution is used by the true distribution.
So it's a very structural assumption because it's local around a global minimizer or a global minimizer. That's a bit of a technical comment, I guess.
MIKHAIL BELKIN: Lorenzo, could you maybe explain a little bit? Because you do get around the initialization point. It's also-- it is true that it's local, but it somehow holds around the initialization, right?
LORENZO ROSASCO: But I mean, the condition is usually that you need-- you basically say that there is a neighborhood, or maybe an intersection of a neighborhood with a subset, and the neighbor is taken around a minimizer. And then you need curvature around there. That's typically what you do in these kind of conditions.
If you're assuming to have it globally, that's a much stronger condition. But usually, these conditions are local. They're in the neighborhood of the minimizer.
MIKHAIL BELKIN: Well, they're kind of local, but they're not so local. They're somehow in the neighborhood of initialization, and that neighborhood includes the minimizer. And this is true once you--
[INTERPOSING VOICES]
LORENZO ROSASCO: That's the cheating part of this whole story, when you assume-- so typically, going the way it really goes is that you assume that the condition is true on the minimizer, and then you cross your fingers that your initialization is also in the neighborhood. Because then, at that point, in some sense, you're in the right place already. If your initialization and the global minimas are far away, then you don't get anywhere. So I mean, that's the typical way they're doing nonlinear problems, both for stability and convergence. So have to have some nice properties about your global minimizer and then start closing up to it.
So I mean, that's the way I see it. So you said it's reversed. You said, I have the nice property around initialization, and the minimizer is close. Well, either way, you have this nice property somewhere, and both initialization and global minimizer are there.
MIKHAIL BELKIN: That's not the way I view it. I feel that overparameterization allows you to have it around the initialization. And once you have it around the initialization, you're guaranteed that the minimizer is close. So it's kind of inverting that story, in a sense.
LORENZO ROSASCO: Right, right, but in a very fundamental way. I mean, anyway--
TOMASO POGGIO: Here is a technical conjecture that I was mentioning to Misha, is that the classical situation in machine learning, especially in the under-parameterized case, is that the loss function is a Morse function, and the modern situation, overparameterized for deep networks, at least under the conditions that Misha was studying, which are not completely general. They don't include ResNets also. But under those conditions, it could be a Morse-Bott function. Does anybody understand what I said? [LAUGHS]
So in other words, you could have, essentially, degenerate manifolds at the minima. That's fine. And this would be-- we know that it's easy to prove that in, for deep networks, overparameterized, your global minima are degenerate. So you have multiple zero, multiple solutions of your interpolation equations.
AUDIENCE: So Tommy's comment that ResNets are not included here, Can you say a little more about that?
TOMASO POGGIO: Well, they're not included in the optimization paper of Misha, if I'm correct. You should ask him, but--
MIKHAIL BELKIN: Yeah, well, we prove it for-- I think for deep networks. We prove it for fully connected.
TOMASO POGGIO: And also very wide, right?
MIKHAIL BELKIN: Well, not very wide. The width is kind of linear. Well, it's complicated, because it depends on this--
TOMASO POGGIO: OK.
MIKHAIL BELKIN: Yeah, well, vary or not vary, I guess, it depends. It seems, actually, that-- yeah, in any case, it's somewhat wide, I would say. Yeah, it seems like it's more of a technical issue, rather than conceptual. So I'm convinced that the condition itself is true, but the proof gets a little hairy. So it is probably just we didn't find quite the right way to prove those things.
TOMASO POGGIO: OK, any more questions or comments? Costis, you are happy? [LAUGHS]
COSTIS DASKALAKIS: Yeah, this obviously requires more thinking to process, right? Yeah, I mean, I can follow up a little bit on Lorenzo's question, or this little debate. But I guess one thing that I was always confused about is, yeah, you can prove that, close to the initialization, there will be a solution that fits that data.
But I guess I don't understand, how, then, you do the jump to generalization after that? How do you talk about-- there are many optimal solutions that fit the data. And the one that is close to the initialization, why is that a good one?
MIKHAIL BELKIN: Yeah, I think it's kind of a mystery, actually. I agree. We don't know. So well, OK, for linear regression, we understand it pretty well, because if you start a linear regression at 0, you'll converge to the minimum normal solution. And presumably, we believe that minimum normal solutions are good. They're more stable. They have all the properties.
Now, when we actually train those networks in practice, we start with some sort of randomly-chosen initialization. We run SGD for a while. We get convergence with something which has-- well, not exactly 0, but small ones. But we can actually, rather for a long time, get even smaller losses and it still works fine.
Why is this solution have the property that it generalizes? So it seems to me that right now, we have analysis of convergence for optimization and some analysis of generalization, and they're somehow not completely aligned yet. So it's--
TOMASO POGGIO: Now, maybe, actually, this is an interesting point. You're speaking about square loss, or Misha?
MIKHAIL BELKIN: Well, it doesn't matter so much, right? That's--
TOMASO POGGIO: Because if we're speaking about exponential loss, then we know that gradient descent techniques would converge to minimum norm solutions. This only assumes that, at some point during gradient descent, you are separating the data. So the classification error is zero. And then you can prove that you will converge to a minimum norm. And this is independent of the initial conditions.
MIKHAIL BELKIN: So--
[INTERPOSING VOICES]
TOMASO POGGIO: So this is true only for--
MIKHAIL BELKIN: --I think this can-- yeah?
TOMASO POGGIO: It's only true for the exponential type of loss functions.
MIKHAIL BELKIN: So I don't think this conveys the full story, because what we see, at least empirically, is for the square loss. So even for linear, for the square loss. If you will-- in any case, we know that for square loss, essentially, that empirically, performance is very similar, but you cannot have this analysis based on going to infinity. It's OK. It's a technical kind of issue.
But it seems that there is some-- I don't know. I think there is kind of-- OK, here is my conjecture, is that the story depends heavily on the loss function, but of course, from an empirical point of view, we don't see that there is a huge difference in the way how the things perform.
TOMASO POGGIO: Well, we find a pretty big empirical difference between exponential and square loss for difficult problems. But--
MIKHAIL BELKIN: I see.
TOMASO POGGIO: --apart from this, I think I believe what you're saying. I believe, also, that it should not-- theoretically, it should not make such a big difference. What I'm saying is that for the exponential loss, we have a proof of convergence with a minimum known solution for kernels and-- for kernel machines, for Linux systems, and also for deep networks.
I don't know how to extend that proof to the square loss. It's probably just a technical problem. I'm just saying, I don't know.
COSTIS DASKALAKIS: So Tommy, can you quantify a little bit the claim you're making? Are you making a big width assumption, like you--
TOMASO POGGIO: No, there is no assumption, clearly.
COSTIS DASKALAKIS: I mean, gradient descent does not always work, right? I mean--
[INTERPOSING VOICES]
TOMASO POGGIO: No, no. One big assumption I'm making is that at some point during gradient descent, you are separating the data. Separating means you are classifying-- I'm speaking about binary classification, for simplicity-- and I'm saying that your function has the correct sign for each data point in the training set.
This is a big assumption, because the data set could get stuck and never separate the data. But if you get to a point where it separates the data-- and empirically, you get there most of the time for data sets, even large data sets-- then we have to prove that it will convert to maximum margin, minimum norm solutions. But I don't know, just for technical reasons, right now how to extend the proof to square-- to loss functions different from exponential type.
OK, I think we are at the point in which-- I think we have some interesting results. Thanks, Misha. And I think these results are interesting not only for deep learning, but more generally for a number of other problems like linear algebra, perhaps. And there are a lot of open questions, so it would be great to have help in answering at least some of them.
And I think we'll tried to have, maybe, an optimization discussion like this one in a couple of weeks or so, if Misha is willing to spend some more time with us.
MIKHAIL BELKIN: Sure, I would definitely.
TOMASO POGGIO: [LAUGHS] OK. And so let me thank Misha, and Gil, and Costis, and Lorenzo, and all of you for participating. Keep safe. Continue to wash your hands. Bye-bye.
COSTIS DASKALAKIS: Thank you very much.
AUDIENCE: Thank you, Tommy.
MIKHAIL BELKIN: Thanks, bye.
AUDIENCE: Bye-bye.
MIKHAIL BELKIN: Bye.