Overparametrized deep network predict well despite the lack of an explicit complexity control during training such as an explicit regularization term. For exponential-type loss functions, we solve this puzzle by showing an effective regularization effect of gradient descent in terms of the normalized weights that are relevant for classification.

}, author = {Tomaso A. Poggio and Qianli Liao and Andrzej Banburski} } @article {4375, title = {Double descent in the condition number}, year = {2019}, month = {12/2019}, abstract = {In solving a system of n linear equations in d variables\ \ Ax=b, the condition number of the (n,d) matrix A measures how\ \ much errors in the data b affect the solution x. Bounds of\ \ this type are important in many inverse problems. An example is\ \ machine learning where the key task is to estimate an underlying\ \ function from a set of measurements at random points in a high\ \ dimensional space and where low sensitivity to error in the data is\ \ a requirement for good predictive performance. Here we report the\ \ simple observation that when the columns of A are random vectors,\ \ the condition number of A is highest, that is worse, when d=n,\ \ that is when the inverse of A exists. An overdetermined system\ \ (n\>d) and especially an underdetermined system (n\<d), for which\ \ the pseudoinverse must be used instead of the inverse, typically\ \ have significantly better, that is lower, condition numbers. Thus\ \ the condition number of A plotted as function of d shows a\ \ double descent behavior with a peak at d=n.

}, author = {Tomaso Poggio and Gil Kur and Andrzej Banburski} } @article {4281, title = {Theoretical Issues in Deep Networks}, year = {2019}, month = {08/2019}, abstract = {While deep learning is successful in a number of applications, it is not yet well understood theoretically.\ A theoretical\ characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal.\ In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

}, author = {Tomaso Poggio and Andrzej Banburski and Qianli Liao} } @article {3694, title = {Theory III: Dynamics and Generalization in Deep Networks}, year = {2018}, month = {06/2018}, abstract = {The key to generalization is controlling the complexity of

\ \ \ \ \ \ the network. However, there is no obvious control of

\ \ \ \ \ \ complexity -- such as an explicit regularization term --

\ \ \ \ \ \ in the training of deep networks for classification. We

\ \ \ \ \ \ will show that a classical form of norm control -- but

\ \ \ \ \ \ kind of hidden -- is present in deep networks trained with

\ \ \ \ \ \ gradient descent techniques on exponential-type losses. In

\ \ \ \ \ \ particular, gradient descent induces a dynamics of the

\ \ \ \ \ \ normalized weights which converge for $t \to \infty$ to an

\ \ \ \ \ \ equilibrium which corresponds to a minimum norm (or

\ \ \ \ \ \ maximum margin) solution. For sufficiently large but

\ \ \ \ \ \ finite $\rho$ -- and thus finite $t$ -- the dynamics

\ \ \ \ \ \ converges to one of several margin maximizers, with the

\ \ \ \ \ \ margin monotonically increasing towards a limit stationary

\ \ \ \ \ \ point of the flow. In the usual case of stochastic

\ \ \ \ \ \ gradient descent, most of the stationary points are likely

\ \ \ \ \ \ to be convex minima corresponding to a regularized,

\ \ \ \ \ \ constrained minimizer -- the network with normalized

\ \ \ \ \ \ weights-- which is stable and has asymptotic zero

\ \ \ \ \ \ generalization gap, asymptotically for $N \to \infty$,

\ \ \ \ \ \ where $N$ is the number of training examples. For finite,

\ \ \ \ \ \ fixed $N$ the generalizaton gap may not be zero but the

\ \ \ \ \ \ minimum norm property of the solution can provide, we

\ \ \ \ \ \ conjecture, good expected performance for suitable data

\ \ \ \ \ \ distributions. Our approach extends some of the results of

\ \ \ \ \ \ Srebro from linear networks to deep networks and provides

\ \ \ \ \ \ a new perspective on the implicit bias of gradient

\ \ \ \ \ \ descent. We believe that the elusive complexity control we

\ \ \ \ \ \ describe is responsible for the puzzling empirical finding

\ \ \ \ \ \ of good predictive performance by deep networks, despite

\ \ \ \ \ \ overparametrization.\