Deep networks are usually trained and tested in a regime in which the training classification error is not a good predictor of the test error. Thus the consensus has been that generalization, defined as convergence of the empirical to the expected error, does not hold for deep networks. Here we show that, when normalized appropriately after training, deep networks trained on exponential type losses show a good linear dependence of test loss on training loss. The observation, motivated by a previous theoretical analysis of overparametrization and overfitting, not only demonstrates the validity of classical generalization bounds for deep learning but suggests that they are tight. In addition, we also show that the bound of the classification error by the normalized cross entropy loss is empirically rather tight on the data sets we studied.

%8 07/2018 %1 %2http://hdl.handle.net/1721.1/116911

%0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Bob Liang %A Jack Hidary %XWe review recent observations on the dynamical systems induced by gradient descent methods used for training deep networks and summarize properties of the solutions they converge to. Recent results illuminate the puzzle in the special case of linear networks for binary classification. They prove that minimization of loss functions such as the logistic, the crossentropy and the exponential loss yields asymptotic convergence to the maximum margin solution for linearly separable datasets, independently of the initial conditions. Here we discuss the case of nonlinear multilayer DNNs near zero minima of the empirical loss, under exponential-type losses and square loss, for several variations of the basic gradient descent algorithm, including a new NMGD (norm minimizing gradient descent) version that converges to the minimum norm fixed points of the gradient descent iteration. Our main results are:

- gradient descent algorithms with weight normalization constraint achieve generalization;
- the fundamental reason for the effectiveness of existing weight normalization and batch normalization techniques is that they are approximate implementations of maximizing the margin under unit norm constraint;
- without unit norm constraints some level of generalization can still be obtained for not-too-deep networks because the balance of the weights across different layers, if present at initialization, is maintained by the gradient flow.

In the perspective of these theoretical results, we discuss experimental evidence around the apparent absence of “overfitting”, that is the observation that the expected classification error does not get worse when increasing the number of parameters. Our explanation focuses on the implicit normalization enforced by algorithms such as batch normalization, since the control of the norm of the weights is related to Halpern iterations for minimum norm solutions which are equivalent to regularization with vanishing λ(t).

^{1}This replaces previous versions of Theory IIIa and Theory IIIb.

http://hdl.handle.net/1721.1/116692

%0 Generic %D 2017 %T Musings on Deep Learning: Properties of SGD %A Chiyuan Zhang %A Qianli Liao %A Alexander Rakhlin %A Karthik Sridharan %A Brando Miranda %A Noah Golowich %A Tomaso Poggio %X[*formerly titled "Theory of Deep Learning III: Generalization Properties of SGD"*]

In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.

%8 04/2017 %2http://hdl.handle.net/1721.1/107841

%0 Generic %D 2017 %T Theory of Deep Learning IIb: Optimization Properties of SGD %A Chiyuan Zhang %A Qianli Liao %A Alexander Rakhlin %A Brando Miranda %A Noah Golowich %A Tomaso Poggio %XIn Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: *SGD concentrates in probability - like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers*.

http://hdl.handle.net/1721.1/115407

%0 Generic %D 2017 %T Theory of Deep Learning III: explaining the non-overfitting puzzle %A Tomaso Poggio %A Keji Kawaguchi %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Xavier Boix %A Jack Hidary %A Hrushikesh Mhaskar %X**THIS MEMO IS REPLACED BY CBMM MEMO 90**

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

%8 12/2017 %1 %2http://hdl.handle.net/1721.1/113003

%0 Journal Article %J International Journal of Automation and Computing %D 2017 %T Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %K Machine Learning %K Neural Networks %XThe paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%B International Journal of Automation and Computing %P 1-17 %8 03/2017 %G eng %U http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst %R 10.1007/s11633-017-1054-2 %0 Generic %D 2016 %T Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality? %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %X[formerly titled "*Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review*"]

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%8 11/2016 %1https://arxiv.org/abs/1611.00740v5

%2http://hdl.handle.net/1721.1/105443