%0 Conference Paper %B NAS Sackler Colloquium on Science of Deep Learning %D 2019 %T Dynamics & Generalization in Deep Networks -Minimizing the Norm %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B NAS Sackler Colloquium on Science of Deep Learning %C Washington D.C. %8 03/2019 %G eng %0 Conference Paper %B ICML %D 2019 %T Weight and Batch Normalization implement Classical Generalization Bounds %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B ICML %C Long Beach/California %8 06/2019 %G eng %0 Generic %D 2018 %T Classical generalization bounds are surprisingly tight for Deep Networks %A Qianli Liao %A Brando Miranda %A Jack Hidary %A Tomaso Poggio %X

Deep networks are usually trained and tested in a regime in which the training classification error is not a good predictor of the test error. Thus the consensus has been that generalization, defined as convergence of the empirical to the expected error, does not hold for deep networks. Here we show that, when normalized appropriately after training, deep networks trained on exponential type losses show a good linear dependence of test loss on training loss. The observation, motivated by a previous theoretical analysis of overparametrization and overfitting, not only demonstrates the validity of classical generalization bounds for deep learning but suggests that they are tight. In addition, we also show that the bound of the classification error by the normalized cross entropy loss is empirically rather tight on the data sets we studied.

%8 07/2018 %1

http://hdl.handle.net/1721.1/116911

%0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %X

The key to generalization is controlling the complexity of
the network. However, there is no obvious control of
complexity -- such as an explicit regularization term --
in the training of deep networks for classification. We
will show that a classical form of norm control -- but
kind of hidden -- is present in deep networks trained with
gradient descent techniques on exponential-type losses. In
particular, gradient descent induces a dynamics of the
normalized weights which converge for $t \to \infty$ to an
equilibrium which corresponds to a minimum norm (or
maximum margin) solution. For sufficiently large but
finite $\rho$ -- and thus finite $t$ -- the dynamics
converges to one of several margin maximizers, with the
margin monotonically increasing towards a limit stationary
point of the flow. In the usual case of stochastic
gradient descent, most of the stationary points are likely
to be convex minima corresponding to a regularized,
constrained minimizer -- the network with normalized
weights-- which is stable and has asymptotic zero
generalization gap, asymptotically for $N \to \infty$,
where $N$ is the number of training examples. For finite,
fixed $N$ the generalizaton gap may not be zero but the
minimum norm property of the solution can provide, we
conjecture, good expected performance for suitable data
distributions. Our approach extends some of the results of
Srebro from linear networks to deep networks and provides
a new perspective on the implicit bias of gradient
descent. We believe that the elusive complexity control we
describe is responsible for the puzzling empirical finding
of good predictive performance by deep networks, despite
overparametrization.

%8 06/2018 %2

http://hdl.handle.net/1721.1/116692

%0 Generic %D 2017 %T Theory of Deep Learning III: explaining the non-overfitting puzzle %A Tomaso Poggio %A Keji Kawaguchi %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Xavier Boix %A Jack Hidary %A Hrushikesh Mhaskar %X

THIS MEMO IS REPLACED BY CBMM MEMO 90

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

%8 12/2017 %1

arXiv:1801.00173

http://hdl.handle.net/1721.1/113003