Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of “high capacity” features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

%2https://hdl.handle.net/1721.1/129744

%0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %XThe key to generalization is controlling the complexity of

the network. However, there is no obvious control of

complexity -- such as an explicit regularization term --

in the training of deep networks for classification. We

will show that a classical form of norm control -- but

kind of hidden -- is present in deep networks trained with

gradient descent techniques on exponential-type losses. In

particular, gradient descent induces a dynamics of the

normalized weights which converge for $t \to \infty$ to an

equilibrium which corresponds to a minimum norm (or

maximum margin) solution. For sufficiently large but

finite $\rho$ -- and thus finite $t$ -- the dynamics

converges to one of several margin maximizers, with the

margin monotonically increasing towards a limit stationary

point of the flow. In the usual case of stochastic

gradient descent, most of the stationary points are likely

to be convex minima corresponding to a regularized,

constrained minimizer -- the network with normalized

weights-- which is stable and has asymptotic zero

generalization gap, asymptotically for $N \to \infty$,

where $N$ is the number of training examples. For finite,

fixed $N$ the generalizaton gap may not be zero but the

minimum norm property of the solution can provide, we

conjecture, good expected performance for suitable data

distributions. Our approach extends some of the results of

Srebro from linear networks to deep networks and provides

a new perspective on the implicit bias of gradient

descent. We believe that the elusive complexity control we

describe is responsible for the puzzling empirical finding

of good predictive performance by deep networks, despite

overparametrization.