%0 Generic %D 2021 %T Distribution of Classification Margins: Are All Data Equal? %A Andrzej Banburski %A Fernanda De La Torre %A Nishka Pant %A Ishana Shastri %A Tomaso Poggio %X

Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin  constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of “high capacity” features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

%2

https://hdl.handle.net/1721.1/129744

%0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %X

The key to generalization is controlling the complexity of
            the network. However, there is no obvious control of
            complexity -- such as an explicit regularization term --
            in the training of deep networks for classification. We
            will show that a classical form of norm control -- but
            kind of hidden -- is present in deep networks trained with
            gradient descent techniques on exponential-type losses. In
            particular, gradient descent induces a dynamics of the
            normalized weights which converge for $t \to \infty$ to an
            equilibrium which corresponds to a minimum norm (or
            maximum margin) solution. For sufficiently large but
            finite $\rho$ -- and thus finite $t$ -- the dynamics
            converges to one of several margin maximizers, with the
            margin monotonically increasing towards a limit stationary
            point of the flow. In the usual case of stochastic
            gradient descent, most of the stationary points are likely
            to be convex minima corresponding to a regularized,
            constrained minimizer -- the network with normalized
            weights-- which is stable and has asymptotic zero
            generalization gap, asymptotically for $N \to \infty$,
            where $N$ is the number of training examples. For finite,
            fixed $N$ the generalizaton gap may not be zero but the
            minimum norm property of the solution can provide, we
            conjecture, good expected performance for suitable data
            distributions. Our approach extends some of the results of
            Srebro from linear networks to deep networks and provides
            a new perspective on the implicit bias of gradient
            descent. We believe that the elusive complexity control we
            describe is responsible for the puzzling empirical finding
            of good predictive performance by deep networks, despite
            overparametrization. 

%8 06/2018 %2

http://hdl.handle.net/1721.1/116692