The key to generalization is controlling the complexity of

\ \ \ \ \ \ the network. However, there is no obvious control of

\ \ \ \ \ \ complexity -- such as an explicit regularization term --

\ \ \ \ \ \ in the training of deep networks for classification. We

\ \ \ \ \ \ will show that a classical form of norm control -- but

\ \ \ \ \ \ kind of hidden -- is responsible for good expected

\ \ \ \ \ \ performance by

\ \ \ \ \ \ deep networks trained with gradient descent techniques on

\ \ \ \ \ \ exponential-type losses. In particular, gradient descent

\ \ \ \ \ \ induces a dynamics of the normalized weights which

\ \ \ \ \ \ converge for $t \to \infty$ to an equilibrium which

\ \ \ \ \ \ corresponds to a minimum norm (or maximum margin)

\ \ \ \ \ \ solution. For sufficiently large but finite $\rho$ -- and

\ \ \ \ \ \ thus fnite $t$ -- the dynamics converges to one of several

\ \ \ \ \ \ hyperbolic minima corresponding to a regularized,

\ \ \ \ \ \ constrained minimizer -- the network with normalized

\ \ \ \ \ \ weights-- which is stable and generalizes. At the limit,

\ \ \ \ \ \ generalizaton is lost but the minimum norm property of the

\ \ \ \ \ \ solution provides, we conjecture, good expected

\ \ \ \ \ \ performance. Our approach extends some of the results of

\ \ \ \ \ \ Srebro from linear networks to deep networks and provides

\ \ \ \ \ \ a new perspective on the implicit bias of gradient

\ \ \ \ \ \ descent. The elusive complexity control we describe is

\ \ \ \ \ \ responsible, at least in part, for the puzzling empirical

\ \ \ \ \ \ finding of good predictive performance by deep networks, despite

\ \ \ \ \ \ overparametrization.\

\

}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Tomaso Poggio and Lorenzo Rosasco and Jack Hidary and Fernanda De La Torre} }