%0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %X
The key to generalization is controlling the complexity of
the network. However, there is no obvious control of
complexity -- such as an explicit regularization term --
in the training of deep networks for classification. We
will show that a classical form of norm control -- but
kind of hidden -- is present in deep networks trained with
gradient descent techniques on exponential-type losses. In
particular, gradient descent induces a dynamics of the
normalized weights which converge for $t \to \infty$ to an
equilibrium which corresponds to a minimum norm (or
maximum margin) solution. For sufficiently large but
finite $\rho$ -- and thus finite $t$ -- the dynamics
converges to one of several margin maximizers, with the
margin monotonically increasing towards a limit stationary
point of the flow. In the usual case of stochastic
gradient descent, most of the stationary points are likely
to be convex minima corresponding to a regularized,
constrained minimizer -- the network with normalized
weights-- which is stable and has asymptotic zero
generalization gap, asymptotically for $N \to \infty$,
where $N$ is the number of training examples. For finite,
fixed $N$ the generalizaton gap may not be zero but the
minimum norm property of the solution can provide, we
conjecture, good expected performance for suitable data
distributions. Our approach extends some of the results of
Srebro from linear networks to deep networks and provides
a new perspective on the implicit bias of gradient
descent. We believe that the elusive complexity control we
describe is responsible for the puzzling empirical finding
of good predictive performance by deep networks, despite
overparametrization.