While deep learning is successful in a number of applications, it is not yet well understood theoretically. A satisfactory theoretical characterization of deep learning however, is beginning to emerge. It covers the following questions: 1) *representation power*\ of deep networks 2) *optimization\ *of the empirical risk 3)\ *generalization properties*\ of gradient descent techniques - why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized? In this review we discuss recent advances in the three areas. In *approximation theory*\ both shallow and deep networks have been shown to approximate any continuous functions on a bounded domain at the expense of an exponential number of parameters (exponential in the dimensionality of the function). However, for a subset of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In *optimization\ *we discuss the loss landscape for the exponential loss function. It turns out that global minima at infinity are completely degenerate. The other critical points of the gradient are less degenerate, with at least one - and typically more - nonzero eigenvalues. This suggests that stochastic gradient descent will find with high probability the global minima. To address the question of *generalization\ *for classification tasks, we use classical uniform convergence results to justify minimizing a surrogate exponential-type loss function under a unit norm constraint on the weight matrix at each layer. It is an interesting side remark, that such minimization for (homogeneous) ReLU deep networks implies maximization of the margin. The resulting constrained gradient system turns out to be identical to the well-known {\it weight normalization} technique, originally motivated from a rather different way. We also show that standard gradient descent contains an implicit\ L_{2}\ unit norm constraint in the sense that it solves the same constrained minimization problem with the same critical points (but a different dynamics). Our approach, which is supported by several independent new results, offers a solution to the puzzle about generalization performance of deep overparametrized ReLU networks, uncovering the origin of the underlying hidden complexity control in the case of deep networks.

The key to generalization is controlling the complexity of

\ \ \ \ \ \ the network. However, there is no obvious control of

\ \ \ \ \ \ complexity -- such as an explicit regularization term --

\ \ \ \ \ \ in the training of deep networks. We will show that a

\ \ \ \ \ \ classical form of norm control -- but kind of hidden -- is

\ \ \ \ \ \ responsible for generalization in deep networks trained

\ \ \ \ \ \ with gradient descent techniques. In particular, gradient

\ \ \ \ \ \ descent induces a dynamics of the normalized weights which

\ \ \ \ \ \ converge for $t \to \infty$ to a degenerate equilibrium

\ \ \ \ \ \ which corresponds to the maximum margin solution. For

\ \ \ \ \ \ sufficiently large but finite $\rho$ -- and thus fnite $t$

\ \ \ \ \ \ -- the dynamics converges to an hyperbolic minimum. Such

\ \ \ \ \ \ dynamics is equivalent to regularized, constrained

\ \ \ \ \ \ minimization which is stable and generalizes at finite

\ \ \ \ \ \ time. Our approach extends some of the results of Srebro

\ \ \ \ \ \ from linear networks to deep networks and provides a new

\ \ \ \ \ \ perspective on the implicit bias of gradient descent. The

\ \ \ \ \ \ elusive complexity control we describe is responsible, at

\ \ \ \ \ \ least in part, for the puzzling empirical finding of good

\ \ \ \ \ \ generalization despite overparametrization by deep

\ \ \ \ \ \ networks.

\

}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Tomaso Poggio and Lorenzo Rosasco and Jack Hidary and Fernanda De La Torre} }