In solving a system of n linear equations in d variables\ \ Ax=b, the condition number of the (n,d) matrix A measures how\ \ much errors in the data b affect the solution x. Bounds of\ \ this type are important in many inverse problems. An example is\ \ machine learning where the key task is to estimate an underlying\ \ function from a set of measurements at random points in a high\ \ dimensional space and where low sensitivity to error in the data is\ \ a requirement for good predictive performance. Here we report the\ \ simple observation that when the columns of A are random vectors,\ \ the condition number of A is highest, that is worse, when d=n,\ \ that is when the inverse of A exists. An overdetermined system\ \ (n\>d) and especially an underdetermined system (n\<d), for which\ \ the pseudoinverse must be used instead of the inverse, typically\ \ have significantly better, that is lower, condition numbers. Thus\ \ the condition number of A plotted as function of d shows a\ \ double descent behavior with a peak at d=n.

}, author = {Tomaso Poggio and Gil Kur and Andrzej Banburski} } @article {4281, title = {Theoretical Issues in Deep Networks}, year = {2019}, month = {08/2019}, abstract = {While deep learning is successful in a number of applications, it is not yet well understood theoretically.\ A theoretical\ characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal.\ In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

}, author = {Tomaso Poggio and Andrzej Banburski and Qianli Liao} } @article {3694, title = {Theory III: Dynamics and Generalization in Deep Networks}, year = {2018}, month = {06/2018}, abstract = {The key to generalization is controlling the complexity of

\ \ \ \ \ \ the network. However, there is no obvious control of

\ \ \ \ \ \ complexity -- such as an explicit regularization term --

\ \ \ \ \ \ in the training of deep networks for classification. We

\ \ \ \ \ \ will show that a classical form of norm control -- but

\ \ \ \ \ \ kind of hidden -- is responsible for good expected

\ \ \ \ \ \ performance by

\ \ \ \ \ \ deep networks trained with gradient descent techniques on

\ \ \ \ \ \ exponential-type losses. In particular, gradient descent

\ \ \ \ \ \ induces a dynamics of the normalized weights which

\ \ \ \ \ \ converge for $t \to \infty$ to an equilibrium which

\ \ \ \ \ \ corresponds to a minimum norm (or maximum margin)

\ \ \ \ \ \ solution. For sufficiently large but finite $\rho$ -- and

\ \ \ \ \ \ thus fnite $t$ -- the dynamics converges to one of several

\ \ \ \ \ \ hyperbolic minima corresponding to a regularized,

\ \ \ \ \ \ constrained minimizer -- the network with normalized

\ \ \ \ \ \ weights-- which is stable and generalizes. At the limit,

\ \ \ \ \ \ generalizaton is lost but the minimum norm property of the

\ \ \ \ \ \ solution provides, we conjecture, good expected

\ \ \ \ \ \ performance. Our approach extends some of the results of

\ \ \ \ \ \ Srebro from linear networks to deep networks and provides

\ \ \ \ \ \ a new perspective on the implicit bias of gradient

\ \ \ \ \ \ descent. The elusive complexity control we describe is

\ \ \ \ \ \ responsible, at least in part, for the puzzling empirical

\ \ \ \ \ \ finding of good predictive performance by deep networks, despite

\ \ \ \ \ \ overparametrization.\

\

}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Tomaso Poggio and Lorenzo Rosasco and Jack Hidary and Fernanda De La Torre} }