Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin\ constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99\% without significant loss of performance. Interestingly, the resulting subset of {\textquotedblleft}high capacity{\textquotedblright} features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

}, author = {Andrzej Banburski and Fernanda De La Torre and Nishka Pant and Ishana Shastri and Tomaso Poggio} } @article {4880, title = {Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss}, year = {2021}, abstract = {Here we consider a simplified model\ of the dynamics of gradient flow under the\ square loss in ReLU networks. \ We show that convergence to a\ solution with the absolute minimum "norm" -- defined as\ the product of the Frobenius norms of each layer weight matrix --\ is expected when\ normalization by a Lagrange multiplier (LN) is used together with\ Weight Decay (WD). In the absence of LN+WD, good solutions\ for classification may still be achieved because of the\ \ \ \ \ \ implicit bias towards small norm solutions in the trajectory\ dynamics of gradient descent introduced by close-to-zero initial conditions on the norms of the weights.\ The main property of the minimizers that bounds their\ expected binary classification error is \ the norm: we prove that among all the\ close-to-interpolating solutions, the ones associated with\ smaller norm\ have better margin and better bounds on the expected\ classification error. We also prove that\

quasi-interpolating solutions obtained by gradient descent in the presence of WD show the recently discovered behavior of\ Neural Collapse and describe related\ predictions. Our analysis \ supports the idea that the advantage of deep networks relative to other standard classifiers is restricted to specific deep architectures such as CNNs and is due to their good approximation properties for target functions that are locally compositional.

Overparametrized deep network predict well despite the lack of an explicit complexity control during training such as an explicit regularization term. For exponential-type loss functions, we solve this puzzle by showing an effective regularization effect of gradient descent in terms of the normalized weights that are relevant for classification.

}, doi = {https://doi.org/10.1038/s41467-020-14663-9}, url = {https://www.nature.com/articles/s41467-020-14663-9}, author = {Tomaso Poggio and Qianli Liao and Andrzej Banburski} } @article {4680, title = {Dreaming with ARC}, year = {2020}, month = {11/2020}, abstract = {Current machine learning algorithms are highly specialized to whatever it is they are meant to do {\textendash}{\textendash} e.g. playing chess, picking up objects, or object recognition.\ How can we extend this to a system that could solve a wide range of problems?\ We argue that this can be achieved by a modular system {\textendash}{\textendash} one that can adapt to solving different problems by changing only the modules chosen and the order in which those modules are applied to the problem. The recently introduced ARC (Abstraction and Reasoning Corpus) dataset serves as an excellent test of abstract reasoning. Suited to the modular approach, the tasks depend on a set of human Core Knowledge inbuilt priors. In this paper we implement these priors as the modules of our system. We combine these modules using a neural-guided program synthesis.\

}, author = {Andrzej Banburski and Anshula Gandhi and Simon Alford and Sylee Dandekar and Peter Chin and Tomaso Poggio} } @article {4570, title = {Hierarchically Local Tasks and Deep Convolutional Networks}, year = {2020}, month = {06/2020}, abstract = {The main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional-like networks in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Among all the possible hierarchically local tasks in vision, text and speech we explore a few of them experimentally by studying how they are affected by disrupting locality in the input images. We also discuss a taxonomy of tasks ranging from local, to hierarchically local, to global and make predictions about the type of networks required to perform\ efficiently on these different types of tasks.

}, keywords = {Compositionality, Inductive Bias, perception, Theory of Deep Learning}, author = {Arturo Deza and Qianli Liao and Andrzej Banburski and Tomaso Poggio} } @article {4679, title = {An Overview of Some Issues in the Theory of Deep Networks}, journal = {IEEJ Transactions on Electrical and Electronic Engineering}, volume = {15}, year = {2020}, month = {10/2020}, pages = {1560 - 1571}, abstract = {During the last few years, significant progress has been made in the theoretical understanding of deep networks. We review our contributions in the areas of approximation theory and optimization. We also introduce a new approach based on cross-validation leave-one-out stability to estimate bounds on the expected error of overparametrized classifiers, such as deep networks.

}, issn = {1931-4973}, doi = {10.1002/tee.23243}, url = {https://onlinelibrary.wiley.com/toc/19314981/15/11}, author = {Tomaso Poggio and Andrzej Banburski} } @article {4565, title = {Theoretical issues in deep networks}, journal = {Proceedings of the National Academy of Sciences}, year = {2020}, month = {Sep-06-2020}, pages = {201907369}, abstract = {While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm minimizers. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

}, issn = {0027-8424}, doi = {10.1073/pnas.1907369117}, url = {https://www.pnas.org/content/early/2020/06/08/1907369117}, author = {Tomaso Poggio and Andrzej Banburski and Qianli Liao} } @article {4375, title = {Double descent in the condition number}, year = {2019}, month = {12/2019}, abstract = {In solving a system of n linear equations in d variables\ \ Ax=b, the condition number of the (n,d) matrix A measures how\ \ much errors in the data b affect the solution x. Bounds of\ \ this type are important in many inverse problems. An example is\ \ machine learning where the key task is to estimate an underlying\ \ function from a set of measurements at random points in a high\ \ dimensional space and where low sensitivity to error in the data is\ \ a requirement for good predictive performance. Here we report the\ \ simple observation that when the columns of A are random vectors,\ \ the condition number of A is highest, that is worse, when d=n,\ \ that is when the inverse of A exists. An overdetermined system\ \ (n\>d) and especially an underdetermined system (n\<d), for which\ \ the pseudoinverse must be used instead of the inverse, typically\ \ have significantly better, that is lower, condition numbers. Thus\ \ the condition number of A plotted as function of d shows a\ \ double descent behavior with a peak at d=n.

}, author = {Tomaso Poggio and Gil Kur and Andrzej Banburski} } @conference {4516, title = {Dynamics \& Generalization in Deep Networks -Minimizing the Norm}, booktitle = {NAS Sackler Colloquium on Science of Deep Learning}, year = {2019}, month = {03/2019}, address = {Washington D.C.}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Jack Hidary and Tomaso Poggio} } @article {4281, title = {Theoretical Issues in Deep Networks}, year = {2019}, month = {08/2019}, abstract = {While deep learning is successful in a number of applications, it is not yet well understood theoretically.\ A theoretical\ characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal.\ In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

}, author = {Tomaso Poggio and Andrzej Banburski and Qianli Liao} } @article {4515, title = {Theories of Deep Learning: Approximation, Optimization and Generalization }, year = {2019}, month = {09/2019}, author = {Qianli Liao and Andrzej Banburski and Tomaso Poggio} } @conference {4517, title = {Weight and Batch Normalization implement Classical Generalization Bounds }, booktitle = {ICML}, year = {2019}, month = {06/2019}, address = {Long Beach/California}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Jack Hidary and Tomaso Poggio} } @article {3694, title = {Theory III: Dynamics and Generalization in Deep Networks}, year = {2018}, month = {06/2018}, abstract = {The key to generalization is controlling the complexity of

\ \ \ \ \ \ the network. However, there is no obvious control of

\ \ \ \ \ \ complexity -- such as an explicit regularization term --

\ \ \ \ \ \ in the training of deep networks for classification. We

\ \ \ \ \ \ will show that a classical form of norm control -- but

\ \ \ \ \ \ kind of hidden -- is present in deep networks trained with

\ \ \ \ \ \ gradient descent techniques on exponential-type losses. In

\ \ \ \ \ \ particular, gradient descent induces a dynamics of the

\ \ \ \ \ \ normalized weights which converge for $t \to \infty$ to an

\ \ \ \ \ \ equilibrium which corresponds to a minimum norm (or

\ \ \ \ \ \ maximum margin) solution. For sufficiently large but

\ \ \ \ \ \ finite $\rho$ -- and thus finite $t$ -- the dynamics

\ \ \ \ \ \ converges to one of several margin maximizers, with the

\ \ \ \ \ \ margin monotonically increasing towards a limit stationary

\ \ \ \ \ \ point of the flow. In the usual case of stochastic

\ \ \ \ \ \ gradient descent, most of the stationary points are likely

\ \ \ \ \ \ to be convex minima corresponding to a regularized,

\ \ \ \ \ \ constrained minimizer -- the network with normalized

\ \ \ \ \ \ weights-- which is stable and has asymptotic zero

\ \ \ \ \ \ generalization gap, asymptotically for $N \to \infty$,

\ \ \ \ \ \ where $N$ is the number of training examples. For finite,

\ \ \ \ \ \ fixed $N$ the generalizaton gap may not be zero but the

\ \ \ \ \ \ minimum norm property of the solution can provide, we

\ \ \ \ \ \ conjecture, good expected performance for suitable data

\ \ \ \ \ \ distributions. Our approach extends some of the results of

\ \ \ \ \ \ Srebro from linear networks to deep networks and provides

\ \ \ \ \ \ a new perspective on the implicit bias of gradient

\ \ \ \ \ \ descent. We believe that the elusive complexity control we

\ \ \ \ \ \ describe is responsible for the puzzling empirical finding

\ \ \ \ \ \ of good predictive performance by deep networks, despite

\ \ \ \ \ \ overparametrization.\