This paper is motivated by an open problem around deep networks, namely, the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

}, keywords = {deep learning, generalization error, interpolatory approximation}, issn = {08936080}, doi = {10.1016/j.neunet.2019.08.028}, url = {https://www.sciencedirect.com/science/article/abs/pii/S0893608019302552}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {4750, title = {Function approximation by deep networks}, journal = {Communications on Pure \& Applied Analysis}, volume = {19}, year = {2020}, month = {08/2020}, pages = {4085 - 4095}, abstract = {We show that deep networks are better than shallow networks at approximating functions that can be expressed as a composition of functions described by a directed acyclic graph, because the deep networks can be designed to have the same compositional structure, while a shallow network cannot exploit this knowledge. Thus, the blessing of compositionality mitigates the curse of dimensionality. On the other hand, a theorem called good propagation of errors allows to "lift" theorems about shallow networks to those about deep networks with an appropriate choice of norms, smoothness, etc. We illustrate this in three contexts where each channel in the deep network calculates a spherical polynomial, a non-smooth ReLU network, or another zonal function network related closely with the ReLU network.

}, keywords = {approximation on the Euclidean sphere, deep networks, degree of approximation}, issn = {1553-5258}, doi = {10.3934/cpaa.2020181}, url = {http://aimsciences.org//article/doi/10.3934/cpaa.2020181}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {4240, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2019}, month = {05/2019}, abstract = {This paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {3315, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2018}, month = {02/2018}, abstract = {An open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that ap- proximation by trigonometric polynomials is a {\textquotedblleft}role model{\textquotedblright} for many other processes of approximation that have inspired many theoretical constructions also in the context of approximation by neural and RBF networks. In this paper, we argue that the maximum loss functional is necessary to measure the generalization error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error, and how much error to expect at which test data. An interesting feature of our new method is that the variance in the training data is no longer an insurmountable lower bound on the generalization error.

}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {3266, title = {Theory of Deep Learning III: explaining the non-overfitting puzzle}, year = {2017}, month = {12/2017}, abstract = {**THIS MEMO IS REPLACED BY CBMM MEMO 90**

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for {\textquotedblleft}low noise{\textquotedblright} datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

}, author = {Tomaso Poggio and Keji Kawaguchi and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Xavier Boix and Jack Hidary and Hrushikesh Mhaskar} } @proceedings {2682, title = {When and Why Are Deep Networks Better Than Shallow Ones?}, year = {2017}, abstract = {While the universal approximation property holds both for hierarchical and shallow networks, deep networks can approximate the class of compositional functions as well as shallow networks but with exponentially lower number of training parameters and sample complexity. Compositional functions are obtained as a hierarchy of local constituent functions, where "local functions{\textquoteright}{\textquoteright} are functions with low dimensionality. This theorem proves an old conjecture by Bengio on the role of depth in networks, characterizing precisely the conditions under which it holds. It also suggests possible answers to the the puzzle of why high-dimensional deep networks trained on large training sets often do not seem to show overfit.

},
author = {Hrushikesh Mhaskar and Qianli Liao and Tomaso Poggio}
}
@article {2557,
title = {Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review},
journal = {International Journal of Automation and Computing},
year = {2017},
month = {03/2017},
pages = {1-17},
abstract = {The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

}, keywords = {convolutional neural networks, deep and shallow networks, deep learning, function approximation, Machine Learning, Neural Networks}, doi = {10.1007/s11633-017-1054-2}, url = {http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst}, author = {Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao} } @article {2183, title = {Deep vs. shallow networks : An approximation theory perspective}, year = {2016}, month = {08/2016}, abstract = {The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function {\textemdash} the ReLU function {\textemdash} used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of *relative dimension* to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

},
keywords = {blessed representation, deep and shallow networks, Gaussian networks, ReLU networks},
issn = {0219-5305},
doi = {10.1142/S0219530516400042},
url = {http://www.worldscientific.com/doi/abs/10.1142/S0219530516400042},
author = {Hrushikesh Mhaskar and Tomaso Poggio}
}
@article {1741,
title = {Learning Functions: When Is Deep Better Than Shallow},
year = {2016},
abstract = {While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.

}, url = {https://arxiv.org/pdf/1603.00988v4.pdf}, author = {Hrushikesh Mhaskar and Qianli Liao and Tomaso Poggio} } @article {2321, title = {Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality?}, year = {2016}, month = {11/2016}, abstract = {[formerly titled "*Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review*"]

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

}, author = {Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao} }