This paper is motivated by an open problem around deep networks, namely, the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

}, keywords = {deep learning, generalization error, interpolatory approximation}, issn = {08936080}, doi = {10.1016/j.neunet.2019.08.028}, url = {https://www.sciencedirect.com/science/article/abs/pii/S0893608019302552}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {4240, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2019}, month = {05/2019}, abstract = {This paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {3315, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2018}, month = {02/2018}, abstract = {An open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that ap- proximation by trigonometric polynomials is a {\textquotedblleft}role model{\textquotedblright} for many other processes of approximation that have inspired many theoretical constructions also in the context of approximation by neural and RBF networks. In this paper, we argue that the maximum loss functional is necessary to measure the generalization error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error, and how much error to expect at which test data. An interesting feature of our new method is that the variance in the training data is no longer an insurmountable lower bound on the generalization error.

}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {4294, title = {Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like?}, journal = {bioRxiv preprint}, year = {2018}, abstract = {The internal representations of early deep artificial neural networks (ANNs) were found to be remarkably similar to the internal neural representations measured experimentally in the primate brain. Here we ask, as deep ANNs have continued to evolve, are they becoming more or less brain-like? ANNs that are most functionally similar to the brain will contain mechanisms that are most like those used by the brain. We therefore developed *Brain-Score* {\textendash} a composite of multiple neural and behavioral benchmarks that score any ANN on how similar it is to the brain{\textquoteright}s mechanisms for core object recognition {\textendash} and we deployed it to evaluate a wide range of state-of-the-art deep ANNs. Using this scoring system, we here report that: (1) DenseNet-169, CORnet-S and ResNet-101 are the most brain-like ANNs. There remains considerable variability in neural and behavioral responses that is not predicted by any ANN, suggesting that no ANN model has yet captured all the relevant mechanisms. (3) Extending prior work, we found that gains in ANN ImageNet performance led to gains on Brain-Score. However, correlation weakened at *>=* 70\% top-1 ImageNet performance, suggesting that additional guidance from neuroscience is needed to make further advances in capturing brain mechanisms. (4) We uncovered smaller (i.e. less complex) ANNs that are more brain-like than many of the best-performing ImageNet models, which suggests the opportunity to simplify ANNs to better understand the ventral stream. The scoring system used here is far from complete. However, we propose that evaluating and tracking model-benchmark correspondences through a Brain-Score that is regularly updated with new brain data is an exciting opportunity: experimental benchmarks can be used to guide machine network evolution, and machine networks are mechanistic hypotheses of the brain{\textquoteright}s network and thus drive next experiments. To facilitate both of these, we release Brain-Score.org: a platform that hosts the neural and behavioral benchmarks, where ANNs for visual processing can be submitted to receive a Brain-Score and their rank relative to other models, and where new experimental data can be naturally incorporated.

Compressed learning (CL) is a joint signal processing and machine learning framework for inference from a signal, using a small number of measurements obtained by a linear projection. In this chapter, we review this concept of compressed leaning, which suggests that learning directly in the compressed domain is possible, and with good performance. We experimentally show that the classification accuracy, using an efficient classifier in the compressed domain, can be quite close to the accuracy obtained when operating directly on the original data. Using convolutional neural network for the image classification, we examine the performance of different linear sensing schemes for the data acquisition stage, such as random sensing and PCA projection. Then, we present an end-to-end deep learning approach for CL, in which a network composed of fully connected layers followed by convolutional ones, performs the linear sensing and the nonlinear inference stages simultaneously. During the training phase, both the sensing matrix and the nonlinear inference operator are jointly optimized, leading to a suitable sensing matrix and better performance for the overall task of image classification in the compressed domain. The performance of the proposed approach is demonstrated using the MNIST and CIFAR-10 datasets.

Full text available online - https://books.google.com/books?hl=en\&lr=\&id=zDx4DwAAQBAJ\&oi=fnd\&pg=PA3\&ots=vxCX2Ddl0f\&sig=RNZB40wA-2EFLjOpkazg8cnWyYo$\#$v=onepage\&q\&f=false

}, keywords = {Compressed learning, Compressed sensing, deep learning, Neural Networks, sparse coding, Sparse representation}, isbn = {9780444642059}, issn = {15708659}, doi = {10.1016/bs.hna.2018.08.002}, url = {https://linkinghub.elsevier.com/retrieve/pii/S1570865918300024}, author = {Zisselman, E. and Amir Adler and Elad, M.} } @article {3573, title = {A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy}, journal = {Neuron}, volume = {98}, year = {2018}, month = {04/2018}, abstract = {A core goal of auditory neuroscience is to build quantitative models that predict cortical responses to natural sounds. Reasoning that a complete model of auditory cortex must solve ecologically relevant tasks, we optimized hierarchical neural networks for speech and music recognition. The best-performing network contained separate music and speech pathways following early shared processing, potentially replicating human cortical organization. The network performed both tasks as well as humans and exhibited human-like errors despite not being optimized to do so, suggesting common constraints on network and human performance. The network predicted fMRI voxel responses substantially better than traditional spectrotemporal filter models throughout auditory cortex. It also provided a quantitative signature of cortical representational hierarchy{\textemdash}primary and non-primary responses were best predicted by intermediate and late network layers, respectively. The results suggest that task optimization provides a powerful set of tools for modeling sensory systems.

}, keywords = {auditory cortex, convolutional neural network, deep learning, deep neural network, encoding models, fMRI, Hierarchy, human auditory cortex, natural sounds, word recognition}, doi = {10.1016/j.neuron.2018.03.044}, url = {https://www.sciencedirect.com/science/article/pii/S0896627318302502}, author = {Alexander J. E. Kell and Daniel L K Yamins and Erica N Shook and Sam V Norman-Haignere and Josh H. McDermott} } @article {4185, title = {Theory I: Deep networks and the curse of dimensionality}, journal = {Bulletin of the Polish Academy of Sciences: Technical Sciences}, volume = {66}, year = {2018}, abstract = {We review recent work characterizing the classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage.

}, keywords = {convolutional neural networks, deep and shallow networks, deep learning, function approximation}, author = {Tomaso Poggio and Qianli Liao} } @article {3155, title = {Fisher-Rao Metric, Geometry, and Complexity of Neural Networks}, year = {2017}, month = {11/2017}, abstract = {We study the relationship between geometry and capacity measures for deep\ neural\ networks\ from\ an\ invariance\ viewpoint.\ We\ introduce\ a\ new notion\ of\ capacity {\textemdash} the\ Fisher-Rao\ norm {\textemdash} that\ possesses\ desirable\ in- variance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures.\ We\ discuss\ upper\ bounds\ on\ the\ generalization\ error\ induced by\ the\ proposed\ measure.\ Extensive\ numerical\ experiments\ on\ CIFAR-10 support\ our\ theoretical\ findings.\ Our\ theoretical\ analysis\ rests\ on\ a\ key structural lemma about partial derivatives of multi-layer rectifier networks.

}, keywords = {capacity control, deep learning, Fisher-Rao metric, generalization error, information geometry, Invariance, natural gradient, ReLU activation, statistical learning theory}, url = {https://arxiv.org/abs/1711.01530}, author = {Liang, Tengyuan and Tomaso Poggio and Alexander Rakhlin and Stokes, James} } @article {2557, title = {Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review}, journal = {International Journal of Automation and Computing}, year = {2017}, month = {03/2017}, pages = {1-17}, abstract = {The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

}, keywords = {convolutional neural networks, deep and shallow networks, deep learning, function approximation, Machine Learning, Neural Networks}, doi = {10.1007/s11633-017-1054-2}, url = {http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst}, author = {Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao} } @article {2761, title = {Why does deep and cheap learning work so well?}, journal = {Journal of Statistical Physics}, volume = {168}, year = {2017}, month = {09/2017}, pages = {1223{\textendash}1247}, chapter = {1223}, abstract = {We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through {\textquotedblleft}cheap learning{\textquotedblright} with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various {\textquotedblleft}no-flattening theorems{\textquotedblright} showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that *n* variables cannot be multiplied using fewer than 2*n* neurons in a single hidden layer.