This paper is motivated by an open problem around deep networks, namely, the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

%B Neural Networks %V 121 %P 229 - 241 %8 01/2020 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0893608019302552 %! Neural Networks %R 10.1016/j.neunet.2019.08.028 %0 Generic %D 2019 %T An analysis of training and generalization errors in shallow and deep networks %A Hrushikesh Mhaskar %A Tomaso Poggio %K deep learning %K generalization error %K interpolatory approximation %XThis paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.

%8 05/2019 %1https://arxiv.org/abs/1802.06266

%2https://hdl.handle.net/1721.1/121183

%0 Generic %D 2018 %T An analysis of training and generalization errors in shallow and deep networks %A Hrushikesh Mhaskar %A Tomaso Poggio %K deep learning %K generalization error %K interpolatory approximation %XAn open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that ap- proximation by trigonometric polynomials is a “role model” for many other processes of approximation that have inspired many theoretical constructions also in the context of approximation by neural and RBF networks. In this paper, we argue that the maximum loss functional is necessary to measure the generalization error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error, and how much error to expect at which test data. An interesting feature of our new method is that the variance in the training data is no longer an insurmountable lower bound on the generalization error.

%8 02/2018 %1 %2http://hdl.handle.net/1721.1/113843

%0 Journal Article %J bioRxiv preprint %D 2018 %T Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? %A Martin Schrimpf %A Jonas Kubilius %E Ha Hong %E Najib J. Majaj %E Rishi Rajalingham %E Elias B. Issa %E Kohitij Kar %E Pouya Bashivan %E Jonathan Prescott-Roy %E Kailyn Schmidt %E Daniel L K Yamins %E James J. DiCarlo %K computational neuroscience %K deep learning %K Neural Networks %K object recognition %K ventral stream %XThe internal representations of early deep artificial neural networks (ANNs) were found to be remarkably similar to the internal neural representations measured experimentally in the primate brain. Here we ask, as deep ANNs have continued to evolve, are they becoming more or less brain-like? ANNs that are most functionally similar to the brain will contain mechanisms that are most like those used by the brain. We therefore developed *Brain-Score* – a composite of multiple neural and behavioral benchmarks that score any ANN on how similar it is to the brain’s mechanisms for core object recognition – and we deployed it to evaluate a wide range of state-of-the-art deep ANNs. Using this scoring system, we here report that: (1) DenseNet-169, CORnet-S and ResNet-101 are the most brain-like ANNs. There remains considerable variability in neural and behavioral responses that is not predicted by any ANN, suggesting that no ANN model has yet captured all the relevant mechanisms. (3) Extending prior work, we found that gains in ANN ImageNet performance led to gains on Brain-Score. However, correlation weakened at *≥* 70% top-1 ImageNet performance, suggesting that additional guidance from neuroscience is needed to make further advances in capturing brain mechanisms. (4) We uncovered smaller (i.e. less complex) ANNs that are more brain-like than many of the best-performing ImageNet models, which suggests the opportunity to simplify ANNs to better understand the ventral stream. The scoring system used here is far from complete. However, we propose that evaluating and tracking model-benchmark correspondences through a Brain-Score that is regularly updated with new brain data is an exciting opportunity: experimental benchmarks can be used to guide machine network evolution, and machine networks are mechanistic hypotheses of the brain’s network and thus drive next experiments. To facilitate both of these, we release Brain-Score.org: a platform that hosts the neural and behavioral benchmarks, where ANNs for visual processing can be submitted to receive a Brain-Score and their rank relative to other models, and where new experimental data can be naturally incorporated.

Compressed learning (CL) is a joint signal processing and machine learning framework for inference from a signal, using a small number of measurements obtained by a linear projection. In this chapter, we review this concept of compressed leaning, which suggests that learning directly in the compressed domain is possible, and with good performance. We experimentally show that the classification accuracy, using an efficient classifier in the compressed domain, can be quite close to the accuracy obtained when operating directly on the original data. Using convolutional neural network for the image classification, we examine the performance of different linear sensing schemes for the data acquisition stage, such as random sensing and PCA projection. Then, we present an end-to-end deep learning approach for CL, in which a network composed of fully connected layers followed by convolutional ones, performs the linear sensing and the nonlinear inference stages simultaneously. During the training phase, both the sensing matrix and the nonlinear inference operator are jointly optimized, leading to a suitable sensing matrix and better performance for the overall task of image classification in the compressed domain. The performance of the proposed approach is demonstrated using the MNIST and CIFAR-10 datasets.

Full text available online - https://books.google.com/books?hl=en&lr=&id=zDx4DwAAQBAJ&oi=fnd&pg=PA3&ots=vxCX2Ddl0f&sig=RNZB40wA-2EFLjOpkazg8cnWyYo#v=onepage&q&f=false

%B Handbook of Numerical Analysis %I Elsevier %V 19 %P 3 - 17 %8 10/2018 %@ 9780444642059 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S1570865918300024 %R 10.1016/bs.hna.2018.08.002 %0 Journal Article %J Neuron %D 2018 %T A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy %A Alexander J. E. Kell %A Daniel L K Yamins %A Erica N Shook %A Sam V Norman-Haignere %A Josh H. McDermott %K auditory cortex %K convolutional neural network %K deep learning %K deep neural network %K encoding models %K fMRI %K Hierarchy %K human auditory cortex %K natural sounds %K word recognition %XA core goal of auditory neuroscience is to build quantitative models that predict cortical responses to natural sounds. Reasoning that a complete model of auditory cortex must solve ecologically relevant tasks, we optimized hierarchical neural networks for speech and music recognition. The best-performing network contained separate music and speech pathways following early shared processing, potentially replicating human cortical organization. The network performed both tasks as well as humans and exhibited human-like errors despite not being optimized to do so, suggesting common constraints on network and human performance. The network predicted fMRI voxel responses substantially better than traditional spectrotemporal filter models throughout auditory cortex. It also provided a quantitative signature of cortical representational hierarchy—primary and non-primary responses were best predicted by intermediate and late network layers, respectively. The results suggest that task optimization provides a powerful set of tools for modeling sensory systems.

%B Neuron %V 98 %8 04/2018 %G eng %U https://www.sciencedirect.com/science/article/pii/S0896627318302502 %) Available online 19 April 2018 %R 10.1016/j.neuron.2018.03.044 %0 Journal Article %J Bulletin of the Polish Academy of Sciences: Technical Sciences %D 2018 %T Theory I: Deep networks and the curse of dimensionality %A Tomaso Poggio %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %XWe review recent work characterizing the classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage.

%B Bulletin of the Polish Academy of Sciences: Technical Sciences %V 66 %G eng %N 6 %0 Report %D 2017 %T Fisher-Rao Metric, Geometry, and Complexity of Neural Networks %A Liang, Tengyuan %A Tomaso Poggio %A Alexander Rakhlin %A Stokes, James %K capacity control %K deep learning %K Fisher-Rao metric %K generalization error %K information geometry %K Invariance %K natural gradient %K ReLU activation %K statistical learning theory %XWe study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity — the Fisher-Rao norm — that possesses desirable in- variance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

%B arXiv.org %8 11/2017 %G eng %U https://arxiv.org/abs/1711.01530 %0 Journal Article %J International Journal of Automation and Computing %D 2017 %T Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %K Machine Learning %K Neural Networks %XThe paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%B International Journal of Automation and Computing %P 1-17 %8 03/2017 %G eng %U http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst %R 10.1007/s11633-017-1054-2 %0 Journal Article %J Journal of Statistical Physics %D 2017 %T Why does deep and cheap learning work so well? %A Henry Lin %A Max Tegmark %K Artificial neural networks %K deep learning %K Statistical physics %XWe show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through “cheap learning” with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks. We further argue that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one. We formalize these claims using information theory and discuss the relation to the renormalization group. We prove various “no-flattening theorems” showing when efficient linear deep networks cannot be accurately approximated by shallow ones without efficiency loss; for example, we show that *n* variables cannot be multiplied using fewer than 2*n* neurons in a single hidden layer.