%0 Generic %D 2023 %T The Janus effects of SGD vs GD: high noise and low rank %A Mengjia Xu %A Tomer Galanti %A Akshay Rangamani %A Lorenzo Rosasco %A Andrea Pinto %A Tomaso Poggio %X

It was always obvious that  SGD with small minibatch size yields for neural networks much higher asymptotic fluctuations in the updates of the weight matrices than GD. It has also been often reported that SGD in deep RELU networks shows empirically a low-rank bias in the weight matrices. A recent  theoretical analysis derived a bound on the rank and linked it to the size of the SGD fluctuations [25]. In this paper, we provide an empirical and  theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the component $W^\perp$ of the matrix $W$ corresponding to the null space of the data matrix $X$ converges to zero for both SGD and GD, provided the regularization term is non-zero. Because of the larger number of updates required to go through all the training data, the convergence rate {\it per epoch} of these components is much faster for SGD than for GD. In practice, SGD has a much stronger bias than GD towards solutions for weight matrices $W$ with high fluctuations -- even when the choice of mini batches is deterministic -- and low rank, provided the initialization is from a random matrix. Thus SGD  with non-zero regularization, shows the coupled phenomenon of  asymptotic noise and a low-rank bias-- unlike GD.

%8 12/2024 %2

https://hdl.handle.net/1721.1/153227

%0 Generic %D 2022 %T Scalable Causal Discovery with Score Matching %A Francesco Montagna %A Nicoletta Noceti %A Lorenzo Rosasco %A Kun Zhang %A Francesco Locatello %X

This paper demonstrates how to discover the whole causal graph from the second derivative of the log-likelihood in observational non-linear additive Gaussian noise models. Leveraging scalable machine learning approaches to approximate the score function , we extend the work of Rolland et al., 2022, that only recovers the topological order from the score and requires an expensive pruning step to discover the edges. Our analysis leads to DAS, a practical algorithm that reduces the complexity of the pruning by a factor proportional to the graph size. In practice, DAS achieves competitive accuracy with current state-of-the-art while being over an order of magnitude faster. Overall, our approach enables principled and scalable causal discovery, significantly lowering the compute bar.

%B NeurIPS 2022 %8 09/2022 %U https://openreview.net/forum?id=v56PHv_W2A %0 Generic %D 2020 %T For interpolating kernel machines, the minimum norm ERM solution is the most stable %A Akshay Rangamani %A Lorenzo Rosasco %A Tomaso Poggio %X

We study the average CVloo stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm has the best CVloo stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error follows a double descent curve.

%8 06/2020 %1

https://arxiv.org/abs/2006.15522

%2

https://hdl.handle.net/1721.1/125927

%0 Conference Proceedings %B Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Beating SGD Saturation with Tail-Averaging and Minibatching %A Nicole Muecke %A Gergely Neu %A Lorenzo Rosasco %X

While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the learning properties of many practically used variants are still poorly understood. In this paper, we consider least squares learning in a nonparametric setting and contribute to filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, in particular tail averaging. Our results show how these different variants of SGD can be combined to achieve optimal learning rates, also providing practical insights. A novel key result is that tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. Further, we show that a combination of tail-averaging and minibatching allows more aggressive step-size choices than using any one of said components.

%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Conference Paper %B NAS Sackler Colloquium on Science of Deep Learning %D 2019 %T Dynamics & Generalization in Deep Networks -Minimizing the Norm %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B NAS Sackler Colloquium on Science of Deep Learning %C Washington D.C. %8 03/2019 %G eng %0 Conference Proceedings %B Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Implicit Regularization of Accelerated Methods in Hilbert Spaces %A Nicolò Pagliana %A Lorenzo Rosasco %X

We study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. We analyze the implicit regularization properties of Nesterov acceleration and a variant of heavy-ball in terms of corresponding learning error bounds. Our results show that acceleration can provides faster bias decay than gradient descent, but also suffers of a more unstable behavior. As a result acceleration cannot be in general expected to improve learning accuracy with respect to gradient descent, but rather to achieve the same accuracy with reduced computations. Our theoretical results are validated by numerical simulations. Our analysis is based on studying suitable polynomials induced by the accelerated dynamics and combining spectral techniques with concentration inequalities.

%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Conference Paper %B ICML %D 2019 %T Weight and Batch Normalization implement Classical Generalization Bounds %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B ICML %C Long Beach/California %8 06/2019 %G eng %0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %X

The key to generalization is controlling the complexity of
            the network. However, there is no obvious control of
            complexity -- such as an explicit regularization term --
            in the training of deep networks for classification. We
            will show that a classical form of norm control -- but
            kind of hidden -- is present in deep networks trained with
            gradient descent techniques on exponential-type losses. In
            particular, gradient descent induces a dynamics of the
            normalized weights which converge for $t \to \infty$ to an
            equilibrium which corresponds to a minimum norm (or
            maximum margin) solution. For sufficiently large but
            finite $\rho$ -- and thus finite $t$ -- the dynamics
            converges to one of several margin maximizers, with the
            margin monotonically increasing towards a limit stationary
            point of the flow. In the usual case of stochastic
            gradient descent, most of the stationary points are likely
            to be convex minima corresponding to a regularized,
            constrained minimizer -- the network with normalized
            weights-- which is stable and has asymptotic zero
            generalization gap, asymptotically for $N \to \infty$,
            where $N$ is the number of training examples. For finite,
            fixed $N$ the generalizaton gap may not be zero but the
            minimum norm property of the solution can provide, we
            conjecture, good expected performance for suitable data
            distributions. Our approach extends some of the results of
            Srebro from linear networks to deep networks and provides
            a new perspective on the implicit bias of gradient
            descent. We believe that the elusive complexity control we
            describe is responsible for the puzzling empirical finding
            of good predictive performance by deep networks, despite
            overparametrization. 

%8 06/2018 %2

http://hdl.handle.net/1721.1/116692

%0 Book Section %B Computational and Cognitive Neuroscience of Vision %D 2017 %T Invariant Recognition Predicts Tuning of Neurons in Sensory Cortex %A Jim Mutch %A F. Anselmi %A Andrea Tacchetti %A Lorenzo Rosasco %A JZ. Leibo %A Tomaso Poggio %B Computational and Cognitive Neuroscience of Vision %I Springer %P 85-104 %G eng %0 Generic %D 2017 %T Symmetry Regularization %A F. Anselmi %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %X

The properties of a representation, such as smoothness, adaptability, generality, equivari- ance/invariance, depend on restrictions imposed during learning. In this paper, we propose using data symmetries, in the sense of equivalences under transformations, as a means for learning symmetry- adapted representations, i.e., representations that are equivariant to transformations in the original space. We provide a sufficient condition to enforce the representation, for example the weights of a neural network layer or the atoms of a dictionary, to have a group structure and specifically the group structure in an unlabeled training set. By reducing the analysis of generic group symmetries to per- mutation symmetries, we devise an analytic expression for a regularization scheme and a permutation invariant metric on the representation space. Our work provides a proof of concept on why and how to learn equivariant representations, without explicit knowledge of the underlying symmetries in the data.

%8 05/2017 %2

http://hdl.handle.net/1721.1/109391

%0 Generic %D 2017 %T Theory of Deep Learning III: explaining the non-overfitting puzzle %A Tomaso Poggio %A Keji Kawaguchi %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Xavier Boix %A Jack Hidary %A Hrushikesh Mhaskar %X

THIS MEMO IS REPLACED BY CBMM MEMO 90

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

%8 12/2017 %1

arXiv:1801.00173

%2

http://hdl.handle.net/1721.1/113003

%0 Journal Article %J International Journal of Automation and Computing %D 2017 %T Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %K Machine Learning %K Neural Networks %X

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%B International Journal of Automation and Computing %P 1-17 %8 03/2017 %G eng %U http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst %R 10.1007/s11633-017-1054-2 %0 Conference Paper %B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %D 2016 %T Holographic Embeddings of Knowledge Graphs %A Maximilian Nickel %A Lorenzo Rosasco %A Tomaso Poggio %X

Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

%B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %C Phoenix, Arizona, USA %G eng %0 Journal Article %J Information and Inference: A Journal of the IMA %D 2016 %T On invariance and selectivity in representation learning %A F. Anselmi %A Lorenzo Rosasco %A Tomaso Poggio %X

We study the problem of learning from data representations that are invariant to transformations, and at the same time selective, in the sense that two points have the same representation if one is the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory—a recent theory of feedforward processing in sensory cortex (Anselmi et al., 2013, Theor. Comput. Sci. and arXiv:1311.4158; Anselmi et al., 2013, Magic materials: a theory of deep hierarchical architectures for learning sensory representations. CBCL Paper; Anselmi & Poggio, 2010, Representation learning in sensory cortex: a theory. CBMM Memo No. 26).

%B Information and Inference: A Journal of the IMA %P iaw009 %8 05/2016 %G eng %U http://imaiai.oxfordjournals.org/lookup/doi/10.1093/imaiai/iaw009 %! Information and Inference %R 10.1093/imaiai/iaw009 %0 Generic %D 2016 %T Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality? %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %X

[formerly titled "Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review"]

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%8 11/2016 %1

https://arxiv.org/abs/1611.00740v5

%2

http://hdl.handle.net/1721.1/105443

%0 Generic %D 2015 %T Deep Convolutional Networks are Hierarchical Kernel Machines %A F. Anselmi %A Lorenzo Rosasco %A Cheston Tan %A Tomaso Poggio %X

We extend i-theory to incorporate not only pooling but also rectifying nonlinearities in an extended HW module (eHW) designed for supervised learning. The two operations roughly correspond to invariance and selectivity, respectively. Under the assumption of normalized inputs, we show that appropriate linear combinations of rectifying nonlinearities are equivalent to radial kernels. If pooling is present an equivalent kernel also exist. Thus present-day DCNs (Deep Convolutional Networks) can be exactly equivalent to a hierarchy of kernel machines with pooling and non-pooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of eHW modules minimize memory requirements while computing a selective and invariant representation.

%8 06/17/2015 %1

arXiv:1508.01084

%2

http://hdl.handle.net/1721.1/100200

%0 Conference Paper %B INTERSPEECH-2015 %D 2015 %T Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %B INTERSPEECH-2015 %I International Speech Communication Association (ISCA) %C Dresden, Germany %8 09/2015 %G eng %U http://www.isca-speech.org/archive/interspeech_2015/i15_3229.html %0 Generic %D 2015 %T Holographic Embeddings of Knowledge Graphs %A Maximilian Nickel %A Lorenzo Rosasco %A Tomaso Poggio %K Associative Memory %K Knowledge Graph %K Machine Learning %X

Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator, HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

%8 11/16/2015 %G English %1

arXiv:1510.04935

%2

http://hdl.handle.net/1721.1/100203

%0 Generic %D 2015 %T On Invariance and Selectivity in Representation Learning %A F. Anselmi %A Lorenzo Rosasco %A Tomaso Poggio %X

We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory, a recent theory of feedforward processing in sensory cortex.

%8 03/23/2015 %G English %1

arXiv:1503.05938v1

%2

http://hdl.handle.net/1721.1/100194

%0 Generic %D 2015 %T I-theory on depth vs width: hierarchical function composition %A Tomaso Poggio %A F. Anselmi %A Lorenzo Rosasco %X

Deep learning networks with convolution, pooling and subsampling are a special case of hierarchical architectures, which can be represented by trees (such as binary trees). Hierarchical as well as shallow networks can approximate functions of several variables, in particular those that are compositions of low dimensional functions. We show that the power of a deep network architecture with respect to a shallow network is rather independent of the specific nonlinear operations in the network and depends instead on the the behavior of the VC-dimension. A shallow network can approximate compositional functions with the same error of a deep network but at the cost of a VC-dimension that is exponential instead than quadratic in the dimensionality of the function. To complete the argument we argue that there exist visual computations that are intrinsically compositional. In particular, we prove that recognition invariant to translation cannot be computed by shallow networks in the presence of clutter. Finally, a general framework that includes the compositional case is sketched. The key condition that allows tall, thin networks to be nicer that short, fat networks is that the target input-output function must be sparse in a certain technical sense.

%8 12/29/2015 %2

http://hdl.handle.net/1721.1/100559

%0 Conference Paper %B NIPS 2015 %D 2015 %T Learning with incremental iterative regularization %A Lorenzo Rosasco %A Villa, Silvia %X

Within a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and prove strong universal consistency, i.e. almost sure convergence of the risk, as well as sharp finite sample bounds for the iterates. Our results are a step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimizationresults.

%B NIPS 2015 %G eng %U https://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization %0 Conference Paper %B NIPS 2015 %D 2015 %T Less is More: Nyström Computational Regularization %A Alessandro Rudi %A Raffaello Camoriano %A Lorenzo Rosasco %X

We study Nystr"om type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. In particular, we prove that these approaches can achieve optimal learning bounds, provided the subsampling level is suitably chosen. These results suggest a simple incremental variant of Nystr"om Kernel Regularized Least Squares, where the subsampling level implements a form of computational regularization, in the sense that it controls at the same time regularization and computations. Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.

%B NIPS 2015 %G eng %U https://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization %0 Generic %D 2015 %T Notes on Hierarchical Splines, DCLNs and i-theory %A Tomaso Poggio %A Lorenzo Rosasco %A Amnon Shashua %A Nadav Cohen %A F. Anselmi %X

We define an extension of classical additive splines for multivariate
function approximation that we call hierarchical splines. We show that the
case of hierarchical, additive, piece-wise linear splines includes present-day
Deep Convolutional Learning Networks (DCLNs) with linear rectifiers and
pooling (sum or max). We discuss how these observations together with
i-theory may provide a framework for a general theory of deep networks.

%2

http://hdl.handle.net/1721.1/100201

%0 Journal Article %J Theoretical Computer Science %D 2015 %T Unsupervised learning of invariant representations %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K convolutional networks %K Cortex %K Hierarchy %K Invariance %X

The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n→∞n→∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n→1n→1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity. We consider the case of visual object recognition, though the theory also applies to other domains like speech. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translation, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and selective signature can be computed for each image or image patch: the invariance can be exact in the case of group transformations and approximate under non-group transformations. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such signature. The theory offers novel unsupervised learning algorithms for “deep” architectures for image and speech recognition. We conjecture that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and selective for recognition—and show how this representation may be continuously learned in an unsupervised way during development and visual experience.

%B Theoretical Computer Science %8 06/25/2015 %G eng %U http://www.sciencedirect.com/science/article/pii/S0304397515005587 %R 10.1016/j.tcs.2015.06.048 %0 Generic %D 2014 %T A Deep Representation for Invariance And Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K Audio Representation %K Hierarchy %K Invariance %K Machine Learning %K Theories for Intelligence %X

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

%8 03/2014 %1

arXiv:1404.0400v1

%2

http://hdl.handle.net/1721.1/100163

%0 Conference Paper %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %D 2014 %T A Deep Representation for Invariance and Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K acoustic signal processing %K signal representation %K unsupervised learning %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %I IEEE %C Florence, Italy %8 05/04/2014 %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854954 %R 10.1109/ICASSP.2014.6854954 %0 Generic %D 2014 %T Learning An Invariant Speech Representation %A Georgios Evangelopoulos %A Stephen Voinea %A Chiyuan Zhang %A Lorenzo Rosasco %A Tomaso Poggio %K Theories for Intelligence %X

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.

%8 06/2014 %1

arXiv:1406.3884

%2

http://hdl.handle.net/1721.1/100186

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Phone Classification by a Hierarchy of Invariant Representation Layers %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Hierarchy %K Invariance %K Neural Networks %K Speech Representation %X

We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html %0 Generic %D 2014 %T Speech Representations based on a Theory for Learning Invariances %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %X

Recognition of sounds and speech from a small number of labelled examples (like humans do), depends on the properties of the representation of the acoustic input. We formulate the problem of extracting robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain, that requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech signal can be obtained by projecting it to a number of template orbits, i.e., each one a set of transformed template signals, and computing the associated one-dimensional empirical probability distributions. The computations are perfomed by modules of filtering and pooling, that can be used for obtaining a mapping in single- or multilayer architectures. We consider several aspects of such representations including different signal scales (word vs. frame), input domains (raw waveforms vs. frequency filterbank responses), structures (shallow vs. multilayer/hierarchical), and ways of sampling from template orbit sets given a set of observations (explicit vs. learned). Preliminary empirical evaluations for learning to separate speech phones and words are given on TIMIT and subsets of TI-DIGITS. 

%C SANE 2014 - Speech and Audio in the Northeast %8 10/2014 %9 poster presentation %0 Generic %D 2014 %T Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K Computer vision %K Pattern recognition %X

The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n→∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n→1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity (n). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, I, in terms of empirical distributions of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition—and that this representation may be continuously learned in an unsupervised way during development and visual experience.

%8 03/2014 %1

1311.4158v5

%2

http://hdl.handle.net/1721.1/90566

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Word-level Invariant Representations From Acoustic Waveforms %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Invariance %K Speech Representation %K Theories for Intelligence %X

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html %0 Book Section %B Empirical Inference %D 2013 %T On Learnability, Complexity and Stability %A Villa, Silvia %A Lorenzo Rosasco %A Tomaso Poggio %A Schölkopf, Bernhard %A Luo, Zhiyuan %A Vovk, Vladimir %X

Empirical Inference, Chapter 7

Editors: Bernhard Schölkopf, Zhiyuan Luo and Vladimir Vovk

Abstract:

We consider the fundamental question of learnability of a hypothesis class in the supervised learning setting and in the general learning setting introduced by Vladimir Vapnik. We survey classic results characterizing learnability in terms of suitable notions of complexity, as well as more recent results that establish the connection between learnability and stability of a learning algorithm.

%B Empirical Inference %I Springer Berlin Heidelberg %C Berlin, Heidelberg %P 59 - 69 %@ 978-3-642-41135-9 %G eng %U http://link.springer.com/10.1007/978-3-642-41136-6 %& 7 %R 10.1007/978-3-642-41136-610.1007/978-3-642-41136-6_7 %0 Generic %D 2013 %T Object recognition data sets (iCub/IIT) %A Lorenzo Rosasco %K Computer vision %K object recognition %K robotics %X

Data set for object recognition and categorization. 10 categories, 40 objects for the training phase. The acquisition size is 640×480 and subsequently cropped to the bounding box of the object according to the kinematics or motion cue. The bounding box is 160×160 in human mode and 320×320 in robot mode. For each object we provide 200 training samples. Each category is trained with 3 objects (600 examples per category).

Click HERE to Download Dataset from IIT website >

Publications

Fanello, S.R.; Ciliberto, C.; Santoro, M.; Natale, L.; Metta, G.; Rosasco, L.; Odone, F.,”iCub World: Friendly Robots Help Building Good Vision Data-Sets,” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR), 2013

Fanello, S. R.; Ciliberto, C.; Natale, L.; Metta, G., “Weakly Supervised Strategies for Natural Object Recognition in Robotics,” IEEE International Conference on Robotics and Automation (ICRA). Karlsruhe, Germany, May 6-10, 2013

Fanello, S.R.; Noceti, N.; Metta, G.; Odone, F., “Multi-Class Image Classification: Sparsity Does It Better,” International Conference on Computer Vision Theory and Applications (VISAPP), 2013

Ciliberto C.; Smeraldi F.; Natale L.; Metta G., “Online Multiple Instance Learning Applied to Hand Detection in a Humanoid Robot,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2011). San Francisco, California, USA, September 25-30, 2011

%8 05/2013 %0 Conference Proceedings %D 2013 %T Unsupervised Learning of Invariant Representations in Hierarchical Architectures. %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K convolutional networks %K Hierarchy %K Invariance %K visual cortex %X

Representations that are invariant to translation, scale and other transformations, can considerably reduce the sample complexity of learning, allowing recognition of new object classes from very few examples – a hallmark of human recognition. Empirical estimates of one-dimensional projections of the distribution induced by a group of affine transformations are proven to represent a unique and invariant signature associated with an image. We show how projections yielding invariant signatures for future images can be learned automatically, and updated continuously, during unsupervised visual experience. A module performing filtering and pooling, like simple and complex cells as proposed by Hubel and Wiesel, can compute such estimates. Under this view, a pooling stage estimates a one-dimensional probability distribution. Invariance from observations through a restricted window is equivalent to a sparsity property w.r.t. to a transformation, which yields templates that are a) Gabor for optimal simultaneous invariance to translation and scale or b) very specific for complex, class-dependent transformations such as rotation in depth of faces. Hierarchical architectures consisting of this basic Hubel-Wiesel module inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts, and are invariant to complex transformations that may only be locally affine. The theory applies to several existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects which is invariant to transformations, stable, and discriminative for recognition – this representation may be learned in an unsupervised way from natural visual experience.

Read paper>

%8 11/2013 %G eng %0 Conference Paper %B Advances in Neural Information Processing Systems 25 (NIPS 2012) %D 2012 %T Learning manifolds with k-means and k-flats %A Guillermo D. Canas %A Tomaso Poggio %A Lorenzo Rosasco %X

We study the problem of estimating a manifold from random samples. In partic- ular, we consider piecewise constant and piecewise linear estimators induced by k-means and k-flats, and analyze their performance. We extend previous results for k-means in two separate directions. First, we provide new results for k-means reconstruction on manifolds and, secondly, we prove reconstruction bounds for higher-order approximation (k-flats), for which no known results were previously available. While the results for k-means are novel, some of the technical tools are well-established in the literature. In the case of k-flats, both the results and the mathematical tools are new.

%B Advances in Neural Information Processing Systems 25 (NIPS 2012) %8 12/2012 %G eng %U https://papers.nips.cc/paper/2012/hash/b20bb95ab626d93fd976af958fbc61ba-Abstract.html