While stochastic gradient descent (SGD) is one of the major workhorses in machine learning, the learning properties of many practically used variants are still poorly understood. In this paper, we consider least squares learning in a nonparametric setting and contribute to filling this gap by focusing on the effect and interplay of multiple passes, mini-batching and averaging, in particular tail averaging. Our results show how these different variants of SGD can be combined to achieve optimal learning rates, also providing practical insights. A novel key result is that tail averaging allows faster convergence rates than uniform averaging in the nonparametric setting. Further, we show that a combination of tail-averaging and minibatching allows more aggressive step-size choices than using any one of said components.

%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Conference Proceedings %B Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Implicit Regularization of Accelerated Methods in Hilbert Spaces %A Nicolò Pagliana %A Lorenzo Rosasco %XWe study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. We analyze the implicit regularization properties of Nesterov acceleration and a variant of heavy-ball in terms of corresponding learning error bounds. Our results show that acceleration can provides faster bias decay than gradient descent, but also suffers of a more unstable behavior. As a result acceleration cannot be in general expected to improve learning accuracy with respect to gradient descent, but rather to achieve the same accuracy with reduced computations. Our theoretical results are validated by numerical simulations. Our analysis is based on studying suitable polynomials induced by the accelerated dynamics and combining spectral techniques with concentration inequalities.

%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %XThe key to generalization is controlling the complexity of

the network. However, there is no obvious control of

complexity -- such as an explicit regularization term --

in the training of deep networks for classification. We

will show that a classical form of norm control -- but

kind of hidden -- is present in deep networks trained with

gradient descent techniques on exponential-type losses. In

particular, gradient descent induces a dynamics of the

normalized weights which converge for $t \to \infty$ to an

equilibrium which corresponds to a minimum norm (or

maximum margin) solution. For sufficiently large but

finite $\rho$ -- and thus finite $t$ -- the dynamics

converges to one of several margin maximizers, with the

margin monotonically increasing towards a limit stationary

point of the flow. In the usual case of stochastic

gradient descent, most of the stationary points are likely

to be convex minima corresponding to a regularized,

constrained minimizer -- the network with normalized

weights-- which is stable and has asymptotic zero

generalization gap, asymptotically for $N \to \infty$,

where $N$ is the number of training examples. For finite,

fixed $N$ the generalizaton gap may not be zero but the

minimum norm property of the solution can provide, we

conjecture, good expected performance for suitable data

distributions. Our approach extends some of the results of

Srebro from linear networks to deep networks and provides

a new perspective on the implicit bias of gradient

descent. We believe that the elusive complexity control we

describe is responsible for the puzzling empirical finding

of good predictive performance by deep networks, despite

overparametrization.

http://hdl.handle.net/1721.1/116692

%0 Book Section %B Computational and Cognitive Neuroscience of Vision %D 2017 %T Invariant Recognition Predicts Tuning of Neurons in Sensory Cortex %A Jim Mutch %A F. Anselmi %A Andrea Tacchetti %A Lorenzo Rosasco %A JZ. Leibo %A Tomaso Poggio %B Computational and Cognitive Neuroscience of Vision %I Springer %P 85-104 %G eng %0 Generic %D 2017 %T Symmetry Regularization %A F. Anselmi %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %XThe properties of a representation, such as smoothness, adaptability, generality, equivari- ance/invariance, depend on restrictions imposed during learning. In this paper, we propose using data symmetries, in the sense of equivalences under transformations, as a means for learning symmetry- adapted representations, i.e., representations that are equivariant to transformations in the original space. We provide a sufficient condition to enforce the representation, for example the weights of a neural network layer or the atoms of a dictionary, to have a group structure and specifically the group structure in an unlabeled training set. By reducing the analysis of generic group symmetries to per- mutation symmetries, we devise an analytic expression for a regularization scheme and a permutation invariant metric on the representation space. Our work provides a proof of concept on why and how to learn equivariant representations, without explicit knowledge of the underlying symmetries in the data.

%8 05/2017 %2http://hdl.handle.net/1721.1/109391

%0 Generic %D 2017 %T Theory of Deep Learning III: explaining the non-overfitting puzzle %A Tomaso Poggio %A Keji Kawaguchi %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Xavier Boix %A Jack Hidary %A Hrushikesh Mhaskar %X**THIS MEMO IS REPLACED BY CBMM MEMO 90**

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

%8 12/2017 %1 %2http://hdl.handle.net/1721.1/113003

%0 Journal Article %J International Journal of Automation and Computing %D 2017 %T Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %K Machine Learning %K Neural Networks %XThe paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%B International Journal of Automation and Computing %P 1-17 %8 03/2017 %G eng %U http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst %R 10.1007/s11633-017-1054-2 %0 Conference Paper %B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %D 2016 %T Holographic Embeddings of Knowledge Graphs %A Maximilian Nickel %A Lorenzo Rosasco %A Tomaso Poggio %XLearning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

%B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %C Phoenix, Arizona, USA %G eng %0 Journal Article %J Information and Inference: A Journal of the IMA %D 2016 %T On invariance and selectivity in representation learning %A F. Anselmi %A Lorenzo Rosasco %A Tomaso Poggio %XWe study the problem of learning from data representations that are invariant to transformations, and at the same time selective, in the sense that two points have the same representation if one is the transformation of the other. The mathematical results here sharpen some of the key claims of *i-theory*—a recent theory of feedforward processing in sensory cortex (Anselmi *et al.*, 2013, *Theor. Comput. Sci.* and arXiv:1311.4158; Anselmi *et al.*, 2013, Magic materials: a theory of deep hierarchical architectures for learning sensory representations. *CBCL Paper*; Anselmi & Poggio, 2010, Representation learning in sensory cortex: a theory. *CBMM Memo No.* 26).

[formerly titled "*Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review*"]

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%8 11/2016 %1https://arxiv.org/abs/1611.00740v5

%2http://hdl.handle.net/1721.1/105443

%0 Generic %D 2015 %T Deep Convolutional Networks are Hierarchical Kernel Machines %A F. Anselmi %A Lorenzo Rosasco %A Cheston Tan %A Tomaso Poggio %XWe extend i-theory to incorporate not only pooling but also rectifying nonlinearities in an extended HW module (eHW) designed for supervised learning. The two operations roughly correspond to invariance and selectivity, respectively. Under the assumption of normalized inputs, we show that appropriate linear combinations of rectifying nonlinearities are equivalent to radial kernels. If pooling is present an equivalent kernel also exist. Thus present-day DCNs (Deep Convolutional Networks) can be exactly equivalent to a hierarchy of kernel machines with pooling and non-pooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of eHW modules minimize memory requirements while computing a selective and invariant representation.

%8 06/17/2015 %1 %2http://hdl.handle.net/1721.1/100200

%0 Conference Paper %B INTERSPEECH-2015 %D 2015 %T Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %B INTERSPEECH-2015 %I International Speech Communication Association (ISCA) %C Dresden, Germany %8 09/2015 %G eng %U http://www.isca-speech.org/archive/interspeech_2015/i15_3229.html %0 Generic %D 2015 %T Holographic Embeddings of Knowledge Graphs %A Maximilian Nickel %A Lorenzo Rosasco %A Tomaso Poggio %K Associative Memory %K Knowledge Graph %K Machine Learning %XLearning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator, HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.

%8 11/16/2015 %G English %1 %2http://hdl.handle.net/1721.1/100203

%0 Generic %D 2015 %T On Invariance and Selectivity in Representation Learning %A F. Anselmi %A Lorenzo Rosasco %A Tomaso Poggio %Xhttp://hdl.handle.net/1721.1/100194

%0 Generic %D 2015 %T I-theory on depth vs width: hierarchical function composition %A Tomaso Poggio %A F. Anselmi %A Lorenzo Rosasco %XDeep learning networks with convolution, pooling and subsampling are a special case of hierarchical architectures, which can be represented by trees (such as binary trees). Hierarchical as well as shallow networks can approximate functions of several variables, in particular those that are compositions of low dimensional functions. We show that the power of a deep network architecture with respect to a shallow network is rather independent of the specific nonlinear operations in the network and depends instead on the the behavior of the VC-dimension. A shallow network can approximate compositional functions with the same error of a deep network but at the cost of a VC-dimension that is exponential instead than quadratic in the dimensionality of the function. To complete the argument we argue that there exist visual computations that are intrinsically compositional. In particular, we prove that recognition invariant to translation cannot be computed by shallow networks in the presence of clutter. Finally, a general framework that includes the compositional case is sketched. The key condition that allows tall, thin networks to be nicer that short, fat networks is that the target input-output function must be sparse in a certain technical sense.

%8 12/29/2015 %2http://hdl.handle.net/1721.1/100559

%0 Conference Paper %B NIPS 2015 %D 2015 %T Learning with incremental iterative regularization %A Lorenzo Rosasco %A Villa, Silvia %XWithin a statistical learning setting, we propose and study an iterative regularization algorithm for least squares defined by an incremental gradient method. In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and prove strong universal consistency, i.e. almost sure convergence of the risk, as well as sharp finite sample bounds for the iterates. Our results are a step towards understanding the effect of multiple epochs in stochastic gradient techniques in machine learning and rely on integrating statistical and optimizationresults.

%B NIPS 2015 %G eng %U https://papers.nips.cc/paper/6015-learning-with-incremental-iterative-regularization %0 Conference Paper %B NIPS 2015 %D 2015 %T Less is More: Nyström Computational Regularization %A Alessandro Rudi %A Raffaello Camoriano %A Lorenzo Rosasco %XWe study Nystr"om type subsampling approaches to large scale kernel methods, and prove learning bounds in the statistical learning setting, where random sampling and high probability estimates are considered. In particular, we prove that these approaches can achieve optimal learning bounds, provided the subsampling level is suitably chosen. These results suggest a simple incremental variant of Nystr"om Kernel Regularized Least Squares, where the subsampling level implements a form of computational regularization, in the sense that it controls at the same time regularization and computations. Extensive experimental analysis shows that the considered approach achieves state of the art performances on benchmark large scale datasets.

%B NIPS 2015 %G eng %U https://papers.nips.cc/paper/5936-less-is-more-nystrom-computational-regularization %0 Generic %D 2015 %T Notes on Hierarchical Splines, DCLNs and i-theory %A Tomaso Poggio %A Lorenzo Rosasco %A Amnon Shashua %A Nadav Cohen %A F. Anselmi %XWe define an extension of classical additive splines for multivariate

function approximation that we call hierarchical splines. We show that the

case of hierarchical, additive, piece-wise linear splines includes present-day

Deep Convolutional Learning Networks (DCLNs) with linear rectifiers and

pooling (sum or max). We discuss how these observations together with

i-theory may provide a framework for a general theory of deep networks.

http://hdl.handle.net/1721.1/100201

%0 Journal Article %J Theoretical Computer Science %D 2015 %T Unsupervised learning of invariant representations %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K convolutional networks %K Cortex %K Hierarchy %K Invariance %XThe present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n→∞n→∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n→1n→1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity. We consider the case of visual object recognition, though the theory also applies to other domains like speech. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translation, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and selective signature can be computed for each image or image patch: the invariance can be exact in the case of group transformations and approximate under non-group transformations. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such signature. The theory offers novel unsupervised learning algorithms for “deep” architectures for image and speech recognition. We conjecture that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and selective for recognition—and show how this representation may be continuously learned in an unsupervised way during development and visual experience.

%B Theoretical Computer Science %8 06/25/2015 %G eng %U http://www.sciencedirect.com/science/article/pii/S0304397515005587 %R 10.1016/j.tcs.2015.06.048 %0 Conference Paper %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %D 2014 %T A Deep Representation for Invariance and Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K acoustic signal processing %K signal representation %K unsupervised learning %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %I IEEE %C Florence, Italy %8 05/04/2014 %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854954 %R 10.1109/ICASSP.2014.6854954 %0 Generic %D 2014 %T A Deep Representation for Invariance And Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K Audio Representation %K Hierarchy %K Invariance %K Machine Learning %K Theories for Intelligence %XRepresentations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

%8 03/2014 %1 %2http://hdl.handle.net/1721.1/100163

%0 Generic %D 2014 %T Learning An Invariant Speech Representation %A Georgios Evangelopoulos %A Stephen Voinea %A Chiyuan Zhang %A Lorenzo Rosasco %A Tomaso Poggio %K Theories for Intelligence %XRecognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.

%8 06/2014 %1 %2http://hdl.handle.net/1721.1/100186

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Phone Classification by a Hierarchy of Invariant Representation Layers %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Hierarchy %K Invariance %K Neural Networks %K Speech Representation %XWe propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html %0 Generic %D 2014 %T Speech Representations based on a Theory for Learning Invariances %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %XRecognition of sounds and speech from a small number of labelled examples (like humans do), depends on the properties of the representation of the acoustic input. We formulate the problem of extracting robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain, that requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech signal can be obtained by projecting it to a number of template orbits, i.e., each one a set of transformed template signals, and computing the associated one-dimensional empirical probability distributions. The computations are perfomed by modules of filtering and pooling, that can be used for obtaining a mapping in single- or multilayer architectures. We consider several aspects of such representations including different signal scales (word vs. frame), input domains (raw waveforms vs. frequency filterbank responses), structures (shallow vs. multilayer/hierarchical), and ways of sampling from template orbit sets given a set of observations (explicit vs. learned). Preliminary empirical evaluations for learning to separate speech phones and words are given on TIMIT and subsets of TI-DIGITS.

%C SANE 2014 - Speech and Audio in the Northeast %8 10/2014 %9 poster presentation %0 Generic %D 2014 %T Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K Computer vision %K Pattern recognition %XThe present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n→∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n→1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity (n). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, I, in terms of empirical distributions of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition—and that this representation may be continuously learned in an unsupervised way during development and visual experience.

%8 03/2014 %1 %2http://hdl.handle.net/1721.1/90566

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Word-level Invariant Representations From Acoustic Waveforms %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Invariance %K Speech Representation %K Theories for Intelligence %XExtracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html %0 Book Section %B Empirical Inference %D 2013 %T On Learnability, Complexity and Stability %A Villa, Silvia %A Lorenzo Rosasco %A Tomaso Poggio %A Schölkopf, Bernhard %A Luo, Zhiyuan %A Vovk, Vladimir %XEmpirical Inference, Chapter 7

Editors: Bernhard Schölkopf, Zhiyuan Luo and Vladimir Vovk

**Abstract:**

We consider the fundamental question of learnability of a hypothesis class in the supervised learning setting and in the general learning setting introduced by Vladimir Vapnik. We survey classic results characterizing learnability in terms of suitable notions of complexity, as well as more recent results that establish the connection between learnability and stability of a learning algorithm.

%B Empirical Inference %I Springer Berlin Heidelberg %C Berlin, Heidelberg %P 59 - 69 %@ 978-3-642-41135-9 %G eng %U http://link.springer.com/10.1007/978-3-642-41136-6 %& 7 %R 10.1007/978-3-642-41136-610.1007/978-3-642-41136-6_7 %0 Generic %D 2013 %T Object recognition data sets (iCub/IIT) %A Lorenzo Rosasco %K Computer vision %K object recognition %K robotics %XData set for object recognition and categorization. 10 categories, 40 objects for the training phase. The acquisition size is 640×480 and subsequently cropped to the bounding box of the object according to the kinematics or motion cue. The bounding box is 160×160 in human mode and 320×320 in robot mode. For each object we provide 200 training samples. Each category is trained with 3 objects (600 examples per category).

Click HERE to Download Dataset from IIT website >

**Publications**

Fanello, S.R.; Ciliberto, C.; Santoro, M.; Natale, L.; Metta, G.; Rosasco, L.; Odone, F.,”*iCub World: Friendly Robots Help Building Good Vision Data-Sets,*” In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR), 2013

Fanello, S. R.; Ciliberto, C.; Natale, L.; Metta, G., “*Weakly Supervised Strategies for Natural Object Recognition in Robotics,” IEEE International Conference on Robotics and Automation (ICRA). Karlsruhe, Germany, May 6-10, 2013*

Fanello, S.R.; Noceti, N.; Metta, G.; Odone, F., “*Multi-Class Image Classification: Sparsity Does It Better*,” International Conference on Computer Vision Theory and Applications (VISAPP), 2013

Ciliberto C.; Smeraldi F.; Natale L.; Metta G., “*Online Multiple Instance Learning Applied to Hand Detection in a Humanoid Robot,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS2011). San Francisco, California, USA, September 25-30, 2011*

Representations that are invariant to translation, scale and other transformations, can considerably reduce the sample complexity of learning, allowing recognition of new object classes from very few examples – a hallmark of human recognition. Empirical estimates of one-dimensional projections of the distribution induced by a group of affine transformations are proven to represent a unique and invariant signature associated with an image. We show how projections yielding invariant signatures for future images can be learned automatically, and updated continuously, during unsupervised visual experience. A module performing filtering and pooling, like simple and complex cells as proposed by Hubel and Wiesel, can compute such estimates. Under this view, a pooling stage estimates a one-dimensional probability distribution. Invariance from observations through a restricted window is equivalent to a sparsity property w.r.t. to a transformation, which yields templates that are a) Gabor for optimal simultaneous invariance to translation and scale or b) very specific for complex, class-dependent transformations such as rotation in depth of faces. Hierarchical architectures consisting of this basic Hubel-Wiesel module inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts, and are invariant to complex transformations that may only be locally affine. The theory applies to several existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects which is invariant to transformations, stable, and discriminative for recognition – this representation may be learned in an unsupervised way from natural visual experience.

%8 11/2013 %G eng