%0 Conference Paper %B Advances in Neural Information Processing Systems 31 %D 2018 %T Trading robust representations for sample complexity through self-supervised visual experience %A Tacchetti, Andrea %A Stephen Voinea %A Evangelopoulos, Georgios %E S. Bengio %E H. Wallach %E H. Larochelle %E K. Grauman %E N. Cesa-Bianchi %E R. Garnett %X

Learning in small sample regimes is among the most remarkable features of the human perceptual system. This ability is related to robustness to transformations, which is acquired through visual experience in the form of weak- or self-supervision during development. We explore the idea of allowing artificial systems to learn representations of visual stimuli through weak supervision prior to downstream su- pervised tasks. We introduce a novel loss function for representation learning using unlabeled image sets and video sequences, and experimentally demonstrate that these representations support one-shot learning and reduce the sample complexity of multiple recognition tasks. We establish the existence of a trade-off between the sizes of weakly supervised, automatically obtained from video sequences, and fully supervised data sets. Our results suggest that equivalence sets other than class labels, which are abundant in unlabeled visual experience, can be used for self-supervised learning of semantically relevant image embeddings.

%B Advances in Neural Information Processing Systems 31 %I Curran Associates, Inc. %C Montreal, Canada %P 9640–9650 %8 12/2018 %G eng %U http://papers.nips.cc/paper/8170-trading-robust-representations-for-sample-complexity-through-self-supervised-visual-experience.pdf %0 Generic %D 2017 %T Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets %A Andrea Tacchetti %A Stephen Voinea %A Georgios Evangelopoulos %X

The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as one-shot classification under geometric transformations, as well as face verification and retrieval under more realistic visual variability. Our results suggest that orbit sets, suitably computed or observed, can be used for efficient, weakly-supervised learning of semantically relevant image embeddings.

%8 03/2017 %1

arXiv:1703.04775v1

%2

http://hdl.handle.net/1721.1/107446

%0 Conference Paper %B AAAI Spring Symposium Series, Science of Intelligence %D 2017 %T Representation Learning from Orbit Sets for One-shot Classification %A Andrea Tacchetti %A Stephen Voinea %A Georgios Evangelopoulos %A Tomaso Poggio %X

The sample complexity of a learning task is increased by transformations that do not change class identity. Visual object recognition for example, i.e. the discrimination or categorization of distinct semantic classes, is affected by changes in viewpoint, scale, illumination or planar transformations. We introduce a weakly-supervised framework for learning robust and selective representations from sets of transforming examples (orbit sets). We train deep encoders that explicitly account for the equivalence up to transformations of orbit sets and show that the resulting encodings contract the intra-orbit distance and preserve identity either by preserving reconstruction or by increasing the inter-orbit distance. We explore a loss function that combines a discriminative term, and a reconstruction term that uses a decoder-encoder map to learn to rectify transformation-perturbed examples, and demonstrate the validity of the resulting embeddings for one-shot learning. Our results suggest that a suitable definition of orbit sets is a form of weak supervision that can be exploited to learn semantically relevant embeddings.

%B AAAI Spring Symposium Series, Science of Intelligence %C AAAI %G eng %U https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15357 %0 Conference Paper %B INTERSPEECH-2015 %D 2015 %T Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %B INTERSPEECH-2015 %I International Speech Communication Association (ISCA) %C Dresden, Germany %8 09/2015 %G eng %U http://www.isca-speech.org/archive/interspeech_2015/i15_3229.html %0 Conference Paper %B NIPS 2015 %D 2015 %T Learning with Group Invariant Features: A Kernel Perspective %A Youssef Mroueh %A Stephen Voinea %A Tomaso Poggio %X
We analyze in this paper a random feature map based on a theory of invariance (I-theory) introduced in Anselmi et.al. 2013. More specifically, a group invariant signal signature is obtained through cumulative distributions of group-transformed random projections. Our analysis bridges invariant feature learning with kernel methods, as we show that this feature map defines an expected Haar-integration kernel that is invariant to the specified group action. We show how this non-linear random feature map approximates this group invariant kernel uniformly on a set of N points. Moreover, we show that it defines a function space that is dense in the equivalent Invariant Reproducing Kernel Hilbert Space. Finally, we quantify error rates of the convergence of the empirical risk minimization, as well as the reduction in the sample complexity of a learning algorithm using such an invariant representation for signal classification, in a classical supervised learning setting
%B NIPS 2015 %G eng %U https://papers.nips.cc/paper/5798-learning-with-group-invariant-features-a-kernel-perspective %0 Generic %D 2014 %T A Deep Representation for Invariance And Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K Audio Representation %K Hierarchy %K Invariance %K Machine Learning %K Theories for Intelligence %X

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

%8 03/2014 %1

arXiv:1404.0400v1

%2

http://hdl.handle.net/1721.1/100163

%0 Conference Paper %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %D 2014 %T A Deep Representation for Invariance and Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K acoustic signal processing %K signal representation %K unsupervised learning %B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing %I IEEE %C Florence, Italy %8 05/04/2014 %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854954 %R 10.1109/ICASSP.2014.6854954 %0 Generic %D 2014 %T Learning An Invariant Speech Representation %A Georgios Evangelopoulos %A Stephen Voinea %A Chiyuan Zhang %A Lorenzo Rosasco %A Tomaso Poggio %K Theories for Intelligence %X

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.

%8 06/2014 %1

arXiv:1406.3884

%2

http://hdl.handle.net/1721.1/100186

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Phone Classification by a Hierarchy of Invariant Representation Layers %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Hierarchy %K Invariance %K Neural Networks %K Speech Representation %X

We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html %0 Generic %D 2014 %T Speech Representations based on a Theory for Learning Invariances %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %X

Recognition of sounds and speech from a small number of labelled examples (like humans do), depends on the properties of the representation of the acoustic input. We formulate the problem of extracting robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain, that requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech signal can be obtained by projecting it to a number of template orbits, i.e., each one a set of transformed template signals, and computing the associated one-dimensional empirical probability distributions. The computations are perfomed by modules of filtering and pooling, that can be used for obtaining a mapping in single- or multilayer architectures. We consider several aspects of such representations including different signal scales (word vs. frame), input domains (raw waveforms vs. frequency filterbank responses), structures (shallow vs. multilayer/hierarchical), and ways of sampling from template orbit sets given a set of observations (explicit vs. learned). Preliminary empirical evaluations for learning to separate speech phones and words are given on TIMIT and subsets of TI-DIGITS. 

%C SANE 2014 - Speech and Audio in the Northeast %8 10/2014 %9 poster presentation %0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Word-level Invariant Representations From Acoustic Waveforms %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Invariance %K Speech Representation %K Theories for Intelligence %X

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html