%0 Generic
%D 2017
%T Musings on Deep Learning: Properties of SGD
%A Chiyuan Zhang
%A Qianli Liao
%A Alexander Rakhlin
%A Karthik Sridharan
%A Brando Miranda
%A Noah Golowich
%A Tomaso Poggio
%X <p>[<em>formerly titled "Theory of Deep Learning III: Generalization Properties of SGD"</em>]</p>    <p>In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.</p>
%8 04/2017
%2 <p><a href="http://hdl.handle.net/1721.1/107841">http://hdl.handle.net/1721.1/107841</a></p>

%0 Generic
%D 2017
%T Theory of Deep Learning IIb: Optimization Properties of SGD
%A Chiyuan Zhang
%A Qianli Liao
%A Alexander Rakhlin
%A Brando Miranda
%A Noah Golowich
%A Tomaso Poggio
%X <p>In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: <em>SGD concentrates in probability - like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers</em>.</p>
%8 12/2017
%2 <p><a href="http://hdl.handle.net/1721.1/115407">http://hdl.handle.net/1721.1/115407</a></p>

%0 Conference Paper
%B INTERSPEECH-2015
%D 2015
%T Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations
%A Chiyuan Zhang
%A Stephen Voinea
%A Georgios Evangelopoulos
%A Lorenzo Rosasco
%A Tomaso Poggio
%B INTERSPEECH-2015
%I International Speech Communication Association (ISCA)
%C Dresden, Germany
%8 09/2015
%G eng
%U http://www.isca-speech.org/archive/interspeech_2015/i15_3229.html

%0 Conference Paper
%B Advances in Neural Information Processing Systems (NIPS 2015) 28
%D 2015
%T Learning with a Wasserstein Loss
%A Charlie Frogner
%A Chiyuan Zhang
%A Hossein Mobahi
%A Mauricio Araya-Polo
%A Tomaso Poggio
%X <p>Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from prob- ability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predic- tions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn’t use the metric.</p>
%B Advances in Neural Information Processing Systems (NIPS 2015) 28
%G eng
%U http://arxiv.org/abs/1506.05439

%0 Generic
%D 2014
%T A Deep Representation for Invariance And Music Classification
%A Chiyuan Zhang
%A Georgios Evangelopoulos
%A Stephen Voinea
%A Lorenzo Rosasco
%A Tomaso Poggio
%K Audio Representation
%K Hierarchy
%K Invariance
%K Machine Learning
%K Theories for Intelligence
%X <p>Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.</p>
%8 03/2014
%1 <p><a href="http://arXiv:1404.0400v1">arXiv:1404.0400v1</a></p>
%2 <p>http://hdl.handle.net/1721.1/100163</p>

%0 Conference Paper
%B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
%D 2014
%T A Deep Representation for Invariance and Music Classification
%A Chiyuan Zhang
%A Georgios Evangelopoulos
%A Stephen Voinea
%A Lorenzo Rosasco
%A Tomaso Poggio
%K acoustic signal processing
%K signal representation
%K unsupervised learning
%B ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing
%I IEEE
%C Florence, Italy
%8 05/04/2014
%G eng
%U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854954
%R 10.1109/ICASSP.2014.6854954

%0 Generic
%D 2014
%T Learning An Invariant Speech Representation
%A Georgios Evangelopoulos
%A Stephen Voinea
%A Chiyuan Zhang
%A Lorenzo Rosasco
%A Tomaso Poggio
%K Theories for Intelligence
%X <p>Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates — such as specific phones or words — together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.</p>
%8 06/2014
%1 <p><a href="http://arxiv.org/abs/1406.3884">arXiv:1406.3884</a></p>
%2 <p>http://hdl.handle.net/1721.1/100186</p>

%0 Conference Paper
%B EAGE Conference and Exhibition 2014
%D 2014
%T Machine Learning Based Automated Fault Detection in Seismic Traces
%A Chiyuan Zhang
%A Charlie Frogner
%A Mauricio Araya-Polo
%A Detlef Hohl
%X <p>Introduction:</p>    <p>The Initial stages of velocity model building (VMB) start off from smooth models that capture geological assumptions of the subsurface region under analysis. Acceptable velocity models result from successive iterations of human intervention (interpreter) and seismic data processing with in complex workflows. The interpreters ensure that any additions or corrections made by seismic processing are compliant with geological and geophysical knowledge. The information that seismic processing adds to the model consists of structural elements, faults are one of the most relevant of those events since they can signal reservoir boundaries or hydrocarbon traps. Faults are excluded in the initial models due to their local scale. Bringing faults into the model in early stages can help to steer the VMB process.</p>    <p>This work introduced a tool whose purpose is to assist the interpreters during the initial stages of the VMB, when no seismic data has been migrated. Our novel method is based on machine learning techniques and can automatically identify and localize faults from not migrated seismic data. Comprehensive research has targeted the fault localization problem, but most of the results are obtained using processed seismic data or images as input (Admasu and Toennies (2004); Tingdahl et al. (2001); Cohen et al. (2006); Hale (2013), etc). Our approach suggests an additional tool that can be used to speed up the<br />  VMB process.</p>    <p>Fully automated VMB has not been achieved because the human factor is difficult to formalize in a way that can be systematically applied. Nonetheless, if our framework is extended to other seismic events or attributes, it might become a powerful tool to alleviate interpreters’ work.</p>
%B EAGE Conference and Exhibition 2014
%C The Netherlands
%8 06/2014
%G eng
%U http://cbcl.mit.edu/publications/eage14.pdf

%0 Conference Paper
%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association
%D 2014
%T Phone Classification by a Hierarchy of Invariant Representation Layers
%A Chiyuan Zhang
%A Stephen Voinea
%A Georgios Evangelopoulos
%A Lorenzo Rosasco
%A Tomaso Poggio
%K Hierarchy
%K Invariance
%K Neural Networks
%K Speech Representation
%X <p>We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.</p>
%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association
%I International Speech Communication Association (ISCA)
%C Singapore
%G eng
%U http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html

%0 Generic
%D 2014
%T Speech Representations based on a Theory for Learning Invariances
%A Stephen Voinea
%A Chiyuan Zhang
%A Georgios Evangelopoulos
%A Lorenzo Rosasco
%A Tomaso Poggio
%X <p>Recognition of sounds and speech from a small number of labelled examples (like humans do), depends on the properties of the representation of the acoustic input. We formulate the problem of extracting robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain, that requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech signal can be obtained by projecting it to a number of template orbits, i.e., each one a set of transformed template signals, and computing the associated one-dimensional empirical probability distributions. The computations are perfomed by modules of filtering and pooling, that can be used for obtaining a mapping in single- or multilayer architectures. We consider several aspects of such representations including different signal scales (word vs. frame), input domains (raw waveforms vs. frequency filterbank responses), structures (shallow vs.&nbsp;multilayer/hierarchical), and ways of sampling from template orbit sets given a set of observations (explicit vs. learned). Preliminary empirical evaluations for learning to separate speech phones and words are given on TIMIT and subsets of TI-DIGITS.&nbsp;</p>
%C SANE 2014 - Speech and Audio in the Northeast
%8 10/2014
%9 poster presentation

%0 Conference Paper
%B INTERSPEECH 2014  - 15th Annual Conf. of the International Speech Communication Association
%D 2014
%T Word-level Invariant Representations From Acoustic Waveforms
%A Stephen Voinea
%A Chiyuan Zhang
%A Georgios Evangelopoulos
%A Lorenzo Rosasco
%A Tomaso Poggio
%K Invariance
%K Speech Representation
%K Theories for Intelligence
%X <p>Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.</p>
%B INTERSPEECH 2014  - 15th Annual Conf. of the International Speech Communication Association
%I International Speech Communication Association (ISCA)
%C Singapore
%G eng
%U http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html