%0 Journal Article %J IEEE Access %D 2022 %T Representation Learning in Sensory Cortex: a theory %A Anselmi, Fabio %A Tomaso Poggio %K Artificial neural networks %K Hubel Wiesel model %K Invariance %K Sample Complexity %K Simple and Complex cells %K visual cortex %X

We review and apply a computational theory based on the hypothesis that the feedforward path of the ventral stream in visual cortex’s main function is the encoding of invariant representations of images. A key justification of the theory is provided by a result linking invariant representations to small sample complexity for image recognition - that is, invariant representations allow learning from very few labeled examples. The theory characterizes how an algorithm that can be implemented by a set of "simple" and "complex" cells - a "Hubel Wiesel module" – provides invariant and selective representations. The invariance can be learned in an unsupervised way from observed transformations. Our results show that an invariant representation implies several properties of the ventral stream organization, including the emergence of Gabor receptive filelds and specialized areas. The theory requires two stages of processing: the first, consisting of retinotopic visual areas such as V1, V2 and V4 with generic neuronal tuning, leads to representations that are invariant to translation and scaling; the second, consisting of modules in IT (Inferior Temporal cortex), with class- and object-specific tuning, provides a representation for recognition with approximate invariance to class specific transformations, such as pose (of a body, of a face) and expression. In summary, our theory is that the ventral stream’s main function is to implement the unsupervised learning of "good" representations that reduce the sample complexity of the final supervised learning stage.

%B IEEE Access %P 1 - 1 %8 09/2022 %G eng %U https://ieeexplore.ieee.org/document/9899392/ %! IEEE Access %R 10.1109/ACCESS.2022.3208603 %0 Journal Article %J Annual Review of Vision Science %D 2018 %T Invariant Recognition Shapes Neural Representations of Visual Input %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %K computational neuroscience %K Invariance %K neural decoding %K visual representations %X

Recognizing the people, objects, and actions in the world around us is a crucial aspect of human perception that allows us to plan and act in our environment. Remarkably, our proficiency in recognizing semantic categories from visual input is unhindered by transformations that substantially alter their appearance (e.g., changes in lighting or position). The ability to generalize across these complex transformations is a hallmark of human visual intelligence, which has been the focus of wide-ranging investigation in systems and computational neuroscience. However, while the neural machinery of human visual perception has been thoroughly described, the computational principles dictating its functioning remain unknown. Here, we review recent results in brain imaging, neurophysiology, and computational neuroscience in support of the hypothesis that the ability to support the invariant recognition of semantic entities in the visual world shapes which neural representations of sensory input are computed by human visual cortex.

%B Annual Review of Vision Science %V 4 %P 403 - 422 %8 10/2018 %G eng %U https://www.annualreviews.org/doi/10.1146/annurev-vision-091517-034103 %N 1 %! Annu. Rev. Vis. Sci. %R 10.1146/annurev-vision-091517-034103 %0 Report %D 2017 %T Fisher-Rao Metric, Geometry, and Complexity of Neural Networks %A Liang, Tengyuan %A Tomaso Poggio %A Alexander Rakhlin %A Stokes, James %K capacity control %K deep learning %K Fisher-Rao metric %K generalization error %K information geometry %K Invariance %K natural gradient %K ReLU activation %K statistical learning theory %X

We study the relationship between geometry and capacity measures for deep  neural  networks  from  an  invariance  viewpoint.  We  introduce  a  new notion  of  capacity — the  Fisher-Rao  norm — that  possesses  desirable  in- variance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures.  We  discuss  upper  bounds  on  the  generalization  error  induced by  the  proposed  measure.  Extensive  numerical  experiments  on  CIFAR-10 support  our  theoretical  findings.  Our  theoretical  analysis  rests  on  a  key structural lemma about partial derivatives of multi-layer rectifier networks.

%B arXiv.org %8 11/2017 %G eng %U https://arxiv.org/abs/1711.01530 %0 Journal Article %J Theoretical Computer Science %D 2015 %T Unsupervised learning of invariant representations %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K convolutional networks %K Cortex %K Hierarchy %K Invariance %X

The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n→∞n→∞). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n→1n→1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a “good” representation for supervised learning, characterized by small sample complexity. We consider the case of visual object recognition, though the theory also applies to other domains like speech. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translation, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and selective signature can be computed for each image or image patch: the invariance can be exact in the case of group transformations and approximate under non-group transformations. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such signature. The theory offers novel unsupervised learning algorithms for “deep” architectures for image and speech recognition. We conjecture that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and selective for recognition—and show how this representation may be continuously learned in an unsupervised way during development and visual experience.

%B Theoretical Computer Science %8 06/25/2015 %G eng %U http://www.sciencedirect.com/science/article/pii/S0304397515005587 %R 10.1016/j.tcs.2015.06.048 %0 Generic %D 2014 %T Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? %A Qianli Liao %A JZ. Leibo %A Youssef Mroueh %A Tomaso Poggio %K Computer vision %K Face recognition %K Hierarchy %K Invariance %X

The standard approach to unconstrained face recognition in natural photographs is via a detection, alignment, recognition pipeline. While that approach has achieved impressive results, there are several reasons to be dissatisfied with it, among them is its lack of biological plausibility. A recent theory of invariant recognition by feedforward hierarchical networks, like HMAX, other convolutional networks, or possibly the ventral stream, implies an alternative approach to unconstrained face recognition. This approach accomplishes detection and alignment implicitly by storing transformations of training images (called templates) rather than explicitly detecting and aligning faces at test time. Here we propose a particular locality-sensitive hashing based voting scheme which we call “consensus of collisions” and show that it can be used to approximate the full 3-layer hierarchy implied by the theory. The resulting end-to-end system for unconstrained face recognition operates on photographs of faces taken under natural conditions, e.g., Labeled Faces in the Wild (LFW), without aligning or cropping them, as is normally done. It achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: LFW-jittered (new here) and SUFR-W.

%8 03/2014 %1

arXiv:1311.4082v3

%2

http://hdl.handle.net/1721.1/100164

%0 Generic %D 2014 %T Computational role of eccentricity dependent cortical magnification. %A Tomaso Poggio %A Jim Mutch %A Leyla Isik %K Invariance %K Theories for Intelligence %X

We develop a sampling extension of M-theory focused on invariance to scale and translation. Quite surprisingly, the theory predicts an architecture of early vision with increasing receptive field sizes and a high resolution fovea — in agreement with data about the cortical magnification factor, V1 and the retina. From the slope of the inverse of the magnification factor, M-theory predicts a cortical “fovea” in V1 in the order of 40 by 40 basic units at each receptive field size — corresponding to a foveola of size around 26 minutes of arc at the highest resolution, ≈6 degrees at the lowest resolution. It also predicts uniform scale invariance over a fixed range of scales independently of eccentricity, while translation invariance should depend linearly on spatial frequency. Bouma’s law of crowding follows in the theory as an effect of cortical area-by-cortical area pooling; the Bouma constant is the value expected if the signature responsible for recognition in the crowding experiments originates in V2. From a broader perspective, the emerging picture suggests that visual recognition under natural conditions takes place by composing information from a set of fixations, with each fixation providing recognition from a space-scale image fragment — that is an image patch represented at a set of increasing sizes and decreasing resolutions.

%8 06/2014 %1

arXiv:1406.1770v1

%2

http://hdl.handle.net/1721.1/100181

%0 Generic %D 2014 %T A Deep Representation for Invariance And Music Classification %A Chiyuan Zhang %A Georgios Evangelopoulos %A Stephen Voinea %A Lorenzo Rosasco %A Tomaso Poggio %K Audio Representation %K Hierarchy %K Invariance %K Machine Learning %K Theories for Intelligence %X

Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.

%8 03/2014 %1

arXiv:1404.0400v1

%2

http://hdl.handle.net/1721.1/100163

%0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Phone Classification by a Hierarchy of Invariant Representation Layers %A Chiyuan Zhang %A Stephen Voinea %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Hierarchy %K Invariance %K Neural Networks %K Speech Representation %X

We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of “neurons”. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html %0 Generic %D 2014 %T Subtasks of Unconstrained Face Recognition %A JZ. Leibo %A Qianli Liao %A Tomaso Poggio %K Face identification %K Invariance %K Labeled Faces in the Wild %K Same-different matching %K Synthetic data %X

Unconstrained face recognition remains a challenging computer vision problem despite recent exceptionally high results (∼ 95% accuracy) on the current gold standard evaluation dataset: Labeled Faces in the Wild (LFW) (Huang et al., 2008; Chen et al., 2013). We offer a decomposition of the unconstrained problem into subtasks based on the idea that invariance to identity-preserving transformations is the crux of recognition. Each of the subtasks in the Subtasks of Unconstrained Face Recognition (SUFR) challenge consists of a same-different face-matching problem on a set of 400 individual synthetic faces rendered so as to isolate a specific transformation or set of transformations. We characterized the performance of 9 different models (8 previously published) on each of the subtasks. One notable finding was that the HMAX-C2 feature was not nearly as clutter-resistant as had been suggested by previous publications (Leibo et al., 2010; Pinto et al., 2011). Next we considered LFW and argued that it is too easy of a task to continue to be regarded as a measure of progress on unconstrained face recognition. In particular, strong performance on LFW requires almost no invariance, yet it cannot be considered a fair approximation of the outcome of a detection→alignment pipeline since it does not contain the kinds of variability that realistic alignment systems produce when working on non-frontal faces. We offer a new, more difficult, natural image dataset: SUFR-in-the-Wild (SUFR-W), which we created using a protocol that was similar to LFW, but with a few differences designed to produce more need for transformation invariance. We present baseline results for eight different face recognition systems on the new dataset and argue that it is time to retire LFW and move on to more difficult evaluations for unconstrained face recognition.

Click here for more information on related dataset >

%I 9th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. (VISAPP). %C Lisbon, Portugal %8 01/2014 %0 Conference Paper %B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %D 2014 %T Word-level Invariant Representations From Acoustic Waveforms %A Stephen Voinea %A Chiyuan Zhang %A Georgios Evangelopoulos %A Lorenzo Rosasco %A Tomaso Poggio %K Invariance %K Speech Representation %K Theories for Intelligence %X

Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.

%B INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association %I International Speech Communication Association (ISCA) %C Singapore %G eng %U http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html %0 Conference Proceedings %D 2013 %T Unsupervised Learning of Invariant Representations in Hierarchical Architectures. %A F. Anselmi %A JZ. Leibo %A Lorenzo Rosasco %A Jim Mutch %A Andrea Tacchetti %A Tomaso Poggio %K convolutional networks %K Hierarchy %K Invariance %K visual cortex %X

Representations that are invariant to translation, scale and other transformations, can considerably reduce the sample complexity of learning, allowing recognition of new object classes from very few examples – a hallmark of human recognition. Empirical estimates of one-dimensional projections of the distribution induced by a group of affine transformations are proven to represent a unique and invariant signature associated with an image. We show how projections yielding invariant signatures for future images can be learned automatically, and updated continuously, during unsupervised visual experience. A module performing filtering and pooling, like simple and complex cells as proposed by Hubel and Wiesel, can compute such estimates. Under this view, a pooling stage estimates a one-dimensional probability distribution. Invariance from observations through a restricted window is equivalent to a sparsity property w.r.t. to a transformation, which yields templates that are a) Gabor for optimal simultaneous invariance to translation and scale or b) very specific for complex, class-dependent transformations such as rotation in depth of faces. Hierarchical architectures consisting of this basic Hubel-Wiesel module inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts, and are invariant to complex transformations that may only be locally affine. The theory applies to several existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects which is invariant to transformations, stable, and discriminative for recognition – this representation may be learned in an unsupervised way from natural visual experience.

Read paper>

%8 11/2013 %G eng