A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the non-uniform sampling performed by the primate retina and the presence of multiple receptive fields with a range of receptive field sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations. We verify that these two mechanisms do not suffer from gradient obfuscation and study their contribution to adversarial robustness through ablation studies.
We study the average CVloo stability of kernel ridge-less regression and derive corresponding risk bounds. We show that the interpolating solution with minimum norm has the best CVloo stability, which in turn is controlled by the condition number of the empirical kernel matrix. The latter can be characterized in the asymptotic regime where both the dimension and cardinality of the data go to infinity. Under the assumption of random kernel matrices, the corresponding test error follows a double descent curve.
}, author = {Akshay Rangamani and Lorenzo Rosasco and Tomaso Poggio} } @article {4570, title = {Hierarchically Local Tasks and Deep Convolutional Networks}, year = {2020}, month = {06/2020}, abstract = {The main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional-like networks in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Among all the possible hierarchically local tasks in vision, text and speech we explore a few of them experimentally by studying how they are affected by disrupting locality in the input images. We also discuss a taxonomy of tasks ranging from local, to hierarchically local, to global and make predictions about the type of networks required to perform\ efficiently on these different types of tasks.
}, keywords = {Compositionality, Inductive Bias, perception, Theory of Deep Learning}, author = {Arturo Deza and Qianli Liao and Andrzej Banburski and Tomaso Poggio} } @article {4240, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2019}, month = {05/2019}, abstract = {This paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advantage of the compositional structure. Instead, we measure the generalization error in the sense of maximum loss, and sometimes, as a pointwise error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error. We prove that a solution of a regularization problem is guaranteed to yield a good training error as well as a good generalization error and estimate how much error to expect at which test data.
}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Mhaskar, H. N. and T. Poggio} } @conference {4460, title = {Biologically-plausible learning algorithms can scale to large datasets.}, booktitle = { International Conference on Learning Representations, (ICLR 2019)}, year = {2019}, abstract = {The backpropagation (BP) algorithm is often thought to be biologically implau- sible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this {\textquotedblleft}weight transport problem{\textquotedblright} (Grossberg, 1987), two biologically-plausible algorithms, pro- posed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP{\textquoteright}s weight sym- metry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) finds that although feedback alignment (FA) and some variants of target-propagation (TP) perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry (SS) algo- rithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights do not share magnitudes but share signs. We examined the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet; RetinaNet for MS COCO). Surprisingly, networks trained with sign- symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018) and es- tablish a new benchmark for future biologically-plausible learning algorithms on more difficult datasets and more complex architectures.
}, author = {Xiao, Will and Chen, Honglin and Qianli Liao and Tomaso Poggio} } @conference {4242, title = {Deep Recurrent Architectures for Seismic Tomography}, booktitle = {81st EAGE Conference and Exhibition 2019}, year = {2019}, month = {06/2019}, abstract = {This paper introduces novel deep recurrent neural network architectures for Velocity Model Building (VMB), which is beyond what Araya-Polo et al 2018 pioneered with the Machine Learning-based seismic tomography built with convolutional non-recurrent neural network. Our investigation includes the utilization of basic recurrent neural network (RNN) cells, as well as Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) cells. Performance evaluation reveals that salt bodies are consistently predicted more accurately by GRU and LSTM-based architectures, as compared to non-recurrent architectures. The results take us a step closer to the final goal of a reliable fully Machine Learning-based tomography from pre-stack data, which when achieved will reduce the VMB turnaround from weeks to days.
}, author = {Amir Adler and Mauricio Araya-Polo and Tomaso Poggio} } @article {4375, title = {Double descent in the condition number}, year = {2019}, month = {12/2019}, abstract = {In solving a system of n linear equations in d variables\ \ Ax=b, the condition number of the (n,d) matrix A measures how\ \ much errors in the data b affect the solution x. Bounds of\ \ this type are important in many inverse problems. An example is\ \ machine learning where the key task is to estimate an underlying\ \ function from a set of measurements at random points in a high\ \ dimensional space and where low sensitivity to error in the data is\ \ a requirement for good predictive performance. Here we report the\ \ simple observation that when the columns of A are random vectors,\ \ the condition number of A is highest, that is worse, when d=n,\ \ that is when the inverse of A exists. An overdetermined system\ \ (n\>d) and especially an underdetermined system (n\<d), for which\ \ the pseudoinverse must be used instead of the inverse, typically\ \ have significantly better, that is lower, condition numbers. Thus\ \ the condition number of A plotted as function of d shows a\ \ double descent behavior with a peak at d=n.
}, author = {Tomaso Poggio and Gil Kur and Andrzej Banburski} } @conference {4516, title = {Dynamics \& Generalization in Deep Networks -Minimizing the Norm}, booktitle = {NAS Sackler Colloquium on Science of Deep Learning}, year = {2019}, month = {03/2019}, address = {Washington D.C.}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Jack Hidary and Tomaso Poggio} } @conference {4537, title = {Properties of invariant object recognition in human one-shot learning suggests a hierarchical architecture different from deep convolutional neural networks}, booktitle = {Vision Science Society}, year = {2019}, month = {05/2019}, address = {Florida, USA}, author = {Yena Han and Gemma Roig and Geiger, Gad and Tomaso Poggio} } @article {4281, title = {Theoretical Issues in Deep Networks}, year = {2019}, month = {08/2019}, abstract = {While deep learning is successful in a number of applications, it is not yet well understood theoretically.\ A theoretical\ characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal.\ In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.
}, author = {Tomaso Poggio and Andrzej Banburski and Qianli Liao} } @article {4515, title = {Theories of Deep Learning: Approximation, Optimization and Generalization }, year = {2019}, month = {09/2019}, author = {Qianli Liao and Andrzej Banburski and Tomaso Poggio} } @conference {4517, title = {Weight and Batch Normalization implement Classical Generalization Bounds }, booktitle = {ICML}, year = {2019}, month = {06/2019}, address = {Long Beach/California}, author = {Andrzej Banburski and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Jack Hidary and Tomaso Poggio} } @article {3315, title = {An analysis of training and generalization errors in shallow and deep networks}, year = {2018}, month = {02/2018}, abstract = {An open problem around deep networks is the apparent absence of over-fitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we explain this phenomenon when each unit evaluates a trigonometric polynomial. It is well understood in the theory of function approximation that ap- proximation by trigonometric polynomials is a {\textquotedblleft}role model{\textquotedblright} for many other processes of approximation that have inspired many theoretical constructions also in the context of approximation by neural and RBF networks. In this paper, we argue that the maximum loss functional is necessary to measure the generalization error. We give estimates on exactly how many parameters ensure both zero training error as well as a good generalization error, and how much error to expect at which test data. An interesting feature of our new method is that the variance in the training data is no longer an insurmountable lower bound on the generalization error.
}, keywords = {deep learning, generalization error, interpolatory approximation}, author = {Hrushikesh Mhaskar and Tomaso Poggio} } @article {4162, title = {Biologically-plausible learning algorithms can scale to large datasets}, year = {2018}, month = {11/2018}, abstract = {The backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP{\textquoteright}s weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.
}, author = {Will Xiao and Honglin Chen and Qianli Liao and Tomaso Poggio} } @article {3962, title = {Can Deep Neural Networks Do Image Segmentation by Understanding Insideness?}, year = {2018}, month = {12/2018}, abstract = {THIS MEMO IS REPLACED BY CBMM MEMO 105
A key component of visual cognition is the understanding of spatial relationships among objects. Albeit effortless to our visual system, state-of-the-art artificial neural networks struggle to distinguish basic spatial relationships among elements in an image. As shown here, deep neural networks (DNNs) trained with hundreds of thousands of labeled examples cannot accurately distinguish whether pixels lie inside or outside 2D shapes, a problem that seems much simpler than image segmentation. In this paper, we sought to analyze the capability of ANN to solve such inside/outside problems using an analytical approach. We demonstrate that it is a mathematically tractable problem and that two previously proposed algorithms, namely the Ray-Intersection Method and the Coloring Method, achieve perfect accuracy when implemented in the form of DNNs.
}, author = {Kimberly M. Villalobos and Jamel Dozier and Vilim Stih and Andrew Francl and Frederico Azevedo and Tomaso Poggio and Tomotake Sasaki and Xavier Boix} } @article {3703, title = {Classical generalization bounds are surprisingly tight for Deep Networks}, year = {2018}, month = {07/2018}, abstract = {Deep networks are usually trained and tested in a regime in which the training classification error is not a good predictor of the test error. Thus the consensus has been that generalization, defined as convergence of the empirical to the expected error, does not hold for deep networks. Here we show that, when normalized appropriately after training, deep networks trained on exponential type losses show a good linear dependence of test loss on training loss. The observation, motivated by a previous theoretical analysis of overparametrization and overfitting, not only demonstrates the validity of classical generalization bounds for deep learning but suggests that they are tight. In addition, we also show that the bound of the classification error by the normalized cross entropy loss is empirically rather tight on the data sets we studied.
}, author = {Qianli Liao and Brando Miranda and Jack Hidary and Tomaso Poggio} } @article {3452, title = {A fast, invariant representation for human action in the visual system}, journal = {Journal of Neurophysiology}, year = {2018}, abstract = {Humans can effortlessly recognize others{\textquoteright} actions in the presence of complex transformations, such as changes in viewpoint. Several studies have located the regions in the brain involved in invariant action recognition; however, the underlying neural computations remain poorly understood. We use magnetoencephalography decoding and a data set of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by different actors at different viewpoints to study the computational steps used to recognize actions across complex transformations. In particular, we ask when the brain discriminates between different actions, and when it does so in a manner that is invariant to changes in 3D viewpoint. We measure the latency difference between invariant and noninvariant action decoding when subjects view full videos as well as form-depleted and motion-depleted stimuli. We were unable to detect a difference in decoding latency or temporal profile between invariant and noninvariant action recognition in full videos. However, when either form or motion information is removed from the stimulus set, we observe a decrease and delay in invariant action decoding. Our results suggest that the brain recognizes actions and builds invariance to complex transformations at the same time and that both form and motion information are crucial for fast, invariant action recognition.
Associated Dataset: MEG action recognition data
}, doi = {https://doi.org/10.1152/jn.00642.2017}, url = {https://www.physiology.org/doi/10.1152/jn.00642.2017}, author = {Leyla Isik and Andrea Tacchetti and Tomaso Poggio} } @article {3871, title = {Invariant Recognition Shapes Neural Representations of Visual Input}, journal = {Annual Review of Vision Science}, volume = {4}, year = {2018}, month = {10/2018}, pages = {403 - 422}, abstract = {Recognizing the people, objects, and actions in the world around us is a crucial aspect of human perception that allows us to plan and act in our environment. Remarkably, our proficiency in recognizing semantic categories from visual input is unhindered by transformations that substantially alter their appearance (e.g., changes in lighting or position). The ability to generalize across these complex transformations is a hallmark of human visual intelligence, which has been the focus of wide-ranging investigation in systems and computational neuroscience. However, while the neural machinery of human visual perception has been thoroughly described, the computational principles dictating its functioning remain unknown. Here, we review recent results in brain imaging, neurophysiology, and computational neuroscience in support of the hypothesis that the ability to support the invariant recognition of semantic entities in the visual world shapes which neural representations of sensory input are computed by human visual cortex.
}, keywords = {computational neuroscience, Invariance, neural decoding, visual representations}, issn = {2374-4642}, doi = {10.1146/annurev-vision-091517-034103}, url = {https://www.annualreviews.org/doi/10.1146/annurev-vision-091517-034103}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio} } @article {3881, title = {Single units in a deep neural network functionally correspond with neurons in the brain: preliminary results}, year = {2018}, month = {11/2018}, abstract = {Deep neural networks have been shown to predict neural responses in higher visual cortex. The mapping from the model to a neuron in the brain occurs through a linear combination of many units in the model, leaving open the question of whether there also exists a correspondence at the level of individual neurons. Here we show that there exist many one-to-one mappings between single units in a deep neural network model and neurons in the brain. We show that this correspondence at the single- unit level is ubiquitous among state-of-the-art deep neural networks, and grows more pronounced for models with higher performance on a large-scale visual recognition task. Comparing matched populations{\textemdash}in the brain and in a model{\textemdash}we demonstrate a further correspondence at the level of the population code: stimulus category can be partially decoded from real neural responses using a classifier trained purely on a matched population of artificial units in a model. This provides a new point of investigation for phenomena which require fine-grained mappings between deep neural networks and the brain.
We review recent work characterizing the classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage.
}, keywords = {convolutional neural networks, deep and shallow networks, deep learning, function approximation}, author = {Tomaso Poggio and Qianli Liao} } @article {4186, title = {Theory II: Deep learning and optimization}, journal = {Bulletin of the Polish Academy of Sciences: Technical Sciences}, volume = {66}, year = {2018}, abstract = {The landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability {\textendash} like the classical Langevin equation {\textendash} on large volume, {\textquotedblright}flat{\textquotedblright} minima, selecting with high probability degenerate minimizers which are typically global minimizers.
}, doi = {10.24425/bpas.2018.125925}, author = {Tomaso Poggio and Qianli Liao} } @article {3694, title = {Theory III: Dynamics and Generalization in Deep Networks}, year = {2018}, month = {06/2018}, abstract = {The key to generalization is controlling the complexity of
\ \ \ \ \ \ the network. However, there is no obvious control of
\ \ \ \ \ \ complexity -- such as an explicit regularization term --
\ \ \ \ \ \ in the training of deep networks for classification. We
\ \ \ \ \ \ will show that a classical form of norm control -- but
\ \ \ \ \ \ kind of hidden -- is present in deep networks trained with
\ \ \ \ \ \ gradient descent techniques on exponential-type losses. In
\ \ \ \ \ \ particular, gradient descent induces a dynamics of the
\ \ \ \ \ \ normalized weights which converge for $t \to \infty$ to an
\ \ \ \ \ \ equilibrium which corresponds to a minimum norm (or
\ \ \ \ \ \ maximum margin) solution. For sufficiently large but
\ \ \ \ \ \ finite $\rho$ -- and thus finite $t$ -- the dynamics
\ \ \ \ \ \ converges to one of several margin maximizers, with the
\ \ \ \ \ \ margin monotonically increasing towards a limit stationary
\ \ \ \ \ \ point of the flow. In the usual case of stochastic
\ \ \ \ \ \ gradient descent, most of the stationary points are likely
\ \ \ \ \ \ to be convex minima corresponding to a regularized,
\ \ \ \ \ \ constrained minimizer -- the network with normalized
\ \ \ \ \ \ weights-- which is stable and has asymptotic zero
\ \ \ \ \ \ generalization gap, asymptotically for $N \to \infty$,
\ \ \ \ \ \ where $N$ is the number of training examples. For finite,
\ \ \ \ \ \ fixed $N$ the generalizaton gap may not be zero but the
\ \ \ \ \ \ minimum norm property of the solution can provide, we
\ \ \ \ \ \ conjecture, good expected performance for suitable data
\ \ \ \ \ \ distributions. Our approach extends some of the results of
\ \ \ \ \ \ Srebro from linear networks to deep networks and provides
\ \ \ \ \ \ a new perspective on the implicit bias of gradient
\ \ \ \ \ \ descent. We believe that the elusive complexity control we
\ \ \ \ \ \ describe is responsible for the puzzling empirical finding
\ \ \ \ \ \ of good predictive performance by deep networks, despite
\ \ \ \ \ \ overparametrization.\
Image instance retrieval is the problem of retrieving images from a database which contain the same object. Convolutional Neural Network (CNN) based descriptors are becoming the dominant approach for generating {\it global image descriptors} for the instance retrieval problem. One major drawback of CNN-based {\it global descriptors} is that uncompressed deep neural network models require hundreds of megabytes of storage making them inconvenient to deploy in mobile applications or in custom hardware. In this work, we study the problem of neural network model compression focusing on the image instance retrieval task. We study quantization, coding, pruning and weight sharing techniques for reducing model size for the instance retrieval problem. We provide extensive experimental results on the trade-off between retrieval performance and model size for different types of networks on several data sets providing the most comprehensive study on this topic. We compress models to the order of a few MBs: two orders of magnitude smaller than the uncompressed models while achieving negligible loss in retrieval performance.
}, url = {https://arxiv.org/abs/1701.04923}, author = {Vijay Chandrasekhar and Jie Lin and Qianli Liao and Olivier Mor{\`e}re and Antoine Veillard and Lingyu Duan and Tomaso Poggio} } @article {2914, title = {Do Deep Neural Networks Suffer from Crowding?}, year = {2017}, month = {06/2017}, abstract = {Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding.
Associated code for this paper.
}, author = {Anna Volokitin and Gemma Roig and Tomaso Poggio} } @article {2485, title = {Eccentricity Dependent Deep Neural Networks for Modeling Human Vision}, year = {2017}, author = {Gemma Roig and Francis Chen and X Boix and Tomaso Poggio} } @conference {2487, title = {Eccentricity Dependent Deep Neural Networks: Modeling Invariance in Human Vision}, booktitle = {AAAI Spring Symposium Series, Science of Intelligence}, year = {2017}, abstract = {Humans can recognize objects in a way that is invariant to scale, translation, and clutter. We use invariance theory as a conceptual basis, to computationally model this phenomenon. This theory discusses the role of eccentricity in human visual processing, and is a generalization of feedforward convolutional neural networks (CNNs). Our model explains some key psychophysical observations relating to invariant perception, while maintaining important similarities with biological neural architectures. To our knowledge, this work is the first to unify explanations of all three types of invariance, all while leveraging the power and neurological grounding of CNNs.
}, url = {https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15360}, author = {Francis Chen and Gemma Roig and Leyla Isik and X Boix and Tomaso Poggio} } @article {3274, title = {A fast, invariant representation for human action in the visual system.}, journal = {J Neurophysiol}, year = {2017}, month = {11/2017}, pages = {jn.00642.2017}, abstract = {Humans can effortlessly recognize others{\textquoteright} actions in the presence of complex transformations, such as changes in viewpoint. Several studies have located the regions in the brain involved in invariant action recognition, however, the underlying neural computations remain poorly understood. We use magnetoencephalography (MEG) decoding and a dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by different actors at different viewpoints to study the computational steps used to recognize actions across complex transformations. In particular, we ask when the brain discriminates between different actions, and when it does so in a manner that is invariant to changes in 3D viewpoint. We measure the latency difference between invariant and non-invariant action decoding when subjects view full videos as well as form-depleted and motion-depleted stimuli. We were unable to detect a difference in decoding latency or temporal profile between invariant and non-invariant action recognition in full videos. However, when either form or motion information is removed from the stimulus set, we observe a decrease and delay in invariant action decoding. Our results suggest that the brain recognizes actions and builds invariance to complex transformations at the same time, and that both form and motion information are crucial for fast, invariant action recognition.
}, keywords = {action recognition, magnetoencephalography, neural decoding, vision}, issn = {1522-1598}, doi = {10.1152/jn.00642.2017}, author = {Leyla Isik and Andrea Tacchetti and Tomaso Poggio} } @article {3155, title = {Fisher-Rao Metric, Geometry, and Complexity of Neural Networks}, year = {2017}, month = {11/2017}, abstract = {We study the relationship between geometry and capacity measures for deep\ neural\ networks\ from\ an\ invariance\ viewpoint.\ We\ introduce\ a\ new notion\ of\ capacity {\textemdash} the\ Fisher-Rao\ norm {\textemdash} that\ possesses\ desirable\ in- variance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures.\ We\ discuss\ upper\ bounds\ on\ the\ generalization\ error\ induced by\ the\ proposed\ measure.\ Extensive\ numerical\ experiments\ on\ CIFAR-10 support\ our\ theoretical\ findings.\ Our\ theoretical\ analysis\ rests\ on\ a\ key structural lemma about partial derivatives of multi-layer rectifier networks.
}, keywords = {capacity control, deep learning, Fisher-Rao metric, generalization error, information geometry, Invariance, natural gradient, ReLU activation, statistical learning theory}, url = {https://arxiv.org/abs/1711.01530}, author = {Liang, Tengyuan and Tomaso Poggio and Alexander Rakhlin and Stokes, James} } @article {2484, title = {On the Human Visual System Invariance to Translation and Scale}, year = {2017}, author = {Yena Han and Gemma Roig and Gadi Geiger and Tomaso Poggio} } @conference {2486, title = {Is the Human Visual System Invariant to Translation and Scale?}, booktitle = {AAAI Spring Symposium Series, Science of Intelligence}, year = {2017}, author = {Yena Han and Gemma Roig and Gadi Geiger and Tomaso Poggio} } @article {3162, title = {Invariant action recognition dataset}, year = {2017}, month = {11/2017}, abstract = {To study the effect of changes in view and actor on action recognition, we filmed a dataset of five actors performing five different actions (drink, eat, jump, run and walk) on a treadmill from five different views (0, 45, 90, 135, and 180 degrees from the front of the actor/treadmill; the treadmill rather than the camera was rotated in place to acquire from different viewpoints). The dataset was filmed on a fixed, constant background. To avoid low-level object/action confounds (e.g. the action {\textquotedblleft}drink{\textquotedblright} being classified as the only videos with water bottle in the scene) and guarantee that the main sources of variation of visual appearance are due to actions, actors and viewpoint, the actors held the same objects (an apple and a water bottle) in each video, regardless of the action they performed. This controlled design allows us to test hypotheses on the computational mechanisms underlying invariant recognition in the human visual system without having to settle for a synthetic dataset.
More information and the dataset files can be found here - https://doi.org/10.7910/DVN/DMT0PG
}, url = {https://doi.org/10.7910/DVN/DMT0PG}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio} } @article {3272, title = {Invariant recognition drives neural representations of action sequences}, journal = {PLOS Computational Biology}, volume = {13}, year = {2017}, month = {12/2017}, pages = {e1005859}, abstract = {Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.
}, doi = {10.1371/journal.pcbi.1005859}, url = {http://dx.plos.org/10.1371/journal.pcbi.1005859}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio}, editor = {Berniker, Max} } @article {3453, title = {Invariant recognition drives neural representations of action sequences}, journal = {PLoS Comp. Bio}, year = {2017}, abstract = {Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.
Associated Dataset: MEG action recognition data
}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio} } @inbook {2562, title = {Invariant Recognition Predicts Tuning of Neurons in Sensory Cortex}, booktitle = {Computational and Cognitive Neuroscience of Vision}, year = {2017}, pages = {85-104}, publisher = {Springer}, organization = {Springer}, issn = {978-981-10-0211-3}, author = {Jim Mutch and F. Anselmi and Andrea Tacchetti and Lorenzo Rosasco and JZ. Leibo and Tomaso Poggio} } @article {2780, title = {Musings on Deep Learning: Properties of SGD}, year = {2017}, month = {04/2017}, abstract = {[formerly titled "Theory of Deep Learning III: Generalization Properties of SGD"]
In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.
}, author = {Chiyuan Zhang and Qianli Liao and Alexander Rakhlin and Karthik Sridharan and Brando Miranda and Noah Golowich and Tomaso Poggio} } @article {3111, title = {Object-Oriented Deep Learning}, year = {2017}, month = {10/2017}, abstract = {We investigate an unconventional direction of research that aims at converting neural networks, a class of distributed, connectionist, sub-symbolic models into a symbolic level with the ultimate goal of achieving AI interpretability and safety. To that end, we propose Object-Oriented Deep Learning, a novel computational paradigm of deep learning that adopts interpretable {\textquotedblleft}objects/symbols{\textquotedblright} as a basic representational atom instead of N-dimensional tensors (as in traditional {\textquotedblleft}feature-oriented{\textquotedblright} deep learning). For visual processing, each {\textquotedblleft}object/symbol{\textquotedblright} can explicitly package common properties of visual objects like its position, pose, scale, probability of being an object, pointers to parts, etc., providing a full spectrum of interpretable visual knowledge throughout all layers. It achieves a form of {\textquotedblleft}symbolic disentanglement{\textquotedblright}, offering one solution to the important problem of disentangled representations and invariance. Basic computations of the network include predicting high-level objects and their properties from low-level objects and binding/aggregating relevant objects together. These computations operate at a more fundamental level than convolutions, capturing convolution as a special case while being significantly more general than it. All operations are executed in an input-driven fashion, thus sparsity and dynamic computation per sample are naturally supported, complementing recent popular ideas of dynamic networks and may enable new types of hardware accelerations. We experimentally show on CIFAR-10 that it can perform flexible visual processing, rivaling the performance of ConvNet, but without using any convolution. Furthermore, it can generalize to novel rotations of images that it was not trained for.
}, author = {Qianli Liao and Tomaso Poggio} } @article {3283, title = {Pruning Convolutional Neural Networks for Image Instance Retrieval}, year = {2017}, month = {07/2017}, abstract = {In this work, we focus on the problem of image instance retrieval with deep descriptors extracted from pruned Convolutional Neural Networks (CNN). The objective is to heavily prune convolutional edges while maintaining retrieval performance. To this end, we introduce both data-independent and data-dependent heuristics to prune convolutional edges, and evaluate their performance across various compression rates with different deep descriptors over several benchmark datasets. Further, we present an end-to-end framework to fine-tune the pruned network, with a triplet loss function specially designed for the retrieval task. We show that the combination of heuristic pruning and fine-tuning offers 5x compression rate without considerable loss in retrieval performance.
}, keywords = {CNN, Image Instance Re- trieval, Pooling, Pruning, Triplet Loss}, url = {https://arxiv.org/abs/1707.05455}, author = {Gaurav Manek and Jie Lin and Vijay Chandrasekhar and Lingyu Duan and Sateesh Giduthuri and Xiaoli Li and Tomaso Poggio} } @conference {2673, title = {Representation Learning from Orbit Sets for One-shot Classification}, booktitle = {AAAI Spring Symposium Series, Science of Intelligence}, year = {2017}, address = {AAAI}, abstract = {The sample complexity of a learning task is increased by transformations that do not change class identity. Visual object recognition for example, i.e. the discrimination or categorization of distinct semantic classes, is affected by changes in viewpoint, scale, illumination or planar transformations. We introduce a weakly-supervised framework for learning robust and selective representations from sets of transforming examples (orbit sets). We train deep encoders that explicitly account for the equivalence up to transformations of orbit sets and show that the resulting encodings contract the intra-orbit distance and preserve identity either by preserving reconstruction or by increasing the inter-orbit distance. We explore a loss function that combines a discriminative term, and a reconstruction term that uses a decoder-encoder map to learn to rectify transformation-perturbed examples, and demonstrate the validity of the resulting embeddings for one-shot learning. Our results suggest that a suitable definition of orbit sets is a form of weak supervision that can be exploited to learn semantically relevant embeddings.
}, url = {https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15357}, author = {Andrea Tacchetti and Stephen Voinea and Georgios Evangelopoulos and Tomaso Poggio} } @article {2900, title = {Symmetry Regularization}, number = {063}, year = {2017}, month = {05/2017}, abstract = {The properties of a representation, such as smoothness, adaptability, generality, equivari- ance/invariance, depend on restrictions imposed during learning. In this paper, we propose using data symmetries, in the sense of equivalences under transformations, as a means for learning symmetry- adapted representations, i.e., representations that are equivariant to transformations in the original space. We provide a sufficient condition to enforce the representation, for example the weights of a neural network layer or the atoms of a dictionary, to have a group structure and specifically the group structure in an unlabeled training set. By reducing the analysis of generic group symmetries to per- mutation symmetries, we devise an analytic expression for a regularization scheme and a permutation invariant metric on the representation space. Our work provides a proof of concept on why and how to learn equivariant representations, without explicit knowledge of the underlying symmetries in the data.
}, author = {F. Anselmi and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio} } @article {2698, title = {Theory II: Landscape of the Empirical Risk in Deep Learning}, year = {2017}, month = {03/2017}, abstract = {Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least for the most successful Deep Convolutional Neural Networks (DCNNs) for visual processing, practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a small degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers -- in the case of classification -- have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN{\textquoteright}s empirical loss surface, which might not be as complicated as people commonly believe.
}, author = {Tomaso Poggio and Qianli Liao} } @article {3261, title = {Theory of Deep Learning IIb: Optimization Properties of SGD}, year = {2017}, month = {12/2017}, abstract = {In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability - like the classical Langevin equation {\textendash} on large volume, {\textquotedblleft}flat{\textquotedblright} minima, selecting flat minimizers which are with very high probability also global minimizers.
}, author = {Chiyuan Zhang and Qianli Liao and Alexander Rakhlin and Brando Miranda and Noah Golowich and Tomaso Poggio} } @article {3266, title = {Theory of Deep Learning III: explaining the non-overfitting puzzle}, year = {2017}, month = {12/2017}, abstract = {THIS MEMO IS REPLACED BY CBMM MEMO 90
A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.
Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for {\textquotedblleft}low noise{\textquotedblright} datasets.
The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.
}, author = {Tomaso Poggio and Keji Kawaguchi and Qianli Liao and Brando Miranda and Lorenzo Rosasco and Xavier Boix and Jack Hidary and Hrushikesh Mhaskar} } @article {2327, title = {View-Tolerant Face Recognition and Hebbian Learning Imply Mirror-Symmetric Neural Tuning to Head Orientation}, journal = {Current Biology}, volume = {27}, year = {2017}, month = {01/2017}, pages = {1-6}, abstract = {The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and robust against identity-preserving transformations, like depth rotations. Current computational models of object recognition, including recent deep-learning networks, generate these properties through a hierarchy of alternating selectivity-increasing filtering and tolerance-increasing pooling operations, similar to simple-complex cells operations. Here, we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules generate approximate invariance to identity-preserving transformations at the top level of the processing hierarchy. However, all past models tested failed to reproduce the most salient property of an intermediate representation of a three-level face-processing hierarchy in the brain: mirror-symmetric tuning to head orientation. Here, we demonstrate that one specific biologically plausible Hebb-type learning rule generates mirror-symmetric tuning to bilaterally symmetric stimuli, like faces, at intermediate levels of the architecture and show why it does so. Thus, the tuning properties of individual cells inside the visual stream appear to result from group properties of the stimuli they encode and to reflect the learning rules that sculpted the information-processing system within which they reside.\
}, doi = {http://dx.doi.org/10.1016/j.cub.2016.10.015}, author = {JZ. Leibo and Qianli Liao and F. Anselmi and W. A. Freiwald and Tomaso Poggio} } @proceedings {2682, title = {When and Why Are Deep Networks Better Than Shallow Ones?}, year = {2017}, abstract = {The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.
}, keywords = {convolutional neural networks, deep and shallow networks, deep learning, function approximation, Machine Learning, Neural Networks}, doi = {10.1007/s11633-017-1054-2}, url = {http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst}, author = {Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao} } @article {2034, title = {Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex}, year = {2016}, month = {04/2016}, abstract = {We discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.
}, author = {Qianli Liao and Tomaso Poggio} } @article {2337, title = {Deep Leaning: Mathematics and Neuroscience}, journal = {A Sponsored Supplement to Science}, volume = {Brain-Inspired intelligent robotics: The intersection of robotics and neuroscience}, year = {2016}, month = {12/2016}, pages = {9-12}, chapter = {9}, abstract = {Understanding the nature of intelligence is one of the greatest challenges in science and technology today. Making significant progress toward this goal will require the interaction of several disciplines including neuroscience and cognitive science, as well as computer science, robotics, and machine learning. In this paper, I will discuss the implications of recent empirical successes in many applications, such as image categorizations, face identification, localization, action recognition through a machine learning technique called "deep learning," which is based on multi-layer or hierarchical neural networks. Such neural networks have become a central tool in machine learning.
}, url = {http://science.imirus.com/Mpowered/imirus.jsp?volume=scim16\&issue=6\&page=10}, author = {Tomaso Poggio} } @article {2066, title = {Deep Learning: mathematics and neuroscience}, year = {2016}, month = {04/2016}, abstract = {The problems of Intelligence are, together, the greatest problem in science and technology today. Making significant progress towards their solution will require the interaction of sev- eral disciplines involving neuroscience and cognitive science in addition to computer sci- ence, robotics and machine learning...
}, author = {Tomaso Poggio} } @article {2183, title = {Deep vs. shallow networks : An approximation theory perspective}, year = {2016}, month = {08/2016}, abstract = {The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function {\textendash} the ReLU function {\textendash} used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.\
Isik, L*, Tacchetti, A*, and Poggio, T (* authors contributed equally to this work)
\
The ability to recognize the actions of others from visual input is essential to humans{\textquoteright} daily lives. The neural computations underlying action recognition, however, are still poorly understood. We use magnetoencephalography (MEG) decoding and a computational model to study action recognition from a novel dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat drink) performed by five actors at five viewpoints. We show for the first that that actor- and view-invariant representations for action arise in the human brain as early as 200 ms. We next extend a class of biologically inspired hierarchical computational models of object recognition to recognize actions from videos and explain the computations underlying our MEG findings. This model achieves 3D viewpoint-invariance by the same biologically inspired computational mechanism it uses to build invariance to position and scale. These results suggest that robustness to complex transformations, such as 3D viewpoint invariance, does not require special neural architectures, and further provide a mechanistic explanation of the computations driving invariant action recognition.
}, url = {http://arxiv.org/abs/1601.01358}, author = {Leyla Isik and Andrea Tacchetti and Tomaso Poggio} } @article {1617, title = {Foveation-based Mechanisms Alleviate Adversarial Examples}, number = {044}, year = {2016}, month = {01/2016}, abstract = {We show that adversarial examples,\ i.e.,\ the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations---applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as a linear classifier: CNNs act locally linearly to changes in the image regions with objects recognized by the CNN, and in other regions the CNN may act non-linearly. Then, we corroborate that when the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation. This is because, hypothetically, the CNNs for ImageNet are robust to changes of scale and translation of the object produced by the foveation, but this property does not generalize to transformations of the perturbation. As a result, the accuracy after a foveation is almost the same as the accuracy of the CNN without the adversarial perturbation, even if the adversarial perturbation is calculated taking into account a foveation.
}, author = {Luo, Yan and X Boix and Gemma Roig and Tomaso Poggio and Qi Zhao} } @article {1594, title = {Group Invariant Deep Representations for Image Instance Retrieval}, year = {2016}, month = {01/2016}, abstract = {Most image instance retrieval pipelines are based on comparison of vectors known as global image descriptors between a query image and the database images. Due to their success in large scale image classification, representations extracted from Convolutional Neural Networks (CNN) are quickly gaining ground on Fisher Vectors (FVs) as state-of-the-art global descriptors for image instance retrieval. While CNN-based descriptors are generally remarked for good retrieval performance at lower bitrates, they nevertheless present a number of drawbacks including the lack of robustness to common object transformations such as rotations compared with their interest point based FV counterparts.
In this paper, we propose a method for computing invariant global descriptors from CNNs. Our method implements a recently proposed mathematical theory for invariance in a sensory cortex modeled as a feedforward neural network. The resulting global descriptors can be made invariant to multiple arbitrary transformation groups while retaining good discriminativeness.
Based on a thorough empirical evaluation using several publicly available datasets, we show that our method is able to significantly and consistently improve retrieval results every time a new type of invariance is incorporated. We also show that our method which has few parameters is not prone to over fitting: improvements generalize well across datasets with different properties with regard to invariances. Finally, we show that our descriptors are able to compare favourably to other state-of-theart compact descriptors in similar bitranges, exceeding the highest retrieval results reported in the literature on some datasets. A dedicated dimensionality reduction step {\textendash}quantization or hashing{\textendash} may be able to further improve the competitiveness of the descriptors.
Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.
}, author = {Maximilian Nickel and Lorenzo Rosasco and Tomaso Poggio} } @conference {2629, title = {How Important Is Weight Symmetry in Backpropagation?}, booktitle = {Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)}, year = {2016}, address = {Phoenix, AZ.}, abstract = {Gradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. This "weight transport problem" (Grossberg 1987) is thought to be one of the main reasons to doubt BP{\textquoteright}s biologically plausibility. Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.{\textquoteright}s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter -- the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100\% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.
}, url = {https://cbmm.mit.edu/sites/default/files/publications/liao-leibo-poggio.pdf}, author = {Qianli Liao and JZ. Leibo and Tomaso Poggio} } @conference {1651, title = {How Important Is Weight Symmetry in Backpropagation?}, booktitle = {Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)}, year = {2016}, month = {Accepted}, publisher = {Association for the Advancement of Artificial Intelligence}, organization = {Association for the Advancement of Artificial Intelligence}, address = {Phoenix, AZ.}, abstract = {Gradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. This "weight transport problem" (Grossberg 1987) is thought to be one of the main reasons to doubt BP{\textquoteright}s biologically plausibility. Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.{\textquoteright}s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter -- the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100\% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.
}, author = {Qianli Liao and JZ. Leibo and Tomaso Poggio} } @article {2142, title = {Introduction Special issue: Deep learning}, journal = {Information and Inference}, volume = {5}, year = {2016}, pages = {103-104}, abstract = {Faced with large amounts of data, the aim of machine learning is to make predictions. It applies to many types of data, such as images, sounds, biological data, etc. A key difficulty is to find relevant vectorial representations. While this problem had been often handled in a ad-hoc way by domain experts, it has recently proved useful to learn these representations directly from large quantities of data, and Deep Learning Convolutional Networks (DLCN) with ReLU nonlinearities have been particularly successful. The representations are then based on compositions of simple parameterized processing units, the depth coming from the large number of such compositions.
\
The goal of this special issue was to explore some of the mathematical ideas and problems at the heart of deep learning. In particular, two key mathematical questions about deep learning are:
the question about the power of the architecture{\textemdash}which classes of functions can it approximate well? Why are deep networks better than shallow and when?
Learning the unknown parameters{\textemdash}weights and biases{\textemdash}from the data via optimization of a loss function: do multiple solutions exist? How {\textquotedblleft}many{\textquotedblright}? Why is stochastic gradient descent (SGD) so unreasonably efficient, at least in appearance?
These questions are still open and a full theory of Deep Learning is still in the making. This special issue, however, begins with two papers that provide a useful contribution to several other theoretical questions surrounding supervised deep learning.
}, doi = {10.1093/imaiai/iaw010}, url = {http://imaiai.oxfordjournals.org/content/5/2/103.short}, author = {Bach, Francis and Tomaso Poggio} } @article {2098, title = {On invariance and selectivity in representation learning}, journal = {Information and Inference: A Journal of the IMA}, year = {2016}, month = {05/2016}, pages = {iaw009}, abstract = {We study the problem of learning from data representations that are invariant to transformations, and at the same time selective, in the sense that two points have the same representation if one is the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory{\textemdash}a recent theory of feedforward processing in sensory cortex (Anselmi et al., 2013, Theor. Comput. Sci. and arXiv:1311.4158; Anselmi et al., 2013, Magic materials: a theory of deep hierarchical architectures for learning sensory representations. CBCL Paper; Anselmi \& Poggio, 2010, Representation learning in sensory cortex: a theory. CBMM Memo No. 26).
}, issn = {2049-8764}, doi = {10.1093/imaiai/iaw009}, url = {http://imaiai.oxfordjournals.org/lookup/doi/10.1093/imaiai/iaw009}, author = {F. Anselmi and Lorenzo Rosasco and Tomaso Poggio} } @article {1741, title = {Learning Functions: When Is Deep Better Than Shallow}, year = {2016}, abstract = {While the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.
}, url = {https://arxiv.org/pdf/1603.00988v4.pdf}, author = {Hrushikesh Mhaskar and Qianli Liao and Tomaso Poggio} } @article {3281, title = {Nested Invariance Pooling and RBM Hashing for Image Instance Retrieval}, journal = {arXiv.org}, year = {2016}, month = {03/2016}, abstract = {The goal of this work is the computation of very compact binary hashes for image instance retrieval. Our approach has two novel contributions. The first one is Nested Invariance Pooling (NIP), a method inspired from i-theory, a mathematical theory for computing group invariant transformations with feed-forward neural networks. NIP is able to produce compact and well-performing descriptors with visual representations extracted from convolutional neural networks. We specifically incorporate scale, translation and rotation invariances but the scheme can be extended to any arbitrary sets of transformations. We also show that using moments of increasing order throughout nesting is important. The NIP descriptors are then hashed to the target code size (32-256 bits) with a Restricted Boltzmann Machine with a novel batch-level regularization scheme specifically designed for the purpose of hashing (RBMH). A thorough empirical evaluation with state-of-the-art shows that the results obtained both with the NIP descriptors and the NIP+RBMH hashes are consistently outstanding across a wide range of datasets.
}, keywords = {CNN, Hashing, Image Instance Retrieval, Invariant Representation, Regularization, unsupervised learning}, url = {https://arxiv.org/abs/1603.04595}, author = {Olivier Mor{\`e}re and Antoine Veillard and Vijay Chandrasekhar and Tomaso Poggio} } @article {1892, title = {Neural Tuning Size in a Model of Primate Visual Processing Accounts for Three Key Markers of Holistic Face Processing}, journal = {Public Library of Science | PLoS ONE }, volume = {1(3): e0150980}, year = {2016}, month = {03/2016}, abstract = {Faces are an important and unique class of visual stimuli, and have been of interest to neuroscientists\ for many years. Faces are known to elicit certain characteristic behavioral markers, collectively labeled {\textquotedblleft}holistic processing{\textquotedblright}, while non-face objects are not processed\ holistically. However, little is known about the underlying neural mechanisms. The main aim of this computational simulation work is to investigate the neural mechanisms that make
face processing holistic. Using a model of primate visual processing, we show that a single key factor, {\textquotedblleft}neural tuning size{\textquotedblright}, is able to account for three important markers of holistic face processing: the Composite Face Effect (CFE), Face Inversion Effect (FIE) and Whole-Part Effect (WPE). Our proof-of-principle specifies the precise neurophysiological property that corresponds to the poorly-understood notion of holism, and shows that this one neural property controls three classic behavioral markers of holism. Our work is consistent with neurophysiological evidence, and makes further testable predictions. Overall, we provide a parsimonious account of holistic face processing, connecting computation, behavior and neurophysiology.
This textbook presents a wide range of subjects in neuroscience from a computational perspective. It offers a comprehensive, integrated introduction to core topics, using computational tools to trace a path from neurons and circuits to behavior and cognition. Moreover, the chapters show how computational neuroscience{\textemdash}methods for modeling the causal interactions underlying neural systems{\textemdash}complements empirical research in advancing the understanding of brain and behavior.
The chapters{\textemdash}all by leaders in the field, and carefully integrated by the editors{\textemdash}cover such subjects as action and motor control; neuroplasticity, neuromodulation, and reinforcement learning; vision; and language{\textemdash}the core of human cognition.
The book can be used for advanced undergraduate or graduate level courses. It presents all necessary background in neuroscience beyond basic facts about neurons and synapses and general ideas about the structure and function of the human brain. Students should be familiar with differential equations and probability theory, and be able to pick up the basics of programming in MATLAB and/or Python. Slides, exercises, and other ancillary materials are freely available online, and many of the models described in the chapters are documented in the brain operation database, BODB (which is also described in a book chapter).
Available now through MIT Press - https://mitpress.mit.edu/neuron-cognition
}, isbn = {9780262034968}, url = {https://mitpress.mit.edu/neuron-cognition}, author = {Owen Lewis and Tomaso Poggio} } @article {2576, title = {Spatio-temporal convolutional networks explain neural representations of human actions}, year = {2016}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio} } @article {2243, title = {Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning}, year = {2016}, month = {10/2016}, abstract = {We systematically explored a spectrum of normalization algorithms related to Batch Normalization (BN) and propose a generalized formulation that simultaneously solves two major limitations of BN: (1) online learning and (2) recurrent learning. Our proposal is simpler and more biologically-plausible. Unlike previous approaches, our technique can be applied out of the box to all learning scenarios (e.g., online learning, batch learning, fully-connected, convolutional, feedforward, recurrent and mixed {\textemdash} recurrent and convolutional) and compare favorably with existing approaches. We also propose Lp Normalization for normalizing by different orders of statistical moments. In particular, L1 normalization is well-performing, simple to implement, fast to compute, more biologically-plausible and thus ideal for GPU or hardware implementations.
[formerly titled "Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review"]
The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.
}, author = {Tomaso Poggio and Hrushikesh Mhaskar and Lorenzo Rosasco and Brando Miranda and Qianli Liao} } @article {2047, title = {Turing++ Questions: A Test for the Science of (Human) Intelligence.}, journal = { AI Magazine}, volume = {37 }, year = {2016}, month = {03/2016}, pages = {73-77}, abstract = {It is becoming increasingly clear that there is an infinite number of definitions of intelligence. Machines that are intelligent in different narrow ways have been built since the 50s. We are entering now a golden age for the engineering of intelligence and the development of many different kinds of intelligent machines. At the same time there is a widespread interest among scientists in understanding a specific and well defined form of intelligence, that is human intelligence. For this reason we propose a stronger version of the original Turing test. In particular, we describe here an open-ended set of Turing++ Questions that we are developing at the Center for Brains, Minds and Machines at MIT {\textemdash} that is questions about an image. Questions may range from what is there to who is there, what is this person doing, what is this girl thinking about this boy and so on.\ The plural in questions is to emphasize that there are many different intelligent abilities in humans that have to be characterized, and possibly replicated in a machine, from basic visual recognition of objects, to the identification of faces, to gauge emotions, to social intelligence, to language and much more. The term Turing++ is to emphasize that our goal is understanding human intelligence at all Marr{\textquoteright}s levels {\textemdash} from the level of the computations to the level of the underlying circuits. Answers to the Turing++ Questions should thus be given in terms of models that match human behavior and human physiology {\textemdash} the mind and the brain. These requirements are thus well beyond the original Turing test. A whole scientific field that we call the science of (human) intelligence is required to make progress in answering our Turing++ Questions. It is connected to neuroscience and to the engineering of intelligence but also separate from both of them.
}, doi = {http://dx.doi.org/10.1609/aimag.v37i1.2641}, url = {http://www.aaai.org/ojs/index.php/aimagazine/article/view/2641}, author = {Tomaso Poggio and Ethan Meyers} } @article {2122, title = {View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation}, year = {2016}, month = {06/2016}, abstract = {The primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identity-preserving transformations like depth-rotations [ 33 , 32 , 23 , 13 ]. Current computational models of object recognition, including recent deep learning networks, generate these properties through a hierarchy of alternating selectivity-increasing filtering and tolerance-increasing pooling operations, similar to simple-complex cells operations [ 46 , 8 , 44 , 29 ]. While simulations of these models recapitulate the ventral stream{\textquoteright}s progression from early view-specific to late view-tolerant representations, they fail to generate the most salient property of the intermediate representation for faces found in the brain: mirror-symmetric tuning of the neural population to head orientation [ 16 ]. Here we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules can provide approximate invariance at the top level of the network. While most of the learning rules do not yield mirror-symmetry in the mid-level representations, we characterize a specific biologically-plausible Hebb-type learning rule that is guaranteed to generate mirror-symmetric tuning to faces tuning at intermediate levels of the architecture.
}, author = {JZ. Leibo and Qianli Liao and W. A. Freiwald and F. Anselmi and Tomaso Poggio} } @book {2207, title = {Visual Cortex and Deep Networks: Learning Invariant Representations}, year = {2016}, month = {09/2016}, pages = {136}, publisher = {The MIT Press}, organization = {The MIT Press}, address = {Cambridge, MA, USA}, abstract = {The ventral visual stream is believed to underlie object recognition in primates. Over the past fifty years, researchers have developed a series of quantitative models that are increasingly faithful to the biological architecture. Recently, deep learning convolution networks{\textemdash}which do not reflect several important features of the ventral stream architecture and physiology{\textemdash}have been trained with extremely large datasets, resulting in model neurons that mimic object recognition but do not explain the nature of the computations carried out in the ventral stream. This book develops a mathematical framework that describes learning of invariant representations of the ventral stream and is particularly relevant to deep convolutional learning networks.
The authors propose a theory based on the hypothesis that the main computational goal of the ventral stream is to compute neural representations of images that are invariant to transformations commonly encountered in the visual environment and are learned from unsupervised experience. They describe a general theoretical framework of a computational theory of invariance (with details and proofs offered in appendixes) and then review the application of the theory to the feedforward path of the ventral stream in the primate visual cortex.
We extend i-theory to incorporate not only pooling but also rectifying nonlinearities in an extended HW module (eHW) designed for supervised learning. The two operations roughly correspond to invariance and selectivity, respectively. Under the assumption of normalized inputs, we show that appropriate linear combinations of rectifying nonlinearities are equivalent to radial kernels. If pooling is present an equivalent kernel also exist. Thus present-day DCNs (Deep Convolutional Networks) can be exactly equivalent to a hierarchy of kernel machines with pooling and non-pooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of eHW modules minimize memory requirements while computing a selective and invariant representation.
}, author = {F. Anselmi and Lorenzo Rosasco and Cheston Tan and Tomaso Poggio} } @conference {1142, title = {Discriminative Template Learning in Group-Convolutional Networks for Invariant Speech Representations}, booktitle = {INTERSPEECH-2015}, year = {2015}, month = {09/2015}, publisher = {International Speech Communication Association (ISCA)}, organization = {International Speech Communication Association (ISCA)}, address = {Dresden, Germany}, url = {http://www.isca-speech.org/archive/interspeech_2015/i15_3229.html}, author = {Chiyuan Zhang and Stephen Voinea and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio} } @article {1508, title = {Holographic Embeddings of Knowledge Graphs}, year = {2015}, month = {11/16/2015}, abstract = {Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. In this work, we propose holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator, HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments we show that holographic embeddings are able to outperform state-of-the-art methods for link prediction in knowledge graphs and relational learning benchmark datasets.
}, keywords = {Associative Memory, Knowledge Graph, Machine Learning}, author = {Maximilian Nickel and Lorenzo Rosasco and Tomaso Poggio} } @article {1593, title = {How Important is Weight Symmetry in Backpropagation?}, year = {2015}, month = {11/29/2015}, abstract = {Gradient backpropagation (BP) requires symmetric feedforward and feedback connections{\textemdash}the same weights must be used for forward and backward passes. This {\textquotedblleft}weight transport problem{\textquotedblright} [1] is thought to be one of the main reasons of BP{\textquoteright}s biological implausibility. Using 15 different classification datasets, we systematically study to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.{\textquoteright}s demonstration [2] but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter{\textemdash}the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100\% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) [3] and/or a {\textquotedblleft}Batch Manhattan{\textquotedblright} (BM) update rule.
}, author = {Qianli Liao and JZ. Leibo and Tomaso Poggio} } @article {695, title = {On Invariance and Selectivity in Representation Learning}, number = {029}, year = {2015}, month = {03/23/2015}, abstract = {We discuss data representation which can be learned automatically from data, are invariant to transformations, and at the same time selective, in the sense that two points have the same representation only if they are one the transformation of the other. The mathematical results here sharpen some of the key claims of i-theory, a recent theory of feedforward processing in sensory cortex.
Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system{\textquoteright}s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions in agreement with the available data. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.
}, doi = {10.1371/journal.pcbi.1004390}, url = {http://dx.plos.org/10.1371/journal.pcbi.1004390}, author = {JZ. Leibo and Qianli Liao and F. Anselmi and Tomaso Poggio} } @article {1380, title = {The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex}, year = {2015}, month = {07/2015}, author = {JZ. Leibo and Qianli Liao and F. Anselmi and Tomaso Poggio} } @article {2560, title = {Invariant representations for action recognition in the visual system}, year = {2015}, author = {Leyla Isik and Andrea Tacchetti and Tomaso Poggio} } @article {2559, title = {Invariant representations for action recognition in the visual system.}, volume = {15}, year = {2015}, address = {Journal of vision}, doi = {10.1167/15.12.558}, url = {http://jov.arvojournals.org/article.aspx?articleid=2433666}, author = {Andrea Tacchetti and Leyla Isik and Tomaso Poggio} } @article {1588, title = {I-theory on depth vs width: hierarchical function composition}, year = {2015}, month = {12/29/2015}, abstract = {Deep learning networks with convolution, pooling and subsampling are a special case of hierarchical architectures, which can be represented by trees (such as binary trees). Hierarchical as well as shallow networks can approximate functions of several variables, in particular those that are compositions of low dimensional functions. We show that the power of a deep network architecture with respect to a shallow network is rather independent of the specific nonlinear operations in the network and depends instead on the the behavior of the VC-dimension. A shallow network can approximate compositional functions with the same error of a deep network but at the cost of a VC-dimension that is exponential instead than quadratic in the dimensionality of the function. To complete the argument we argue that there exist visual computations that are intrinsically compositional. In particular, we prove that recognition invariant to translation cannot be computed by shallow networks in the presence of clutter. Finally, a general framework that includes the compositional case is sketched. The key condition that allows tall, thin networks to be nicer that short, fat networks is that the target input-output function must be sparse in a certain technical sense.
}, author = {Tomaso Poggio and F. Anselmi and Lorenzo Rosasco} } @conference {1574, title = {Learning with a Wasserstein Loss}, booktitle = {Advances in Neural Information Processing Systems (NIPS 2015) 28}, year = {2015}, abstract = {Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from prob- ability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predic- tions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn{\textquoteright}t use the metric.
}, url = {http://arxiv.org/abs/1506.05439}, author = {Charlie Frogner and Chiyuan Zhang and Hossein Mobahi and Mauricio Araya-Polo and Tomaso Poggio} } @conference {1559, title = {Learning with Group Invariant Features: A Kernel Perspective}, booktitle = {NIPS 2015}, year = {2015}, abstract = {We define an extension of classical additive splines for multivariate
function approximation that we call hierarchical splines. We show that the
case of hierarchical, additive, piece-wise linear splines includes present-day
Deep Convolutional Learning Networks (DCLNs) with linear rectifiers and
pooling (sum or max). We discuss how these observations together with
i-theory may provide a framework for a general theory of deep networks.
We are in the midst of a revolution in machine intelligence, the engineering of getting computers to perform tasks that, until recently, could only be done by people. You can speak to your smart phone and it answers back, software identifies faces at border-crossings and labels people and objects in pictures posted to social media. Algorithms can teach themselves to play Atari video games. A camera and chip embedded into the front view-mirror of top-of-the-line sedans let the vehicle drive autonomously on the open road...
}, author = {Christof Koch and Tomaso Poggio} } @article {1415, title = {Unsupervised learning of invariant representations}, journal = {Theoretical Computer Science}, year = {2015}, month = {06/25/2015}, abstract = {The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n{\textrightarrow}$\infty$n{\textrightarrow}$\infty$). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n{\textrightarrow}1n{\textrightarrow}1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a {\textquotedblleft}good{\textquotedblright} representation for supervised learning, characterized by small sample complexity. We consider the case of visual object recognition, though the theory also applies to other domains like speech. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translation, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and selective signature can be computed for each image or image patch: the invariance can be exact in the case of group transformations and approximate under non-group transformations. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such signature. The theory offers novel unsupervised learning algorithms for {\textquotedblleft}deep{\textquotedblright} architectures for image and speech recognition. We conjecture that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and selective for recognition{\textemdash}and show how this representation may be continuously learned in an unsupervised way during development and visual experience.
}, keywords = {convolutional networks, Cortex, Hierarchy, Invariance}, doi = {10.1016/j.tcs.2015.06.048}, url = {http://www.sciencedirect.com/science/article/pii/S0304397515005587}, author = {F. Anselmi and JZ. Leibo and Lorenzo Rosasco and Jim Mutch and Andrea Tacchetti and Tomaso Poggio} } @article {2062, title = {What if...}, year = {2015}, month = {06/2015}, abstract = {Over the last 3 years and increasingly so in the last few months, I have seen supervised DCLNs {\textemdash} feedforward and recurrent {\textemdash} do more and more of everything quite well. They seem to learn good representations for a growing number of speech and text problems (for a review by the pioneers in the field see LeCun, Bengio, Hinton, 2015). More interestingly, it is increasingly clear, as I will discuss later, that instead of being trained on millions of labeled examples they can be trained in implicitly supervised ways. This breakthrough in machine learning triggers a few dreams. What if we have now the basic answer to how to develop brain-like intelligence and its basic building blocks?...
}, author = {Tomaso Poggio} } @article {357, title = {Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines?}, number = {003}, year = {2014}, month = {03/2014}, abstract = {The standard approach to unconstrained face recognition in natural photographs is via a detection, alignment, recognition pipeline. While that approach has achieved impressive results, there are several reasons to be dissatisfied with it, among them is its lack of biological plausibility. A recent theory of invariant recognition by feedforward hierarchical networks, like HMAX, other convolutional networks, or possibly the ventral stream, implies an alternative approach to unconstrained face recognition. This approach accomplishes detection and alignment implicitly by storing transformations of training images (called templates) rather than explicitly detecting and aligning faces at test time. Here we propose a particular locality-sensitive hashing based voting scheme which we call {\textquotedblleft}consensus of collisions{\textquotedblright} and show that it can be used to approximate the full 3-layer hierarchy implied by the theory. The resulting end-to-end system for unconstrained face recognition operates on photographs of faces taken under natural conditions, e.g., Labeled Faces in the Wild (LFW), without aligning or cropping them, as is normally done. It achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: LFW-jittered (new here) and SUFR-W.
}, keywords = {Computer vision, Face recognition, Hierarchy, Invariance}, author = {Qianli Liao and JZ. Leibo and Youssef Mroueh and Tomaso Poggio} } @article {443, title = {Computational role of eccentricity dependent cortical magnification.}, number = {017}, year = {2014}, month = {06/2014}, abstract = {We develop a sampling extension of M-theory focused on invariance to scale and translation. Quite surprisingly, the theory predicts an architecture of early vision with increasing receptive field sizes and a high resolution fovea {\textemdash} in agreement with data about the cortical magnification factor, V1 and the retina. From the slope of the inverse of the magnification factor, M-theory predicts a cortical {\textquotedblleft}fovea{\textquotedblright} in V1 in the order of 40 by 40 basic units at each receptive field size {\textemdash} corresponding to a foveola of size around 26 minutes of arc at the highest resolution, ≈6 degrees at the lowest resolution. It also predicts uniform scale invariance over a fixed range of scales independently of eccentricity, while translation invariance should depend linearly on spatial frequency. Bouma{\textquoteright}s law of crowding follows in the theory as an effect of cortical area-by-cortical area pooling; the Bouma constant is the value expected if the signature responsible for recognition in the crowding experiments originates in V2. From a broader perspective, the emerging picture suggests that visual recognition under natural conditions takes place by composing information from a set of fixations, with each fixation providing recognition from a space-scale image fragment {\textemdash} that is an image patch represented at a set of increasing sizes and decreasing resolutions.
}, keywords = {Invariance, Theories for Intelligence}, author = {Tomaso Poggio and Jim Mutch and Leyla Isik} } @conference {1141, title = {A Deep Representation for Invariance and Music Classification}, booktitle = {ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing}, year = {2014}, month = {05/04/2014}, publisher = {IEEE}, organization = {IEEE}, address = {Florence, Italy}, keywords = {acoustic signal processing, signal representation, unsupervised learning}, doi = {10.1109/ICASSP.2014.6854954}, url = {http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6854954}, author = {Chiyuan Zhang and Georgios Evangelopoulos and Stephen Voinea and Lorenzo Rosasco and Tomaso Poggio} } @article {227, title = {A Deep Representation for Invariance And Music Classification}, number = {002}, year = {2014}, month = {03/2014}, abstract = {Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.
}, keywords = {Audio Representation, Hierarchy, Invariance, Machine Learning, Theories for Intelligence}, author = {Chiyuan Zhang and Georgios Evangelopoulos and Stephen Voinea and Lorenzo Rosasco and Tomaso Poggio} } @article {389, title = {The dynamics of invariant object recognition in the human visual system.}, journal = {J Neurophysiol}, volume = {111}, year = {2014}, month = {01/2014}, pages = {91-102}, abstract = {The human visual system can rapidly recognize objects despite transformations that alter their appearance. The precise timing of when the brain computes neural representations that are invariant to particular transformations, however, has not been mapped in humans. Here we employ magnetoencephalography decoding analysis to measure the dynamics of size- and position-invariant visual information development in the ventral visual stream. With this method we can read out the identity of objects beginning as early as 60 ms. Size- and position-invariant visual information appear around 125 ms and 150 ms, respectively, and both develop in stages, with invariance to smaller transformations arising before invariance to larger transformations. Additionally, the magnetoencephalography sensor activity localizes to neural sources that are in the most posterior occipital regions at the early decoding times and then move temporally as invariant information develops. These results provide previously unknown latencies for key stages of human-invariant object recognition, as well as new and compelling evidence for a feed-forward hierarchical model of invariant object recognition where invariance increases at each successive visual area along the ventral stream.
Corresponding Dataset - The dynamics of invariant object recognition in the human visual system.
}, keywords = {Adolescent, Adult, Evoked Potentials, Visual, Female, Humans, Male, Pattern Recognition, Visual, Reaction Time, visual cortex}, issn = {1522-1598}, doi = {10.1152/jn.00394.2013}, url = {http://jn.physiology.org/content/early/2013/09/27/jn.00394.2013.abstract}, author = {Leyla Isik and Ethan Meyers and JZ. Leibo and Tomaso Poggio} } @article {2288, title = {The dynamics of invariant object recognition in the human visual system.}, year = {2014}, month = {01/2014}, abstract = {This is the dataset for corresponding Journal Article - The dynamics of invariant object recognition in the human visual system.
\
The human visual system can rapidly recognize objects despite transformations that alter their appearance. The precise timing of when the brain computes neural representations that are invariant to particular transformations, however, has not been mapped in humans. Here we employ magnetoencephalography decoding analysis to measure the dynamics of size- and position-invariant visual information development in the ventral visual stream. With this method we can read out the identity of objects beginning as early as 60 ms. Size- and position-invariant visual information appear around 125 ms and 150 ms, respectively, and both develop in stages, with invariance to smaller transformations arising before invariance to larger transformations. Additionally, the magnetoencephalography sensor activity localizes to neural sources that are in the most posterior occipital regions at the early decoding times and then move temporally as invariant information develops. These results provide previously unknown latencies for key stages of human-invariant object recognition, as well as new and compelling evidence for a feed-forward hierarchical model of invariant object recognition where invariance increases at each successive visual area along the ventral stream.
\
Dataset files can be downloaded here - http://dx.doi.org/10.7910/DVN/KRUPXZ
11 subjects{\textquoteright} MEG data from Isik et al., 2014. Data is available in raw .fif format or in Matlab raster format that is compatible with the neural decoding toolbox (readout.info).
For Matlab code to pre-process this MEG data, and run the decoding analyses please visit
https://bitbucket.org/lisik/meg_decoding
}, doi = {http://dx.doi.org/10.7910/DVN/KRUPXZ}, author = {Leyla Isik and Ethan Meyers and JZ. Leibo and Tomaso Poggio} } @article {438, title = {The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex}, number = {004}, year = {2014}, month = {04/2014}, abstract = {Is visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system{\textquoteright}s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.
}, keywords = {Neuroscience, Theories for Intelligence}, doi = {10.1101/004473}, url = {http://biorxiv.org/lookup/doi/10.1101/004473}, author = {JZ. Leibo and Qianli Liao and F. Anselmi and Tomaso Poggio} } @article {451, title = {Learning An Invariant Speech Representation}, number = {022}, year = {2014}, month = {06/2014}, abstract = {Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates {\textemdash} such as specific phones or words {\textemdash} together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.
}, keywords = {Theories for Intelligence}, author = {Georgios Evangelopoulos and Stephen Voinea and Chiyuan Zhang and Lorenzo Rosasco and Tomaso Poggio} } @conference {222, title = {Learning invariant representations and applications to face verification}, booktitle = {NIPS 2013}, year = {2014}, month = {02/2014}, publisher = {Advances in Neural Information Processing Systems 26}, organization = {Advances in Neural Information Processing Systems 26}, address = {Lake Tahoe, Nevada}, abstract = {One approach to computer object recognition and modeling the brain{\textquoteright}s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformation-invariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identity-preserving transformations. The model{\textquoteright}s wiring can be learned from videos of transforming objects{\textemdash}or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter {\textquotedblleft}transformations{\textquotedblright} which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered{\textemdash}achieving strong performance in these highly unconstrained cases as well.
}, keywords = {Computer vision}, url = {http://nips.cc/Conferences/2013/Program/event.php?ID=4074}, author = {Qianli Liao and JZ. Leibo and Tomaso Poggio} } @article {450, title = {Neural tuning size is a key factor underlying holistic face processing.}, number = {021}, year = {2014}, month = {06/2014}, abstract = {Faces are a class of visual stimuli with unique significance, for a variety of reasons. They are ubiquitous throughout the course of a person{\textquoteright}s life, and face recognition is crucial for daily social interaction. Faces are also unlike any other stimulus class in terms of certain physical stimulus characteristics. Furthermore, faces have been empirically found to elicit certain characteristic behavioral phenomena, which are widely held to be evidence of {\textquotedblleft}holistic{\textquotedblright} processing of faces. However, little is known about the neural mechanisms underlying such holistic face processing. In other words, for the processing of faces by the primate visual system, the input and output characteristics are relatively well known, but the internal neural computations are not. The main aim of this work is to further the fundamental understanding of what causes the visual processing of faces to be different from that of objects. In this computational modeling work, we show that a single factor {\textendash} {\textquotedblleft}neural tuning size{\textquotedblright} {\textendash} is able to account for three key phenomena that are characteristic of face processing, namely the Composite Face Effect (CFE), Face Inversion Effect (FIE) and Whole - Part Effect (WPE). Our computational proof - of - principle provides specific neural tuning properties that correspond to the poorly - understood notion of holistic face processing, and connects these neural properties to psychophysical behavior. Overall, our work provides a unified and parsimonious theoretical account for the disparate empirical data on face - specific processing, deepening the fundamental understanding of face processing
}, keywords = {Theories for Intelligence}, author = {Cheston Tan and Tomaso Poggio} } @conference {220, title = {Phone Classification by a Hierarchy of Invariant Representation Layers}, booktitle = {INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association}, year = {2014}, publisher = {International Speech Communication Association (ISCA)}, organization = {International Speech Communication Association (ISCA)}, address = {Singapore}, abstract = {We propose a multi-layer feature extraction framework for speech, capable of providing invariant representations. A set of templates is generated by sampling the result of applying smooth, identity-preserving transformations (such as vocal tract length and tempo variations) to arbitrarily-selected speech signals. Templates are then stored as the weights of {\textquotedblleft}neurons{\textquotedblright}. We use a cascade of such computational modules to factor out different types of transformation variability in a hierarchy, and show that it improves phone classification over baseline features. In addition, we describe empirical comparisons of a) different transformations which may be responsible for the variability in speech signals and of b) different ways of assembling template sets for training. The proposed layered system is an effort towards explaining the performance of recent deep learning networks and the principles by which the human auditory cortex might reduce the sample complexity of learning in speech recognition. Our theory and experiments suggest that invariant representations are crucial in learning from complex, real-world data like natural speech. Our model is built on basic computational primitives of cortical neurons, thus making an argument about how representations might be learned in the human auditory cortex.
}, keywords = {Hierarchy, Invariance, Neural Networks, Speech Representation}, url = {http://www.isca-speech.org/archive/interspeech_2014/i14_2346.html}, author = {Chiyuan Zhang and Stephen Voinea and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio} } @article {457, title = {Representation Learning in Sensory Cortex: a theory.}, number = {026}, year = {2014}, month = {11/2014}, abstract = {We review and apply a computational theory of the feedforward path of the ventral stream in visual cortex based on the hypothesis that its main function is the encoding of invariant representations of images. A key justification of the theory is provided by a theorem linking invariant representations to small sample complexity for recognition {\textendash} that is, invariant representations allows learning from very few labeled examples. The theory characterizes how an algorithm that can be implemented by a set of {\textquotedblright}simple{\textquotedblright} and {\textquotedblright}complex{\textquotedblright} cells {\textendash} a {\textquotedblright}HW module{\textquotedblright} {\textendash} provides invariant and selective representations. The invariance can be learned in an unsupervised way from observed transformations. Theorems show that invariance implies several properties of the ventral stream organization, including the eccentricity dependent lattice of units in the retina and in V1, and the tuning of its neurons. The theory requires two stages of processing: the first, consisting of retinotopic visual areas such as V1, V2 and V4 with generic neuronal tuning, leads to representations that are invariant to translation and scaling; the second, consisting of modules in IT, with class- and object-specific tuning, provides a representation for recognition with approximate invariance to class specific transformations, such as pose (of a body, of a face) and expression. In the theory the ventral stream main function is the unsupervised learning of {\textquotedblright}good{\textquotedblright}
representations that reduce the sample complexity of the final supervised learning stage.
Recent months have seen an increasingly public debate taking form around the risks of AI (Artificial Intelligence). A letter signed by Nobel prizes and other physicists defined AI as the top existential risk to mankind. More recently, Tesla CEO Elon Musk has been quoted saying that it is {\textquotedblleft}potentially more dangerous than nukes.{\textquotedblright} Physicist Stephen Hawking told the BBC that {\textquotedblleft}the development of full artificial intelligence could spell the end of the human race{\textquotedblright}. And of course recent films such as Her and Transcendence have reinforced the message. Thoughtful comments by experts in the field such as Rod Brooks, Oren Etsioni and others have done little to settle the debate.
As the Director of a new multi-institution, NSF-funded and MIT-based Science and Technology Center {\textemdash} called the Center for Brains, Minds and Machines (CBMM) {\textemdash} I am arguing here on behalf of my collaborators and many colleagues, that the terms of the debate should be fundamentally rephrased. Our vision of the Center{\textquoteright}s research integrates cognitive science, neuroscience, computer science, and artificial intelligence. Our belief is that understanding intelligence and replicating it in machines, goes hand in hand with understanding how the brain and the mind perform intelligent computations. The convergence and recent progress in technology, mathematics, and neuroscience has created a new opportunity for synergy across fields.\ The dream of understanding intelligence is an old one. Yet, as the debate around AI shows, now is an exciting time to pursue this vision.\ Our mission at CBMM is thus to establish an emerging field, the Science and Engineering of Intelligence. This integrated effort should ultimately make fundamental progress with great value to science, technology, and society. We believe that we must push ahead with research, not pull back.
}, author = {Tomaso Poggio} } @article {1140, title = {Speech Representations based on a Theory for Learning Invariances}, year = {2014}, month = {10/2014}, type = {poster presentation}, address = {SANE 2014 - Speech and Audio in the Northeast}, abstract = {Recognition of sounds and speech from a small number of labelled examples (like humans do), depends on the properties of the representation of the acoustic input. We formulate the problem of extracting robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain, that requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech signal can be obtained by projecting it to a number of template orbits, i.e., each one a set of transformed template signals, and computing the associated one-dimensional empirical probability distributions. The computations are perfomed by modules of filtering and pooling, that can be used for obtaining a mapping in single- or multilayer architectures. We consider several aspects of such representations including different signal scales (word vs. frame), input domains (raw waveforms vs. frequency filterbank responses), structures (shallow vs.\ multilayer/hierarchical), and ways of sampling from template orbit sets given a set of observations (explicit vs. learned). Preliminary empirical evaluations for learning to separate speech phones and words are given on TIMIT and subsets of TI-DIGITS.\
}, author = {Stephen Voinea and Chiyuan Zhang and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio} } @article {219, title = {Subtasks of Unconstrained Face Recognition}, year = {2014}, month = {01/2014}, publisher = {9th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. (VISAPP).}, address = {Lisbon, Portugal}, abstract = {Unconstrained face recognition remains a challenging computer vision problem despite recent exceptionally high results (\~{} 95\% accuracy) on the current gold standard evaluation dataset: Labeled Faces in the Wild (LFW) (Huang et al., 2008; Chen et al., 2013). We offer a decomposition of the unconstrained problem into subtasks based on the idea that invariance to identity-preserving transformations is the crux of recognition. Each of the subtasks in the Subtasks of Unconstrained Face Recognition (SUFR) challenge consists of a same-different face-matching problem on a set of 400 individual synthetic faces rendered so as to isolate a specific transformation or set of transformations. We characterized the performance of 9 different models (8 previously published) on each of the subtasks. One notable finding was that the HMAX-C2 feature was not nearly as clutter-resistant as had been suggested by previous publications (Leibo et al., 2010; Pinto et al., 2011). Next we considered LFW and argued that it is too easy of a task to continue to be regarded as a measure of progress on unconstrained face recognition. In particular, strong performance on LFW requires almost no invariance, yet it cannot be considered a fair approximation of the outcome of a detection{\textrightarrow}alignment pipeline since it does not contain the kinds of variability that realistic alignment systems produce when working on non-frontal faces. We offer a new, more difficult, natural image dataset: SUFR-in-the-Wild (SUFR-W), which we created using a protocol that was similar to LFW, but with a few differences designed to produce more need for transformation invariance. We present baseline results for eight different face recognition systems on the new dataset and argue that it is time to retire LFW and move on to more difficult evaluations for unconstrained face recognition.
Click here for more information on related dataset \>
}, keywords = {Face identification, Invariance, Labeled Faces in the Wild, Same-different matching, Synthetic data}, author = {JZ. Leibo and Qianli Liao and Tomaso Poggio} } @article {384, title = {Subtasks of unconstrained face recognition}, year = {2014}, month = {01/2014}, abstract = {This package contains:
1. \ SUFR-W, a dataset of {\textquotedblleft}in the wild{\textquotedblright} natural images of faces gathered from the internet. The protocol used to create the dataset is described in Leibo, Liao and Poggio (2014).
2. \ The full set of SUFR synthetic datasets, called the {\textquotedblleft}Subtasks of Unconstrained Face Recognition Challenge{\textquotedblright} in Leibo, Liao and Poggio (2014).
Click here for more information \& download \>
Click here to download the data set directly \>
}, keywords = {Computer vision}, author = {JZ. Leibo and Qianli Liao and Tomaso Poggio} } @article {455, title = {Unsupervised learning of clutter-resistant visual representations from natural videos.}, number = {023}, year = {2014}, month = {09/2014}, abstract = {Populations of neurons in inferotemporal cortex (IT) maintain an explicit code for object identity that also tolerates transformations of object appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning rules are not known, recent results [4, 5, 6] suggest the operation of an unsupervised temporal-association-based method e.g., Foldiak{\textquoteright}s trace rule [7]. Such methods exploit the temporal continuity of the visual world by assuming that visual experience over short timescales will tend to have invariant identity content. Thus, by associating representations of frames from nearby times, a representation that tolerates whatever transformations occurred in the video may be achieved. Many previous studies verified that such rules can work in simple situations without background clutter, but the presence of visual clutter has remained problematic for this approach. Here we show that temporal association based on large class-specific filters (templates) avoids the problem of clutter. Our system learns in an unsupervised way from natural videos gathered from the internet, and is able to perform a difficult unconstrained face recognition task on natural images (Labeled Faces in the Wild [8]).
}, author = {Qianli Liao and JZ. Leibo and Tomaso Poggio} } @article {226, title = {Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning?}, number = {001}, year = {2014}, month = {03/2014}, abstract = {The present phase of Machine Learning is characterized by supervised learning algorithms relying on large sets of labeled examples (n{\textrightarrow}$\infty$). The next phase is likely to focus on algorithms capable of learning from very few labeled examples (n{\textrightarrow}1), like humans seem able to do. We propose an approach to this problem and describe the underlying theory, based on the unsupervised, automatic learning of a {\textquotedblleft}good{\textquotedblright} representation for supervised learning, characterized by small sample complexity (n). We consider the case of visual object recognition though the theory applies to other domains. The starting point is the conjecture, proved in specific cases, that image representations which are invariant to translations, scaling and other transformations can considerably reduce the sample complexity of learning. We prove that an invariant and unique (discriminative) signature can be computed for each image patch, I, in terms of empirical distributions of the dot-products between I and a set of templates stored during unsupervised learning. A module performing filtering and pooling, like the simple and complex cells described by Hubel and Wiesel, can compute such estimates. Hierarchical architectures consisting of this basic Hubel-Wiesel moduli inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts. The theory extends existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects/images which is invariant to transformations, stable, and discriminative for recognition{\textemdash}and that this representation may be continuously learned in an unsupervised way during development and visual experience.
}, keywords = {Computer vision, Pattern recognition}, author = {F. Anselmi and JZ. Leibo and Lorenzo Rosasco and Jim Mutch and Andrea Tacchetti and Tomaso Poggio} } @conference {221, title = {Word-level Invariant Representations From Acoustic Waveforms}, booktitle = {INTERSPEECH 2014 - 15th Annual Conf. of the International Speech Communication Association}, year = {2014}, publisher = {International Speech Communication Association (ISCA)}, organization = {International Speech Communication Association (ISCA)}, address = {Singapore}, abstract = {Extracting discriminant, transformation-invariant features from raw audio signals remains a serious challenge for speech recognition. The issue of speaker variability is central to this problem, as changes in accent, dialect, gender, and age alter the sound waveform of speech units at multiple scales (phonemes, words, or phrases). Approaches for dealing with this variability have typically focused on analyzing the spectral properties of speech at the level of frames, on par with frame-level acoustic modeling usually applied to speech recognition systems. In this paper, we propose a framework for representing speech at the whole-word level and extracting features from the acoustic, temporal domain, without the need for spectral encoding or pre-processing. Leveraging recent work on unsupervised learning of invariant sensory representations, we extract a signature for a word by first projecting its raw waveform onto a set of templates and their transformations, and then forming empirical estimates of the resulting one-dimensional distributions via histograms. The representation and relevant parameters are evaluated for word classification on a series of datasets with increasing speaker-mismatch difficulty, and the results are compared to those of an MFCC-based representation.
}, keywords = {Invariance, Speech Representation, Theories for Intelligence}, url = {http://www.isca-speech.org/archive/interspeech_2014/i14_2385.html}, author = {Stephen Voinea and Chiyuan Zhang and Georgios Evangelopoulos and Lorenzo Rosasco and Tomaso Poggio} } @inbook {218, title = {On Learnability, Complexity and Stability}, booktitle = {Empirical Inference}, year = {2013}, pages = {59 - 69}, publisher = {Springer Berlin Heidelberg}, organization = {Springer Berlin Heidelberg}, chapter = {7}, address = {Berlin, Heidelberg}, abstract = {Empirical Inference, Chapter 7
Editors: Bernhard Sch{\"o}lkopf, Zhiyuan Luo and Vladimir Vovk
Abstract:
We consider the fundamental question of learnability of a hypothesis class in the supervised learning setting and in the general learning setting introduced by Vladimir Vapnik. We survey classic results characterizing learnability in terms of suitable notions of complexity, as well as more recent results that establish the connection between learnability and stability of a learning algorithm.
}, isbn = {978-3-642-41135-9}, doi = {10.1007/978-3-642-41136-610.1007/978-3-642-41136-6_7}, url = {http://link.springer.com/10.1007/978-3-642-41136-6}, author = {Villa, Silvia and Lorenzo Rosasco and Tomaso Poggio and Sch{\"o}lkopf, Bernhard and Luo, Zhiyuan and Vovk, Vladimir} } @article {223, title = {NSF Science and Technology Centers {\textendash} The Class of 2013}, year = {2013}, month = {11/2013}, publisher = {North America Gender Summit}, address = {Washington, D.C.}, author = {Eaton Lattman and Tomaso Poggio and Robert Westervelt} } @proceedings {387, title = {Unsupervised Learning of Invariant Representations in Hierarchical Architectures.}, year = {2013}, month = {11/2013}, abstract = {Representations that are invariant to translation, scale and other transformations, can considerably reduce the sample complexity of learning, allowing recognition of new object classes from very few examples {\textendash} a hallmark of human recognition. Empirical estimates of one-dimensional projections of the distribution induced by a group of affine transformations are proven to represent a unique and invariant signature associated with an image. We show how projections yielding invariant signatures for future images can be learned automatically, and updated continuously, during unsupervised visual experience. A module performing filtering and pooling, like simple and complex cells as proposed by Hubel and Wiesel, can compute such estimates. Under this view, a pooling stage estimates a one-dimensional probability distribution. Invariance from observations through a restricted window is equivalent to a sparsity property w.r.t. to a transformation, which yields templates that are a) Gabor for optimal simultaneous invariance to translation and scale or b) very specific for complex, class-dependent transformations such as rotation in depth of faces. Hierarchical architectures consisting of this basic Hubel-Wiesel module inherit its properties of invariance, stability, and discriminability while capturing the compositional organization of the visual world in terms of wholes and parts, and are invariant to complex transformations that may only be locally affine. The theory applies to several existing deep learning convolutional architectures for image and speech recognition. It also suggests that the main computational goal of the ventral stream of visual cortex is to provide a hierarchical representation of new objects which is invariant to transformations, stable, and discriminative for recognition {\textendash} this representation may be learned in an unsupervised way from natural visual experience.
}, keywords = {convolutional networks, Hierarchy, Invariance, visual cortex}, author = {F. Anselmi and JZ. Leibo and Lorenzo Rosasco and Jim Mutch and Andrea Tacchetti and Tomaso Poggio} } @article {380, title = {A Large Video Database for Human Motion Recognition}, year = {2011}, month = {01/2011}, abstract = {With nearly one billion online videos viewed everyday, an emerging new frontier in computer vision research is recognition and search in video. While much effort has been devoted to the collection and annotation of large scalable static image datasets containing thousands of image categories, human action datasets lack far behind.
Here we introduce HMDB collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips.
The actions categories can be grouped in five types:
A general GPU-based framework for the fast simulation of {\textquotedblleft}cortically-organized{\textquotedblright} networks, defined as networks consisting of n-dimensional layers of similar cells.
This is a fairly broad class, including more than just {\textquotedblleft}HMAX{\textquotedblright} models. We have developed specialized CNS\ packages\ for HMAX feature hierarchy models (hmax), convolutional networks (cnpkg), and networks of Hodgkin-Huxley spiking cells (hhpkg).
While CNS is designed for use with a GPU, it can run (much more slowly) without one. It does, however, require MATLAB.
CNS ({\textquotedblleft}Cortical Network Simulator{\textquotedblright})
Neurobehavioural analysis of mouse phenotypes requires the monitoring of mouse behaviour over long periods of time. In this study, we describe a trainable computer vision system enabling the automated analysis of complex mouse behaviours. We provide software and an extensive manually annotated video database used for training and testing the system. Our system performs on par with human scoring, as measured from ground-truth manual annotations of thousands of clips of freely behaving mice. As a validation of the system, we characterized the home-cage behaviours of two standard inbred and two non-standard mouse strains. From these data, we were able to predict in a blind test the strain identity of individual animals with high accuracy. Our video-based software will complement existing sensor-based automated approaches and enable an adaptable, comprehensive, high-throughput, fine-grained, automated analysis of mouse behaviour.
}, author = {E. Garrote and H. Jhuang and V. Khilnani and Tomaso Poggio and T. Serre and X. Yu} }