Recent results suggest that square loss performs on par with cross-entropy loss in classification tasks for deep networks. While the theoretical understanding of training deep networks with the cross-entropy loss has been growing,

the study of square loss for classification has been lacking. Here we study the dynamics of training under Gradient Descent techniques and show that we can expect convergence to minimum norm solutions when both Weight Decay (WD) and normalization techniques, like Batch Normalization (BN), are used. We perform numerical simulations that show approximate independence on initial conditions as suggested by our analysis, while in the absence of BN+WD we find that good solutions can be achieved for small initializations. We prove that quasi-interpolating solutions obtained by gradient descent in the presence of WD are expected to show the recently discovered behavior of

Neural Collapse and describe other predictions of the theory.

Overparametrized deep network predict well despite the lack of an explicit complexity control during training such as an explicit regularization term. For exponential-type loss functions, we solve this puzzle by showing an effective regularization effect of gradient descent in terms of the normalized weights that are relevant for classification.

%B Nature Communications %V 11 %8 02/2020 %G eng %U https://www.nature.com/articles/s41467-020-14663-9 %R https://doi.org/10.1038/s41467-020-14663-9 %0 Generic %D 2020 %T Hierarchically Local Tasks and Deep Convolutional Networks %A Arturo Deza %A Qianli Liao %A Andrzej Banburski %A Tomaso Poggio %K Compositionality %K Inductive Bias %K perception %K Theory of Deep Learning %XThe main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional-like networks in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Among all the possible hierarchically local tasks in vision, text and speech we explore a few of them experimentally by studying how they are affected by disrupting locality in the input images. We also discuss a taxonomy of tasks ranging from local, to hierarchically local, to global and make predictions about the type of networks required to perform efficiently on these different types of tasks.

%8 06/2020 %1https://arxiv.org/abs/2006.13915

%2https://hdl.handle.net/1721.1/125980

%0 Generic %D 2020 %T Implicit dynamic regularization in deep networks %A Tomaso Poggio %A Qianli Liao %XSquare loss has been observed to perform well in classification tasks, at least as well as crossentropy. However, a theoretical justification is lacking. Here we develop a theoretical analysis for the square loss that complements the existing asymptotic analysis for the exponential loss.

%8 08/2020 %2https://hdl.handle.net/1721.1/126653

%0 Journal Article %J Proceedings of the National Academy of Sciences %D 2020 %T Theoretical issues in deep networks %A Tomaso Poggio %A Andrzej Banburski %A Qianli Liao %XWhile deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm minimizers. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

%B Proceedings of the National Academy of Sciences %P 201907369 %8 Sep-06-2020 %G eng %U https://www.pnas.org/content/early/2020/06/08/1907369117 %! Proc Natl Acad Sci USA %R 10.1073/pnas.1907369117 %0 Conference Paper %B International Conference on Learning Representations, (ICLR 2019) %D 2019 %T Biologically-plausible learning algorithms can scale to large datasets. %A Xiao, Will %A Chen, Honglin %A Qianli Liao %A Tomaso Poggio %XThe backpropagation (BP) algorithm is often thought to be biologically implau- sible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this “weight transport problem” (Grossberg, 1987), two biologically-plausible algorithms, pro- posed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP’s weight sym- metry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) finds that although feedback alignment (FA) and some variants of target-propagation (TP) perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry (SS) algo- rithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights do not share magnitudes but share signs. We examined the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet; RetinaNet for MS COCO). Surprisingly, networks trained with sign- symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018) and es- tablish a new benchmark for future biologically-plausible learning algorithms on more difficult datasets and more complex architectures.

%B International Conference on Learning Representations, (ICLR 2019) %G eng %0 Conference Paper %B NAS Sackler Colloquium on Science of Deep Learning %D 2019 %T Dynamics & Generalization in Deep Networks -Minimizing the Norm %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B NAS Sackler Colloquium on Science of Deep Learning %C Washington D.C. %8 03/2019 %G eng %0 Generic %D 2019 %T Theoretical Issues in Deep Networks %A Tomaso Poggio %A Andrzej Banburski %A Qianli Liao %XWhile deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal. In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

%8 08/2019 %2https://hdl.handle.net/1721.1/122014

%0 Generic %D 2019 %T Theories of Deep Learning: Approximation, Optimization and Generalization %A Qianli Liao %A Andrzej Banburski %A Tomaso Poggio %B TECHCON 2019 %8 09/2019 %0 Conference Paper %B ICML %D 2019 %T Weight and Batch Normalization implement Classical Generalization Bounds %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B ICML %C Long Beach/California %8 06/2019 %G eng %0 Generic %D 2018 %T Biologically-plausible learning algorithms can scale to large datasets %A Will Xiao %A Honglin Chen %A Qianli Liao %A Tomaso Poggio %XThe backpropagation (BP) algorithm is often thought to be biologically implausible in the brain. One of the main reasons is that BP requires symmetric weight matrices in the feedforward and feedback pathways. To address this "weight transport problem" (Grossberg, 1987), two more biologically plausible algorithms, proposed by Liao et al. (2016) and Lillicrap et al. (2016), relax BP's weight symmetry requirements and demonstrate comparable learning capabilities to that of BP on small datasets. However, a recent study by Bartunov et al. (2018) evaluate variants of target-propagation (TP) and feedback alignment (FA) on MINIST, CIFAR, and ImageNet datasets, and find that although many of the proposed algorithms perform well on MNIST and CIFAR, they perform significantly worse than BP on ImageNet. Here, we additionally evaluate the sign-symmetry algorithm (Liao et al., 2016), which differs from both BP and FA in that the feedback and feedforward weights share signs but not magnitudes. We examine the performance of sign-symmetry and feedback alignment on ImageNet and MS COCO datasets using different network architectures (ResNet-18 and AlexNet for ImageNet, RetinaNet for MS COCO). Surprisingly, networks trained with sign-symmetry can attain classification performance approaching that of BP-trained networks. These results complement the study by Bartunov et al. (2018), and establish a new benchmark for future biologically plausible learning algorithms on more difficult datasets and more complex architectures.

%8 11/2018 %1https://arxiv.org/abs/1811.03567

%2https://hdl.handle.net/1721.1/121157

%0 Generic %D 2018 %T Classical generalization bounds are surprisingly tight for Deep Networks %A Qianli Liao %A Brando Miranda %A Jack Hidary %A Tomaso Poggio %XDeep networks are usually trained and tested in a regime in which the training classification error is not a good predictor of the test error. Thus the consensus has been that generalization, defined as convergence of the empirical to the expected error, does not hold for deep networks. Here we show that, when normalized appropriately after training, deep networks trained on exponential type losses show a good linear dependence of test loss on training loss. The observation, motivated by a previous theoretical analysis of overparametrization and overfitting, not only demonstrates the validity of classical generalization bounds for deep learning but suggests that they are tight. In addition, we also show that the bound of the classification error by the normalized cross entropy loss is empirically rather tight on the data sets we studied.

%8 07/2018 %1 %2http://hdl.handle.net/1721.1/116911

%0 Journal Article %J Bulletin of the Polish Academy of Sciences: Technical Sciences %D 2018 %T Theory I: Deep networks and the curse of dimensionality %A Tomaso Poggio %A Qianli Liao %K convolutional neural networks %K deep and shallow networks %K deep learning %K function approximation %XWe review recent work characterizing the classes of functions for which deep learning can be exponentially better than shallow learning. Deep convolutional networks are a special case of these conditions, though weight sharing is not the main reason for their exponential advantage.

%B Bulletin of the Polish Academy of Sciences: Technical Sciences %V 66 %G eng %N 6 %0 Journal Article %J Bulletin of the Polish Academy of Sciences: Technical Sciences %D 2018 %T Theory II: Deep learning and optimization %A Tomaso Poggio %A Qianli Liao %XThe landscape of the empirical risk of overparametrized deep convolutional neural networks (DCNNs) is characterized with a mix of theory and experiments. In part A we show the existence of a large number of global minimizers with zero empirical error (modulo inconsistent equations). The argument which relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity. We show with simulations that the corresponding polynomial network is indistinguishable from the RELU network. According to Bezout theorem, the global minimizers are degenerate unlike the local minima which in general should be non-degenerate. Further we experimentally analyzed and visualized the landscape of empirical risk of DCNNs on CIFAR-10 dataset. Based on above theoretical and experimental observations, we propose a simple model of the landscape of empirical risk. In part B, we characterize the optimization properties of stochastic gradient descent applied to deep networks. The main claim here consists of theoretical and experimental evidence for the following property of SGD: SGD concentrates in probability – like the classical Langevin equation – on large volume, ”flat” minima, selecting with high probability degenerate minimizers which are typically global minimizers.

%B Bulletin of the Polish Academy of Sciences: Technical Sciences %V 66 %G eng %N 6 %R 10.24425/bpas.2018.125925 %0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %XThe key to generalization is controlling the complexity of

the network. However, there is no obvious control of

complexity -- such as an explicit regularization term --

in the training of deep networks for classification. We

will show that a classical form of norm control -- but

kind of hidden -- is present in deep networks trained with

gradient descent techniques on exponential-type losses. In

particular, gradient descent induces a dynamics of the

normalized weights which converge for $t \to \infty$ to an

equilibrium which corresponds to a minimum norm (or

maximum margin) solution. For sufficiently large but

finite $\rho$ -- and thus finite $t$ -- the dynamics

converges to one of several margin maximizers, with the

margin monotonically increasing towards a limit stationary

point of the flow. In the usual case of stochastic

gradient descent, most of the stationary points are likely

to be convex minima corresponding to a regularized,

constrained minimizer -- the network with normalized

weights-- which is stable and has asymptotic zero

generalization gap, asymptotically for $N \to \infty$,

where $N$ is the number of training examples. For finite,

fixed $N$ the generalizaton gap may not be zero but the

minimum norm property of the solution can provide, we

conjecture, good expected performance for suitable data

distributions. Our approach extends some of the results of

Srebro from linear networks to deep networks and provides

a new perspective on the implicit bias of gradient

descent. We believe that the elusive complexity control we

describe is responsible for the puzzling empirical finding

of good predictive performance by deep networks, despite

overparametrization.

http://hdl.handle.net/1721.1/116692

%0 Journal Article %D 2017 %T Compression of Deep Neural Networks for Image Instance Retrieval %A Vijay Chandrasekhar %A Jie Lin %A Qianli Liao %A Olivier Morère %A Antoine Veillard %A Lingyu Duan %A Tomaso Poggio %XImage instance retrieval is the problem of retrieving images from a database which contain the same object. Convolutional Neural Network (CNN) based descriptors are becoming the dominant approach for generating {\it global image descriptors} for the instance retrieval problem. One major drawback of CNN-based {\it global descriptors} is that uncompressed deep neural network models require hundreds of megabytes of storage making them inconvenient to deploy in mobile applications or in custom hardware. In this work, we study the problem of neural network model compression focusing on the image instance retrieval task. We study quantization, coding, pruning and weight sharing techniques for reducing model size for the instance retrieval problem. We provide extensive experimental results on the trade-off between retrieval performance and model size for different types of networks on several data sets providing the most comprehensive study on this topic. We compress models to the order of a few MBs: two orders of magnitude smaller than the uncompressed models while achieving negligible loss in retrieval performance.

%8 01/2017 %G eng %U https://arxiv.org/abs/1701.04923 %0 Generic %D 2017 %T Musings on Deep Learning: Properties of SGD %A Chiyuan Zhang %A Qianli Liao %A Alexander Rakhlin %A Karthik Sridharan %A Brando Miranda %A Noah Golowich %A Tomaso Poggio %X[*formerly titled "Theory of Deep Learning III: Generalization Properties of SGD"*]

In Theory III we characterize with a mix of theory and experiments the generalization properties of Stochastic Gradient Descent in overparametrized deep convolutional networks. We show that Stochastic Gradient Descent (SGD) selects with high probability solutions that 1) have zero (or small) empirical error, 2) are degenerate as shown in Theory II and 3) have maximum generalization.

%8 04/2017 %2http://hdl.handle.net/1721.1/107841

%0 Generic %D 2017 %T Object-Oriented Deep Learning %A Qianli Liao %A Tomaso Poggio %XWe investigate an unconventional direction of research that aims at converting neural networks, a class of distributed, connectionist, sub-symbolic models into a symbolic level with the ultimate goal of achieving AI interpretability and safety. To that end, we propose Object-Oriented Deep Learning, a novel computational paradigm of deep learning that adopts interpretable “objects/symbols” as a basic representational atom instead of N-dimensional tensors (as in traditional “feature-oriented” deep learning). For visual processing, each “object/symbol” can explicitly package common properties of visual objects like its position, pose, scale, probability of being an object, pointers to parts, etc., providing a full spectrum of interpretable visual knowledge throughout all layers. It achieves a form of “symbolic disentanglement”, offering one solution to the important problem of disentangled representations and invariance. Basic computations of the network include predicting high-level objects and their properties from low-level objects and binding/aggregating relevant objects together. These computations operate at a more fundamental level than convolutions, capturing convolution as a special case while being significantly more general than it. All operations are executed in an input-driven fashion, thus sparsity and dynamic computation per sample are naturally supported, complementing recent popular ideas of dynamic networks and may enable new types of hardware accelerations. We experimentally show on CIFAR-10 that it can perform flexible visual processing, rivaling the performance of ConvNet, but without using any convolution. Furthermore, it can generalize to novel rotations of images that it was not trained for.

%8 10/2017 %2http://hdl.handle.net/1721.1/112103

%0 Generic %D 2017 %T Theory II: Landscape of the Empirical Risk in Deep Learning %A Tomaso Poggio %A Qianli Liao %XPrevious theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least for the most successful Deep Convolutional Neural Networks (DCNNs) for visual processing, practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a small degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The zero-minimizers -- in the case of classification -- have a non-zero margin. The same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the zero-minimizer with larger margin, as discussed in Theory III (to be released). We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

%8 03/2017 %1 %2http://hdl.handle.net/1721.1/107787

%0 Generic %D 2017 %T Theory of Deep Learning IIb: Optimization Properties of SGD %A Chiyuan Zhang %A Qianli Liao %A Alexander Rakhlin %A Brando Miranda %A Noah Golowich %A Tomaso Poggio %XIn Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: *SGD concentrates in probability - like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers*.

http://hdl.handle.net/1721.1/115407

%0 Generic %D 2017 %T Theory of Deep Learning III: explaining the non-overfitting puzzle %A Tomaso Poggio %A Keji Kawaguchi %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Xavier Boix %A Jack Hidary %A Hrushikesh Mhaskar %X**THIS MEMO IS REPLACED BY CBMM MEMO 90**

A main puzzle of deep networks revolves around the absence of overfitting despite overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamical systems associated with gradient descent minimization of nonlinear networks behave near zero stable minima of the empirical error as gradient system in a quadratic potential with degenerate Hessian. The proposition is supported by theoretical and numerical results, under the assumption of stable minima of the gradient.

Our proposition provides the extension to deep networks of key properties of gradient descent methods for linear networks, that as, suggested in (1), can be the key to understand generalization. Gradient descent enforces a form of implicit regular- ization controlled by the number of iterations, and asymptotically converging to the minimum norm solution. This implies that there is usually an optimum early stopping that avoids overfitting of the loss (this is relevant mainly for regression). For classification, the asymptotic convergence to the minimum norm solution implies convergence to the maximum margin solution which guarantees good classification error for “low noise” datasets.

The implied robustness to overparametrization has suggestive implications for the robustness of deep hierarchically local networks to variations of the architecture with respect to the curse of dimensionality.

%8 12/2017 %1 %2http://hdl.handle.net/1721.1/113003

%0 Journal Article %J Current Biology %D 2017 %T View-Tolerant Face Recognition and Hebbian Learning Imply Mirror-Symmetric Neural Tuning to Head Orientation %A JZ. Leibo %A Qianli Liao %A F. Anselmi %A W. A. Freiwald %A Tomaso Poggio %XThe primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and robust against identity-preserving transformations, like depth rotations. Current computational models of object recognition, including recent deep-learning networks, generate these properties through a hierarchy of alternating selectivity-increasing filtering and tolerance-increasing pooling operations, similar to simple-complex cells operations. Here, we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules generate approximate invariance to identity-preserving transformations at the top level of the processing hierarchy. However, all past models tested failed to reproduce the most salient property of an intermediate representation of a three-level face-processing hierarchy in the brain: mirror-symmetric tuning to head orientation. Here, we demonstrate that one specific biologically plausible Hebb-type learning rule generates mirror-symmetric tuning to bilaterally symmetric stimuli, like faces, at intermediate levels of the architecture and show why it does so. Thus, the tuning properties of individual cells inside the visual stream appear to result from group properties of the stimuli they encode and to reflect the learning rules that sculpted the information-processing system within which they reside.

%B Current Biology %V 27 %P 1-6 %8 01/2017 %G eng %R http://dx.doi.org/10.1016/j.cub.2016.10.015 %0 Conference Proceedings %B AAAI-17: Thirty-First AAAI Conference on Artificial Intelligence %D 2017 %T When and Why Are Deep Networks Better Than Shallow Ones? %A Hrushikesh Mhaskar %A Qianli Liao %A Tomaso Poggio %XWhile the universal approximation property holds both for hierarchical and shallow networks, deep networks can approximate the class of compositional functions as well as shallow networks but with exponentially lower number of training parameters and sample complexity. Compositional functions are obtained as a hierarchy of local constituent functions, where "local functions'' are functions with low dimensionality. This theorem proves an old conjecture by Bengio on the role of depth in networks, characterizing precisely the conditions under which it holds. It also suggests possible answers to the the puzzle of why high-dimensional deep networks trained on large training sets often do not seem to show overfit.

%B AAAI-17: Thirty-First AAAI Conference on Artificial Intelligence
%G eng
%0 Journal Article
%J International Journal of Automation and Computing
%D 2017
%T Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review
%A Tomaso Poggio
%A Hrushikesh Mhaskar
%A Lorenzo Rosasco
%A Brando Miranda
%A Qianli Liao
%K convolutional neural networks
%K deep and shallow networks
%K deep learning
%K function approximation
%K Machine Learning
%K Neural Networks
%X The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%B International Journal of Automation and Computing %P 1-17 %8 03/2017 %G eng %U http://link.springer.com/article/10.1007/s11633-017-1054-2?wt_mc=Internal.Event.1.SEM.ArticleAuthorOnlineFirst %R 10.1007/s11633-017-1054-2 %0 Generic %D 2016 %T Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex %A Qianli Liao %A Tomaso Poggio %XWe discuss relations between Residual Networks (ResNet), Recurrent Neural Networks (RNNs) and the primate visual cortex. We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. A direct implementation of such a RNN, although having orders of magnitude fewer parameters, leads to a performance similar to the corresponding ResNet. We propose 1) a generalization of both RNN and ResNet architectures and 2) the conjecture that a class of moderately deep RNNs is a biologically-plausible model of the ventral stream in visual cortex. We demonstrate the effectiveness of the architectures by testing them on the CIFAR-10 dataset.

%8 04/2016 %1 %2http://hdl.handle.net/1721.1/102238

%0 Conference Paper %B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %D 2016 %T How Important Is Weight Symmetry in Backpropagation? %A Qianli Liao %A JZ. Leibo %A Tomaso Poggio %XGradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. This "weight transport problem" (Grossberg 1987) is thought to be one of the main reasons to doubt BP's biologically plausibility. Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.'s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter -- the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.

%B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %C Phoenix, AZ. %G eng %U https://cbmm.mit.edu/sites/default/files/publications/liao-leibo-poggio.pdf %0 Conference Paper %B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %D 2016 %T How Important Is Weight Symmetry in Backpropagation? %A Qianli Liao %A JZ. Leibo %A Tomaso Poggio %XGradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. This "weight transport problem" (Grossberg 1987) is thought to be one of the main reasons to doubt BP's biologically plausibility. Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.'s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter -- the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.

%B Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) %I Association for the Advancement of Artificial Intelligence %C Phoenix, AZ. %8 Accepted %G eng %0 Generic %D 2016 %T Learning Functions: When Is Deep Better Than Shallow %A Hrushikesh Mhaskar %A Qianli Liao %A Tomaso Poggio %XWhile the universal approximation property holds both for hierarchical and shallow networks, we prove that deep (hierarchical) networks can approximate the class of compositional functions with the same accuracy as shallow networks but with exponentially lower number of training parameters as well as VC-dimension. This theorem settles an old conjecture by Bengio on the role of depth in networks. We then define a general class of scalable, shift-invariant algorithms to show a simple and natural set of requirements that justify deep convolutional networks.

%U https://arxiv.org/pdf/1603.00988v4.pdf %1 %2http://hdl.handle.net/1721.1/101635

%0 Generic %D 2016 %T Streaming Normalization: Towards Simpler and More Biologically-plausible Normalizations for Online and Recurrent Learning %A Qianli Liao %A Kenji Kawaguchi %A Tomaso Poggio %Xhttp://hdl.handle.net/1721.1/104906

%0 Generic %D 2016 %T Theory I: Why and When Can Deep Networks Avoid the Curse of Dimensionality? %A Tomaso Poggio %A Hrushikesh Mhaskar %A Lorenzo Rosasco %A Brando Miranda %A Qianli Liao %X[formerly titled "*Why and When Can Deep - but Not Shallow - Networks Avoid the Curse of Dimensionality: a Review*"]

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

%8 11/2016 %1https://arxiv.org/abs/1611.00740v5

%2http://hdl.handle.net/1721.1/105443

%0 Generic %D 2016 %T View-tolerant face recognition and Hebbian learning imply mirror-symmetric neural tuning to head orientation %A JZ. Leibo %A Qianli Liao %A W. A. Freiwald %A F. Anselmi %A Tomaso Poggio %XThe primate brain contains a hierarchy of visual areas, dubbed the ventral stream, which rapidly computes object representations that are both specific for object identity and relatively robust against identity-preserving transformations like depth-rotations [ 33 , 32 , 23 , 13 ]. Current computational models of object recognition, including recent deep learning networks, generate these properties through a hierarchy of alternating selectivity-increasing filtering and tolerance-increasing pooling operations, similar to simple-complex cells operations [ 46 , 8 , 44 , 29 ]. While simulations of these models recapitulate the ventral stream’s progression from early view-specific to late view-tolerant representations, they fail to generate the most salient property of the intermediate representation for faces found in the brain: mirror-symmetric tuning of the neural population to head orientation [ 16 ]. Here we prove that a class of hierarchical architectures and a broad set of biologically plausible learning rules can provide approximate invariance at the top level of the network. While most of the learning rules do not yield mirror-symmetry in the mid-level representations, we characterize a specific biologically-plausible Hebb-type learning rule that is guaranteed to generate mirror-symmetric tuning to faces tuning at intermediate levels of the architecture.

%8 06/2016 %1arXiv:1606.01552v1 [cs.NE]

%2http://hdl.handle.net/1721.1/103394

%0 Generic %D 2015 %T How Important is Weight Symmetry in Backpropagation? %A Qianli Liao %A JZ. Leibo %A Tomaso Poggio %XGradient backpropagation (BP) requires symmetric feedforward and feedback connections—the same weights must be used for forward and backward passes. This “weight transport problem” [1] is thought to be one of the main reasons of BP’s biological implausibility. Using 15 different classification datasets, we systematically study to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.’s demonstration [2] but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter—the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) [3] and/or a “Batch Manhattan” (BM) update rule.

%8 11/29/2015 %1http://arxiv.org/abs/1510.05067

%2http://hdl.handle.net/1721.1/100797

%0 Journal Article %J PLOS Computational Biology %D 2015 %T The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex %A JZ. Leibo %A Qianli Liao %A F. Anselmi %A Tomaso Poggio %XIs visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system’s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions in agreement with the available data. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.

%B PLOS Computational Biology %V 11 %P e1004390 %8 10/23/2015 %G eng %U http://dx.plos.org/10.1371/journal.pcbi.1004390 %N 10 %! Invariance and Domain Specificity %R 10.1371/journal.pcbi.1004390 %0 Generic %D 2015 %T The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex %A JZ. Leibo %A Qianli Liao %A F. Anselmi %A Tomaso Poggio %8 07/2015 %0 Generic %D 2014 %T Can a biologically-plausible hierarchy effectively replace face detection, alignment, and recognition pipelines? %A Qianli Liao %A JZ. Leibo %A Youssef Mroueh %A Tomaso Poggio %K Computer vision %K Face recognition %K Hierarchy %K Invariance %XThe standard approach to unconstrained face recognition in natural photographs is via a detection, alignment, recognition pipeline. While that approach has achieved impressive results, there are several reasons to be dissatisfied with it, among them is its lack of biological plausibility. A recent theory of invariant recognition by feedforward hierarchical networks, like HMAX, other convolutional networks, or possibly the ventral stream, implies an alternative approach to unconstrained face recognition. This approach accomplishes detection and alignment implicitly by storing transformations of training images (called templates) rather than explicitly detecting and aligning faces at test time. Here we propose a particular locality-sensitive hashing based voting scheme which we call “consensus of collisions” and show that it can be used to approximate the full 3-layer hierarchy implied by the theory. The resulting end-to-end system for unconstrained face recognition operates on photographs of faces taken under natural conditions, e.g., Labeled Faces in the Wild (LFW), without aligning or cropping them, as is normally done. It achieves a drastic improvement in the state of the art on this end-to-end task, reaching the same level of performance as the best systems operating on aligned, closely cropped images (no outside training data). It also performs well on two newer datasets, similar to LFW, but more difficult: LFW-jittered (new here) and SUFR-W.

%8 03/2014 %1 %2http://hdl.handle.net/1721.1/100164

%0 Generic %D 2014 %T The Invariance Hypothesis Implies Domain-Specific Regions in Visual Cortex %A JZ. Leibo %A Qianli Liao %A F. Anselmi %A Tomaso Poggio %K Neuroscience %K Theories for Intelligence %XIs visual cortex made up of general-purpose information processing machinery, or does it consist of a collection of specialized modules? If prior knowledge, acquired from learning a set of objects is only transferable to new objects that share properties with the old, then the recognition system’s optimal organization must be one containing specialized modules for different object classes. Our analysis starts from a premise we call the invariance hypothesis: that the computational goal of the ventral stream is to compute an invariant-to-transformations and discriminative signature for recognition. The key condition enabling approximate transfer of invariance without sacrificing discriminability turns out to be that the learned and novel objects transform similarly. This implies that the optimal recognition system must contain subsystems trained only with data from similarly-transforming objects and suggests a novel interpretation of domain-specific regions like the fusiform face area (FFA). Furthermore, we can define an index of transformation-compatibility, computable from videos, that can be combined with information about the statistics of natural vision to yield predictions for which object categories ought to have domain-specific regions. The result is a unifying account linking the large literature on view-based recognition with the wealth of experimental evidence concerning domain-specific regions.

%8 04/2014 %G eng %U http://biorxiv.org/lookup/doi/10.1101/004473 %2http://hdl.handle.net/1721.1/100168

%R 10.1101/004473 %0 Conference Paper %B NIPS 2013 %D 2014 %T Learning invariant representations and applications to face verification %A Qianli Liao %A JZ. Leibo %A Tomaso Poggio %K Computer vision %XOne approach to computer object recognition and modeling the brain’s ventral stream involves unsupervised learning of representations that are invariant to common transformations. However, applications of these ideas have usually been limited to 2D affine transformations, e.g., translation and scaling, since they are easiest to solve via convolution. In accord with a recent theory of transformation-invariance [1], we propose a model that, while capturing other common convolutional networks as special cases, can also be used with arbitrary identity-preserving transformations. The model’s wiring can be learned from videos of transforming objects—or any other grouping of images into sets by their depicted object. Through a series of successively more complex empirical tests, we study the invariance/discriminability properties of this model with respect to different transformations. First, we empirically confirm theoretical predictions (from [1]) for the case of 2D affine transformations. Next, we apply the model to non-affine transformations; as expected, it performs well on face verification tasks requiring invariance to the relatively smooth transformations of 3D rotation-in-depth and changes in illumination direction. Surprisingly, it can also tolerate clutter “transformations” which map an image of a face on one background to an image of the same face on a different background. Motivated by these empirical findings, we tested the same model on face verification benchmark tasks from the computer vision literature: Labeled Faces in the Wild, PubFig [2, 3, 4] and a new dataset we gathered—achieving strong performance in these highly unconstrained cases as well.

%B NIPS 2013 %I Advances in Neural Information Processing Systems 26 %C Lake Tahoe, Nevada %8 02/2014 %G eng %U http://nips.cc/Conferences/2013/Program/event.php?ID=4074 %0 Generic %D 2014 %T Subtasks of unconstrained face recognition %A JZ. Leibo %A Qianli Liao %A Tomaso Poggio %K Computer vision %XThis package contains:

1. SUFR-W, a dataset of “in the wild” natural images of faces gathered from the internet. The protocol used to create the dataset is described in Leibo, Liao and Poggio (2014).

2. The full set of SUFR synthetic datasets, called the “Subtasks of Unconstrained Face Recognition Challenge” in Leibo, Liao and Poggio (2014).

Click here for more information & download >

Click here to download the data set directly >

%8 01/2014 %0 Generic %D 2014 %T Subtasks of Unconstrained Face Recognition %A JZ. Leibo %A Qianli Liao %A Tomaso Poggio %K Face identification %K Invariance %K Labeled Faces in the Wild %K Same-different matching %K Synthetic data %XUnconstrained face recognition remains a challenging computer vision problem despite recent exceptionally high results (∼ 95% accuracy) on the current gold standard evaluation dataset: Labeled Faces in the Wild (LFW) (Huang et al., 2008; Chen et al., 2013). We offer a decomposition of the unconstrained problem into subtasks based on the idea that invariance to identity-preserving transformations is the crux of recognition. Each of the subtasks in the Subtasks of Unconstrained Face Recognition (SUFR) challenge consists of a same-different face-matching problem on a set of 400 individual synthetic faces rendered so as to isolate a speciﬁc transformation or set of transformations. We characterized the performance of 9 different models (8 previously published) on each of the subtasks. One notable ﬁnding was that the HMAX-C2 feature was not nearly as clutter-resistant as had been suggested by previous publications (Leibo et al., 2010; Pinto et al., 2011). Next we considered LFW and argued that it is too easy of a task to continue to be regarded as a measure of progress on unconstrained face recognition. In particular, strong performance on LFW requires almost no invariance, yet it cannot be considered a fair approximation of the outcome of a detection→alignment pipeline since it does not contain the kinds of variability that realistic alignment systems produce when working on non-frontal faces. We offer a new, more difﬁcult, natural image dataset: SUFR-in-the-Wild (SUFR-W), which we created using a protocol that was similar to LFW, but with a few differences designed to produce more need for transformation invariance. We present baseline results for eight different face recognition systems on the new dataset and argue that it is time to retire LFW and move on to more difﬁcult evaluations for unconstrained face recognition.

Click here for more information on related dataset >

%I 9th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. (VISAPP). %C Lisbon, Portugal %8 01/2014 %0 Generic %D 2014 %T Unsupervised learning of clutter-resistant visual representations from natural videos. %A Qianli Liao %A JZ. Leibo %A Tomaso Poggio %XPopulations of neurons in inferotemporal cortex (IT) maintain an explicit code for object identity that also tolerates transformations of object appearance e.g., position, scale, viewing angle [1, 2, 3]. Though the learning rules are not known, recent results [4, 5, 6] suggest the operation of an unsupervised temporal-association-based method e.g., Foldiak’s trace rule [7]. Such methods exploit the temporal continuity of the visual world by assuming that visual experience over short timescales will tend to have invariant identity content. Thus, by associating representations of frames from nearby times, a representation that tolerates whatever transformations occurred in the video may be achieved. Many previous studies verified that such rules can work in simple situations without background clutter, but the presence of visual clutter has remained problematic for this approach. Here we show that temporal association based on large class-specific filters (templates) avoids the problem of clutter. Our system learns in an unsupervised way from natural videos gathered from the internet, and is able to perform a difficult unconstrained face recognition task on natural images (Labeled Faces in the Wild [8]).

%8 09/2014 %1 %2http://hdl.handle.net/1721.1/100187