%0 Conference Paper %B IEEE International Conference on Acoustics, Speech and Signal Processing %D 2022 %T Neural Collapse in Deep Homogeneous Classifiers and the role of Weight Decay %A Andrzej Banburski %A Akshay Rangamani %B IEEE International Conference on Acoustics, Speech and Signal Processing %C Singapore %8 05/2022 %G eng %0 Generic %D 2022 %T PCA as a defense against some adversaries %A Aparna Gupte %A Andrzej Banburski %A Tomaso Poggio %X

Neural network classifiers are known to be highly vulnerable to adversarial perturbations in their inputs. Under the hypothesis that adversarial examples lie outside of the sub-manifold of natural images, previous work has investigated the impact of principal components in data on adversarial robustness. In this paper we show that there exists a very simple defense mechanism in the case where adversarial images are separable in a previously defined $(k,p)$ metric. This defense is very successful against the popular Carlini-Wagner attack, but less so against some other common attacks like FGSM. It is interesting to note that the defense is still successful for relatively large perturbations.

%0 Generic %D 2021 %T Distribution of Classification Margins: Are All Data Equal? %A Andrzej Banburski %A Fernanda De La Torre %A Nishka Pant %A Ishana Shastri %A Tomaso Poggio %X

Recent theoretical results show that gradient descent on deep neural networks under exponential loss functions locally maximizes classification margin, which is equivalent to minimizing the norm of the weight matrices under margin constraints. This property of the solution however does not fully characterize the generalization performance. We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization. We then show that, after data separation is achieved, it is possible to dynamically reduce the training set by more than 99% without significant loss of performance. Interestingly, the resulting subset of “high capacity” features is not consistent across different training runs, which is consistent with the theoretical claim that all training points should converge to the same asymptotic margin under SGD and in the presence of both batch normalization and weight decay.

https://hdl.handle.net/1721.1/129744

%0 Generic %D 2021 %T Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss %A M. Xu %A Akshay Rangamani %A Andrzej Banburski %A Q. Liao %A Tomer Galanti %A Tomaso Poggio %X

We overview several properties -- old and new -- of training overparametrized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous ReLU networks. We study the convergence to a solution with the absolute minimum $\rho$, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers (LM) is used together with Weight Decay (WD) under different forms of gradient descent. A main property of the minimizers that bounds their expected error {\it for a specific network architecture} is $\rho$. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next we prove that quasi-interpolating solutions obtained by Stochastic Gradient Descent (SGD) in the presence of WD have a bias towards low rank weight matrices -- that, as we also explain, should improve generalization. The same analysis predicts the existence of an inherent SGD noise for deep networks. In both cases, we verify our predictions experimentally. We then predict Neural Collapse and its properties without any specific assumption -- unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for the problems that are appropriate for sparse deep architectures such as CNNs. The deep reason compositionally sparse target functions can be approximated well by ``sparse'' deep networks without incurring in the curse of dimensionality.

%0 Generic %D 2021 %T Evaluating the Adversarial Robustness of a Foveated Texture Transform Module in a CNN %A Jonathan Gant %A Andrzej Banburski %A Arturo Deza %A Tomaso Poggio %B NeurIPS 2021 %8 12/2021 %U https://nips.cc/Conferences/2021/Schedule?showEvent=21868 %0 Generic %D 2020 %T Biologically Inspired Mechanisms for Adversarial Robustness %A Manish Vuyyuru Reddy %A Andrzej Banburski %A Nishka Pant %A Tomaso Poggio %X

A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the non-uniform sampling performed by the primate retina and the presence of multiple receptive fields with a range of receptive field sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations. We verify that these two mechanisms do not suffer from gradient obfuscation and study their contribution to adversarial robustness through ablation studies.

%8 06/2020 %2

https://hdl.handle.net/1721.1/125981

%0 Journal Article %J Nature Communications %D 2020 %T Complexity Control by Gradient Descent in Deep Networks %A Tomaso Poggio %A Qianli Liao %A Andrzej Banburski %X

Overparametrized deep network predict well despite the lack of an explicit complexity control during training such as an explicit regularization term. For exponential-type loss functions, we solve this puzzle by showing an effective regularization effect of gradient descent in terms of the normalized weights that are relevant for classification.

%B Nature Communications %V 11 %8 02/2020 %G eng %U https://www.nature.com/articles/s41467-020-14663-9 %R https://doi.org/10.1038/s41467-020-14663-9 %0 Generic %D 2020 %T Dreaming with ARC %A Andrzej Banburski %A Anshula Gandhi %A Simon Alford %A Sylee Dandekar %A Peter Chin %A Tomaso Poggio %X

Current machine learning algorithms are highly specialized to whatever it is they are meant to do –– e.g. playing chess, picking up objects, or object recognition. How can we extend this to a system that could solve a wide range of problems? We argue that this can be achieved by a modular system –– one that can adapt to solving different problems by changing only the modules chosen and the order in which those modules are applied to the problem. The recently introduced ARC (Abstraction and Reasoning Corpus) dataset serves as an excellent test of abstract reasoning. Suited to the modular approach, the tasks depend on a set of human Core Knowledge inbuilt priors. In this paper we implement these priors as the modules of our system. We combine these modules using a neural-guided program synthesis.

%B Learning Meets Combinatorial Algorithms workshop at NeurIPS 2020 %8 11/2020 %2

https://hdl.handle.net/1721.1/128607

%0 Generic %D 2020 %T Hierarchically Local Tasks and Deep Convolutional Networks %A Arturo Deza %A Qianli Liao %A Andrzej Banburski %A Tomaso Poggio %K Compositionality %K Inductive Bias %K perception %K Theory of Deep Learning %X

The main success stories of deep learning, starting with ImageNet, depend on convolutional networks, which on certain tasks perform significantly better than traditional shallow classifiers, such as support vector machines. Is there something special about deep convolutional networks that other learning machines do not possess? Recent results in approximation theory have shown that there is an exponential advantage of deep convolutional-like networks in approximating functions with hierarchical locality in their compositional structure. These mathematical results, however, do not say which tasks are expected to have input-output functions with hierarchical locality. Among all the possible hierarchically local tasks in vision, text and speech we explore a few of them experimentally by studying how they are affected by disrupting locality in the input images. We also discuss a taxonomy of tasks ranging from local, to hierarchically local, to global and make predictions about the type of networks required to perform efficiently on these different types of tasks.

%8 06/2020 %1

https://arxiv.org/abs/2006.13915

https://hdl.handle.net/1721.1/125980

%0 Journal Article %J IEEJ Transactions on Electrical and Electronic Engineering %D 2020 %T An Overview of Some Issues in the Theory of Deep Networks %A Tomaso Poggio %A Andrzej Banburski %X

During the last few years, significant progress has been made in the theoretical understanding of deep networks. We review our contributions in the areas of approximation theory and optimization. We also introduce a new approach based on cross‐validation leave‐one‐out stability to estimate bounds on the expected error of overparametrized classifiers, such as deep networks.

%B IEEJ Transactions on Electrical and Electronic Engineering %V 15 %P 1560 - 1571 %8 10/2020 %G eng %U https://onlinelibrary.wiley.com/toc/19314981/15/11 %N 11 %! IEEJ Trans Elec Electron Eng %R 10.1002/tee.23243 %0 Journal Article %J Proceedings of the National Academy of Sciences %D 2020 %T Theoretical issues in deep networks %A Tomaso Poggio %A Andrzej Banburski %A Qianli Liao %X

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm minimizers. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

%B Proceedings of the National Academy of Sciences %P 201907369 %8 Sep-06-2020 %G eng %U https://www.pnas.org/content/early/2020/06/08/1907369117 %! Proc Natl Acad Sci USA %R 10.1073/pnas.1907369117 %0 Generic %D 2019 %T Double descent in the condition number %A Tomaso Poggio %A Gil Kur %A Andrzej Banburski %X

In solving a system of n linear equations in d variables Ax=b, the condition number of the (n,d) matrix A measures how much errors in the data b affect the solution x. Bounds of this type are important in many inverse problems. An example is machine learning where the key task is to estimate an underlying function from a set of measurements at random points in a high dimensional space and where low sensitivity to error in the data is a requirement for good predictive performance. Here we report the simple observation that when the columns of A are random vectors, the condition number of A is highest, that is worse, when d=n, that is when the inverse of A exists. An overdetermined system (n>d) and especially an underdetermined system (n<d), for which the pseudoinverse must be used instead of the inverse, typically have significantly better, that is lower, condition numbers. Thus the condition number of A plotted as function of d shows a double descent behavior with a peak at d=n.

%8 12/2019 %2

https://hdl.handle.net/1721.1/123108

%0 Conference Paper %B NAS Sackler Colloquium on Science of Deep Learning %D 2019 %T Dynamics & Generalization in Deep Networks -Minimizing the Norm %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B NAS Sackler Colloquium on Science of Deep Learning %C Washington D.C. %8 03/2019 %G eng %0 Generic %D 2019 %T Theoretical Issues in Deep Networks %A Tomaso Poggio %A Andrzej Banburski %A Qianli Liao %X

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization by gradient descent and good out-of-sample performance --- why the expected error does not suffer, despite the absence of explicit regularization, when the networks are overparametrized. We review our recent results towards this goal. In {\it approximation theory} both shallow and deep networks are known to approximate any continuous functions on a bounded domain at a cost which is exponential (the number of parameters is exponential in the dimensionality of the function). However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can have a linear dependence on dimensionality, unlike shallow networks. In characterizing {\it minimization} of the empirical exponential loss we consider the gradient descent dynamics of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to the normalized network. The dynamics of the normalized weights implied by standard gradient descent turns out to be equivalent to the dynamics of the constrained problem of minimizing an exponential-type loss subject to a unit $L_2$ norm constraint. In particular, the dynamics of the typical, unconstrained gradient descent converges to the same critical points of the constrained problem. Thus, there is {\it implicit regularization} in training deep networks under exponential-type loss functions with gradient descent. The critical points of the flow are hyperbolic minima (for any long but finite time) and minimum norm minimizers (e.g. maxima of the margin). Though appropriately normalized networks can show a small generalization gap (difference between empirical and expected loss) even for finite $N$ (number of training examples) wrt the exponential loss, they do not generalize in terms of the classification error. Bounds on it for finite $N$ remain an open problem. Nevertheless, our results, together with other recent papers, characterize an implicit vanishing regularization by gradient descent which is likely to be a key prerequisite -- in terms of complexity control -- for the good performance of deep overparametrized ReLU classifiers.

%8 08/2019 %2

https://hdl.handle.net/1721.1/122014

%0 Generic %D 2019 %T Theories of Deep Learning: Approximation, Optimization and Generalization %A Qianli Liao %A Andrzej Banburski %A Tomaso Poggio %B TECHCON 2019 %8 09/2019 %0 Conference Paper %B ICML %D 2019 %T Weight and Batch Normalization implement Classical Generalization Bounds %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Lorenzo Rosasco %A Jack Hidary %A Tomaso Poggio %B ICML %C Long Beach/California %8 06/2019 %G eng %0 Generic %D 2018 %T Theory III: Dynamics and Generalization in Deep Networks %A Andrzej Banburski %A Qianli Liao %A Brando Miranda %A Tomaso Poggio %A Lorenzo Rosasco %A Jack Hidary %A Fernanda De La Torre %X

The key to generalization is controlling the complexity of
the network. However, there is no obvious control of
complexity -- such as an explicit regularization term --
in the training of deep networks for classification. We
will show that a classical form of norm control -- but
kind of hidden -- is present in deep networks trained with
gradient descent techniques on exponential-type losses. In
particular, gradient descent induces a dynamics of the
normalized weights which converge for $t \to \infty$ to an
equilibrium which corresponds to a minimum norm (or
maximum margin) solution. For sufficiently large but
finite $\rho$ -- and thus finite $t$ -- the dynamics
converges to one of several margin maximizers, with the
margin monotonically increasing towards a limit stationary
point of the flow. In the usual case of stochastic
gradient descent, most of the stationary points are likely
to be convex minima corresponding to a regularized,
constrained minimizer -- the network with normalized
weights-- which is stable and has asymptotic zero
generalization gap, asymptotically for $N \to \infty$,
where $N$ is the number of training examples. For finite,
fixed $N$ the generalizaton gap may not be zero but the
minimum norm property of the solution can provide, we
conjecture, good expected performance for suitable data
distributions. Our approach extends some of the results of
Srebro from linear networks to deep networks and provides
a new perspective on the implicit bias of gradient
descent. We believe that the elusive complexity control we
describe is responsible for the puzzling empirical finding
of good predictive performance by deep networks, despite
overparametrization.

%8 06/2018 %2

http://hdl.handle.net/1721.1/116692