In this paper, we conduct an empirical study of the feature learning process in deep classifiers. Recent research has identified a training phenomenon called Neural Collapse (NC), in which the top-layer feature embeddings of samples from the same class tend to concentrate around their means, and the top layerâ€™s weights align with those features. Our study aims to investigate if these properties extend to intermediate layers. We empirically study the evolution of the covariance and mean of representations across different layers and show that as we move deeper into a trained neural network, the within-class covariance decreases relative to the between-class covariance. Additionally, we find that in the top layers, where the between-class covariance is dominant, the subspace spanned by the class means aligns with the subspace spanned by the most significant singular vector components of the weight matrix in the corresponding layer. Finally, we discuss the relationship between NC and Associative Memories (Willshaw et al., 1969).

%8 02/2023 %2https://hdl.handle.net/1721.1/148239

%0 Generic %D 2023 %T Norm-Based Generalization Bounds for Compositionally Sparse Neural Networks %A Tomer Galanti %A Mengjia Xu %A Liane Galanti %A Tomaso Poggio %XIn this paper, we investigate the Rademacher complexity of deep sparse neural networks, where each neuron receives a small number of inputs. We prove generalization bounds for multilayered sparse ReLU neural networks, including convolutional neural networks. These bounds differ from previous ones, as they consider the norms of the convolutional filters instead of the norms of the associated Toeplitz matrices, independently of weight sharing between neurons.

As we show theoretically, these bounds may be orders of magnitude better than standard norm-based generalization bounds and empirically, they are almost non-vacuous in estimating generalization in various simple classification problems. Taken together, these results suggest that compositional sparsity of the underlying target function is critical to the success of deep neural networks.

%2https://hdl.handle.net/1721.1/145776

%0 Generic %D 2023 %T SGD and Weight Decay Provably Induce a Low-Rank Bias in Deep Neural Networks %A Tomer Galanti %A Zachary Siegel %A Aparna Gupte %A Tomaso Poggio %XIn this paper, we study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization. Our analysis is based on a minimal set of assumptions and applies to neural networks of any width or depth, including those with residual connections and convolutional layers.

%2https://hdl.handle.net/1721.1/148230

%0 Journal Article %J arXiv %D 2022 %T On the Implicit Bias Towards Minimal Depth of Deep Neural Networks %A Tomer Galanti %A Liane Galanti %XWe study the implicit bias of gradient based training methods to favor low-depth solutions when training deep neural networks. Recent results in the literature suggest that penultimate layer representations learned by a classifier over multiple classes exhibit a clustering property, called neural collapse. We demonstrate empirically that neural collapse extends beyond the penultimate layer and emerges in intermediate layers as well. In this regards, we hypothesize and empirically show that gradient based methods are implicitly biased towards selecting neural networks of minimal depth for achieving this clustering property.

%B arXiv %8 03/2022 %G eng %U https://arxiv.org/abs/2202.09028 %0 Generic %D 2022 %T SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks %A Tomer Galanti %A Tomaso Poggio %XWe analyze deep ReLU neural networks trained with mini-batch stochastic gradient decent and weight decay. We prove that the source of the SGD noise is an implicit low rank constraint across all of the weight matrices within the network. Furthermore, we show, both theoretically and empirically, that when training a neural network using Stochastic Gradient Descent (SGD) with a small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may include convolutional layers, residual connections, as well as batch normalization layers.

%8 03/2022 %2https://hdl.handle.net/1721.1/141380

%0 Generic %D 2021 %T Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss %A M. Xu %A Akshay Rangamani %A Andrzej Banburski %A Q. Liao %A Tomer Galanti %A Tomaso Poggio %X*We overview several properties -- old and new -- of training overparametrized deep networks under the square loss. We first consider a model of the dynamics of gradient flow under the square loss in deep homogeneous ReLU networks. We study the convergence to a solution with the absolute minimum $\rho$, which is the product of the Frobenius norms of each layer weight matrix, when normalization by Lagrange multipliers (LM) is used together with Weight Decay (WD) under different forms of gradient descent. A main property of the minimizers that bounds their expected error {\it for a specific network architecture} is $\rho$. In particular, we derive novel norm-based bounds for convolutional layers that are orders of magnitude better than classical bounds for dense networks. Next we prove that quasi-interpolating solutions obtained by Stochastic Gradient Descent (SGD) in the presence of WD have a bias towards low rank weight matrices -- that, as we also explain, should improve generalization. The same analysis predicts the existence of an inherent SGD noise for deep networks. In both cases, we verify our predictions experimentally. We then predict Neural Collapse and its properties without any specific assumption -- unlike other published proofs. Our analysis supports the idea that the advantage of deep networks relative to other classifiers is greater for the problems that are appropriate for sparse deep architectures such as CNNs. The deep reason compositionally sparse target functions can be approximated well by ``sparse'' deep networks without incurring in the curse of dimensionality.*