The Janus effects of SGD vs GD: high noise and low rank

TitleThe Janus effects of SGD vs GD: high noise and low rank
Publication TypeCBMM Memos
Year of Publication2023
AuthorsXu, M, Galanti, T, Rangamani, A, Rosasco, L, Pinto, A, Poggio, T
Date Published12/2024

It was always obvious that  SGD with small minibatch size yields for neural networks much higher asymptotic fluctuations in the updates of the weight matrices than GD. It has also been often reported that SGD in deep RELU networks shows empirically a low-rank bias in the weight matrices. A recent  theoretical analysis derived a bound on the rank and linked it to the size of the SGD fluctuations [25]. In this paper, we provide an empirical and  theoretical analysis of the convergence of SGD vs GD, first for deep RELU networks and then for the case of linear regression, where sharper estimates can be obtained and which is of independent interest. In the linear case, we prove that the component $W^\perp$ of the matrix $W$ corresponding to the null space of the data matrix $X$ converges to zero for both SGD and GD, provided the regularization term is non-zero. Because of the larger number of updates required to go through all the training data, the convergence rate {\it per epoch} of these components is much faster for SGD than for GD. In practice, SGD has a much stronger bias than GD towards solutions for weight matrices $W$ with high fluctuations -- even when the choice of mini batches is deterministic -- and low rank, provided the initialization is from a random matrix. Thus SGD  with non-zero regularization, shows the coupled phenomenon of  asymptotic noise and a low-rank bias-- unlike GD.


CBMM Memo No:  144

Associated Module: 

CBMM Relationship: 

  • CBMM Funded