Loss landscape: SGD has a better view

TitleLoss landscape: SGD has a better view
Publication TypeCBMM Memos
Year of Publication2020
AuthorsPoggio, T, Cooper, Y
Date Published07/2020
Abstract

Consider a loss function ... where f(x) is a deep feedforward network with R layers, no bias terms and scalar output. Assume the network is overparametrized that is, d >> n, where d is the number of parameters and n is the number of data points. The networks are assumed to interpolate the training data (e.g. the minimum of L is zero). If GD converges, it will converge to a critical point of L, namely a solution of ... There are two kinds of critical points - those for which each term of the above sum vanishes individually, and those for which the expression only vanishes when all the terms are summed. The main claim in this note is that while GD can converge to both types of critical points, SGD can only converge to the first kind, which include all global minima.

See image below for full formulas.

DSpace@MIT

https://hdl.handle.net/1721.1/126041

CBMM Memo No:  107

Associated Module: 

CBMM Relationship: 

  • CBMM Funded