Tomaso Poggio on Deep Learning Representation, Optimization, and Generalization [Synched]

April 20, 2018

While Poggio the teacher has taught some extraordinary leaders in AI, Poggio the scientist is renowned for his theory of deep learning, presented in papers with self-explanatory names: Theory of Deep Learning I, II and III.

Those outside academia may know Tomaso Poggio through his students, DeepMind Founder Demis Hassabis and Mobileye Founder Amnon Shashua. The former built the celebrated AI Go champion AlphaGo, while the latter has installed copilot systems in more than 15 million vehicles worldwide, and produced the world’s first L2 autonomous driving system in a car.

While Poggio the teacher has taught some extraordinary leaders in AI, Poggio the scientist is renowned for his theory of deep learning, presented in papers with self-explanatory names: Theory of Deep Learning I, II and III.

He is a Professor in the Department of Brain and Cognitive Sciences, an investigator at the McGovern Institute for Brain Research, a member of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and Director of the Center for Biological and Computational Learning at MIT and the Center for Brains, Minds, and Machines.

Poggio’s research focuses on three deep learning problems: 1) Representation: Why are deep neural networks better than shallow ones? 2) Optimization: Why is SGD (Stochastic Gradient Descent) good at finding minima and what are good minima? 3) Generalization: Why is it that we don’t have to worry about overfitting despite overparameterization?

Poggio uses mathematics to explain each problem before inductively working out the theory.

Why Are Deep Neural Networks Better Than Shallow Ones?

Poggio and mathematician Steve Smale co-authored a 2002 paper that summarized classical learning theories on neural networks with one hidden layer. “Classical theory tells us to use one layer networks, while we find that the brain using many layers,” recalls Poggio.

Both deep and single-layer networks can approximate continuous functions. This was one reason why research in the 80s focused on simpler single-layer networks.

The problem occurs in the dimensionality of single-layer networks. In order to represent a complicated function, a single-layer network would require more units than the number of atoms in the universe. Mathematically, this is called “the curse of dimensionality,” wherein the number of parameters goes up exponentially corresponding to function dimensionality.

Mathematicians make assumptions about function smoothness in order to escape the curse of dimensionality. Yet deep learning offers a different approach that uses compositional functions. The units that deep networks require to approximate a compositional function share a linear relationship with function dimensionality.

Deep learning works beautifully for datasets that are compositional in nature, such as images and voice samples. Images can be broken down into related snippets of details, while voice samples can be converted into meaningful phonemes. For an image classification task, there’s no need to look at pixels that are further apart, the model simply observes each small bit and combines them. The neural network escapes the curse of dimensionality by using a very small number of parameters...

Read the full story on Synced's website using the link below.

Associated CBMM Pages: