|Title||How Deep Sparse Networks Avoid the Curse of Dimensionality: Efficiently Computable Functions are Compositionally Sparse|
|Publication Type||CBMM Memos|
|Year of Publication||2022|
The main claim of this perspective is that compositional sparsity of the target function, which corresponds to the task to be learned, is the key principle underlying machine learning. Mhaskar and Poggio (2016) proved that sparsity of the compositional target functions (when the functions are on the reals, the constituent functions are also required to be smooth), naturally leads to sparse deep networks for approximation and thus for optimization. This is the case of most CNNs in current use, in which the known sparse graph of the target function is reflected in the sparse connectivity of the network. When the graph of the target function is unknown, I conjecture that transformers are able to implement a flexible version of sparsity (selecting which input tokens interact in the MLP layer), through the self-attention layers. Surprisingly, the assumption of compositional sparsity of the target function is not restrictive in practice, since I prove here that for computable functions (if on the reals with Lipschitz continuous derivatives) compositional sparsity is equivalent to efficient computability, that is Turing computability in polynomial time.
- CBMM Funded