Theoretical Frameworks for Intelligence

Understanding intelligence and the brain requires theories at different levels, including the biophysics of single neurons, algorithms and circuits, overall computations and behavior, and a theory of learning. Advances have been made in many of these areas from multiple perspectives in the past few decades. In fact several major contributors to these advances are members of our team.

This theoretical foundation provides a common framework for fields as diverse as computer science, cognitive science, and neuroscience. Recent successes in intelligent systems applications – from Google to Watson – would not have been possible without these developments. For the first time, we have the beginnings of a unifying and useful mathematics of brains, minds, and machines with rigorous foundations, demonstrated applicability in almost every area of cognitive and neural science, and real practical value for building intelligent systems.

The computational role of eccentricity dependent resolution in the retina: consequences for hierarchical models of object recognition

The convolutional architectures of current hierarchical models of object recognition are designed primarily around translation invariance. This is achieved by layers of complex cells, which pool responses to some visual feature over a local region in (x,y). The eccentricity-dependent resolution of the primate retina, however, suggests that for the ventral visual pathway, the most important transformation is not position but scale.

Retinal resolution decreases approximately linearly with eccentricity, which means that the radius of the visual field for which a given spatial frequency can be detected is proportional to wavelength. Low spatial frequencies can be detected over a wide range of eccentricities, while the highest frequencies are detectable only very near fixation. The linear relationship implies that the visual input at each spatial frequency can be represented by the same number of samples. Under this scheme, a fixated object undergoing scaling induces a simple shift in the sampled representation along the dimension of scale. This allows for scale invariant recognition over a wide range of scales, but translation invariance only proportional to object size. This relative priority is perhaps unsurprising given that changes in target position may be accommodated relatively easily via eye movements, while changes in scale cannot.

Decomposition of the retinal image into spatial frequency bands takes place in V1, whose cells are well-modeled by Gabor filters. Measurements of receptive field properties for V1 and beyond is difficult to obtain near the fovea. The scale invariance hypothesis predicts a small central region of constant maximum resolution, which we identify with the foveola. Based on estimates for V1 of the smallest receptive field size in central vision, and of the rate of increase with eccentricity, we estimate the diameter of the foveola to be on the order of 20-30’ of arc.

A major difference between HMAX and other hierarchical models is that its convolutional architecture explicitly includes a scale dimension, which arises from a fixed Gabor filter model of V1. However, in all existing HMAX models, input resolution is uniform, i.e., every scale band spans the entire visual field. This is impossible given a biologically realistic retina, however, we conjecture that such a strategy may not be desirable even for an artificial vision system.

Current HMAX models over represent high frequencies, which must be pooled more aggressively to achieve an invariant signature. Initial experiments have shown that overreliance on high frequencies impairs scale invariance, even in controlled single-object situations. In natural images, aggressive pooling at high frequencies increases sensitivity to clutter. The alternative suggested by the retinal resolution curve is a nested representation, where different scale bands represent different amounts of the scene. For an object spanned by an intermediate scale band, a finer band might span one of its parts, while a coarser band spans the surrounding context.

The initial goal of this project is to understand the consequences for object recognition of using invariant signatures constructed in this manner: how they affect performance in different situations (e.g., controlled vs. cluttered scenes), the number of fixations required for learning and inference, and the effect of other architectural choices (e.g., local vs. global pooling over space or scale). Of longer-term interest is the observation that nested signatures necessarily encode information about the hierarchical structure of scenes, and might therefore be elements of a shared representation for object recognition as well as higher-level reasoning about a scene, in the direction of the CBMM challenge.

The Center for Brains, Minds & Machines

Theoretical Frameworks for Intelligence

The computational role of eccentricity dependent resolution in the retina: consequences for hierarchical models of object recognition

Associated Research Thrust(s):

Tomaso Poggio

Tomer Ullman

Ken Nakayama

Jim Mutch

Leyla Isik

Search form

You are here

Theoretical Frameworks for Intelligence

Associated Research Thrust(s):