The convolutional architectures of current hierarchical models of object recognition are designed primarily around translation invariance. This is achieved by layers of complex cells, which pool responses to some visual feature over a local region in (x,y). The eccentricity-dependent resolution of the primate retina, however, suggests that for the ventral visual pathway, the most important transformation is not position but scale.
Retinal resolution decreases approximately linearly with eccentricity, which means that the radius of the visual field for which a given spatial frequency can be detected is proportional to wavelength. Low spatial frequencies can be detected over a wide range of eccentricities, while the highest frequencies are detectable only very near fixation. The linear relationship implies that the visual input at each spatial frequency can be represented by the same number of samples. Under this scheme, a fixated object undergoing scaling induces a simple shift in the sampled representation along the dimension of scale. This allows for scale invariant recognition over a wide range of scales, but translation invariance only proportional to object size. This relative priority is perhaps unsurprising given that changes in target position may be accommodated relatively easily via eye movements, while changes in scale cannot.
Decomposition of the retinal image into spatial frequency bands takes place in V1, whose cells are well-modeled by Gabor filters. Measurements of receptive field properties for V1 and beyond is difficult to obtain near the fovea. The scale invariance hypothesis predicts a small central region of constant maximum resolution, which we identify with the foveola. Based on estimates for V1 of the smallest receptive field size in central vision, and of the rate of increase with eccentricity, we estimate the diameter of the foveola to be on the order of 20-30’ of arc.
A major difference between HMAX and other hierarchical models is that its convolutional architecture explicitly includes a scale dimension, which arises from a fixed Gabor filter model of V1. However, in all existing HMAX models, input resolution is uniform, i.e., every scale band spans the entire visual field. This is impossible given a biologically realistic retina, however, we conjecture that such a strategy may not be desirable even for an artificial vision system.
Current HMAX models over represent high frequencies, which must be pooled more aggressively to achieve an invariant signature. Initial experiments have shown that overreliance on high frequencies impairs scale invariance, even in controlled single-object situations. In natural images, aggressive pooling at high frequencies increases sensitivity to clutter. The alternative suggested by the retinal resolution curve is a nested representation, where different scale bands represent different amounts of the scene. For an object spanned by an intermediate scale band, a finer band might span one of its parts, while a coarser band spans the surrounding context.
The initial goal of this project is to understand the consequences for object recognition of using invariant signatures constructed in this manner: how they affect performance in different situations (e.g., controlled vs. cluttered scenes), the number of fixations required for learning and inference, and the effect of other architectural choices (e.g., local vs. global pooling over space or scale). Of longer-term interest is the observation that nested signatures necessarily encode information about the hierarchical structure of scenes, and might therefore be elements of a shared representation for object recognition as well as higher-level reasoning about a scene, in the direction of the CBMM challenge.