%0 Journal Article %J PLOS Computational Biology %D 2022 %T Look twice: A generalist computational model predicts return fixations across tasks and species %A Zhang, Mengmi %A Armendariz, Marcelo %A Xiao, Will %A Rose, Olivia %A Bendtz, Katarina %A Livingstone, Margaret %A Ponce, Carlos %A Kreiman, Gabriel %E Faisal, Aldo A. %X

Primates constantly explore their surroundings via saccadic eye movements that bring different parts of an image into high resolution. In addition to exploring new regions in the visual field, primates also make frequent return fixations, revisiting previously foveated locations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations. Return fixations were ubiquitous across different behavioral tasks, in monkeys and humans, both when subjects viewed static images and when subjects performed natural behaviors. Return fixations locations were consistent across subjects, tended to occur within short temporal offsets, and typically followed a 180-degree turn in saccadic direction. To understand the origin of return fixations, we propose a proof-of-principle, biologically-inspired and image-computable neural network model. The model combines five key modules: an image feature extractor, bottom-up saliency cues, task-relevant visual features, finite inhibition-of-return, and saccade size constraints. Even though there are no free parameters that are fine-tuned for each specific task, species, or condition, the model produces fixation sequences resembling the universal properties of return fixations. These results provide initial steps towards a mechanistic understanding of the trade-off between rapid foveal recognition and the need to scrutinize previous fixation locations.

%B PLOS Computational Biology %V 18 %P e1010654 %8 11/2022 %G eng %U https://dx.plos.org/10.1371/journal.pcbi.1010654 %N 11 %! PLoS Comput Biol %R 10.1371/journal.pcbi.1010654 %0 Journal Article %J Nature Human Behaviour %D 2021 %T Beauty is in the eye of the machine %A Zhang, Mengmi %A Gabriel Kreiman %X

Ansel Adams said, “There are no rules for good photographs, there are only good photographs.” Is it possible to predict our fickle and subjective appraisal of ‘aesthetically pleasing’ visual art? Iigaya et al. used an artificial intelligence approach to show how human aesthetic preference can be partially explained as an integration of hierarchical constituent image features.

Artificial intelligence (AI) has made rapid strides in a wide range of visual tasks, including recognition of objects and faces, automatic diagnosis of clinical images, and answering questions about images. More recently, AI has also started penetrating the arts. For example, in October 2018, the first piece of AI-generated art came to auction, with an initial estimate of US$ 10,000, and strikingly garnered a final bid of US$ 432,500 (Fig. 1). The portrait depicts a portly gentleman with a seemingly fuzzy facial expression, dressed in a black frockcoat with a white collar. Appreciating and creating a piece of art requires a general understanding of aesthetics. What are the nuances, structures, and semantics embedded in a painting that can provide us with an aesthetically pleasing sense?

%B Nature Human Behaviour %V 5 %P 675 - 676 %8 05/2021 %G eng %U http://www.nature.com/articles/s41562-021-01125-5 %N 6 %! Nat Hum Behav %R 10.1038/s41562-021-01125-5 %0 Journal Article %J CVPR 2020 %D 2020 %T Putting visual object recognition in context %A Zhang, Mengmi %A Tseng, Claire %A Gabriel Kreiman %X

Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g. a cow in the ocean). To understand and model the role of contextual information in visual recognition, we systematically and quantitatively investigated ten critical properties of where, when, and how context modulates recognition including amount of context, context and object resolution, geometrical structure of context, context congruence, time required to incorporate contextual information, and temporal dynamics of contextual modulation. The tasks involve recognizing a target object surrounded with context in a natural image. As an essential benchmark, we first describe a series of psychophysics experiments, where we alter one aspect of context at a time, and quantify human recognition accuracy. To computationally assess performance on the same tasks, we propose a biologically inspired context aware object recognition model consisting of a two-stream architecture. The model processes visual information at the fovea and periphery in parallel, dynamically incorporates both object and contextual information, and sequentially reasons about the class label for the target object. Across a wide range of behavioral tasks, the model approximates human level performance without retraining for each task, captures the dependence of context enhancement on image properties, and provides initial steps towards integrating scene and object information for visual recognition.

%B CVPR 2020 %8 01/2020 %G eng %0 Journal Article %J Nature Communications %D 2018 %T Finding any Waldo with zero-shot invariant and efficient visual search %A Zhang, Mengmi %A Feng, Jiashi %A Ma, Keng Teck %A Lim, Joo Hwee %A Qi Zhao %A Gabriel Kreiman %X

Searching for a target object in a cluttered scene constitutes a fundamental challenge in daily vision. Visual search must be selective enough to discriminate the target from distractors, invariant to changes in the appearance of the target, efficient to avoid exhaustive exploration of the image, and must generalize to locate novel target objects with zero-shot training. Previous work on visual search has focused on searching for perfect matches of a target after extensive category-specific training. Here, we show for the first time that humans can efficiently and invariantly search for natural objects in complex scenes. To gain insight into the mechanisms that guide visual search, we propose a biologically inspired computational model that can locate targets without exhaustive sampling and which can generalize to novel objects. The model provides an approximation to the mechanisms integrating bottom-up and top-down signals during search in natural scenes.

%B Nature Communications %V 9 %8 09/2018 %G eng %U http://www.nature.com/articles/s41467-018-06217-x %! Nat Commun %R 10.1038/s41467-018-06217-x %0 Generic %D 2018 %T What am I searching for? %A Zhang, Mengmi %A Feng, Jiashi %A Lim, Joo Hwee %A Qi Zhao %A Gabriel Kreiman %X

Can we infer intentions and goals from a person’s actions? As an example of this family of problems, we consider here whether it is possible to decipher what a person is searching for by decoding their eye movement behavior. We conducted two human psychophysics experiments on object arrays and natural images where we monitored subjects’ eye movements while they were looking for a target object. Using as input the pattern of "error" fixations on non-target objects before the target was found, we developed a model (InferNet) whose goal was to infer what the target was. "Error" fixations share similar features with the sought target. The Infernet model uses a pre-trained 2D convolutional architecture to extract features from the error fixations and computes a 2D similarity map between the error fixation and all locations across the search image by modulating the search image via convolution across layers. InferNet consolidates the modulated response maps across layers via max pooling to keep track of the sub-patterns highly similar to features at error fixations and integrates these maps across all error fixations. InferNet successfully identifies the subject’s goal and outperforms all the competitive null models, even without any object-specific training on the inference task.

%8 07/2018 %1

arXiv:1807.11926

%2

http://hdl.handle.net/1721.1/119576