%0 Generic %D 2021 %T Image interpretation by iterative bottom-up top- down processing %A Shimon Ullman %A Liav Assif %A Alona Strugatski %A Ben-Zion Vatashsky %A Hila Levi %A Aviv Netanyahu %A Adam Uri Yaari %X

Scene understanding requires the extraction and representation of scene components, such as objects and their parts, people, and places, together with their individual properties, as well as relations and interactions between them. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). The BU- TD model extracts and recognizes scene constituents with their selected properties and relations, and uses them to describe and understand the image.

The scene representation is constructed by the iterative use of three components. The first model component is a bottom-up stream that extracts selected scene elements, properties and relations. The second component (‘cognitive augmentation’) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the top-down stream, in the form of a TD instruction, instructing the model what task to perform next. The top-down stream then guides the BU visual stream to perform the selected task in the next cycle. During this

process, the visual representations extracted from the image can be combined with relevant non- visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world.
We show how the BU-TD model composes complex visual tasks from sequences of steps, invoked by individual TD instructions. In particular, we describe how a sequence of TD-instructions is used to extract from the scene structures of interest, including an algorithm to automatically select the next TD- instruction in the sequence. The selection of TD instruction depends in general on the goal, the image, and on information already extracted from the image in previous steps. The TD-instructions sequence is therefore not a fixed sequence determined at the start, but an evolving program (or ‘visual routine’) that depends on the goal and the image.

The extraction process is shown to have favourable properties in terms of combinatorial generalization,

generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding.

 
%8 11/2021 %2

https://hdl.handle.net/1721.1/139678

%0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images. %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K M inimal images %K Parts and relations %K Top-down processing %X

The goal in this work is to model the process of  ‘full interpretation’  of  object images,  which is the ability to identify and localize all semantic features and parts that are recognized by human observers.  The task is approached  by dividing the interpretation of  the complete object to the interpretation of multiple reduced but interpretable local  regions. In such reduced regions, interpretation is  simpler,  since the number of  se mantic  components is small, and the variability of possible configurations is low. 

We model the interpretation process by identifying primitive components and  relations that play a useful role in  local  interpretation by humans. To identify useful  compo nents and relations used in the interpretation process, we consider the  interpretation of  ‘ minimal configurations’ :  these  are  reduced  local regions , whic h are  minimal in the sense that further reduction  renders them unrecognizable and  uninterpretable.  We show that  such  minimal  interpretable image s have useful properties,  which  we use to identify  informative  features and relations used for full interpretation.  We describe our interpretation model, and show results of  detailed  interpretations  of  minimal c onfigurations, produced automatically by the model. Finally, we  discuss  implications of  full  interpretation  to  difficult visual tasks, such as recognizing human  activities or interactions , which are beyond the scope of current models of visual  recognition .

%B Cognition %V 171 %P 65-84 %8 02/2018 %G eng %& 65 %R https://doi.org/10.1016/j.cognition.2017.10.006 %0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation Minimal images Parts and relations Top-down processing %X

The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.

We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.

%B Cognition %V 171 %P 65 - 84 %8 01/2018 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S001002771730269Xhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/plain %! Cognition %R 10.1016/j.cognition.2017.10.006 %0 Generic %D 2017 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K Parts and relations %K Visual object recognition %X

The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.

We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss implications of full interpretation to difficult visual tasks, such as recognizing human activities or interactions, which are beyond the scope of current models of visual recognition.

This manuscript has beed accepted for publication in Cognition.

%8 02/2017 %2

http://hdl.handle.net/1721.1/106887

%0 Journal Article %J PNAS %D 2016 %T Atoms of recognition in human and computer vision %A Shimon Ullman %A Liav Assif %A Eitan Fetaya %A Daniel Harari %K Computer vision %K minimal images %K object recognition %K visual perception %K visual representations %X
Discovering the visual features and representations used by thebrain to recognize objects is a central problem in the study of vision. Recently, neural network models of visual object recognition, including biological and deep network models, have shown remarkableprogress and have begun to rival human performance in some challenging tasks. These models are trained on image examples andlearn to extract features and representations and to use them for categorization. It remains unclear, however, whether the representations and learning processes discovered by current models aresimilar to those used by the human visual system. Here we show,by introducing and using minimal recognizable images, that thehuman visual system uses features and processes that are not usedby current models and that are critical for recognition. We found bypsychophysical studies that at the level of minimal recognizableimages a minute change in the image can have a drastic effect onrecognition, thus identifying features that are critical for the task.Simulations then showed that current models cannot explain thissensitivity to precise feature configurations and, more generally,do not learn to recognize minimal images at a human level. The roleof the features shown here is revealed uniquely at the minimal level, where the contribution of each feature is essential. A full understanding of the learning and use of such features will extend ourunderstanding of visual recognition and its cortical mechanisms andwill enhance the capacity of computational models to learn fromvisual experience and to deal with recognition and detailedimage interpretation.
%B PNAS %V 113 %P 2744–2749 %8 03/2016 %G eng %U http://www.pnas.org/content/113/10/2744.abstract %N 10 %R 10.1073/pnas.1513198113 %0 Conference Proceedings %B Cognitive Science Society %D 2015 %T A model for full local image interpretation %A Guy Ben-Yosef %A Liav Assif %A Daniel Harari %A Shimon Ullman %X

We describe a computational model of humans' ability to provide a detailed interpretation of a scene’s components. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed - forward but limited top - down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class - specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models.

%B Cognitive Science Society %G eng %0 Journal Article %J Visual Cognition %D 2015 %T Visual categorization of social interactions %A Stephan de la Rosa %A Rabia N. Choudhery %A Cristóbal Curio %A Shimon Ullman %A Liav Assif %A Heinrich H. Bülthoff %X

Prominent theories of action recognition suggest that during the recognition of actions the physical patterns of the action is associated with only one action interpretation (e.g., a person waving his arm is recognized as waving). In contrast to this view, studies examining the visual categorization of objects show that objects are recognized in multiple ways (e.g., a VW Beetle can be recognized as a car or a beetle) and that categorization performance is based on the visual and motor movement similarity between objects. Here, we studied whether we find evidence for multiple levels of categorization for social interactions (physical interactions with another person, e.g., handshakes). To do so, we compared visual categorization of objects and social interactions (Experiments 1 and 2) in a grouping task and assessed the usefulness of motor and visual cues (Experiments 3, 4, and 5) for object and social interaction categorization. Additionally, we measured recognition performance associated with recognizing objects and social interactions at different categorization levels (Experiment 6). We found that basic level object categories were associated with a clear recognition advantage compared to subordinate recognition but basic level social interaction categories provided only a little recognition advantage. Moreover, basic level object categories were more strongly associated with similar visual and motor cues than basic level social interaction categories. The results suggest that cognitive categories underlying the recognition of objects and social interactions are associated with different performances. These results are in line with the idea that the same action can be associated with several action interpretations (e.g., a person waving his arm can be recognized as waving or greeting).

%B Visual Cognition %V 22 %8 02/06/2015 %G eng %N 9-10 %R 10.1080/13506285.2014.991368