%0 Generic %D 2021 %T Image interpretation by iterative bottom-up top- down processing %A Shimon Ullman %A Liav Assif %A Alona Strugatski %A Ben-Zion Vatashsky %A Hila Levi %A Aviv Netanyahu %A Adam Uri Yaari %X
Scene understanding requires the extraction and representation of scene components, such as objects and their parts, people, and places, together with their individual properties, as well as relations and interactions between them. We describe a model in which meaningful scene structures are extracted from the image by an iterative process, combining bottom-up (BU) and top-down (TD) networks, interacting through a symmetric bi-directional communication between them (‘counter-streams’ structure). The BU- TD model extracts and recognizes scene constituents with their selected properties and relations, and uses them to describe and understand the image. The scene representation is constructed by the iterative use of three components. The first model component is a bottom-up stream that extracts selected scene elements, properties and relations. The second component (‘cognitive augmentation’) augments the extracted visual representation based on relevant non-visual stored representations. It also provides input to the third component, the top-down stream, in the form of a TD instruction, instructing the model what task to perform next. The top-down stream then guides the BU visual stream to perform the selected task in the next cycle. During this |
process, the visual representations extracted from the image can be combined with relevant non- visual representations, so that the final scene representation is based on both visual information extracted from the scene and relevant stored knowledge of the world. The extraction process is shown to have favourable properties in terms of combinatorial generalization, |
generalizing well to novel scene structures and new combinations of objects, properties and relations not seen during training. Finally, we compare the model with relevant aspects of the human vision, and suggest directions for using the BU-TD scheme for integrating visual and cognitive components in the process of scene understanding. |
https://hdl.handle.net/1721.1/139678
%0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images. %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K M inimal images %K Parts and relations %K Top-down processing %XThe goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of se mantic components is small, and the variability of possible configurations is low.
We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful compo nents and relations used in the interpretation process, we consider the interpretation of ‘ minimal configurations’ : these are reduced local regions , whic h are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable image s have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal c onfigurations, produced automatically by the model. Finally, we discuss implications of full interpretation to difficult visual tasks, such as recognizing human activities or interactions , which are beyond the scope of current models of visual recognition .
%B Cognition %V 171 %P 65-84 %8 02/2018 %G eng %& 65 %R https://doi.org/10.1016/j.cognition.2017.10.006 %0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation Minimal images Parts and relations Top-down processing %XThe goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.
We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.
%B Cognition %V 171 %P 65 - 84 %8 01/2018 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S001002771730269Xhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/plain %! Cognition %R 10.1016/j.cognition.2017.10.006 %0 Generic %D 2017 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K Parts and relations %K Visual object recognition %XThe goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.
We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss implications of full interpretation to difficult visual tasks, such as recognizing human activities or interactions, which are beyond the scope of current models of visual recognition.
This manuscript has beed accepted for publication in Cognition.
http://hdl.handle.net/1721.1/106887
%0 Journal Article %J PNAS %D 2016 %T Atoms of recognition in human and computer vision %A Shimon Ullman %A Liav Assif %A Eitan Fetaya %A Daniel Harari %K Computer vision %K minimal images %K object recognition %K visual perception %K visual representations %XWe describe a computational model of humans' ability to provide a detailed interpretation of a scene’s components. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed - forward but limited top - down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class - specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models.
%B Cognitive Science Society %G eng %0 Journal Article %J Visual Cognition %D 2015 %T Visual categorization of social interactions %A Stephan de la Rosa %A Rabia N. Choudhery %A Cristóbal Curio %A Shimon Ullman %A Liav Assif %A Heinrich H. Bülthoff %XProminent theories of action recognition suggest that during the recognition of actions the physical patterns of the action is associated with only one action interpretation (e.g., a person waving his arm is recognized as waving). In contrast to this view, studies examining the visual categorization of objects show that objects are recognized in multiple ways (e.g., a VW Beetle can be recognized as a car or a beetle) and that categorization performance is based on the visual and motor movement similarity between objects. Here, we studied whether we find evidence for multiple levels of categorization for social interactions (physical interactions with another person, e.g., handshakes). To do so, we compared visual categorization of objects and social interactions (Experiments 1 and 2) in a grouping task and assessed the usefulness of motor and visual cues (Experiments 3, 4, and 5) for object and social interaction categorization. Additionally, we measured recognition performance associated with recognizing objects and social interactions at different categorization levels (Experiment 6). We found that basic level object categories were associated with a clear recognition advantage compared to subordinate recognition but basic level social interaction categories provided only a little recognition advantage. Moreover, basic level object categories were more strongly associated with similar visual and motor cues than basic level social interaction categories. The results suggest that cognitive categories underlying the recognition of objects and social interactions are associated with different performances. These results are in line with the idea that the same action can be associated with several action interpretations (e.g., a person waving his arm can be recognized as waving or greeting).
%B Visual Cognition %V 22 %8 02/06/2015 %G eng %N 9-10 %R 10.1080/13506285.2014.991368