%0 Journal Article %J Cognition %D 2020 %T Minimal videos: Trade-off between spatial and temporal information in human and machine vision. %A Guy Ben-Yosef %A Gabriel Kreiman %A Shimon Ullman %K Comparing deep neural networks and humans %K Integration of spatial and temporal visual information %K minimal images %K Minimal videos %K Visual dynamic recognition %X

Objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. Human recognition in minimal videos is invariably accompanied by full interpretation of the internal components of the video. State-of-the-art deep convolutional networks for dynamic recognition cannot replicate human behavior in these configurations. The gap between human and machine vision demonstrated here is due to critical mechanisms for full spatiotemporal interpretation that are lacking in current computational models.

%B Cognition %8 08/2020 %G eng %U https://www.sciencedirect.com/science/article/abs/pii/S0010027720300822 %R 10.1016/j.cognition.2020.104263 %0 Conference Paper %B International Conference on Learning Representations (ICLR 2020) %D 2020 %T What can human minimal videos tell us about dynamic recognition models? %A Guy Ben-Yosef %A Gabriel Kreiman %A Shimon Ullman %X

In human vision objects and their parts can be visually recognized from purely spatial or purely temporal information but the mechanisms integrating space and time are poorly understood. Here we show that human visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by identifying minimal videos: these are short and tiny video clips in which objects, parts, and actions can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art deep networks for dynamic visual recognition cannot replicate human behavior in these configurations. This gap between humans and machines points to critical mechanisms in human dynamic vision that are lacking in current models.

 

Published as a workshop paper at “Bridging AI and Cognitive Science” (ICLR 2020)

%B International Conference on Learning Representations (ICLR 2020) %C Virtual Conference %8 04/2020 %G eng %U https://baicsworkshop.github.io/pdf/BAICS_1.pdf %0 Conference Paper %B International Conference on Learning Representations (ICLR) %D 2019 %T Minimal images in deep neural networks: Fragile Object Recognition in Natural Images %A S. Srivastava %A Guy Ben-Yosef %A X. Boix %X

The human ability to recognize objects is impaired when the object is not shown in full. "Minimal images" are the smallest regions of an image that remain recognizable for humans. Ullman et al. 2016 show that a slight modification of the location and size of the visible region of the minimal image produces a sharp drop in human recognition accuracy. In this paper, we demonstrate that such drops in accuracy due to changes of the visible region are a common phenomenon between humans and existing state-of-the-art deep neural networks (DNNs), and are much more prominent in DNNs. We found many cases where DNNs classified one region correctly and the other incorrectly, though they only differed by one row or column of pixels, and were often bigger than the average human minimal image size. We show that this phenomenon is independent from previous works that have reported lack of invariance to minor modifications in object location in DNNs. Our results thus reveal a new failure mode of DNNs that also affects humans to a much lesser degree. They expose how fragile DNN recognition ability is for natural images even without adversarial patterns being introduced. Bringing the robustness of DNNs in natural images to the human level remains an open challenge for the community.

%B International Conference on Learning Representations (ICLR) %C New Orleans, La %G eng %U https://arxiv.org/pdf/1902.03227.pdf %0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images. %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K M inimal images %K Parts and relations %K Top-down processing %X

The goal in this work is to model the process of  ‘full interpretation’  of  object images,  which is the ability to identify and localize all semantic features and parts that are recognized by human observers.  The task is approached  by dividing the interpretation of  the complete object to the interpretation of multiple reduced but interpretable local  regions. In such reduced regions, interpretation is  simpler,  since the number of  se mantic  components is small, and the variability of possible configurations is low. 

We model the interpretation process by identifying primitive components and  relations that play a useful role in  local  interpretation by humans. To identify useful  compo nents and relations used in the interpretation process, we consider the  interpretation of  ‘ minimal configurations’ :  these  are  reduced  local regions , whic h are  minimal in the sense that further reduction  renders them unrecognizable and  uninterpretable.  We show that  such  minimal  interpretable image s have useful properties,  which  we use to identify  informative  features and relations used for full interpretation.  We describe our interpretation model, and show results of  detailed  interpretations  of  minimal c onfigurations, produced automatically by the model. Finally, we  discuss  implications of  full  interpretation  to  difficult visual tasks, such as recognizing human  activities or interactions , which are beyond the scope of current models of visual  recognition .

%B Cognition %V 171 %P 65-84 %8 02/2018 %G eng %& 65 %R https://doi.org/10.1016/j.cognition.2017.10.006 %0 Journal Article %J Cognition %D 2018 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation Minimal images Parts and relations Top-down processing %X

The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.

We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss possible extensions and implications of full interpretation to difficult visual tasks, such as recognizing social interactions, which are beyond the scope of current models of visual recognition.

%B Cognition %V 171 %P 65 - 84 %8 01/2018 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S001002771730269Xhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/xmlhttps://api.elsevier.com/content/article/PII:S001002771730269X?httpAccept=text/plain %! Cognition %R 10.1016/j.cognition.2017.10.006 %0 Journal Article %J Interface Focus %D 2018 %T Image interpretation above and below the object level %A Guy Ben-Yosef %A Shimon Ullman %X

Computational models of vision have advanced in recent years at a rapid rate, rivalling in some areas human-level performance. Much of the progress to date has focused on analysing the visual scene at the object level—the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both ‘below' the object level, such as identifying and localizing object parts and sub-parts, as well as ‘above’ the object level, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, and their components, properties and inter-relations, a process referred here as ‘image interpretation'. In this paper, we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, and in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called ‘minimal images’, and at both levels recognition depends on combining ‘bottom-up' processing, proceeding from low to higher levels of a processing hierarchy, together with ‘top-down' processing, proceeding from high to lower levels stages of visual analysis.

%B Interface Focus %V 8 %P 20180020 %8 06/2018 %G eng %U https://royalsocietypublishing.org/doi/full/10.1098/rsfs.2018.0020#d3e1503 %N 4 %! Interface Focus %R 10.1098/rsfs.2018.0020 %0 Generic %D 2018 %T Image interpretation above and below the object level %A Guy Ben-Yosef %A Shimon Ullman %K Interaction Recognition %K minimal images %K Social Interactions %K Visual interpretation %K visual recognition %X

Computational models of vision have advanced in recent years at a rapid rate, rivaling in some areas human-level performance. Much of the progress to date has focused on analyzing the visual scene at the object level – the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both ‘below’ the object level, such as identifying and localizing object parts and sub-parts, as well as ‘above’ the object levels, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, their components, properties, and inter-relations, a process referred here as ‘image interpretation’.

In this paper we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining ‘bottom-up’ processing, proceeding from low to higher levels of a processing hierarchy, together with ‘top-down’ processing, proceeding from high to lower levels stages of visual analysis.

%8 05/2018 %2

http://hdl.handle.net/1721.1/115373

%0 Journal Article %J Proceedings of the Royal Society: Interface Focus %D 2018 %T Image interpretation above and below the object level %A Guy Ben-Yosef %A Shimon Ullman %X

Computational models of vision have advanced in recent years at a rapid rate, rivaling in some areas human-level performance. Much of the progress to date has focused on analyzing the visual scene at the object level – the recognition and localization of objects in the scene. Human understanding of images reaches a richer and deeper image understanding both ‘below’ the object level, such as identifying and localizing object parts and sub-parts, as well as ‘above’ the object levels, such as identifying object relations, and agents with their actions and interactions. In both cases, understanding depends on recovering meaningful structures in the image, their components, properties, and inter-relations, a process referred here as ‘image interpretation’.

In this paper we describe recent directions, based on human and computer vision studies, towards human-like image interpretation, beyond the reach of current schemes, both below the object level, as well as some aspects of image interpretation at the level of meaningful configurations beyond the recognition of individual objects, in particular, interactions between two people in close contact. In both cases the recognition process depends on the detailed interpretation of so-called 'minimal images', and at both levels recognition depends on combining ‘bottom-up’ processing, proceeding from low to higher levels of a processing hierarchy, together with ‘top-down’ processing, proceeding from high to lower levels stages of visual analysis.

%B Proceedings of the Royal Society: Interface Focus %8 06/2018 %G eng %0 Generic %D 2018 %T Partially Occluded Hands: A challenging new dataset for single-image hand pose estimation %A Battushig Myanganbayar %A Cristina Mata %A Gil Dekel %A Katz, Boris %A Guy Ben-Yosef %A Andrei Barbu %K dataset %K Partial occlusion %K RGB hand-pose reconstruction %X

Recognizing the pose of hands matters most when hands are interacting with other objects. To understand how well both machines and humans perform on single-image 2D hand-pose reconstruction from RGB images, we collected a challenging dataset of hands interacting with 148 objects. We used a novel methodology that provides the same hand in the same pose both with the object being present and occluding the hand and without the object occluding the hand. Additionally, we collected a wide range of grasps for each object designing the data collection methodology to ensure this diversity. Using this dataset we measured the performance of two state-of-the-art hand-pose recognition methods showing that both are extremely brittle when faced with even light occlusion from an object. This is not evident in previous datasets because they often avoid hand- object occlusions and because they are collected from videos where hands are often between objects and mostly unoccluded. We annotated a subset of the dataset and used that to show that humans are robust with respect to occlusion, and also to characterize human hand perception, the space of grasps that seem to be considered, and the accuracy of reconstructing occluded portions of hands. We expect that such data will be of interest to both the vision community for developing more robust hand-pose algorithms and to the robotic grasp planning community for learning such grasps. The dataset is available at occludedhands.com

%8 12/2018 %0 Conference Paper %B The 14th Asian Conference on Computer Vision (ACCV 2018) %D 2018 %T Partially Occluded Hands: A challenging new dataset for single-image hand pose estimation %A Battushig Myanganbayar %A Cristina Mata %A Gil Dekel %A Boris Katz %A Guy Ben-Yosef %A Andrei Barbu %K dataset %K Partial occlusion %K RGB hand-pose reconstruction %X

Recognizing the pose of hands matters most when hands are interacting with other objects. To understand how well both machines and humans perform on single-image 2D hand-pose reconstruction from RGB images, we collected a challenging dataset of hands interacting with 148 objects. We used a novel methodology that provides the same hand in the same pose both with the object being present and occluding the hand and without the object occluding the hand. Additionally, we collected a wide range of grasps for each object designing the data collection methodology to ensure this diversity. Using this dataset we measured the performance of two state-of-the-art hand-pose recognition methods showing that both are extremely brittle when faced with even light occlusion from an object. This is not evident in previous datasets because they often avoid hand- object occlusions and because they are collected from videos where hands are often between objects and mostly unoccluded. We annotated a subset of the dataset and used that to show that humans are robust with respect to occlusion, and also to characterize human hand perception, the space of grasps that seem to be considered, and the accuracy of reconstructing occluded portions of hands. We expect that such data will be of interest to both the vision community for developing more robust hand-pose algorithms and to the robotic grasp planning community for learning such grasps. The dataset is available at occludedhands.com

%B The 14th Asian Conference on Computer Vision (ACCV 2018) %8 12/2018 %G eng %U http://accv2018.net/ %0 Generic %D 2018 %T Spatiotemporal interpretation features in the recognition of dynamic images %A Guy Ben-Yosef %A Gabriel Kreiman %A Shimon Ullman %X

Objects and their parts can be visually recognized and localized from purely spatial information in static images and also from purely temporal information as in the perception of biological motion. Cortical regions have been identified, which appear to specialize in visual recognition based on either static or dynamic cues, but the mechanisms by which spatial and temporal information is integrated is only poorly understood. Here we show that visual recognition of objects and actions can be achieved by efficiently combining spatial and motion cues in configurations where each source on its own is insufficient for recognition. This analysis is obtained by the identification of minimal spatiotemporal configurations: these are short videos in which objects and their parts, along with an action being performed, can be reliably recognized, but any reduction in either space or time makes them unrecognizable. State-of-the-art computational models for recognition from dynamic images based on deep 2D and 3D convolutional networks cannot replicate human recognition in these configurations. Action recognition in minimal spatiotemporal configurations is invariably accompanied by full human interpretation of the internal components of the image and their inter-relations. We hypothesize that this gap is due to mechanisms for full spatiotemporal interpretation process, which in human vision is an integral part of recognizing dynamic event, but is not sufficiently represented in current DNNs.

%8 11/2018 %2

http://hdl.handle.net/1721.1/119248

%0 Generic %D 2017 %T Full interpretation of minimal images %A Guy Ben-Yosef %A Liav Assif %A Shimon Ullman %K Image interpretation %K Parts and relations %K Visual object recognition %X

The goal in this work is to model the process of ‘full interpretation’ of object images, which is the ability to identify and localize all semantic features and parts that are recognized by human observers. The task is approached by dividing the interpretation of the complete object to the interpretation of multiple reduced but interpretable local regions. In such reduced regions, interpretation is simpler, since the number of semantic components is small, and the variability of possible configurations is low.

We model the interpretation process by identifying primitive components and relations that play a useful role in local interpretation by humans. To identify useful components and relations used in the interpretation process, we consider the interpretation of ‘minimal configurations’: these are reduced local regions, which are minimal in the sense that further reduction renders them unrecognizable and uninterpretable. We show that such minimal interpretable images have useful properties, which we use to identify informative features and relations used for full interpretation. We describe our interpretation model, and show results of detailed interpretations of minimal configurations, produced automatically by the model. Finally, we discuss implications of full interpretation to difficult visual tasks, such as recognizing human activities or interactions, which are beyond the scope of current models of visual recognition.

This manuscript has beed accepted for publication in Cognition.

%8 02/2017 %2

http://hdl.handle.net/1721.1/106887

%0 Conference Paper %B AAAI Spring Symposium Series, Science of Intelligence %D 2017 %T A model for interpreting social interactions in local image regions %A Guy Ben-Yosef %A Alon Yachin %A Shimon Ullman %X
Understanding social interactions (such as ‘hug’ or ‘fight’) is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. In this work we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably from reduced images (called ‘minimal images’). (ii) The recognition of a social interaction depends on identifying components and relations within the minimal image (termed ‘interpretation’). We show psychophysics data for minimal images and modeling results for their interpretation. We discuss the integration of minimal configurations in recognizing social interactions in a detailed, high-resolution image.
%B AAAI Spring Symposium Series, Science of Intelligence %C Palo Alto, CA %8 03/2017 %G eng %U http://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15354 %0 Generic %D 2016 %T Recognizing and Interpreting Social Interactions in Local Image Regions %A Guy Ben-Yosef %A Alon Yachin %A Shimon Ullman %X

Understanding social interactions (such as 'hug' or 'fight') is a basic and important capacity of the human visual system, but a challenging and still open problem for modeling. Here we study visual recognition of social interactions, based on small but recognizable local regions. The approach is based on two novel key components: (i) A given social interaction can be recognized reliably from reduced images (called 'minimal images'). (ii) The recognition of a social interaction depends on identifying components and relations within the minimal image (termed 'interpretation'). We show psychophysics data for minimal images and modeling results for their interpretation. 

%B The 24th Annual Workshop on Object Perception, Attention, and Memory (OPAM), Boston, MA %8 11/2016 %0 Conference Proceedings %B Cognitive Science Society %D 2015 %T A model for full local image interpretation %A Guy Ben-Yosef %A Liav Assif %A Daniel Harari %A Shimon Ullman %X

We describe a computational model of humans' ability to provide a detailed interpretation of a scene’s components. Humans can identify in an image meaningful components almost everywhere, and identifying these components is an essential part of the visual process, and of understanding the surrounding scene and its potential meaning to the viewer. Detailed interpretation is beyond the scope of current models of visual recognition. Our model suggests that this is a fundamental limitation, related to the fact that existing models rely on feed - forward but limited top - down processing. In our model, a first recognition stage leads to the initial activation of class candidates, which is incomplete and with limited accuracy. This stage then triggers the application of class - specific interpretation and validation processes, which recover richer and more accurate interpretation of the visible scene. We discuss implications of the model for visual interpretation by humans and by computer vision models.

%B Cognitive Science Society %G eng