Vision and Language

Vision and Goals

We refer to this as the ‘Turing test for vision’ — being able to use vision to answer a large and flexible set of queries about objects and agents in the image in a human-like manner.   Queries can be, for example, about objects, their parts, spatial relations between objects, actions, goals, and interactions.   Understanding queries and formulating answers requires interactions between vision and natural language. Interpreting goals and interactions requires connections between vision and social cognition. Answering queries also requires task-dependent processing, i.e., different visual processes to achieve different goals.

Approach

To achieve our goals we will develop novel methods of extracting meaningful information from images based on extended interpretation and goal-directed processing. Semantic image interpretation often requires an extended process directed to specific objects and relations in a task-dependent manner, e.g., what is person X looking at or touching, or is object Y stable. Our method will combine probabilistic inference with policy learning to generate a sequence of operations applied to the image. The first stage will construct in a bottom-up manner an initial interpretation of the scene, and the second will generate and apply an interpretation in a task dependent manner, in which different processes can be synthesized in response to different queries. The first component will produce rich hierarchical representations in a robust and invariant manner.  The second stage, generating extended goal directed processing, will use policy learning related to Markov Decision Processes and reinforcement learning.

Integration

Close interactions with the Social Intelligence research thrust involve understanding actions, goals, and interactions among agents. Interactions with the Development of Intelligence research thrust involve incorporating useful structures and biases derived from human developmental cognition.  Hierarchical object recognition connects with the Circuits for Intelligence research thrust for modeling cortical mechanisms of hierarchical representations and object recognition, and for connecting the computational constraints with neuronal circuits. Computational theories will be used to design stimuli for testing neuronal responses to object configurations, agent-object interactions, and interactions among agents in Circuits for Intelligence, and for predicting, testing, and analyzing interactions among brain areas (Social Intelligence).  Visual aspects of social interactions connect with the Social Intelligence research thrust. The Vision and Language research thrust will focus on what vision can deliver and how, and the Social Intelligence research thrust will focus on how representations of social knowledge incorporate visual information to make social inferences. All projects will engage Theoretical Frameworks for Intelligence research thrust concerning invariant recognition and probabilistic modeling and inference.

Related Projects

Recent Publications

W. Lotter, Kreiman, G., and Cox, D., Unsupervised Learning of Visual Structure using Predictive Generative Networks, in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
CBMM Funded
K. Allen, Yildirim, I., and Tenenbaum, J. B., Integrating Identification and Perception: A case study of familiar and unfamiliar face processing, in Proceedings of the Thirty-Eight Annual Conference of the Cognitive Science Society, 2016.PDF icon allen_5_13.pdf (2.13 MB)
CBMM Related
A. Wong and Yuille, A., One Shot Learning by Composition of Meaningful Patches, in International Conference on Computer Vision (ICCV), Santiago, Chile, 2015.PDF icon AlexWongOneShotCVPR2015.pdf (1.83 MB)
CBMM Funded