Vision and Language

Vision and Language

The goal of this research is to combine vision with aspects of language and social cognition to obtain complex knowledge about the environment. To obtain a full understanding of visual scenes, computational models should be able to extract from the scene any meaningful information that a human observer can extract about actions, agents, goals, scenes and object configurations, social interactions, and more. We refer to this as the ‘Turing test for vision,’ i.e., the ability of a machine to use vision to answer a large and flexible set of queries about objects and agents in an image in a human-like manner. Queries might be about objects, their parts, and spatial relations between objects, actions, goals, and interactions. Understanding queries and formulating answers requires interactions between vision and natural language. Interpreting goals and interactions requires connections between vision and social cognition. Answering queries also requires task-dependent processing, i.e., different visual processes to achieve different goals.

Vision and Goals

We refer to this as the ‘Turing test for vision’ — being able to use vision to answer a large and flexible set of queries about objects and agents in the image in a human-like manner. Queries can be, for example, about objects, their parts, spatial relations between objects, actions, goals, and interactions. Understanding queries and formulating answers requires interactions between vision and natural language. Interpreting goals and interactions requires connections between vision and social cognition. Answering queries also requires task-dependent processing, i.e., different visual processes to achieve different goals.

Approach

To achieve our goals we will develop novel methods of extracting meaningful information from images based on extended interpretation and goal-directed processing. Semantic image interpretation often requires an extended process directed to specific objects and relations in a task-dependent manner, e.g., what is person X looking at or touching, or is object Y stable. Our method will combine probabilistic inference with policy learning to generate a sequence of operations applied to the image. The first stage will construct in a bottom-up manner an initial interpretation of the scene, and the second will generate and apply an interpretation in a task dependent manner, in which different processes can be synthesized in response to different queries. The first component will produce rich hierarchical representations in a robust and invariant manner. The second stage, generating extended goal directed processing, will use policy learning related to Markov Decision Processes and reinforcement learning.

Integration

Close interactions with the Social Intelligence research thrust involve understanding actions, goals, and interactions among agents. Interactions with the Development of Intelligence research thrust involve incorporating useful structures and biases derived from human developmental cognition. Hierarchical object recognition connects with the Circuits for Intelligence research thrust for modeling cortical mechanisms of hierarchical representations and object recognition, and for connecting the computational constraints with neuronal circuits. Computational theories will be used to design stimuli for testing neuronal responses to object configurations, agent-object interactions, and interactions among agents in Circuits for Intelligence, and for predicting, testing, and analyzing interactions among brain areas (Social Intelligence). Visual aspects of social interactions connect with the Social Intelligence research thrust. The Vision and Language research thrust will focus on what vision can deliver and how, and the Social Intelligence research thrust will focus on how representations of social knowledge incorporate visual information to make social inferences. All projects will engage Theoretical Frameworks for Intelligence research thrust concerning invariant recognition and probabilistic modeling and inference.

The Center for Brains, Minds & Machines

Vision and Language

Vision and Goals

Approach

Integration

Related Projects

Recent Publications

View all Vision and Language publications >

Research Thrust Leaders

Principal Investigators

Co-Investigators

Search form

You are here

Vision and Language

Vision and Goals

Approach

Integration

Related Projects

Recent Publications

View all Vision and Language publications >

Research Thrust Leaders

Principal Investigators

Co-Investigators