Projects: Vision and Language

Grounded language acquisition

Children learn to describe what they see through visual observation while overhearing incomplete linguistic descriptions of events and properties of objects. We have in the past made progress on the problem of learning the meaning of some words from visual observation and are now extending this work in several ways. This prior work required that the descriptions be unambiguous, meaning that they have a single syntactic parse.

Principal Investigators:
Grounded question answering

We have constructed techniques for describing videos with natural language sentences. Building on this work, we are going beyond description to answering questions such as: What is the person on the left doing with the blue object? This work takes as input a natural-language question and produces a natural-language answer.

Principal Investigators:
Example of Bayesian vector analysis

This project explores a Bayesian theory of vector analysis for hierarchical motion perception. The theory takes a step towards understanding how moving scenes are parsed into objects.

Principal Investigators:
Human language learning

We are developing a computational framework for modeling language typology and understanding its role in second language acquisition. In particular, we are studying the cognitive and linguistic characteristics of cross-linguistic structure transfer by investigating the relations between the speakers’ native language properties and their usage patterns and mistakes in English.

Principal Investigators:
Intentions and goals in human interactions

We are constructing detectors which can determine gaze direction in 3D and to apply these detectors to model human-human and human-object interactions. We aim to predict the intentions and goals of the agents by utilizing their direction of gaze, head pose, body pose, and the spatial relations between agents and objects.

Principal Investigators:
Investigating neural signals underlying language processing in the human brain

We take advantage of a rare opportunity to interrogate the neural signals underlying language processing in the human brain by invasively recording field potentials from the human cortex in epileptic patients. These signals provide high spatial and temporal resolution and therefore are ideally suited to investigate language processing, a question that is difficult to study in animal models.

Principal Investigators:
Multi-sentence event recognition

Existing approaches to labeling images and videos with natural-language sentences generate either one sentence or a collection of unrelated sentences. Humans, however, produce a coherent set of sentences, which reference each other and describe the salient activities and relationships being depicted.

Principal Investigators:
Objects and hands in context

Many of the most interesting and salient activities that humans perform involve manipulating objects, which frequently involves using hands to grasp and move objects. Unfortunately, object detectors tend to fail at precisely this critical juncture where the salient part of the activity occurs because the hand occludes the object being grasped. At the same time, while a hand is manipulating an object, the hand is being significantly deformed making it more difficult to recognize.

Principal Investigators:
The computational role of eccentricity dependent resolution in the retina: consequences for hierarchical models of object recognition

Current hierarchical models of object recognition lack the non-uniform resolution of the retina, and they also typically neglect scale.  We show that these two issues are intimately related, leading to several predictions.  We further conjecture that the retinal resolution function may represent an optimal input shape, in space and scale, for a single feed-forward pass.  The resulting outputs may encode structural information useful for other aspects of the CBMM challenge, beyond the recognition of single objects.

Principal Investigators:
Visual processing with minimal recognizable configurations

We are studying human visual processing using image patches that are minimal recognizable configurations (MiRC) of larger images. Although they represent only a small portion of the image, the majority of people can recognize MiRC images, while any cropping or down-sampling severely hurts recognition performance.

Principal Investigators:

Recent Publications

W. Lotter, Kreiman, G., and Cox, D., Unsupervised Learning of Visual Structure using Predictive Generative Networks, in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, 2016.
CBMM Funded
A. Wong and Yuille, A., One Shot Learning by Composition of Meaningful Patches, in International Conference on Computer Vision (ICCV), Santiago, Chile, 2015.PDF icon AlexWongOneShotCVPR2015.pdf (1.83 MB)
CBMM Funded