Vision and Language

Research Thrust: Vision and Language

Shimon UllmanVision can be combined with aspects of language and social cognition to obtain and communicate complex knowledge about the surrounding environment, for example, to answer a large and flexible set of queries about objects and agents in an image or video in a human-like manner, as captured in the CBMM Challenge. These lectures provide an overview of current approaches aimed at achieving this understanding from visual input, an overview of the START natural language system, and recent efforts to bridge these capabilities. The last lecture of this series addresses a cognitive ability that distinguishes human intelligence from that of other primates: the ability to tell, understand, and recombine stories.

Presentations

Shimon Ullman: Visual Understanding: State of the World, Future Directions

Shimon Ullman: Visual Understanding: State of the World, Future Directions

Topics: Overview of visual understanding; object categorization and variability in appearance within categories; recognizing individuals; identifying object parts; learning categories from examples by combining different features (simple to complex) and classifiers; visual classes as similar configurations of image components; finding optimal features that maximize mutual information for class vs. non-class distinction (Ullman et al., Nature Neuroscience 2002); SIFT and HoG features; Hmax model; state-of-the-art systems from the Pascal challenge vs. human performance; deep learning and convolutional neural nets (e.g. ImageNet); unsupervised learning methods; fMRI and EEG studies indicating high correlation between informativeness of image patches and activity in higher-level visual “object” areas (e.g. LOC); recognition of object parts with hierarchies of sub-fragments at multiple scales (Ullman et al., PNAS 2008); object segmentation (e.g. Malik et al.; Brandt, Sharon, Basri, Nature 2006) using top-down semantic information to enhance segmentation; future challenges include recognizing what people are doing, interactions between agents, task-dependent image analysis e.g. answering queries, visual routines, and using vision to learning conceptual knowledge in a new domain

Boris Katz: Telling Machines about the World, & Daniel Harari: Innate Mechanisms and Learning: ...

Boris Katz: Telling Machines about the World, and Daniel Harari: Innate Mechanisms and Learning: Developing Complex Visual Concepts from Unlabeled Natural Dynamic Scenes

Topics: (Boris Katz) Limitations of recent AI successes (Goggles, Kinect, Watson, Siri); brief history of computer vision system performance; scene understanding tasks: object detection, verification, identification, categorization, recognition of activities or events, spatial and temporal relationships between objects, explanation (e.g. what past events caused the scene to look as it does?), prediction (e.g. what will happen next?), filling in gaps in objects and events; enhancing computer vision systems by combining vision and language processing (e.g. creating a knowledge base about objects for the scene recognition system and testing performance with natural language questions); overview of the START system: syntactic analysis producing parse trees, semantic representation using ternary expressions, language generation, matching ternary expressions and transformational rules, replying to questions, object-property-value data model, decomposition of complex questions into simpler questions; recent progress on understanding and describing simple activities in video (Daniel Harari) Supervised vs. unsupervised learning; lack of feasibility to have labeled training data for all visual concepts; toward social understanding: hand recognition and following gaze direction; toward scene understanding: object segmentation and containment; detecting “mover” events as a pattern of interaction between a moving hand and an object (co-training by appearance and context); using mover events to generate training data for a kNN classifier to determine the direction of gaze; model for object segmentation using common motion and motion discontinuity; learning the concept of containment

Andrei Barbu: From Language to Vision and Back Again

Andrei Barbu: From Language to Vision and Back Again

Topics: Importance of bridging low-level perception with high-level cognition; model system for a limited domain that can (1) recognize how well a sentence describes a video, (2) retrieve sample videos for which a sentence is true, (3) generate language descriptions and answer questions about videos, (4) acquire language concepts, (5) use video to resolve language ambiguity, (6) translate between languages, and (7) guide planning; determining whether a sentence describes a video involves recognizing participants, movement, directions, relationships; overview of system that starts with many unreliable detections, uses HMMs to track coherently moving objects and recognize words from tracks, and gets information about participants and relations from a dependency parser (e.g. START) to encode sentence structure; similar approach is used to generate sentences and answer questions about videos (combining trackers and words); examples involving simple objects and agents performing actions such as approach, pick up, put down; translation between languages via imagination of videos depicting sentences

Patrick Winston: The Story Understanding Story

Patrick Winston: The Story Understanding Story

Topics: Brief history of AI and arguments against the possibility of artificial intelligence; emergence of symbolic processing capability through evolution; strong story hypothesis: ability to tell, understand, recombine stories distinguishes human intelligence from that of other primates; understanding the story of MacBeth: how to answer questions about information that is not explicit, such as whether Duncan is dead at the end; use of inference rules, explanation rules, concept patterns; Genesis system for story understanding that can find connections between events, integrate cultural background of reader, answer questions about motives, assess similarity between stories, and interpret stories from different domains such as politics and conflict, e.g. understanding analogies between US-Viet Cong and Arab-Israeli conflicts; social animal hypothesis; directed perception hypothesis