Language and Vision

Language and Vision

Candace Ross and colleagues developed a semantic parser that learns to understand sentences after being trained on video-sentence pairs without explicit parses or logical forms.


The ability to answer simple questions about the objects, agents, and actions portrayed in a visual scene, and the ability to use natural language commands to direct a robotic agent to navigate through a scene and manipulate objects, require the combination of visual and language understanding. This unit explores general frameworks for solving many tasks that rely on the integration of vision and language.


Boris Katz and Andrei Barbu: Grounding language acquisition
As noted by Boris Katz, children learn language by observing the environment, listening to people around them, and connecting what they see and what they hear, in contrast to natural language understanding systems trained with large databases of parse trees and logical forms that capture the syntax and meaning of sentences. Andrei Barbu describes a semantic parser that learns the structure of natural language from sentences grounded in a visual context, with videos that capture their meaning, analogous to what children experience as they acquire language.
Andrei Barbu: Language & vision
Andrei Barbu ties language understanding to the ability of a robotic agent to act in the world. A sampling-based path planner for robot navigation is enhanced with a recurrent neural network that learns successful sequences of actions from past experience in similar environments, guided by natural language commands. Grounding language in vision, incorporating how well a sentence captures the content of a video, can form the basis for solving problems such as recognition, image and video retrieval, question answering, language acquisition, and common sense reasoning.
Stefanie Tellex reflects on how we can create robots that seamlessly use natural language to communicate with humans. The combination of probabilistic methods based on partially observable Markov decision processes (POMDPs), corpus-based training, and decision theory, enable the development of interactive robotic systems that people can control with natural language commands, and that respond appropriately to these commands and ask questions or request help to complete the desired actions.
Richard Socher shows how a model based on dynamic memory networks enables many aspects of natural language processing to be cast as question-answering tasks. The model uses gated recurrent units from recurrent neural networks (RNNs) and incorporates memory and attention mechanisms. With a novel input module for processing images, the model can be applied to the task of answering questions about natural images.

Further Study

Online Resources

Additional information about the speakers’ research and publications can be found at these websites:


Berzak, Y., Huang, Y., Barbu, A., Korhonen, A., Katz, B. (2016) Anchoring and agreement in syntactic annotations, Proceedings of the 2016 Conference on Empirical Methods on Natural Language Processing, Austin, Texas

Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani, I., Zhong, V., Paulus, R., Socher, R. (2016) Ask me anything: Dynamic memory networks for natural language processing, 33rd International Conference on Machine Learning

Kuo, Y., Barbu, A., Katz, B. (2018) Deep sequential models for sampling-based planning, International Conference on Intelligent Robots and Systems, Madrid, Spain

Kuo, Y., Katz, B., Barbu, A. (2020) Deep compositional robotic planners that follow natural language commands, IEEE International Conference on Robotics and Automation, Paris, France

Patel, R., Pavlick, E., Tellex, S. (2020) Grounding language to non-markovian tasks with no supervision of task specifications, Proceedings of Robotics: Science and Systems

Ross, C., Barbu, A., Berzak, Y., Mayanganbayar, B., Katz, B (2018) Grounding language acquisition by training semantic parsers using captioned videos, Conference on Empirical Methods on Natural Language Processing, Brussels, Belgium

Tellex, S., Gopalan, N., Kress-Gazit, H., Matuszek, C. (2020) Robots that use language: A survey, Annual Review of Control, Robotics, and Autonomous Systems, 3, 1-35

Wang, C., Ross, C., Kuo, Y., Katz, B., Barbu, A. (2020) Learning a natural-language to LTL executable semantic parser for grounded robotics, 4th Conference on Robot Learning, Cambridge MA

Yu, H., Siddharth, N., Barbu, A., Siskind, J. M. (2015) A compositional framework for grounded language inference, generation, and acquisition in video, Journal of Artificial Intelligence Research, 52, 601-713

Xiong, C., Merity, S., Socher, R. (2016) Dynamic memory networks for visual and textual question answering, 33rd International Conference on Machine Learning