Unsupervised Learning of Spoken Language with Visual Context

Unsupervised Learning of Spoken Language with Visual Context

Date Posted: 

February 10, 2017

Date Recorded: 

February 3, 2017


Jim Glass
  • Speech Representation, Perception and Recognition

Associated CBMM Pages: 


Jim Glass, MIT

Abstract: Despite continuous advances over many decades, automatic speech recognition remains fundamentally a supervised learning scenario that requires large quantities of annotated training data to achieve good performance. This requirement is arguably the major reason that less than 2% of the worlds languages have achieved some form of ASR capability. Such a learning scenario also stands in stark contrast to the way that humans learn language, which inspires us to consider approaches that involve more learning and less supervision. In our recent research towards unsupervised learning of spoken language, we are investigating the role that visual contextual information can play in learning word-like units from unannotated speech. In this talk, I describe our recent efforts to learn an audio-visual embedding space using a deep learning model that associate images with corresponding spoken descriptions. Through experimental evaluation and analysis we show that the model is able to learn a useful word-like embedding representation that can be used to cluster visual objects and their spoken instantiation.

Associated Research Thrust: