Unsupervised Learning of Spoken Language with Visual Context
- Speech Representation, Perception and Recognition
Jim Glass, MIT
Abstract: Despite continuous advances over many decades, automatic speech recognition remains fundamentally a supervised learning scenario that requires large quantities of annotated training data to achieve good performance. This requirement is arguably the major reason that less than 2% of the worlds languages have achieved some form of ASR capability. Such a learning scenario also stands in stark contrast to the way that humans learn language, which inspires us to consider approaches that involve more learning and less supervision. In our recent research towards unsupervised learning of spoken language, we are investigating the role that visual contextual information can play in learning word-like units from unannotated speech. In this talk, I describe our recent efforts to learn an audio-visual embedding space using a deep learning model that associate images with corresponding spoken descriptions. Through experimental evaluation and analysis we show that the model is able to learn a useful word-like embedding representation that can be used to cluster visual objects and their spoken instantiation.