Unsupervised Learning of Spoken Language with Visual Context
- Speech Representation, Perception and Recognition
Jim Glass, MIT
Abstract: Despite continuous advances over many decades, automatic speech recognition remains fundamentally a supervised learning scenario that requires large quantities of annotated training data to achieve good performance. This requirement is arguably the major reason that less than 2% of the worlds languages have achieved some form of ASR capability. Such a learning scenario also stands in stark contrast to the way that humans learn language, which inspires us to consider approaches that involve more learning and less supervision. In our recent research towards unsupervised learning of spoken language, we are investigating the role that visual contextual information can play in learning word-like units from unannotated speech. In this talk, I describe our recent efforts to learn an audio-visual embedding space using a deep learning model that associate images with corresponding spoken descriptions. Through experimental evaluation and analysis we show that the model is able to learn a useful word-like embedding representation that can be used to cluster visual objects and their spoken instantiation.
have an interactive transcript feature enabled, which appears below the video when playing. Viewers can search for keywords in the video or click on any word in the transcript to jump to that point in the video. When searching, a dark bar with white vertical lines appears below the video frame. Each white line is an occurrence of the searched term and can be clicked on to jump to that spot in the video.
LEARNING