%0 Conference Paper
%B AAAI 2017
%D 2017
%T Attention Correctness in Neural Image Captioning
%A Chenxi Liu
%A Junhua Mao
%A Fei Sha
%A Alan Yuille
%X <p>Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the "correctness" of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.</p>
%B AAAI 2017
%G eng

%0 Conference Paper
%B The Conference on Computer Vision and Pattern Recognition (CVPR)
%D 2016
%T Generation and Comprehension of Unambiguous Object Descriptions
%A Junhua Mao
%A Jonathan Huang
%A Alexander Toshev
%A Oana Camburu
%A Alan Yuille
%A Kevin Murphy
%X <pre>  We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.</pre>    <pre>  We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.  </pre>    <pre>  Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate,  our task allows for easy objective evaluation.</pre>    <pre>  We also present a new large-scale <u>dataset</u> for referring expressions, based on</pre>    <pre>  MS-COCO.</pre>    <pre>  We have released the <u>dataset</u> and a toolbox for visualization and evaluation, see \url{<u>https</u>://<u>github</u>.com/<u>mjhucla</u>/<u>Google_Refexp_toolbox</u>}.</pre>
%B The Conference on Computer Vision and Pattern Recognition (CVPR)
%C Las Vegas, Nevada
%8 06/2016
%G eng
%U https://github.com/ mjhucla/Google_Refexp_toolbox

%0 Conference Paper
%B NIPS 2016
%D 2016
%T Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images
%A Junhua Mao
%A Jianjing Xu
%A Yushi Jing
%A Alan Yuille
%X <p>In this paper, we focus on training and evaluating effective word embeddings with both text and visual information.&nbsp; More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest [ 2 ]. This dataset is more than 200 times larger than MS COCO [ 22 ], the standard large-scale image dataset with sentence descriptions. In addition, we construct an evaluation dataset to directly assess the effectiveness of word embeddings in terms of finding semantically similar or related words and phrases. The word/phrase pairs in this evaluation dataset are collected from the click data with millions of users in an image search system,&nbsp; thus contain rich semantic relationships.&nbsp; Based on these datasets, we propose and compare several Recurrent Neural Networks (RNNs) based multimodal (text and image) models. Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings. The project page is: http://www.stat. ucla.edu/~junhua.mao/multimodal_embedding.html 1 .</p>
%B NIPS 2016
%G eng

%0 Generic
%D 2015
%T Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)
%A Junhua Mao
%A Wei Xu
%A Yi Yang
%A Jiang Wang
%A Zhiheng Huang
%A Alan Yuille
%X <p>In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly<br />  optimize the ranking objective function for retrieval.</p>
%8 05/07/2015
%G English
%1 <p><a href="http://arxiv.org/abs/1412.6632">arXiv:1412.6632</a></p>
%2 <p>http://hdl.handle.net/1721.1/100198</p>

%0 Conference Paper
%B International Conference of Computer Vision
%D 2015
%T Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images
%A Junhua Mao
%A Wei Xu
%A Yi Yang
%A Jiang Wang
%A Zhiheng Huang
%A Alan Yuille
%X <pre>  In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions.  </pre>    <pre>  Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts.</pre>    <pre>  Our method has an image captioning module based on m-RNN with several improvements.</pre>    <pre>  In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task.</pre>    <pre>  We propose methods to prevent <u>overfitting</u> the new concepts. </pre>    <pre>  In addition, three novel concept <u>datasets</u> are constructed for this new task, and are publicly available on the project page.</pre>    <pre>  In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts.</pre>    <pre>  The project page is: \url{<u>www</u>.stat.<u>ucla</u>.<u>edu</u>/~<u>junhua</u>.<u>mao</u>/projects/<u>child_learning</u>.<u>html</u>}.</pre>
%B International Conference of Computer Vision
%C Santiago, Chile
%8 12/2015
%G eng
%U www.stat.ucla.edu/~junhua.mao/projects/child_learning.html