Title | Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) |
Publication Type | CBMM Memos |
Year of Publication | 2015 |
Authors | Mao, J, Xu, W, Yang, Y, Wang, J, Huang, Z, Yuille, A |
Number | 033 |
Date Published | 05/07/2015 |
Publication Language | English |
Abstract | In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly |
arXiv | |
DSpace@MIT |
Research Area:
CBMM Relationship:
- CBMM Funded