%0 Generic %D 2018 %T Deep Nets: What have they ever done for Vision? %A Alan Yuille %A Chenxi Liu %X

This is an opinion paper about the strengths and weaknesses of Deep Nets. They are at the center of recent progress on Artificial Intelligence and are of growing importance in Cognitive Science and Neuroscience since they enable the development of computational models that can deal with a large range of visually realistic stimuli and visual tasks. They have clear limitations but they also have enormous successes. There is also gradual, though incomplete, understanding of their inner workings. It seems unlikely that Deep Nets in their current form will be the best long-term solution either for building general purpose intelligent machines or for understanding the mind/brain, but it is likely that many aspects of them will remain. At present Deep Nets do very well on specific types of visual tasks and on specific benchmarked datasets. But Deep Nets are much less general purpose, flexible, and adaptive than the human visual system. Moreover, methods like Deep Nets may run into fundamental difficulties when faced with the enormous complexity of natural images. To illustrate our main points, while keeping the references small, this paper is slightly biased towards work from our group.

%8 05/2018 %1

https://arxiv.org/abs/1805.04025

%2

http://hdl.handle.net/1721.1/115292

%0 Generic %D 2018 %T Deep Regression Forests for Age Estimation %A Wei Shen %A Yilu Guo %A Yan Wang %A Kai Zhao %A Bo Wang %A Alan Yuille %X

Age estimation from facial images is typically cast as a nonlinear regression problem. The main challenge of this problem is the facial feature space w.r.t. ages is inhomogeneous, due to the large variation in facial appearance across different persons of the same age and the non-stationary property of aging patterns. In this paper, we propose Deep Regression Forests (DRFs), an end-to-end model, for age estimation. DRFs connect the split nodes to a fully connected layer of a convolutional neural network (CNN) and deal with inhomogeneous data by jointly learning input-dependant data partitions at the split nodes and data abstractions at the leaf nodes. This joint learning follows an alternating strategy: First, by fixing the leaf nodes, the split nodes as well as the CNN parameters are optimized by Back-propagation; Then, by fixing the split nodes, the leaf nodes are optimized by iterating a step-size free update rule derived from Variational Bounding. We verify the proposed DRFs on three standard age estimation benchmarks and achieve state-of-the-art results on all of them.

%8 06/2018 %2

http://hdl.handle.net/1721.1/115413

%0 Generic %D 2018 %T DeepVoting: A Robust and Explainable Deep Network for Semantic Part Detection under Partial Occlusion %A Zhishuai Zhang %A Cihang Xie %A Jianyu Wang %A Lingxi Xie %A Alan Yuille %X

In this paper, we study the task of detecting semantic parts of an object, e.g., a wheel of a car, under partial occlusion. We propose that all models should be trained without seeing occlusions while being able to transfer the learned knowledge to deal with occlusions. This setting alleviates the diffi- culty in collecting an exponentially large dataset to cover occlusion patterns and is more essential. In this scenario, the proposal-based deep networks, like RCNN-series, often produce unsatisfactory re- sults, because both the proposal extraction and classification stages may be confused by the irrelevant occluders. To address this, [25] proposed a voting mechanism that combines multiple local visual cues to detect semantic parts. The semantic parts can still be detected even though some visual cues are missing due to occlusions. However, this method is manually-designed, thus is hard to be optimized in an end-to-end manner.

In this paper, we present DeepVoting, which incorporates the robustness shown by [25] into a deep network, so that the whole pipeline can be jointly optimized. Specifically, it adds two layers after the intermediate features of a deep network, e.g., the pool-4 layer of VGGNet. The first layer extracts the evidence of local visual cues, and the second layer performs a voting mechanism by utilizing the spatial relationship between visual cues and semantic parts. We also propose an improved version DeepVoting+ by learning visual cues from context outside objects. In experiments, DeepVoting achieves significantly better performance than several baseline methods, including Faster-RCNN, for semantic part detection under occlusion. In addition, DeepVoting enjoys explainability as the detection results can be diagnosed via looking up the voting cues.

%8 06/2018 %2

http://hdl.handle.net/1721.1/115181

%0 Conference Paper %B Conference on Computer Vision and Pattern Recognition (CVPR) %D 2018 %T DeepVoting: An Explainable Framework for Semantic Part Detection under Partial Occlusion %A Zhishuai Zhang %A Cihang Xie %A Jianyu Wang %A Lingxi Xie %A Alan Yuille %X

In this paper, we study the task of detecting semantic parts of an object, e.g., a wheel of a car, under partial occlusion. We propose that all models should be trained without seeing occlusions while being able to transfer the learned knowledge to deal with occlusions. This setting alleviates the difficulty in collecting an exponentially large dataset to cover occlusion patterns and is more essential. In this scenario, the proposal-based deep networks, like RCNN-series, often produce unsatisfactory results, because both the proposal extraction and classification stages may be confused by the irrelevant occluders. To address this, [25] proposed a voting mechanism that combines multiple local visual cues to detect semantic parts. The semantic parts can still be detected even though some visual cues are missing due to occlusions. However, this method is manually-designed, thus is hard to be optimized in an end-to-end manner. In this paper, we present DeepVoting, which incorporates the robustness shown by [25] into a deep network, so that the whole pipeline can be jointly optimized. Specifically, it adds two layers after the intermediate features of a deep network, e.g., the pool-4 layer of VGGNet. The first layer extracts the evidence of local visual cues, and the second layer performs a voting mechanism by utilizing the spatial relationship between visual cues and semantic parts. We also propose an improved version DeepVoting+ by learning visual cues from context outside objects. In experiments, DeepVoting achieves significantly better performance than several baseline methods, including Faster-RCNN, for semantic part detection under occlusion. In addition, DeepVoting enjoys explainability as the detection results can be diagnosed via looking up the voting cues.

%B Conference on Computer Vision and Pattern Recognition (CVPR) %C Salt Lake City, Utah %8 06/2018 %G eng %U http://cvpr2018.thecvf.com/ %0 Generic %D 2018 %T Recurrent Multimodal Interaction for Referring Image Segmentation %A Chenxi Liu %A Zhe Lin %A Xiaohui Shen %A Jimei Yang %A Xin Lu %A Alan Yuille %X

In this paper we are interested in the problem of image segmentation given natural language descriptions, i.e. referring expressions. Existing works tackle this problem by first modeling images and sentences independently and then segment images by combining these two types of representations. We argue that learning word-to-image interaction is more native in the sense of jointly modeling two modalities for the image segmentation task, and we propose convolutional multimodal LSTM to encode the sequential interactions between individual words, visual information, and spatial information. We show that our proposed model outperforms the baseline model on benchmark datasets. In addition, we analyze the intermediate output of the proposed multimodal LSTM approach and empirically explain how this approach enforces a more effective word-to-image interaction.

%8 05/2018 %2

http://hdl.handle.net/1721.1/115374

%0 Generic %D 2018 %T Scene Graph Parsing as Dependency Parsing %A Yu-Siang Wang %A Chenxi Liu %A Xiaohui Zeng %A Alan Yuille %X

In this paper, we study the problem of parsing structured knowledge graphs from textual descrip- tions. In particular, we consider the scene graph representation that considers objects together with their attributes and relations: this representation has been proved useful across a variety of vision and language applications. We begin by introducing an alternative but equivalent edge-centric view of scene graphs that connect to dependency parses. Together with a careful redesign of label and action space, we combine the two-stage pipeline used in prior work (generic dependency parsing followed by simple post-processing) into one, enabling end-to-end training. The scene graphs generated by our learned neural dependency parser achieve an F-score similarity of 49.67% to ground truth graphs on our evaluation set, surpassing best previous approaches by 5%. We further demonstrate the effective- ness of our learned parser on image retrieval applications.

%8 05/2018 %2

http://hdl.handle.net/1721.1/115375

%0 Conference Paper %B Conference on Computer Vision and Pattern Recognition (CVPR) %D 2018 %T Single-Shot Object Detection with Enriched Semantics %A Zhishuai Zhang %A Siyuan Qiao %A Cihang Xie %A Wei Shen %A Bo Wang %A Alan Yuille %X

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

%B Conference on Computer Vision and Pattern Recognition (CVPR) %C Salt Lake City, Utah %8 06/2018 %G eng %U http://cvpr2018.thecvf.com/ %0 Generic %D 2018 %T Single-Shot Object Detection with Enriched Semantics %A Zhishuai Zhang %A Siyuan Qiao %A Cihang Xie %A Wei Shen %A Bo Wang %A Alan Yuille %X

We propose a novel single shot object detection network named Detection with Enriched Semantics (DES). Our motivation is to enrich the semantics of object detection features within a typical deep detector, by a semantic segmentation branch and a global activation module. The segmentation branch is supervised by weak segmentation ground-truth, i.e., no extra annotation is required. In conjunction with that, we employ a global activation module which learns relationship between channels and object classes in a self-supervised manner. Comprehensive experimental results on both PASCAL VOC and MS COCO detection datasets demonstrate the effectiveness of the proposed method. In particular, with a VGG16 based DES, we achieve an mAP of 81.7 on VOC2007 test and an mAP of 32.8 on COCO test-dev with an inference speed of 31.5 milliseconds per image on a Titan Xp GPU. With a lower resolution version, we achieve an mAP of 79.7 on VOC2007 with an inference speed of 13.0 milliseconds per image.

%8 06/2018 %2

http://hdl.handle.net/1721.1/115180

%0 Journal Article %J Annals of Mathematical Sciences and Applications (AMSA) %D 2018 %T Visual Concepts and Compositional Voting %A Jianyu Wang %A Zhishuai Zhang %A Cihang Xie %A Yuyin Zhou %A Vittal Premachandran %A Jun Zhu %A Lingxi Xie %A Alan Yuille %K deep networks %K pattern theory %K visual concepts %X

It is very attractive to formulate vision in terms of pattern theory \cite{Mumford2010pattern}, where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and can easily be fooled by adding occluding objects. It is natural to wonder whether by better understanding deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal representations of a deep network using vehicle images from the PASCAL3D+ dataset. We use clustering algorithms to study the population activities of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of vehicles. To analyze this we annotate these vehicles by their semantic parts to create a new dataset, VehicleSemanticParts, and evaluate visual concepts as unsupervised part detectors. We show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines (SVM). We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like SVM and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called VehicleOcclusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

%B Annals of Mathematical Sciences and Applications (AMSA) %V 3 %P 151–188 %G eng %U http://www.intlpress.com/site/pub/pages/journals/items/amsa/content/vols/0003/0001/a005/index.html %N 1 %R 10.4310/AMSA.2018.v3.n1.a5 %0 Generic %D 2018 %T Visual concepts and compositional voting %A Jianyu Wang %A Zhishuai Zhang %A Cihang Xie %A Yuyin Zhou %A Vittal Premachandran %A Jun Zhu %A Lingxi Xie %A Alan Yuille %X

It is very attractive to formulate vision in terms of pattern theory [26], where patterns are defined hierarchically by compositions of elementary building blocks. But applying pattern theory to real world images is very challenging and is currently less successful than discriminative methods such as deep networks. Deep networks, however, are black-boxes which are hard to interpret and, as we will show, can easily be fooled by adding occluding objects. It is natural to wonder whether by better under- standing deep networks we can extract building blocks which can be used to develop pattern theoretic models. This motivates us to study the internal feature vectors of a deep network using images of vehicles from the PASCAL3D+ dataset with the scale of objects fixed. We use clustering algorithms, such as K-means, to study the population activity of the features and extract a set of visual concepts which we show are visually tight and correspond to semantic parts of the vehicles. To analyze this in more detail, we annotate these vehicles by their semantic parts to create a new dataset which we call VehicleSemanticParts, and evaluate visual concepts as unsupervised semantic part detectors. Our results show that visual concepts perform fairly well but are outperformed by supervised discriminative methods such as Support Vector Machines. We next give a more detailed analysis of visual concepts and how they relate to semantic parts. Following this analysis, we use the visual concepts as building blocks for a simple pattern theoretical model, which we call compositional voting. In this model several visual concepts combine to detect semantic parts. We show that this approach is significantly better than discriminative methods like Support Vector machines and deep networks trained specifically for semantic part detection. Finally, we return to studying occlusion by creating an annotated dataset with occlusion, called Vehicle Occlusion, and show that compositional voting outperforms even deep networks when the amount of occlusion becomes large.

%8 03/2018 %2

http://hdl.handle.net/1721.1/115182

%0 Conference Paper %B AAAI 2017 %D 2017 %T Attention Correctness in Neural Image Captioning %A Chenxi Liu %A Junhua Mao %A Fei Sha %A Alan Yuille %X

Attention mechanisms have recently been introduced in deep learning for various tasks in natural language processing and computer vision. But despite their popularity, the "correctness" of the implicitly-learned attention maps has only been assessed qualitatively by visualization of several examples. In this paper we focus on evaluating and improving the correctness of attention in neural image captioning models. Specifically, we propose a quantitative evaluation metric for the consistency between the generated attention maps and human annotations, using recently released datasets with alignment between regions in images and entities in captions. We then propose novel models with different levels of explicit supervision for learning attention maps during training. The supervision can be strong when alignment between regions and caption entities are available, or weak when only object segments and categories are provided. We show on the popular Flickr30k and COCO datasets that introducing supervision of attention maps during training solidly improves both attention correctness and caption quality, showing the promise of making machine perception more human-like.

%B AAAI 2017 %G eng %0 Conference Paper %B British Machine Vision Conference (BMVC) %D 2017 %T Detecting Semantic Parts on Partially Occluded Objects %A Jianyu Wang %A Cihang Xie %A Zhishuai Zhang %A Jun Zhu %A Lingxi Xie %A Alan Yuille %X

In this paper, we address the task of detecting semantic parts on partially occluded objects. We consider a scenario where the model is trained using non-occluded images but tested on occluded images. The motivation is that there are infinite number of occlusion patterns in real world, which cannot be fully covered in the training data. So the models should be inherently robust and adaptive to occlusions instead of fitting / learning the occlusion patterns in the training data. Our approach detects semantic parts by accumulating the confidence of local visual cues. Specifically, the method uses a simple voting method, based on log-likelihood ratio tests and spatial constraints, to combine the evidence of local cues. These cues are called visual concepts, which are derived by clustering the internal states of deep networks. We evaluate our voting scheme on the VehicleSemanticPart dataset with dense part annotations. We randomly place two, three or four irrelevant objects onto the target object to generate testing images with various occlusions. Experiments show that our algorithm outperforms several competitors in semantic part detection when occlusions are present.

%B British Machine Vision Conference (BMVC) %C London, UK %8 09/2017 %G eng %U https://bmvc2017.london/proceedings/ %0 Generic %D 2017 %T Detecting Semantic Parts on Partially Occluded Objects %A Jianyu Wang %A Cihang Xie %A Zhishuai Zhang %A Jun Zhu %A Lingxi Xie %A Alan Yuille %X

In this paper, we address the task of detecting semantic parts on partially occluded objects. We consider a scenario where the model is trained using non-occluded images but tested on occluded images. The motivation is that there are infinite number of occlusion patterns in real world, which cannot be fully covered in the training data. So the models should be inherently robust and adaptive to occlusions instead of fitting / learning the occlusion patterns in the training data. Our approach detects semantic parts by accumulating the confidence of local visual cues. Specifically, the method uses a simple voting method, based on log-likelihood ratio tests and spatial constraints, to combine the evidence of local cues. These cues are called visual concepts, which are derived by clustering the internal states of deep networks. We evaluate our voting scheme on the VehicleSemanticPart dataset with dense part annotations. We randomly place two, three or four irrelevant objects onto the target object to generate testing images with various occlusions. Experiments show that our algorithm outperforms several competitors in semantic part detection when occlusions are present.

%8 09/2017 %2

http://hdl.handle.net/1721.1/115179

%0 Generic %D 2017 %T Multi-stage Multi-recursive-input Fully Convolutional Networks for Neuronal Boundary Detection %A Wei Shen %A Bin Wang %A Yuan Jiang %A Yan Wang %A Alan Yuille %X

In the field of connectomics, neuroscientists seek to identify cortical connectivity comprehensively. Neuronal boundary detection from the Electron Microscopy (EM) images is often done to assist the automatic reconstruction of neuronal circuit. But the segmentation of EM images is a challenging problem, as it requires the detector to be able to detect both filament-like thin and blob-like thick membrane, while suppressing the ambiguous intracellular structure. In this paper, we propose multi-stage multi-recursive-input fully convolutional networks to address this problem. The multiple recursive inputs for one stage, i.e., the multiple side outputs with different receptive field sizes learned from the lower stage, provide multi-scale contextual boundary information for the consecutive learning. This design is biologically-plausible, as it likes a human visual system to compare different possible segmentation solutions to address the ambiguous boundary issue. Our multi-stage networks are trained end-to-end. It achieves promising results on two public available EM segmentation datasets, the mouse piriform cortex dataset and the ISBI 2012 EM dataset.

%8 10/2017 %2

http://hdl.handle.net/1721.1/115411

%0 Conference Proceedings %B ECCV %D 2016 %T DOC: Deep OCclusion Recovering From A Single Image %A Peng Wang %A Alan Yuille %X

Recovering the occlusion relationships between objects is a fundamental human visual ability which yields important information about the 3D world. In this paper we propose a deep network architecture, called DOC, which acts on a single image, detects object boundaries and estimates the border ownership (i.e. which side of the boundary is foreground and which is background). We represent occlusion relations by a binary edge map, to indicate the object boundary, and an occlusion orientation variable which is tangential to the boundary and whose direction specifies border ownership by a left-hand rule. We train two related deep convolutional neural networks, called DOC, which exploit local and non-local image cues to estimate this representation and hence recover occlusion relations. In order to train and test DOC we construct a large-scale instance occlusion boundary dataset using PASCAL VOC images, which we call the PASCAL instance occlusion dataset (PIOD). This contains 10,000 images and hence is two orders of magnitude larger than existing occlusion datasets for outdoor images. We test two variants of DOC on PIOD and on the BSDS occlusion dataset and show they outperform state-of-the-art methods. Finally, we perform numerous experiments investigating multiple settings of DOC and transfer between BSDS and PIOD, which provides more insights for further study of occlusion estimation.

%B ECCV %G eng %0 Conference Paper %B The Conference on Computer Vision and Pattern Recognition (CVPR) %D 2016 %T Generation and Comprehension of Unambiguous Object Descriptions %A Junhua Mao %A Jonathan Huang %A Alexander Toshev %A Oana Camburu %A Alan Yuille %A Kevin Murphy %X
  We propose a method that can generate an unambiguous description (known as a referring expression) of a specific object or region in an image, and which can also comprehend or interpret such an expression to infer which object is being described.
  We show that our method outperforms previous methods that generate descriptions of objects without taking into account other potentially ambiguous objects in the scene.  
  Our model is inspired by recent successes of deep learning methods for image captioning, but while image captioning is difficult to evaluate,  our task allows for easy objective evaluation.
  We also present a new large-scale dataset for referring expressions, based on
  MS-COCO.
  We have released the dataset and a toolbox for visualization and evaluation, see \url{https://github.com/mjhucla/Google_Refexp_toolbox}.
%B The Conference on Computer Vision and Pattern Recognition (CVPR) %C Las Vegas, Nevada %8 06/2016 %G eng %U https://github.com/ mjhucla/Google_Refexp_toolbox %0 Conference Paper %B NIPS 2016 %D 2016 %T Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images %A Junhua Mao %A Jianjing Xu %A Yushi Jing %A Alan Yuille %X

In this paper, we focus on training and evaluating effective word embeddings with both text and visual information.  More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest [ 2 ]. This dataset is more than 200 times larger than MS COCO [ 22 ], the standard large-scale image dataset with sentence descriptions. In addition, we construct an evaluation dataset to directly assess the effectiveness of word embeddings in terms of finding semantically similar or related words and phrases. The word/phrase pairs in this evaluation dataset are collected from the click data with millions of users in an image search system,  thus contain rich semantic relationships.  Based on these datasets, we propose and compare several Recurrent Neural Networks (RNNs) based multimodal (text and image) models. Experiments show that our model benefits from incorporating the visual information into the word embeddings, and a weight sharing strategy is crucial for learning such multimodal embeddings. The project page is: http://www.stat. ucla.edu/~junhua.mao/multimodal_embedding.html 1 .

%B NIPS 2016 %G eng %0 Conference Paper %B ECCV %D 2016 %T Zoom better to see clearer: Human and object parsing with hierarchical auto-zoom net %A Fangting Xia %A Peng Wang %A Liang-chieh Chen %A Alan Yuille %B ECCV %C Amsterdam, The Netherlands %8 09/2016 %G eng %0 Conference Paper %B ECCV %D 2016 %T Zoom Better to See Clearer: Human Part Segmentation with Auto Zoom Net %A Fangting Xia %A Peng Wang %A Liang-chieh Chen %A Alan Yuille %X

Parsing articulated objects, e.g . humans and animals, into semantic parts ( e.g . body, head and arms, etc .) from natural images is a challenging and fundamental problem for computer vision. A big dif- ficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and loca- tion will degrade the parsing output and cause errors in boundary details. To tackle these difficulties, we propose a “Hierarchical Auto-Zoom Net” (HAZN) for object part parsing which adapts to the local scales of ob- jects and parts. HAZN is a sequence of two “Auto-Zoom Nets” (AZNs), each employing fully convolutional networks that perform two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively “zoom” (re- size) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. For humans, our approach significantly out- performs the state-of-the-arts by 5% mIOU and is especially better at segmenting small instances and small parts. We obtain similar improve- ments for parsing cows and horses over alternative methods. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that, for example, we do not need to waste computational resources scaling the entire image.

%B ECCV %G eng %0 Generic %D 2015 %T Complexity of Representation and Inference in Compositional Models with Part Sharing %A Alan Yuille %A Roozbeh Mottaghi %X This paper performs a complexity analysis of a class of serial and parallel compositional models of multiple objects and shows that they enable efficient representation and rapid inference. Compositional models are generative and represent objects in a hierarchically distributed manner in terms of parts and subparts, which are constructed recursively by part-subpart compositions. Parts are represented more coarsely at higher level of the hierarchy, so that the upper levels give coarse summary descriptions (e.g., there is a horse in the image) while the lower levels represents the details (e.g., the positions of the legs of the horse). This hierarchically distributed representation obeys the executive summary principle, meaning that a high level executive only requires a coarse summary description and can, if necessary, get more details by consulting lower level executives. The parts and subparts are organized in terms of hierarchical dictionaries which enables part sharing between different objects allowing efficient representation of many objects. The first main contribution of this paper is to show that compositional models can be mapped onto a parallel visual architecture similar to that used by bio-inspired visual models such as deep convolutional networks but more explicit in terms of representation, hence enabling part detection as well as object detection, and suitable for complexity analysis. Inference algorithms can be run on this architecture to exploit the gains caused by part sharing and executive summary. Effectively, this compositional architecture enables us to perform exact inference simultaneously over a large class of generative models of objects.The second contribution is an analysis of the complexity of compositional models in terms of computation time (for serial computers) and numbers of nodes (e.g., ``neurons") for parallel computers. In particular, we compute the complexity gains by part sharing and executive summary and their dependence on how the dictionary scales with the level of the hierarchy. We explore three regimes of scaling behavior where the dictionary size (i) increases exponentially with the level of the hierarchy, (ii) is determined by an unsupervised compositional learning algorithm applied to real data, (iii) decreases exponentially with scale. This analysis shows that in some regimes the use of shared parts enables algorithms which can perform inference in time linear in the number of levels for an exponential number of objects. In other regimes part sharing has little advantage for serial computers but can enable linear processing on parallel computers. %8 05/2015 %1

arXiv:1301.3560v1

%2

http://hdl.handle.net/1721.1/100196

%0 Generic %D 2015 %T Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) %A Junhua Mao %A Wei Xu %A Yi Yang %A Jiang Wang %A Zhiheng Huang %A Alan Yuille %X

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated according to this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly
optimize the ranking objective function for retrieval.

%8 05/07/2015 %G English %1

arXiv:1412.6632

%2

http://hdl.handle.net/1721.1/100198

%0 Conference Paper %B International Conference of Computer Vision %D 2015 %T Learning like a Child: Fast Novel Visual Concept Learning from Sentence Descriptions of Images %A Junhua Mao %A Wei Xu %A Yi Yang %A Jiang Wang %A Zhiheng Huang %A Alan Yuille %X
  In this paper, we address the task of learning novel visual concepts, and their interactions with other concepts, from a few images with sentence descriptions.  
  Using linguistic context and visual features, our method is able to efficiently hypothesize the semantic meaning of new words and add them to its word dictionary so that they can be used to describe images which contain these novel concepts.
  Our method has an image captioning module based on m-RNN with several improvements.
  In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task.
  We propose methods to prevent overfitting the new concepts. 
  In addition, three novel concept datasets are constructed for this new task, and are publicly available on the project page.
  In the experiments, we show that our method effectively learns novel visual concepts from a few examples without disturbing the previously learned concepts.
  The project page is: \url{www.stat.ucla.edu/~junhua.mao/projects/child_learning.html}.
%B International Conference of Computer Vision %C Santiago, Chile %8 12/2015 %G eng %U www.stat.ucla.edu/~junhua.mao/projects/child_learning.html %0 Conference Paper %B International Conference on Computer Vision (ICCV) %D 2015 %T One Shot Learning by Composition of Meaningful Patches %A Alex Wong %A Alan Yuille %X
The task of discriminating one object from another is almost trivial for a human being. However, this task is computationally taxing for most modern machine learning methods; whereas, we perform this task at ease given very few examples for learning. It has been proposed that the quick grasp of concept may come from the shared knowledge between the new example and examples previously learned. We believe that the key to one-shot learning is the sharing of common parts as each part holds immense amounts of information on how a visual concept is constructed. We propose an unsupervised method for learning a compact dictionary of image patches representing meaningful components of an objects. Using those patches as features, we build a compositional model that outperforms a number of popular algorithms on a one-shot learning task. We demonstrate the effectiveness of this approach on hand-written digits and show that this model generalizes to multiple datasets.
%B International Conference on Computer Vision (ICCV) %C Santiago, Chile %8 12/2015 %G eng %0 Conference Paper %B International Conference on Computer Vision (ICCV) %D 2015 %T One Shot Learning via Compositions of Meaningful Patches %A Alex Wong %A Alan Yuille %X

The task of discriminating one object from another is al- most trivial for a human being. However, this task is compu- tationally taxing for most modern machine learning meth- ods; whereas, we perform this task at ease given very few examples for learning.  It has been proposed that the quick grasp of concept may come from the shared knowledge be- tween  the  new  example  and  examples  previously  learned. We believe that the key to one-shot learning is the sharing of common parts as each part holds immense amounts of in- formation on how a visual concept is constructed.  We pro- pose an unsupervised method for learning a compact dictio- nary of image patches representing meaningful components of an objects.  Using those patches as features, we build a compositional model that outperforms a number of popu- lar algorithms on a one-shot learning task. We demonstrate the effectiveness of this approach on hand-written digits and show that this model generalizes to multiple datasets.

%B International Conference on Computer Vision (ICCV) %G eng %0 Generic %D 2015 %T Parsing Occluded People by Flexible Compositions %A Xianjie Chen %A Alan Yuille %X

This paper presents an approach to parsing humans when there is significant occlusion. We model humans using a graphical model which has a tree structure building on recent work [32, 6] and exploit the connectivity prior that, even in presence of occlusion, the visible nodes form a connected subtree of the graphical model. We call each connected subtree a flexible composition of object parts. This involves a novel method for learning occlusion cues. During inference we need to search over a mixture of different flexible models. By exploiting part sharing, we show that this inference can be done extremely efficiently requiring only twice as many computations as searching for the entire object (i.e., not modeling occlusion). We evaluate our model on the standard benchmarked “We Are Family" Stickmen dataset and obtain significant performance improvements over the best alternative algorithms. 

 

%B Computer Vision and Pattern Recognition (CVPR) %8 06/1/2015 %G eng %1

arXiv:1412.1526

%2

http://hdl.handle.net/1721.1/100199

%0 Conference Proceedings %B IEEE International Conference on Computer Vision (ICCV) %D 2015 %T Scene-Domain Active Part Models for Object Representation %A Zhou Ren %A Chaohui Wang %A Alan Yuille %X

In this paper, we are interested in enhancing the expressivity and robustness of part-based models for object representation, in the common scenario where the training data are based on 2D images. To this end, we propose scene-domain active part models (SDAPM), which reconstruct and characterize the 3D geometric statistics between object’s parts in 3D scene-domain by using 2D training data in the image-domain alone. And on top of this, we explicitly model and handle occlusions in SDAPM. Together with the developed learning and inference algorithms, such a model provides rich object descriptions, including 2D object and parts localization, 3D landmark shape and camera viewpoint, which offers an effective representation to various image understanding tasks, such as object and parts detection, 3D landmark shape and viewpoint estimation from images. Experiments on the above tasks show that SDAPM outperforms previous part-based models, and thus demonstrates the potential of the proposed technique.

%B IEEE International Conference on Computer Vision (ICCV) %C Santiago, Chile %P 2497 - 2505 %8 12/2015 %G eng %U http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7410644&tag=1 %R 10.1109/ICCV.2015.287 %0 Conference Paper %B CVPR %D 2015 %T Semantic Part Segmentation using Compositional Model combing Shape and Appearance %A Jianyu Wang %A Alan Yuille %X

In this paper, we study the problem of semantic part seg- mentation for animals. This is more challenging than stan- dard object detection, object segmentation and pose estima- tion tasks because semantic parts of animals often have sim- ilar appearance and highly varying shapes. To tackle these challenges, we build a mixture of compositional models to represent the object boundary and the boundaries of seman- tic parts.   And we incorporate edge,  appearance,  and se- mantic part cues into the compositional model. Given part- level  segmentation  annotation,  we  develop  a  novel  algo- rithm to learn a mixture of compositional models under var- ious poses and viewpoints for certain animal classes.  Fur- thermore, a linear complexity algorithm is offered for effi- cient inference of the compositional model using dynamic programming.  We evaluate our method for horse and cow using a newly annotated dataset on Pascal VOC 2010 which has pixelwise part labels. Experimental results demonstrate the effectiveness of our method.

%B CVPR %G eng %0 Generic %D 2014 %T Detect What You Can: Detecting and Representing Objects using Holistic Models and Body Parts. %A Xianjie Chen %A Roozbeh Mottaghi %A Xiaobai Liu %A Sanja Fidler %A Raquel Urtasun %A Alan Yuille %X

Detecting objects becomes difficult when we need to deal with large shape deformation, occlusion and low resolution. We propose a novel approach to i) handle large deformations and partial occlusions in animals (as examples of highly deformable objects), ii) describe them in terms of body parts, and iii) detect them when their body parts are hard to detect (e.g., animals depicted at low resolution). We represent the holistic object and body parts separately and use a fully connected model to arrange templates for the holistic object and body parts. Our model automatically decouples the holistic object or body parts from the model when they are hard to detect. This enables us to represent a large number of holistic object and body part combinations to better deal with different “detectability” patterns caused by deformations, occlusion and/or low resolution.
We apply our method to the six animal categories in the PASCAL VOC dataset and show that our method significantly improves state-of-the-art (by 4.1% AP) and provides a richer representation for objects. During training we use annotations for body parts (e.g., head, torso, etc), making use of a new dataset of fully annotated object parts for PASCAL VOC 2010, which provides a mask for each part.

%8 06/2014 %1

arXiv:1406.2031

%2

http://hdl.handle.net/1721.1/100179

%0 Generic %D 2014 %T Human-Machine CRFs for Identifying Bottlenecks in Holistic Scene Understanding. %A Roozbeh Mottaghi %A Sanja Fidler %A Alan Yuille %A Raquel Urtasun %A Devi Parikh %X

Recent trends in image understanding have pushed for holistic scene understanding models that jointly reason about various tasks such as object detection, scene recognition, shape analysis, contextual reasoning, and local appearance based classifiers. In this work, we are interested in understanding the roles of these different tasks in improved scene understanding, in particular semantic segmentation, object detection and scene recognition. Towards this goal, we “plug-in” human subjects for each of the various components in a state-of-the-art conditional random field model. Comparisons among various hybrid human-machine CRFs give us indications of how much “head room” there is to improve scene understanding by focusing research efforts on various individual tasks.

%8 06/2014 %1

arXiv:1406.3906

%2

http://hdl.handle.net/1721.1/100184

%0 Generic %D 2014 %T Parsing Semantic Parts of Cars Using Graphical Models and Segment Appearance Consistency. %A Wenhao Lu %A Xiaochen Lian %A Alan Yuille %X

This paper addresses the problem of semantic part parsing (segmentation) of cars, i.e.assigning every pixel within the car to one of the parts (e.g.body, window, lights, license plates and wheels). We formulate this as a landmark identification problem, where a set of landmarks specifies the boundaries of the parts. A novel mixture of graphical models is proposed, which dynamically couples the landmarks to a hierarchy of segments. When modeling pairwise relation between landmarks, this coupling enables our model to exploit the local image contents in addition to spatial deformation, an aspect that most existing graphical models ignore. In particular, our model enforces appearance consistency between segments within the same part. Parsing the car, including finding the optimal coupling between landmarks and segments in the hierarchy, is performed by dynamic programming. We evaluate our method on a subset of PASCAL VOC 2010 car images and on the car subset of 3D Object Category dataset (CAR3D). We show good results and, in particular, quantify the effectiveness of using the segment appearance consistency in terms of accuracy of part localization and segmentation.

%8 06/2014 %1

arXiv:1406.2375v2

%2

http://hdl.handle.net/1721.1/100182

%0 Generic %D 2014 %T Robust Estimation of 3D Human Poses from a Single Image. %A Chunyu Wang %A Yizhou Wang %A Zhouchen Lin %A Alan Yuille %A Wen Gao %X

Human pose estimation is a key step to action recognition. We propose a method of estimating 3D human poses from a single image, which works in conjunction with an existing 2D pose/joint detector. 3D pose estimation is challenging because multiple 3D poses may correspond to the same 2D pose after projection due to the lack of depth information. Moreover, current 2D pose estimators are usually inaccurate which may cause errors in the 3D estimation. We address the challenges in three ways: (i) We represent a 3D pose as a linear combination of a sparse set of bases learned from 3D human skeletons. (ii) We enforce limb length constraints to eliminate anthropomorphically implausible skeletons. (iii) We estimate a 3D pose by minimizing the L1 -norm error between the projection of the 3D pose and the corresponding 2D detection. The L1-norm loss term is robust to inaccurate 2D joint estimations. We use the alternating direction method (ADM) to solve the optimization problem efficiently. Our approach outperforms the state-of-the-arts on three benchmark datasets.

%8 06/2014 %1

arXiv:1406.2282v1

%2

http://hdl.handle.net/1721.1/100177

%0 Generic %D 2014 %T The Secrets of Salient Object Segmentation. %A Yin Li %A Christof Koch %A James M. Rehg %A Alan Yuille %X

In this paper we provide an extensive evaluation of fixation prediction and salient object segmentation algorithms as well as statistics of major datasets. Our analysis identifi es serious design flaws of existing salient object benchmarks, called the dataset design bias, by over emphasising the stereotypical concepts of saliency. The dataset design bias does not only create the discomforting disconnection between xations and salient object segmentation, but
also misleads the algorithm designing. Based on our analysis, we propose a new high quality dataset that off ers both fixation and salient object segmentation ground-truth. With fixations and salient object being presented simultaneously, we are able to bridge the gap between fixations and salient objects, and propose a novel method for salient object segmentation. Finally, we report significant benchmark progress on three existing datasets of segmenting salient objects.

%8 06/2014 %1

arXiv:1406.2807

%2

http://hdl.handle.net/1721.1/100178