%0 Conference Paper %B International Conference on Learning Representations %D 2021 %T Unsupervised Discovery of 3D Physical Objects %A Yilun Du %A Kevin A Smith %A Tomer Ullman %A Joshua B. Tenenbaum %A Jiajun Wu %X
We study the problem of unsupervised physical object discovery. Unlike existing frameworks that aim to learn to decompose scenes into 2D segments purely based on each object's appearance, we explore how physics, especially object interactions,facilitates learning to disentangle and segment instances from raw videos, and to infer the 3D geometry and position of each object, all without supervision. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded objects of varying sizes, and infer properties of those objects. Our model reliably segments objects on both synthetic and real scenes. The discovered object properties can also be used to reason about physical events.
%B International Conference on Learning Representations %8 07/2020 %G eng %U https://openreview.net/forum?id=lf7st0bJIA5 %0 Conference Paper %B Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual, July 29 - August 1, 2020 %D 2020 %T The fine structure of surprise in intuitive physics: when, why, and how much? %A Kevin A Smith %A Lingjie Mei %A Shunyu Yao %A Jiajun Wu %A Elizabeth S Spelke %A Joshua B. Tenenbaum %A Tomer D. Ullman %E Stephanie Denison %E Michael Mack %E Yang Xu %E Blair C. Armstrong %B Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual, July 29 - August 1, 2020 %G eng %U https://cogsci.mindmodeling.org/2020/papers/0761/index.html %0 Journal Article %J Current Opinion in Neurobiology %D 2019 %T An integrative computational architecture for object-driven cortex %A Ilker Yildirim %A Jiajun Wu %A Nancy Kanwisher %A Joshua B. Tenenbaum %XComputational architecture for object-driven cortex
Objects in motion activate multiple cortical regions in every lobe of the human brain. Do these regions represent a collection of independent systems, or is there an overarching functional architecture spanning all of object-driven cortex? Inspired by recent work in artificial intelligence (AI), machine learning, and cognitive science, we consider the hypothesis that these regions can be understood as a coherent network implementing an integrative computational system that unifies the functions needed to perceive, predict, reason about, and plan with physical objects—as in the paradigmatic case of using or making tools. Our proposal draws on a modeling framework that combines multiple AI methods, including causal generative models, hybrid symbolic-continuous planning algorithms, and neural recognition networks, with object-centric, physics-based representations. We review evidence relating specific components of our proposal to the specific regions that comprise object-driven cortex, and lay out future research directions with the goal of building a complete functional and mechanistic account of this system.
From infancy, humans have expectations about how objects will move and interact. Even young children expect objects not to move through one another, teleport, or disappear. They are surprised by mismatches between physical expectations and perceptual observations, even in unfamiliar scenes with completely novel objects. A model that exhibits human-like understanding of physics should be similarly surprised, and adjust its beliefs accordingly. We propose ADEPT, a model that uses a coarse (approximate geometry) object-centric representation for dynamic 3D scene understanding. Inference integrates deep recognition networks, extended probabilistic physical simulation, and particle filtering for forming predictions and expectations across occlusion. We also present a new test set for measuring violations of physical expectations, using a range of scenarios derived from de- velopmental psychology. We systematically compare ADEPT, baseline models, and human expectations on this test set. ADEPT outperforms standard network architectures in discriminating physically implausible scenes, and often performs this discrimination at the same level as people.
%B 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %U http: //physadept.csail.mit.edu/ %0 Conference Proceedings %B Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Visual Concept-Metaconcept Learning %A Chi Han %A Jiayuan Mao %A Chuang Gan %A Joshua B. Tenenbaum %A Jiajun Wu %XHumans reason with concepts and metaconcepts: we recognize red and blue from visual input; we also understand that they are colors, i.e., red is an instance of color. In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and blue are instances of color, we generalize to the fact that green is also an instance of color since they all categorize the hue of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them. Evaluation on both synthetic and real-world datasets validates our claims.
%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Conference Paper %B The IEEE International Conference on Computer Vision (ICCV) %D 2017 %T Generative modeling of audible shapes for object perception %A Zhoutong Zhang %A Jiajun Wu %A Qiujia Li %A Zhengjia Huang %A James Traer %A Josh H. McDermott %A Joshua B. Tenenbaum %A William T. Freeman %XHumans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audio-visual data, purely from 3D object shapes and their physical properties. Through comparison with audio recordings and human behavioral studies, we validate the accuracy of the sounds it generates. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We demonstrate that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.
%B The IEEE International Conference on Computer Vision (ICCV) %C Venice, Italy %8 10/2017 %G eng %U http://openaccess.thecvf.com/content_iccv_2017/html/Zhang_Generative_Modeling_of_ICCV_2017_paper.html %0 Conference Proceedings %B Advances in Neural Information Processing Systems 30 %D 2017 %T Learning to See Physics via Visual De-animation %A Jiajun Wu %A Lu, Erika %A Kohli, Pushmeet %A William T. Freeman %A Joshua B. Tenenbaum %E I. Guyon %E U. V. Luxburg %E S. Bengio %E H. Wallach %E R. Fergus %E S. Vishwanathan %E R. Garnett %X3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenge for learning-based approaches, as 3D object annotations in real images are scarce. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from the domain adaptation issue when tested on real data. In this work, we propose an end-to-end trainable framework, sequentially estimating 2.5D sketches and 3D object shapes. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2.5D sketches are much easier to be recovered from a 2D image, and to transfer from synthetic to real data. Second, for 3D reconstruction from the 2.5D sketches, we can easily transfer the learned model on synthetic data to real images, as rendered 2.5D sketches are invariant to object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2.5D sketches, making the framework end-to-end trainable on real images, requiring no real-image annotations. Our framework achieves state-of-the-art performance on 3D shape reconstruction.
%B Advances in Neural Information Processing Systems 30 %I Curran Associates, Inc. %C Long Beach, CA %P 540–550 %8 12/2017 %G eng %U http://papers.nips.cc/paper/6657-marrnet-3d-shape-reconstruction-via-25d-sketches.pdf %0 Conference Paper %B Annual Conference on Neural Information Processing Systems (NIPS) %D 2017 %T Self-supervised intrinsic image decomposition. %A Michael Janner %A Jiajun Wu %A Tejas Kulkarni %A Ilker Yildirim %A Joshua B. Tenenbaum %B Annual Conference on Neural Information Processing Systems (NIPS) %C Long Beach, CA %8 12/2017 %G eng %U https://papers.nips.cc/paper/7175-self-supervised-intrinsic-image-decomposition %0 Conference Proceedings %B Advances in Neural Information Processing Systems 30 %D 2017 %T Shape and Material from Sound %A zhang, zhoutong %A Qiujia Li %A Zhengjia Huang %A Jiajun Wu %A Joshua B. Tenenbaum %A William T. Freeman %E I. Guyon %E U. V. Luxburg %E S. Bengio %E H. Wallach %E R. Fergus %E S. Vishwanathan %E R. Garnett %XWhat can we infer from hearing an object falling onto the ground? Based on knowledge of the physical world, humans are able to infer rich information from such limited data: rough shape of the object, its material, the height of falling, etc. In this paper, we aim to approximate such competency. We first mimic the human knowledge about the physical world using a fast physics-based generative model. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further approximate human past experience by directly mapping audio to object properties using deep learning with self-supervision. We evaluate our method through behavioral studies, where we compare human predictions with ours on inferring object shape, material, and initial height of falling. Results show that our method achieves near-human performance, without any annotations.
%B Advances in Neural Information Processing Systems 30 %C Long Beach, CA %P 1278–1288 %8 12/2017 %G eng %U http://papers.nips.cc/paper/6727-shape-and-material-from-sound.pdf %0 Conference Paper %B 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) %D 2017 %T Synthesizing 3D Shapes via Modeling Multi-view Depth Maps and Silhouettes with Deep Generative Networks %A Amir Arsalan Soltani %A Haibin Huang %A Jiajun Wu %A Tejas Kulkarni %A Joshua B. Tenenbaum %K 2d to 3d %K 3D generation %K 3D reconstruction %K Core object system %K depth map %K generative %K perception %K silhouette %XWe study the problem of learning generative models of 3D shapes. Voxels or 3D parts have been widely used as the underlying representations to build complex 3D shapes; however, voxel-based representations suffer from high memory requirements, and parts-based models require a large collection of cached or richly parametrized parts. We take an alternative approach: learning a generative model over multi-view depth maps or their corresponding silhouettes, and using a deterministic rendering function to produce 3D shapes from these images. A multi-view representation of shapes enables generation of 3D models with fine details, as 2D depth maps and silhouettes can be modeled at a much higher resolution than 3D voxels. Moreover, our approach naturally brings the ability to recover the underlying 3D representation from depth maps of one or a few viewpoints. Experiments show that our framework can generate 3D shapes with variations and details. We also demonstrate that our model has out-of-sample generalization power for real-world tasks with occluded objects.
%B 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) %C Honolulu, HI %8 07/2017 %G eng %U http://ieeexplore.ieee.org/document/8099752/http://xplorestaging.ieee.org/ielx7/8097368/8099483/08099752.pdf?arnumber=8099752 %R 10.1109/CVPR.2017.269 %0 Conference Paper %B NIPS 2015 %D 2015 %T Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. %A Jiajun Wu %A Ilker Yildirim %A Joseph J. Lim %A William T. Freeman %A Joshua B. Tenenbaum %X