%0 Conference Paper %B Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence %D 2021 %T Temporal and Object Quantification Networks %A Mao, Jiayuan %A Luo, Zhezheng %A Gan, Chuang %A Joshua B. Tenenbaum %A Wu, Jiajun %A Kaelbling, Leslie Pack %A Ullman, Tomer D. %E Zhou, Zhi-Hua %Y Gini, Maria %X

We present Temporal and Object Quantification Networks (TOQ-Nets), a new class of neuro-symbolic networks with a structural bias that enables them to learn to recognize complex relational-temporal events. This is done by including reasoning layers that implement finite-domain quantification over objects and time. The structure allows them to generalize directly to input instances with varying numbers of objects in temporal sequences of varying lengths. We evaluate TOQ-Nets on input domains that require recognizing event-types in terms of complex temporal relational patterns. We demonstrate that TOQ-Nets can generalize from small amounts of data to scenarios containing more objects than were present during training and to temporal warpings of input sequences.

%B Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence %C Montreal, Canada %8 06/2021 %G eng %U https://www.ijcai.org/proceedings/2021 %R 10.24963/ijcai.2021/386 %0 Conference Paper %B International Conference on Learning Representations %D 2021 %T Unsupervised Discovery of 3D Physical Objects %A Yilun Du %A Kevin A Smith %A Tomer Ullman %A Joshua B. Tenenbaum %A Jiajun Wu %X

We study the problem of unsupervised physical object discovery. Unlike existing frameworks that aim to learn to decompose scenes into 2D segments purely based on each object's appearance, we explore how physics, especially object interactions,facilitates learning to disentangle and segment instances from raw videos, and to infer the 3D geometry and position of each object, all without supervision. Drawing inspiration from developmental psychology, our Physical Object Discovery Network (POD-Net) uses both multi-scale pixel cues and physical motion cues to accurately segment observable and partially occluded objects of varying sizes, and infer properties of those objects. Our model reliably segments objects on both synthetic and real scenes. The discovered object properties can also be used to reason about physical events.

%B International Conference on Learning Representations %8 07/2020 %G eng %U https://openreview.net/forum?id=lf7st0bJIA5 %0 Conference Paper %B Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual, July 29 - August 1, 2020 %D 2020 %T The fine structure of surprise in intuitive physics: when, why, and how much? %A Kevin A Smith %A Lingjie Mei %A Shunyu Yao %A Jiajun Wu %A Elizabeth S Spelke %A Joshua B. Tenenbaum %A Tomer D. Ullman %E Stephanie Denison %E Michael Mack %E Yang Xu %E Blair C. Armstrong %B Proceedings of the 42th Annual Meeting of the Cognitive Science Society - Developing a Mind: Learning in Humans, Animals, and Machines, CogSci 2020, virtual, July 29 - August 1, 2020 %G eng %U https://cogsci.mindmodeling.org/2020/papers/0761/index.html %0 Journal Article %J Current Opinion in Neurobiology %D 2019 %T An integrative computational architecture for object-driven cortex %A Ilker Yildirim %A Jiajun Wu %A Nancy Kanwisher %A Joshua B. Tenenbaum %X

Computational architecture for object-driven cortex

Objects in motion activate multiple cortical regions in every lobe of the human brain. Do these regions represent a collection of independent systems, or is there an overarching functional architecture spanning all of object-driven cortex? Inspired by recent work in artificial intelligence (AI), machine learning, and cognitive science, we consider the hypothesis that these regions can be understood as a coherent network implementing an integrative computational system that unifies the functions needed to perceive, predict, reason about, and plan with physical objects—as in the paradigmatic case of using or making tools. Our proposal draws on a modeling framework that combines multiple AI methods, including causal generative models, hybrid symbolic-continuous planning algorithms, and neural recognition networks, with object-centric, physics-based representations. We review evidence relating specific components of our proposal to the specific regions that comprise object-driven cortex, and lay out future research directions with the goal of building a complete functional and mechanistic account of this system.

%B Current Opinion in Neurobiology %V 55 %P 73 - 81 %8 01/2019 %G eng %U https://linkinghub.elsevier.com/retrieve/pii/S0959438818301995 %! Current Opinion in Neurobiology %R 10.1016/j.conb.2019.01.010 %0 Conference Proceedings %B 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Modeling Expectation Violation in Intuitive Physics with Coarse Probabilistic Object Representations %A Kevin A Smith %A Lingjie Mei %A Shunyu Yao %A Jiajun Wu %A Elizabeth S Spelke %A Joshua B. Tenenbaum %A Tomer D. Ullman %X

From infancy, humans have expectations about how objects will move and interact. Even young children expect objects not to move through one another, teleport, or disappear. They are surprised by mismatches between physical expectations and perceptual observations, even in unfamiliar scenes with completely novel objects. A model that exhibits human-like understanding of physics should be similarly surprised, and adjust its beliefs accordingly. We propose ADEPT, a model that uses a coarse (approximate geometry) object-centric representation for dynamic 3D scene understanding. Inference integrates deep recognition networks, extended probabilistic physical simulation, and particle filtering for forming predictions and expectations across occlusion. We also present a new test set for measuring violations of physical expectations, using a range of scenarios derived from de- velopmental psychology. We systematically compare ADEPT, baseline models, and human expectations on this test set. ADEPT outperforms standard network architectures in discriminating physically implausible scenes, and often performs this discrimination at the same level as people.

%B 33rd Conference on Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %U http: //physadept.csail.mit.edu/ %0 Conference Proceedings %B Neural Information Processing Systems (NeurIPS 2019) %D 2019 %T Visual Concept-Metaconcept Learning %A Chi Han %A Jiayuan Mao %A Chuang Gan %A Joshua B. Tenenbaum %A Jiajun Wu %X

Humans reason with concepts and metaconcepts: we recognize red and blue from visual input; we also understand that they are colors, i.e., red is an instance of color. In this paper, we propose the visual concept-metaconcept learner (VCML) for joint learning of concepts and metaconcepts from images and associated question-answer pairs. The key is to exploit the bidirectional connection between visual concepts and metaconcepts. Visual representations provide grounding cues for predicting relations between unseen pairs of concepts. Knowing that red and blue are instances of color, we generalize to the fact that green is also an instance of color since they all categorize the hue of objects. Meanwhile, knowledge about metaconcepts empowers visual concept learning from limited, noisy, and even biased data. From just a few examples of purple cubes we can understand a new color purple, which resembles the hue of the cubes instead of the shape of them. Evaluation on both synthetic and real-world datasets validates our claims.

%B Neural Information Processing Systems (NeurIPS 2019) %C Vancouver, Canada %8 11/2019 %G eng %0 Conference Paper %B The IEEE International Conference on Computer Vision (ICCV) %D 2017 %T Generative modeling of audible shapes for object perception %A Zhoutong Zhang %A Jiajun Wu %A Qiujia Li %A Zhengjia Huang %A James Traer %A Josh H. McDermott %A Joshua B. Tenenbaum %A William T. Freeman %X

Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audio-visual data, purely from 3D object shapes and their physical properties. Through comparison with audio recordings and human behavioral studies, we validate the accuracy of the sounds it generates. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We demonstrate that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.

%B The IEEE International Conference on Computer Vision (ICCV) %C Venice, Italy %8 10/2017 %G eng %U http://openaccess.thecvf.com/content_iccv_2017/html/Zhang_Generative_Modeling_of_ICCV_2017_paper.html %0 Conference Proceedings %B Advances in Neural Information Processing Systems 30 %D 2017 %T Learning to See Physics via Visual De-animation %A Jiajun Wu %A Lu, Erika %A Kohli, Pushmeet %A William T. Freeman %A Joshua B. Tenenbaum %E I. Guyon %E U. V. Luxburg %E S. Bengio %E H. Wallach %E R. Fergus %E S. Vishwanathan %E R. Garnett %X
We introduce a paradigm for understanding physical scenes without human annotations. At the core of our system is a physical world representation that is first recovered by a perception module and then utilized by physics and graphics
engines. During training, the perception module and the generative models learn by visual de-animation
— interpreting and reconstructing the visual information stream. During testing, the system first recovers the physical world state, and then uses the generative models for reasoning and future prediction.
 
Even more so than forward simulation, inverting a physics or graphics engine is a computationally hard problem; we overcome this challenge by using a convolutional inversion network. Our system quickly recognizes the physical world
state from appearance and motion cues, and has the flexibility to incorporate both differentiable and non-differentiable physics and graphics engines. We evaluate our system on both synthetic and real datasets involving multiple physical scenes, and demonstrate that our system performs well on both physical state estimation and reasoning problems. We further show that the knowledge learned on the synthetic dataset generalizes to constrained real images.
%B Advances in Neural Information Processing Systems 30 %P 152–163 %8 12/2017 %G eng %U http://papers.nips.cc/paper/6620-learning-to-see-physics-via-visual-de-animation.pdf %0 Conference Proceedings %B Advances in Neural Information Processing Systems 30 %D 2017 %T MarrNet: 3D Shape Reconstruction via 2.5D Sketches %A Jiajun Wu %A Wang, Yifan %A Xue, Tianfan %A Sun, Xingyuan %A William T. Freeman %A Joshua B. Tenenbaum %E I. Guyon %E U. V. Luxburg %E S. Bengio %E H. Wallach %E R. Fergus %E S. Vishwanathan %E R. Garnett %X

3D object reconstruction from a single image is a highly under-determined problem, requiring strong prior knowledge of plausible 3D shapes. This introduces challenge for learning-based approaches, as 3D object annotations in real images are scarce. Previous work chose to train on synthetic data with ground truth 3D information, but suffered from the domain adaptation issue when tested on real data. In this work, we propose an end-to-end trainable framework, sequentially estimating 2.5D sketches and 3D object shapes. Our disentangled, two-step formulation has three advantages. First, compared to full 3D shape, 2.5D sketches are much easier to be recovered from a 2D image, and to transfer from synthetic to real data. Second, for 3D reconstruction from the 2.5D sketches, we can easily transfer the learned model on synthetic data to real images, as rendered 2.5D sketches are invariant to object appearance variations in real images, including lighting, texture, etc. This further relieves the domain adaptation problem. Third, we derive differentiable projective functions from 3D shape to 2.5D sketches, making the framework end-to-end trainable on real images, requiring no real-image annotations. Our framework achieves state-of-the-art performance on 3D shape reconstruction.

%B Advances in Neural Information Processing Systems 30 %I Curran Associates, Inc. %C Long Beach, CA %P 540–550 %8 12/2017 %G eng %U http://papers.nips.cc/paper/6657-marrnet-3d-shape-reconstruction-via-25d-sketches.pdf %0 Conference Paper %B Annual Conference on Neural Information Processing Systems (NIPS) %D 2017 %T Self-supervised intrinsic image decomposition. %A Michael Janner %A Jiajun Wu %A Tejas Kulkarni %A Ilker Yildirim %A Joshua B. Tenenbaum %B Annual Conference on Neural Information Processing Systems (NIPS) %C Long Beach, CA %8 12/2017 %G eng %U https://papers.nips.cc/paper/7175-self-supervised-intrinsic-image-decomposition %0 Conference Proceedings %B Advances in Neural Information Processing Systems 30 %D 2017 %T Shape and Material from Sound %A zhang, zhoutong %A Qiujia Li %A Zhengjia Huang %A Jiajun Wu %A Joshua B. Tenenbaum %A William T. Freeman %E I. Guyon %E U. V. Luxburg %E S. Bengio %E H. Wallach %E R. Fergus %E S. Vishwanathan %E R. Garnett %X

What can we infer from hearing an object falling onto the ground? Based on knowledge of the physical world, humans are able to infer rich information from such limited data: rough shape of the object, its material, the height of falling, etc. In this paper, we aim to approximate such competency. We first mimic the human knowledge about the physical world using a fast physics-based generative model. Then, we present an analysis-by-synthesis approach to infer properties of the falling object. We further approximate human past experience by directly mapping audio to object properties using deep learning with self-supervision. We evaluate our method through behavioral studies, where we compare human predictions with ours on inferring object shape, material, and initial height of falling. Results show that our method achieves near-human performance, without any annotations.

%B Advances in Neural Information Processing Systems 30 %C Long Beach, CA %P 1278–1288 %8 12/2017 %G eng %U http://papers.nips.cc/paper/6727-shape-and-material-from-sound.pdf %0 Conference Paper %B 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) %D 2017 %T Synthesizing 3D Shapes via Modeling Multi-view Depth Maps and Silhouettes with Deep Generative Networks %A Amir Arsalan Soltani %A Haibin Huang %A Jiajun Wu %A Tejas Kulkarni %A Joshua B. Tenenbaum %K 2d to 3d %K 3D generation %K 3D reconstruction %K Core object system %K depth map %K generative %K perception %K silhouette %X

We study the problem of learning generative models of 3D shapes. Voxels or 3D parts have been widely used as the underlying representations to build complex 3D shapes; however, voxel-based representations suffer from high memory requirements, and parts-based models require a large collection of cached or richly parametrized parts. We take an alternative approach: learning a generative model over multi-view depth maps or their corresponding silhouettes, and using a deterministic rendering function to produce 3D shapes from these images. A multi-view representation of shapes enables generation of 3D models with fine details, as 2D depth maps and silhouettes can be modeled at a much higher resolution than 3D voxels. Moreover, our approach naturally brings the ability to recover the underlying 3D representation from depth maps of one or a few viewpoints. Experiments show that our framework can generate 3D shapes with variations and details. We also demonstrate that our model has out-of-sample generalization power for real-world tasks with occluded objects.

%B 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) %C Honolulu, HI %8 07/2017 %G eng %U http://ieeexplore.ieee.org/document/8099752/http://xplorestaging.ieee.org/ielx7/8097368/8099483/08099752.pdf?arnumber=8099752 %R 10.1109/CVPR.2017.269 %0 Conference Paper %B NIPS 2015 %D 2015 %T Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. %A Jiajun Wu %A Ilker Yildirim %A Joseph J. Lim %A William T. Freeman %A Joshua B. Tenenbaum %X
Humans demonstrate remarkable abilities to predict physical events in dynamicscenes, and to infer the physical properties of objects from static images. We propose a generative model for solving these problems of physical scene understanding from real-world videos and images. At the core of our generative modelis a 3D physics engine, operating on an object-based representation of physical properties, including mass, position, 3D shape, and friction. We can infer these latent properties using relatively brief runs of MCMC, which drive simulations in
the physics engine to fit key features of visual observations. We further explore directly mapping visual inputs to physical properties, inverting a part of the generative process using deep learning. We name our model Galileo, and evaluate it on a video dataset with simple yet physically rich scenarios. Results show that Galileo is able to infer the physical properties of objects and predict the outcome of a variety of physical events, with an accuracy comparable to human subjects. Our study points towards an account of human vision with generative physical knowledge at its core, and various recognition models as helpers leading to efficient inference.
%B NIPS 2015 %C Montréal, Canada %G eng %U https://papers.nips.cc/paper/5780-galileo-perceiving-physical-object-properties-by-integrating-a-physics-engine-with-deep-learning