%0 Journal Article
%J Nature Machine Learning
%D 2020
%T A neural network trained to predict future video frames mimics critical properties of biological neuronal responses and perception.
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>While deep neural networks take loose inspiration from neuroscience, it is an open question how seriously to take the analogies between artificial deep networks and biological neuronal systems. Interestingly, recent work has shown that deep convolutional neural networks (CNNs) trained on large-scale image recognition tasks can serve as strikingly good models for predicting the responses of neurons in visual cortex to visual stimuli, suggesting that analogies between artificial and biological neural networks may be more than superficial. However, while CNNs capture key properties of the average responses of cortical neurons, they fail to explain other properties of these neurons. For one, CNNs typically require large quantities of labeled input data for training. Our own brains, in contrast, rarely have access to this kind of supervision, so to the extent that representations are similar between CNNs and brains, this similarity must arise via different training paths. In addition, neurons in visual cortex produce complex time-varying responses even to static inputs, and they dynamically tune themselves to temporal regularities in the visual environment. We argue that these differences are clues to fundamental differences between the computations performed in the brain and in deep networks. To begin to close the gap, here we study the emergent properties of a previously- described recurrent generative network that is trained to predict future video frames in a self-supervised manner. Remarkably, the model is able to capture a wide variety of seemingly disparate phenomena observed in visual cortex, ranging from single unit response dynamics to complex perceptual motion illusions. These results suggest potentially deep connections between recurrent predictive neural network models and the brain, providing new leads that can enrich both fields.</p>
%B Nature Machine Learning
%8 04/2020
%G eng

%0 Conference Proceedings
%B Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020)
%D 2020
%T Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations
%A Joel Dapello
%A Tiago Marques
%A Martin Schrimpf
%A Franziska Geiger
%A David Cox
%A James J. DiCarlo
%X <p>Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system. However, these CNNs can be fooled by imperceptibly small, explicitly crafted perturbations, and struggle to recognize objects in corrupted images that are easily recognized by humans. Here, by making comparisons with primate neural data, we first observed that CNN models with a neural hidden layer that better matches primate primary visual cortex (V1) are also more robust to adversarial attacks. Inspired by this observation, we developed VOneNets, a new class of hybrid CNN vision models. Each VOneNet contains a fixed weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models. The VOneBlock is based on a classical neuroscientific model of V1: the linear-nonlinear-Poisson model, consisting of a biologically-constrained Gabor filter bank, simple and complex cell nonlinearities, and a V1 neuronal stochasticity generator. After training, VOneNets retain high ImageNet performance, but each is substantially more robust, outperforming the base CNNs and state-of-the-art methods by 18% and 3%, respectively, on a conglomerate benchmark of perturbations comprised of white box adversarial attacks and common image corruptions. Finally, we show that all components of the VOneBlock work in synergy to improve robustness. While current CNN architectures are arguably brain-inspired, the results presented here demonstrate that more precisely mimicking just one stage of the primate visual system leads to new gains in ImageNet-level computer vision applications.</p>    <p>Github:<a href="https://github.com/dicarlolab/vonenet"> https://github.com/dicarlolab/vonenet</a></p>
%B Advances in Neural Information Processing Systems 33 pre-proceedings (NeurIPS 2020)
%8 12/2020
%G eng
%U https://proceedings.neurips.cc/paper/2020/hash/98b17f068d5d9b7668e19fb8ae470841-Abstract.html

%0 Journal Article
%J arXiv
%D 2020
%T ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation
%A Chuang Gen
%A Jeremy Schwartz
%A Seth Alter
%A Martin Schrimpf
%A James Traer
%A Julian De Freitas
%A Jonas Kubilius
%A Abhishek Bhandwaldar
%A Nick Haber
%A Megumi Sano
%A Kuno Kim
%A Elias Wang
%A Damian Mrowca
%A Michael Lingelbach
%A Aidan Curtis
%A Kevin Feigleis
%A Daniel Bear
%A Dan Gutfreund
%A David Cox
%A James J. DiCarlo
%A Josh H. McDermott
%A Joshua B. Tenenbaum
%A Daniel L K Yamins
%X <p>We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments. TDW has several unique properties: 1) realtime near photo-realistic image rendering quality; 2) a library of objects and environments with materials for high-quality rendering, and routines enabling user customization of the asset library; 3) generative procedures for efficiently building classes of new environments 4) high-fidelity audio rendering; 5) believable and realistic physical interactions for a wide variety of material types, including cloths, liquid, and deformable objects; 6) a range of "avatar" types that serve as embodiments of AI agents, with the option for user avatar customization; and 7) support for human interactions with VR devices. TDW also provides a rich API enabling multiple agents to interact within a simulation and return a range of sensor and physics data representing the state of the world. We present initial experiments enabled by the platform around emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, multi-agent interactions, models that "learn like a child", and attention studies in humans and neural networks. The simulation platform will be made publicly available.</p>
%B arXiv
%8 07/2020
%G eng
%U https://arxiv.org/abs/2007.04954
%9 Preprint

%0 Report
%D 2018
%T A neural network trained to predict future videoframes mimics critical properties of biologicalneuronal responses and perception
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>While deep neural networks take loose inspiration from neuroscience, it is an open question how seriously to take the analogies between artificial deep networks and biological neuronal systems. Interestingly, recent work has shown that deep convolutional neural networks (CNNs) trained on large-scale image recognition tasks can serve as strikingly good models for predicting the responses of neurons in visual cortex to visual stimuli, suggesting that analogies between artificial and biological neural networks may be more than superficial. However, while CNNs capture key properties of the average responses of cortical neurons, they fail to explain other properties of these neurons. For one, CNNs typically require large quantities of labeled input data for training. Our own brains, in contrast, rarely have access to this kind of supervision, so to the extent that representations are similar between CNNs and brains, this similarity must arise via different training paths. In addition, neurons in visual cortex produce complex time-varying responses even to static inputs, and they dynamically tune themselves to temporal regularities in the visual environment. We argue that these differences are clues to fundamental differences between the computations performed in the brain and in deep networks. To begin to close the gap, here we study the emergent properties of a previously-described recurrent generative network that is trained to predict future video frames in a self-supervised manner. Remarkably, the model is able to capture a wide variety of seemingly disparate phenomena observed in visual cortex, ranging from single unit response dynamics to complex perceptual motion illusions. These results suggest potentially deep connections between recurrent predictive neural network models and the brain, providing new leads that can enrich both fields.</p>
%I arXiv | Cornell University
%8 05/2018
%G eng
%U https://arxiv.org/pdf/1805.10734.pdf

%0 Journal Article
%J Proceedings of the National Academy of Sciences
%D 2018
%T Recurrent computations for visual pattern completion
%A Hanlin Tang
%A Martin Schrimpf
%A William Lotter
%A Moerman, Charlotte
%A Paredes, Ana
%A Ortega Caro, Josue
%A Hardesty, Walter
%A David Cox
%A Gabriel Kreiman
%K Artificial Intelligence
%K computational neuroscience
%K Machine Learning
%K pattern completion
%K Visual object recognition
%X <p>Making inferences from partial information constitutes a critical aspect of cognition. During visual perception, pattern completion enables recognition of poorly visible or occluded objects. We combined psychophysics, physiology, and computational models to test the hypothesis that pattern completion is implemented by recurrent computations and present three pieces of evidence that are consistent with this hypothesis. First, subjects robustly recognized objects even when they were rendered &lt;15% visible, but recognition was largely impaired when processing was interrupted by backward masking. Second, invasive physiological responses along the human ventral cortex exhibited visually selective responses to partially visible objects that were delayed compared with whole objects, suggesting the need for additional computations. These physiological delays were correlated with the effects of backward masking. Third, state-of-the-art feed-forward computational architectures were not robust to partial visibility. However, recognition performance was recovered when the model was augmented with attractor-based recurrent connectivity. The recurrent model was able to predict which images of heavily occluded objects were easier or harder for humans to recognize, could capture the effect of introducing a backward mask on recognition behavior, and was consistent with the physiological delays along the human ventral visual stream. These results provide a strong argument of plausibility for the role of recurrent computations in making visual inferences from partial information.</p>
%B Proceedings of the National Academy of Sciences
%8 08/2018
%G eng
%U http://www.pnas.org/lookup/doi/10.1073/pnas.1719397115
%! Proc Natl Acad Sci USA
%R 10.1073/pnas.1719397115

%0 Conference Paper
%B ICLR
%D 2017
%T Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%B ICLR
%G eng

%0 Generic
%D 2017
%T Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning—leveraging unlabeled examples to learn about the structure of a domain — remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network (“PredNet”) architecture that is inspired by the concept of “predictive coding” from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and the representation learned in this setting is useful for estimating the steering angle. Altogether, these results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.</p>
%8 03/2017
%1 <p><a href="https://arxiv.org/abs/1605.08104v5">arXiv:1605.08104v5</a></p>
%2 <p><a href="http://hdl.handle.net/1721.1/107497">http://hdl.handle.net/1721.1/107497</a></p>

%0 Generic
%D 2016
%T PredNet - "Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning" [code]
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>The PredNet is a deep convolutional recurrent neural network inspired by the principles of predictive coding from the neuroscience literature [1, 2]. It is trained for next-frame video prediction with the belief that prediction is an effective objective for unsupervised (or "self-supervised") learning [e.g. 3-11].</p>    <hr />  <p>For full project information and links to download code, etc. visit the website - https://coxlab.github.io/prednet/</p>

%0 Conference Paper
%B International Conference on Learning Representations (ICLR)
%D 2016
%T Unsupervised Learning of Visual Structure using Predictive Generative Networks
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>The ability to predict future states of the environment is a central pillar of intelligence. At its core, effective prediction requires an internal model of the world and an understanding of the rules by which the world changes. Here, we explore the internal models developed by deep neural networks trained using a loss based on predicting future frames in synthetic video sequences, using a CNN-LSTM-deCNN framework. We first show that this architecture can achieve excellent performance in visual sequence prediction tasks, including state-of-the-art performance in a standard 'bouncing balls' dataset (Sutskever et al., 2009). Using a weighted mean-squared error and adversarial loss (Goodfellow et al., 2014), the same architecture successfully extrapolates out-of-the-plane rotations of computer-generated faces. Furthermore, despite being trained end-to-end to predict only pixel-level information, our Predictive Generative Networks learn a representation of the latent structure of the underlying three-dimensional objects themselves. Importantly, we find that this representation is naturally tolerant to object transformations, and generalizes well to new tasks, such as classification of static images. Similar models trained solely with a reconstruction loss fail to generalize as effectively. We argue that prediction can serve as a powerful unsupervised loss for learning rich internal representations of high-level object features.</p>
%B International Conference on Learning Representations (ICLR)
%C San Juan, Puerto Rico
%8 May 2016
%G eng
%U http://arxiv.org/pdf/1511.06380v2.pdf

%0 Generic
%D 2015
%T UNSUPERVISED LEARNING OF VISUAL STRUCTURE USING PREDICTIVE GENERATIVE NETWORKS
%A William Lotter
%A Gabriel Kreiman
%A David Cox
%X <p>The ability to predict future states of the environment is a central pillar of intelligence. At its core, effective prediction requires an internal model of the world and an understanding of the rules by which the world changes. Here, we explore the internal models developed by deep neural networks trained using a loss based on predicting future frames in synthetic video sequences, using an Encoder-Recurrent-Decoder framework (Fragkiadaki et al., 2015). We first show that this architecture can achieve excellent performance in visual sequence prediction tasks, including state-of-the-art performance in a standard “bouncing balls” dataset (Sutskever et al., 2009). We then train on clips of out-of-the-plane rotations of computer-generated faces, using both mean-squared error and a generative adversarial loss (Goodfellow et al., 2014), extending the latter to a recurrent, conditional setting. Despite being trained end-to-end to predict only pixel-level information, our Predictive Generative Networks learn a representation of the latent variables of the underlying generative process. Importantly, we find that this representation is naturally tolerant to object transformations, and generalizes well to new tasks, such as classification of static images. Similar models trained solely with a reconstruction loss fail to generalize as effectively. We argue that prediction can serve as a powerful unsupervised loss for learning rich internal representations of high-level object features.</p>
%8 12/15/2015
%G English
%1 <p><a href="http://arxiv.org/abs/1511.06380">arXiv:1511.06380</a></p>
%2 <p>http://hdl.handle.net/1721.1/100275</p>