%0 Journal Article %J Cognition %D 2021 %T Causal inference in environmental sound recognition %A James Traer %A Sam Norman-Haignere %A Josh H. McDermott %X

Sound is caused by physical events in the world. Do humans infer these causes when recognizing sound sources? We tested whether the recognition of common environmental sounds depends on the inference of a basic physical variable -- the source intensity (i.e. the power that produces a sound). A source's intensity can be inferred from the intensity it produces at the ear and its distance, which is normally conveyed by reverberation. Listeners could thus use intensity at the ear and reverberation to constrain recognition by inferring the underlying source intensity. Alternatively, listeners might separate these acoustic cues from their representation of a sound's identity in the interest of invariant recognition. We compared these two hypotheses by measuring recognition accuracy for sounds with typically low or high source intensity (e.g. pepper grinders vs. trucks) that were presented across a range of intensities at the ear or with reverberation cues to distance. The recognition of low-intensity sources (e.g. pepper grinders) was impaired by high presentation intensities or reverberation that conveyed distance, either of which imply high source intensity. Neither effect occurred for high-intensity sources. The results suggest that listeners implicitly use the intensity at the ear along with distance cues to infer a source's power and constrain its identity. The recognition of real-world sounds thus appears to depend upon the inference of their physical generative parameters, even generative parameters whose cues might otherwise be separated from the representation of a sound's identity.

%B Cognition %8 03/2021 %G eng %R 10.1016/j.cognition.2021.104627 %0 Journal Article %J arXiv %D 2020 %T ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation %A Chuang Gen %A Jeremy Schwartz %A Seth Alter %A Martin Schrimpf %A James Traer %A Julian De Freitas %A Jonas Kubilius %A Abhishek Bhandwaldar %A Nick Haber %A Megumi Sano %A Kuno Kim %A Elias Wang %A Damian Mrowca %A Michael Lingelbach %A Aidan Curtis %A Kevin Feigleis %A Daniel Bear %A Dan Gutfreund %A David Cox %A James J. DiCarlo %A Josh H. McDermott %A Joshua B. Tenenbaum %A Daniel L K Yamins %X

We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments. TDW has several unique properties: 1) realtime near photo-realistic image rendering quality; 2) a library of objects and environments with materials for high-quality rendering, and routines enabling user customization of the asset library; 3) generative procedures for efficiently building classes of new environments 4) high-fidelity audio rendering; 5) believable and realistic physical interactions for a wide variety of material types, including cloths, liquid, and deformable objects; 6) a range of "avatar" types that serve as embodiments of AI agents, with the option for user avatar customization; and 7) support for human interactions with VR devices. TDW also provides a rich API enabling multiple agents to interact within a simulation and return a range of sensor and physics data representing the state of the world. We present initial experiments enabled by the platform around emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, multi-agent interactions, models that "learn like a child", and attention studies in humans and neural networks. The simulation platform will be made publicly available.

%B arXiv %8 07/2020 %G eng %U https://arxiv.org/abs/2007.04954 %9 Preprint %0 Generic %D 2020 %T ThreeDWorld (TDW): A High-Fidelity, Multi-Modal Platform for Interactive Physical Simulation %A Jeremy Schwartz %A Seth Alter %A James J. DiCarlo %A Josh H. McDermott %A Joshua B. Tenenbaum %A Daniel L K Yamins %A Dan Gutfreund %A Chuang Gan %A James Traer %A Jonas Kubilius %A Martin Schrimpf %A Abhishek Bhandwaldar %A Julian De Freitas %A Damian Mrowca %A Michael Lingelbach %A Megumi Sano %A Daniel Bear %A Kuno Kim %A Nick Haber %A Chaofei Fan %X

TDW is a 3D virtual world simulation platform, utilizing state-of-the-art video game engine technology

A TDW simulation consists of two components: a) the Build, a compiled executable running on the Unity3D Engine, which is responsible for image rendering, audio synthesis and physics simulations; and b) the Controller, an external Python interface to communicate with the build.

Researchers write Controllers that send commands to the Build, which executes those commands and returns a broad range of data types representing the state of the virtual world.

TDW provides researchers with:

TDW is being used on a daily basis in multiple labs, supporting research that sits at the nexus of neuroscience, cognitive science and artificial intelligence.

Find out more about ThreeDWorld on the project weobsite using the link below.

%8 07/2020 %U http://www.threedworld.org/ %1

ThreeDWorld on Github - https://github.com/threedworld-mit/tdw

%0 Journal Article %J Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19) %D 2019 %T A perceptually inspired generative model of rigid-body contact sounds %A James Traer %A Maddie Cusimano %A Josh H. McDermott %X

Contact between rigid-body objects produces a diversity of impact and friction sounds. These sounds can be synthesized with detailed simulations of the motion, vibration and sound radiation of the objects, but such synthesis is computationally expensive and prohibitively slow for many applications. Moreover, detailed physical simulations may not be necessary for perceptually compelling synthesis; humans infer ecologically relevant causes of sound, such as material categories, but not with arbitrary precision. We present a generative model of impact sounds which summarizes the effect of physical variables on acoustic features via statistical distributions fit to empirical measurements of object acoustics. Perceptual experiments show that sampling from these distributions allows efficient synthesis of realistic impact and scraping sounds that convey material, mass, and motion.

%B Proceedings of the 22nd International Conference on Digital Audio Effects (DAFx-19) %8 09/2019 %G eng %0 Conference Paper %B Cognitive Science %D 2019 %T Scrape, rub, and roll: causal inference in the perception of sustained contact sounds %A Maddie Cusimano %A James Traer %A Josh H. McDermott %B Cognitive Science %C Montreal, Québec, Canada %8 07/2019 %G eng %0 Generic %D 2018 %T Human inference of force from impact sounds: Perceptual evidence for inverse physics %A James Traer %A Josh H. McDermott %X

An impact sound is determined both by material properties of the objects involved (e.g., mass, density, shape, and rigidity) and by the force of the collision. Human listeners can typically estimate the force of an impact as well as the material which has been struck. To investigate the underlying auditory mechanisms we played listeners audio recordings of two boards being struck and measured their ability to identify the board struck with more force. Listeners significantly outperformed models based on simple acoustic features (e.g., signal power or spectral centroid). We repeated the experiment with synthetic sounds generated from simulated object resonant modes and simulated contact forces derived from a spring model. Listeners could not distinguish synthetic from real recordings and successfully estimated simulated impact force. When the synthetic modes were altered (e.g., to simulate a harder material) listeners altered their judgments of both material and impact force, consistent with the physical implications of the alteration. The results suggest that humans use resonant modes to infer object material, and use this knowledge to estimate the impact force, explaining away material contributions to the sound.

%B Annual Meeting of the Acoustical Society %V 143 %8 03/2018 %U https://asa.scitation.org/doi/abs/10.1121/1.5035721 %N 3 %) Published Online: April 2018 %R 10.1121/1.5035721 %0 Generic %D 2018 %T Human recognition of environmental sounds is not always robust to reverberation %A James Traer %A Josh H. McDermott %X

Reverberation is ubiquitous in natural environments, but its effect on the recognition of non-speech sounds is poorly documented. To evaluate human robustness to reverberation, we measured its effect on the recognizability of everyday sounds. Listeners identified a diverse set of recorded environmental sounds (footsteps, animal vocalizations, vehicles moving, hammering, etc.) in an open set recognition task. For each participant, half of the sounds (randomly assigned) were presented in reverberation. We found the effect of reverberation to depend on the typical listening conditions for a sound. Sounds that are typically loud and heard in indoor environments, and which thus should often be accompanied by reverberation, were recognized robustly, with only a small impairment for reverberant conditions. In contrast, sounds that are either typically quiet or typically heard outdoors, for which reverberation should be less pronounced, produced a large recognition decrement in reverberation. These results demonstrate that humans can be remarkably robust to the distortion induced by reverberation, but that this robustness disappears when the reverberation is not consistent with the expected source properties. The results are consistent with the idea that listeners perceptually separate sound sources from reverberation, constrained by the likelihood of source-environment pairings.

%B Annual Meeting of the Acoustical Society %7 The Journal of the Acoustical Society of America %V 143 %U https://asa.scitation.org/doi/abs/10.1121/1.5035960 %N 3 %R 10.1121/1.5035960 %0 Generic %D 2017 %T Auditory Perception of Material and Force from Impact Sounds %A James Traer %A Josh H. McDermott %B Annual Meeting of Association for Research in Otolaryngology %0 Conference Paper %B The IEEE International Conference on Computer Vision (ICCV) %D 2017 %T Generative modeling of audible shapes for object perception %A Zhoutong Zhang %A Jiajun Wu %A Qiujia Li %A Zhengjia Huang %A James Traer %A Josh H. McDermott %A Joshua B. Tenenbaum %A William T. Freeman %X

Humans infer rich knowledge of objects from both auditory and visual cues. Building a machine of such competency, however, is very challenging, due to the great difficulty in capturing large-scale, clean data of objects with both their appearance and the sound they make. In this paper, we present a novel, open-source pipeline that generates audio-visual data, purely from 3D object shapes and their physical properties. Through comparison with audio recordings and human behavioral studies, we validate the accuracy of the sounds it generates. Using this generative model, we are able to construct a synthetic audio-visual dataset, namely Sound-20K, for object perception tasks. We demonstrate that auditory and visual information play complementary roles in object perception, and further, that the representation learned on synthetic audio-visual data can transfer to real-world scenarios.

%B The IEEE International Conference on Computer Vision (ICCV) %C Venice, Italy %8 10/2017 %G eng %U http://openaccess.thecvf.com/content_iccv_2017/html/Zhang_Generative_Modeling_of_ICCV_2017_paper.html %0 Generic %D 2017 %T Investigating audition with a generative model of impact sounds %A James Traer %A Josh H. McDermott %B Annual Meeting of Acoustical Society of America %0 Generic %D 2017 %T A library of real-world reverberation and a toolbox for its analysis and measurement %A James Traer %A Josh H. McDermott %B Annual Meeting of Acoustical Society of America %0 Generic %D 2016 %T Environmental statistics enable perceptual separation of sound and space %A James Traer %A Josh H. McDermott %X

The sound that reaches our ears from colliding objects (i.e. bouncing, scraping, rolling etc.) is structured, both by the physical characteristics of the sound source and by environmental reverberation. The inference of any one single parameter (mass, size, material, motion, room size, distance) is ill-posed, yet humans can simultaneously identify properties of sound sources and environments from the resulting sound, via mechanisms that remain unclear. We investigate whether our ability to recognize sound sources and spaces reflects an ability to separately infer how physical factors effect sound, and whether any such separation is enabled by statistical regularities of real-world sounds and real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of both solid objects and environmental spaces sampled from the distribution encountered by humans during daily life. Both the objects and the sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates.  Object IRs showed sharp spectral peaks due to strong resonances and environmental IRs showed broad frequency variation: mid frequencies reverberated longest while higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans utilize these regularities to separate reverberation from sources, we manipulated environmental IR characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but we found that their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us.

%B Speech and Audio in the Northeast %0 Journal Article %J Proceedings of the National Academy of Sciences %D 2016 %T Statistics of natural reverberation enable perceptual separation of sound and space %A James Traer %A Josh H. McDermott %K auditory scene analysis %K environmental acoustics %K natural scene statistics %K psychoacoustics %K Psychophysics %X

In everyday listening, sound reaches our ears directly from a source as well as indirectly via reflections known as reverberation. Reverberation profoundly distorts the sound from a source, yet humans can both identify sound sources and distinguish environments from the resulting sound, via mechanisms that remain unclear. The core computational challenge is that the acoustic signatures of the source and environment are combined in a single signal received by the ear. Here we ask whether our recognition of sound sources and spaces reflects an ability to separate their effects and whether any such separation is enabled by statistical regularities of real-world reverberation. To first determine whether such statistical regularities exist, we measured impulse responses (IRs) of 271 spaces sampled from the distribution encountered by humans during daily life. The sampled spaces were diverse, but their IRs were tightly constrained, exhibiting exponential decay at frequency-dependent rates: Mid frequencies reverberated longest whereas higher and lower frequencies decayed more rapidly, presumably due to absorptive properties of materials and air. To test whether humans leverage these regularities, we manipulated IR decay characteristics in simulated reverberant audio. Listeners could discriminate sound sources and environments from these signals, but their abilities degraded when reverberation characteristics deviated from those of real-world environments. Subjectively, atypical IRs were mistaken for sound sources. The results suggest the brain separates sound into contributions from the source and the environment, constrained by a prior on natural reverberation. This separation process may contribute to robust recognition while providing information about spaces around us.

%B Proceedings of the National Academy of Sciences %V 113 %P E7856 - E7865 %8 09/2016 %G eng %U http://www.pnas.org/lookup/doi/10.1073/pnas.1612524113 %N 48 %! Proc Natl Acad Sci USA %R 10.1073/pnas.1612524113