Deep neural network models reveal interplay of peripheral coding and stimulus statistics in pitch perception [video]
December 13, 2021
December 13, 2021
All Captioned Videos Publication Releases
Perception is thought to be shaped by the environments for which organisms are optimized. These influences are difficult to test in biological organisms but may be revealed by machine perceptual systems optimized under different conditions. CBMM researchers, MIT graduate students Mark Saddler and Ray Gonzalez, and Prof. Josh McDermott, MIT, investigated environmental and physiological influences on pitch perception, whose properties are commonly linked to peripheral neural coding limits. Hear from Mark and Josh about how they were able to accomplish this.
[MUSIC PLAYING] MARK SADDLER: I'm Mark Saddler. I'm a PhD student in Josh McDermott's lab in the Brain and Cognitive Sciences Department at MIT. Human pitch perception is perhaps the most studied aspect of human hearing.
JOSH MCDERMOTT: So pitch refers to the fact that sounds can be high or low. This is true in music, it's true when we speak. So I can make a very low sound or a very high sound. And while I'm changing the pitch of my voice, I'm changing the fundamental frequency. And the perceptual correlate of that is what we refer to as pitch. And because it's important in music and speech, there's been long standing interest in understanding how the brain derives fundamental frequency or estimates fundamental frequency from sound.
So in hearing science, right now we have really, really good models of the front end of the system. So people have invested an enormous amount of effort in trying to understand the ear and the auditory nerve. And there are very good computational models that can predict with pretty good accuracy what the auditory nerve will do in response to sound. By contrast, we don't have very good models of the rest of the auditory system.
MARK SADDLER: In this paper, we trained deep artificial neural networks to estimate the fundamental frequency of natural sounds heard through human cochlea. So the input representation to our networks, rather than just being sound waveforms directly, is our simulated auditory nerve representations of sound.
JOSH MCDERMOTT: There were two key questions we sought to answer. The first was, what kind of information has to be provided to a model from the cochlea in order to get behavior out that resembles that of humans? And the second question was, to what extent are the properties of human pitch perception a consequence of the nature of natural sounds? So there's a long history of trying to explain pitch perception mechanistically. But in this project, we instead asked, why does pitch perception have the properties that it does?
MARK SADDLER: The only constraints that we put on our model are the task of estimating the fundamental frequency of natural sounds, the kinds of sounds that are important to humans, such as speech and music, and the ears of our model, the hard coded peripheral auditory model that we use as the network's input representation. And then we let the model just learn whatever strategy it can to estimate the fundamental frequency of those sounds. Despite never being fit to human data in any way, when we tested our model on the same stimuli from all these human pitch psychophysical experiments, we found that the model does a remarkably good job replicating aspects of human behavior.
It really suggests that you can understand these aspects of human behavior as byproducts of a system that was optimized to estimate the fundamental frequency of natural sounds heard through a human cochlea. Now we can go back in and test which constraints of the model are actually necessary to achieve human-like behavior.
JOSH MCDERMOTT: In order to build a model that reproduces human pitch perception, you have to give it input from the cochlea that preserves the high temporal fidelity that we think exists in the auditory nerve. And that's pretty compelling evidence that human pitch perception is actually making use of that very precise spike time that's coming out of the auditory nerve.
MARK SADDLER: It really suggests that the characteristics of human pitch perception are really being driven by the temporal resolution in the cochlea.
JOSH MCDERMOTT: You also have to optimize it on natural sounds. So if you instead train it on various kinds of unnatural sounds, you get out a system that can estimate fundamental frequency, but that does so in ways that deviate from human pitch perception. And so that suggests that pitch perception is really fundamentally shaped by the demands of estimating fundamental frequency from natural sounds and natural environments.
MARK SADDLER: What if the model was instead optimized for some idealized world where there was never competing sources, you only ever heard one sound perfectly at a time? How might that have shaped the pitch strategy that humans use? And we can test this in our model by just training on sounds with no background noise. And when we did this, we found that the model exhibits very unhuman-like behavior. In particular, it's not able to replicate key aspects of human pitch perception.
I think one of the really interesting contributions of this work is that we're combining detailed models of what is known about the cochlea with sort of data driven machine learning techniques. And with a combination of these, we can build models that are able to perform real world auditory tasks. And because we know what's happening in that input representation, we can go in and see how it changes in your peripheral auditory system in our model's cochlea, how those drive changes in behavior.
Associated Research Module: