Successes and Failures of Neural Network Models of Hearing         
      
            Date Posted: 
                  November 2, 2020       
            Date Recorded: 
                  October 20, 2020       
            CBMM Speaker(s): 
                  
Josh McDermott       All Captioned Videos CBMM Research 
                  Description: 
                  
Humans derive an enormous amount of information about the world from sound. This talk will describe our recent efforts to leverage contemporary neural networks to build models of these abilities and their instantiation in the brain. Such models have enabled a qualitative step forward in our ability to account for real-world auditory behavior and illuminate function within auditory cortex. But they also exhibit substantial discrepancies with human perceptual systems that we are currently trying to understand and eliminate. 
                    
HECTOR PENAGOS: It is my pleasure to have Josh here. And as many of you know, he's taken a really cool approach to studying human hearing from a computational perspective, building on neural network models to try to infer mechanisms that can give rise to different hearing phenomena. Today, he's going to tell us about successes with this approach and some challenges that he continues to work on in his lab.
Let's try to keep this interactive. This is a Zoom meeting format. So you can unmute yourself and ask a question as Josh is presenting. There will also be a Q&A session at the end so that we allow Josh time to finish what he's prepared to present, and we don't get stuck at any given point. But feel free to ask questions. And, Josh, by all means, you tell us when to stop and when to continue. So I'm going to turn it over to Josh now. Thanks.
JOSH MCDERMOTT: All right, thanks a lot, Hector. Thanks, everybody, for coming. So I'm going to tell you about some of our recent progress and roadblocks in using neural networks to build models of hearing. So the problem that we're interested in is deriving information from sound. And so I usually start by playing people some sounds, so just listen to this audio signal.
[AUDIO PLAYBACK]
- Much nicer because it doesn't know when [INAUDIBLE]
- Yeah, it's great.
- It's-- everyone [INAUDIBLE] had a place to go [INAUDIBLE] And so many people-- I used to hang out with a friend of mine--
- Are you using the app that determines what [INAUDIBLE]
[END PLAYBACK]
JOSH MCDERMOTT: OK, so that's just some everyday audio, just something I recorded on my phone. And the point of playing you that is that, just by listening, you're able to tell that that was a recording that was made in a cafe. You could hear the voices of people talking, tell that a couple of them are women. You can hear a man. You could distinguish their accents. You can hear the music in the background, the dishes clattering, all that stuff, right?
And so what was happening is that there was a waveform that made its way to your ears. It caused a particular pattern of pressure displacements inside your eardrum like that shown here. And just from that pattern of the way that your eardrum was wiggling back and forth, you were able to tell all those things that were going on in the world. So it's a pretty remarkable computational feat.
But on the other hand, human hearing is quite fragile. So it's probably, like, the most common sensory deficit, which is that as people-- typically, as they age, also with noise exposure, you lose your hearing. So this is a graph that kind of shows the average audiogram as a function of age group. And you can see that if you're in your 20s and 30s, like, things are pretty good.
But then after that, on average, your hearing steadily degrades. And so this is a plot of your detection threshold as a function of frequency. And so especially at high frequencies, hearing loss becomes extremely common by the time you're in your 70s, 80s, or 90s.
We have treatments for hearing loss, which is hearing aids. They help people in quiet, but less so in noisy environments. So typically, if people have to go into a restaurant or a bar, they'll have a hard time hearing. And even those of us like myself who are middle-aged will have a harder time than we did when we were in college.
And our inability to develop the treatments for hearing impairment is really limited by an understanding of how we hear. Like, we don't really understand how our brain takes the input it gets from the ear and does the interesting things that we do with sound. And so it's a little hard to know how to fix it.
So our research group is called the Lab for Computational Audition. And really, our number one kind of long-term goal is to build good predictive models of human hearing. So we'd like to end up with a computer program that will take audio as input and then make good predictions about what a person is going to hear when they get that audio into their ears. And we think if we were successful in that goal, it would transform our ability to make people hear better.
So from where I sit, the peripheral auditory system, by which I mean the ear which includes the cochlea, is fairly well characterized. So sound comes in through the outer ear, gets funneled through the ear canal. It causes your eardrum to vibrate back and forth. Those vibrations are transmitted through these three tiny bones, through this thing that looks like a snail. That's the cochlea. It's the sensory organ does the transduction in hearing.
And we've got pretty standard and widely accepted models of that part of the auditory system. So typically, there's a sound signal. And then that gets passed through a bank of bandpass filters, because one of the signature properties of cochlear transduction is that it's frequency tuned. There's typically nonlinear operations. And that culminates in a representation we'll often refer to as a cochleagram.
And we commonly will look at that representation as an image where you have frequency on the y-axis and time on the x-axis. And then the gray level here represents the energy at that point in time, or the firing rate that would be coming out of that particular channel in your auditory nerve. So this is a cochleagram that we made for recording somebody drumming.
[DRUMMING]
And we kind of think of this picture as, like, what your ear is sending to your brain. So as I said, the models of the peripheral auditory system are pretty widely accepted. There's lots of different variants of them. And you work with different variants depending on your purposes. But that's a pretty well-understood part of the system. So mostly what we worry about is what happens downstream.
And so over the past five years or so, my group has spent a lot of time asking whether we can obtain better models of the auditory system by training systems to perform tasks. That has been enabled in large part by the revolution that's happened in engineering that everybody here knows about, which is that we can now get pretty good performance on a variety of classification tasks with artificial neural networks. So these systems have repeated applications of pretty simple operations. And the parameters can be optimized to classify input signals pretty well.
So the approach that we generally take is to hardwire a model of the cochlea to be faithful to biology, on the grounds that we have a fair amount of knowledge of what that stage of the system is doing and how to model it. And then we generally learn all the subsequent stages with a neural network. We consider the result as a candidate model of the auditory system.
So everybody knows that similar approaches have been fruitful in the visual system, in particular in predicting responses in the visual cortex and predicting aspects of behavior. We thought that they would be particularly useful in the auditory system where, by contrast, the cortex was not very well understood. And we didn't really have any previous good models of behavior.
So there are lots of widely discussed limitations to this approach that everybody here, I think, has heard about, so I won't dwell on them. But the fact is that for now, if you really want to approach aspects of human behavior and brain responses, it's pretty hard to avoid dealing with neural networks.
And so it's important to understand and appreciate and think about the limitations. And we certainly spend a lot of time doing that. And then we will spend the second half of the talk talking about some of the limitations that we have hit upon. But it's been an exciting approach that we've tried to mine.
So the plan for what I was going to talk about today to kind of have two parts of the talk-- I mean, the first part, I was going to give you a summary of some of the recent successes of our neural network models of hearing, mostly in terms of their ability to account for a pretty broad range of human behaviors. And then in the second part, I was going to talk about some of the shortcomings that we have come across and some of the exploratory work we've been doing in that domain to try to understand those shortcomings, figure out how to fix them.
So just in case I don't get through everything, these are the take-home messages from the first part. Overall, after you train neural networks on natural auditory tasks using natural sounds, across the board, we find pretty good matches to human behavioral experiments. And this is now evident in a bunch of different domains. The recognition of speech and noise, sound localization, pitch perception. We've also done some experiments on music recognition I will talk about.
And one of the things that we've done with this is we've manipulated the training conditions. And doing this shows that the similarity that you observe between the model on human behavior is really a function of optimization for natural tasks and natural sounds in a biological cochlea. And so if you alter those tasks or sounds or cochlea, you tend to no longer see the behavioral similarity to the same extent. And this can be useful in that it provides insight into the origins of human behavioral traits. So that's kind of one application of that.
The other thing I'm going to-- I'll show you is that degrading simulated cochlear input in order to simulate the effects of hearing impairment on the ear. If you degrade the input to the neural network, it tends to reproduce the characteristics of human hearing impairment, OK? And so we have really the first models of human hearing impairment that can actually perform behaviors. So we think those will be useful in a bunch of different ways. So this is what I'm going to do in the first part of the talk.
So the question that we initially started to ask was whether trained neural networks can replicate human behavior. We were naturally drawn to speech recognition because it's an important human behavior. And it's one where there were lots of labeled corpora.
So we've trained a lot of networks to recognize speech and background noise. We'll take excerpts of speech, superimpose that on different kinds of background noise. And the task that we have mostly used is to report the word that occurs in the middle of the clip.
And there are a lot of reasons why we chose that task. One is that it's a task that you can ask a person to do. And this is what one of the stimuli is like-- so there's some speech, and you have to say the word that occurs in the middle of the clip. In this example, the person will say gross domestic product group. And the answer will be domestic.
[AUDIO PLAYBACK]
- --gross domestic product grew one--
[END PLAYBACK]
JOSH MCDERMOTT: All right, and in different variants of this, there have been different numbers of words, but there's always a lot. You know, in this case, it was 600. In more recent versions, it was more like 800 or 1,000.
The methodology that we used to build these models is pretty standard. So the weights are learned with backpropagation. We typically do some kind of optimization over the architectural hyperparameters.
I mean, one of the big things that we impose on these models which is not entirely uncontroversial is convolution in both time and frequency. And so nobody really has too much difficulty with convolution in time. But convolution in frequency is not obviously a very natural choice for sounds. And it turns out that that is actually pretty useful to do if you're trying to build models. And in fact, I'm going to come back to that towards the end of the talk.
And so our initial work in this domain was led by Alex Kell and Dan Yamins. And these behavioral experience were run by Erica Shook, who was an undergrad in the lab at the time through the CBMM MSRP program. So I'm going to show you what happens with humans when they have to recognize speech and background noise. There'll be a bunch of different conditions, which are different types of background noise and different signal-to-noise levels.
And here's what the graph is going to look like. So this is going to plotted in proportion correct as a function of signal-to-noise ratio. So as you go from left to right, the speech is essentially getting louder relative to the noise. And you'd expect that it should get easier to recognize. And indeed, that's true, but there's different lines here for different types of background noise. And you can see that it's a lot easier to recognize speech when it's superimposed on music than if it's superimposed on babble, which is kind of like crowd noise. All right, so this is just what humans do, OK?
And so we then ran the model on the exact same experiment. And the results of that are shown here. And there's really two points to take away from this. One is that the model is overall matching human performance in this domain. In fact, it's doing a little bit better. But the other thing is that the relative performance of the different conditions is pretty similar. So the model also does a lot better with music than it does with speech babble, OK?
And the key point to make here-- and this is true for everything that I will show you-- is that there's no fitting of the model to human behavior, right? What happens here is you train the model to perform a task. In this case, the model was trained to recognize speech and noise. And then it is tested. And this is just what it does, OK? And the results look fairly similar to humans.
And in fact, if you plot the proportion correct of humans on the task versus the model proportion correct on the task for all the different conditions-- each dot's a different condition here-- they're pretty strongly correlated. OK, so this was one of the first examples that we found in this domain that was published a couple of years ago.
We've since moved on to other auditory behaviors. One of the other really cool things that we do with sounds is localize them. So localization with sound is pretty interesting because, unlike in vision, a sound's location is not made explicit on the sensory receptor epithelium. So you get this map of frequency in your cochlea. And if a sound comes from a different location, that's not really laid out on the cochlea in a straightforward way. But it's something that has been studied scientifically over a very long time. It's in all the textbooks.
And the classical story is that there are three main types of cues to a sound's location. So if the sound is coming from the right, in general, it will arrive first at the right ear compared to the left ear. And that produces a time difference between the left and the right ear, here shown in red and blue. There will also be level differences, and that's because your head casts an acoustic shadow, such that the intensity of the sound at the right ear will generally be higher than the sound at the left ear.
And then your ears have this funny shape. And the thing on the outside of your ear is called the pinna. And the funny shape of the ear filters sound in particular ways. And that filtering is different depending on where the sound comes from. So if it comes from above, you'll get a particular kind of transfer function. If it comes from below, you get a different kind of transfer function, OK? And so we believe that people have learned to recognize that filtering and use that in order to localize sounds in the vertical range, OK?
All right, so that's the textbook story. But in real-world environments, there's noise. There's reflections. Reflections pose a particularly interesting problem because if you think about it, if there's a sound coming from a particular place, if the sound that comes from that source reflects off a surface in the environment, that reflection will arrive from the wrong direction, all right? So reflections actually really provide erroneous cues. So in general, localization in real-world conditions is a hard problem. And we've never really had models that can actually localize sounds, OK?
So Andrew Francl, who is a grad student in the lab, he has been trying to build models of sound localization using neural networks. And to get around the necessity of needing lots of labeled data, he's trained the model in a virtual environment. So this is a schematic of how that works.
So you have natural sound sources-- that's shown in red-- noise sources which are shown in black. And then those are rendered in an acoustic simulation of a room. So there's a virtual person with two ears, who's positioned at a particular location. And the source is at a particular location. And then we simulate what the room does to the sound. So you simulate all the reflections and stuff.
And what that gives you are two audio signals that should replicate what the audio would be in the ears of a person that was in that room listening to those sounds. So you get left-ear and right-ear signals. You pass those through model of the cochlea. And then that's the input to a neural network that has to report the azimuth-- that's the location in the horizontal plane-- and the elevation of the sound, OK? So Andrew set this whole thing up, and he trained it, and the model trains.
And then one of the kind of cool things is that even though it's trained in the virtual environment, it generalizes to the real world, by which we mean building 46. So this is a mannequin that's relaxing in a chair in our lab space. And it's a very special mannequin that has microphones inside its mannequin ears. And so you can make recordings of what sounds sound like coming from particular locations in this particular room, OK?
So Andrew created a test set from the mannequin recordings and then provided those as input to the model. And you can see that the model does pretty well. So the judgments here are largely along the diagonal, OK? All right, so that's kind of cool. We have a model of sound localization that actually works in real-world conditions.
But then what he did is he went and reproduced a lot of experiments from the literature on the model. So people been studying sound localization for a really, really long time. And there's a very rich panoply of experimental results. And overall, the findings are that the model generally reproduces human results across a pretty wide range of experiments. And I'll just give you a couple of the highlights here.
So one of the classic things that you find in textbooks is that the use of time and level differences between the two ears is frequency dependent. And this is classically referred to as duplex theory. So the classical story is that humans rely on interaural time differences at low frequencies and interaural level differences or intensity differences at high frequencies.
And so this is an e-- one of many, many, many experiments that provided evidence for this. And so in this experiment, people were presented with noise that was either low pass-- so this is, like, a schematic spectrogram of the noise. So this one's got low frequencies here-- or high pass, all right?
And then the noises are rendered spatially. And then the experimenter secretly adds an additional level difference or time difference to the stimulus, OK? And then what you ask the participant to do is to localize the sound. And the question is whether the added time or level difference will change their perceived location, OK?
And so you get these graphs that plot the imposed bias for either the ITD or the ILD versus the response, OK? And so what you find here is that in humans, for high frequency stimuli, the level differences that you add have a big effect on their localization, whereas the time differences don't. And the reverse is true for low frequencies stimuli. So the time difference really has a big effect, and the level difference doesn't, OK?
And there is a classical story that that's because that's usually the places where these cues are useful. And the model largely reproduces that. So it is strongly affected by added level differences for high-frequency stimuli and strongly affected by time differences for low-frequency stimuli, OK?
Another kind of classical finding is that people really rely on their outer ears if they are localizing in the vertical plane, but not in the horizontal plane. So this is a really beautiful study from 1998 from the group of John Van Opstal, where they brought people in the lab, and they measured their ability to localize. And then they put these plastic ear molds in their ear. And the purpose of the plastic ear molds is to alter the way that the ear filters sound. So this was an attempt to test whether people are actually using the filtering in their ears.
And so what you can see in panel B is human localization before you put the molds in. And so these dashed lines are a grid of locations in space. And then the solid lines depict human localization. And the point is that the solid lines are kind of on top of the black lines, which means that people are accurate, right? People can accurately localize sounds.
Panel C is showing what happens when you put these plastic molds in people's ears. And what's cool is that people retain the ability to localize in the horizontal plane, but they completely lose the ability to localize in elevation, OK? And so this is an indication that these people have learned to use the particular filters that are in their ears.
And so Andrew was able to reproduce this experiment with the model because the model was trained on a particular set of ear filters, all right? But then he can take another set of ear filters and swap that in and test the model on that.
And so this is what happened. And you see more or less the same thing that you see with humans, which is that the model localizes accurately when you test it with the set of ears that it was trained on. But then when you swap in a different set of ears, it retains the ability to localize in azimuth-- so that's the horizontal plane-- but it loses the ability to localize in elevation.
So, one other final example that I'll leave you with is this thing called the precedence effect. So this is a well-known effect in sound localization, whereby the very first part of the sound tends to dominate your perception of the location. And the classical example of this was discovered by Hans Wallach, who was a great Gestalt psychologist who also did a lot of work on human motion perception and many other great things. But he's also known for the precedence effect.
And the setup here is that you have two speakers, like one to the left, and one to the right. And the speakers just play clicks. And one of them is leading-- that means that the click comes out of this speaker first, and the other one is lagging. So there's a slight time delay between the clicks.
All right, and the phenomenon here is that when the delay between the two clicks is short, so less than 10 milliseconds or so, in general, people were report hearing a single sound. And the location that they hear is that of the first click, all right? So that's why it's called the precedence effect. The click that precedes is the one that dominates your localization, OK? So if you ask them to tell you the location they perceive, they'll report 45 degrees, which in this case, is the location of the leading click. And then at some point, that breaks down.
OK, all right, now, this has been widely hypothesized to be something that is an adaptation for dealing with reflections, the idea being that when sound is reflecting off of surfaces in the environment, well, the sound that comes direct from the source is generally going to get there first because that's the shortest path. And so if you get delayed copies of the sound, those might be reflections. And so your brain might have learned to suppress them in some way. So it's a well-known effect in human sound localization.
And the model replicates that too. So this is the graph that shows that. So the judged location here is dominated by the leading click when the delays are short. And then the effect kind of goes away.
OK, so we've also got kind of analogous results in pitch perception. In this case, these are models that are trained to report the fundamental frequency of natural sounds superimposed on noise, where you can take a whole panoply of classical psychophysical experiments, replicate them on the model. And in general, it tends to qualitatively and, in many cases, quantitatively reproduce how humans hear.
All right, so from my perspective, this is a big advance over previous models, in that we're getting human-like behavior out of our auditory models for the first time. And this occurs in realistic conditions, in many cases, with comparable accuracy. They exhibit similar psychophysics, which suggests similar use of cues.
And one of the interesting things that we've done with this is to use this phenomenon, the fact that we get these behavioral matches between the model and humans, to investigate the conditions that give rise to human-like behavior. So in particular, we've asked whether this similarity that we often observe depends on the statistics of the environment or on the properties of the ear.
And the way that you do this is you can train the model in alternative conditions, right? So for instance, with these pitch perception phenomena, we instead trained our models on unnatural synthetic tones instead of-- in this case, it was clips of speech and music. And again, you're not going to understand the details of these experiments, and it doesn't really matter. [INAUDIBLE] It's just a bunch of effects which, if you were into pitch perception, you would understand and love.
But the key point here is to just notice that when I flip between these two slides-- so this one is the result of the model that was trained on natural sounds. And this is the model that's trained on the synthetic tones. You can get really different results. So the model is solving the same problem in this case. But it seems to be doing it in a very different way that's really unlike the way that humans solve it, presumably because humans are optimized for natural sounds in some sense. And so these kinds of models give us a way to actually evaluate that.
And we've got somewhat similar kinds of results for sound localization. So what Andrew did is he took his virtual training environment and altered it in various ways. And he did three different things. One is that he got rid of reverberations, so he removed echoes. So that's anechoic training. So that's like if you lived in a world where every surface would completely absorb the sound that impinged on it. He also removed background noise. So you can keep all of the reverberation, get rid of the background noise, or you can make the sounds unnatural. So bandpass noise in this particular case.
And so these are models that are trained up in these alternative worlds. And you think of this as, like, simulating evolution and development in some alternative universe. You can then bring the model into the lab and run it on this same set of experiments. And so this is a summary graph that shows the human model dissimilarity across a big set of experiments, OK? So lower means you have a better fit to the human data.
And the light blue bars is training in normal conditions. And then the other three bars are these three alternative virtual worlds. And you can see that in each of these three cases, you get a worse fit to the human data. And in many cases, these divergences are interpretable.
And so I'm just going to show you the one that is our favorite. So this is the results of that precedence effect. So again, this is where localization is really dominated by the leading sound. And the blue curve is the one that I showed you before. And then the other three curves are these three alternative environments. And so you can see that two of them largely reproduce the effect. And one of those is the thing that has no noise. And the other one is the thing with unnatural sounds.
And then the one that looks really, really different is the anechoic training. So if you train a system in conditions that do not have reflections, you actually lose the precedence effect, which provides pretty nice evidence that this particular perceptual effect is, in fact, some kind of adaptation to deal with the presence of reflections when you have to localize sounds.
All right, I'm going to just pause for one minute and ask if anybody has any questions.
PRESENTER: It's from [INAUDIBLE] He's asking, is it possible that the sound volume confuses the model when detecting the sound source distance?
JOSH MCDERMOTT: Not sure under what conditions you are referring to that. I mean, in the virtual training environment, of course the-- I mean, the sound volume is sort of appropriate for the distance at some level. I mean, all that stuff is kind of rendered correctly insofar as the simulation is correct. And so the model should be learning some of that. I mean, like, in this version of this, we didn't use a big range of distances. But in principle, that's just something that-- that's another thing that it should learn to use.
And in the-- I mean, in the psychophysical experiments, you don't really see any signs that that's causing problems. And one of the things, I guess, that is-- that's kind of interesting about these results is that you're getting human-like behavior with some pretty weird psychophysical stimuli that were dreamed up by an experimentalist at some point because they thought they'd test at some kind of hypothesis, you know?
So the model is trained on natural sounds, but it does exhibit generalizations to certain kinds of sounds that are not obviously in the training distribution. So that's kind of mostly what I've got to say about that. OK.
AUDIENCE: I have a question about the cochleagram. So is it-- have you tried to train this model without using the cochleagram? And were the results worse?
JOSH MCDERMOTT: Yeah, it's a good question. We have done-- we've done lots of variants of that where-- you know, we've certainly altered our cochlear model in lots of ways. So for instance, if you-- in these models that are trained to estimate fundamental frequency that we use to account for pitch perception, if you degrade the timing information in the cochlear model, you tend to get abnormal, like inhuman results, from that, which is some evidence that human pitch perception really depends on the fine timing information that's coming out of the cochlea.
So there's lots of things that are kind of like that. But I think you might have been asking, well, have you just gotten rid of the cochlear model entirely and tried to learn from the waveform? Is that what you're asking about?
AUDIENCE: Yes.
JOSH MCDERMOTT: Yeah, so we've done a little bit of that. And yeah, in general, things tend to be worse, although you can often do fine on the training set. But the generalization is often funny in various ways. You know, we haven't explored that in great detail, in part because, again, it's--
I mean, it's an interesting question, but from the standpoint of building models of the auditory system, we think we have a pretty good idea of what should go in for the ear, you know? And so it's not obvious that that's the greatest idea from the standpoint of building a model of the auditory system. But I agree. It's interesting question.
AUDIENCE: Thank you.
JOSH MCDERMOTT: OK. All right, so I mentioned that hearing is fragile. And the most common complaints are that-- so we often measure hearing impairment by measuring the audiogram, right, and find that the thresholds are elevated. But the most common complaint of hearing-impaired listeners is actually difficulty hearing in noisy conditions. You know, you go to a restaurant with your grandchildren, and you can't hear what they're saying, or whatever.
And one of the frustrating things about that is that we don't really understand how the peripheral impairments that we're beginning to understand-- so the changes that happen in the ear when people lose their hearing-- we don't understand how those impairments give rise to behavioral impairments. And that's in part because we've never really had working models that can actually instantiate auditory behavior, all right? So we've been attempting to try to build those models in the hopes that we might be able to use them to develop better treatments for hearing impairment.
And there's a couple behavioral signatures that are measured in people with hearing impairment. And the first is just what I said, which is to say that speech recognition is usually pretty good when the signal-to-noise ratio is high. So that's what's shown here, where you don't have a lot of background noise. But there are big deficits in noisy conditions, OK?
So this is proportion correct versus signal-to-noise ratio. The solid symbols here are normal hearing listeners. The open symbols are people with hearing impairment. So you can see there's a bigger gap here at the lower SNRs, all right? So that's kind of one common finding.
And then another is that normal hearing listeners have a much easier time hearing a noise if the noise is modulated. So this is the temporal-- the time waveform of the noise. And so you can see these are amplitude fluctuations that are imposed on the noise, whereas this is just stationary noise that has a pretty flat envelope. And so again, this is the same kind of thing. This is percent correct in a speech recognition task as a function of signal-to-noise ratio.
But for the dashed line, the modulated noise, normal hearing listeners do a lot better for the equivalent signal-to-noise ratio compared to stationary noise. Whereas for hearing-impaired listeners, that advantage pretty much goes away, OK? So again, not really well understood why this happens, although there are various theoretical explanations.
So what we've done is we tried to simulate the loss of outer hair cells, which is one type of hearing loss. Again, the details here don't really matter. There's a handful of common traits that we associate with the loss of outer hair cells. They include broader frequency tuning, reduced response to quiet sounds, and a narrow dynamic range. And we have a way of instantiating that in our models of the cochlea, OK? And then we then swap those into the neural network model.
So this is the results of the normal hearing model with a normal hearing cochlea on that speech recognition task. So this is proportion correct versus signal-to-noise ratio. The lines are different types of background noise. And then when you swap in the hearing-impaired cochlea, you can see that things get worse, but particularly in the very noisy conditions, right? So the model is almost as good at what we call clean speech. So that's without any background noise.
So similarly, we can reproduce that benefit that normal hearing listeners get for modulated noise-- so the human graph is shown here at the bottom, and the model graph is shown at the top-- if we have a normal cochlea there, right? So the dashed line is pretty far above the solid line. But if we swap in this model of impaired cochlear function, that advantage pretty much goes away, OK? So when we alter the cochlea to simulate hearing impairment, it qualitatively reproduces the signatures of hearing impairment in humans.
Now one other kind of interesting thing here is that the way that we did this is we trained the neural network on the normal cochlea. And we freeze the network, and then swap in a model of impaired hearing. And it's natural to wonder, well, what happens if instead the neural network is trained on the impaired cochlea?
And what's interesting is that, at least for this type of hearing impairment, when you do that--0 we call that the plastic model of hearing impairment-- the deficits basically go away, OK? So you can see that the relationship between the impaired hearing model and the normal hearing model is right on the diagonal here. And the same is true for this benefit for modulated masking noise. And so when the neural network is allowed to adapt itself to the altered ear, it's actually able to get pretty normal behavior.
And so that is-- it's sort of intriguing to speculate that, well, maybe that is consistent with the idea that aspects of human hearing impairment are due to a lack of plasticity. So, you know, your brain is kind of fixed once you get older. And then your ear changes, and the system doesn't have the ability to change to optimize, to reoptimize itself for the new operating conditions. But maybe if there was some way to imbue it with sufficient plasticity, you could fix things a little bit. That's total speculation, but kind of interesting.
OK, so, and I think-- so we've also done a whole bunch of work to try to use these kinds of models to predict brain responses. I've talked about this lots of times. So I'm going to just skip over that in the interest of time. But they do better than normal models.
All right, so those are the take-home messages from the first part, which is to say that, in general, when you take neural networks, and you train them on natural auditory tasks with natural sounds, you get pretty good matches to human behavior. And this is in all of the domains that we've looked, so speech recognition and noise, sound localization, pitch perception, music recognition.
You can also manipulate the training conditions. And that seems to suggest that the similarity that you observe between these models and humans is a function of optimization for natural sounds and tasks in the cochlea, right? So when you make the optimization conditions unnatural, you tend to deviate in various ways. And that can give you insight into the origins of human behavioral traits.
And then finally, I showed you how degrading of simulated cochlear input to the neural network can reproduce some of the characteristics of human hearing impairment, which we think is an interesting new direction.
All right, so what I want to turn to now is a discussion of some of the model shortcomings. And these are the-- this is the take-home messages from the second part, because I don't get through all of them. So I'm going to tell you about a method called metamer. We're going to generate metamers of neural networks and argue that these provide a way to reveal model invariances.
One of the key findings from doing this is that metamers of the deep layers of standard neural network models are not metameric for humans. They're not even recognizable to humans, which seems like a pretty huge discrepancy. And this is true for both vision and auditory networks.
We have found that model metamers can be made more human-recognizable with some architectural modifications, in this case, by reducing aliasing, and by making models more robust to adversarial examples, for reasons that we don't fully understand. But neither of these things is sufficient, and the divergences kind of still remain. And that is a challenge for now.
OK, so let's talk about metamers. So metamers are a really old idea in perceptual science. I teach them every year in my undergraduate perception class. They're defined as physically distinct stimuli that are indistinguishable to the observer. The classic example comes from color vision.
So these are two spectra of visible light. So wavelength is on the x-axis, and power's on the y-axis. So the one on the left is the spectrum from a tungsten bulb. The one on the right is a metameric match from a color monitor. So you can see that physically, those two spectra are completely different. But to a human with normal color vision, they'll look the same and, in fact, will be indistinguishable.
And the reason for this is very well understood. It's because you have three types of cones. And so you take that high-dimensional spectrum and project that onto your three photopigments. And that projection down onto the subspace inevitably is going to map many different things onto the same point. And in fact, metamers were used long before we had the ability to go poking around in the eye to infer the trichromatic nature of human [INAUDIBLE] So it's a kind of a classical story in perception research.
And the idea of metamers has been revived multiple times. Since then, it's often also been a big part of human texture perception, and then most recently, was important in the understanding of crowding, including by our own Ruth Rosenholtz as well as others, Eero Simoncelli most notably.
OK, so we got interested in the idea that metamers might be a useful way to try to understand these neural network models. And so the-- you know, we normally think of these-- the recognition tasks that these kind of models are solving as confronting the challenge of becoming invariant in the right ways to all of the different ways in which natural stimuli vary.
So you learn to recognize the word "dog," and the difficulty is that I can say the word dog in lots of different ways. And you say it in a way that sounds different. Everybody's voice is different. Instead of recognize the word "dog," you have to be invariant to all of those factors of variation, right?
And it's natural to suppose that when you train the network, what you're doing is imbuing it with invariances that allow it to perform its task. And you would expect that if the network ends up being a good model of human recognition, well, the variances should be the same. And so we thought that metamers generating stimuli that would produce the same responses in the model might reveal the learned transformations and could provide another test of whether the model captures human perception.
So this is work that was led in lab by Jenelle Feather. And the idea is really simple. So you have a stimulus that gets passed through the model. And then you have some activations that are induced at some stage of the model. And the goal is to generate a stimulus that produces nearly the same activations, OK?
So we want to have a synthetic stimulus that, when passed through the model, will give you the same activations at some particular point within the model. And so that is pretty easy to do. You just do gradient descent on the input signal in order to minimize a loss function that you set up between the activations to the two signals.
All right, so when you do this, you end up with the synthetic signal for which the model's response in a particular layer is matched. Now, the kinds of models that we work with are feed forward and deterministic. And so if the responses are matched in one layer, they will be matched in all subsequent layers. And then the decision about the stimulus will be the same. So if the model thinks that this particular speech utterance contains a particular word in the middle, it will, by definition, think that this synthetic signal also contains that same word in the middle, OK?
And so this is just graphs that kind of quantify that. So if you match on an early layer, by definition, the late layer also has to be a matched. And so these are the correlation coefficients between the activations of the original signal and the metamer.
But critically, the same thing is not true if you match on a late layer. So if you match on a late layer, in general, the response is that the early layer will not be matched. And that's because the network involves pooling and is building up these invariances. And so it's just the case that you can have multiple distinct stimuli will produce the same responses deep in the network.
OK, so that's the central idea, right? We're going to be generating metamers from different layers of these networks, in our case, from speech. And we're going to listen to them and see what they sound like. And so some examples of that are shown here. So these are metamers that were generated from a neural network that's trained to recognize speech. So we take this original speech signal. We measure its activations. And then we're going to synthesize signals matching the activations at each stage of the network from very early up to very late OK?
And to visualize the signals, we are going to represent them as their cochleagram, right? So this is a spectrogram-like representation-- frequency on the x-axis, time on the-- I'm sorry, frequency on the y-axis, time on the x-axis, OK? And just from eyeballing it, you can kind of see that as you move through the network, these particular examples of metamers start to look less and less like the original signal. And so that's consistent with the idea that the network is building up invariances.
And if you know anything about audio, you will probably intuit that these signals might-- don't look a whole lot like speech. And indeed, they don't sound much like speech. And that's kind of the key find. So I'm just going to play you some examples here.
[AUDIO PLAYBACK]
- The job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: All right, so that's from a very early layer that essentially sounds like the original. And you can keep going.
[AUDIO PLAYBACK]
- The job security program that prevents like la-- the job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: All right, starts to sound more different.
[AUDIO PLAYBACK]
- The job security program that prevents la-- the job security program that prevents la-- the job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: And by the end, it's very hard to hear anything.
[AUDIO PLAYBACK]
- The job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: OK, so I should just emphasize these are-- each of these is a signal, an audio signal, that produced nearly identical activations to the clean speech signal at the corresponding layer of the network, OK? And so the consequence of that is that they are fully recognizable to the network by design. But what we find is that they become progressively unintelligible to humans.
So the way that we evaluate this is actually with a recognition task. So this is actually not a test of metamers, and it's more conservative than that. We're just asking whether the human can recognize the speech utterance, or the word in the middle that was used to generate it, OK?
And these are the results and we verified that the network can recognize its own metamers. It's supposed to. And indeed, it does, right? So that's the gray line. It's a ceiling. So the graph here is plotting the proportion correct versus the layer of the network from which the metamer was generated. And so the green line here is human performance. And so we can see that from the early layers, human listeners can recognize the speech. But then it gets harder and harder. And then by the late layers, they really can't hear anything.
So this result is not specific to audio. You get qualitatively similar results if you do the same kind of exercise for vision networks. These are three common visual neural network architectures. And these are metamers generated from this particular image from successive layers. And by the time you get to the deep layers, you can't recognize them.
And so I think this is-- this is not new. And indeed, the synthesis method here has been around for a long time. There's an early neural network visualization paper that did essentially this. But I think the significance of this for the relevance of these models, the perception has really not been very widely appreciated. And it certainly had been measured. So Jenelle went did the recognition experiment on these vision models and found more or less the same thing. So by the time you get to the deep layers, you generate metamers for those. And those are not recognizable.
OK, so it's kind of an interesting contrast to what I was telling you about in the first part of the talk, where I gave you all these examples where we find similar behavior between the models and humans. But those are all with natural sounds, or at least not with incredibly unnatural sounds, OK? So here, we're generating these signals using the model, and we get very divergent behavior. So it seems like a pretty substantial inconsistency with biological perceptual systems.
So we've since been using this to try to evaluate models, and also trying to understand what is responsible for this kind of curious behavior. One kind of interesting finding actually relates to biologically inspired constraints, or what some people might argue are biologically inspired. So I mentioned at the start of the talk that typically, when we are building these models, we always impose convolutions in both dimensions of the input, time and frequency.
And we've found empirically that that tends to make the models easier to train, tends to make performance better. And what Jenelle did with Christina Trexler, who is an undergrad who was in the lab last summer with the MSRP program, is to actually evaluate the role of this particular architectural constraint in the metamer phenomenon.
So what you're seeing here in the top row are metamers that are generated from the kind of standard models that we normally use that have convolution in both time and frequency, OK? And so that kind of looks more or less like the thing I just showed you. And the thing at the bottom is a model that only has 1D convolutions, so in time only, and it's fully connected in frequency.
And particularly, so I was drawing attention to the fact that the deep layers of our standard models generate metamers that are not recognizable to humans, and that's true. But the early layer is generally things that are pretty normal. But you can see here that with 1D convolutions, even at these very early layers, the metamers are pretty crazy. So I'll just play you a couple to give you a flavor.
[AUDIO PLAYBACK]
- The job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: That's very intelligible compared to this.
[AUDIO PLAYBACK]
- The job security program that prevents la--
[END PLAYBACK]
JOSH MCDERMOTT: OK? All right, so this is at least some indication that if you're in the business of building models of the auditory system, imposing convolution in frequency is, at minimum, a useful thing to do, in the sense that when you have these fully connected models, though in principle, in some conditions, they might be able to learn the same thing, they have a pretty hard time doing that.
And one of the other kind of interesting observations that Jenelle and Christina made with this particular finding is that although the metamers of these models were very different, their behavioral phenotype on natural speech was actually really similar. So this is the result of that same speech recognition experiment I've shown you a few times now.
So we've got proportion correct on the y-axis, and signal-to-noise ratio on the x-axis. And the different lines are different types of background noise. And you can see that the two types of models do pretty much the same on this, right? So you really only are seeing this difference when you look at metamers.
OK, so some reason for optimism is that we have had a little bit of success improving metamers by making sensible architectural modifications. So one kind of scandal of the deep learning area is that the models typically violate a lot of signal processing principles. So in particular, it's pretty common to get significant aliasing.
And so we made some architectural modifications that remove aliasing. We found that when you do that, the metamers get a bit better. So it doesn't get rid of the problem, but they become more recognizable. So that's consistent with classical signal processing intuitions about what you might expect biological sensory systems to do.
But the last thing that I want to leave you with is another thing that we have found that kind of helps to improve this to some extent. And that has to do with addressing another kind of commonly talked about divergence between neural networks and human perception, which is adversarial examples. This is an extremely well-known phenomenon at this point, which is that most neural networks can be fooled by small things that are imperceptible to humans, adversarial perturbations. So you take a particular training example-- could be an image, could be a speech utterance that has a particular label-- and you can derive, using the model, a very small change to the stimulus that will cause the model to misclassify it.
And one of our neighbors across the street, Aleksander Madry, has developed a method for making models robust to adversarial examples. And this essentially involves generating adversarial examples during training and then training at-- simply adding those to the training set, training the model to correctly classify them.
And this is sort of a picture that is intended to give a little bit of an intuition of this, which is that there's some class boundary that the model learns. And maybe it's not sufficiently complex, such that you can get these really small changes that end up on the wrong side of the class boundary. And with the adversarial training, you kind of get something that maybe is a little bit more correct.
And Aleksander had noted that when he actually visualized representations from these networks, they tended to look better. And so we thought that we could measure this and relate this to this metamer phenomenon that we've been looking at. And indeed, Jenelle found, in collaboration with Guillaume Leclerc and Aleksander, that the robust models in the vision domain have metamers that are more recognizable to humans.
So you can just kind of see this when you generate them. She did a behavioral experiment showing that, particularly for the deep layers, the robustly trained models give you more recognizable metamers for humans than do the standard trained models. What Jenelle then did is she went and trained robust audio models and found a similar kind of phenomenon. So you get a pretty big improvement in the recognizability of the metamers for the robust networks.
Now the big question here is, like, why is this happening? It's also worth noting that the issue is far from completely fixed, right? So you still get a pretty big recognition gap here. But it does it does make things better. And I think we still don't totally know why this is making a difference.
And it's interesting to think a little bit about the relationship between metamers and adversarial examples. And sometimes, metamers are like the congress of adversarial examples. So these are cases where the model is judging two signals to be the same, but that they-- where they look or sound different to humans.
But it's more complicated than that. I mean, metamers, in some sense-- they're very different from adversarial examples in the sense that they're independent of a classifier. So you can generate metamers for any kind of representation, like the classifier is not really part of it. So it's just as relevant for models that are trained without supervision.
So indeed, one of the things we're working on now is looking at this kind of phenomena in models that are trained with unsupervised learning. So this is really still kind of an open question. It's mostly an empirical finding at this point. But it is something that really makes a big difference.
And so the last thing that I'll just leave you with is that this phenomenon of metamers is useful in the sense that it reveals differences that are not evident with our usual metrics, OK? So in the paper by Alex Kell from a couple of years ago, we made two primary comparisons. One is behavioral comparisons with natural sounds. And the other is fMRI predictions.
And Jenelle did this nice demonstration that both of those kinds of metrics really don't show differences between the standardly trained neural networks and the robust trained neural networks. So the correlation between human speech recognition across different background conditions is pretty much the same for the two types of models. It's very high in both cases. And the fMRI predictions are really equally good in the two cases right? So it's really only when you do the metamers test or if you evaluated adversarial examples in this particular case that you would see these differences, OK?
OK, so those are just the take-home messages from part 2. We've been using metamers of neural networks to reveal model invariances. The general finding is that the metamers of the deep layers of neural networks are not metameric for humans. They're usually not even recognizable to humans. So they're very far from metameric. This is true across modalities for both vision and auditory networks.
We've had some success improving the models. Metamers can be made more human-recognizable with certain kinds of architectural modifications that seem sensible and by making models more robust to adversarial examples, though we don't totally understand that yet. But this definitely doesn't fix the problem completely. These divergences remain.
And so just a top-level summary here-- we've been working on building new models of audition via deep learning of audio tasks. We have lots of examples now of compelling matches to human behavior with real-world sounds and tasks that we didn't have before. We've replicated lots of classical psychophysical results that give us insight into the origins of behavioral traits. We've got better models of the auditory cortex than we did before. Evidence for hierarchical organization-- I skipped over that. But there are significant remaining discrepancies that we see with model metamers.
But I just want to conclude by acknowledging all the folks who did this work. Alex Kell, Andrew Francl, Mark Saddler, Jenelle Feather, Erica Shook, Ray Gonzalez, Dan Yamins, Yang Zhang, Kaizhi Qian, their collaborators at IBM, and Guillaume and Aleksander across the street. And I'd like to thank you. And I'm happy to take questions.
PRESENTER: Josh, there's a question from [INAUDIBLE] in the chat. He asks, can you please elaborate more on what aliasing is and how you reduced it with architectural choices?
JOSH MCDERMOTT: Yeah, I went through that really, really quickly. So aliasing is A phenomenon happens when you downsample a signal without low-pass filtering first. And it causes information that is in high frequencies to get kind of moved into low frequencies. And so it's something that you always try to avoid when you're doing signal processing. So if you downsample, you apply a low-pass filter, OK?
And neural networks often end up aliasing because, of course, as you know, the downsampling operations are kind of a big part of what they do. And the filtering that precedes that is usually not constrained to be low-pass in any way. And so you can just-- you can do that by essentially building in low-pass filters prior to the pooling operations. And that can largely eliminate the phenomena.
And that has the effect of making the metamers more recognizable, which is consistent with the idea that, in accord with what we were taught in DSP, the brain might be set up so as to try to minimize aliasing. I mean, you can think of it as like, when you alias, you kind of-- you're mixing things up, right, that maybe should not be mixed up, which is to say the different frequencies.
AUDIENCE: Can I ask a follow-up question?
JOSH MCDERMOTT: Definitely.
AUDIENCE: So another question I wanted to ask was, have you tried synthesizing with putting some of these constraints-- like, not allowing any high-frequency sound to be in your synthetic stimuli, and see what the effect of that would be? The reason I'm asking for this is to kind of try to isolate whether how much of the problem is coming from having high-frequency components in any kind of stimuli.
JOSH MCDERMOTT: So to be clear, the-- well, when we're talking about high frequencies here, we're not talking about high audio frequencies, necessarily. I mean, the aliasing phenomenon this kind of internal to the network, right? So it's sort of high frequencies inside the neural network representation. So if you've got a spatial array of x and y, it's frequencies in that domain, right?
And I mean, we-- I'm not sure if we've actually tried to-- we've probably done some experiments along these lines. But I think what you might be getting at, or what your question is related to, is the tendency of a lot of people who do neural network visualizations to actually employ smoothness priors, so that's a pretty common thing to do. And I think people who have tried to do visualizations, they realize that if you don't impose smoothing priors, things don't look very good, which is essentially the phenomenon that we're talking about.
You can definitely make things better by imposing priors. But it kind of defeats the purpose, right, because the question here is whether the model has got the same invariances as the human. And so if you really want to ask that, you don't want to actually be imposing priors onto the solution, you know?
So, I mean, I think it's possible that it might give you some clues. But in this case, we sort of already got the clue from the aliasing architectural manipulation. So I'm not sure that that particular prior is itself going to be all that informative. 
       Associated Research Module: