Sound, Ears, Brains and the World (59:15)
- All Captioned Videos
- CBMM Summer Lecture Series
JOSH MCDERMOTT: Today what I'm going to tell you about is a couple of things. So first of all, we're going to do a quick kind of overview of the problem of audition, and the basics of the auditory system. And then I'm going to talk to you about a general big problem called auditory scene analysis. Then I'll tell you a couple of little stories of things that we've been doing in that domain.
Today, we're going to talk about "Sound, Ears, Brains, and the World." So as I said, I study how people hear. And one of the funny things about perception is that perception is usually pretty effortless. And so we often don't spend a lot of time thinking about all the complicated things that are going on inside our heads to enable us to perceive the world in the way that we do.
And so it's often useful, I find, when I'm giving talks, to just start by asking people to listen. So what I'm going to do here is play you just a couple examples of sound. This is the kind of stuff that would enter your ears on a daily basis. And so everybody just close your eyes and listen to this.
[AMBIENT NOISE]
[INTERPOSING VOICES]
[MUSIC PLAYING]
OK. We could skip the last two. So this is what you just heard. So those first two clips, they're just things that I recorded on my phone, right? And this is the kind of stuff you hear all the time. And you could probably tell what they were.
And the purpose of playing you these things is just to point out that your brain is deriving an awful lot of information about what was happening in the world to cause those sounds. So in the first case, you could tell that there were people in sort of a cafe or restaurant. And you could hear a couple different people talking. You could probably hear music in the background. You could hear dishes clattering. You might have been able to tell one of the people had an Irish accent, all kinds of stuff like that.
In the second one, you could probably tell the people were in a sports bar. And that they were watching football. You could, again, hear the voices of some of the people who were talking. You could identify clapping. You could also, probably, interestingly, separately identify the sound that was coming from the TV in the bar.
So there was an announcer who was talking from the TV. And that has a different sound than the people in the bar. And you could probably also hear the noise from the stadium coming in through the TV. All right.
So you're deriving a lot of information from sound. And what makes that kind of interesting and remarkable, and gives us a lot of things to study, is that your brain was deriving all that information from sensory input, that the first order just looked like this, right?
So there was a sound signal that was coming out of the speakers in this room. It was propagating through the air. And it was causing your eardrum to vibrate back and forth in some particular pattern, right? And so that pattern of wiggles is what's plotted here, right?
So we've got the pressure that would be recorded at the eardrum as a function of time. And so as the pressure varies, your eardrum would be wiggling back and forth in this particular funny pattern of displacement. OK?
So that's all you got to work with, right? Your brain gets this as the input. And from that input, you're able to determine all of those things that I was just telling you about, right? All right.
So that's the problem of audition. When objects in the world vibrate, they transmit acoustic energy through the surrounding medium, usually air, in the form of a wave. And the ears are devices to measure this sound energy and transmit it to your brain. And the task of your brain is to interpret this signal and to use it to figure out what's out there in the world.
So that's really what you care about as listener, right? You're not particularly interested in those wiggles of your eardrum, per se. You want to know what happened in the world to cause the sound. And so the reason that this is an interesting problem is that most of the things that you are interested in as a listener are not explicit in the wave form.
How many people have heard that term "explicit" in the context of sensory problems? Couple, OK. So what I mean by that is that, if I give you this wave form, and I just let you look at it, or if I give it to you in digital form, and I allow you to run some kind of simple classifier on it, you're going to have a very hard time determining whether it was a dog, or a train, or rain, or people singing, right?
That's why you need a brain. The purpose of your brain is to take this sound waveform and to transform it into representations where the things that you care about are more easily recovered, right? So that's what we really mean by explicit. We mean that some thing is easily recovered from the representation.
All right, so that's kind of the key question that drives our research. How do we derive information about the world from sound? OK.
And so the general approach we take is to start with what the brain has to work with. And so we try to look at sound using representations like the ones that we find in the early auditory system. And so as I said in the first little part of today's lecture, I'm going to take on a little tour of the auditory system, right?
So this is a diagram of your ear. So this is the outer ear. This is known as the pinna. That's the thing you can see, the floppy thing on the side of the head. And so the pinna funnels sound through the ear canal to the eardrum, right?
So the eardrum is this thin little umbrella-like membrane. And that's the thing that vibrates in response to sound. And those vibrations, they get transmitted through these three small bones to an organ known as the cochlea. So the cochlea's this thing that looks like a snail. And that's the thing that takes that mechanical energy from the sound wave form and turns that into electrical signals that then get passed to your brain.
All right, that's the plan for today. OK. All right, so this is the standard functional schematic of the ear. So as I said, we've got the outer ear, which consists of this thing called the pinna, and then the eardrum. We've got the middle ear, which are those three little bones. And then the inner ear, which is the cochlea. And each of those has a different function.
So this is just a slice-through, a schematic head and brain, to kind of orient you. So we've got the outer ears here, the ear canals. The cochlea is here. And then those signals-- as I said, remember, the cochlea is this device that takes sound energy and turns that into electrical signals. Those signals then get sent to all these different stations in the auditory pathways, eventually making their way up to the auditory cortex.
And I'm not going to tell you in detail about any of those. But there's a whole bunch of them. OK.
So one of the signature-- probably these signature-- features of that process, of taking sound energy and turning that into electrical signals, is that it's frequency-tuned, right? So I told you how the cochlea is that snail-shaped thing. You can see this coil here.
And this is that snail just kind of unwrapped a little bit to try to make it easier to see. And so the snail-shaped thing, the cochlea, consists of these tubes that are separated by a membrane, all right? And the membrane is called the basilar membrane.
And so, generally speaking, what happens is vibrations get transmitted to this tube, which is filled with fluid. And it sets up a traveling wave on that basilar membrane, right? So it's a mechanical device. It's pretty remarkable.
And so sound comes in. There's a wave that gets set up on the basilar membrane. But the wave peaks in different places depending on the frequency content of the sound. And so that's what these three diagrams are indicating. So if you send in a high frequency, you get a peek down here near the bass. If you send in a medium frequency, the wave peaks in amplitude kind of in the middle. And if you send in a low frequency, it peaks an amplitude down here towards the apex. All right? So different frequencies get transduced in different places along the membrane.
All right, now, where do the electrical signals come from? Well, this is a cross-section of the cochlea. So if you imagine taking this and kind of making a slice, right there. So this is that basilar membrane, the thing that's wiggling up and down. And sitting on top of that is this mass of cells. It's called the organ of Corti. And the organ of Corti contains these things called hair cells.
And as this thing vibrates up and down, there's a sheering motion between this mass of cells and this membrane here. And it causes the hair cells to physically deform. And hair cells are this particular type of nerve cell whose membrane potential changes when their shape changes. And so as this thing moves up and down, there's a membrane potential change in this that synapses with auditory nerve fibers, which eventually generate action potentials, that then get sent to your brain.
OK. So the point of all this is that nerve fibers that are synapsing to different places along the cochlea are going to end up transmitting different frequencies because of this particular setup of the cochlea, the fact that the traveling wave peaks in different places depending on the frequency content. And this is just another close up, here, showing the same thing. So the things that are doing a lot of the action, here, are these hair cells.
All right. And so here's just an example of what's called a tuning curve for an auditory nerve fiber. So the way this experiment would work is, you would stick an electrode in, and try to hook a single auditory nerve fiber. And then you can measure the response of that nerve fiber as you played different sounds to the animal that you're recording from, all right?
And so what this graph is plotting is the minimum sound intensity needed to elicit a neural response as a function of the frequency of a tone, all right? So you're just presenting a single tone, like a beep. You can vary the frequency of the tone. And then you can very kind of how intense it is, how loud it is, all right?
And so what this graph is showing is that there's a particular frequency, and this is known as the characteristic frequency for the nerve fiber, where you can elicit a response from the nerve fiber at a very low level. So this is like 20 DBSPL. So that's lower than the background noise of the air conditioning IN this room. That's pretty quiet. OK?
And then, as a frequency kind of increases or decreases, the sound level that's needed to elicit a response goes way up. All right? So it's as though this auditory nerve fiber's really tuned to this particular frequency. All right?
And so that's just one example. Here are a whole bunch of tuning curves from different auditory nerve fibers. And so you can see that these kind of V-shaped things, they sort of cover, more or less, the whole spectrum. So the idea is that this nerve fiber would be synapsing near the base. This one would be synapses near the apex. These would be kind of in the middle.
So this is the picture, in terms of these auditory nerve responses, right? So the idea here is that each auditory nerve fiber is kind of responding to a restricted range of frequencies. And so we typically model what the ear is doing as a bank of band-pass filters.
And so each one of these curves here-- so we've kind of just taken those plots and sort of turn them upside down. So you can think of each one of these things as representing a nerve fiber or a filter. And the idea is that sound comes in. And the sound that gets passed through the filter is the stuff that's kind of in this range for this filter, and in this range for this one, and in this range for this one. And so, collectively, we kind of think of what the ear is sending the brain is as the output of a whole bunch of these band-pass filters.
And so, in engineering terms, we call the output of these filters subbands. So they have like a subset of all the possible frequencies in a sound. And so it's pretty easy to kind of set up a simple model to replicate this, where, here, we just have some sound wave forms. So it's just a signal where the amplitude, or voltage, or pressure is just varying over time.
And we're going to pass that through a bank of filters. So each one of these colored curves represents one filter that you might kind of loosely think of as an auditory nerve fiber. And so this is frequency here. And this is the response of the filter. And so the idea is that this blue filter here is just passing frequencies that are in kind of this range.
And so, well, what happens if you do this? Well, this is one example. So, again, this is the original sound wave form. And this is a subband that comes out of the filter that spans 350 to 520 Hertz. So these are relatively low frequencies.
And so you can see that the original waveform kind of wiggles all over the place. And it does so at pretty fast rates. Whereas this subband actually wiggles around here at pretty low rates. And that's because the frequencies that are being passed are low. All right? And so it oscillates fairly slowly with time.
By comparison, here's the output of a filter that's tuned to slightly higher frequency. And so you can see that the wiggles here have increased in their rate. And here's one that's higher still. And you can see that those wiggles are even faster. OK?
And so the idea is that this is more or less what the ear is sending your brain, right? So it's taking that original sound waveform-- so this is the pattern of wiggles at your eardrum, all right-- and it's chopping that up into all these different pieces that contain different parts of the spectrum.
All right. And so we think of what the ear is doing as taking that waveform and then turning it into a higher-dimensional representation, all these different very filtered versions of the signal, each of which contains a particular range of frequencies. And so we call those things subbands. All right.
And so the simple auditory model that we typically work with has that form. So you've got a sound signal. It gets passed to a bank of filters. We're just showing two of them here for clarity. And then the output of those filters are these things called subbands. And so, typically, when we try to think about how the brain does stuff with sounds, we at least start out by looking at sounds in this kind of a representation.
Turns out that a lot of the information in sound is carried not simply by the frequencies that it contains in some sense but by the way that frequencies are modulated over time, right? So when I talk, my mouth is opening and closing. And the amplitude of the signal that's coming out of my mouth is changing very quickly in time.
And you can see that in this particular example. So this is one example subband. And so you can see that-- so this is a subband from a pretty high-frequency filter. And so the blue curve, here, is the filtered waveform. And so it's wiggling around at this very fast rate.
But you can see that the amplitude of the waveform waxes and wanes at kind of a slower rate. And that's what's plotted by the red curve. All right? So the red curve is what we call the envelope of the blue curve.
And, so, typically, when we think of was the ear is sending the brain, we think of these envelopes. Because if you imagine an auditory nerve that is responding to frequencies in this particular range, the envelope, here, would correspond to the instantaneous firing rate of the nerve fiber. And so it's pretty easy to extract this using standard signal processing tricks.
So if that's kind of unfamiliar to you, this might be a slightly more familiar way in which to think about this. So what I'm showing you here is a variant of what's called a spectrogram. So this is probably most common way to look at sounds. How many people have seen a spectrogram before? All right, so about half of you.
And so what a spectrogram plots is the frequency content of a sound, usually on the y-axis, as a function of time, usually on the x-axis. And the instantaneous amplitude will be plotted in grayscale, or color, or something like that. OK?
And so this is just a spectrogram of this particular sound.
[DRUMMING]
All right. So it's just somebody drumming. And you can kind of see that there are these things that happen here. Those are like the drum hits. Boom, boom, ba, bom, bada, or something like that, OK? All right. So that's a spectrogram. All right.
So what does this have to do with what I was just telling you? Well, one way to generate a spectrogram is to have a bank of those filters, like I was saying, all right-- here's the output of one of them-- to extract the slope from each filter, and then to plot the envelope in grayscale, typically, horizontally, right? So if you take the output of each of those filters, you plot it in grayscale, and then you just stack them one on top of the other, you're going to get a picture that will describe the frequency content of the sound over time.
So remember, the amplitude envelope tells you the instantaneous amplitude in a particular range of frequencies, right? And so we're just going to plot that in grayscale. And so it says, well, it's high here, and here, and here, and here, right? And so together, you get this image that we typically use to visualize sound.
And so one of the interesting things, when you look at these kinds of images, is that they often look pretty messy and noisy, right? So that was kind of low in amplitude. So I don't know how easy that was the here. But that was a pretty crisp sounding drum break. It sounded good.
And then you look at this image. And it's kind of a mess, right? But the fact that it looks kind of noisy and messy is mainly, we think, a reflection of the fact that your brain is not used to getting sound in this format, right? Normally, you listen to it. You don't look at it, OK?
And we actually believe that this picture contains most of the perceptually-relevant information about the sound. And one way to demonstrate that is to use this picture to make a sound, all right?
So imagine we play this game where you give me that recording of somebody playing the drums. You allow me to make this picture. And then you throw out the original sound. All right? And I have to generate a sound that's consistent with this picture. OK?
And so it turns out that if you do that, the sound that you get out is typically perceptually indistinguishable from the original, all right? And I can play you one example, here.
[DRUMMING]
Compare that to--
[DRUMMING]
AUDIENCE: [INAUDIBLE]
JOSH MCDERMOTT: What's that?
AUDIENCE: You can see the bass note in the spectrogram, right?
JOSH MCDERMOTT: That's right, yeah. Exactly. Yeah. But the point is, they sound the same in the two cases, right? So that was the original. And this is this one that we've synthesized just from this picture, OK? So even though this looks kind of messy and noisy, it seems to contain most of the information that you actually need about the sound.
You might ask, well, how do you make a sound out of this? And so I'm going to tell you about this. Because this turns out to be pretty useful. And so the basic idea is, we want to generate a sound, constraining it only to generate the same picture as some original waveform, all right?
And so the way that we do that is we start with a sound that's as random as possible, just a noise signal, all right? We then pass it through that same filter bank. And so we now have a sub-band representation of the noise signal, right? So we've got all those different filtered versions of it.
And all we do is we take each of those sub bands, and we just adjust them so that they have the same amplitude envelope as the sound that we measured, all right? We can then take the modified subbands and add them back up. And we get a new signal, all right?
And so that's the basic game that we play. And it's actually a little bit more complicated than that, but not that much, OK? All right. And so this process of synthesizing a sound from a representation turns out to be a pretty useful thing to do. And we do it on a lot of different contexts. And in a few minutes, I'll show you an interesting application of that.
What is on the screen right now is a live spectrogram. So the microphone input of my computer is being turned into a spectrogram. So this is frequency. And time is just kind of rolling by.
And so the only thing I don't like about this thing is that it's kind of slow. And so speech changes very quickly. And so, when you talk, it just looks like kind of a mess. But if, instead, I sing. (SINGING) La. Loo. Lee.
(SPEAKING) All right. So this is kind of related to your question. So this is not three different instruments. But it's three different sounds that I was making with the same pitch, all right?
And so you can see that, in each case, we get these horizontal stripes. These are the harmonics. So it looks like I was singing something that was maybe 170 Hertz. OK. So you get 170, and then two times that, which is 340, and then three times that, four, five, six, seven, eight.
And they get closer together here. Because this is a logarithmic frequency scale. And so what I did in each of those three cases is I was singing a different vowel. So I think that was "la," and "loo," or "lee," or something like that. And what happens when you make different vowels is you change the shape of your mouth.
And when you change the shape of your mouth, the resonant frequencies change. And so, in all of those cases, the sound is generated by a sound source, which is the vocal chords. And the vocal chords-- I'm actually going to show you this in a little bit. But the vocal chords are these sort of membranes.
And they open and close at periodic rates. So they literally produce this train of clicks, all right? So that's in the time domain. And the frequency domain, it's periodic. And so you get a bunch of harmonics of a fundamental, all right? But the point is that, in all of these three cases, the source is doing the same thing. It's just making 170 hertz sound.
But my throat and my mouth are doing different things and all those three cases. And so the resonances are very different. And so if you look at the spectrum, here, and here, and here, you can see that they just look different, right?
So they've all got the same set of stripes. But the relative amplitude of the stripes is different in all of the cases. And so that's what makes "ah" sound different from "ooh," sound different from "eeh." So when you make different vowels, "ah-eeh-ooh-eeh-ah-eeh-ooh," you're changing the shape of your mouth. And so those resonances are moving around.
And so for different instruments, it wouldn't look exactly like this, but the principle is kind of the same, right, which is that the amplitude of the different harmonics would, again, be defined by whatever the instrument was. And in addition, there would be this signature in the time domain, which is the way that the amplitude would change over time.
So what I'm going to do in the second part, here, is we'll take some of that stuff that we just talked about and kind of apply it to some problems in hearing. And so one of the general problems that my lab is really interested in is what's called auditory scene analysis.
And so, in general, auditory scene analysis is the process of inferring events in the world from sound. And, I guess, maybe, more specifically, when I say auditory scene analysis, what I mean is the fact that, usually, the sound signal that we receive, it's caused by multiple things in the world. And there's a lot of different variants of that.
And so one of the things that makes perception really hard is we often face problems of that sort. So there's a single signal that kind of enters your ears, or that enters your eyes. But it results from a whole bunch of things in the world. And so your brain somehow has to infer what the causes are likely to be in the world. And so we'll look at a couple of examples of that.
So one of the most commonly studied aspects of auditory scene analysis is what's known as the cocktail party problem. So how many people have heard of the cocktail party problem? OK.
So this refers to the fact that, oftentimes, when we listen, our ears are receiving a mixture of multiple sound sources in the world. And so, in the cocktail party problem, the classic version, the different sources are different people talking, all right? So you get this red signal here, that's what is actually causing your eardrum to vibrate back and forth.
But that red signal, it results from these two different sources in the world. So this might be your friend, and your mom, or whatever, all right? And so those two sources are producing this sound. That's what's coming out of their mouths. But then those things travel through the air. And out at your ear, they add together into this particular, single mixture. All right.
And so, as a listener, you're not really interested in that mixture, per se. You're usually interested in those individual sources. Because you want to know what your friend was saying, or maybe, what your mother was saying.
And so, somehow or another, you have to figure out something about the content of one or both sources just from that mixed signal that you receive. OK? All right. So here is just a demonstration of this. So what you're usually interested in, maybe, is what one particular person is saying.
AUDIO PLAYBACK: She argues with her sister.
JOSH MCDERMOTT: But what could enter your ears could be this.
[INTERPOSING VOICES]
Or maybe even this.
[INTERPOSING VOICES]
Or maybe even this.
[INTERPOSING VOICES]
OK. So next to each of those icons, I'm plotting the spectrograms. That's the thing that we just learned about. OK? And this is the spectrogram of the single speaker. And you can see that there is all this structure in the spectrogram.
So the timescale, here, is kind of zoomed in, compared to the one that we were just looking at. And so you can kind of see that things are varying in the sort of detailed way as a function of time and frequency. And we believe that that structure is used by your brain to help you figure out what the person was saying.
And these are spectrograms of the sounds with the additional speakers. And what is very obvious is that, as those additional speakers are added to the cocktail party, the structure that was evident in the single speech signal becomes progressively more and more obscured.
And so by the time you get down to this one here, where you have seven additional speakers, it's kind of amazing that you can do anything at all with this. And how many people felt like they could kind of hear the target talker down here? Yeah. Quite a few of you, right? So, OK.
So people are really remarkably good at this. It remains beyond the reach of present-day speech recognition algorithms. So your iPhone, for instance, will work pretty well if you're in a quiet room. But you take it to a restaurant. And it's probably pretty hopeless.
It's also an important problem because, when people lose their hearing, this is one of the first things to go, the ability to hear in kind of complicated situations like this. So your grandparents probably have a hard time if you take them out to a noisy restaurant, even if they're wearing their hearing aids. So this is not a problem that we currently have really good engineering solutions to. OK.
So one of the things that makes this problem harder, and that also makes it very interesting, is it's what we call ill-posed. And so many interesting perceptual problems have this character. And, in general, what is meant by ill-posed is that the problem, in general, is not solvable. You don't have enough information in the sensory input to recover the things that you're trying to recover.
So remember, you observe this mixture, all right? And it's caused by these two blue things in the world. And in this particular case, the reason this is ill-posed is that many sets of sounds add up to equal the observed mixture.
So there's the blue ones, which is the ones that actually happened in the world. But then there's lots, and lots, and lots of sets of green ones, right, which are different from the blue ones, but that also add up to be equal to the red signal, right? And so, somehow or another, your brain has to take this mixture and infer the true sources, and not all these other ones.
So this is really exactly analogous to this math problem, all right? So suppose I tell you that x plus y equals 17. Please solve for x, right? So everybody knows if you get this on a math test, you should complain to the professor. Because there is not a unique solution, right?
It could be 16 and 1. It could be 15 and 2. It could be 14 and 3, and so on, and so forth. Right? There is an infinite family of possible solutions, all right? And yet, that's exactly the problem that your brain is facing when you get a mixture of sources and you need to understand what somebody is saying. OK. All right. So, in general, it's impossible.
So how do we manage to hear? Well, the thing with ill-posed problems is that you can only solve them with assumptions, in this case, about the sound sources that occur in the world. And you can only make assumptions about sound sources if real-world sound sources actually have some degree of regularity.
Fortunately, they have quite a lot of regularity. So real-world sounds are very far from random. One way to realize this is actually just to listen to some of them. So here are some example real-world sounds.
[NOISES]
All right. So those are just four kind of more or less randomly selected sounds. All right? By comparison, I'm going to play you some sounds that are fully random. All right? So what's a fully random sound?
Well, remember, a sound is just a wave form, right? So at every point in time, there's the amplitude of that particular sample. And so we can generate a random sound by just generating a sequence of random numbers, right? So at every point in time, I'm just going to pick a random number, say, from some Gaussian, all right?
And so when you get that, you get white noise. And so here's one sample.
[WHITE NOISE]
Here's another one.
[WHITE NOISE]
Here's another one.
[WHITE NOISE]
OK.
AUDIENCE: Those sound exactly the same.
JOSH MCDERMOTT: They do. And that's interesting in and of itself. Those are actually three different, independent samples of white noise. But, hopefully, it's apparent to you from those three examples, that we would have to sit here drawing these fully random samples for a very, very, very long time before we got something that sounded like a doorbell, or a hawk, OK?
And so the point is that real-world sounds are a very small portion of all possible sounds, all right? They're a very tightly restricted space of the set of signals, all right? And so that means that they exhibit regularities. And we believe that your brain has internalized those regularities. And that's what enables you to hear. Right.
So one example of such a regularity is one that we were just talking about, which is harmonic frequencies. So voices and instruments tend to produce frequencies that are harmonics or multiples of a fundamental frequency. So here's an example amplitude spectrum of a schematic of a person talking. So you've got the fundamental frequency, and then all of the harmonics, which are multiples of that frequency.
And so, as I was just saying, that results from, in the case of speech, from vocal chords. So what this is, on the screen, is a movie that was taken from a camera that was stuck down somebody's throat while they talking, all right? So they put a little camera on a tube, stuck it down the throat. And this is just a looped animated gif.
And so these are the vocal chords, right? So there are these two membranes. And when you speak, air gets blown through those membranes. And they pulse open and closed in a very regular way. All right.
All right. And so what you can see in this wave form is a sequence of those pulses. And so a very important fact about signals is that, whenever you have something that is periodic in the time domain, in the frequency domain, it's harmonic. So a signal that is periodic in the time domain, namely, repeating at some regular rate, so a sequence of pulses, if you look at it in the frequency domain, you're going to see harmonics of a fundamental.
And so here's the time domain representation. This is time. And that's just the amplitude of the waveform. And this is a frequency representation. So it's a spectrogram, frequency versus time. And you see you have these kind of regular stripes. And so that's just the same thing that we saw on the live spectrum, then, that we were looking at. OK.
All right. So this is just a diagram of the way that you make speech sounds. So the vocal folds are down here. And as we discussed, so this is the source of the sound, that's what's generating that harmonic source, that sound then propagates through the mouth and the nasal passage in the throat, and then comes out. And so, depending on the shape of that cavity, you're going to get different sounds out.
And so when you make different vowels, the vocal cavity is in these different configurations. So the shape is a little bit different. And so that creates a particular filter. So this is frequency. And that's the response. And so these peaks, they kind of move around. They're in different places.
And so you have this harmonic sound source that's approved by the vocal folds. That, again, then gets passed through these three filters. And so the spectrum that comes out always has those same frequencies. Those are the stripes that we saw on the last spectrogram. But the peaks are in different places, all right?
And so that's what generates different vowels. That's the thing we just talked about. OK.
But the point is that the frequencies that are there are typically harmonic, at least in the case of vocal sounds. And so we can ask the question, does the brain use harmonicity to solve the cocktail party problem?
And so we've done some experiments on this in my lab by artificially manipulating speech to mess up the harmonic frequency relations. And I'm not going to tell you in detail how we do this, ought to be illegal. But in this day and age, you can do these kinds of things.
So you can take speech apart, move the frequencies around, and then kind of put them back together, and create a sound signal, OK? So this is just regular, re-synthesized harmonic speech.
AUDIO PLAYBACK: She smiled and [INAUDIBLE] in her beautifully [INAUDIBLE] face.
JOSH MCDERMOTT: Someone saying some random sentence. And so this is the spectrogram. And so, again, you can see that the frequencies here are regularly spaced. So now we have a linear frequency scale. And so one of the things that's useful about using the linear frequency scale is that the harmonics are always separated by the same amount. Because they're multiples of the fundamental.
So here we got 200 hertz 400, 600, 800, 1,000, 1,2000, and so on, and so forth. OK. So the comparison is this re-synthesized single, where we've messed up the frequency relations to cause them to be inharmonic.
And so you can see here that these frequencies are no longer exactly the same amount apart, right? That's too far. And that's too little. And that's too far. And so on, and so forth. OK? Now, you can say that-- I mean, this is kind of a subtle manipulation, on the one hand. But when you re-synthesize this, it sounds really weird.
AUDIO PLAYBACK: She smiled and [INAUDIBLE] in her beautifully [INAUDIBLE] face.
JOSH MCDERMOTT: So that's inharmonic speech. Now, it's probably very apparent to you that that's not something that could ever be produced by a human, right? So that's kind of interesting. But it's also pretty apparent to you that you can understand perfectly what this computer is saying, right?
And that's because the only thing that we've changed are the properties of the source. So the filtering that's distinguishing different vowels, that's left exactly intact. And that's because we use this method that kind of decomposes speech into these two different components, of the source and the filter. OK?
All right. OK, so the question here is, now that we've messed up this regularity of the sound, will this impair our ability to solve the cocktail party problem? So if you go to a cocktail party where everybody sounds like this, will it be harder to understand what people are saying? OK? All right.
And so we did some experiments to try to test this. So we would present people with words, either a single word at a time, or two words at the same time, concurrently. And we would ask them to just type in what they hear. Right?
So the question is when you make the speech inharmonic, would this be a lot harder to do, when you have two sources? All right? So first of all, I mean, as you could hear from those demos, there's a pretty big effect of this, in terms of just the way that these things sound.
So here we just varied the amount by which the frequencies are jitterred. And we got people to rate how natural it sounded. And so when it's harmonic, they say it's very natural. And then as we add the jitter, it's increasingly unnatural. And so you can hear this for yourself. This is harmonic. Oops.
AUDIO PLAYBACK: Finally, he asked, do you object to pay? Finally, he asked, do you object to pay?
JOSH MCDERMOTT: Yeah. I didn't choose this sentence. Somebody did this for me. All right? It's from a speech database that has-- they took all of these sentences from plays, and stuff like that. And so it's pretty eccentric.
AUDIO PLAYBACK: Finally, he asked, do you object to pay? Finally, he asked, do you object to to pay? Finally, he asked, do you object to pay?
JOSH MCDERMOTT: OK. So point of this is just that it kind of sounds weirder and weirder. And then it kind of bottoms out. OK. All right.
So what happens when you do this experiment? All right, so here are the results of this word-recognition experiment. So what's being plotted here on the y-axis, this is the average number of words that people correctly type in as a function of the amount of jitter. So this is when the speeches regular and harmonic. And then we're kind of adding more jitter to it.
And so the black line is what happens if you just present a single word in isolation. And when you present a single word in isolation, there's really no effect of the inner harmonicity. And hopefully, that was pretty apparent to you in just listening to that example, that when you just hear one of these things, you can understand what the person's saying without any problem.
But when we present pairs of concurrent words, things behave pretty differently. So when the speech is harmonic-- so people are not perfect. But they're getting something right like 65% of the time. And then, as you increase the jitter, they get worse. And then it kind of bottoms out. All right.
So this suggests that, indeed, there's something about mixtures of inharmonic speech that makes it harder for you to understand what's being said. And we believe it's because that harmonicity is one of these regularities that your brain is really relying on to solve this ill-posed problem.
OK. OK, so in the last little bit of the talk, I'm going to tell you about a second scene analysis problem that is the focus of a lot of active investigation in my lab. And that scene analysis problem is due to the fact that sound sources interact with the environment on their way to your ear, OK.
So right now, I'm talking, and sound is reaching you directly from my mouth to your ears. But it's also reaching you indirectly after reflecting off of all of the various surfaces in the room. All right?
And so this is the diagram that shows the direct sound from a speaker to this person's ears. Those are the green lines. But in addition to those green lines, the direct path, the sound can also reflect once off the walls before hitting the ears. And there's a bunch of those kinds of paths.
They can also reflect twice. So here it reflects once, and then twice, and then reaches the ears. And of course, there's all these the paths where it reflects three times, and four times, and five times. And we're just not drawing those in order to have the thing not be too cluttered. OK.
So these are reflections. Collectively, they're known as reverberation. And so what happens is, your ear gets all these delayed copies of the original sound source, right? So you get the one that comes direct from the source. And it reaches you first. Because that's the shortest path. And then all these other paths are longer, right? And so they come with a time delay. All right.
OK. So this happens all the time. In basically any real-world environment, you're going to have this to some extent. And you'll be well-aware that some environments have more of it than others. But it's very, very common.
And yet, it's not something that we really understand, perceptually, very well. And so I like to show this picture. Because it kind of captures-- it conveys two things. So this is a picture from an old paper from the '30s where they were trying to study sound localisation. And so they specifically, actually, wanted to avoid the effect of all these reflections, right?
And the only way that they could do that is by putting people on the roof of this building where they did the study. And so this is a quote from the paper. They say, "in order to avoid possible reflecting surfaces, a tall swivel chair was erected on top of a ventilator, which rises nine feet above the roof of the new biological laboratories at Harvard University. At the present time, this procedure appears to be the only practical means of avoiding errors due to the reflection of sound." All right.
And so I like to show this picture because it really makes two points. One is that, historically, reverberation is something that people have kind of tried to avoid in their experiments, right? And number two, reverberation is ubiquitous. And the only way you can avoid it is by doing these ridiculous things like going up on a roof where there's no reflecting surfaces.
So this is Stevens And Newman. It's 1946. Those you who are interested in psycho-physics, this is the same Stevens of Stevens' Law fame. So S.S. Stevens, who's a famous psycho-physicist.
OK. So reverberation's ubiquitous. And it's a it's a big challenge for your brain. So this is a picture that kind of shows why that's the case. So what we're showing you here are what's called a dry version of a sound signal, so that's a sound without any reverberation. So the way that the only real way you would make that is by going up on that roof, or by going into one of these anechoic chambers where there's no reverberation.
And then this is somebody talking. And this is that same signal, but recorded, or simulated in a room that has a lot of reverberation. So the point is that this and this look really different. Similarly, you can look at spectrograms. And the point is that those things look really different.
And so the idea is that, if you had designed some mechanism to recognize this, it's kind of obvious that we'd try have a hard time recognizing this. OK? And, indeed, that's the case. So this is a big problem for machine speech recognition.
So this is a graph that just shows the percent errors as a function of the amount of reverberation. So each colored bar here is some particular speech recognition algorithm. And so, when there's no reverberation, you get dry, clean speech, they're not making many errors, right? So that's what happens if you hold your iPhone up like this, right?
But as we introduce just a little bit amount of reverb, the errors kind of jump way, way, up. And so, if you actually put your iPhone on the other side of the room, that will effectively amplify the reverberation. And it's not going to understand you. You should try it. OK.
But reverberation also provides us with information. So it's one of the things we use to figure out how far away things are. And it also provides a cue to the size of the space that we're in. It's also a big ingredient in music production.
So this one of the most popular tricks in music production. So this is-- how people know this song, "For the Love of Money" by the O'jays? Couple, all right. Yeah, you should you should check this out.
So this is like a song from the '70s that features a lot of interesting production tricks. And this is just the intro. So the intro to the song is just a baseline. And throughout the song, what the engineer does is he turns the reverb on and off, just to mess with you, to make it sound interesting.
And I don't know how obvious this will be, played over these speakers. And so I should also just say, in general, all the demos that I'm going to play to you are going to have the effects of this room reverberation. All right? So, ideally, what you'd want to do is to listen to this over headphones so you wouldn't get that. But it's unavoidable. All right.
So for that reason, I don't know how obvious that will be. And so what I've done here, just to make it more apparent, is I'm plotting you the wave form. And so there's this little riff that the guy plays on the bass. And then it kind of repeats. And in the second repetition, they've turned the reverb off. And so you can just see that this looks pretty different from that. And I've just re-plotted it down below.
So you can see that the kind of notes are in the same place. But the wave form looks really different. You can see whether you hear it.
[MUSIC - THE O'JAYS - "THE LOVE OF MONEY"]
All right, you can hear that. All right. OK. So people use this as a production trick all the time. And it's because it sounds interesting to people. OK. So we wanted to try to understand how the brain deals with this.
Now, the way that reverb is typically measured is by recording something called the impulse response. And so that's literally what it sounds like. It's the response of a space to an impulse.
What's an impulse? Well, it's a very brief, high-amplitude sound. It's more or less what you'd get if you fired a gun, OK? So we could fire a gun in here, and just record what that sounded like and that. And that would give us the impulse response.
And so, if there was no reverberation, the sound that you would get would just be this impulse. But because of the reverberation, you get all of these reflected copies of the impulse. And so each reflection is attenuated in amplitude. And that's because each surface that the sound interacts with absorb some of the energy. And then it arrives at the delay. Because the propagation path is longer. All right.
So you get this thing, the impulse response. This is an actual impulse response measured from a classroom. And so you get this peak here. That's the direct sound. And then you get a bunch of these individual reflections. And the reflections kind of all blur together. And they form this dense tail.
And so, if you just listen to the impulse response--
[NOISE]
Right, it just sounds like somebody making a tap in a classroom. OK.
OK, so the problem here that the brain faces is that you get this sound signal that enters your ears. And it's the combination of the effects of the sound that was produced at the source and the effects of the environment. So the effects of the environment, as I say, can be summarized with this quantity called the impulse response.
And the sound that enters your ears is what's called the convolution of the source signal with the impulse response of the environment. And if you don't know a convolution is, it's OK. The point is just that these two things in the world get combined to generate the signal that enters your head. Right?
So just like in the cocktail party problem, your ear receives a mixture of these two different sources in the world. Here, you're getting the sound from a single source. But it interacts with the environment, right? So there are these two like causal factors in the world that affect the sound that enters your ears. All right.
And the effect of this is to massively distort the sound from the source. And because these impulse responses vary from place to place, this poses a real problem. OK. So, again, as was the case with the cocktail party problem, the listener is usually not particularly interested in that signal that it receives, but maybe in the source signal, and maybe, also, in the environment, right? And so it'd be nice if you could somehow take those apart.
And so the key question we've been working on is whether we can view the perception of reverberation as a process of separating the sound source from the reverb. OK. And so as we were just discussing, the nature of ill-posed problems is that, in general, they're unsolvable, right? The only way that you can solve them is by making assumptions about the nature of the things that you're trying to infer, OK?
And we just saw how natural sound sources are very far from random. And we wanted to see whether these environmental impulse responses were also non-random, and whether they would exhibit regularities that, perhaps, the brain could have learned, and that it could use to separately infer the source and the filter. OK. And so the general approach here is to try to measure reverb in real-world conditions, and to characterize regularities.
And so this is a work by a fellow by the name of James Traer, who is a postdoc in my lab. You'll probably see him around the building this summer. OK. So what James did is he developed a method to measure impulse responses. So I just told you how you can measure an impulse response by firing a gun, OK?
But what we wanted to do is to measure the nature of reverberation in the kinds of places that people spend their time in during life, right? So we wanted to be able to go into restaurants, into bars, and city streets. And you can't just walk around firing guns. And so we needed a method to measure the impulse response that was slightly lower-impact. All right.
So this is apparatus. So there's a speaker here, and a recorder. And the speaker plays out a fairly low-amplitude noise signal that lasts between 3 minutes, and, say, 10 minutes, OK? And the noise is not that loud. And so you can take it to a public space. And you just hear this speaker that's playing this noise signal. And you record what that sounds like in the space.
And the key thing is that we've got the recording of what it sounds like in the space. And we also know what the signal was that was coming out of the speaker. And because we know those two things, we can then use those to infer the impulse response of the space. So we can take this thing anywhere, and measure the impulse response. And James has been most places in Boston, at this point.
So we had people participate in the survey. So the volunteers in the survey, they got pinged with text messages 24 times a day at random times. And we programmed their phones to send us back their GPS coordinates when they would get this ping. And in addition, the participants were supposed to take a photograph of wherever it was that they were, and to also send a the address. And so the idea is that, with these three sources of information, we could figure out where they were. And James could then go to the space and measure the impulse response. And so over the past three years, he's been doing that.
So we now have about 300 locations, from seven different people, that did this for a couple weeks of their lives. And there are all kinds of different spaces. So we've got stuff from restaurants, department stores, city streets, the woods, bathrooms, subway stations, and so on, and so forth, all the different places that people spend their lives. So the idea is that we're taking samples from the distribution of real-world reverberations. And so the question is whether these things exhibit regularities.
All right. And so part of the payoff of that stuff that we were talking about during the first part of the talk is that, to analyze these impulse responses, we're going to look at them in frequency bands that simulate the ears response to sounds, right? So we've got that good old set of band-pass filters that we were looking at before. And we take our impulse response. And we're going to plot it as a spectrogram.
And we call these things cochleagrams, because they're like a particular kind of spectrogram where we use the frequency response of the ear to generate the spectrogram, OK? And so you get something that looks like this. And so we just want to see whether these things are consistent. OK.
All right, so I just showed you the spectrogram. These are the amplitude envelopes of particular frequency channels, the ones in red, here. These are two examples. This is an office. And this is the Park Street T Station. Have anybody been to Park Street yet? Yeah, it's over there. OK. So this is what it sounds like. Or looks like.
OK, and so what you see is that energy decays. And in all these cases, the decay is pretty well approximated by a straight line. At some point, you hit the noise floor. That's shown in red, all right?
And so we're plotting this on a log scale. And so the fact that these things are straight lines means that we're seeing exponential decay. All right, so there's a functional form here of exponential decay. And if you don't know what exponential decay is, doesn't really matter. The point is, it's a particular functional form. And it seems to describe pretty much all the impulse responses we see.
And we quantified that by fitting additional polynomial terms. And, basically, the linear term kind of explains everything that you can explain. You fit quadratic and cubic terms, and so forth, it doesn't help. All right. So the punchline here is that you pretty much always see exponential decay.
We can then, also, look at the decay rates. So that's what's plotted here on the x-axis as a function of a frequency. And these are these two examples, the office and the Park Street T station. And when you do this, you, again, see pretty tight regularities. And in particular, one of the key things that we see is that the decay rates are always fast at high frequencies. And they're typically slower kind of at the mid frequencies.
There's a little bit of variance from place to place. But here's sort of some summary statistics where each of these lines is the median of a quartile. And so this sort of captures the central tendency. And you always see these like sideways U shapes.
So, again, this is frequency. And that's the decay time. It's what this is saying, is that, in the middle range of the frequency spectrum, the energy is kind of sticking around for a long time. And at the high and the very low end, it's decaying much more fast.
And we believe this is just due to the absorptive properties of typical materials, that materials tend to absorb a lot of high frequencies. Probably, the low frequency is passing through the material. So you know how, if you're in an apartment building, and your neighbor's playing music, what you're going to hear is the bass? That's because those low frequencies tend to propagate through materials very easily.
And so when they propagate through, they don't come back. And so that's why those things go away. They kind of spread out to the rest of the world. Whereas the high frequency's just going to get absorbed.
So one interesting thing is that things look pretty similar if you compare indoor and outdoor environments. So this is those same kind of plots. We've got frequency and decay time. This is the subset of the impulse responses from rural environments, where there's basically no man-made structures. And you see that same kind of sideways U. This is outdoor urban. This is indoor.
There's a little bit of variation. But, basically, this trend for the high frequencies to decay quickly seems to be pretty universal. All right. So the point here is that these regularities raise the possibility that your brain might have internalized this, and that you could use this in your perception.
OK. All right, and so we want we want to test whether you actually have implicit knowledge of these kind of regularities. And so one simple test of this is just to listen to the impulse response. And so what we're going to try to do is synthesize impulse responses that have these regularities, or that don't. And then we're going to listen to them, and see if they sound like reverberation.
And so when you do this with actual, real-world impulse responses, they sound like impulses. So here's a classroom.
[NOISE]
An office.
[NOISE]
This is the inside of somebody's car.
[NOISE]
A forest.
[NOISE]
This is the BCS atrium.
[NOISE]
It's kind of quiet. But that's a pretty long one. 'Cause that's a big space. All right.
AUDIENCE: It's the same source every time, right?
JOSH MCDERMOTT: Yeah. So every space sounds a little different. But the point is that they always kind of sound like impulses in spaces.
AUDIENCE: The distance from the microphone is always the same?
JOSH MCDERMOTT: Yes, that's right, yeah. Yeah. All right, so we're going to make some impulse responses. And so we're going to play the same game that I was showing you at the start. So remember how, when I was trying to convince you that these spectrograms actually capture a lot of the perceptually important information, we played this game of generating a sound signal from the spectrogram, all right? So we're going to do the same thing here.
So we have a noise signal. We're going to split it up into all these different sub bands. We're going to impose decay on the sub bands, with particular rates, and particular shapes. And we're going to add them back up. And we're going to get a synthetic impulse response, right? And then we can do experiments with these things and see whether these regularities actually matter. All right?
All right. And so the interesting thing is that, when we impose the kind of decay characteristics that we observe in the real world, exponential decay with these particular dependence on frequency, you get something it sounds like an impulse in a space.
[NOISE]
So that's a pretty big space.
[NOISE]
And what's really cool is that if you violate those regularities, so, if, instead, the high frequencies decay a lot more slowly than the mids, it no longer sounds like reverberation. And what you'll hear when I play you this is this kind of high-frequency hiss. So it's like your brain refuses to believe that this reflects reverberation.
[NOISE]
Hear that "sss" kind of thing. Interestingly, so linear decay also sounds pretty weird.
[NOISE]
[NOISE]
So it doesn't sound like reverb, right? OK. So these demonstrations-- and we've done experience to try to verify this-- seem to suggest that your brain has implicit knowledge of these regularities. And what I think is so cool about this is that there's some sense in which these things are all the same thing, right?
We just took a noise signal. And we imposed different kinds of decay, right? But unless you get it kind of more or less right, it doesn't actually sound like reverb, right? So there's this very specific part of the space that is occupied by natural reverb. And your brain seems to have knowledge of that.
OK, so the very last thing, and then we'll wrap up, is we wanted to see whether listeners could use these regularities to separate the sound source from the effects of the environment. And so the key prediction is that our ability to extract properties of the source ought to be better for impulse responses that conform to the real-world distribution.
Remember, we're talking about an ill-posed problem here. The only way you can solve it is with knowledge of those regularities. And so the notion is that, if you violate those regularities, that ought to interfere with your ability to estimate the things you're trying to estimate.
So we had people do an experiment. So in every trial of the experiment, they hear three sounds. Each of the sounds is a source convolved with an impulse response. Two of the sources are the same. And one of them is different. But all three of the impulse responses are different.
So as a consequence of this, all three of the sounds are different, right? But two of them are generated by the same source. And so the task of the listener is to say which source is different from the other two. And people were told that these things are going to be heard in reverberant conditions. OK?
And so these are like random, synthetic sources. This is what they sound like without reverb, or with the room reverb.
[NOISE]
Right, you could probably tell the first one was different from the other two. So they just random signals. OK. All right.
And so here are the results. So this is proportion correct as a function of the kind of reverberation that we're giving them. And so when there's no reverb, that's the dry condition. People get about 85% correct. It's kind of a hard task.
When we add what we call ecological reverb that mimics the properties of natural reverberation, people get a little bit worse. But they're still pretty good. And so the key question is, if we give them various types of non-ecological reverberation, so if it decays linearly rather than exponentially, or if the frequency dependence is altered, will they basically stay at this level? Or are they going to be worse, right?
And so we predict that if they're really relying on these real-world regularities to separate the contributions of the source and the environment, then they ought to be worse here. And, indeed, that is what we find.
So in all of these different conditions, people perform worse at this discrimination task than when the reverb reflects the regularities of real-world reverberation. All right. I'm just going to skip over this in the interest of time.
So just to summarize what I have told you today. So I've talked about some core problems of audition. And a lot of the ones that are of interest to me in my lab are what we call ill-posed, right? And a lot of interesting problems in perception have this character.
And so ill-posed problems require the brain to use prior knowledge of statistical regularities of real-world sounds and real-world acoustics. And so in the context of reverberation, we measured the distribution of environmental impulse responses that people encounter in their lives. And we found that they are highly constrained. So they always exhibit exponential decay that's frequency dependent.
And we found that people's ability to discriminate sources from reverberant audio is better for impulse responses that are faithful to that empirical distribution. And so we think this is suggestive of a separation process that uses some strong assumptions about the nature of reverberation. And we think that that knowledge of reverberation, it's either learned from experience, or maybe built into the auditory system.
So one of the interesting findings is that the acoustic properties of rural environments are qualitatively similar to those of urban environments. And so it's possible that this is something that auditory systems have been dealing with for millennia. And maybe it's something that is built into your brain.
On the other hand, it might be saying that you just learn from growing up. So I want to just emphasize again that the reverb work was done by James Traer with help from various people . Thank you all for listening. And I'm happy to answer any other questions.
[APPLAUSE]
Josh McDermott, Professor of Brain and Cognitive Sciences at MIT, describes the early stages of human auditory processing and addresses how important information about the world can be derived from sound. Combining studies of auditory perception and computational modeling, Dr. McDermott shows how the brain may take advantage of regularities in how reverberation manifests itself in the complex signal that enters the ear, in order to distinguish multiple sound sources, often referred to as the Cocktail Party problem.
Resources:
- Josh McDermott’s website
- McDermott, J. H. (2009) The cocktail party problem, Current Biology 19(22): R1024–27.
- McDermott, J. H. (2013) Audition, In Oxford Handbook of Cognitive Neuroscience Two Volume Set (Oxford Library of Psychology), edited by K. N. Ochsner and S. Kosslyn, Oxford University Press.
- Traer. J. & McDermott, J. H. (2016) Statistics of natural reverberation enable perceptual separation of sound and space, Proceedings of the National Academy of Sciences 113(48):E7856 - E7865.