Metamers of neural networks reveal divergence from human perceptual systems [video]
December 4, 2019
November 25, 2019
All Captioned Videos Publication Releases
First author and MIT graduate student Jenelle Feather and MIT Associate Professor Josh McDermott discuss their recent paper, part of the NIPS 2019 Proceedings, where they investigated whether the invariances learned by deep neural networks actually match human perceptual invariances.
[MUSIC PLAYING] JOSH MCDERMOTT: Artificial neural networks have recently emerged as leading models of sensory systems. When appropriately optimized, these models perform tasks like speech recognition and object classification about as well as humans and exhibit similar patterns of behavior. And the feature spaces that they learn can be used to predict brain activity substantially better than previous models.
JENELLE FEATHER: A critical part of the representational power of contemporary neural networks is their invariance. They instantiate nonlinear functions that map many distinct stimulus examples onto the same category, thus achieving robust recognition abilities.
In this work, we investigated whether the invariances that are learned by deep neural networks actually match human perceptual invariances. We found sets of stimuli that the network said were the same, and asked if humans are also able to perceive these stimuli as the same. We called these stimuli model metamers, stimuli that are physically distinct, but that are perceived to be the same by a model.
JOSH MCDERMOTT: So the basic logic is simple. If we have a good model of some aspect of perception, say, speech recognition, then if we pick two sounds that the model judges to be the same, a human listener when presented with those two sounds should also judge them to be the same. If, instead, they judge them to be different, that indicates a clear difference between the representations in the model and those in human perception.
JENELLE FEATHER: In our paper, we evaluated both visual and auditory neural networks. These stimuli are generated by first measuring the model activations for a particular natural stimulus, such as an image or sound. We then take a noise stimulus and use optimization tools to modify this noise input until, eventually, the activations for the noise match those of the natural stimulus.
We now considered this optimized noise to be our model metamer. The model metamer is also classified as the same category as the natural stimulus, even though the input can be very different from the original.
JOSH MCDERMOTT: Now, previous work has used related optimization tools to invert neural network representations, but has always used priors that constrain the resulting signals to be naturalistic. So from the standpoint of using this as a tool to evaluate these models as models of human perception, those priors could actually mask differences that might exist between the model and the human representations. And so when we generated model metamers, they were constrained only by the activations in the neural network.
JENELLE FEATHER: We generated these model metamers for three different image-trained architectures. You can see, from looking at the image demos, that the model metamers generated from the late model stages are completely unrecognizable for all of the tested models.
To quantify the extent to which the model representations actually match those of human perception, we ran a human behavioral experiment where participants had to classify the natural image and the model metamers. As you may expect from having seen the example images, humans can recognize the model metamer when it is matched to early stages of the network. But they're completely unable to recognize it when it is matched to the late model stages.
We saw the same trend in networks trained to recognize speech. Model metamers matched to late stages of the network are unrecognizable.
Not only do listeners get the word incorrect, but the metamer doesn't sound like speech at all, further suggesting that the network representations don't line up with human representations.
JOSH MCDERMOTT: Having obtained this result, we dug a bit deeper to try to understand the origins of the model metamer failures for our audio-trained networks. We explored the effect of the particular task the models trained on and the model architecture, and we found some modifications that increased the recognizability of the model metamers to humans. This gives us some hope that we may eventually be able to develop models that pass the metamer test and that, thus, better capture the invariances of human perception.
JENELLE FEATHER: Model metamers demonstrate a significant failure of present-day neural networks to match the invariances in the human visual and auditory systems. We hope that this work will provide a useful behavioral measuring stick to improve model representations and create better models of the human sensory systems.
Associated Research Module: