%0 Generic %D 2016 %T Learning mid-level codes for natural sounds %A Wiktor Mlynarski %A Josh H. McDermott %X

Auditory perception depends critically on abstract and behaviorally meaningful representations of natural auditory scenes. These representations are implemented by cascades of neuronal processing stages in which neurons at each stage recode outputs of preceding units. Explanations of auditory coding strategies must thus involve understanding how low-level acoustic patterns are combined into more complex structures. While models exist in the visual domain to explain how phase invariance is achieved by V1 complex cells, and how curvature representations emerge in V2, little is known about analogous grouping principles for mid-level auditory representations.

We propose a hierarchical, generative model of natural sounds that learns combinations of spectrotemporal features from natural stimulus statistics. In the first layer the model forms a sparse, convolutional code of spectrograms. Features learned on speech and environmental sounds resemble spectrotemporal receptive fields (STRFs) of mid-brain and cortical neurons, consistent with previous findings [1]. To generalize from specific STRF activation patterns, the second layer encodes patterns of time-varying magnitude (i.e. variance) of multiple first layer coefficients. Because it forms a code of a non- stationary distribution of STRF activations, it is partially invariant to their specific values. Moreover, because second-layer features are sensitive to STRF combinations, the representation they support is more selective to complex acoustic patterns. The second layer substantially improved the model's performance on a denoising task, implying a closer match to the natural stimulus distribution.

Quantitative hypotheses emerge from the model regarding selectivity of auditory neurons characterized by multidimensional STRFs [2] and sensitivity to increasingly more abstract structure [3]. The model also predicts that the auditory system constructs representations progressively more invariant to noise, consistent with recent experimental findings [4]. Our results suggest that mid-level auditory representations may be derived from high-order stimulus dependencies present in the natural environment. 

%B Computational and Systems Neuroscience (Cosyne) 2016 %C Salt Lake City, UT %8 02/2016 %U http://www.cosyne.org/c/index.php?title=Cosyne2016_posters_2