%0 Journal Article %J NeuroImage %D 2020 %T The speed of human social interaction perception %A Leyla Isik %A Mynick, Anna %A Pantazis, Dimitrios %A Nancy Kanwisher %X

The ability to perceive others’ social interactions, here defined as the directed contingent actions between two or more people, is a fundamental part of human experience that develops early in infancy and is shared with other primates. However, the neural computations underlying this ability remain largely unknown. Is social interaction recognition a rapid feedforward process or a slower post-perceptual inference? Here we used magnetoencephalography (MEG) decoding to address this question. Subjects in the MEG viewed snapshots of visually matched real-world scenes containing a pair of people who were either engaged in a social interaction or acting independently. The presence versus absence of a social interaction could be read out from subjects’ MEG data spontaneously, even while subjects performed an orthogonal task. This readout generalized across different people and scenes, revealing abstract representations of social interactions in the human brain. These representations, however, did not come online until quite late, at 300 ms after image onset, well after feedforward visual processes. In a second experiment, we found that social interaction readout still occurred at this same late latency even when subjects performed an explicit task detecting social interactions. We further showed that MEG responses distinguished between different types of social interactions (mutual gaze vs joint attention) even later, around 500 ms after image onset. Taken together, these results suggest that the human brain spontaneously extracts information about others’ social interactions, but does so slowly, likely relying on iterative top-down computations.

%B NeuroImage %P 116844 %8 Jan-04-2020 %G eng %U https://www.ncbi.nlm.nih.gov/pubmed/32302763 %! NeuroImage %R 10.1016/j.neuroimage.2020.116844 %0 Journal Article %J Nature Communications %D 2019 %T How face perception unfolds over time %A Dobs, Katharina %A Leyla Isik %A Pantazis, Dimitrios %A Nancy Kanwisher %X

Within a fraction of a second of viewing a face, we have already determined its gender, age and identity. A full understanding of this remarkable feat will require a characterization of the computational steps it entails, along with the representations extracted at each. Here, we used magnetoencephalography (MEG) to measure the time course of neural responses to faces, thereby addressing two fundamental questions about how face processing unfolds over time. First, using representational similarity analysis, we found that facial gender and age information emerged before identity information, suggesting a coarse-to-fine processing of face dimensions. Second, identity and gender representations of familiar faces were enhanced very early on, suggesting that the behavioral benefit for familiar faces results from tuning of early feed-forward processing mechanisms. These findings start to reveal the time course of face processing in humans, and provide powerful new constraints on computational theories of face perception.

%B Nature Communications %V 10 %8 01/2019 %G eng %U http://www.nature.com/articles/s41467-019-09239-1 %N 1 %! Nat Commun %R 10.1038/s41467-019-09239-1 %0 Journal Article %J Journal of Neurophysiology %D 2018 %T A fast, invariant representation for human action in the visual system %A Leyla Isik %A Andrea Tacchetti %A Tomaso Poggio %X

Humans can effortlessly recognize others’ actions in the presence of complex transformations, such as changes in viewpoint. Several studies have located the regions in the brain involved in invariant action recognition; however, the underlying neural computations remain poorly understood. We use magnetoencephalography decoding and a data set of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by different actors at different viewpoints to study the computational steps used to recognize actions across complex transformations. In particular, we ask when the brain discriminates between different actions, and when it does so in a manner that is invariant to changes in 3D viewpoint. We measure the latency difference between invariant and noninvariant action decoding when subjects view full videos as well as form-depleted and motion-depleted stimuli. We were unable to detect a difference in decoding latency or temporal profile between invariant and noninvariant action recognition in full videos. However, when either form or motion information is removed from the stimulus set, we observe a decrease and delay in invariant action decoding. Our results suggest that the brain recognizes actions and builds invariance to complex transformations at the same time and that both form and motion information are crucial for fast, invariant action recognition.

Associated Dataset: MEG action recognition data

%B Journal of Neurophysiology %G eng %U https://www.physiology.org/doi/10.1152/jn.00642.2017 %R https://doi.org/10.1152/jn.00642.2017 %0 Journal Article %J Annual Review of Vision Science %D 2018 %T Invariant Recognition Shapes Neural Representations of Visual Input %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %K computational neuroscience %K Invariance %K neural decoding %K visual representations %X

Recognizing the people, objects, and actions in the world around us is a crucial aspect of human perception that allows us to plan and act in our environment. Remarkably, our proficiency in recognizing semantic categories from visual input is unhindered by transformations that substantially alter their appearance (e.g., changes in lighting or position). The ability to generalize across these complex transformations is a hallmark of human visual intelligence, which has been the focus of wide-ranging investigation in systems and computational neuroscience. However, while the neural machinery of human visual perception has been thoroughly described, the computational principles dictating its functioning remain unknown. Here, we review recent results in brain imaging, neurophysiology, and computational neuroscience in support of the hypothesis that the ability to support the invariant recognition of semantic entities in the visual world shapes which neural representations of sensory input are computed by human visual cortex.

%B Annual Review of Vision Science %V 4 %P 403 - 422 %8 10/2018 %G eng %U https://www.annualreviews.org/doi/10.1146/annurev-vision-091517-034103 %N 1 %! Annu. Rev. Vis. Sci. %R 10.1146/annurev-vision-091517-034103 %0 Generic %D 2018 %T MEG action recognition data %A Leyla Isik %A Andrea Tacchetti %X

MEG action recognition data from Isik et al., 2018 and Tacchetti et al., 2017. In binned format to be used with the Neural Decoding Toolbox (2018-02-13).

Associated publications:

L. IsikTacchetti, A., and Poggio, T.A fast, invariant representation for human action in the visual systemJournal of Neurophysiology, 2018.
A. TacchettiIsik, L., and Poggio, T.Invariant recognition drives neural representations of action sequencesPLoS Comp. Bio, 2017.
%U https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/KFYY2M %R https://doi.org/10.7910/DVN/KFYY2M %0 Journal Article %J NeuroImage %D 2018 %T What is changing when: decoding visual information in movies from human intracranial recordings %A Leyla Isik %A Jedediah Singer %A Nancy Kanwisher %A Madsen JR %A Anderson WS %A Gabriel Kreiman %K Electrocorticography (ECoG) %K Movies %K Natural vision %K neural decoding %K object recognition %K Ventral pathway %X

The majority of visual recognition studies have focused on the neural responses to repeated presentations of static stimuli with abrupt and well-defined onset and offset times. In contrast, natural vision involves unique renderings of visual inputs that are continuously changing without explicitly defined temporal transitions. Here we considered commercial movies as a coarse proxy to natural vision. We recorded intracranial field potential signals from 1,284 electrodes implanted in 15 patients with epilepsy while the subjects passively viewed commercial movies. We could rapidly detect large changes in the visual inputs within approximately 100 ms of their occurrence, using exclusively field potential signals from ventral visual cortical areas including the inferior temporal gyrus and inferior occipital gyrus. Furthermore, we could decode the content of those visual changes even in a single movie presentation, generalizing across the wide range of transformations present in a movie. These results present a methodological framework for studying cognition during dynamic and natural vision.

%B NeuroImage %V 180, Part A %P 147-159 %8 10/2018 %G eng %U https://www.sciencedirect.com/science/article/pii/S1053811917306742 %) Available online 18 August 2017 %R 10.1016/j.neuroimage.2017.08.027 %0 Conference Paper %B AAAI Spring Symposium Series, Science of Intelligence %D 2017 %T Eccentricity Dependent Deep Neural Networks: Modeling Invariance in Human Vision %A Francis Chen %A Gemma Roig %A Leyla Isik %A X Boix %A Tomaso Poggio %X

Humans can recognize objects in a way that is invariant to scale, translation, and clutter. We use invariance theory as a conceptual basis, to computationally model this phenomenon. This theory discusses the role of eccentricity in human visual processing, and is a generalization of feedforward convolutional neural networks (CNNs). Our model explains some key psychophysical observations relating to invariant perception, while maintaining important similarities with biological neural architectures. To our knowledge, this work is the first to unify explanations of all three types of invariance, all while leveraging the power and neurological grounding of CNNs.

%B AAAI Spring Symposium Series, Science of Intelligence %G eng %U https://www.aaai.org/ocs/index.php/SSS/SSS17/paper/view/15360 %0 Journal Article %J J Neurophysiol %D 2017 %T A fast, invariant representation for human action in the visual system. %A Leyla Isik %A Andrea Tacchetti %A Tomaso Poggio %K action recognition %K magnetoencephalography %K neural decoding %K vision %X

Humans can effortlessly recognize others' actions in the presence of complex transformations, such as changes in viewpoint. Several studies have located the regions in the brain involved in invariant action recognition, however, the underlying neural computations remain poorly understood. We use magnetoencephalography (MEG) decoding and a dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat, drink) performed by different actors at different viewpoints to study the computational steps used to recognize actions across complex transformations. In particular, we ask when the brain discriminates between different actions, and when it does so in a manner that is invariant to changes in 3D viewpoint. We measure the latency difference between invariant and non-invariant action decoding when subjects view full videos as well as form-depleted and motion-depleted stimuli. We were unable to detect a difference in decoding latency or temporal profile between invariant and non-invariant action recognition in full videos. However, when either form or motion information is removed from the stimulus set, we observe a decrease and delay in invariant action decoding. Our results suggest that the brain recognizes actions and builds invariance to complex transformations at the same time, and that both form and motion information are crucial for fast, invariant action recognition.

%B J Neurophysiol %P jn.00642.2017 %8 11/2017 %G eng %R 10.1152/jn.00642.2017 %0 Generic %D 2017 %T Invariant action recognition dataset %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %X

To study the effect of changes in view and actor on action recognition, we filmed a dataset of five actors performing five different actions (drink, eat, jump, run and walk) on a treadmill from five different views (0, 45, 90, 135, and 180 degrees from the front of the actor/treadmill; the treadmill rather than the camera was rotated in place to acquire from different viewpoints). The dataset was filmed on a fixed, constant background. To avoid low-level object/action confounds (e.g. the action “drink” being classified as the only videos with water bottle in the scene) and guarantee that the main sources of variation of visual appearance are due to actions, actors and viewpoint, the actors held the same objects (an apple and a water bottle) in each video, regardless of the action they performed. This controlled design allows us to test hypotheses on the computational mechanisms underlying invariant recognition in the human visual system without having to settle for a synthetic dataset.

More information and the dataset files can be found here - https://doi.org/10.7910/DVN/DMT0PG

%8 11/2017 %U https://doi.org/10.7910/DVN/DMT0PG %0 Journal Article %J PLoS Comp. Bio %D 2017 %T Invariant recognition drives neural representations of action sequences %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %X

Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.

Associated Dataset: MEG action recognition data

%B PLoS Comp. Bio %G eng %0 Journal Article %J PLOS Computational Biology %D 2017 %T Invariant recognition drives neural representations of action sequences %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %E Berniker, Max %X

Recognizing the actions of others from visual stimuli is a crucial aspect of human perception that allows individuals to respond to social cues. Humans are able to discriminate between similar actions despite transformations, like changes in viewpoint or actor, that substantially alter the visual appearance of a scene. This ability to generalize across complex transformations is a hallmark of human visual intelligence. Advances in understanding action recognition at the neural level have not always translated into precise accounts of the computational principles underlying what representations of action sequences are constructed by human visual cortex. Here we test the hypothesis that invariant action discrimination might fill this gap. Recently, the study of artificial systems for static object perception has produced models, Convolutional Neural Networks (CNNs), that achieve human level performance in complex discriminative tasks. Within this class, architectures that better support invariant object recognition also produce image representations that better match those implied by human and primate neural data. However, whether these models produce representations of action sequences that support recognition across complex transformations and closely follow neural representations of actions remains unknown. Here we show that spatiotemporal CNNs accurately categorize video stimuli into action classes, and that deliberate model modifications that improve performance on an invariant action recognition task lead to data representations that better match human neural recordings. Our results support our hypothesis that performance on invariant discrimination dictates the neural representations of actions computed in the brain. These results broaden the scope of the invariant recognition framework for understanding visual intelligence from perception of inanimate objects and faces in static images to the study of human perception of action sequences.

%B PLOS Computational Biology %V 13 %P e1005859 %8 12/2017 %G eng %U http://dx.plos.org/10.1371/journal.pcbi.1005859 %N 12 %R 10.1371/journal.pcbi.1005859 %0 Journal Article %J Proceedings of the National Academy of Sciences %D 2017 %T Perceiving social interactions in the posterior superior temporal sulcus %A Leyla Isik %A Kami Koldewyn %A David Beeler %A Nancy Kanwisher %X

Primates are highly attuned not just to social characteristics of individual agents, but also to social interactions between multiple agents. Here we report a neural correlate of the representation of social interactions in the human brain. Specifically, we observe a strong univariate response in the posterior superior temporal sulcus (pSTS) to stimuli depicting social interactions between two agents, compared with (i) pairs of agents not interacting with each other, (ii) physical interactions between inanimate objects, and (iii) individual animate agents pursuing goals and interacting with inanimate objects. We further show that this region contains information about the nature of the social interaction—specifically, whether one agent is helping or hindering the other. This sensitivity to social interactions is strongest in a specific subregion of the pSTS but extends to a lesser extent into nearby regions previously implicated in theory of mind and dynamic face perception. This sensitivity to the presence and nature of social interactions is not easily explainable in terms of low-level visual features, attention, or the animacy, actions, or goals of individual agents. This region may underlie our ability to understand the structure of our social world and navigate within it.

 

 
%B Proceedings of the National Academy of Sciences %V 114 %8 10/2017 %G eng %U http://www.pnas.org/content/early/2017/10/06/1714471114.short %N 43 %! PNAS %( PNAS October 9, 2017. 201714471; published ahead of print October 9, 2017 %R https://doi.org/10.1073/pnas.1714471114 %0 Journal Article %J Neuroimage %D 2017 %T What is changing when: Decoding visual information in movies from human intracranial recordings %A Leyla Isik %A Jedediah Singer %A Joseph Madsen %A Nancy Kanwisher %A Gabriel Kreiman %X

The majority of visual recognition studies have focused on the neural responses to repeated presentations of static stimuli with abrupt and well-defined onset and offset times. In contrast, natural vision involves unique renderings of visual inputs that are continuously changing without explicitly defined temporal transitions. Here we considered commercial movies as a coarse proxy to natural vision. We recorded intracranial field potential signals from 1,284 electrodes implanted in 15 patients with epilepsy while the subjects passively viewed commercial movies. We could rapidly detect large changes in the visual inputs within approximately 100 ms of their occurrence, using exclusively field potential signals from ventral visual cortical areas including the inferior temporal gyrus and inferior occipitalgyrus. Furthermore, we could decode the content of those visual changes even in a single movie presentation, generalizing across the wide range of transformations present in a movie. These results present a methodological framework for studying cognition during dynamic and natural vision.

%B Neuroimage %G eng %U https://www.sciencedirect.com/science/article/pii/S1053811917306742 %R https://doi.org/10.1016/j.neuroimage.2017.08.027 %0 Generic %D 2016 %T Fast, invariant representation for human action in the visual system %A Leyla Isik %A Andrea Tacchetti %A Tomaso Poggio %X

Isik, L*, Tacchetti, A*, and Poggio, T (* authors contributed equally to this work)

 

The ability to recognize the actions of others from visual input is essential to humans' daily lives. The neural computations underlying action recognition, however, are still poorly understood. We use magnetoencephalography (MEG) decoding and a computational model to study action recognition from a novel dataset of well-controlled, naturalistic videos of five actions (run, walk, jump, eat drink) performed by five actors at five viewpoints. We show for the first that that actor- and view-invariant representations for action arise in the human brain as early as 200 ms. We next extend a class of biologically inspired hierarchical computational models of object recognition to recognize actions from videos and explain the computations underlying our MEG findings. This model achieves 3D viewpoint-invariance by the same biologically inspired computational mechanism it uses to build invariance to position and scale. These results suggest that robustness to complex transformations, such as 3D viewpoint invariance, does not require special neural architectures, and further provide a mechanistic explanation of the computations driving invariant action recognition.

%8 01/2016 %U http://arxiv.org/abs/1601.01358 %1

arXiv:1601.01358v1

%2

http://hdl.handle.net/1721.1/100804

%0 Report %D 2016 %T Spatio-temporal convolutional networks explain neural representations of human actions %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %G eng %0 Generic %D 2015 %T Invariant representations for action recognition in the visual system. %A Andrea Tacchetti %A Leyla Isik %A Tomaso Poggio %B Vision Sciences Society %C Journal of vision %V 15 %U http://jov.arvojournals.org/article.aspx?articleid=2433666 %N 12 %R 10.1167/15.12.558 %0 Generic %D 2015 %T Invariant representations for action recognition in the visual system %A Leyla Isik %A Andrea Tacchetti %A Tomaso Poggio %B Computational and Systems Neuroscience %0 Generic %D 2014 %T Abstracts of the 2014 Brains, Minds, and Machines Summer Course %A Nadav Amir %A Tarek R. Besold %A Raffaello Camoriano %A Goker Erdogan %A Thomas Flynn %A Grant Gillary %A Jesse Gomez %A Ariel Herbert-Voss %A Gladia Hotan %A Jonathan Kadmon %A Scott W. Linderman %A Tina T. Liu %A Andrew Marantan %A Joseph Olson %A Garrick Orchard %A Dipan K. Pal %A Giulia Pasquale %A Honi Sanders %A Carina Silberer %A Kevin A Smith %A Carlos Stein N. de Briton %A Jordan W. Suchow %A M. H. Tessler %A Guillaume Viejo %A Drew Walker %A Leila Wehbe %A Andrei Barbu %A Leyla Isik %A Emily Mackevicius %A Yasmine Meroz %X

A compilation of abstracts from the student projects of the 2014 Brains, Minds, and Machines Summer School, held at Woods Hole Marine Biological Lab, May 29 - June 12, 2014.

%8 09/2014 %2

http://hdl.handle.net/1721.1/100189

%0 Generic %D 2014 %T Computational role of eccentricity dependent cortical magnification. %A Tomaso Poggio %A Jim Mutch %A Leyla Isik %K Invariance %K Theories for Intelligence %X

We develop a sampling extension of M-theory focused on invariance to scale and translation. Quite surprisingly, the theory predicts an architecture of early vision with increasing receptive field sizes and a high resolution fovea — in agreement with data about the cortical magnification factor, V1 and the retina. From the slope of the inverse of the magnification factor, M-theory predicts a cortical “fovea” in V1 in the order of 40 by 40 basic units at each receptive field size — corresponding to a foveola of size around 26 minutes of arc at the highest resolution, ≈6 degrees at the lowest resolution. It also predicts uniform scale invariance over a fixed range of scales independently of eccentricity, while translation invariance should depend linearly on spatial frequency. Bouma’s law of crowding follows in the theory as an effect of cortical area-by-cortical area pooling; the Bouma constant is the value expected if the signature responsible for recognition in the crowding experiments originates in V2. From a broader perspective, the emerging picture suggests that visual recognition under natural conditions takes place by composing information from a set of fixations, with each fixation providing recognition from a space-scale image fragment — that is an image patch represented at a set of increasing sizes and decreasing resolutions.

%8 06/2014 %1

arXiv:1406.1770v1

%2

http://hdl.handle.net/1721.1/100181

%0 Journal Article %J J Neurophysiol %D 2014 %T The dynamics of invariant object recognition in the human visual system. %A Leyla Isik %A Ethan Meyers %A JZ. Leibo %A Tomaso Poggio %K Adolescent %K Adult %K Evoked Potentials, Visual %K Female %K Humans %K Male %K Pattern Recognition, Visual %K Reaction Time %K visual cortex %X

The human visual system can rapidly recognize objects despite transformations that alter their appearance. The precise timing of when the brain computes neural representations that are invariant to particular transformations, however, has not been mapped in humans. Here we employ magnetoencephalography decoding analysis to measure the dynamics of size- and position-invariant visual information development in the ventral visual stream. With this method we can read out the identity of objects beginning as early as 60 ms. Size- and position-invariant visual information appear around 125 ms and 150 ms, respectively, and both develop in stages, with invariance to smaller transformations arising before invariance to larger transformations. Additionally, the magnetoencephalography sensor activity localizes to neural sources that are in the most posterior occipital regions at the early decoding times and then move temporally as invariant information develops. These results provide previously unknown latencies for key stages of human-invariant object recognition, as well as new and compelling evidence for a feed-forward hierarchical model of invariant object recognition where invariance increases at each successive visual area along the ventral stream.

Corresponding Dataset - The dynamics of invariant object recognition in the human visual system.

%B J Neurophysiol %V 111 %P 91-102 %8 01/2014 %G eng %U http://jn.physiology.org/content/early/2013/09/27/jn.00394.2013.abstract %N 1 %R 10.1152/jn.00394.2013 %0 Generic %D 2014 %T The dynamics of invariant object recognition in the human visual system. %A Leyla Isik %A Ethan Meyers %A JZ. Leibo %A Tomaso Poggio %X

This is the dataset for corresponding Journal Article - The dynamics of invariant object recognition in the human visual system.

 

The human visual system can rapidly recognize objects despite transformations that alter their appearance. The precise timing of when the brain computes neural representations that are invariant to particular transformations, however, has not been mapped in humans. Here we employ magnetoencephalography decoding analysis to measure the dynamics of size- and position-invariant visual information development in the ventral visual stream. With this method we can read out the identity of objects beginning as early as 60 ms. Size- and position-invariant visual information appear around 125 ms and 150 ms, respectively, and both develop in stages, with invariance to smaller transformations arising before invariance to larger transformations. Additionally, the magnetoencephalography sensor activity localizes to neural sources that are in the most posterior occipital regions at the early decoding times and then move temporally as invariant information develops. These results provide previously unknown latencies for key stages of human-invariant object recognition, as well as new and compelling evidence for a feed-forward hierarchical model of invariant object recognition where invariance increases at each successive visual area along the ventral stream.

 

Dataset files can be downloaded here - http://dx.doi.org/10.7910/DVN/KRUPXZ

11 subjects’ MEG data from Isik et al., 2014. Data is available in raw .fif format or in Matlab raster format that is compatible with the neural decoding toolbox (readout.info).

For Matlab code to pre-process this MEG data, and run the decoding analyses please visit

https://bitbucket.org/lisik/meg_decoding

%8 01/2014 %R http://dx.doi.org/10.7910/DVN/KRUPXZ