Module 2 Research Update
Date Posted:
February 17, 2021
Date Recorded:
February 16, 2021
CBMM Speaker(s):
Gabriel Kreiman ,
Mengmi Zhang ,
Will Xiao Speaker(s):
Jie Zheng
All Captioned Videos CBMM Research
Loading your interactive content...
Description:
Abstracts:
Speaker: Mengmi Zhang
Title: The combination of eccentricity, bottom-up, and top-down cues explain conjunction and asymmetric visual search
Abstract: Visual search requires complex interactions between visual processing, eye movements, object recognition, memory, and decision making. Elegant psychophysics experiments have described the task characteristics and stimulus properties that facilitate or slow down visual search behavior. En route towards a quantitative framework that accounts for the mechanisms orchestrating visual search, here we propose an image-computable biologically-inspired computational model that takes a target and a search image as inputs and produces a sequence of eye movements. To compare the model against human behavior, we consider nine foundational experiments that demonstrate two intriguing principles of visual search: (i) asymmetric search costs when looking for a certain object A among distractors B versus the reverse situation of locating B among distractors A; (ii) the increase in search costs associated with feature conjunctions. The proposed computational model has three main components, an eccentricity-dependent visual feature processor learnt through natural image statistics, bottom-up saliency, and target-dependent top-down cues. Without any prior exposure to visual search stimuli or any task-specific training, the model demonstrates the essential properties of search asymmetries and slower reaction time in feature conjunction tasks. Furthermore, the model can generalize to real-world search tasks in complex natural environments. The proposed model unifies previous theoretical frameworks into an image-computable architecture that can be directly and quantitatively compared against psychophysics experiments and can also provide a mechanistic basis that can be evaluated in terms of the underlying neuronal circuits.
Speaker: Jie Zheng
Title: Neurons detect cognitive boundaries to structure episodic memories in humans
Abstract:While experience is continuous, memories are organized as discrete events. Cognitive boundaries are thought to segment experience and structure memory, but how this process is implemented remains unclear. We recorded the activity of single neurons in the human medial temporal lobe during the formation and retrieval of memories with complex narratives. Neurons responded to abstract cognitive boundaries between different episodes. Boundary-induced neural state changes during encoding predicted subsequent recognition accuracy but impaired event order memory, mirroring a fundamental behavioral tradeoff between content and time memory. Furthermore, the neural state following boundaries was reinstated during both successful retrieval and false memories. These findings reveal a neuronal substrate for detecting cognitive boundaries that transform experience into mnemonic episodes and structure mental time travel during retrieval.
Speaker: Will Xiao
Title: Adversarial images for the Primate Brain
Abstract: Deep artificial neural networks have been proposed as a model of primate vision. However, these networks are vulnerable to adversarial attacks, whereby introducing minimal noise can fool networks into misclassifying images. Primate vision is thought to be robust to such adversarial images. We evaluated this assumption by designing adversarial images to fool primate vision. To do so, we first trained a model to predict responses of face-selective neurons in macaque inferior temporal cortex. Next, we modified images, such as human faces, to match their model-predicted neuronal responses to a target category, such as monkey faces, with a small budget for pixel value change. These adversarial images elicited neuronal responses similar to the target category. Remarkably, the same images fooled monkeys and humans at the behavioral level. These results call for closer inspection of the adversarial sensitivity of primate vision, and show that a model of visual neuron activity can be used to specifically direct primate behavior.
GABRIEL KREIMAN: So it's a great pleasure to introduce today's three speakers for an update on what's happening in module 2 in CBMM. So I'm going to be very brief because I want to listen to all three of these talented scholars. So before I do so, I just want to give a big shout to Dr. Jerry Wang, who many of you know, who's been working in module 2 and just defended his, successfully defended his thesis today. So now he's our youngest PhD in the lab and module 2 and probably in CBMM right now. And so congratulations, Jerry, if you're in the audience.
So very briefly, just to recap module 2, we're interested in how we can flexibly use visual information to answer essentially an infinite number of questions on an image. So take an image like this one, and in a very short amount of time, we can essentially make a lot of inferences and answer questions about how many shoes there are, who are they, what are they doing, where is Obama's foot, maybe even what happened before. And perhaps understand why this image is meant to be humorous.
Borrowing a term that was coined by Shimon Ullman, we think that this is done by the ability to have computational routines, visual routines that can extract visual information. But importantly, they can be reused in a compositional manner and in a flexible manner. So that's one of the key questions we're interested in understanding in module 2. And so those include quite a broad and potentially large number of different types of routines. And we're trying to make progress towards defining what they are, trying to understand what the neuronal mechanisms are implementing these type of routines, and trying to instantiate those into biologically-inspired computational models.
So today we have three talented young people that are going to talk about work in progress in this domain. Will is-- actually the order I think is not correct here, we start with Mengmi first, and then Jie, and finally Will. Mengmi is going to tell us about one of the most basic routines that has to do with understanding the role of attention and eye movements to sequentially understand a visual scene. Then Jie's going to talk about the transfer of visual information to episodic memory and how our brains detect boundaries to segment information to form episodic memories. And finally Will will talk about stress testing computational models to understand whether humans and primates and non-human primates are susceptible to the types of adversarial images that have been so important in defining the limits of currently convolutional neural networks.
So without further ado, I'm going to stop sharing and invite Mengmi to start with her presentation.
MENGMI ZHANG: So, hi everyone. My name is Mengmi. I'm a second-year post-doc from Kreiman Lab, and these are my collaborators. So in particular I want to mention Shashi, who is a very talented college student. He's been doing a majority of the work I'm about to present to you very soon. So let me first start by briefly introducing you what is visual search and why we need it. So visual search is a very challenging problem, natural vision for example. Sometimes we may have to find our car keys in our living room, or we had to look for our friend in a parking per [? hour. ?]
And here is an example where I'm showing the subject. He's asked to search for a calculator and he's constantly shifting his attention as indicated in the yellow dot. Eventually he managed to find the calculator in the red box. The study of visual search has a long history in neurophysiology and psychology. So in this work our goal is to collect, put together pieces of evidences in, on neurophysiology and to come up with the computational model for visual search, such that we could quantitatively assess their behaviors and compare those with humans in psychology. This will help us gain better understanding of the neuron circuits involving visual search in the brain.
So, there are two fundamental properties in visual search. This is the first one, which is called a search asymmetry, so let me explain to you what I mean by that. In the first experiment, what you are saying is the target and the distractors differ in terms of their lighting condition. For example, the target is receiving the light from left to right whereas the distractors are receiving the light from right to left, which is in the opposite direction.
Here is a second set of experiments, and the lighting conditions becomes vertical. So in psychology, scientists have already found that we as humans were better at searching targets in vertical lighting conditions compared with the horizontal lighting condition. This is intriguing to me because it leads to the question whether human visual system has feature preferences, and if yes, where does this visual preference come from? Is this simply because our eyes get used to seeing more target objects with a vertical lighting condition? And is this simply because the sun is always up in the sky, and therefore we tend to see more target objects in vertical lighting conditions? Yeah.
So this is the second fundamental property in visual search, which is called a feature conjunction search. In the first row, what I'm showing are the conditions where the distractor and the target differ in only one effect. It could be either in color, orientation, or size. In the second row, the distractor and the target differ in more than one effect. So this is the feature conjunction search condition.
And scientists have already found that we as humans were worse in searching, in the features, feature conjunction conditions compared with the simpler conditions, where the target and distractor only differ in one effect. So with these two fundamental properties of visual searching minds, let me give you an overview of our proposed computational model for visual search that is called Eccentricity Net. So same as the psychophysics experiments, the Eccentricity Net is first presented with a target image. Next, the Eccentricity Net is presented with a search image.
We use the same 2D comp nets that have been pre-trained on ImageNet for object recognition to extract the feature maps. Then we could use those feature maps on the search image to predict the bottom-up saliency map. Similarly, we can use the stored abstract representation of the target to modulate those feature maps of the search image across multiple layers, and this resulting in top-down modulation map.
Next, we can linearly combine these two maps together to generate the overall attention map. Then we can use the winner-take-all mechanism to choose the location with the maximum attention value on this overall attention map as the next fixation location. And this process iterates until the Eccentricity Net finds the target. If you have been to my previous talks, you are probably familiar with and think this architecture is very similar as our previous work, and that's true. But I want to highlight three very important components that are new in this current model.
So the first component we introduce in this new model is the bottom-up saliency model. So what we did is given any units on the feature maps of the search image, we compute their self-information. So the higher the self-information, the more salient that location is. The second component that is different from our previous model is that, this top-down modulation. So instead of performing top-down modulation at the top of the 2D comp net, here we are applying top-down modulation across multiple layers. The third component, which is different and new, is in this pre-trained and 2D comp net.
I'm going to talk a little bit more about the third component. So I'm pretty sure most of us are very familiar with the standard 2D comp net architecture. But here if we zoom in to the third pooling layer, so let me explain to you what I mean by this eccentricity-dependent pooling layer. So given any unit G on this pooling layer three, we could have computed Euclidean distance from the center fixation location. The further the distance, the larger receptive field size unit G has, as indicated by the different colors on the right.
All right, so here we are just reproducing the plot, eccentricity versus receptive field size plot from the simultaneous [INAUDIBLE], where they have taken measurements of the receptive [? field ?] sizes in macaque monkey brains. There are two observations if you are paying attention to this figure. So the first observation is that, within the same visual area, as the eccentricity increases, the receptive field size increases. And if we compare across different visual areas, then the neuron in there, in V4 tends to have larger receptive field size compared with the neurons in V1, V2.
So now let's see how the Eccentricity Net performs in this. So what we did is quite similar to the one they did on monkeys, so we present this black image with a white square on top of it to this Eccentricity Net. And then we ask the Eccentricity Net to fixate in the center of the image. Then we slide this white square across the entire image, such that we could record the neuron responses or the activation values for each unit in the Eccentricity Net.
And from there we could compute the corresponding receptive field sizes. And this is what we get. I would say qualitatively that the three curves match quite well with the one monkey. Just to give you a better understanding of the receptor field sizes across different layers for it, each unit in this, each individual pooling layer, this is how it looks like. All right, so this-- in this project we are interested in studying visual search tasks, and we want to test whether Eccentricity Net could transfer those knowledge, learning object recognition to visual search.
Therefore we loaded the pre-trained weights on ImageNet in object recognition, and we did not to do any fine-tuning or retraining in the pre-trained comp net. This is a visualization of the example scanpath predicted by Eccentricity Net. So if you are looking at subfigure A, Eccentricity Net starts the search in the center of the image. And then it takes-- the Eccentricity Net only one fixation to find the target in vertical lighting condition, whereas it takes the model three fixations to find the target in the horizontal lighting condition.
Similarly we could get example [? scanpaths ?] in the feature conjunction search experiments. All right, so back in the old days in psychology, eye-tracking technology was not available. Therefore scientists used this reaction time in milliseconds to measure how long it takes for the subject to localize, to find the target. And since our model is generating a sequence of fixations, therefore we did another separate experiment in order to map the number of fixations to reaction time. And after we were done by the mapping, here's what we get.
So the dashed line indicates the performance for the model, and the solid line indicates the performance for humans. The blue curve denotes the vertical lighting conditions, the red curve denotes a horizontal lighting condition. And there are two observations we can make by comparison. So the first one is the red curve is always above the blue curve. That means for both humans and model, they are very good at searching for targets in vertical lighting conditions compared with a horizontal lighting condition.
The second observation is that if we compute the slope of the blue curve versus the red curve, then the blue curve is relatively flat. So that means the number of items on the display has little effect in the reaction time. This is true for both humans and the model. And these are the qualitative results for both humans and model in feature conjunction search experiments, and we could also see similar conclusions. That is, the blue curve and the green curve always below the red curve. That means humans and the model were better at searching simple targets compared with the feature conjunction search conditions. And the number of items seems to have little effect in reaction time in the simpler condition compared with the feature conjunction search condition.
In psychology, scientists also introduced another measurement that is called the search slope to evaluate the humans' and the model's performance over all conditions from all experiments. So here what you are seeing is the x-axis is a human search, so your reaction times per item, and the y-axis is the model. And due to time constraints I didn't introduce you [? to all ?] six experiments we did, but here I am just-- so here you could have a brief overview and see what are the actual six experiments being included in this search asymmetry condition.
So there are six experiments in total and two conditions per experiment. Those are the corresponding 12 markers on the plot. The dashed line indicates identity line, and the dotted line showing these two conditions from the same experiment together. So most of the markers that are lying around the identity line, that means it has a higher linear correlation value. And it seems that the humans and the model matches pretty well. Similar analysis could be applied in the feature conjunction experiment.
All right, so, so far I have been talking about the two fundamental properties in visual search. And most of the time, we have been evaluating Eccentricity Net and very simplistic stimuli, for example searching Ts among Ls, or searching for simple shapes. The next path we want to test is whether Eccentricity Net could generalize in naturalistic settings. So these are the six, eight-- sorry, three experiments that we did from our previous projects. These three DSS include object arrays, natural images, and the finding Waldo experiment.
So the first evaluation metric we introduce here is a search cumulative search performance. And it's telling you what is the probability of finding the target given a fixed number of fixations. Although Eccentricity Net is only able to assess very limited information from certain parts of the image because, due to eccentricity, however its performance does its job. It performs almost as well as humans.
Then the second evaluation metric method here is the [? scanpath ?] similarity score. It's telling you how similar between two scanpaths in terms of their temporal and the spatial dynamics. We could see Eccentricity Net did a better job compared with our peers' model. It has a higher score. The last aspect that we evaluated eccentricity on is the [? saccadic ?] [? eye ?] distribution. And it's just interesting that without imposing additional constraints, such as [? muscular ?] constraints, Eccentricity Net just-- yield has a similar [? saccadic ?] [? eye ?] distribution as humans.
So, to summarize, our proposed model approximates humans in many different-- in varieties of aspects. This includes search asymmetry, conjunction search, and even in naturalistic settings. So yeah, I'm happy to take any questions.
WILL XIAO: Hi, Mengmi, just a quick question. I'm curious how the fixation size distribution compares to the [? IVSM ?] model since you showed it for your new model but not--
MENGMI ZHANG: What do you mean by the [? scanpaths? ?]
WILL XIAO: Sorry, sorry not fixated, the [? saccade ?] [? side, ?] so the previous slide.
MENGMI ZHANG: Oh yeah, we do have that in the supplementary figures. So, so for [? IVSM ?] the distribution looks totally different.
WILL XIAO: I see, thanks.
TIAGO MARQUES: Hi, Mengmi, nice work. I have a question about-- so you mentioned that, so the [? net ?] who was pre-trained on ImageNet, and then you observe these differences, for example for the orientation in terms of reaction time in the [? number of ?] cuts. So I was wondering whether-- so you expect that all of that comes just from the natural statistics of the images on ImageNet, so if you train the model now by randomizing orientations of the images, you would lose that effect, no? And if you-- and whether you've tried that hypothesis? Just retraining the model, but now on-- shifted, sorry, rotated images on ImageNet and see whether you still have that effect?
MENGMI ZHANG: Sorry, Tiago, so would you mind clarifying what do you mean by effect here?
TIAGO MARQUES: So for the first experiment where you showed that to detect the saliency in the-- and the vertically symmetric object was faster than the horizontal object, right?
MENGMI ZHANG: Are you referring to the horizontal versus vertical lighting condition? Yeah, that's a good question. So just want to rephrase your question to make sure I understand. So you're asking if we retrain this model using those rotated ImageNet images from ImageNet, then would this asymmetry effect be different?
Yeah that's-- so this question is fantastic because I personally think of this asymmetry is somehow inherent from this object recognition task. And we haven't tested this, but yes, definitely, I think this is an interesting experiment that we should definitely try, yeah. And I wouldn't bet it to be the opposite, for example, if we rotated, the image 180 degrees, yeah.
TIAGO MARQUES: Well I mean I guess if you just rotate the image 90 degrees it's obvious because everything is rotated. What I was saying is, if you just now, for each image that you train, you just randomly rotate it, so that you break whatever preferred orientations you have in the natural statistics of the images. So you would expect if you trained the model now under that situation to expect this effect to be gone, right, to be no asymmetry in these orientations?
MENGMI ZHANG: Yep, excellent question.
AUDIENCE: Hi, Mengmi, great talk. So my question is about search asymmetry per se, because my sort of understanding of search asymmetry is that you have the same two objects where you just make one of them target in one situation and make the same thing as distractor in the other situation, and you would find differences in reaction time, right? So that is what search asymmetry is for example, the search for O among Qs, was as Q among Os.
MENGMI ZHANG: Well yeah, I agree, yeah.
AUDIENCE: But when you find something similar, if you use your model to test that--
MENGMI ZHANG: Let me just pull up the slide that is relevant to your question. Yeah, so here-- these are the six experiments that we did in terms of search asymmetry. But I would like to clarify for example, in this L amount, [? Es ?] versus T amount, L's reduces this search asymmetry effect, yeah.
AUDIENCE: OK, cool, thank you.
MENGMI ZHANG: And this also happens in the orientation.
GABRIEL KREIMAN: OK, thank you very much, Mengmi. If there are no further questions, then maybe we can move on to Jie.
JIE ZHENG: Hi, my name is Jie, so you can call me JZ. I'm a post-doc from Dr. Gabriel Kreiman's lab. So today I'm going to share some of the recent findings about this interesting neurons we found in humans that detect cognitive boundaries to structure episodic memories. So as we all know our lives unfold over time, weaving rich, dynamic, and multisensory information into a continuous experience.
However, our memory about this continuous experience tend to be-- we tend to remember this continuous experience as a set of discrete events, which serve as the building blocks for our autobiographic memories. And one example will be, we don't necessarily remember a two-hour movie frame by frame, but instead we remember it as a set of salient moments or salient events as we described here.
Then the fundamental question becomes then, what defines an event, or what determines the onset and offset of that event? There are numerous computational and theoretical works has been proposing that the transition from the continuous experience into discrete mnemonic episodes relies on the detection of cognitive boundaries. So before I jump into this neural evidence supporting this cognitive boundary detection concept, I want to, stepping back a little bit to talk about another type of boundary, the spatial boundary, which has been well-studied in the rodent electrophysiological studies.
So here I'm showing a example, or for example, border cells, which reported in 2008 the science paper from the [? Moser's ?] lab, which they found the neurons in the hippocampal networks increased their neuronal activities when the rodent is approaching the spatial boundary in an open space. And besides encoding the spatial boundaries in a fixed open space like the walls here, they also found that the neurons in the hippocampal network which are capable of capturing the structural information from a more complex, more complex spatial environment.
So here I'm showing three grid cells, which their neuronal firing patterns looks like a grid, and when the rodent's running around in this open area. But now if we put in walls in the same open area to creating this [? fear ?] [? pain ?] maze and asking the rat, the rodents to run in from one side to the others, what we've found is that the neural firing patterns, the grid-like patterns, right now were segmented into repeated sub-maps, with the highest firing rates occurring most likely around the turning points of these walls.
So both of this piece of information is suggesting that the neurons in the hippocampus network was capable of capturing the structural information from a spatial environment. And more interestingly, a recent study from [? Tanagawas ?] groups in which they were training the rodents running around in the same spatial environment repeatedly, as showing here for four laps, they found neurons in a hippocampal network, increased their response strongly to a specific lap in respect to the spatial locations.
And on top of that, moreover that they found this kind of lap-specific [? grade ?] mappings phenomenon was highly transferable. Because when they put in the same rat from the square maze to a different spatial maze, the circle maze here, they found those, these neurons are consistently showing this [? grade ?] mappings to the specific laps, in respect to the spatial changes. So this finding is suggesting that neurons in the hippocampal network may be also capable capturing some event structures of non-spatial scenarios.
So all these findings inspire us to do this work, trying to figure out whether there are specific neurons in the hippocampal network actually signaling the cognitive boundaries in humans as well. So to address these questions, we developed this task which has three different phases. So during the encoding phase we present a series of clips without any sound to the subjects. And the key point here is for those clips we bend a different type of boundaries. So as showing here we have no boundary, which is just the clips with a continuous shot from a movie without any visual cuttings or visual editings.
And for the soft boundaries we're referring to those sync cuts which the editors are changing the camera angles, but the clips are still from the same movie. And for the hard boundaries, we manually created them by taking clips from two different movies and put them together. So as you may notice that, both the soft boundary and the hard boundary contains, there's visual discontinuities. But only the hard boundaries actually has a jump in the narrative story here.
And one thing to mentioned that since we don't have any real boundaries in the no-boundary clips, we manually defined the middle of the clip as a virtual boundary just for the analysis comparison's purpose. So, after subject watched all the clips which are 90 of them, 30 clips for each no-boundary, soft-boundary and hard-boundary conditions, we evaluate subject's memories about those clips they watched in two different ways.
So in the scene recognition task, we extracted frames from the clips they have seen before in the including phase and the ones from the clips they haven't seen, and asked them to decide whether the frame's old or new. And in the time discrimination task we extracted two frames which across different type of boundaries, and present it side by side to the subject and ask them to determine whether the left or the right frames occurs first in the original clips they have seen in encoding session.
And besides the binary choice, we also invented some sort of measurements for their confidence levels just to make sure they have a good quality of their memories. So, so far we have collected 20 patients, and we have 20 patients with drug-resistant epilepsy perform this task simultaneously with the electrode implanted in their brains to record brain signals. So first let's look at the behavior data from all the subjects.
For the scene recognition task, if we plotted out subjects' memory performance, in terms their recognition accuracy, their response time, and also their confidence levels, actually for all three boundary types we don't find a significant difference across them. However, if we take a closer look, for example, for the accuracy rate, if we-- splitting the trials within each boundary types based on the relative distance between the targeted frames we extracted from the clips and their relative distance to the previous boundaries, what we found is that the frames right after the soft boundary and hard boundary tend to be remembered better than the ones are further away.
And this kind of distance in fact, we didn't observe in the no-boundary conditions. Then how about the time discrimination? So if we plotted out that same measurements for the time discrimination task in terms like accuracy, response time, and also confidence level, what we already can see is that subjects tend to have a worse time discrimination accuracy, and they take longer time to response, and also they have less confidence about their choice when they are trying to recall the temporal order of two frames that cross a hard boundary in the original clip.
So, it seems like the boundary are influencing both the scene recognition on time discrimination memory. Then what's the neural signatures actually tied it to, where it potentially can explain this memory effects we observed in the behavioral level? What-- in terms of the neural data what we are using is, we are recording the single-unit activity from the human subjects using the Neuralynx Systems, and also this hybrid, micro and macro electrodes.
So a typical procedure is the subjects were going to have a pre-implantation MRI scan first, and based on that the neurosurgeons and neurologists are doing some surgical planning to determine where to insert the electrodes. And during the surgery they will insert the macro electrodes first and using it as a guidance to insert the microwires. And after that they will do another MRI or post-implantation MRI or CTs to help us to localize where those electrodes are.
And here I was just showing the signals we recorded. So on the top is the raw signals. You can see this local view of potential data is embedded with some ticks, which are the single-unit data we recorded from humans. So by doing some bypass filters and some spike-sorting algorithms, we were able to capturing the single neuron activities from the raw data that we recorded. And among all these 20 patients we recorded so far, we detected approximately 1,000 neurons from different brain regions.
And for the analysis and the results, I'm going to present it to you as following. We are focusing on these three specific regions, the hippocampus, amygdala, and parahippocampus, which are specifically because they are known to play critical roles in the episodic memories and also boundary detections. So as I mentioned before during the encoding session, subjects were watching a series of clips embedded with different type of boundaries.
So the first thing we look at into the neural data is trying to see whether there are any neuronal activities is able to differentiate these different type of boundaries. So here I'm showing you a raster, a plot, a neural data plot we are generating for each individual neurons we detected. So amount all the neurons we recorded so far, there are two types of neurons we found actually signaling the different type of boundaries we mentioned before.
So one of them is as shown in the example here, which we call the boundary neurons. As showing here the-- each dot represents to the activation of their specific neurons, and we color-coded for a different boundary conditions here. And what you see here is that this neurons increase its firing rate right after soft-boundary and hard-boundary conditions, but not for the no-boundary conditions.
And we also observe another type of neurons as shown in the example here, which only respond or only increase its firing rate for, right at the hard boundaries but not for the soft-boundary and no-boundary conditions, which we name it as the event neurons or event cells. And so far, across like 580 neurons we recorded in those three regions, we were able to find 42 boundary cells and they all have very consistent firing rates, firing patterns, as showing here for each row represents two, one boundary as we detected.
So both of them, all of them more constantly responds to a soft boundary and hard boundary, but not for the no-boundary conditions. And also across all these 580 neurons, in the middle temporal lobe we also found like 36 events cells. They also very-- responds very consistently only to the hard-boundary condition. And this plot also has some interesting informations here, as well. If we look at the-- both these two cell populations, we notice that they either responds to a soft boundary and the hard boundary only, or both soft boundary and hard boundaries.
So the question will be, as I mentioned before, both soft boundary, hard boundary, they contains these visual transitions or the visual discontinuities during the clips, but not for the no-boundary conditions. So the question will be whether this absence of the neuronal firing in the no-boundary condition is due to this lack of the visual transition information in the clip. So to address this questions, we were plotting this neural activity from these two cell populations aligned to the clip onset and also the clip offsets, which also contains the same amount of a visual, abrupt visual transitions, because they are changing either from a fixation across to a whole movie or the whole movie to a fixation across.
So as showing here, these two population of neurons, none of them are actually, has the same amount or a strong fire rate increase as they show to soft-boundary and hard-boundary condition. And another interesting information in this plot is if we compare both the boundary cells and also the event cells, which they both responds to the hard-boundaries conditions, what we, what you probably can see here already is that they're still slightly delayed in terms of the response between these two cell populations.
So if we further quantify this temporal relationship between these two cell populations by computing their average firing rate within each cell populations, what you can see here is that the purple-- that the pink line here is the boundary cell's average firing rate traces and also the purple one is from the event cells. So apparently these boundary cells are-- responds a little bit earlier, was approximately like a 100 milliseconds earlier than the event cells.
Then the question is, what the potential reasons actually driving this sequential firings between these two cell populations? To address this question we delve into the spatial locations or where those B cells or E cells are located in the brains. So as shown in here in this tables, what you can see is that the majority of the boundary cells are actually located in the parahippocampal gyrus, while the most of the events cells are located in the hippocampus. And this kind of spatial specificity remains true when we extended our search to all the other brain regions which we have the access of the single neuron detections.
So this suggesting this kind of sequential firing, where the early response from the B cells might be due to some early response from the visual strains or visual areas, or the later response from the event cells which are mostly from the hippocampus might be the late comparable operations from the hippocampus. Then besides purely signaling what the boundaries of what different type of boundary it is, we also found these boundary cells and events cells are capable reflecting subjects' subsequent memory performance earlier on, and even engaged-- even at the encoding session.
So here, this is the same neuron from the boundary cells I showed you before, which increased the firing for both soft boundary and hard boundaries. If right now we're splitting this plot based on subjects' subsequent memory performance in the same recognition, either they get correct or incorrect in the task. What we have found is that that increased the firing rate in the soft-boundary and hard-boundary condition are highly, or mainly from the correct trials but not from the incorrect trials.
And if we further quantify these firing rate modulations across all the boundary cells, we see consistently this high firing rate in the correct soft- and hard-boundary trials but not for the incorrect soft hard-boundary trials. And also we don't see any conditional difference between the correct and incorrect in the no-boundary clips. And then how about the event cells? So this is the same events as I showed you before, which increases its firing rate only for the hard boundaries.
So we have converted this raster plot into a phase plot by converting the time when that spike activity happens, relative to the theta oscillations we observed. So then right now this plot will be converted to a phase plot with the x-axis still be the time. But now the y-axis will be the theta phase. And if we further splitting this plot based on subjects' subsequent memory performance in the time discrimination, what we observed is that for the correct soft and hard boundaries, we observe these phase clusterings but not for the other four conditions.
And we can further quantify this phase clustering by computing the mean resultant lengths, which as showing here, each arrows represent to the theta phase when that spike activity happens. So if all the spike activity tends to happen at a specific phase, which is the bottom plot showing, and we will expecting a very large mean resultant lengths as the blue arrow showing here, compared to the case like there, is firing at the random phase. Then we will expecting a shorter mean resultant length as showing in gray here.
Then we further quantify this phase-clustering effects across all 36 event cells. We see consistently this modulation of spike activity occurs in the correct 12 for the soft and hard boundaries but not for the incorrect, and also not for no-boundary conditions. So as I'm showing you this, B cells and E cells, they seem modulated, they-- we're using their spike rate modulations or spike time modulations to carry on specific memory informations from the scene recognition or time discriminations.
And we also doing them some different combinations to see what are they are actually representing to the other type of informations by different modalities. And we actually don't find any significant results, so-- which means that these two population of neurons, they are probably carrying this distinct memory aspects by their own individual strategies. So these are the single neuron datas responding to the boundary, different boundaries.
Now we look into, at the population levels, whether all the neurons in the temporal lobe also responds to the different boundaries. To address these questions, we form this neuronal dynamics matrix by putting all the neurons, their spike datas from all the neurons we recorded in the middle temporal lobe. And then we're doing some dimensionality reductions using the PCA analysis, and downgraded this high-dimensional information to a three-dimensional space, and plotted the neural trajectories at each time point when the subject's viewing the different clips.
So as shown here, each dot in the lower dimensional PC space represents to the neural state at each time point. And when we are applying a sliding windows, we will be able to plotting the neural trajectories, and right now, which is plotting the neural trajectory from either no boundary, soft boundary, hard boundary, before the boundary occurs. As you can see here, before the boundary occurs, they are kind of tangling with each other not very closely. And if we extended our analysis windows by moving forward, which you can see that the soft-boundary and hard-boundary conditions after the-- it occurs which are marked by the black dots, the neural distance or neural shift, shift largely from the original states to further away states.
And if we further quantify this neural state shift by computing the neural distance or what we call the multi-dimensional distance, as the Euclidean distance between any time point at the Euclidean distance between any time point to the boundaries, what we can do is converting this 3D plot into this 2D diagrams. And we can see the link seeing this neural state shift in both soft boundary and hard boundary, which are the bumps here in both blue and red curves. And we further look at whether this neural stage shift correlated with our subject's memory performance as well. So we plotted this that the accumulated multi-dimensional distance in the neural space, and their correlations were relative to the same recognition accuracy and the time of discrimination accuracy.
Interestingly we found a positive correlations between the neural state shift with the scene recognition accuracy and a negative correlation that was the time discrimination accuracy, which potentially can be a-- neural explanations for the memory effects or the memory trade-off effects we observed at the behavioral level. So to summarize, at the behavioral level we see the boundary enhance the recognition accuracy and the hard boundaries actually impairs the older memories we have.
And at the single cell levels, we detected 78 neurons signaling different type of boundaries, either both the soft boundary and hard boundary or only the hard boundaries. And also the response strengths of those neurons are correlated with subject's memory performance in a different-- with different strategies. And at the population levels, we observed this neural state shift across boundaries which positively correlated with the scene recognition but negatively correlated with the time discrimination.
So in the end I want to thank all my PIs and also my collaborators from our consortium teams and the funding source. And since the time limit's a problem, so I didn't present all the results we have, but they are public available in [? our buyer ?] archive paper right now. Thank you. So I'm happy to take any questions now.
AUDIENCE: I have a question. First of all, what a fascinating story, [? these ?] [INAUDIBLE] amazing neurons to have discovered and characterized. And I really like the controls that you have as well for the low-level stimuli. I had one question though, one thing is I'm trying to understand if I wanted to envision these neurons in the wild, right, in these patients' heads, when you're not showing the movies clips that are so well-defined, what would correspond to the hard onsets and the soft onsets in the real world for these subjects?
I could imagine that maybe you could have observed some of these behaviors, some of these neurons behaving when, for example, you're setting up things and the subject's beginning to make memories related to maybe you coming into the room. What's the wild-- wild type equivalent of the stimuli?
JIE ZHENG: Yeah, that's a very good question. Actually we are also trying to delve into under more naturalistic setups how this neurons will respond or how we actually capturing those event transitions as well. So one idea we were thinking about, for example, when we are having right now, having this conversation, right, and suddenly my phone rings and I was trying to pick [? up ?] [? phone ?] [? and ?] starting talking about someone else.
So I think as something like this, the salient transitions or where this jump in terms that the narratives can be potentially a hard boundary as well in a more naturalistic setup. But that's definitely harder to study in terms that the-- like a neural study or neural analysis part, because in order to have this, the powers to see those neuronal firings, that's why we created this manually like a modified hard boundary to just, as a representative of those salient change.
AUDIENCE: That makes sense. One thing that makes, that would also follow is maybe emotional content. Was there any kind of emotional content to the changes, maybe something intense happening one clip and then not another clip that would also marked as boundaries?
JIE ZHENG: Yeah, I think highly likely. So I see a lot of behavior study has been done on that part of the scope, like whether the salient information or the emotional-related information can be a trigger for people to segment it, a continuous event. Behaviorally there definitely a lot of evidence to support this idea. But I think that's also a very interesting direction which should also look at whether the neurons we've found here actually represent that part of information as well. Thanks.
AUDIENCE: Jie, hi. It was fascinating, really, really wonderful. I'm interested in the possibility that there may still be a low-level explanation and maybe more generally what is it really that is driving the hard, the hard boundary neurons? I mean it's quite amazing. So I was-- I can imagine a bunch of other experiments but of course that means more experiments, so that's hard. But I'm wondering about things like color, not color, intervening ties.
Like there are movies with long extended single take, the most famous is one called Russian Ark. It's an hour and a half's one take, and you could move the time clips farther, farther apart. Or is it something like color black and white, or something about the statistics of the scene, you know those particular clips.
JIE ZHENG: Yeah, yeah that's definitely very like not interesting, so I think we're all concerned, we're all questioning about this topic. Like whether this event segmentation is really driven by some lower visual features or we were able to do that also without that. So I think of some of the control we did here, I didn't show, but basically we were looking at, for example, the soft boundary and hard boundaries.
We were measuring a different, in terms they are visual features, whether they're luminous, contrast, or complexity, or the color distributions, whether those lower features can be different between these two type of the boundaries that create this, a different response in the event cells, which only responds to hard boundary but not soft boundary. And we don't find any significant result for now, but it doesn't mean like a-- not saying-- all of this not contributing to the event segmentations.
But I think a more interesting question you're asking right now here is how we gradually forming this event structure even without this very obvious visual cues. I think, so referring to the papers like Anna Schapiro's paper in the Journal of Neuroscience. She find this event structure can also gradually form even though there's-- without any very transient visual cues. I found one explanations will be-- so the original prediction area, predictional [? arrow ?] [? series ?] like when, assuming our brain is constantly doing some predictions, and if later on if that prediction arrows drops to-- increase to a certain levels, maybe that triggers this event segmentations already.
So that can be fitted to this continuous [INAUDIBLE] scenarios, as well, yeah. But another thought which we haven't tested yet, but it's also in my mind, is that people has proposing this working memory and a long-term memory relationships. So there's recent findings from [? Anna-- ?] [? Javadpour's ?] studies, which she find that people tend to segment it in a continuous event, a continuous experience. Very frequently they also have worse working memory performance.
So I think in that study he was, she was trying to claim that your working memory capacity might be also indicators or factors like determine how you going to find segmented or like a core segmented continuous experience.
AUDIENCE: Great, thank you so much.
AUDIENCE: Can I ask one question? What do you make of the data preference spiking of those neurons? I forget if the prediction there is that they accumulate around a data phase in the recognition task or in the time order task.
JIE ZHENG: Oh you mean the event cells? Whether these are the scene recognition or whether the encoding or the memory retrieval, right?
AUDIENCE: Exactly so--
JIE ZHENG: Yeah, so this, the dots and the neural activity I was showing here, actually it's from the encoding. I was just explaining their activity based on their behaviors from the retrieval part. So it seems like they were able to reflecting or associate it with the later memory performance during the encoding already.
AUDIENCE: Right, so but what is the memory task that they're asked to do in this case?
JIE ZHENG: Oh, OK. So in this task basically we were pulling out two frames that cross either no boundary, soft boundary, or hard boundary presented side by side to the subject, ask them to decide whether the left or the right frames happens earlier in the original clip they have seen.
AUDIENCE: But presumably they have the time, is not-- the time varies, right? So you could be pulling an image that happens two seconds after the transition, or only one second after transition, like it's not that you're testing, you're not grabbing an image that happens right around the time when you have that nice fixation around that data phase.
JIE ZHENG: Yes, you mean like not right that the frame I take out, and not necessary to be the frame right after the soft boundary or hard boundary, right?
AUDIENCE: Exactly right, so the question like if that's the case, how do you-- what do you think about this influence of data is in that encoding that is in a way unpredictable at the time of the test, right?
JIE ZHENG: Yeah, so I say that's a good point. So another thing I haven't presented here which might be related to the question you have here is, so, for the retrieval data we're also looking at the reinstatement of the neural context, so basically trying to see when people recalling this dynamic clips whether recalling after frames or they recalling specific frames of this continuous experience. And what we are doing is basically computing this correlations between their encoding activity and retrieval activity, which we've found is that we actually see a stronger correlations around the, sorry, around the boundaries, but not around exactly when that target frame occurs.
So which, which we think even though we presented maybe 80 seconds of the clip to the subjects, but what they really remember is actually those frames right after the clips. And to be honest, for the clip itself, even though the clips are, the frames are further away from the clips, they are very similar to the one, to the frames that after the boundary as well. So yeah, I don't know whether that address your questions or not.
AUDIENCE: I'm thinking is it possible that one of the reasons you might be able to assign this data phases is that also data per se is strongest signal in that particular case? Let's just say in cases in which subjects are really paying attention to the transition versus cases in which they're not paying attention, and there might be a correlation and that you might have a low-amplitude data and then there your estimate of the phase is more inaccurate? Or that's one possibility.
The other is that is it possible that these neurons in, again in the cases in which subjects are paying attention, exhibit more of that time lock for a particular, or a phase-lock, rather, in that you're just capturing one instance of that in your measurement, so just different possibilities. Just like, if it were just pure recognition at the time of the phase-locking, I'd have an easier time seeing how that happens. The fact that you have a [INAUDIBLE] variability in the probes that you choose, that it's hard to interpret what this means.
JIE ZHENG: So I think to-- the first point you brought up, like attention, I think definitely attention is a playing a credible role here in terms like a subject's memory performance, and also this event segmentations, whether detection of the boundaries. But unfortunately we don't have a very-- a sensitive way to measure the patient's attentions for this data set. But it's definitely interesting to see whether they're actually correlated with neural response or not.
And the second point you brought up is for that theta phase, right, or the theta powers, whether the stronger powers gave us a better estimation for the theta phase, the we tend to have this, a better phase clusterings. That's something actually I'm looking into right now. So I think I should be able to give more updates later when I'm looking into this data, yeah.
AUDIENCE: Yeah, cool, thank you.
GABRIEL KREIMAN: Hey, so just, thank you very much, everybody, Hector also put the feedback. Can I say just maybe, I'm going to take an editorial decision now to move on to Will, because we still have yet another talk. But maybe Jie, you can contact Hector who has thought about this very deeply and a lot, and get further feedback from him. Thank you, thank you very much. So let's move on to the last presentation today by Will.
WILL XIAO: Hi, I'm Will. So today I want to tell you about an ongoing project where we're trying to create adversarial images for primate vision. Is the slide working and can you hear me? OK, great, yes. So this is a collaboration project with Li, a visiting student from the National University of Singapore. And first I want to explain why we're interested in adversarial images of primate vision and how we operationally define it.
I'm sure most of the audience here know of adversarial images, but just to provide a brief background, adversarial images are really tied to convolutional neural networks, or CNNs. And there are some of the current best computer vision algorithms. However, if we add small, carefully-crafted noise to natural images, CNNs can be misled to make wrong predictions. And here is a familiar example. On the left it's an image of a panda, which is correctly categorized by CNN. And in the middle is the added noise, magnified to be visible. On the right, it's the adversarial image which the previously correct CNN now categorizes as the gibbon.
But you will be hard-pressed to see any change in the image at all. And what's more, this is not shown here, but CNNs assign very high confidence to these incorrect predictions about adversarial images. And this seems like an intuitively compelling example, that CNNs are unlike our vision in being susceptible to adversarial attack. However, this conclusion doesn't really sit well with the finding that CNNs have very good models, the current best models for primate vision, shown by work including the one [? Mengmi ?] just presented and from many other labs at CBMM.
And I think there has been several roundtable discussions here about to what extent CNNs are actually good models of primate vision. And I think some here would agree that CNNs have good, but already quote unquote, "falsified models," but I still think there is no specific and satisfactory explanation of why CNNs can predict primate vision neural responses pretty well, but behaves so differently when faced with adversarial images. So to talk about adversarial images for primates, including humans, I want to use the following definition.
So I defined an adversarial image relative to a clean image, which is a natural image that can be unambiguously classified. Then the adversarial image must be minimally different from the clean image, but results in a different classification. I want to unpack this a little more because this is pretty counterintuitive and contradicts some popular definitions of adversarial images. For example, a common definition is that the adversarial noise should be unnoticeable to humans. However, because we want to test adversarial images of primate vision including humans, by definition the adversarial noise has to be perceptible.
And for the same reason, we cannot use the definition that adversarial images are images that humans do not get incorrectly but fool CNNs. OK, another potentially counterintuitive consequence of this definition is that we can have trivial examples such as this. For example, let's take a clean image from a different class x1, and construct this image x0, plus the difference between x1 and x0. This of course is just x1 and will be classified differently than x0 by construction, and unfortunately this will count as an adversarial image by our definition. Although it's obviously a not very interesting one.
So the key here is that it's really essential to be quantitative, to control the size of the adversarial change, and to define what is small change. We really need to measure and compare it to the actual separation between two clean image classes. So in this framework adversarial image, to-- to a given limit there will be so-called adversarial images for any visual system. But the key is how robust is, for example primate vision to adversarial images, and can a small change of the image lead to a change in categorization?
Here, small is defined relative to trivial methods such as just replacing the image. Meanwhile however much perturbation it turns out we need, I think it will be interesting to compare that amount to what's needed to attack convolutional neural nets, both naive and adversarially-robust ones. Because I think that would be a useful benchmark for both computer vision and for models of the brain.
And briefly I want to summarize what's already known related to this question. There is a paper from [? Elsayed ?] et al, who showed that time-limited humans can be affected by the same adversarial images that affect convolutional nets. Their effect does require very short presentation times and basically reduces already low accuracy of 75% by a further 10%. And this is not surprising, because usually well, we are hardly able to see adversarial noise under normal viewing conditions. So for this to work something has to be different.
And another somewhat related result from Zhou and Firestone is that humans can guess what wrong categorizations will be made by CNNs. That is not necessarily-- that doesn't necessarily mean humans will make the same mistakes but just that they can expect how the CNNs will get it wrong. So what is not known is whether primates can be affected by adversarial images under normal viewing conditions, or more precisely as we just discussed, how much image change is required to change that categorization. And finally, it is not known how vision neurons respond to adversarial images, both those tailored to CNNs and those tailored to the neurons themselves.
And it's interesting because one prediction is that encoding models based on CNNs should diverge from actual neural responses when it comes to adversarial images, at least if primate vision turns out to be more robust than CNNs. So to try to address this question, here's our specific attack setting. We chose to attack, conduct a targeted attack between human and monkey faces. And this is because we have access to face-selective neurons which are highly selective between these two categories with clean categorization performance, about 95% from just a handful of neurons.
And I will today focus on this, the first attack direction. We also tested the reverse direction of monkey to human and non-phased attack. And I'm happy to show those if time permits at the end. So to make adversarial images tailored to primate neurons, we designed what we called a gray box adversarial method. To many of you who are familiar with work from [? Pouya and Ko ?] from the DiCarlo Lab on neural population control, this pipeline is very similar to their pipeline to make images that's designed to drive neurons. And the main difference is the loss function, which I'll describe now.
So to briefly just summarize the method, first we record the responses of some face-selective neurons to face an object images, and then, we fit an encoding model of the neuronal response based on features extracted by, in this case, a ResNet. As I alluded to earlier, this class of models is state of the art at predicting neural responses from arbitrary images. And in our settings, the fitted model explained about 50% of the explainable variance.
And finally we use this model which is end-to-end differentiable as a substitute for the actual neurons we want to attack and use the model to modify clean images. The loss function then is to match the predicted response to the real response vector to the target class. So to be concrete, in human-to-monkey attack, we modify a human image so that the model predicts that the neurons will respond to the modified image similarly as two monkey images. So we call this method gray box only because we use some, but very limited information from the system being attacked.
And to keep with the boxes analogy, in the literature on adversarial attack this is probably much closer to black box attack, which assumes little to no information from the system than it is to white box attack, which requires being able to differentiate through the whole model. And another reason this is limited information is because other results I'm showing today are based on the same set of attack images made using the same encoding model, which was fit on data from one monkey. But we tested several monkeys and also humans.
OK, so now that we have this algorithm to make images, we also want to define the budget allowed for image changed. And for this we just used a simple metric, which is mean squared error of pixel value change. This is directly proportional to the square of the L2 metric commonly used in the attack literature. So the levels we used, we capped the MSE at 800, with pixel values ranging from 0 to 255. And we tested 10 linearly-spaced levels from 200 to 800.
And here are some example, human-to-monkey attack images at these various noise levels. For reference, here is the distribution of the MSE distance between 250 each of clean human and monkey faces. The average of this distribution is about 6,500, and the minimum in all of these pairs is less-- is about 950. And it might seem not too far from 800, which is the max level that we are using, but remember that adversarial attack has to work on the basis of individual images.
So to estimate that, we can plot the minimum MSE as a function of number of images each image is compared to. So the little gray lines in here are individual images and the purple line is the image average. If it's a power law fit pretty well and by extrapolation it seems that we expect to need to compare one image to about 100 million before we come up with a image that's an MSE of 800 away from it.
OK, so this is-- I just described how we attempt to make adversarial images tailored to neurons. We also as comparisons tested several other attack methods using the same noise levels. The first one we tested is adversarial images made to fool a pure model. These were made for the same fine-tuned ResNet that we used as the backbone for the substitute model, but this is without fitting to any neural data.
Next, we also tested merged images, which are just simple linear interpolations between two clean images from the two classes. And thirdly, to control for image quality degradation, we also tested adding Gaussian noise. This isn't really an attack method because it's not targeted. But I think they serve as a good example of-- how big that noise level 800 really is. And lastly I'm not going to discuss the PS images, just putting it here because it's going to show up on the plots. And finally I want to discuss-- describe some results we have.
First, we tested the attack images at the neuronal response level. For this we recorded from face-selective monkey [? acting ?] neurons to the images while the monkey is fixating. And to quantify attack success rate, we train linear SVMs to classify neural responses to clean human and monkey faces. Accuracy on clean images were pretty high, as described earlier. And we're going to use these SVMs to classify the various attack images.
So first I'm going to show one example array recorded from one monkey. This is a UMAP visualization of the responses of maybe 20-something neurons. And you can see that the clean human face images are clearly separated, whereas the human-to-monkey adversarial images are already less similar to the human images and more similar to the monkey. And of course one caveat of UMAP is that it's not linear. So when quantifying the attack success rate, we just use a linear SVM on the, directly on the normalized firing rates.
Here, success rate is defined as the fraction of images out of 40 preconditions for noise level. And the dashed gray line here indicates error rate on clean images. So you can see that the error rate, the neuron population is highly selective between the two categories. Meanwhile the gray box attack images were able to basically successfully change the image categorization up to around 40% of success rate. And all the other attack methods didn't reach success rates much higher than the error rate on clean images.
We tested the same images on two other monkeys and these are the results. Gray box images in all cases were consistently more effective than the other attack methods. And interestingly, they achieved about the same success rate at the highest noise level, despite being tested on different monkeys using a model based on responses just from the first monkey. And just quickly, we verified one concern that the attack images could be misclassified by the SVMs simply because they became non face like, and this is not the case.
As you can see from this UMAP visualization, we also trained SVMs to classify between face and non face images, and almost all of the attack images are classified still as faces. OK, so next I want to show some behavioral results. And for this, we tested monkeys who were trained to do this binary classification task in cage on a touch screen. And we also tested the images on human subjects on Amazon Mechanical Turk.
So here are some results from the monkey behavior experiments. Again the success rate is the fraction of images that were categorized as the opposite class. And the inset here shows each start is an individual image and the inset shows fraction of trials on which the image is chosen as each class. Again, the dash gray line here shows accuracy or error rate on clean images. And the monkeys were well-trained on this task and still they made up to 40% to 60% errors on these gray box images, while not making as much error on the other attack methods.
And finally we also tested the images on Amazon Mechanical Turk. Oh, one thing, important that I forgot to mention is that both monkeys and humans had a full second to examine the images so the viewing time is practically unlimited. And it's interesting that gray box images were more effective than the other methods even on humans, although the overall success rate is lower than the success on monkeys.
OK, to summarize what I talked about, I want to recall the question at the beginning, which is how robust is primate vision to adversarial attack? And if you only take one message away from this talk, I hope it is that it is not insane to ask this question. This can be a well-defined question and can be useful. And I would like to argue that this question needs a quantitative answer. OK, so as for our results, based on our attack methods, we found that relatively small changes in image could change the categorical representation of visual neurons in IT, and could also change categorization by primate behavior, both in a targeted way.
And here relatively small is compared to the natural separation between two categories of images. The actual changes are still very noticeable, but that's by definition required to change primate perception. And lastly, how does the required noise compare to the noise needed to attack CNNs? And for this I would like to hear your thoughts and maybe get some pointers to literature, because I'm not really familiar with the state of the art here. And I just put here too arbitrary recent papers on the topic. And as you can see the relative range of noise level is quite different, showing that the number can strongly depend on the task, probably on the image size as well as on the number of alternative categories.
Regardless, I think being able to measure robustness in the brain and finding a fair comparison between the brain and the CNN is a useful yardstick by which to evaluate progress in making adversarial robust models. And finally, I just want to say that the success rates and noise levels we could achieve is dependent on the methods we have, which is still fairly simple. So it's possible that future methods can come up with even more effective and even less-perturbed images that change human perception.
With that, I would like to thank my collaborator Li, as well as my advisors Gabriel and Marge. Thank you for listening, and I would love to hear your questions and comments.
AUDIENCE: Will, quick question. First of all, excellent line of investigation, it's very, very cool, very gutsy, I like it. I had a quick question, do you have any-- what are-- I'm just curious what the basic level PSTH responses of the neurons were to faces, human faces, monkey faces, and to the adversarial images? Were they very different or?
WILL XIAO: So yes, several answers to your question. First is what I already have, and just limited by what arrays we currently have in the lab. Two-- one of the recording has relatively lower SNR as you can already see here, but overall these are face-selective arrays. If you average images from the category, they are highly face-selective. And if you look at individual images then the lower SNR arrays, less-- that's less good.
But just as I was making this talk, I was reminded that it will be interesting to actually test what I promised, which is fit models on clean images and evaluate how well they predict adversarial images, as well as just-- I think, wait-- yes, as well as how the neurons respond to the adversarial images themselves, which I actually have answer for. So this is the slide that maybe I passed over too quickly. But the adversarial images, they're little changed from face images, and they are still perceived as, or sorry, categorized as face images based on neural responses. So presumably they are highly activated.
AUDIENCE: But in terms of overall firing rate, have you visualized just the PSTH?
WILL XIAO: Well, I haven't. No, I have not actually.
AUDIENCE: OK, interesting.
WILL XIAO: I really should, to look at the time course for example.
AUDIENCE: Hi, Will. Did you say something about human responses or Mechanical Turk? Was there some--
WILL XIAO: Yes, maybe I went over too quickly over this as well. But the dashed lines are just replotting the lines from the left, which I tested on monkeys. So we also did this on Mechanical Turk with two alternative forced-choice tasks, and I guess 30% of the images were classified over subjects as monkey, the human-to-monkey attack images.
AUDIENCE: But it's, OK, so that's a forced-choice paradigm. But actually I think if we're looking at it, we would say, well, that looks like a doctored monkey face or that looks like an altered human face, it looks weird, it's neither. Or I'm trying to figure out what it is that you, Will, think is closer or something like that. I mean, that would be my immediate response to these images. So I'm not sure, I'm not sure quite how to think about alternative, to alternative forced-choice task in this case.
WILL XIAO: Yeah, I think that's an excellent question though. In defense of the quantifications, I guess that's because like in adversarial attack literature it's always a one-way forced choice. But of course it's an interesting question, what these images exactly are doing. And I've been discussing this with Gabriel for example, is the attack certainly is destroying certain face features in the human faces, probably making them less human-like. But it's unclear to me is are they just making them less human-like or is there some direction in which it's also moving along to the monkey direction.
The neural response data that I was just showing you shows that it's not moving maybe too far away from the face cluster at least. But after vast space of possible pixel combinations, how tight are the islands of face and monkey images, and is this really in the no-man's land? And maybe just to anticipate a little, I think a related question I've been puzzling over is that I feel that method, if we give it unlimited noise, it's still not going to come up with the full monkey image, right? So it's-- that doesn't mean that models are not good local approximations of the decision boundary, but, but yeah, that's just something to keep in mind as well.
AUDIENCE: Thank you, it's great and fascinating.
AUDIENCE: Just to quickly follow up on that, Will, so yeah, it's pretty cool. So at the-- in this slide, it kind of shows you're not doing a fast gradient sign method attack or PGD, like you can't do, you can't do that as far as I understand because you're working with a monkey, not with a network, right? I could be wrong but I think that's what I understood. So here what you're sharing is the different strengths of one specific direction, in which you're perturbing the stimuli, right? OK, OK.
Yeah, it would be interesting in the PGD way of thinking if there's somehow like a curve instead of a straight line that you could draw in this high-dimensional space, and if it is a straight line, maybe you won't get like a monkey face. But if it is a curve or if it's something that's not straight, you could maybe actually see something that is more natural. I don't know, it's just something when I saw it resembled a lot the interplay between fast gradient sign method and PGD that just got me thinking. And I don't know what your thoughts were on that maybe as well.
WILL XIAO: Yeah, I just want to quickly clarify that this is based on actually fast gradient sign method. To do that we had to basically make a substitute silicone version of the monkey neurons and-- and an exciting direction is to be able to do this like zeroth order without the intermediate substitute model. I'm still not clear how easy that is because to be able to change images pixel-wise, that will entail a very large search space, right? So, so yeah, I'm just not courageous enough to undertake that yet. But unless we have a perfect model, I think to find the optimum direction of change it will have to be directly based on the neural responses.
AUDIENCE: Right, right, yeah, kind of reminds me of the eigen-distortions paper by Berardino and Simoncelli.
WILL XIAO: Exactly, exactly, so yeah, I also think of this in that framework except that eigen-distortion I think it's only about perceptible differences, and this is both perceptible and targeted, so a little different.
AUDIENCE: Thanks, Will. Thanks.
AUDIENCE: Hi, Will. I really liked your talk. So one thing I feel is instead of framing your study in terms of the research attack, can you frame it in terms of the minimal sets of features to recognize certain objects? Like I think your human face turning into a monkey face and leads human monkeys to judge it otherwise, per its-- represents the image definitely has some features that our human or monkey use to do the classification. So this result can be interpreted as your adversarial attack, definitely capture those minimal features to do the classification.
And the follow-up is if you don't, do not start from a human face but start from a gray, gray screen or gray noise, and just evolve or generates those features, I mean, can you still achieve those kind of performance?
WILL XIAO: So the first point is very interesting and I have to think more about it. I guess the results as they currently are and the perplexing degree of this question basically makes there to be all sorts of ways to frame this. Like another suggestion that I've heard is that this is finally a direction that could-- it's like using a model to make microstimulation, for example, that could change behavior. So I'm not clear what is the best way to frame this. But, but your suggestion is the first time I've heard so I'll have to consider that carefully.
And the second one I have some data related to it, which is the non face to face attack that I alluded to earlier. I hope that's relevant to your question. Sorry, I have to find the slide. So we did try to-- it didn't start from a gray screen, but actually we're doing the experiment right now starting from not objects but noise images, and so used the same pipeline to make the images have face neural responses, and we could make that as well.
AUDIENCE: So it's kind of comparable to the face to face task, right?
WILL XIAO: At the same noise level, I would say this is bounded at around 20 to 30, whereas for the other it's like 30 to 40. So it is harder, which might make sense given the separation, yeah.
AUDIENCE: Yeah, thank you, yeah.
GABRIEL KREIMAN: OK, thank you very much, Will. And if there are no further questions, I want to thank all three speakers, Mengmi, Jie, and Will, and thank you all for your feedback and questions.
Associated Research Module: