Visual Search Asymmetry: Deep Nets and Humans Share Similar Inherent Biases
Date Posted:
October 27, 2021
Date Recorded:
October 26, 2021
CBMM Speaker(s):
Mengmi Zhang All Captioned Videos CBMM Research
Loading your interactive content...
Description:
Visual search is a ubiquitous and often challenging daily task, exemplified by looking for the car keys at home or a friend in a crowd. An intriguing property of some classical search tasks is an asymmetry such that finding a target A among distractors B can be easier than finding B among A. To elucidate the mechanisms responsible for asymmetry in visual search, we propose a computational model that takes a target and a search image as inputs and produces a sequence of eye movements until the target is found. The model integrates eccentricity-dependent visual recognition with target-dependent top-down cues. We compared the model against human behavior in six paradigmatic search tasks that show asymmetry in humans. Without prior exposure to the stimuli or task-specific training, the model provides a plausible mechanism for search asymmetry. We hypothesized that the polarity of search asymmetry arises from experience with the natural environment. We tested this hypothesis by training the model on an augmented version of ImageNet where the biases of natural images were either removed or reversed. The polarity of search asymmetry disappeared or was altered depending on the training protocol. This study highlights how classical perceptual properties can emerge in neural network models, without the need for task-specific training, but rather as a consequence of the statistical properties of the developmental diet fed to the model. Our work will be presented in the upcoming Neurips conference, 2021.
GABRIEL KREIMAN: It's a great pleasure to introduce Mengmi. Mengmi was a fantastic post-doc in her lab who's now back in Singapore. It's actually not only amazing work that she's going to talk about, but it's particularly commendable that it's 4:00 AM in Singapore, and she woke up specifically to deliver this presentation. So that's pretty exciting.
This work has just been accepted in NeurIPS. And she's going to be presenting that in NeurIPS as well very soon. So it's very, very exciting to have her here. And so Mengmi, please go ahead.
MENGMI ZHANG: All right. Thanks, Gabriel, for the introduction. So hi, everyone. My name is Mengmi. Today. I will talk about visual search and how we can use visual search as a tool to explore some inherent biases in deep nets. Feel free to interrupt me during my presentation if you have any questions.
So this is a joint work visa Shashi, Chia-Chen, Jeremy, and Gabriel. Visual search is a [AUDIO OUT] problem in that for vision, for example, we often search for our car keys in the living room, or look for our friends or our party crowd.
Here is an example visual search trial with eye movements recorded. In this example, this person is looking for a calculator. The yellow dots denote the eye movement, and the rectangle denotes the ground truth target location. After looking around, eventually this person managed to find where their calculator is.
An intriguing property of some classical search tasks is asymmetry such that finding a target A among these structures B can be easier than finding B's among A. Now let me explain what I meant with this specific example.
In the first experiment, the target and the structures defer in length and orientation, but both are in the horizontal direction. While the targets receive the light from left to right, these structures are receiving the lights in the reverse direction. However, in the second experiment, the lighting condition changes to vertical direction.
Surprisingly, psychologists have found that humans search for targets faster in the vertical lighting conditions compared with the horizontal lighting conditions. Thus, this finding makes us wonder how such asymmetry is implemented in the human brain? And where does this asymmetry come from? Does it come from natural statistics? For example, we get used to seeing all the objects with vertical lighting conditions more often simply because the sun is always shining above the sky.
It turns out that they're [AUDIO OUT] to asymmetry in lighting conditions. There are many other such asymmetries as well. In total, we examined six foundational classical physics experiments showing visual search asymmetry in the literature. Each experiment has two conditions.
In experiment one, we studied curves versus lines. In experiment three, we studied certain shapes with resulting intersections among crosses, or vise versa. In experiment four, we included these rotated T's versus rotated L's.
In experiment 5, we included these oblique lines versus vertical lines in the homogeneous case. These structures are the same type. In experiment 6, we studied the heterogeneous case when the structures are oblique lines but with different orientations.
Visual search has been explored in both neurophysiology and in psychology. However, the mechanism underlying how neural representations guide visual search and that lead us to the search asymmetry remains mysterious. By putting together all these pieces of evidence in neuroscience and psychology, one of our goals is to develop an image computable model for asymmetry in visual search.
By quantitatively assessing this model's behaviors and the comparison with humans in a variety of experiments, we hope to gain a better understanding of this neural mechanism involving search asymmetry. Here, I'm going to repurpose this ecentricity-dependent on deep nets for visual search. At first, the model takes the target image as an input, and they extract these feature maps through a stack of convolution blocks.
Then, the model takes the search image as input and extracts the corresponding feature maps through the same stacks of convolution blocks. Note that in the beginning of the initial search trial, the model starts fixating at the center of the search image as indicated by the green circle. And then the model continuously shifts its fixations until the target is found.
Next, the model applies top-down modulation from the feature maps of the target onto the feature maps of the search image. This top-down modulation produces three different attention maps across three different feature levels. Those three individual attention maps can then be normalized and linearly related to produce an overall attention map.
A winner-take-all mechanism then selects the maximum location of the attention map as the location for the next fixation. This process iterates until the model finds the target. Here, we assume the model has infinite inhibition-of-return, and therefore it does not revisit previous locations.
For the visual processor of the model, we used a deep, thin pre-trained image net for their recognition. However, current 2D designs assume this uniform sampling within a layer and do not reflect the property of eccentricity dependence. The previous work by Freeman et al. has shown that in the ventral stream of primate brains, within the same area, the reception through all sizes increases with increasing eccentricity.
Across multiple visual areas, the reception field all sizes also increased given the same eccentricity. Several observations from the literature indirectly support this idea that eccentricity-dependence on something, enhances visual search asymmetry. Thus, here we introduce this novel of putting layers images in deg 16 to capture disease and eccentricity-dependence.
Now, let's Zoom in to this proposed eccentricity-dependence pooling layer and see how it works. So in this example, the model is fixated at the center of the image. For any given unit G, it's inputting window size is a function of the eccentricity. In other words, the further away the unit G is located from the center fixation, the larger opening window size it has.
Here, we reproduced this part of the eccentricity versus receptive field sizes in macaque monkeys. So in order to generate the comparative plot for the eccNET, we did the following experiment. First, we have the eccNET fixated at the center of this black image. Then, we introduce a wide bottom image and present it to the eccNET.
For each given location of the bright spots, they can measure the eccentricity and the neuron response of each artificial neuron in the supporting layer. This gives us the plot of eccentricity versus receptive field size on the right. It's worth noting that the plot of the eccNET is not an exact measure of the monkey plot on the left, but it preserves the similar trend.
To give you a better understanding of the reception field size for each unit in each layer, here we are providing the visualization results of one example image and its corresponding eccentricity-dependence is the same thing as its different layers. Different from the previous work, we emphasize that this is eccentricity-dependent mechanism is applied across multiple layers of eccNET.
In contrast to most computational models, where they have been trained on specific visual search tasks, the weight of our model were only pre-trained for object recognition tasks and ImageNet data set. And we did not do any fine tuning using any human data or any ground truth data in the asymmetry search experiment.
Back in the old days when scientists did not have the luxury of tracking human eye movements data in visual search experiments, they measured the reaction time of a key press as a means to assess the search speed. Thus, eye movements in this six psychophysics experiment I presented before were not reported in the paper. On the other hand, most computational models in visual search produce a sequence of fixation.
Therefore, in order to compare the results between the models and the humans we conducted an additional experiment where we measured both the key press reaction time and the eye movement data simultaneously. Since reaction time resulted from a combination of times taken by the eye movements plus a motor response time taken during a finger key press, we performed a linear regression on the eye tracking data and the reaction time as a figure shown on the right. This gives us a linear model to convert the number of fixations to reaction time, where the slope of this line denotes a fixation duration, and the intercept indicates the motor response time.
Before we dive into the quantitative results, let me first show you some of the visualization examples of pre-existing fixation sequences by eccNET. In this particular search trial under vertical lighting conditions, eccNET starts the search from the center of the image denoted by the yellow dot. And then it takes the eccNET only one fixation to find the ground truth target as indicated by the red box.
In contrast, when eccNET searches under the horizontal lighting conditions that are shown on the right, it takes the eccNET two fixations to find the target. Again, I couldn't emphasize more that eccNET is using pre-trained weight on ImageNet in object recognition, and eccNET has never been exposed to such simple stimulus during training. Moreover, eccNET has never been fine tuned using any human eye movement data or ground truth target locations in the visual search task.
The proposed model shows search asymmetry and it quantitatively captured human behavior when their reaction time plots were compared side by side. For example, in experiment A of curves versus lines, the x-axis shows the number of these structures on the search image, and the y-axis shows the reaction time in milliseconds. The model looks for the straight line among curved lines.
Similar to humans, increasing the number of these structures lead to longer reaction times. In other words, the slope of the red curve is positive for both the humans and the model. But when the target and the distractor were reversed, that is humans have to search for a curved line among straight lines, the blue curve is relatively flat and it is below the red curve.
So this indicates that the reaction time was shorter, and there is a minimal dependence on the number of these structures for reaction time. In other words, it's easier to search for a curved line among straight lines than the reverse condition. This also holds true for eccNET.
Similarly, the model quantitatively capture the human behavior for the rest of the experiment as well, except for experiment E. To further investigate this polarity of search asymmetry, we introduced a new metric called the asymmetry index to better compare different baseline models and eccNET with human results.
Now, let me first introduce the definition of the asymmetry index. Within each experiment, we can first define the easy versus hard condition based on human performance. For example, in the figure on the right, the red curve is above the blue curve. Thus, searching for lines and long curves is a harder condition for humans compared to searching for curve but long lines.
We can then compute the slope of these two individual lines. Then, with the two slopes of the hard and easy conditions for the model, we can now define the asymmetry index as a formula shown here for each experiment, where the H is the slope of the heart condition defined in human experiments, and the E is the slope for the easy condition defined in human experiment. Therefore, if the model follows the human asymmetry patterns for a given experiment, it would have a positive asymmetry index.
And if there is no symmetry at all, then the asymmetry index equals zero. And the negative value indicates that the model does show asymmetry, but the asymmetry is the opposite of the human. Based on this newly defined metric, we calculated the asymmetry index for humans. It is around 0.6.
Interestingly, our proposed method which is eccNET, scored quite closely to humans in terms of the asymmetry index. Please note that this proposed model has no previous exposure to all the images in the current study, and the model was not trained on any of this given task beforehand. It did not have any tuning parameter where it was dependent on either eye movements or reaction time from humans. And it was not designed with the goal of showing asymmetry. Despite all this, it still scored quite closely to humans, which is around 0.5.
We also considered comparing the results with other visual search models. This includes our first model which is IVSN. We found that even IVSN shows a positive asymmetry index, which suggests that it performs similar to humans, but it is not as close as eccNET.
For baseline models, we compared it with chance, which showed almost zero asymmetry index. This indicates that the chance has no specific bias for different search conditions. We also compare the results with GBVS. It is a purely bottom-up saliency model. We found that GBVS also scored close to zero, indicating most of the asymmetry is driven by top-down modulation. As a simple pixel matching model also did not show the positive asymmetry, this suggests that the feature biases might happen at the abstract level in the latent space instead of at the pixel space.
To understand the mechanism responsible for asymmetry, we ablated the ecc model. First, we tried with top-down modulation from a single feature layer. Second, we tried top-down modulation from multiple layers, but then we removed the eccentricity-dependent pooling layers. Both of them showed a positive asymmetry, but the scores were significantly lower than eccNET. This is suggests that both the top-down modulation across multiple layers and the eccentricity-dependent pooling layers are important for explaining search asymmetry.
Then, we tested this effect of training data used to train this basis for the visual processor. So first, instead of using ImageNet, we used MNIST to train the model with. We found MNIST does give a positive score, but the score is significantly lower than the score of eccNET.
Second, we trained the model on a 90 degrees rotated version of the ImageNet. This gives an interesting result that the absolute score was high but the polarity was negative. This indicates that there exists asymmetry in search conditions, but it shows the reverse polarity under all conditions in our experiment.
So to elaborate on the last two points, for example, in the lighting direction, compared with the polarity of eccNET pre-trained on ImageNet, the polarity for asymmetry has got completely reversed after eccNET has been turned down and rotated in ImageNet. Instead of using ImageNet containing millions of natural images with rich varieties of real world statistics, we then turned the eccNET on MNIST. It contains only grayscale handwritten digits.
We found that the model turned on MNIST shows a similar asymmetry. But it's absolutely reaction time, as indicated in the red box, is very far from human. This implies that the model did learn some of the asymmetry biases, but it was not able to learn which features to be able to quickly find the target.
So we further evaluated the role of the training regime in other experiments. For example, we trained the model on ImageNet images after applying a fisheye transform to reduce the proportion of straight lines and increase the proportion of curves. Second, we also introduced extra vertical and horizontal lines in the training data, thus increasing the proportion of straight lines. Third, to test whether data statistics other than ImageNet would alter the polarity, we also turned the model on Places data set as well as the rotated Places data set.
Due to the time constraint, please refer to our paper for more results analysis within these additional experiments. Moreover, to evaluate whether asymmetry depends on the architecture of the visual processor, we replaced VGG16 in the IVSN model ResNet on the six experiments. Though ResNet backbone does not approximate human behaviors as well as eccNET, it still shows a positive asymmetry index. This suggests that search asymmetry is a general effect for these nets.
So apart from the search asymmetry to evaluate the model's visual searchability in the natural world, we also tested the performance of eccNET on more complex visual search environments. In the first experiment, the model has to search for a target object in order to read. In the second experiment, the model has to search for a target in a more natural environment.
The search has [AUDIO OUT] the model has to search for Waldo. To assess the search efficiency of the model, we use the cumulative search score, that is, what is the probability that either human subjects or the model could find the target within a given certain number of fixations? The x-axis denotes the number of fixations, and the y-axis denotes the cumulative probability that the model finds the target.
We can see that eccNET approximated the human behaviors on all those three tasks, and it was close to IVSN's performance in terms of search efficiency. In the second column, to assess the spatial temporal similarity between two fixation sequences, we introduced the scanpath score. The scanpath predicted by eccNET share more similarities with humans than the previous IVSN models.
Lastly, we assessed the distribution of saccade sizes made by humans and the model. We can see eccNET approximates the human saccade distribution much better, while the IVSN model fails to do so. So here are several key messages. First, asymmetries reflects the strong priors in our visual system. It implies a whole reprocess of features to guide search. We introduced this biologically-plausible model for visual search. This novel eccentricity-dependence pooling layers.
Our model approximates the human search asymmetry without any prior exposure to any stimuli or task-specific training. We concluded a series of augmentation studies to alter the statistics in the training data. Our experimental results suggest that search asymmetry is a general effect of deep nets. The results also highlight that asymmetry of search behaviors arises from both the training regime as well as the network architecture. So please feel free to check out our GitHub page for more information. Thank you.
Associated Research Module: