Modeling human visual search: A combined Bayesian searcher and saliency map approach for eye movement guidance in natural scenes
December 16, 2020
December 12, 2020
Melanie Sclar, University of Buenos Aires
All Captioned Videos SVRHM Workshop 2020
SPEAKER: Our next speaker is Melanie Sclar, and she is speaking from Argentina from the University of Buenos Aires. And she is the last presenter of the series of the four accepted papers that had the highest score. As a fun fact, [INAUDIBLE] on your website is an Olympian, a math Olympian. So that's kind of cool, [INAUDIBLE] Olympicos, as we'd stay in South America. So without further ado take it away, Melanie.
MELANIE SCLAR: Thank you, Antonio, for the presentation. Yeah, well, I am Melanie Sclar. I'm here to present modeling human visual search, a combined Bayesian [INAUDIBLE] approach for eye movement guidance in natural scenes. And this is joint work with [INAUDIBLE] and [INAUDIBLE] all from the University of Buenos Aires in Argentina.
So predicting eye movements for passive observation has been thoroughly studied. But when we take a task into account, then it becomes much more difficult. As we all know, people will fixate on different points in the image depending if they are free viewing or if during a task and which task they're doing. For example, one task may be visual search.
When dealing with a task, it is important to consider not only the fixation locations but the order of those fixations-- in other words, discomfort. So it might be that two different people are focusing on the same points but in different order. So that might make one person finish before than the other one. So it's important to look at the shape of that, what does it look like.
So how can we model eye movements during visual search in natural images? So there's been some prior work on this mainly in Najemnik and Geisler in 2005, who have proposed an influential Bayesian model called the Ideal Bayesian Searcher, or IBS, for short. But it has only been tested in artificial images. So our proposal is to adopt IBS with the necessary considerations for natural images.
Then the question becomes what should we take into account when we are adopting IBS to work on natural images. So this is a top level schema of how IBS and our modifications work. So I'm going to walk you through the top level schema of how this works and all the modifications that we did.
So first of all, we compute the template response. And the template response is quantifying the similarity between the image badge at a position and the target image giving according to fixation position. What this means is if I'm looking at a position I, how similar this position J looks to the target given that, for example, I'm very far away. So this position might not be very visible, et cetera.
So once we have this computed, we can estimate the probability of the next fixation. And for that, we'll need a prior of the prior probability. We'll talk about what we did here. We can then update the posterior probability for target location, choose the maximum [INAUDIBLE] location, and, of course execute the location to the position that attains the x maximum.
So first of all, let's look at the prior that we added. Previously, it was a uniform distribution because in Najemnik and Geisler's case, all the positions were as likely as each other to have been fixated first. Of course in natural images, this is not the case as it has been studied with saliency maps. So we decided to put that as a prior.
We know that they predict very well fixations in a free viewing task. But how will they perform on a visual search task? So we selected different models from the MIT Tuebingen saliency benchmark, and we tested how well do they predict the first fixation, the second, the third, the fourth, and so on.
And we see hear that they first predict very well, but they sharply drop afterwards. So in black, you'll see humans, and the best model that we found was in case two. So that's the reason why we're going to use the deep gaze two saliency model for now on for all the experiments.
Great, so we know what prior will put, but I have not talked about how to compute the standard response. And we also modified this computation. We changed it a little bit from how this worked from 2005 was doing it.
So just to reiterate, this is quantifying the similarity between a given position I on the target image given that I'm looking at the position j. So if this position is very far away, visibility is low. And then we really are not sure of what we're seeing because it's in the peripheral vision, for example. So in IBS, this was modeled as a Gaussian.
So the value for each pair of IJs was drawn from a Gaussian distribution. This is also very good because you might-- if you run the model twice, you might get different [INAUDIBLE], which is a nice characteristic. And the variance was 1 divided by the visability to reflect this.
When you have high visibility, you are sure of your judgment-- so lower variance. And when you have low visibility, then you have high variance. You're not very sure.
Regarding the mean, they were only putting a positive value if the analyzed position was the target location and a negative value otherwise. So they were not looking at how similar does this disposition look to the target, which makes sense, again, in their context with an experiment with artificial images when they don't have this behavior of maybe something that looks like my keys but it's not actually my keys. And I might look and find out, oh no, they're not.
So for that, we analyze the image patch of the position I'm considering moving to and computed the correlation between the target-- because we actually show people the target that they're going to search for-- the correlation between that target and the image patch. And we weight that value with the prior mean, the mean that IBS was already using. That's what I mean. So
We are combining the two, and we assign more weight to the IBS, meaning the original one, if the visibility is high. And if it's low, we assign a higher weight to the image similarity. Great, so we now have an idea of how we are computing template response.
But how do we compute the visibility map? In prior work, again, this was estimated empirically with a previous experiment. So before you came for the visual search experiment, you came for another one just to model the individual differences of the retina for each individual subject. And we decided to simplify this by just taking a 2D Gaussian with the same parameters for every participant.
This has benefits, right, because we not only not have to have an extra experiment, but also, we are avoiding possible leaks of viewing patterns to the model. So just to reiterate, on top of this simplification, we also change the prior from uniform to saliency map. And we also modified the computation of the template response to account for the presence of the structures.
So to test for our model, we design an experiment and gather data from 57 subjects searching for a target in 134 in their images-- so kitchens, living rooms, et cetera. And each trial will have a randomized number of [INAUDIBLE] that you're allowed. Out so we first show you the target.
Then we make you fixate on a specific point, same point for everyone. And without subjects actually knowing, they have a limited amount of [INAUDIBLE]. After that time, they will be asked where the target was and their confidence. This last bit is not included on today's presentation. It's part of larger work.
Great, so with this data set, now we can actually compare model performances between CIBS on previous work, CIBS with different saliency maps. So first, here, I'm going to show CIBS, which stands for Correlation-based IBS-- so our modification of including image similarity with the correlation. And we decided specifically that we wanted to account for several different perspectives, several different ways of comparing our model with humans.
And these are not all the metrics that we use. There are some more in the paper. But we're specifically interested in making sure that the [INAUDIBLE] look human-like and not that we are just achieving the result in the same amount of fixations.
The first panel, though, it is more quantifying the proportion of targets found. So if I allowed the subject to only do two [INAUDIBLE], what is the proportion of the targets that they will find? What's the percentage of the trials that they will be able to correctly solve, and, well, so on and so forth for all the possible maximum [INAUDIBLE].
So the box plots are the humans, and the lines on top are the different models, which are CIBS OR model with [INAUDIBLE] two saliency map-- so state of the art saliency map-- a center bias, uniform prior or flat prior, and a noise prior. So again, we are not trying to just find all the targets with two [INAUDIBLE]. We are trying to make it look like humans. So something that goes near to the mean of this [INAUDIBLE] plot could be beneficial.
So the second plot I think it's very interesting because we are now looking at the shapes of the [INAUDIBLE] with [INAUDIBLE] dissimilarity metric. Since it's dissimilarity, then lower is better. And this metric is based on strength at its distance.
So it is trying to analyze how similar the shape is. There is some variance between humans as well. As you can see, they are in black-- one humans against all the other ones.
And then all the models appear. But here, we can already see that the red and the blue one are performing better. They are-- their distribution is more human-like than the purple and yellow one.
And on the third panel, we also use the same similarity-- dissimilarity metric, sorry. But we use it to compare how does this dissimilarity metric between humans and comparing humans to each model. So on the y-axis, that abbreviation means [INAUDIBLE] dissimilarity between humans and on the x-axis is [INAUDIBLE] dissimilarity between humans and a model.
So we are looking for a slope of exactly one. We are looking forward to [INAUDIBLE] the identity function. So they are indistinguishable.
So in the sense, of course, again, red and blue are better than purple and yellow that are very far away from the slope of one. But if you take an overall look at them, then you will be able to tell that red is better. Again, we also have some additional metrics in the paper. We can discuss also in the poster session if you're interested.
Then, of course, we are now going to compare CIBS to other possible models, first to IBS and to some other baselines. So again, it's going to be with a deep case two prior because we know that's the best one that we obtained from before. So we are going to have CIBS, IBS, [INAUDIBLE], and a saliently-based model, which just means looking at the most salient location each time and forcing [INAUDIBLE] over time.
So this is the same as before. In the third panel, we can already see the green and pink differ a lot from humans. And between green and blue, which could be green in CIBS and blue IBS, they both perform very well.
Maybe if you look at all of these metrics holistically, maybe CIBS works a little bit better. But they both work excellent. So to conclude this, CIBS and IBS have better performance than non-planar strategies, and we also showed that adding nontrivial priors resulted in more similar [INAUDIBLE] behavior to humans. We still have lots of feature work, of course.
For example, we may want to change the template response, explore other possibilities of showing this possible disturbance caused by destructors. And also, we want to explore individual differences for images where B modalities or multimodalities exist because we see that humans, they don't always perform search in the same way. For example, on the image that I'm showing here on the right, we ask people to search for a cup.
Half of them started on the table to the left, and half of them started on the counter top to the right. So we definitely need to move towards models that can account for these individual differences. So that's all I had for today.