Visual Search: Differences between your Brain and Deep Neural Networks
December 16, 2020
December 12, 2020
Miguel Eckstein, University of California, Santa Barbara
All Captioned Videos SVRHM Workshop 2020
PRESENTER: --Miguel Eckstein, who was my advisor in graduate school. So Miguel and I go back a long time, as I'd say. And I remember a peculiar joke that-- Miguel has influenced most of my work on foveation. And I was very excited for my first paper coming out. And I [? told him, ?] Miguel, the foveation revolution one day might be happening. And I hope it does.
And Miguel would tell me, well, I've been working-- the foveation revolution has been happening for quite some time, [INAUDIBLE]-- 25, 30 years before you were born. So without further ado, Miguel Eckstein will be talking about visual search-- differences between your brain and deep neural networks. Miguel-- [SPANISH]. Take it away.
MIGUEL ECKSTEIN: Thanks, everybody who-- inviting me to [? participate ?] in this workshop. I've enjoyed the talks this morning and this afternoon. They have been great. So some of the work I'll be talking about fits in quite well with some of the previous talks. I will be comparing human performance and some deep neural networks. For this talk, it will be quite simple. It will be at the behavioral level. It will be at the output level of the DNNs rather than looking at the-- some of the layers and the representations.
I put together a talk that has three examples about some-- three examples of differences in which-- by which humans, and actually, DNNs search. So there are three examples from visual search. The first two examples are intellectually more interesting. They illustrate different solutions to these problems, possibly reflecting human constraints for different goals that humans might have when they're doing these tasks, relative to the more narrow goals that the DNNs have.
And the third example is a little bit more obvious, but it has interesting implications in the practical world. It really reveals some of the more obvious limitations of humans when they actually search through images, but it has important implications. So I will start with the two examples. We'll see how we do.
I am going to start with a demo, which is a few years old now. You might have seen it. It might work not so well over Zoom, but we'll try it anyway. So what I'm going to ask you to do is to just look for a toothbrush, and then you can raise your hand or whatever. You can clap or whatever-- put your little icon there-- and then we'll see where we go. So I'm going to put an image up, and you're going to look for the-- you're going to localize a toothbrush.
OK. And by now, I think most of you probably found the toothbrush. And if it worked, most of you probably found this toothbrush right here, and-- but maybe some of you have missed this one right here, that large one, that there. And again, it depends on visual angle sometimes, but this is what typically happens.
Now, why does it happen? Arguably, you could actually say that this is actually-- the large toothbrush is actually way more salient than this small toothbrush. But what happens is, when I tell you to look for a toothbrush, your brain rapidly processes the scene, and it's actually [INAUDIBLE] to spatial scales when you're looking for a toothbrush that are consistent with the rest of the objects and the rest of the scene.
And this toothbrush is consistent with that spatial scale for the target, while this one is actually not consistent. It's mis-scaled, and it would be very unusual to actually have a toothbrush of that size. It's an error, but it's sort of a rational strategy, because these toothbrushes are just quite uncommon in the real world. So this is the main phenomenon, this idea that humans actually miss these giant objects.
But the idea that is actually a useful strategy because, if I actually have other large objects, like a big broom or something that might look like a toothbrush, you might be able to discount that just because it's at the wrong scale. So your brain rapidly looks at scales that are consistent with a likely size of the object that you're looking for.
So we did this. Obviously, I wanted to show-- we want to do, aside from a demo, have some data. And of course, we wanted [AUDIO OUT] giant toothbrush-- we wanted many objects-- and it's actually quite hard to actually have some of these giant [AUDIO OUT]. So what we did is we actually had an experiment that used these simpler computer-generated scenes, where we actually change the scale of the object so it was consistent with a scene, or inconsistent-- and typically larger, like I just showed you.
And we had 40 different scenes. We had different types of targets. And people would actually look at a fixation. Then you would actually have an image that tells you, OK, look for an object-- a toothbrush, or cork, or whatever you're looking for-- so just one object per trial. And then the stimulus will come up for a second so you have time to make eye movements, and then you actually can tell us whether the object was present or absent.
So half the time the object was not present, and half the times was the object was present. And of those times when it was present, sometimes it was at the right scale, sometimes it was at the wrong scale. And every single participant just saw that scene with that object once, so you wouldn't have any sort of learning and so on. This is just an example of the toothbrush at the right scale right here, and that's the-- a mis-scale. So it's actually larger than what you'd actually expect.
Here are the results. So let's look at what I just showed you for human performance. So this is actually humans at the-- objects at the normal scale. So this is hit rate-- so how many times they actually correctly found that target and said that there was a target present, when it was actually at the normal scale. And this is when it's actually mis-scaled.
As you would imagine, yes, if you do this many, many, many times, people will pick up that you're tricking them, and they can adjust their strategy. It is not the case that you remain-- you will miss all giant objects forever. Eventually, you pick up that somebody's tricking you, and you can adjust your strategy. Our experiments were about 42 trials, so they were quite short, and we got a very strong effect.
These are three different CNNs. These were the state of the art in 2016. This is a few years back now. And this is the target object probability for the objects that were normally scaled and the mis-scaled. Now, what you can see is this association, which is that the CNNs do not have any of this dependence of the relative size of the object-- the size of the object relative to the rest of the scenes or the objects around them.
So this is the first interesting thing, that these are state of the art object detectors are searching for these objects in a drastically different way than humans do. Humans are really guided by the relative positions and sizes of objects to other-- to the rest of us seeing, while these object detectors are actually a little bit more indifferent to that. And of course, this might be changing with time, but this is actually in 2016. I'll show you a slide of another version of this algorithm YOLO, which is-- I think now is up to version 3, or 3.0, or so on.
The next slide-- I'll show you just a little snippet of-- we've been interested in this and we've been interested in trying to find out where actually this happens in the brain. So what we've done is we've put humans in an MRI, while we actually change the relative size of an object relative to the rest of the scene, and try to look at the bold activity as a function of that relationship. And we've looked at maybe seven or eight areas that were predetermined or pre-segmented. We're using functional localizers.
I'm just showing you one of these areas-- TOS, which is an area that actually cares quite a bit about the actual physical size of-- it response to physical size of objects, whether the objects are large or small. And here what we see is that this area is actually also caring about the relative size of an object with the rest of the scene.
So what you see here in blue is when the object is in its normal size relative to the rest of the scene. And then, here as you actually-- it's become too small. We're either doing two manipulations. We're either shrinking the size of the object itself-- so it's too small relative to the rest of the scene-- or we're actually-- we are also changing the actual field of view and keeping the [? retinal ?] size of the object.
And those two give you a similar effect. Sometimes the object is getting smaller, so this is less surprising, but this is a more surprising thing when you actually make it larger. When you make the object larger in relative size to the rest of the seeing, the activity into TOS decreases. And that is not the case for other areas. If I show you FFA or V1, this is not happening.
So this is identifying where actually this happens in the brain, and we do similar things-- now this is years later. So we have YOLO version 3. We do get that, when the object gets smaller, the-- we have a significant drop, but we don't get this decrease in activity when the object gets too large relative to the rest of the scene-- so again, highlighting this dissociation in which-- with which humans and these DNNs are actually processing these scenes.
So now you might be wondering, what is it that behind this dissociation? We don't have a full answer, but we're wondering about that too. There are two possible answers. There might be more. So one of-- is that there's really-- the current object detectors have processing bottlenecks. As you know, many of the object detectors actually have a bounding box, so they're not.
They're partitioning and partially analyzing the scene. So they're not really analyzing the entire scene-- just mostly having to do with data processing bottlenecks that might-- in the next few years, we might get over. One exception is YOLO. The YOLO algorithm actually uses a version of the entire image, although it's a little bit low resolution. So it's actually surprising that that one doesn't take up any sort of spatial relationships.
So it might be that, as we actually-- and these models are mostly feedforward. it might be that it's actually-- as we build more things from other domains-- so I'm thinking of natural language processing work-- contextual information is actually fundamental there-- maybe we might start seeing these object detectors show some of the properties that we actually see, instead of human behavior.
So the second one is perhaps related to these things from natural language processing. Maybe the object detectors are missing something fundamental and cannot learn these contextual relationships. And this might tie into some of the things that we've seen in other talks having to do with feedback and so on that might actually allow to a faster and better learning of these contextual relationships.
So that is an example 1. So now I'm going to move to a second example, where we see another interesting dissociation between how DNNs search and humans do. OK, here is the text. This is a person search in the wild. What you see are two targets. So there's Ian and Amanda, and just one of them is going to appear in the scene, or neither of them.
People are answering a three-class task. They either say Ian, they say Amanda, or neither. And the way this works is they're fixating here. We're going to eye track them-- our subjects. And these stimuli are actually videos, so they're going to actually be moving. And these videos were filmed over a period of about a month and a half or two months on campus at UC Santa Barbara, and Ian and Amanda were instructed to actually dress differently each time. And the people around them and the environments actually also changed and so on.
And this is what we're going to do. Here, what we are interested in is we're interested in understanding which parts of the body the faces-- their heads are contributing to performance for humans and for the CNNs-- which features are important. So we're going to do that to ways. One way is we're going to look at where people are actually looking when they're doing this task, looking at their fixations. And the second way is we're going to manipulate the presence of features in these videos. And I'll show you how that-- how we did that.
OK, so here is the intact. Each of these dots is actually a different subject. It's the same video seen across different trials, because they're different people. We're not tracking them all simultaneously. I just collapse the data from everybody. And this is the-- what we call the intact condition. I'm going to play it again. And what you see is that people spend a whole lot of time looking at-- they look around, but they look a lot of times to the head.
Now, this is one condition. In a different condition, we actually eliminated-- this is what we call the-- here we go-- the headless condition. We've eliminated the heads. So the task remains the same. You still have to identify whether Ian, Amanda, or neither is present in the videos, but you've seen lots of different videos, different environments-- Ian, Amanda with different clothing, but no heads are present.
The interesting thing is that people still are looking towards the top part of the body, which-- we were surprised. They're looking for that head. Then we have a bodiless condition where the body got eliminated, and this is fairly obvious. They tend to look right there at the head. And then we had a fourth condition, which is-- this was hard to do because we had to segment the face and keep the head. It's called the faceless.
This is the best we could do, but you can see it's not perfect. We tried to eliminate the face, but keep-- you see the hair is there. So part of the hair is there and the head is there. And you can see people are looking at that little empty spot quite a bit. What we tried to do is really combine from all these videos by registering-- scaling the bodies of these people across videos-- because they're [INAUDIBLE] different angles and they have different sizes-- and registering all the fixations into one silhouette.
And this is what we got when we did this, sort of a heat map of where were people looking at. Unsurprisingly, when it's in intact, they're actually looking at-- right there at the faces. Here we go into bodiless. Well, when there is no body, they're still looking there. There's not really much to look at there. And then, for faceless, even there's no face there. They're still looking there.
And now, even for headless, it does move somewhere, because there is nothing here, but it remains quite high-- so quite high in the body. They don't really look to a center of mass or anything. They really still looking just towards the neck, where the body finishes. And that would be the head, but the head is not there. So this gives you this idea that people are actually focusing on faces and heads, and this is an [? overpracticed ?] and pretty common strategy.
OK, when we looked at performance across these four conditions, what we saw is the following. This is performance in the intact. Chance is 33%. Now, when you eliminated the face, you had this big drop in performance. And then, when you eliminate the head, that was another-- that's similar drop in performance as the face. Now, if you actually eliminate the body, you can see that human performance did not vary much.
And there has been literature that people can use body to identify-- observers can use bodies to identify people. That's been documented. A lot of those studies are actually people walking by themselves and not less [? than ?] these more complex cluttered scenes that I actually had, sort of in a search scenario. So this is a little bit surprising.
Now, when we ran the CNN-- in this case, this is a ResNet-18, where we actually train the last layers to actually to be able to identify Amanda and Ian. And what we found is that we found a different pattern of results. In particular, we found that the-- eliminating the face had very little impact on performance for the CNN, and the body was actually quite important. You can see that detriment in performance. And the head itself-- maybe the hair or something-- was also important.
So here's a second dissociation in which humans and CNNs are actually doing this search task. Humans are very face-centered in the way they do this, and they use the body to a less of an extent, while the CNNs are actually using the body quite a bit, and sort of care less about the face. Now, why would this be? Why would we have this dissociation and strategies.
Well, this is what we were thinking. We think, here, CNN is trained to optimize this single task, the search for these-- Ian or Amanda. But in general, humans in everyday life really need to simultaneously optimize a number of different tasks that are important-- identification, but also the emotional state-- inferring what's the emotional state of the person, the gaze orientation, and so on.
So what we think is that this-- partly, this might reflect this idea that the human strategy is really optimizing a set of multiple tasks, while the CNN is optimizing one. And the second one is perhaps-- there are obviously constraints on human cognitive resources-- in particularly, memory. So as much as we could, we tried to have Ian and Amanda wear different clothes.
Like I said, it was over 45 days, the filming. I am sure they repeated the shoes one day or another. It could be that the CNNs pick up on all these things, while humans-- it would be an incredible memory load to pick up all these individual features, even though they're appearing once in a while and they're predictive of somebody's identity.
Humans obviously are-- have memory constraints, and faces are invariant across, at least, short periods of time, and so they have a high probability of stability. So that's another reason why humans might actually focus more on faces to do these tasks. So that is the second example that I had-- that I brought-- highlighting how humans and CNNs search differently.
I had this summer slide, just in case I ran out of time, but I did not, so we're going to go to the third example. I still have a few minutes. So this one-- you will have to pay a little bit of attention, because this is a task you might not be so familiar with. So I'm going to explain what's going on here. What you see to the left-- this is actually from medical imaging. This is actually from breast screening, where doctors, radiologists are trying to detect cancer.
What you see on the left is the traditional way that breast screening has [INAUDIBLE] for decades. So these are X-ray-- 2D X-ray mammograms, and doctors are interested in little features like masses and microcalcifications. Masses are a little bit bigger. Microcalcifications are little tiny. I'll try to show you a simulated one in a sec. To the right is a technology that's become prevalent in the last maybe five years or so-- five, six years.
So what you see here it's called digital breast tomosynthesis. It's also X-ray-based, but it actually ends up-- you end up with something like a 3D volume. The goal of this is it allows radiologists to parasegment the targets that they're interested from the normal anatomy. And typically, it actually improves performance. So radiologists do better diagnosing cancer with these 3D breast tomosynthesis images than the 2D mammograms, in general.
Last time I checked, there were about 40% of the clinics in US that had this. This is probably now x70. I need to really check the numbers. This was a few years ago. But it's what's now in most clinics coming, or it's there already in the big cities. So we were interested, again, and how the relationship between how radiologists look for targets in these 2D mammograms and these 3D images, and the relationship between performance in these two conditions, and how we compared to a CNN.
And we particularly focused on one task that we thought was that is clinically important which instead of a detection of these very small targets that are very salient when you look at them, but they're sort of hard to see in the periphery. This is just a-- this is called a microcalcification. Typically, they actually appear in clusters, not by themselves, so this is a little bit of a simplification.
We inserted these. This is a collaboration with the University of Pennsylvania and the Santa Barbara Women's Imaging Center. We start this at random locations in these phantoms. These are not actual DBTs. These are actually phantoms created by pen so complicated phantoms. They look similar. They're not exactly what real DBTs, but they look similar to those.
And we stuck-- there are about 100 slices. I'm just showing four in the 3D. And we studied this at a random location, and we also put it at random location in the 2D, and then the humans, and the radiologist, and the CNN had to search for that little target. And when you actually do that and you measure-- a CNN is actually specific unit that's specific for medical images.
2016 was state of the art in image segmentation for medical images, and we trained the last layers to do a task. And you can see that it does better in the 3D than in the 2D, which is actually consistent with this notion that the 3D, because it has volumetric information, carries more information than the 2D image.
But when we actually looked at our radiologists, look what happened. We had this dissociation where radiologists did much poorly and-- for this particular target that I just show you, in the 3D than in the 2D. And so this is actually an interesting result, because there's a big dissociation, but it's-- if you're not a vision scientist, if somebody in medical imaging or radiology, you might think this is actually-- oh, what is this [? happening? ?]
But those of you in the audience, you're a vision scientist, so you probably have figured out what's going on. What happens is this target is extremely salient in the-- when you foveate, and the 2D image-- you can actually-- not rapidly, but with moderate time, you can actually scan the image and actually find it. But in this huge volume of 100 slices, if you count the typical clinical times about to four minutes, people don't-- radiologists do not have the time to scrutinize every slice and every spot of every slice, so they often miss it.
So here's an example. I actually show you. So what you're seeing here-- I'll pause it for a sec-- in green is-- are the fixations of the radiologist. And at some point, you're going to see a little white ring that's going to appear for a few slices, and that's where the target is going to be. You're going to see how they miss it. So they're scanning around. They're fixating here.
That's it, right there. I don't know if everybody saw it. It was right there. I'll put it back. You're going to see here-- right there the-- maybe I missed it. OK. There-- there you go. That's where the target is. It's hard to see, but if you're looking away, it's going to be-- it's very easy to miss.
So clearly, what the-- what explains this association is really that the humans have this foveated system, and they're under-exploring this 3D volume, which leads to this dissociation with CNNs, which are nonfoveated and are actually thoroughly exploring the entire 3D volume,
OK, so to summarize, I gave you sort of three examples-- very different examples of how humans and CNN search in different ways. The first example emphasizes how the human brain heavily relies on contextual relations for object search, while current CNNs-- and like I said, I haven't tested-- this could have changed. If you have a point [INAUDIBLE] we need to try to see if it has these properties. That'd be awesome. The ones we had tested until recently do not seem to incorporate such relationships.
The second one was really visual search of a person in the wild, and illustrated how the human brain adapts strategies that optimize performance across a battery of tasks raather than-- CNNs are trained to do these narrow tasks, and that might explain the difference in strategy.
And then the third one was identified, in a practical domain, instances that reveal visual cognitive bottlenecks in humans that result in these large search areas for these small signals. And it suggests that that-- those might be good at circumstances where AI might greatly help mitigate human errors, and that was presented in cancer screening by radiologists. So with that, thank you very much.