What are you searching for? ... and how do you do it? (57:00)
August 17, 2018
August 16, 2018
Jeremy M. Wolfe
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Jeremy M. Wolfe, Brigham & Women's Hospital; Harvard Medical School
Introduction to visual search that examines Treisman’s Feature Integration Theory, features that guide shifts of attention during search, challenges for model development such as how to terminate the search process and the temporal differences between search and recognition, and a neural architecture that combines a selective pathway for object recognition and non-selective pathway for visual properties such as texture and gist.
SPEAKER: I'm going to talk about visual attention today. And if I'm going to talk about visual attention, I need to say a word about really the founder of the modern field and Treisman who died in February. Oh, so here here's the short sociology of science part of today's talk this is Anne Treisman. This is Michel Treisman, her first husband but also a quite notable cognitive scientist. She was until her death married to Danny Kahneman, but she became famous under his last name which was something of an annoyance to her and I think more of an annoyance to him, because she was the famous Treisman long after they were no longer an item. So you may take whatever message you want from that about changing your name if and when you get married or something like that. So here's a picture of Anne with Danny Kahneman. No, not really but with the National Medal of Science the US highest scientific award which she got from Obama back when we believed in science.
So here's what I'm going to do the starting place for modern work on visual search. The problem of how you find what you're looking for in a world full of things that you're not looking for is Anne Treisman's Feature Integration Theory. And I'll tell you a bit about sort of the founding data for that, and then the data that leads to modifications by people like me and others. My version of the modification is called Guided Search. I'll tell you about a few problems that any model that you want to develop of human visual search behavior would need to solve. And then I'll start trying to tie the problem of search to the broader question of what do you see. What are you seeing right now. What's the world look like and I will end up somewhere in here getting beyond simple search tasks we're looking for one specific target in the world, but we'll get there when we get there.
All right. So let's start with the what should be intuitively obvious. Search is necessary because you simply cannot process all of the visual information in the field at once. It's a response to the limited capacity of the human visual and cognitive systems. So if you're looking for Waldo, you're having a bad time because this particular piece doesn't have Waldo in it. But if you look for a lion you can found the lion. OK. Even after lunch. So this immediately raises a couple of interesting questions.
Interesting question number one, what was there before you found the lion? Right. There was a period of time when that image came up on the screen, you didn't necessarily even know what I was asking for. But you didn't know where the lion was. Eventually, you found the lion. What was there beforehand? It wasn't some sort of black or gray hole. There was some visual stuff there. The contents of that visual stuff we can call preattentive vision. That's a term in common use in the literature. We can ask a related problem. Suppose you redeploy your attention someplace else, like to this silly pink bus, now what's here?
When you got your attention here, you turned this into recognize the lion in some fashion. What's there when you move your attention away? Does it revert back to the preattentive state, or does it somehow remain lion? And that you could call the problem of post attentive vision. That's a less common term but clear enough. The standard laboratory experiments that really got popularized by Treisman in the late '70s early '80s are experiments like this; You put up a stimulus. Observers are going to push a key saying yes, there's a target present or no there's not a target present. We're going to vary the number of items on the screen, call that the set size. We're going to measure your response time or reaction time, usually abbreviated as RT. And the interesting data are going to be in particular, the slope this reaction time by Set size functions.
If you do a task like this, even if I don't tell you what the target is, I didn't tell you what to look for here but you'd be, Oh, yes, I did. Never mind. Even if I didn't tell you what to look for, it would have been fairly clear that this green thing was the target and you would not have had, you can intuitively imagine that it doesn't matter how many red things there are up there the slope of that function will be essentially 0. And that works for a variety of basic features. And it doesn't work for combinations of basic features, at least not in quite the same way. What Anne originally thought, what Anne Treisman originally thought was that you could process basic features in parallel across the entire field.
But as soon as you started combining features that your search became sequential, serial, and presumably self terminating. That if you're looking for a red vertical thing you're going to have to look around until you find it more red horizontals and green verticals are going to produce a steeper slope. Oh, I didn't put any numbers on this but if you're using great big salient stimuli like this, this is not constrained by your eye movements. The growth here isn't that you have to fixate one after the other. The slopes on a really serious visual search might be something like 25 or 30 milliseconds per item. You only move your eyes volitionally three or four times a second which would produce obviously a much steeper slope. So you have a distinction between searches that are extremely efficient with a slope near 0, tasks that are not as efficient. But it's not just that you have to fixate on things in order to know what you're looking at.
If you have items that are defined by how elements are put together. So you're getting closer to something that sounds more like object recognition, those are also going to produce these inefficient kind of searches if you're looking for a T among L's, for example. So the original claim of classic Treisman 1980 Feature Integration Theory is parallel search for features serial search for everything else. There are other properties that basic features had that are interesting and useful. So if you take a look at this collection of red and green vertical and horizontal bars, you will see that they segment into regions pretty clearly. Right? OK.
These regions are probably perfectly nice and obvious to you. Those four are less obvious if I go back, but the conjunctions of basic features don't support texture segmentation in the way that the features themselves do. So there's a clear border between vertical for green and red and horizontal and vertical, but there is a less, what is he doing over here? He snuck in. But here all the Reds are horizontal, all the Reds are vertical. All the Green's are vertical, all the Greens are horizontal. It's divided up into separate regions, but they don't segment. So basic features support texture segmentation combinations of features don't.
Another one of the classic things that features do is they support search asymmetries. The presence of a feature is much easier to detect among its absence than vise versa. So if you're looking for the Q among Os, that's convincingly easier than a search for O among Qs. And all of these things have been used as ways of trying to figure out what are the fundamental features that are extracted in early vision for purposes of a visual search and visual attention.
Oh, here's demo time. Try this. This is one of-- I'm going to flash up a bunch of stuff on the screen. There will be two big red numbers. I just want you to say which is the bigger number. The one on the left or the one you can't tell me already I haven't put it up there yet. What can I?
AUDIENCE: Do you mean by size or the actual number?
SPEAKER: Oh, thank you. I think I mean the actual number. Now you're making me forget how I made this demo. But let's assume it's the actual number. Which number is the larger number? OK. On your mark, get set, go. Left? Was it left? OK. That's lovely, I don't care. What I really care about is there was a bunch of other stuff there. Right. You saw a bunch of other stuff. Right? Do you know what you saw? So this is to alternative first choice psychophysics is what we do in the lab, that means you don't get to say I don't know. You have to say one of the two alternatives. So yes or no?
SPEAKER: No. No, people are pretty clear on no.
SPEAKER: OK. Not a deep sense of conviction here, except on the purple one I noticed, I think. And that's pretty good here what are the answers? Yeah, the Q was there. You correctly knew that the purple thing was not there. The yellow one was there, the blue-- the R is not there. The blue one is not there. What and pointed out and what this is obviously the demo version of something that's a real experiment, is that people were pretty good at knowing what the features were that were present. But immediately after the image is gone, they were very unclear about how these things went together. They produced what she called illusory conjunctions. So you might be convinced that blue vertical was present, because there's blue stuff in vertical stuff. You might be convinced that the R was present because there's a P and that Q has that diagonal that you stick it on there you'd get an R.
So she thought that the features seemed to flow rather separately, and the architecture that she proposed was you've got the stimulus out there in the world. You've got a set of feature maps early in the visual system, which in the 1980s were thought maybe to map on to early visual processes. And then if you wanted to do object recognition, if you wanted to just figure out if there was red there, you could get that out of a color map. But if you wanted to do anything else, what you needed to do was deploy your attention there which some but your attentional spotlight metaphorically, which allowed you to reach back in a feedback way into the early visual system and bind together all the various different bits of a visual input. I mean, historically, where this is coming from is you're finding chunks of the brain that seem to like color or that like orientation or that like motion.
And you've got the problem that you don't see color or orientation or motion, you see a red bar moving to the left, how do those features cohere together? And the key idea here was this idea of binding and the idea that binding was a capacity limited operation that could only be done on one or maybe very few objects at the same time.
Now, in modern research, this really gets started in the 70s and 80s, the idea is not entirely new. So here is the French philosopher, Condillac who asked back in the 18th century; what would happen if you got up in the morning and threw open the curtains in a place where you didn't know what's outside the curtains, what do you see when you first see a brand new scene? And he said, what you would see is just well he would have said the preattentive features but he didn't have that language. He said, you just see little patches of color and shape. That's all you'd get. Only later would you know that you were looking at a hill and a river and things of that ilk.
So let's try that. Here are some curtains I will fling them open, and you're going to tell me what you see. Already, let's see if it'll fling. OK. What did you see?
AUDIENCE: Crosses and plusses.
SPEAKER: Crosses and plus's. Yeah, OK. Plusses, Xs, good. What else?
SPEAKER: Colors. OK. What color? I heard red, green, and OK. Good, I gave away the answer here. Did you see-- which plus did you see?
Both, neither left-- Both neither left. All right. OK, good. We got all four answers. Those are the four logical possibilities. Tenenbaum's got a computer that will generate those. The answer is you didn't really know. Right.
So here's what you were actually looking at. It turns out that both is the correct answer. But you really didn't know. And the reason you didn't know is this binding issue. That before your attention gets there, as Condillac suggested you knew you had red and green, you knew you had vertical and horizontal didn't know how they go together. Only once you get attention onto an object, do those features get bound in a way that allows you to recognize that specific object.
Let's try that again. Look for the one in the next slide. Look for the one that's red vertical green horizontal. I'll leave it up there this time. It's not just going to be a brief flash. So you ready, that's the one you're looking for. Yeah, you're feeling good because you found it. Now you're feeling like there are several. Oh, Yeah, like half the display. This being the audience that it is somewhere there's an overachiever here who found the other one. Yeah, it's you. Right. Oh, there's several. , Right there's one over here.
As soon as that image came up on the screen, you knew red and green plus is everywhere. But only once you started scrutinizing individual ones did you find those features together in a way that lets you know which particular plus was there. So you had this idea of feature integration with a preattentive stage followed by an attentive stage. There are developed in the early 80s, when I got into this business. They developed some problems with the basic scheme. In particular, and had decided that everything that wasn't a feature search was going to be the same serial search but these kinds of searches where you were looking at conjunctions of fundamental features turned out to be fundamentally different than these kind. And under cited paper that makes this point in an intuitively obvious way is this paper by Howard Egeth forever at Johns Hopkins, and a couple of his students at the time.
What he said, what he had observers do was look for a T among Ls. On some trials, you looked for a T among Ls. On some trials he told you the T is red. And what you discovered was that if you knew that the T was read, the slope basically dropped by half if half the items were red. And the reason should be clear enough. If you know that the T is red, there's not much point at looking at black Ls. You can use that preattentive color information now to guide your attention to the only elements that could possibly be the target.
You can wrap that up to looking for conjunctive stimuli. If you're looking for a red vertical thing, and some things are red but not vertical and other things are vertical but not red, here you can try this. Clap when you find a red vertical thing.
Oh, good. And you can clap three times if you want, that's fine too. What you're doing here, what makes this an easy task much easier than that task of looking for the plusses, is you can basically say to your search engine, give me all the red stuff. Give me all the vertical stuff and you're allowed to do the intersection operation. And the intersection operations, as well the most likely I don't know I have to bind these features. But the most likely place to find red verticals is like here where there's redness and verticalness.
The reason that plus demo is so hard is because if you say, give me all the red, give me all the vertical, you get every plus in the field. Right. Because they all have both red and plus in them. Yeah, OK. So if you're looking for that thing, there's no way to guide yourself on the basis of the basic features. So you just have to start examining them one after the other.
And so feature integration theory leads to in my case to guided search, which is a fairly modest modification in fact on the basic architecture the Treisman originally proposed. You've got early visual processes. If you want to recognize and bind an object, you're going to have to attend to it. But you're not going to attend randomly. Where you attend is going to be intelligently guided by a set of these preattentive features that you have available to you.
It seems useful in this era to say that those guiding features are probably not the whatever 4,000 long feature vector that you get out of your favorite neural network. Those are features that are terrifically useful for object recognition, but they are not what you're doing when you're guiding your attention. Let's see. I think we can illustrate that. Here's the sort of features that you do have and then I'll illustrate why it's probably not the case that it's all the vast variety of things that come out of your favorite deep net.
The features that you do have are things like orientation, curvature something about shape that we don't understand, and some fairly advanced things, like lighting direction, or basically, the 3D layout of objects. Is that a hand? Yes, that's a hand.
AUDIENCE: Yeah, it strikes me from that previous example that it almost seems like there's a higher featured rate. So when narrowed my search, I narrowed it by color first, and then my orientation, not by operation.
SPEAKER: Yeah so absolutely everybody's intuition is exactly that what you're doing is a staged operation of get one subset and then work through that subset, turns out not to be true. And the way you can show that is do experiments where you force people to do the subset search. So instead of saying, look for the green vertical thing, you say, tell me if there's an oddball in the green set. So you have to get the green set and then do basically the same pop out search within that subset. And that turns out to be hundreds of milliseconds slower. The other evidence that that's not what you're doing is, so green vertical. That's two features as well we can play this game a lot, big green vertical, big green shiny vertical, big green shiny jagged vertical.
There's a bunch of features and you can make these higher order conjunctions search gets easier each time you do that. To a first approximation, it would get slower if you had to do each of those as a separate step. The evidence suggests that you can basically load a set of terms into your search engine and use them all at the same time. And then what's quite interesting is that there are a variety of things that simply don't work. And this actually in some ways gets to the connection between these features and object recognition features.
Doing object recognition really useful to be able to pull out types of intersections. T intersections typically tell you about occlusion, X intersections don't. Searching for X intersections among T intersections is a very inefficient search, it's not easy at all. There's no evidence that the type of intersection will guide your attention. Interestingly, I said you can say, give me all the red stuff. Give me all the vertical stuff and get the intersection operation. That works between types of features, but not within. So if you say, I can't even figure out what the target is here. It looks like it's blue and yellow. Say, give me all the blue stuff. Give me all the yellow stuff. You don't get this guy. What you get is instead of the and operation that you wanted to do, you get the results of the order and you get everything. So conjunctions within a feature turn out to be quite inefficient.
And then there's all sorts of interesting fights over faces. Basically, I won't go into the gory details but I don't think faces guide your attention. Here's an example. Look for the X intersection among T intersections. These are the particular characters are obviously made to balance the size and the line segments and stuff like that. But they differ in intersection type. I found it. Found a few. So there's that one, there's that whole region. So it's another way in another sense in which it doesn't work. If those guys were red, that would have just simply jumped out you would have known that was a red region.
I assume that unless you happen to see this little two-dot artifact, then nobody noticed that region when it first came up there. Because that those that feature does not work to guide attention. Even though, it would not be hugely surprising to discover that the top layer of your network that's doing your [? Alex Netty ?] kind of network that's doing object recognition, might well have units that looked a lot like they were telling the difference between different forms of intersection.
Another important way that you know that these features are different from, say [? Alex Netty ?] kind of features, is that what guides your attention is both a limited set of features and within those features or within those types of features, its course and categorical in nature. I mean if you think about having like you have your own little Google search engine. In a normal Google search engine, I can type any old thing in there and it'll do good work for me. In your visual search engine, you can only type in a very small vocabulary. It's about one to two dozen types of features. And within that feature like color you only get a very few set of terms. You don't get to say, I would like to guide my attention to all items that are 583 nanometer wavelength lights. That's not going to happen for you even if you know what that happened to know what that color would look like.
So as an example look for orientation oddballs here. Which items attract your attention? You're ready? No problem. Right. Little harder you got that. You didn't get that one. But once your attention is here, you can tell that one is a different orientation than this. The physiology, if you look at single cells or you look at the psychophysics if you get people in the lab and measure it, you can tell the difference between basically, like a degree of orientation. But in terms of guiding your attention, it's more on the order of 10 or 15 degrees. Its course.
Here is another version. I'm going to put up a bunch of lines the targets are tilted 10 degrees to the left of vertical, there will be two of them. So I want you to find both of them. You're ready? OK. Got two of them. All right. The question is which one was easier to find? How many people vote that the one on the left was easier to find? How many vote that it was easier to find on the right?
All right. Left wins by at least 3-1 which is good because that's the answer I wanted. If it's occasionally doesn't work. And if it doesn't work, I can always revert back to the data. But what's the difference here? The difference is that this guy's easier, that guy's harder. This guy is easier because it is the categorically steep item in the display, the targets are tilted plus and minus 50 degrees. So they're 40 degrees and 60 degrees away from the target. The distractors are 40 and 60 degrees away from the target.
Over here, this is the steepest thing but these guys are steep. It's also a left tilted item, but these guys are tilted to the left and it turns out that that's more difficult. Because it looks like what you can say to your search engine is give me the steep stuff, or give me the stuff that's tilted to the left. You don't get to say I want the steep stuff, you don't get to say I want 45 degree items. It's very coarse. Now, we haven't worked that out in every feature under the sun but it's true in size it's true in color. It looks like, the way to think about this is you've got the front end of the visual system, which is doing all that feature extraction that allows you to do extremely fine detailed object recognition eventually in a lot of detail in there. But there's a very coarse abstraction of those features. That is what you are using to guide your attention around.
Now remember you're doing all this presumably because you're doing all of this in the service of severe capacity limitations. You want to be quick and dirty about this. It's not mostly going to be that useful to you to say, I want to find items of exactly this shade of red. You say, I want red. And within the set of red things, OK, now I can use my built in deep net to figure out exactly which red item it is that I want. I can go through 30, 40, 50 objects a second in object recognition land, the evidence suggests. So I don't need to do this guidance business precisely. I just need to get into the neighborhood. I just need not to waste time on things that are never going to be my target.
Not all the rules are desperately clear. So in each of these displays there is a one elephant that is different from all the others. Right. Which side is easier? That it's easier to find a dead elephant among live elephants than a live elephant among dead elephants. Right. OK. That's a classic version of a certain asymmetry. The rule is that the presence of a feature is easier to detect than the absence.
Oh, great. Like what's the feature here? Deadness? It's not-- there's a whole bunch of controlled experiments. And I'm not going to you, it's not just weirdness it's not just pointee upness. We have no idea what the answer is. But it's probably not the project you want to work on for this course. But if anybody figures out why dead elephants are easier to find than live elephants, let me know.
And so this is what those sort of data look like. Looking for a dead elephant among live elephants is, that slope is only 5 milliseconds per item live among dead is about three times that. I don't know why. So the elephant problem is not really one of the deep problems in the field. There are more interesting problems. Here is one of them. Clap. Here's another demo, clap for a T. Ready? Ready, up we go. That's good. There is not a T.
Now, at some point I even heard somebody say there's no T. How do you know that? Well, the obvious answer would seem to be to put this animation, . That you would go through and cross out all the items until I do please tell me I didn't put one on every L here. Stop. Anyway, then you would go through and exhaustively cross everybody out. And when you'd crossed everybody out, you knew it's time to quit. But at some point you're going to know it's time to quit. If you don't find what you're looking for, it would be desperately stupid. If you simply got stuck and could never move on. Oh, good. We can move on.
But so that's what is known as is inhibition of return. Inhibition of return is an interesting phenomenon in the attention business where if you can show that attention has been here and moved away. It is indeed harder to get attention back to a previously attended location than to get it to a new location. Ray Klein at Dalhousie has been working on this for many years.
But that's not what you're doing. How do we know it's not what you're doing? Well, here let's do an experiment like this. We're going to have you looking for a T among Ls. We can make life a little more difficult by making them a little jagged. But it's the same basic task. And every 100 milliseconds, we're just going to randomly plot the locations of every item. So if there's a T present, it will be present on every frame. But the location will be randomized 10 times a second. Obviously, you cannot at that point be marking the locations that you rejected. It's just not going to work. So the only thing you can do is when it's time to grab a new one, you grab a new one. If it's a T, you're good. If it's not a T, you keep going until you quit at some point. So that is sampling from the display with replacement.
Inhibition of return, this story is sampling from the display without replacement. What do the data look like if you simply say how long does it take you to say yes there's a T, or no there's not a T present. So the static version the standard version is the green data. The red data is what you should get if you go from sampling without replacement to sampling with replacement. But the blue and black lines are two versions of the everything's flickering around 10 times a second experiment. It really doesn't make any difference. It's very odd. You would think that if I was looking for Gabriel, let's say and everybody in the room was changing positions 10 times a second, you would think that would make my life more difficult. But in this experiment, it doesn't. So you're not going through and inhibiting items one after the other. And deciding on that basis when to quit.
So what do you do? Well, a way to think about what you're doing in search is to think of it as really two decisions or a sort of an iterative process of a pair of decisions happening over and over again. You pay attention to an item and you've got essentially a signal detection problem. Is this a target or is this not a target? And that for most of the things I've shown you here that's an easy task. You have no trouble deciding if something is a T or an L. If we switch to my medical side of the world, you're attending to this thing. And the question is this cancer or not, the signal detection part is going to be much more difficult. But you're basically doing a little signal detection task. If the answer is yes, you've found your target. If the answer is no, you have to decide whether it's time to quit. And rather than going through and saying I'm marking everything, what you seem to be doing is accumulating some evidence to some variety of a quitting threshold. That quitting threshold gets set on the basis of your understanding of the task.
How likely is it that there's something there? If I never find something if this is a task where there's simply never anything there. I'll have a relatively fast quitting threshold. So if I don't find it I'm not going to waste time on it. If I know my damn keys are in the bedroom, that quitting threshold goes way up there because I'm going to hunt until I find them. But it's a different kind of probabilistic process, not an inhibition of return. OK. Here's another problem that so anybody who wants to build a search engine that behaves like the human search engine needs to figure out how you're going to quit if and when you don't find the target.
Here's another thing you're going to have to figure out. So we've got this notion that you're doing this object recognition object binding kind of thing, and the slope of those functions is essentially a rate function. Right. If the slope is 20 milliseconds an item, that means every second. 50 items are somehow getting through this binary. And the data on searches like searching for a T among Ls tells you that you're doing something on the order of 20 to 50 objects per second. Really fast.
It's pretty impressive the problem is that if you do studies of just straight up object recognition. Nobody thinks you can do it that fast. The estimates of how long it takes to go from a stimulus on the retina to recognition are, the fastest you get is like 100 milliseconds or so. And probably somewhat more than that. So there's a gap. That's not a hand. No, OK. It's like an auction, you make the wrong gesture.
So a couple of paths, there are multiple possibilities for what you're doing here. One possibility is that every time you move your eyes, let's say in a display like this you grab a bunch of items. And that you can find those guys in parallel. And then you move your eyes again in series and bind the next bunch in parallel so sort of a clumpy version, our version of the story is somewhat different. We think that you only actually attend to one item at a time. You select one item at a time, and each one of those starts accumulating object recognition kind of information.
And you can imagine, some sort of a diffusion process. Maybe that's the boundary for Ts and that's the boundary for Ls. And you grab this guy and it starts diffusing towards an L boundary and this guy is headed for an L boundary. Eventually, you get that one which is going to be your target. But at any given moment, so one after the other. But at any given moment if you slice through this, you'd be processing multiple items in parallel. It's not that radically different from this. The way to think about this is, look at that is as a car wash or a pipeline and computer science line language.
But think about a car wash. Cars go into that machine one at a time. So it's in that sense serial. If I bomb this car wash, how many cars am I going to destroy I'm going to destroy a bunch of them because several of them are being washed in parallel. And we think it's hard to prove, but we think that that's what's going on when you're doing search tasks that you're going through and loading up this car wash one after the other, and then multiple items are in there at the same time.
So all right, I said that. OK. So let's say, a bit about how this relates to this question of, well, what are you actually seeing right now? Whether or not you're actually searching for something. OK, so back to Waldo land. Go find someone there. Run your nice object into bears, and what makes Waldo problems, good is that he figured out how to thwart lots of shared features. Front end of the vision-- front end of the process massively parallel early visual processing. At some point, there's a mandatory very tight bottleneck that's guarded by this guiding representation with a few features in it. You bind the features and you get your bare. But going back to the initial Waldo picture, you were seeing something everywhere. . Right.
That bottleneck on object recognition, but you're seeing something everywhere else. And one way to think about that is to think about two pathways that are contributing to your visual awareness. It is tempting, but speculative to try mapping this on to say a ventral pathway and what's and where's, and things like that. But in physical terms at least, you've got this selective pathway that can do object recognition.
And you've got a nonselective pathway that's giving you visual stuff everywhere. And is contributing, [? Candillac's ?] version of this in the 18th century was to say just getting little patches of color everywhere. More recent work is telling you that this nonselective pathway notably by people like Old Oliver at MIT, this nonselective pathway is giving you more than just little local patches of color. You can also get some semantics out of that. So the gist of the scene. Did I put a gist demo in there? No, not quite. We'll come back to that in the second.
The result of this is the Illusion that at any given moment you are seeing a world full of understood objects. And imagine what's going on. You've got this nonselective pathway giving you something everywhere. You've got this selective pathway that one way or another is giving you 20, 30, 40 objects per second bouncing around all over the place recognizing stuff. The conscious experience is I'm seeing everything everywhere. The way to show that that's not the case are the classic demos, how many people have seen change blindness demos at some point? Most people. It doesn't matter. Here is a change blindness demo. What changed other than the orientation? I know that.
Normally in a change blindness demo you put a blank in there. You don't need a blank frame. There's no blanket, oops. There's no which way is back. There's no blank in there but the orientation shift will also mask the transient that will tell you the changes. Has anybody found the change yet? No. It doesn't matter who your undergraduate education completely wasted. You learned all about change blindness. OK. If we take out that orientation shift so that you'll see the transient, you will see that this is not subtle. Yeah. Come on. And of course, if I go back here, now you can see it just fine. Right. This is not a subtle change.
The reason it's interesting is because of this illusion that you have that you're seeing stuff everywhere all the time. It wouldn't be, if I said you have the back of the head blindness. Right. You don't know but Tenenbaum is busy making horrible faces right now and you didn't even notice that. You said, that's really boring. I can't see behind my head. Josh is looking behind to see if he's-- Oh, OK. That'll work.
We'll see how that works. Anyway, but intuitively, you understand that he's gone. It's amazing that you intuitively understand that you can't see behind your head right but you don't understand that you can't see what's right in front of your eyes. And that becomes an unfortunate surprise. Now, that's a lovely surprise in intro psych kinds of lectures. It has real implications out in the real world. And here's an example. So I work at actually, both Gabriel and I work in the Harvard Medical system based in hospitals God bless us. And I hang out with radiologists quite a lot. This is one slice through a chest CT. Those are the lungs. The white blood would be the heart. You're looking for signs of lung cancer in this case, this is a particularly obvious one. They're basically like golf balls in the lung little teeny golf balls that lung nodules are what you'd be looking for.
And we did an experiment where we had radiologists searching for these lung nodules while we're tracking their eyes, all very nice on the last frame as some of you will have noticed on last case, we put a gorilla in the lungs. How many people have seen the great gorilla demo in that same class where you saw the change blindness stuff? That's why we used a gorilla. Right. It's an homage to Dan Simons and his gorilla.
If you're a radiologist looking for lung cancer, your instructions or anything else for that matter, your instructions are look for the thing you're supposed to be looking for and tell us if there's anything else clinically significant there. So-called incidental findings. A gorilla the size of a matchbook in your lung is clinically significant. A little unlikely, but clinically significant. 20 of our 24 radiologists failed to report it. Important note, this is not that we had bad radiologists. But they're working with the same search engine that you're working with. And the result is that they will miss things that are literally right in front of their eyes. We're tracking their eyes they fixated on that gorilla for a full second on average, still did not report it out.
And it's not just these situations where you get a failure to see errors. How could this happen it was right in front of your eyes. These happen all the time. They can be catastrophic, and they end up in court. So as an example a woman comes around the bend on her motorcycle out on a rural road, and quarter mile of clear road in front of her slams into a pickup truck. And I think she was killed actually. How could she fail to see a pickup truck that's just sitting right there?
Well, the problem was in this particular case that it was sitting right there. The truck had come to a stop in the middle of a rural highway. Because the woman in the truck was talking to a friend in the front yard. And this woman presumably, the woman who was killed came around the bend probably attended to that truck said, I understand about trucks on highways they're moving, was still looking at the whole scene but not attending to the truck. By the time she attended to the truck again, having attended to whatever else she may have been attending to, she did not notice in this case, the absence of a change until it was too late. So she could slam into something that was clearly visible because of this illusion that we can see everything.
And these sorts of things I know about these things coming up because every now and then a lawyer calls me and says, I've got a defendant or a plaintiff, the accident involves, Oh, my God how could you not see that? And this is how you could not see that, Oh, if you have never seen the gorilla experiment just to remind people, you're watching a ballgame and you're told to count the number of times the white team counts the ball. And in the middle of that video, a gorilla walks through. It's not a real gorilla she's in a gorilla suit it's OK. They got it through a human subjects committee pounds on her chest and walks out and 50% of people fail to report seeing a gorilla when you query them afterwards. And of course, everybody can see it the second time.
One of the interesting aspects of this is the next time you particularly if I'm giving the talk, the next time you see a piece of lung CT, you're going to be looking for a gorilla. You'll never get fooled by that one again, and you won't be fooled by the amazing disappearing angel either. But I could go onto my laptop find another six versions of these sort of things and bamboozle you. It's not something that you can immunize yourself against. It's simply part of the structure of the way the human system is put together. All right. Let's see. Looking at that yes, OK.
I'll do one last demo to make one last point here. This is lung nodules for non radiologist. I told you that the radiologists are looking for a little golf balls. But just to, I can't get you to look for nodules. So look for golf balls. All right. Success? Got that one. Feel good about that one. Did you find those five? Oh, good for you. Have we met before? No, Oh, even better for you. OK. That's great did you get that one too? OK. So I still win.
By the way, this is another problem you'd actually don't want to work on for the month. Because it would only take you half an hour. You could write a piece of code that would find golf balls in this image with no problem. And it would be much happier with that one which is a nice high contrast golf ball than either of the two on the green. What's the problem here? Rather like the radiologist's, you are also experts. But your experts on the real world. Your mini golf experts. You know and this is an important different form of guidance. You know where golf balls could be and should be. And they could and should be on the little golf course here and they shouldn't be floating up in the sky there and in the real world I forget who was asking about real scenes before somebody over there. In the real world, this scene guidance is a profound constraint on where you put your attention.
I sort of alluded to that before but we should underline that point here. If I'm looking for people, I'm going to look on a ground plane where people are likely to be. I'm not going to waste time looking for people hanging from the ceiling. On the rare occasion mostly in lectures like this, where there is somebody hanging from the ceiling, you'll get fooled. But most of the time this is a very useful shortcut again, dealing with the capacity limits on your visual system. And you desperately don't want.
So it's a problem because you miss things. You miss weird stuff. This is a medical problem, it's a problem in the intelligence community, you're going to end up missing the golf ball in the wrong place. But you really don't want to take away the message, Oh, this sort of scene guidance stuff it's a disaster. I don't want to do it. So when you get into an Uber, you really, really don't want the Uber driver who doesn't believe in scene guidance right is driving down the highway saying, I wonder where the exit sign is. It could be in the sky. I wonder if it's at my feet. That guy's going to kill you.
What you want under most normal circumstances is you want to deploy your limited resources in as intelligent manner as you can get away with. And that'll get you in trouble in lectures like this. And unfortunately, it will also get you in trouble in a variety of settings, medical settings and things like that. And an interesting frontier so one possibility is, we will build that deep net that simply does radiology, and end of story. More likely we'll figure out how to pair smart AI with smart experts in a way that cuts down on these sorts of errors. Which is not a separate lecture to talk about the details of why that's not trivial. I think I'll stop there and not talk about hybrid search, which is very cool. But you need to get to the beach. So ask a couple of pithy questions and then everybody can go to the beach. So Thanks.