Boris Katz: Telling Machines about the World, and Daniel Harari: Innate Mechanisms and Learning: Developing Complex Visual Concepts from Unlabeled Natural Dynamic Scenes
June 9, 2014
June 9, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: (Boris Katz) Limitations of recent AI successes (Goggles, Kinect, Watson, Siri); brief history of computer vision system performance; scene understanding tasks: object detection, verification, identification, categorization, recognition of activities or events, spatial and temporal relationships between objects, explanation (e.g. what past events caused the scene to look as it does?), prediction (e.g. what will happen next?), filling in gaps in objects and events; enhancing computer vision systems by combining vision and language processing (e.g. creating a knowledge base about objects for the scene recognition system and testing performance with natural language questions); overview of the START system: syntactic analysis producing parse trees, semantic representation using ternary expressions, language generation, matching ternary expressions and transformational rules, replying to questions, object-property-value data model, decomposition of complex questions into simpler questions; recent progress on understanding and describing simple activities in video (Daniel Harari) Supervised vs. unsupervised learning; lack of feasibility to have labeled training data for all visual concepts; toward social understanding: hand recognition and following gaze direction; toward scene understanding: object segmentation and containment; detecting “mover” events as a pattern of interaction between a moving hand and an object (co-training by appearance and context); using mover events to generate training data for a kNN classifier to determine the direction of gaze; model for object segmentation using common motion and motion discontinuity; learning the concept of containment
BORIS KATZ: So we are starting. Good morning, everyone. The MIT computer science and artificial intelligence lab celebrated its 50th anniversary last week, and we had a nice conference with MIT. What's that? way Guess the boss left. Can you hear me? Who knows how this machine works? It's certainly on, yes, but there's no volume. Is this better? Engineers figured it out. The answer is yes. OK, thank you.
All right, as I was saying, we had a conference celebrating the 50th anniversary of Project MAC at MIT, which is sort of what was there before the MIT computer science and artificial intelligence laboratory. And many people talked about progress in the field and, in particular, some standing successes of artificial intelligence applications, and you see some of them here on the screen-- Goggles and the Kinect and Watson and Siri.
But, of course, when you look a bit more carefully at these gadgets, at the systems, you will see that none of them would be said to be truly intelligent, in part because they don't have any knowledge about the world outside of their narrow areas of expertise. If we go back to one of the grand challenges of our center, you see that one would be [INAUDIBLE] computational system that would look at the visual scenes, answer questions about them, describe them the way humans do.
So if you can imagine, these tasks require quite a few ideologies of natural language processing, [INAUDIBLE] language understanding, language innovation, and question answering. And this is what I will be mostly talking about this morning. But I'll start by giving you a brief history of computer vision because that is certainly part of our test. [INAUDIBLE].
So 50 years ago, if you were to give a machine an image like that, well, the machine will tell you that it can't possibly handle even one image like that. It will not even [INAUDIBLE]. Going forward, 25 years ago, [INAUDIBLE] black and white, if someone gave their visual system an image like that and asked it to label objects and [INAUDIBLE] describing, and this is what the answer was. So the machine saw sheeps and beds and horses and airplanes.
Of course, much progress happened during 25 years with that. But just a few months ago, a student of Antonius gave it this image, and the machine said, bread. Well, it is-- as you all know, it's really amazing, exciting times for computer vision. We have these great gadgets like Kinect on the cellphones and [INAUDIBLE] magic, and there are cameras that easily see your faces. Machines drive themselves. We have robots.
So this is now a real story of a student at our lab who decided that he wants to do machine vision. Everyone told me, this is awesome. Look at all these wonderful images we have. Just pick one. Also, we have these fantastic data sets. We have image maps that [INAUDIBLE] mentioned awhile ago, that's millions of images. We have [INAUDIBLE] Caltech. We have Pascal, just plenty of system and [INAUDIBLE]. And then, he gave it this image.
Well, we, of course, know what we see in this image, but his system said this, car. Well, there's this wonderful English expression-- if all you have is a hammer, everything looks like a nail. Once you start investigating what happened, well, there was this image patch right there circled by [INAUDIBLE] system, and if you try to find out what his detector sees, it actually sees something that sort of looks like a car.
So, why is this happening? Let's look a little bit more general at a bunch of images and see what in general do we want the scene recognition system to do. Well, some of the standard tasks that are used in the field are things like verification. You want to know if this is a street lamp or not.
Detection-- you want where there are people in the scene. Or object categorization-- you want the machine to tell you that this is mountains here or buildings, people, trees, and so forth. Activity-- you want the machine to be able to tell you what is this guy doing there, or what are these two people doing there.
Humans are, of course, absolutely remarkable-- verification, detection, categorization, things like that. Humans do spacial and temporal relationships [INAUDIBLE], the same as event recognition. They can do more. They can explain things just by looking at an image-- what past event causes a scene to look as it does now. We can predict things-- what future event will happen at the scene.
They can hallucinate and fill gaps about the scene-- what objects that are occluded or invisible in the scene might also be present there, or what events could have occurred. Let me prove to you how awesome humans are at many of these tasks, so I'll show you a blurry video example, and maybe you can tell me what you see here. Anybody wants to say what they see? Tell me, please.
AUDIENCE: A guy typing on a computer.
BORIS KATZ: Yeah. The man at the computer with his monitor and things like that. Well, humans make mistakes, too. So please take a look what he actually was doing. So the guy had a shoe. He put some beer cans on his head. And he typed away in front of a trash can. He uses a stapler, and so on and so forth. But to tell you the truth, I would be ecstatic if any of our vision systems could make [INAUDIBLE].
So, why are machines falling short? Well, one of the reasons is that our vision system is tuned to process structures which are typically found in the world, but our machines just don't know enough about the world. And they don't know what is typical in the world, and therefore, they do the silliness that we sometimes see.
So let's ask a few questions. How is this knowledge that humans are so good at [INAUDIBLE] to be obtained? How can we pass this knowledge to our machines? And how can we determine whether the computer knowledge is correct?
I bolded for you the word partial. Our partial answer is using language because, of course, there's many avenues that we humans acquire knowledge and other members of our group [INAUDIBLE] for example. But since my project is about language, so let's concentrate on what knowledge the language system can give to the vision system.
So the proposal is to create a knowledge base containing descriptions of objects, their properties, relations between them as they're typically found in the world, and make this knowledge base available to a vision system. And test the performance in the system by [INAUDIBLE] We will test it by asking natural language questions.
A student in our group wanted to see how actually much knowledge people use when they look at images in order to understand them. And so, he asked mechanical [INAUDIBLE] people to look at several hundred images and write down questions that they think that an image like that would answer. And so here are some pictures. So people wanted to know how many people are in the photo, what's in the cart, is anybody walking, is there any luggage.
In this example, people wanted to see what number is displayed or what color is the backpack in the chair. In this image, someone asked, what is the object that is parked in front of the fence? Just think about the knowledge that is involved because the question was about the stroller, but the machine needs to know not only cars can be parked, but things like strollers and that animate objects are not usually things. And therefore, this woman who is standing next to the fence is not a good candidate.
Who is winning, yellow or red? Well, we need to know that yellow and red mean people wearing these colors, that we have to pay no attention to people in the audience wearing these colors, that it's a sports event, where usually are involved winners and losers. And if somebody's on the floor, it's likely they're a loser, and things like that. A lot of knowledge.
So we'll do the system, which we'll call START, that provides a number of natural language tools. The tools we've got here are in the directive to go from natural language text to some semantic structure, which is providing machines with new knowledge. The ability to go back from that semantic representation into natural text, which is a generator.
And finally, the ability to test machines, how it understood what you told it or what you told it to do, and hear the question. [INAUDIBLE] ability to semantic representation and the machine's ability to reply, do something [INAUDIBLE]. Here are some of the building blocks of our system.
I will not have time to go through all of them, but I will describe some to give you a sense of how the system works, and in fact, hopefully, if you're interested you could ask us to give you some of our talks and [INAUDIBLE]. So the first thing that the language system does when it looks at a sentence, it takes the sentence as an input.
And here's a sentence from Tom Sawyer on the top. It's used to analyze it. So linguists usually analyze sentences using these beautiful parse trees. But some of you who tried, this is not a very good representation for a machine. So what we propose instead is to use, of course, the syntactic information in these parse trees, but give the machines a more semantic representation, which involves a ternary expression representation.
So this is a versatile syntax-driven representation of language. It nicely highlights the interesting semantic relations that you want to know about. But it is important. It's very efficient for indexing information [INAUDIBLE]. We'll show you examples of this representation. This is that same sentence, and Instead of that scary parse tree, you see a nice set of ternary expression circles, which pretty much describe the whole sentence.
One triple any triple representations can point to another, so it actually looks like a linear set and is fact, the semantic output. And we've distinguished a number of types of ternary expressions. While the [INAUDIBLE] syntax, syntactic structure of the sentence. We notice that it's related to syntactic features and also some lexical features.
And this is a screen shot of the system, and again, it gives you access to all the sentences you wanted to, that show you different sentences from [INAUDIBLE]. We're going see, and you see in the black syntactic structures of the sentence, that their features [INAUDIBLE] some semantic features.
Now, we will need to be able to take these structures-- the machine has them as their brain or just take them either from analyzing sentences or from analyzing the world --and see if we can convert such nice semantic representations into language. And there are many reasons why you want to do it.
You want robotic people to explain actions to you. You want a machine question answering system to answer complex questions, to track conversation. You want to have a mixed-initiative dialog with the machine and many other reasons why you want a machine to be able to generate nice [INAUDIBLE] sentences and [INAUDIBLE].
And this is an example for our generator in action of that same sentence from The Old Man and the Sea, but we added and modified some situations the machine obtained from the original sentence. And it was able to generate a much-- or handle a lot of sentences such as [INAUDIBLE] of course. In fact, it's a question because we told [INAUDIBLE] that we wanted to turn it a question. But this gives you the ability, a view of what the kind of language this [INAUDIBLE] system can generate if you want it to.
So now, we know how to take a sentence, create structure, and we know how to generate an answer, a sentence or an answer given structure. So let's see how we can use a question and answer. So let's look at that same sentence about Tom Sawyer. And let's say, someone asks, was anybody sitting by an open window? The machine on the left will create our next question from the question. And on the right will be the assertion in the knowledge base that the machine has for analyzing the original sentence.
As you can see, there's some easy looking matches, and this is what this looks like in the semantic network representation. And the matcher says, sure, we could match that. So the matcher in our system and, in fact, in any syntactic [INAUDIBLE] based question and answering system, needs to happen on many levels. It needs to happen on the level of words. You want to know that similar words are similar, that synonyms would be similar, that hyponyms need to work one way and not the other.
But much harder is national level structure because in most languages, the same sort would be expressed in many different ways, syntactically, and they mean the same. And you want the language system to be able to answer questions that were formulated in one way, and then to associate that the machine had formulated in the other.
So we'll a say few words about what we call transformational S-rules just because I find it interesting, [INAUDIBLE] point of view. So let's look at a few English verbs. So let's look at a verb like "surprise." And let's look at the sentence that's a very common English sentence. "The patience surprised the doctor with his fast recovery." And as you all know, the same sentence would be rephrased by something like, "The patient's fast recovery surprised the doctor."
If you look at the structure, they're very, very different, as a syntactic structure, but they mean completely the same thing. Another verb, "load." "The crane loaded the ship with containers." And "the crane loaded containers onto the ship." This is a different alternation from "surprise." "Provide." "Did Iran provide Syria with weapons?" Or "Did Iran provide weapons to Syria." Again, if you had to sort by alternation.
And one can think that we have this wonderful world that allows us to say anything any way, but if you try to use, for example, the "load" alternation for the word "surprise," you get complete gibberish. "The patient surprised fast recovery onto the doctor." Or if you want to use the "load" or the "surprise" information with the word "load," you get another type of gibberish. "The crane's containers loaded the ship." The stars, by the way, indicate that it's a horrible sentence and no way to stick with the truth.
So this is very interesting. So we've started looking quite a while ago [INAUDIBLE] to look at these verbs. And they realized that, in fact, there is some semantic regularity with this alternation. So let's look again at the verb "surprise," and you see that many of other verbs-- in fact, in English, several hundred verbs --undergo the same alternation. So you could confuse or anger or disappoint or embarrass the doctor with a fast recovery and so on and so forth.
And if you look at these verbs a bit more carefully, you actually see that they have similar meanings. They belong to the same semantic class. So this is an interesting intersection between a syntax and semantics that I think we should pay quite a lot of attention to. And so, what our system does, instead of trying to teach the machine for every verb what kind of alternation it can or cannot do, we can do it much more generally on the level of verb classes, which will make for much more [INAUDIBLE] paired with lexicon.
It's also interesting that I invented a verb someone made the other day, is [INAUDIBLE] example. So if I invent the verb that you never knew, I'm sorry for you, and like jumping from the ceiling at me, and surprising or scaring me with something like that, but that would be a completely different verb that's like-- what should we say [INAUDIBLE].
Students [INAUDIBLE] Boris if you jump. If you see from the video what that means, you will eventually start using the verb with a different alternation, without any of you telling anything else. So there's this amazing ability by humans to understand semantic classes of English verbs to know how it works. That would be very interesting to study.
Well, anyway, so we know how to analyze sentences, and we know how to match, so it's time to actually give answers. So if you ask the server system what I just told you, was anybody sitting by an open window, you parse it. You match it. You come up with the match. And our generator can, of course, actually generate the sentence back to you, which is not at all interesting. It would say again, Tom's aunt was sitting by an open window in an apartment. But a much more interesting application is when you ask the system a question that you actually don't know the answer to, and the system needs to do something to give you the answer.
So the example below is what our START system uses on its question answering in response to a match that I described to you about. It executes a script to go to some on some analogy depository, in this case, in the web, get the answer back and generate a nice answer. So the example here is, who directed Gone with the Wind? And the answer was that Gone with the Wind was directed by three people, Cukor, Fleming, and [INAUDIBLE].
Here are some screenshots from the subsystems that use this ability to perform a procedure in response to a match. The question was, does Russia border Moldova? And the machine went to a particular website, in this case, World Factbook, found out all the countries that border Moldova, looked where Russia was on the list, didn't find it, and says, Moldova does not border Russia.
When, of course, you try to go through a search engine, you get some hits, but they will need to not answer the question [INAUDIBLE] they talk about [INAUDIBLE]. This ability to be able to analyze English sentences precisely give you a lot of power. It allows you to answer much more complex questions-- this little book of questions that I described to you --by breaking them apart.
So the example here-- and again, it's especially convoluted, but just to show you how these things are done --the question is, who is the president of the fourth largest country married to? And the system needs to figure out that, in fact, that's more than one question, and this larger question. And I'll show you how it's actually done, sort of an "under the hood" view of this syntactic decomposition.
So this is another T-expression from the question. START says, no, it's just hard to answer right away, so let me try and have a nice linguistical base, syntactical base algorithm that tells you in which order to resolve this expression. And the machine says, no, first we'll figure out what is the largest country, in fact, what is the fourth largest country, and finds China.
And then it says, OK, now I can try to answer a much simpler question. Who is president of China married to? But even then was a little bit hard. But then the machine knows that it can easily find the present of China and then answer one simple question about him. And the machine [INAUDIBLE].
Here's some examples of START forming this complex question answering. In what city was the fifth president of the US born? START says, I know the fifth president of USA is James Monroe, and then they told you that he was born in Monroe Hall, Virginia from different sources, in fact-- from Wikipedia, from the public library, and from [INAUDIBLE]. So simple. Yeah?
AUDIENCE: Does the system have any way to deal with ambiguity, with the question you asked with the fourth largest country?
BORIS KATZ: Yeah. That's right, and, in fact, there are many types of ambiguities and [INAUDIBLE] for ambiguities. There's [INAUDIBLE] ambiguities, and in fact, in a talk this afternoon Andre will tell you much more about the syntactical ambiguities that would [INAUDIBLE]. This example is-- you said that required the system to understand that largest could mean largest by number of people or largest by area.
And the system selected-- I forgot --maybe the area in this particular example because it liked it better [INAUDIBLE]. There's no learning involved. It just tells the system it could be either or in this particular order besides individual. But it automatically does things if you have analyzed a lot of adjectives, it knows what property to look at. If you ask, how deep is the Black Sea?
It will know to look at that adjective or particular object rather than some [INAUDIBLE] actually analyze the semantics of [INAUDIBLE]. And if it's ambiguous in something, you just have the answers at the same time. So, this ability to automatically, as I showed you, to decompose the question and break it into pieces, and, on the fly, manufacture a set of procedures, it's quite useful to what, in fact, Shimon [INAUDIBLE] does in his talk.
Suppose we both see an image or video, and you want the machine to answer questions about it. So what we can do instead of executing a pre-fabricated procedure, we could automatically-- after analyzing the question and creating ternary expressions --we can automatically produce some forward representation [INAUDIBLE] suggested [INAUDIBLE] which is what we're looking at right now. Which will then be converted into a goal for a bunch of visual procedures, again, [INAUDIBLE] what Shimon was talking to, and actual automatically find the answers to the questions, or find image batches or images of pieces of video that actually respond to this question.
So I have a couple of minutes left. I will show you some recent progress-- and you'll hear much more about it from Andre this afternoon --the ability of our more recent system to understand and describe activities in video. So we're trying to go beyond bounding boxes and towards generating responses to the questions. So here's the video.
And the question to the system was, who approached the bin? So after analysis of the video and analysis of the question, the machine responded, the person to the right of the bin, which not only looks like magic, but I think it feels like magic. But Andre this afternoon will tell you much more about how this magic happens. And hopefully you'll believe [INAUDIBLE].
This is my last slide. If we're successful-- and hopefully we will be in some near future --if our machine gets an image like this, it will be able to recognize the objects, and it'll see that this is an amusement park, that people are watching, that there's a stroller. It will be able to answer queries, what is parked in front of the fence, many people looking at each other. And it even will be able to generate a narrative.
It will say, look at this picture or video and say it's sunny day at the amusement park, that blonde-haired mother [INAUDIBLE] blue jeans is waiting with her baby by the carousel, and so on and so forth. And next time, I hope [INAUDIBLE]. Thank you. Today, I should start set up this web. I hope it works. Yes, go ahead.
AUDIENCE: Actually, I have two questions, if you don't mind. The first [INAUDIBLE] dealing with depending on what language would it speak, we can very easily detect features or features of the language that don't exist otherwise. So take, for example, in African-American vernacular English, double negation doesn't necessarily doesn't make a positive, it just means somebody is more into it. So dealing with things like that, where the logical structure itself can be very different, is that something that can be easily approached? And should I ask my second question first, or let you answer first?
BORIS KATZ: Well, nothing is easily approached if for the last few days, [INAUDIBLE] discovered that. But as strange as it looks to standard English speakers, the language that most will speak, in fact, have structure and have grammar. As you know, sign languages have [INAUDIBLE] grammar. So then it would be analyzed and then eventually, you'd insist it here. I think you certainly could. And what's your second question?
AUDIENCE: So the question has to do with an interesting feature. I don't know if it has a name or not, but something that I've noticed in Hawaiian pidgin. This word, [HAWAIIAN PIDGIN], you heard of this before? OK, so if you go to Hawaii, and you talk to native speakers, they will often have sentences like, give me [HAWAIIAN PIDGIN] from the [INAUDIBLE].
So basically, the sentence itself is entirely cognitive-dependent, and you only know what they're talking about if you know the person and what they're dealing with. So for certain people, they know exactly what they're talking about. They know that this person wants the coffee, or they want this [INAUDIBLE]. But somebody else is really-- you can't penetrate it.
So if you put enough time with [INAUDIBLE], like a family over there or people you know, eventually you could get fluent, but I feel like if you try to understand it [INAUDIBLE] extra module on there, like a theory of mind to say, oh, this person usually likes hot dogs, and so they want a hot dog, or something like that. So I wondered if there's any way that that can be approached in a nice manner.
BORIS KATZ: Yeah, I imagine at some point, but to me, it is a big [INAUDIBLE] in the culture. We're so behind the simple things that you want to do, that that certainly is an interesting problem, but we need to solve by [INAUDIBLE] from the [INAUDIBLE] problems. So as one can imagine, approaches to, for example, by actually understanding each person or at some point during the system [INAUDIBLE] use this knowledge to answer specific questions after recognizing you.
And again, it's an interesting problem, but we don't know how to do very simple things, and that's what we're doing. Thank you. All right, so maybe we should give Danny a chance, but I'm around all day, so let's talk. And if you guys want to use any of our systems, we would absolutely have you to use it.
DANIEL HARARI: So I'll go back a little back to what Shimon has talked with before about vision system, ways that we as humans know how to recognize and detect things in our surroundings. And I will focus on the way this ability develops in humans and how we would like to develop in an AI system. So this is joint work with Nimrod Dorfman and Shimon.
So as Shimon said, most of the visual perception tasks that AI systems are to deal with are mostly related to object recognition and action recognition, but we would like to do a larger range of human visual perception tasks. And the major approaches that currently are available to computer vision is either supervised methods, where you have labeled training the data with Labeled for objects and actions, and all kinds of concepts.
And then, you will learn a classifier know that discriminate brings different categories. But it really doesn't seem a very scalable approach, now that the scope is at least 30,000 object categories. There are much more categories for actions and there's human interaction [INAUDIBLE], So the unsupervised approach is at other end of this spectrum, where data is completely unavailable.
And then, most of the methods are working on that statistical analysis, trying to learn common patterns within this data. But again, sometimes, the concept that you would like returned is the visual data. It is not very salient. It's actually very subtle data. And this statistical methods are kind of very hard to learn.
So we might learn with the current photos about this action-- a man is walking a dog, but how about this instance of the action? This is very rare, but still humans can do very [INAUDIBLE] to understand what's going on. How about physical laws? We humans can say exactly what's going to happen and predict-- by the way, don't worry. There's a mattress here. [INAUDIBLE]. This is a match. This is called [INAUDIBLE] system.
And also, by looking at a single image, we can say so much about the social interactions and the relationships between the people in the scene. And something that AI systems cannot deal with as approaches are [INAUDIBLE]. So how do humans do so well on these tasks?
We can go back to infancy and see that newborns actually have very limited capacities on many visual recognition tasks. They do have some face recognition capabilities. They are quite good at tracking motion of different types, but they don't have any object supervision capabilities, no Gestalt cues. Still, they manage to develop these capabilities in the very first few months of life.
So I will focus on these kinds of developmental [INAUDIBLE] due to the time restrictions. We will go only toward social understanding research. Yeah?
AUDIENCE: So one thing that we've covered in the journal course, the PNAS paper, so you might be able to go through that a little more quickly.
DANIEL HARARI: OK, good. So that's a good point. So I'll go quickly on that, and then maybe we can have vantage to talk a little more about the [INAUDIBLE]. Just a short overview about this. So we would like to find hands. In an image of hands, I guess the common pattern would not be a very good idea to look for because hands are very valuable in their appearance. And also, motion-- if we go back to think about the way that we usually get free hands in interactions, motion alone is not enough because there's much motion around us which is not related with hands.
So therefore, again, we'll go back to the infant. And we see that infants, for example, are not just sensitive to motion, but they are sensitive to different types of motions, like launching, active, and passive kinds of motion. And there is much development at work about an infant that can learn how to relate hands with the agents event at a very young age, or in the first year of life.
So as you probably went through the paper, we said the notion of movement detection in which motion comes into it at a certain place, and then, after the motion flips out, the change in the appearance of this place. For this, we actually don't need anything related to object detection or even segmentation, just the analysis of this motion, this optical slope in this specific region, and a very short memory about the appearance that was in the spot just before the motion came.
So this is a short video showing how does this movement detector, or just general motion not trigger the [INAUDIBLE] detector. But whenever there is an interaction of an object on another active object, then it triggers the detector. Sometimes, there's a noisy event, but still, the ball here, it moves the car. So our [INAUDIBLE] algorithm is able to pick up those events. And our assumption is that in the [INAUDIBLE] of infants, during the first months of age, these kind of events highly correlate with the appearance of hands.
That a nice internal greeting signal but then, they can start to pick up and learn about the hands appearance. So we go further and hands on appearance sample by relating the body features with the hand appearance. And after a very short duration on the training, examples like this with no supervised labelling, we managed to track-- the detection track of the hands. Whenever it's green, it's an action [INAUDIBLE] frame. We do not analyze the motion for this action. The yellow explains that there's no detection track for the motion.
So here we can see just the improvement on the performance. So while this magenta curve [INAUDIBLE] performance. By the way, everybody is familiar with precision photographs? So, in contrast with [INAUDIBLE] the best performance is on the top left corner. So precision photographs [INAUDIBLE] have best performance at the right [INAUDIBLE]. And here you can see that the performance is not that great.
The thing is that notice that for certain [INAUDIBLE], there is a very high precision performance in this case. We can take just the best quarry example, we'd be quite confident that they will actually be hands. And therefore, we can alternatively take this kind of examples as positive labels. The hands can do the whole training procedure [INAUDIBLE] performance is quite [INAUDIBLE].
The red curve here just as [INAUDIBLE] outward among the label. This is the best thing that appearance [INAUDIBLE]. And another task that we wanted to learn is about the direction of gaze. And again, our motivation was the ability of infants to follow the other person, his direction. It's quite useful for learning about the environment, which objects are equitable later on.
Also learn about object names and language. So we do research for a nice and easy way to learn about this, but still in an unsupervised way because we, for instance, there's no easy way to get them labeled like that. So we would like to associate face images with fixed direction, and this analysis was done for 2D. And currently there [INAUDIBLE] standard be also for 3D. And then, once we have this couple of face appearances in the directions, we can train a very simple [INAUDIBLE] algorithm whenever we have [INAUDIBLE] image, we can associate with similar images of faces and then have the direction.
So how can we create this kind of training script? So again, we went back to movement events, and then we associate any movement event with the direction from the face to the place where the hand manipulated the object. And it's a very strong teaching signal [INAUDIBLE] direction of gaze [INAUDIBLE]. Also, whenever you go pick up something or try to put it down, you look very close reaching [INAUDIBLE].
So these are kind of training examples. We actually don't use this very high resolution trace images. We represent these images with the [INAUDIBLE] distributor that Shimon was talking about before. And then given the algorithm, applying this to test images. And by the way, full generalization whenever the face of the person tested said the algorithm hasn't [INAUDIBLE] face in the training script. So here, you can see the results in red. [INAUDIBLE] model prediction and greens are to human annotators that were presented with the same images and were asked just to mark the direction of [INAUDIBLE].
Of course, there's also [INAUDIBLE] in this section, but in general, the performance is quite good. As you can see the red, which is the algorithm, the performance is quite similar to [INAUDIBLE] objects here in green. So I went through this work briefly and have time now to speak about a study that's more related to scene understanding and then a trigger to know something about objects and their functionality around us.
And object segregation, in that sense, is also something which is learned, not innate. This work, for example, has been shown that if adults are shown this kind of an object, which has different appearances and [INAUDIBLE] but in [INAUDIBLE] versus this object. So adults were habituated with this kind of display [INAUDIBLE] these two couples of images.
And for this one, they could predict that this is the exact [INAUDIBLE] object [INAUDIBLE] behind the [INAUDIBLE]. And for this one, they came to a preceded of two different objects. And for infants, so when they are younger than seven months, they don't have any preference for either of the objects, thinking that they don't have any good prediction of how the objects look like.
But around seven months of age, they already predict and have a longer [INAUDIBLE] at looking at this kind of object, meaning that they already start to give out some Gestalt cues on a good continuation of the object to have gone [INAUDIBLE]. But they don't have any prediction here on these kinds of objects which have different textures.
And it's only later on by the age of a year or something, two years, that they learned to receive these kinds of Gestalt cues that allow them to achieve this [INAUDIBLE]. So adults have a very strong capacity for segregated objects [INAUDIBLE] the scene, so how can we do it? So we believe that it all begins with motion. So first type of motion is common motion.
The second type of motion is the motion discontinuities that can actually even [INAUDIBLE] complex appearance can give us a very good sense of what is figural and what is [INAUDIBLE]. So the task is that we are interested to segregate the object in a static image. But we would like to learn how to segregate the object in this image, in an unsupervised way. So we start with the motion segregation.
We first detect motion discontinuities and extract local occlusion boundaries that are along these motion discontinuities, and we create some kind of large dictionary features that are really discriminate for these boundaries being the object foreground. And then, again using the motion segregation, we use the common motion, the temporal continuance of the object that's moving around and show them object form.
We don't need to know what kind of object it is, but for a very short period of time, just by the common motion, we segregate the object from the baseline. And we can learn something from interference, at least at a certain [INAUDIBLE]. So this kind of a occlusion cues appear when extracted features along the boundaries.
So this is not from our study. It was a study [INAUDIBLE]. But you see the different types of occlusion cues like T-junctions and say-- can you say where is the front-- what is the figure and what is the background? Who says it's on the right? On the left? So it's a really strong cue, and there are other two very definitive types, which is one, the convex.
So convexity suggests that the more convex spots it responds on, and then, there's also very interesting cues about extremal edges. This is something which is very difficult to-- to natural images, it's very hard to reproduce this in syntactic data. So you see here the different reflections and changes in lighting.
But it's a very strong cue, and actually, many of you probably saw [INAUDIBLE] also with the artists that are drawing all kinds of complex things in the containers and stuff like that. They do have this kind of shading that is a very strong cue to perceive that information. So about the object, sometimes these cues are not enough. To discriminate the object, you actually need something more ballistic, not just trying to figure out where are they found.
So this is the way we do it. So we start with a moving object. As you can see, you can see [INAUDIBLE] very clustered and similar background. We extract these boundary features around the motion discontinuities, and we associate these features with their direction of the figure of the object. And we, of course, assume that the moving part is the figure, and this is a good assumption if we think that the motion attracts most of our attention and once we focus on something which is moving.
In a very local region around motion, it's usually the only motion that was there around it [INAUDIBLE]. So these are kind of batches that were extracted automatically from these motion discontinuities. And as you can see, without searching for specifically types of boundaries, we actually extract those convex, those extremal images, and some T-junctions, pretty similar to the others we have studied.
The thing that we notice here is that in order to use discriminate these kinds of features, we actually need as very large number of features, not few. We shouldn't use the model for more than 100,000 [INAUDIBLE] good detection. As you can see, the prediction we associate with the appearance with the learned features and then we can associate with the direction of the figure.
For the object, we [INAUDIBLE] a lot of instances of the object appearance or the object [INAUDIBLE] of the motion. And then we learned features through [INAUDIBLE] object. Even if it has the most [INAUDIBLE] and not very segmentation around. So here, you can see the results when applying these boundary features to new images. And again, we've done full generalized approach, so these objects were not seen during training.
But still, you can have a pretty good sense of what's in front and what's behind. Even you can see that-- don't just get the figure ground between the object and the background, but also an object has more topological and figure [INAUDIBLE] that even further [INAUDIBLE] on the object itself. From the global object detector, we get a more complete detection of the object, but it's not very accurate around the boundaries.
And then, we want to combine two approaches, it gets very nice and precise segmentation of the object and also information of where the figure is detected. Even if we take a more sophisticated segmentation algorithm and feed this kind of map into it. So Shimon talked about it [INAUDIBLE] the graph cut of segmentation.
So this is kind of an algorithm that represents the PowerPoint. It's called a GrabCut, and it's a variation of the graph cut, but if you apply it to our images, just with no cues, so this is the way it should be segmented. But if we give it our maps as a cue, you can see that this is found much better on the figure.
So I have five more minutes, so I'll just briefly talk about the work that is going on in the field and in research about object containers. So there are many types of containers and [INAUDIBLE] something very interesting [INAUDIBLE] we have very important functionalities. For humans, and also as a species we learn to use these kinds of containers, but it's not that any object can be a container. And it appears that infants are actually pretty good at it, also in the first year of life.
Just to show you some kind of experiments we've done on infants. So these are three examples of containment. One is an object which is loosely contained in a basket. Another one which is something more-- looks like something mechanical. It's very tight containment. Another type of container. So these are computational examples are being shown to infants. And then, they are being tested.
Again, first the object near the container, but this time the container is upside down. And when the object is being put on the container, not inside, so infants show longer looking time, which means that they are surprised that this happened, and it's a different event that was in habituation. This happens around nine months of age and sometimes a bit later, but certainly [INAUDIBLE].
This is a different experiment that is being done with the containers. So in the first experiment, a container is being put behind the [INAUDIBLE] behind the container. And then, at the test event, the container is being pushed aside. On the other case, an object is being put inside the container, and then [INAUDIBLE] when the container is pushed aside, the object is revealed behind it.
So in this case, infants show much surprise when this event happens because they expect the container to be transferred with the container [INAUDIBLE]. So this is a very complex concept to understand. We were looking at ways how we can learn [INAUDIBLE]. And again, we want to use the notion of motion, try to get both variations, loose and tight-fit, et cetera. So this is just to give you a taste of what we are doing.
We are training on these kind of events where objects are appearing [INAUDIBLE] and put in containers, during which we learn all what happens to the object in the container, especially about the boundaries. We can detect that [INAUDIBLE] boundaries and [INAUDIBLE] boundaries. And then, even static image, we can [INAUDIBLE] what the object [INAUDIBLE] found behind inside.
Everything is being learned from the [INAUDIBLE] and then the model [INAUDIBLE] predict about the [INAUDIBLE]. So let me summarize and say that different complex concepts such as hands and direction of gaze, static object segregation, are very hard to learn in a more general statistical approach. But if we use a specific use, that can be either [INAUDIBLE] during their first months of age. It is possible to learn those very complex cues if-- sorry.
So it's possible to learn all the concepts just by using these cues as internal teaching signals that can then be applied to regular supervised mechanisms of learning and facilitate the learning and enable then further learning of more difficult and complex concepts. So thank you, and if you have any questions.
AUDIENCE: To some extent does detecting the face have to do with real-life correlations between the people? So are there other [INAUDIBLE] that can be [INAUDIBLE]? The first thing that comes to mind is maybe shapes in [INAUDIBLE]. Either you know the light source or if you realize that one shape can then extrapolate to other things. Like anything that has the same angle, I think, is easier to recognize. So are there other [INAUDIBLE] that you're thinking of that [INAUDIBLE].
DANIEL HARARI: So it's a nice idea which can say something about when objects are related to different sources of elimination. The thing is that we usually want to relate human responses to this kind of object, so it's a little bit [INAUDIBLE] strong signal, I think. Because we don't have control of the sources of illumination.
So all objects around us would respond similarly to these sources of illumination if they are changing two times. But they would not distinguish between objects that draw our attention more than others. And in these gaze mover projects, we wanted to see whether there is a specific signal that we can [INAUDIBLE] onto and just train on these examples whenever this kind of event is happening to grab both the face appearance and the specific place of tension. It's more specific.
Associated Research Thrust: