SAM GERSHMAN: In this section, this last section, I'm going to talk about taking those basic building blocks and putting them together to build new kinds of models. You should really think about the building blocks as a kind of modular tool kit. And it's hard to go through many different variations on different compositions of these building blocks. I'm going to just focus on one kind of combination of context in motion perception.
So to motivate this, the question is, how do we parse a moving scene? An interesting observation about motion, the aspects of motion I'm going to focus on here, is this idea that when we look at a moving scene, we naturally extract a kind of hierarchical structure from the scene. And I'll illustrate what I mean by that by this little clip.
So when we watch a movie like this, there are a number of things that we see. We can perceive the motion of Dr. Octopus and Spider-Man relative to the backgrounds. And within that reference frame, we can also perceive the relative motion of Spider-Man and Dr. Octopus relative to each other. And we can also perceive the motion of Dr. Octopus' tentacles relative to his body.
And so what we have here is the kind of hierarchically organized collection of nested reference frames. And this kind of nesting of reference frames actually has been studied for many decades. And probably the seminal contribution to that research was Gunna Johansson's Ph.D. thesis, which was published in 1950 as the book Configurations in Event Perception. And what Johansson did was he showed that you can get remarkably complex percepts using just simple dot configurations. I'll show you a few examples.
So let's consider this dot. Everyone can see that that's moving diagonally, right? So what Johansson did was he asked, what happens if I add two dots that are always collinear with this diagonally translating dots. And these two dots are going to be oscillating horizontally.
So now, I think for most people-- I've shared it so many times it's hard to say, but most people will observe that this dot is not longer perceived as moving diagonally, but rather it's moving up and down within this horizontally translating reference frame.
Another classic example, which actually predated Johansson is known as the Duncker wheel. So the way they actually did these experiments originally is they took a wheel, and they put a little light on the rim of the wheel, and they turned all of the lights off in the room. And they rolled this wheel. And then they showed it to naive observers and asked them what they saw.
And when you do that, you don't see rolling motion. What you actually see is what's called cycloidal motion, a sort of bouncing thing like that. But if you then add another light at the center of the wheel, then all of a sudden, that same motion is perceived as a rotation, rotating around this horizontally translating reference.
So these are two examples of cases where our percept of a particular motion can change radically depending on the reference frame that we create for it. And Johansson, in addition to laying some of the empirical groundwork for these ideas, also laid the theoretical groundwork. And particularly, he had this idea of vector analysis, that what you do when you see these dot configurations is that you basically extract the common motion of the configuration as well as the relative motions. And so what you're actually perceiving is the relative motion with the common motions of tracking.
So for example, in the first dot configuration that I showed you, this vector analysis would take this pattern-- that's the observed motion vectors-- and decompose it into some of this horizontally translating motion plus this vertical motion that's specific to this middle. The problem, though, is that many different vector interpretations are available for a given motion pattern, so how do we choose which one? And over the years, various principles have been proposed-- the minimum principle that simple motions are preferred, and adjacency principle that you should assign dots to the nearest reference frame. But no real coherent theory was proposed that could integrate all of these different principles.
So we-- that is, Josh and Frank Jakel and I-- tried to develop such a theory. And our starting point was actually this theory of Bayesian motion perception that was developed by Yair Weiss and Ted Adelson that they called the slow and smooth theory. And the reason they called it the slow and smooth theory is because the idea is that you combine-- the likelihood here is that the motion energy that's coming into your visual system, in the early stages of your visual system, with a prior that favors motions that are both slow and also smooth over stages.
But this is a sort of flat theory of motion perception. It doesn't do any kind of segmentation or nested reference frames. So we sought to generalize this idea [INAUDIBLE] called Bayesian vector analysis. And the centerpiece of this idea is what we call a motion tree, which is a decomposition of a moving scene into a hierarchical set of nested reference frames.
So the idea is that each one of these nodes here corresponds to a motion component, which is basically a vector field, and that the sum of a particular object in the image-- sorry, the observed motion of a particular object in the image is a sum of motions along a path. So Spider-Man's hand here, the observed motion of Spider-Man's hand, is decomposed into the sum of the motion of Spider-Man and Dr. Octopus relative to the background, plus Spider-Man's relative motion, plus the hand motion relative to Spider-Man.
And so the way that this model extracts the motion tree is using Bayes rules. So now we have the probability of this motion tree, given a sequence of images, is proportional to the images, probability of the images [INAUDIBLE] of the tree and the probability of a tree. So this is the point where we're going to start combining the different building blocks that I mentioned to you before, to get these different pieces of the model.
So we assume that the tree structure is drawn from a nested CRP. Remember, the nested CRP is a probability distribution of these trees that can be potentially infinitely deep and infinitely wide. So this allows the model to adaptively select how many motion points there should be to analyze a particular image. And so each node in that tree is associated with a vector field that's drawn from a Gaussian process. And the Gaussian process basically imposes something like the smoothness constraint that was in Weiss and Adelson's theory.
And then finally, one nice thing about using Gaussian processes is that the sum of a Gaussian process is another Gaussian process. And that means that you can take the tree structure and analytically marginalize out all of the component motions to compute the marginal likelihood of the actual [INAUDIBLE] motion. So that allows us to basically just compute this [INAUDIBLE] probability of a tree's given images while not dealing directly with the-- rather, marginalizing out the component motions.
So this is what happens when you apply the theory to these kinds of stimuli. So here's the Johansson stimulus. The percept you see is this nested motion percept of vertical motion that's nested in horizontal motion. And what the model extracts is something that looks like this. So the motion here is actually quite simple. It has the root node of the tree is this horizontal motion, and then you have this vertical motion here.
And you see that the bottom and top nodes get assigned to this root node. Or rather they get assigned to a path that starts and terminates with the root node, whereas the middle dot gets assigned to a path that goes from the root node to the child.
Duncker wheel is the same kind of story. So if you just see the motion on the rim, you see basically cycloidal motion. But when you add the center motion, then it extracts this rotational motion that's nested within a horizontally translating reference.
And then this model can capture a bunch of other things that we don't have time to go into. But this is just an illustration of the ways in which you can take these different building blocks-- in this case, Gaussian processes and nested Chinese restaurant processes, and combine them in ways that make sense.
AUDIENCE: By the way, [INAUDIBLE].
SAM GERSHMAN: Yeah. But interestingly-- it's interesting to think about the balance between Bayesian complexity in this kind of model, because what the model is basically doing is trying to-- well, what is simplicity in this context? Well, the simplest thing would just be a single node. But the model will prefer nesting motions, to have more shared structure across motions, rather than building a large number of clusters.
So think about the motion tree. There are different things that could happen, right? It could try to make the tree deeper. It could try to make the tree wider. And the model will prefer making the tree deeper because there's actually the-- well, it's difficult to explain without going into the math. But the description length of this model is shorter when you have more sharing components in this hierarchy.
OK. So the last part I'm going to talk-- oh, yeah.
AUDIENCE: So you're motivating this, but then you want a unifying theory for all [INAUDIBLE].
SAM GERSHMAN: Yeah.
AUDIENCE: And so if you could talk a little-- you kind of talked a little bit about it right there, but what exactly does the prior preferred-- it seems like all the work is being done by the prior. Or a lot of the work is being done by the prior, right, in terms of giving you a theory. And so it prefers nestings. Is there anything else that you can [INAUDIBLE]?
SAM GERSHMAN: Well, in this model, there's going to be a dominance of local information compared to distance. So there's a proximity sensitivity here, things that-- so this is equivalent, basically, to Gogol's adjacency principle, or it's one version of that. Objects are going to be differentially sensitive to reference frames that are nearby rather than farther away. And the idea of--
AUDIENCE: The reference frames [INAUDIBLE] spatially contiguous?
SAM GERSHMAN: They don't have to be spatially contiguous. I mean, I should say that there are lots of ways that this model is sort of incomplete. For example, it doesn't really deal properly with certain kinds of motions that we know obey physical constraints in the world, like articulated motion, right?
So in the case of biological motion, this is not the same as continuity. In some senses it's a stronger constraint because we know that the physical structure of the body skeleton imposes articulation on motions. So things can only move around joints, right? But the model doesn't know anything about the skeleton. We just show it here for illustration purposes. It just knows about the coherent changes of different points on the skeleton.
But yeah, there's no continuity constraint. You could build something like that in. But that's actually a little dangerous, right, because of things like occlusion. So if part of my body-- my body's being occluded here, but you're not going to all of a sudden think that I lost half my body or something like that. OK. So does that answer your question, Alex? OK.
So now I'm going to just talk briefly about automatic structure discovery, and I think Josh will talk a little bit more about some of these things. So we had a kind of set of building blocks for structure. Right now we're more or less picking the right structure by hand. Can we do this automatically?
Charles Kemp a number of years ago developed the idea that you could devise essentially a grammar, a graph grammar, that would build up a very large family of different structures through a small set of production rules. And then the idea is that if I give the model a particular data set, then it will use these production rules. It will try to recover the set of productions that gave rise to those data.
So it's trying to analyze the structural form of a data set into a small number of objects like partitions, chains, orders, rings, hierarchy, et cetera. Some examples, like, if you feed the model information about animals, the features of animals, then it can recover something that looks like a taxonomic tree. Or you see it safely recover some kind of plainer safe space, you give it Supreme Court judges. This is voting patterns of Supreme Court judges. Then you get a linear space color. It gives you a circle space. You get a map-like structure. And so in all these cases, these are the structural forms that best explain the data.
More recently, Roger Gross pursued a related line of reasoning where he was trying to understand-- he was trying to unify a bunch of different probabilistic models that were based on matrix factorization. And the idea here was to construct a grammar that could be composed in potentially very complex ways. And the grammar had production tools like low-rank approximation clustering, linear dynamics, and so on. And when you set it up like this, it can actually produce many of the models that are in wide use today and show how they're all interconnected. So it can show how things like low rank approximations are connected to low rank linear dynamical systems or sparse coding and so on.
And importantly, so you can apply it to a data set. And it will figure out the right combination of these different production rules to model that particular data set, or basically doing a kind 3-D research over the marginal likelihood.
Even more interesting, I think, is the follow-up work to that. It's like, how far we actually go? Can we actually basically replace scientists with computers by-- all right, so I can have my automatic structure discovery algorithm, which I give it some data, it finds structure, and then I write a paper about it. But maybe I don't even have to write the paper.
And that's actually exactly what they did. So they defined a family. In this case, they were using Gaussian processes-- but this general idea is not unique to Gaussian processes-- where they defined a large number of different kinds of kernels and ways to combine them. And then what they did was they used Bayesian model selection to pick the right set of kernels. And then they had compositional descriptions of those kernels that they could basically use to synthesize a quantitative description of a data set.
So in this case, you give it-- what data set-- this was-- I can't remember if it was solar, something or other. Oh, you're going to talk about it? But I highly recommend you go look at these. They created a bunch of these automatically generated papers. And it actually reads like a paper, but it was generated-- well, a sort of poorly written paper. But it's actually quite remarkable.
OK. So just to sum up, the motivation for non-parametric Bayesian models is to facilitate the process of structured learning in a way that balances-- it allows for sort of arbitrary complexity, but also preference for simplicity. So you can kind of get the just right structure. And I talk about some growing experimental literature on cognitive science that support the idea that some of these things might actually be going on in the brain and that you can take the basic building blocks, like clusters, features, and functions, and compose them in new and interesting ways.
OK. And if you're interested, there are a few papers that I've written, some other people have written, that summarize this and also go into more of a mathematical [INAUDIBLE]. Thank you.
I'll be around for the rest of the day, so feel free to find me. I'd like to meet as many of you guys as I can.
JOSHUA TENENBAUM: So we're probably-- some of the things I'm going to talk about, it's very related to what Sam talked about, and just very complementary almost. I can always expand this [INAUDIBLE] time allotted to me, unfortunately. I don't know if we'll use the whole time. I guess we have until-- we have till 3:30 officially, or what? How long does this session [INAUDIBLE]?
AUDIENCE: [INAUDIBLE]. 4:00, right?
JOSHUA TENENBAUM: Well, we'll see how it goes. We might end a little bit earlier. Maybe [INAUDIBLE] is still getting all these notes on tape. So you might want to do the first part of your-- the more lecture-y part of your tutorial before the break and come back to-- we'll see. I'll go for somewhere between 30 to 45 minutes, and we'll see how it goes.
We could also have some time for discussion and sort of general questions on everything you've seen since morning. And another thing, if we have extra time, is somebody asked a question about how the kind of math that we're talking about here at the cognitive level relates to the math of neurons. I just [INAUDIBLE] really interesting open question. That's probably a more interesting discussion to have later on in the week, after you've seen-- or later on in the summer school through next week or even the week after. But it's never too early to start asking these questions.
Another version of that question is how can you implement the kind of things we're talking about here in neural networks? What would that even mean?
OK. So we've gone through the introduction, cognition, basic cognition is Bayesian inference findings, structuring the world with non-parametric Bayes. I'm going to talk about learning to learn with hierarchical Bayes. But again, hierarchical Bayes and non-parametric Bayes go very naturally together. They both have to do with ways of building hypothesis [INAUDIBLE], not just [INAUDIBLE] and building them in flexible ways that are a response to your previous experience in related problems or aspects of some domain.
So think of this as just sort of an additional set of applications. And maybe at the end-- I have some slides on the same kind of stuff that Sam talked about with grammars for-- taking this idea of grammars for structure and maybe say a few other things about those topics.
So let's go back to this problem that we presented in the morning. Again, related to many of the things Sam talked about, learning a concept from examples. How are you able to do this? How are you able to take a few examples and figure out which other things are in the concept from those?
Well, it must be because you somehow have already built up from some other experience the right kinds of hypothesis basis, right? And there's no free lunch, right? There's no way to-- like, if you're getting an infinite concept from just a few examples, there must be a lot of other stuff that fills in the gap, some of which is based on the other objects that you can see here, I would say. And in some sense, in a kind of sort of semi-supervised or unsupervised structure finding way. All these other objects here help you to see what kinds of things are out there in the world.
But also, your ability to correctly find the right kind of structure in this much larger array depends on your general knowledge of objects. So another way to put it is, if I had just showed you these things and I said, what do you think a tufa is, you'd know a lot less than if I did what I'm doing here, which is showing you this whole world and then giving you a couple of labeled examples. There's a lot you can extract just from that whole world, this sort of fancy kind of clustering. But your ability to do that depends on even more general abstract knowledge about visual objects.
So the way we've modeled this kind of thing is to say, well, really you have to start off with this big unsupervised bag of objects, which you're able to cluster in some way. The way we did it was sort of more of a hierarchical picture like this, where we said, basically by looking at the relevant features of these objects, you can group them into small groups and then group the groups into something like a hierarchy-- you can think of this as an intuitive version of an evolutionary tree, where these objects get organized into these big branches and then sub-structures there.
And then imagine that each node in this tree is a possible hypothesis for what the word tufa could mean. So maybe it labels this branch or this branch or this branch or that branch. Remember the Oscar Wilde and Griffith stuff that you talked about-- or [INAUDIBLE] and Griffith stuff, where people were learning a simpler, three-level, artificial, pre-structured taxonomy with one-syllable names for each of these categories.
So this is sort of similar kind of idea on a bigger scale. And then the idea is, all right, if you're given a few examples that are over there in that part of the tree and said, this is a tufa, this is a tufa, that's a tufa, well then, what could the word tufa mean? Well, maybe it refers to this branch, maybe this branch, maybe this branch. I mean, already you've kind of actually narrowed things down a lot, because out of all the possible subsets of these objects, you're now only considering the subsets corresponding to branches in this tree. And given these examples here, well, you're sort of ruling out branches which are smaller than-- that don't span the examples, kind of like in that predicting the future thing, where you rule out all the hypotheses of values of T total that are less than T.
And then what you're left with are the sort of branches that are bigger than the example set. There's only a few of those. And you're able to pretty quickly figure out, well-- or pretty confidently figure out it's probably this one here, because otherwise it would be sort of a suspicious coincidence to get the first three tufas, say, all the way clustered in this branch here, if, say, tufas referred to this really big branch up here, or even this much bigger one here. Then it'd be kind of a coincidence to get the first three tufas over here if the set of tufas was this much larger set, kind of like the coin flipping, where you could get five heads in a row flipping a coin, but it's a bit of a coincidence.
So it's that sense of suspicious coincidence interacting with the right structure of hypothesis phase, which we think explains your ability here to get this concept from just a few examples. And there might be a little bit of uncertainty. Remember this one over here, people weren't quite sure, maybe that's a tufa. Maybe it's not. Probably not, but there's a little bit of uncertainty. And in general for these ones in here, you get a little bit of residual uncertainty that you don't get for these ones over here or over here, which as you remember from our first demo, that that was kind of how it worked.
So I'm not going through any of the math of how we built these kind of Bayesian models of word learning, but that's sort of the basic idea. What I want to focus on here is, again, kind of calling themes that Sam was covering, how do you build up this thing in the first place? How do you know how to build this tree?
So there's at least two kinds of things you need to know. One is, you need to know, in a sense, what features to pay attention to. What counts as similar? There's a sense in which I think we'd all agree, these things over here are more similar to each other than they are to these things or these things. And these things over here are more similar to each other than they are to these things.
But how do we describe that knowledge? Learning to learn the word tufa means learning what counts as similar about objects in general so that you can build up trees like this from the right experience. And then the other thing is maybe, like, how do you learn that you should have a tree at all? That was more like the stuff that Charles Kemp was working on. So I'll talk a little bit about that after I talk about the aspect of learning to learn in terms of what features to pay attention to, what counts as similar.
Also in keeping with our thrust one of this research program, I want to provide a concrete developmental angle. Because these aspects of learning to learn are actually really important in children's word learning. Some of my favorite developmental psychology is this work of Linda Smith and colleagues, which I'll talk a little bit about here, and it helped to motivate a lot of this work that we did on hierarchical Bayesian models, or learning to learn.
So here's an example of a little thing you can do with your favorite two year old. You can show them a new object like this, and you say, this is a dax. And now you can show them a few other objects like this and say, can you show me which one of these is a dax? So notice here, you have an object that matches this one in its shape but not in its, say, material properties. Here you've got one which matches in its material properties but not in its shape. And here you've got one which doesn't match in either of those.
So if you show this to a two-year-old, which one do you think they'll pick as the dax? How many people say this one in the middle? How many people say this one? How many people say this one? Presumably you'd do the same thing.
So two-year-olds show what is called the shape bias. It's not a bad bias. It's probably a good bias. They preferentially generalize a new word to other things of the same shape but differing in, say, material properties rather than [INAUDIBLE] which match in material but not in shape.
Sensible thing to do-- a lot of words for categories are organized this way. In vision, shape analysis has always been sort of a center of things for object detection or whatever object categorization. Many of children's first words refer to objects which are organized in this way.
But it's interesting that it's not like your brain comes pre-wired necessarily to do this, at least in the context of word learning. If you take a 20-month-old-- so two-year-olds are 24-month-olds, but if you take someone who's a year and a half, they don't show the same shape bias in this task. In this task, it would be typically a chance. So they would be as likely to pick this one as to pick that one.
AUDIENCE: Does it matter that you have this object in the center that is U-shaped that is very similar to the one with the right texture?
JOSHUA TENENBAUM: I don't think it matters.
JOSHUA TENENBAUM: It's a good question, but I don't think it matters. I'm not sure.
By other measures, even younger infants are already more interested in object shape than material properties. So just to be clear, it's not like they can't perceive shape or be interested in it. But it's something specifically about knowledge about how a new word is going to-- what kind of features a new word is going to map onto.
So there's at least some evidence that this bias is learned. Now, just because something develops-- and you'll hear a lot more about developmental psychology over the next week. Just because something develops doesn't mean that it's actually learned. It could be just like-- a lot of things mature in the brain. A lot of things are happening between 20 months and 24 months. Maybe just your brain-- something's changing in the wiring of your brain that isn't really something you want to describe as learning. Or maybe you're just sort of-- I don't know. Who knows what's going on?
But there's actually really interesting evidence that this is learned and learned in a way that's specific to learning about language and learning about how words are used in language to label objects. And that comes from-- this is sort of much older work that Linda Smith did together with Barbara Landau and others back in the '80s.
But in some more recent work that's about 10 years old, Smith and colleagues actually did some very interesting training studies to kind of get at how this kind of knowledge is learned. So you can take a 17-month-old, someone who's significantly younger than the 20-month-olds who are not passing the shape bias test here. So they're again going to be at chance here.
And bring them into the lab and give them an experience of this sort. So there's four pairs of objects. Each pair has the same shape but differs in size, material, color, other possibly relevant properties. And the 17-month-olds come into the lab, and they're taught a new label for these interesting new-shaped things. Like, this one's a wib. This is a wib, and we might pick out take out these toys and play with them. And the experimenter will say, oh, here's the wib, and here's my wib. And let's play with our wibs. Can you put your wib on top of my wib and do that for a few minutes.
Then bring out these ones and say, oh, OK. Here's a lug for you, and here's a lug for me. Let's play with our lugs. And here's a zub for you and so on.
So they just play in a fairly natural way, using the words to refer to these two objects in each pair. And they do this for a few minutes. And then they do this again a week later, and again the week after that. And they do this for maybe six weeks, eight weeks, each time for about 20 minutes a week and each time with the same four pairs of objects.
So kids have some systematic experience over a couple of months with pairs of objects that are given the same name based on shape. But it's not a whole lot of experience in the sense that it's really only these same four pairs of objects each time.
And somehow from this, after eight weeks of training-- and I think it also works with just six weeks-- now the 19-month-olds who have been through this experiment now have acquired the shape bias. So how do you know that? Well, first of all, if you show them other shaped wibs, they will call them wibs, other of these square arches. So they will generalize these words to yet other looking objects with the same shape.
But also you can give them another thing here. So you can give them this task, and now they will pass this task. So this is what's called a second order generalization. A first order generalization is generalizing wib to other similarly shaped arch-like objects. But a second order one would now be generalizing a totally novel [INAUDIBLE] which they've never heard before with a totally novel shape to other things that have the same totally novel shape. People get that?
AUDIENCE: What happens if you do this training study but instead of common shapes you use material properties?
JOSHUA TENENBAUM: I will come to that. Re-ask that question in one minute. Two minutes.
So this was pretty cool that this works. And also experimentally, the second order generalization is basically as strong as the first order generalization. So they do this quite reliably, just as reliably for dax, this new thing, as for wibs that they've had this experience on. So it seems like they've learned something generalizable about how to map words onto new object categories.
But the most interesting generalization is outside of the lab. So here's data on the same kids' English vocabulary. So these are real words in English and their real meanings. And what they did was they took the kids at the beginning, 17 and 19 months, and they used standard measures of kids' vocabulary, which at this age, you basically ask parents from a big checklist, which of these words does your kids know? It's pretty noisy data, but there's some reliability to it. And this has been validated in a lot of ways over the years in developmental studies.
And there's a very good control group here. So though the error bars are rather big, the effect is quite convincing, and it's been replicated. So here's the basic effect-- that if you take the kids who've gone through this training over two months, from 17 to 19 months, they start off knowing about 15 words. Again, probably most of you don't have kids. A 17 month old hardly knows any words really. They're just beginning to learn words in an explicitly recognizable way from their parents. So they know on average about 15 or 20 object names.
But after they've gone through this experiment two months later, they know on average 60 names. So they've really more than tripled their vocabulary. But the interesting thing is now compare it with the control group who had exactly the same experience, played with exactly the same toys, but just didn't get the words. So they would have the same toys and somebody would say, oh, you know, here's yours, here's mine, let's play with them. Oh, that's cool. Let's do that for a while. And now here's this. Here's yours. Let's play with these other ones. Here's yours, here's mine, let's play with them-- so the same four pairs of objects on the same experience without the words used to label the things of the same shape.
And what they did after two months is only learned about 30 words. So there's a factor of three difference in terms of the number of object names that the kids learned, those who had learned the shape bias. And also, yeah, those kids who were in the control group didn't learn the shape bias also, for the artificial objects and the artificial words.
It's also importantly an effect which is specific to object naming. It's not like these kids were just generally better at language acquisition or had learned something just generally about what words mean. Because if you look at all the other words the kids know, they also learned more over those two months, but there's no difference at all between the two conditions. So it's a really striking effect.
So somebody was raising their hand back there? Really striking effect. I don't know of very many studies that show anything like this kind of transfer from learning something in the lab to learning something in the real world, and particularly something that's really quite important for real world cognitive development as early vocabulary [INAUDIBLE]. Yeah, question?
AUDIENCE: How do you explain all the 24 months being, like-- 20 not having and 24 months having it, even though there's a whole range of parenting styles which might include something like this? Some parents might do-- like, if this is only 20 minutes a week.
JOSHUA TENENBAUM: Yeah. Well, so first of all, when you say that 24 months have it and 20-month-olds don't, that's not a super-precise thing. That's, of course, going to be sensitive to what culture and what language you're speaking. So that's only true for kids who are brought up with English as their first language in a particular kind of socioeconomic status group that is a typical laboratory population.
It's interesting that for kids who learn a different language-- I mean, I think that there are effects that are definitely going to be driven by what language you're learning. So languages that don't have as many frequent early words or concrete object naming don't show a later developing shape bias, and it's much weaker. And I think in general, the kind of-- I don't know if it's that parenting style, but the environment the kids grow up in, exact precise timing of language milestones are going to vary as a function of that.
So there are going to be error bars on that for within any individual demographically defined group as well as significant variation across it. The key thing is that in a typical experimental population, the one that this study was done in, that data is relevant for that study.
And this study has been replicated a bunch of times. I don't know how cross-culturally this kind of thing is done. But-- at least the claim, which I think is well supported, is that in something like this mechanism, the experience of learning the first words and kind of bootstrapping off of what is generalizable across the first initial object names-- namely, they all seem to refer to some different component of dimension of shape-- or not all of them, but that's a statistical tendency. But that is what's driving the change in this group, at least, in the normal version of this population, from 20 to 24 months. Yeah?
AUDIENCE: How do they-- couldn't it be the case that the kids have-- that it's not learned, that the kids have an innate bias for shape and that somehow it's triggered by-- typically kids, when they're 24 years old, parents trigger that sort of behavior. How come [INAUDIBLE].
JOSHUA TENENBAUM: Yes, that is quite possible. Again, when we say that it's learned, we're not saying-- we're not saying here that they've learned to see shape and somehow before they didn't. Linda Smith has actually made some claims which I'm not sure how to interpret, that learning words, the same process here actually changes how you see a shape.
I'm not going to talk about that here. I think there's probably something right about that and other parts which I'm just not so sure about. But yeah, I think this is totally consistent with-- and the models I'm going to present, actually, I'll show you almost three different models of this which vary in terms of-- one version basically says, you're already perceiving shape, and you're just kind of in a sense learning that that's the thing that counts for this--
AUDIENCE: No, no, no. I have two 18-month-old kids. They definitely know shape.
JOSHUA TENENBAUM: You can do it within family control.
AUDIENCE: As soon as I get home, I'm going to start giving them these things. But what I'm saying is--
JOSHUA TENENBAUM: The control kid is never going to forgive you.
AUDIENCE: --the bias to associate shape with word learning, that could also be innate, and it's triggered artificially in the experiment. So it's just the triggering of the innate bias towards the word learning.
JOSHUA TENENBAUM: Right. So I think a lot of that is probably likely to be true, in that again, there's work on-- other [INAUDIBLE] work, which I think we'll talk about more next week, showing that even 10-month-olds-- like, there's a bias to sort of trap object shape beyond, say, object color in infants who are younger than one year already, and to sort of in some ways associate that with nameable object properties.
So yeah, I think there's a lot of reasons. I think there's kind of-- when it comes to naming classes of objects, that there's at least a much more early developed, not necessarily innate bias to be interested in shape. I think many people would say-- the people who argue for innate biases would say, it's not so much about shape as function and essence, and shape is just a kind of observable cue. So it's possible that what's learned is that shape is a good cue to object function. I mean, there's many different ways to interpret this. And here I don't want to weigh in on that, but I think that's perfectly consistent.
This is also a good point to mention the thing that Sam asked me about. So Sam said, what if you do the same experiment with-- what did you say? With material property, yeah.
So they did that. Smith and colleagues did that. And what they found is, first of all, that it was a bit harder to train them, so they didn't learn it as well. And that's consistent with the idea that there's some kind of innate bias towards shape as the basis for object categories.
The other thing is that it actually had a negative effect on their real world vocabulary. They actually learned fewer words. You can see the error bars are big. It wasn't a significant effect, but it was negative. And so they stopped. It was one of the things, like, ooh, we just thought we were doing a fun experiment, but we're actually messing these kids up.
So they stopped. And they then brought them in for a follow-up study. This was never publish. I only heard about this from Linda Smith. They then brought them in for a follow-up study to check that after six months or a year, they have their heads kind of covered. And it almost had.
Anyway, but that's consistent with the same idea that yes, there's probably some at least-- pre-existing at this age, pre-existing tendency to associate object category names with shape. But whatever you're learning in this experiment is really-- it is the driver of the real world, much more generalizable [INAUDIBLE] biases.
AUDIENCE: So has anyone done analyses after attempting the shape versus material properties? Or they're actually predictable?
JOSHUA TENENBAUM: Yes. And again, that depends on the language. And yeah, Larissa Samuelson, who is one of the co-authors on some of these studies, did some analysis of that. And so yes, that varies language by language. Different languages just sometimes label different things or preferentially more often, and that predicts also the different ages at which kids wire and the strength of the shape biases that they [INAUDIBLE].
So this is very interesting stuff. I mean, if you want to talk about how real learning works, real human concept learning-- I think the single most important thing that's different from the standard machine learning view is that real world concept learning happens very quickly. You just need a couple of examples of a concept.
But even more important is the learning to learn, that the inductive biases the allow you to learn a new concept from so few examples themselves can be acquired from just a few examples of a few example concepts, [INAUDIBLE]. So it's not just that you learned to learn. But you learned to learn from just a few examples of a few concepts. And then you generalize that to this whole broad set of object concepts. So that's very interesting to try to understand how that kind of thinking works.
There are, in the real world of objects and stuff in the world and language, of course not all things are solid object functional artifacts. And not all words are common nouns that label those things. And kids do wind up learning a little bit later other kinds of inductive biases?
So for example, for these non-solid substances like toothpaste or honey or jello and so on, kids by around age three, I think, acquire a sort of material preference for nouns which label these things. So like, here, if you say, oh, that's some tufa, then kids would think-- like, three-year-olds would think, oh, tufa is any kind of green toothpaste-y goo like that, or whatever, as opposed to anything that just happens to come out in that shape. Here you have a bunch of things which all have the same shape. They also have the same material properties. So if we called them jello, that would be fine.
But you can show that kids are not-- again, not generalizing. For non-solid substances and nouns that label those, they have the material preference. These things also gets bootstrapped off of [INAUDIBLE] syntax and nouns that I'm not really going to go into that's also very interesting. Like, in languages like English-- not all languages have this, but the difference between nouns like tube where you can say, the tube, there's one tube, two tubes, three tubes, there's four tubes. You have that plural and you could them, versus there's some-- you don't say there's one goop or two goops or three goops or three toothpastes or four toothpastes. Or if you say there's three toothpastes what you mean are there are three tubes of toothpaste. So these kinds of inductive biases also interact with learning the syntax of noun phrases in a very interesting way.
Here's one other cool thing, which is you take these things here. Well, we can just do a little bit of a demo here. So let's say this is a blicket. So which of these do you think is a blicket? How many people say this is a blicket? How many people say this is? So most people go with the thing that's a different color, the same shape. You show the shape bias.
But suppose I do this. Do you see what happened? It's a little low. Can you see? What happened OK, so for those of you can see, now this is a blicket. How people say this is a blicket? How many people say this is a blicket?
Right. So your preference changed. What happened? I just added these two little dots that kind of look like eyes. And again, this is work that Smith and Landau and others showed much earlier, that if you perceive this as an animate creature, then its shape isn't as important, or at least it's not a rigid shape. This thing looks like it could be a plausible non rigid transformations of the same body. And creatures are more likely to change their shape, say, in this way more than they are to change their color and texture in this way.
So this is, again, a really important thing in toddler development, is learning these flexible and kind of domain specific inductive biases. All right, so how can we do this? Well, here's a way that we tried to use hierarchical Bayesian models to do this. And I'm just going to show you a very simple one. This is also helpful, because this will also be one of the models that is on the church [INAUDIBLE] thing that you can look at. And in some instances, it's also just sort of a tutorial on a very basic hierarchical Bayesian model that's called the Dirichlet multinomial model.
And Charles Kemp and Amy Perfors, who were two students in our group a few years ago, realized that this pretty basic hierarchical Bayesian model could be used to at least qualitatively describe some of the kinds of things that you're seeing here. And then I'll show you how that worked. And then I'll show you some interesting work that people in our group did later, taking this and building this out actually to build useful computer vision systems based on this idea that it could actually improve how they learn object categories in a more real world, computer vision context by building on the sort of idea.
So here's an even more simple abstract game that will help you see how this kind of learning to learn works. Let's say we have a bunch of bags here, bags of marbles. And we draw one marble out of this bag. It comes out blue. So what do you think-- if I draw another marble out of that bag, what do you think it's going to be?
JOSHUA TENENBAUM: I mean, if you had to say. How confident are you? Raise your hand where totally confident it's going to be blue, and this is like, I have no idea. Again, you're giving me your sort of medium things.
All right, what if you had drawn from these bags and you saw these things? So now, how confident are you that the next one here is going to be blue? Everybody's getting good exercise. Right.
So assuming that these bags in some sense are all the same kind of bag, then it does seem like the property of color is distributed homogeneously in these bags. So you could say that that's the same kind of inference. So let's see how we can capture this in a simple hierarchical Bayesian model and then talk a little bit about how this could be used to model the acquisition of a shape bias.
So here's the model. And in drawing in this graphical format, what we're doing is we're capturing knowledge at different levels of attraction which capture with the corresponding parameters. So at the bottom level, we've got the data that's observed. And each of these branches here is one of the bags. At the bottom level, we've got the draws of marbles.
At one level up, what we're calling these thetas, those are parameters which describe the composition of each bag so the relative distribution of colors inside the bag. And in asserting that you can generalize something abstract across these bags, we're saying, well, there's some kind of distribution on distributions up here, with parameters alpha beta, that can be seen as generating the distributions inside each bag that can be seen as generating a particular observation. And then maybe even some hyper-parameter on these things. But the idea in this hierarchical Bayesian model is to observe data at this level and make inferences at these two levels simultaneously.
Now, the math behind this is technically we say, well, these are multinomial distributions, and this is a Dirichlet distribution. A multinomial distribution, again, right, is just a multi-dimensional generalization of a binomial distribution. In the same way that a binomial distribution might describe the counts of coin flips, this describes something like the counts of rolling a weighted die. So you can think of each of these thetas as like the weights of a die, where one face of the die corresponds to each color. And if you want to say, well, what's the distribution of colors I would draw from a bag, well, I'm just talking about, like, each draw is an independent roll of that die, and I'm interested in the-- to characterize what the bag is like in general, I'm interested in talking about the weights on each colored face. And then, so that's a multinomial.
And then, here we're saying, well, what's a natural prior on multinomials? Well this is what's called a Dirichlet distribution. And if you're familiar with the idea of conjugate priors in Bayesian statistics, it's the conjugate prior for a multinomial. It has a very similar form, as we'll see, in the sense that you can think of the Dirichlet as specifying a prototypical multinomial. That's what this beta vector will capture. And then you can think of it also as capturing sort of a strength or variance-- like, how much each of these individual draws, multinomial draws from the Dirichlet looks like a prototype versus just sort of being [INAUDIBLE].
So just to put some pictures on it that are just schematic, each of the multinationals can be-- I'm just drawing it as kind of a histogram showing the weight on the different colors. So if you've got a bag that's-- if you draw a bunch of red marbles from the bag, your guess here is that it's basically a red bag. So all the weight pretty much is on the red outcome. For this one, if you draw a bunch of yellow marbles, all the weight is on the yellow outcome. And if you make the inference from this one blue draw here that it's probably blue-- although you maybe aren't quite as sure as in these cases, then your guess is that the multinomial describing this new bag has most of its weight on the blue outcome, but maybe there's some possibility it could be others.
What licenses that strong inference from just one example is what you've learned up here, this shared prior that seems to describe what's going on in each of these bags is to say, well, the way this is captured here is we're saying, this data basically characterizes what's the distribution of colors across the whole population. And the alpha characterizes how much we expect each individual bag to look like that.
So across the whole population, we see all the different colors in sort of rough to equal proportions. Not as much blue, so that ones a little bit lower, but the others are pretty uniform. And having this alpha which is much less than one basically says, I think that each individual bag doesn't look very much like this. So that if an individual bag, if I've seen a bunch of red things, it probably just looks like that.
Another way to think of alpha is to say, in estimating the bag proportions, how much do you weight the date, the kind of bottom up signal, versus the shared prior? And if alpha is much less than 1, you're basically saying, go with the data. If alpha is much greater than 1, you're saying, go with this. If alpha is much greater than 1, you're saying you expect each individual bag to look a lot like this.
So in particular, consider a case like this. Here what I've done-- consider the difference between these. So the difference between these two is, if you look at the bottom, all I've done is take the same marbles and shuffle them around so that here, in the first one, they were nicely sorted color by bag. And now I've shuffled them around so we've got, the first bag has two red, two yellow, a green, and a brown. Here we've got three yellow, brown, red, green. Here we've got two brown, two green, yellow, red, and so on. Just shuffled the same ones around.
If you were to look at bags like this, well, now you'd say, well, I don't know, each bag-- first of all, all the bags look the same. Although empirically they're quite different, right? This one has two yellow, one brown, two red, and one green. This one has three yellow, a brown, and a green. This one has no yellow at all. This one has one yellow, two green, two brown, one red.
So in some ways, they're quite different. But actually, if you look at them, they all kind of look the same. They all look like bags of M&Ms or something. And somehow, what we get in a world like this is we think, oh, well, I guess again, the world as a whole has exactly the same distribution of colors-- red, yellow, brown, green, maybe a little blue. But now I get the sense that each individual bag looks a lot like the world as a whole. So thetas the same, but alpha's much higher. Alpha's much greater than 1.
And another way of putting that is to say that your best guess at what each individual bag is like should be highly skewed towards what the prior says, the shared prior, relative to each individual weight. There a subtle little effect of each individual weight, but this is really doing most of the work.
And now in this case, if this is what you've learned, if you're doing joint inference at each of these levels, this is the part that you've learned that now carries over to a new case. Well, if this is what you've learned, then you see one blue. And you say, well, OK, I'm going to combine the data, the likelihood term here, with this learned prior. But since alpha is much greater than 1, the prior is going to dominate.
So most of what I think about this bag is driven by expecting it to look like this beta prototype. But there's maybe a little bit of an emphasis on blue. Maybe it's a little more blue than the rest of the population-- after all, I did see a blue here where I hadn't seen a blue in these ones.
So that's the basic difference between these two cases. Why is it that in this case, you kind of learned to do one shot learning, or you've learned to trust that one observation in that new bag very strongly? Whereas in this case, most of what you've learned is kind of ignore that one observation or to really down-weight it.
Well, it has to do with the fact that in this case you've learned that the population as a whole is not very reflective of what each individual bag looks like. But the statistics of that bag is very diagnostic. [INAUDIBLE] trust the data. Go with that one or a few examples. Whereas in this case, what you've learned is, again, basically ignore the data. This thing is just as variable as anything else in the population. So in some sense, hear you've learned to not learn about this property, and here you've learned to learn.
So the [INAUDIBLE] that Kemp et al built for the shape bias, the simplest model basically just follows this structure. But it says that there's multiple dimensions of objects. There's their shape, their material, and so on, their texture. And what you're learning is a different-- you're learning sort of a parallel hierarchical Dirichlet multinomial with structure like this, where it's the same structure of bags across different dimensions. Each bag reflects a nameable object category. So it's like, the blicket, the lubs, the [INAUDIBLE], and so on. The draws are the objects that you've seen with that name.
But you have a different one of these models for each of the dimensions. And you learn essentially that dimensions of, say, material and color are like this. But shape is like this. So you've learned that shape is something which is distributed homogeneously within bags but heterogeneously across the whole population. Whereas you learned that same material properties, they are heterogeneous in each bag or each object category, the same way they're heterogeneous in the object population as a whole. And thus it's really the shape one that should drive your generalization rather than the material properties.
So just to put a few sort of little numbers on that, here are the ideas. We've said, OK, to model this case, you have four bags corresponding to the four named categories. Each of those bags has two draws from it. And then we're going to measure a bunch of different dimensions which are all going to be sort of discrete categories. Just like the colors of marbles, there's a bunch of discrete shape categories. It turns out it doesn't really matter how many there are. There could be 20 or 100 or 1,000 and it kind of works out the same. It can even be an infinite number. It could be non-parametric. It'll come out about the same.
But here-- so there's a bunch of different possible shape categories, texture, color, so on. And then here what you're seeing is OK, basically there's a correlation where the two things that are in bag have shape value one. The two things that are in bag two have shape value two. The two things in bag three have shape value three. But texture and color are sort of arbitrarily varying across bags.
And what you learn, again, from this is that shape is this thing where the heterogeneity of different shape values is not reflected in each individual bag. Each bag is much more homogeneous than the population as a whole, whereas for texture and color, you learn that their heterogeneity seems to be actually present in each individual bag, so that now, when you go to generalize to a new case-- so here now, this is the second order generalization, you now present one example of a new category, a new bag, five-- that's like the blue marble-- and it has a new shape you've never seen before, and also new texture, new color. So all of these are new values you've never seen before along each of these dimensions.
But the model is still able to figure out-- to do this second order transfer to figure out, OK, now if I now have some other things which have the same new shape as this object five but still other different values for texture and color-- so like, texture 10, color 10-- or I have this one that has, say, yet another new shape, shape six, but it matches this one on, say, the texture property, well, the model is learning to go with the shape as opposed to the texture match.
So that's what you're seeing here. It's preferentially saying, OK, this is a draw from a new bag. This one is likely to be from the same bag, but more likely than these two. And just like human children also, the first and second order generalization are almost equally strong. So the model is, again, generalizing the shape bias to novel shapes, almost as strong as shapes that it has some extensive experience with, namely sheets one through four.
This is the first version of something we called the blessing of abstraction. You'll see this in a few other examples later on, where when you're learning to learn one of these hierarchical Bayesian models, often learning at a more abstract level-- in this case, the higher level here-- because of how it cools information across a bunch of different low level categories, often learning at the more abstract level can be as strong or stronger than the learning at the more specific levels, which is an interesting feature of abstraction in human cognitive development-- somewhat different maybe than if you're used to some of the kind of deep learning systems-- again, you'll see more of that later on-- where you might learn layer by layer. You might think that in that kind of architecture, and it sort of seems to be true, you learn the more low level concrete features first and then only with a lot more experience you learn more high-level things. Whereas in human learning, often you're learning the most abstract things actually pretty early on. Any questions about this model?
AUDIENCE: Can you explain the prior over the Dirichlet parameters?
JOSHUA TENENBAUM: I didn't say much about it. It's basically just a-- so we assume the symmetric Dirichlet. So we assume the symmetric-- so the beta part of the Dirichlet distribution can itself be seen as multinomial. It's exactly the same form as thetas. It's one of these distributions that sums to 1 over the possible [INAUDIBLE] outcomes. So we assume the Dirichlet on that part of the Dirichlet, just a symmetric one. But it's very generic. It's basically [INAUDIBLE].
And for the alpha, we assumed an exponential distribution. It's just, again, a kind of natural, fairly uninformative prior, maximum entropy prior with just some mean value planned. It doesn't really matter [INAUDIBLE].
Another interesting feature of hierarchical Bayesian models, right, is that as you go up in the model-- it's sort of the flip side of the [INAUDIBLE] section-- the choices you make tend to matter less. You're more detached from the data. The precise values of any parameter don't influence things that much. So if we had set lambda to be some other value, it'd make a small difference. It wouldn't make that much difference. Was there another question back there, or you're just trying to see.
But in principle, there's no reason you couldn't also do inference at that level. I mean, you could keep doing inference at multiple levels. And you don't have [INAUDIBLE]. It's just, it's going to matter less. There's going to be less constraint up there to [INAUDIBLE].
JOSHUA TENENBAUM: I mean, certainly if you assume-- so an exponential is one that just gives reasonably significant mass across a range of alpha [INAUDIBLE] just convenient. But if you have a delta function on [INAUDIBLE]. So yeah, you could have had a very strong prior. And a typical thing people are doing, at least in data statistics and in a lot of hierarchical Bayesian models is just sort of making weaker priors as you go up. But you could have a very strong prior and then that would very much commit you to some. But particularly if you're going to do as we did here, which is fix that lambda, so not do inference on that, then it's good to have a fairly broad prior.
So people have used these hierarchical Bayesian models to capture a lot of different kinds of learning to learn effects. Instead of talking about the different kinds of cognitive applications-- again, in the spirit of the summer school, I want to talk a little bit about how these ideas were used going back into a sort of more machine learning, computer vision context. This is some work that we did in our group with Ruslan Salakhutdinov. The names are sort of down there in small font, unfortunately. You might know Salakhutdinov as-- his official first name is Ruslan. He was a PhD student with Geoff Hinton, did some of the earlier Hinton deep learning stuff. He's mostly known for doing deep Boltzmann machines and various other kinds of restricted Boltzmann machines and [INAUDIBLE] type of things. And Russ is now a professor at Toronto in computer science.
But he spent a couple of years at MIT working with me and with Antonio Torralba and doing several projects. But a lot of what he was trying to do was to sort of combine or explore the interface between this more kind of hierarchical non-parametric Bayesian world view and the deep learning. This is a project that doesn't use any deep learning, so I've done a pretty good job of using up most of the time I have allotted. I have a couple of slides after this one on something that combines this deep learning which I probably won't go into. But if you're interested in that, I'll at least put it up there [INAUDIBLE].
But what Russ was really trying to do in these projects was to try to see, can we take these insights about how humans learn to do one shot learning and build better computer vision systems? Recognizing that the typical way that object detection systems are trained has a number of limitations, some of which, I think the deeper ones, which this work doesn't address, are the representations, right? There's nothing here about three-dimensional objects. All the things that I said you need if you want to do that bicycle rider problem or those graduation cap problems, none of these systems are going to solve that.
But they're going to try to get beyond the problem of having to get hundreds of thousands of training examples to tackle the one-shot learning problem, which again is one that a lot of people in this center have been interested in. Tommy, you guys probably didn't talk about one-shot learning, but you probably will next week in [INAUDIBLE]. So you'll see different takes on this problem of, in a sense, learning to do one-shot learning or getting the learning curve from the asymptotics of, as n goes to infinity, to the more interesting case of n goes to 1. This is a problem that a bunch of us are interested in.
And I think this is a nice project to talk about, because it's sort of really somewhere in between the kind of things that we've done on a more cognitive level. It's very human child inspired. But the actual representation it uses is very much of a kind of-- it's not learned, but it's a kind of hierarchical visual system inspired feature set.
So the features are what's called texture of textures. The precise details aren't that important. I think if you took, for example, the features in some of the models that Tommy's group has built, like [INAUDIBLE] type models, I think you'd have similar kinds of things that could be done with those.
These are the kind of features that were developed originally by Paul Viola and others to build actually the first face detection systems, the first practically used face detection systems. But they're very similar to things that people in Tommy's group had earlier done using support vector machines and so on for face detection.
But basically what these features are doing is they take a sort of basic set of E1 type filters, like, looking for edges and bars, filter images with these [INAUDIBLE], and then run that recursively several times. So you're looking for kind of edges of edges or textures of textures three levels deep. And that's done parallel in RGB channels. So there's basically 15,000-- when you take these simple edge detectors and [INAUDIBLE] around and corner type of textures, and you nest them three levels deep in a non-linear hierarchy, and then you have about 15,000 of these features, and then you do that three color channels, you get about 45,000 features. But again, these days-- this work was done a few years ago. These days, there's no reason to stop at three levels. You could do all sorts of multiple levels of spooling and linear and non-linear and so on.
The main thing is, you build up a high dimensional, sort of neural light feature space in which you can now take any image, project it into that feature space, and ask questions like this. So for example, let's say you didn't know anything about cows, but I showed you this one image here of a cow, and I said, look through your big database of images and show me other things that are like this one. It's kind of like a computer vision version of word learning. If I say, this is a cow, can you show me other things that you think are cows.
And what we'd like is our system to return things like this. Both of these are cows-- but maybe they're all cows. Maybe that's a sheep. This is not what we'd like it to do. It returns a sheep, a cow, a couple of chimneys, and an open field.
Now, the thing here is, if you just take this high dimensional feature space and find the nearest neighbors to this one image, then this is actually what you get. The nearest neighbor is a sheep, the next nearest neighbor is another cow, but then you've got these chimneys. What you'd like is to get something more like this.
And the system that Russ built learns to do this, because basically what it learns to do is to reshape the similarity metric in this high dimensional space to capture more what defines the object categories of interest. That's similar in a sense but in a much more high dimensional setting to learning that sheep is what counts for some kind of categories but learning that, say, material or some other properties count for other-- So it's learning a similarity metric, but it's learning a category-specific metric.
The features that you want to generalize for cows are rather different for the ones that you want to generalize for, say, cars or trucks or knives and forks, as we'll see in a second. So it's not just that you want to take this 45,000 dimensional feature space and weight some of the dimensions differently. But you want to learn that some dimensions should be weighted high for some categories but other dimensions should be weighted high for other categories. Different categories should have not only different prototypes but different similarity metrics.
And the real challenge of one-shot learning is not just learning, you know, what does the typical cow look like from seeing one example. But if different categories have different similarity metrics, how are you going to learn what counts as similar when you only have one example? Or to put it in the way we do here, is we're using a kind of high-dimensional Gaussian representation of each category.
So let's say you have one cow like this, and you have the sense that the similarity metric can be captured by some kind of covariant structure, how can you learn a covariance matrix from one example? I mean, it's hard enough to guess the mean of a high-dimensional Gaussian from one example. But it seems like you should have zero information about the covariant structure.
But the insight behind this is that-- it's sort of summed up in this slogan here, that similar categories have similar similarity metrics. Or you might say, categories whose means are close by are also likely to have similar covariance matrices. So it's a picture like this.
Let's say that you've seen a bunch of dogs, horses, and sheep, but only one cow. But you might say, well, in this part of your high-dimensional feature space, it seems like the labeled categories tend to be elongated in the X direction but not the Y direction. Or if you like, so the similarity metrics being close in this vertical Y direction is very important for whether you're a dog versus a sheep. It's very discriminating. Whereas this dimension, the Y or horizontal one, is not very discriminating.
As opposed to this other part of the space, where we've got our cars, vans, and trucks. And over here, it's different. Here, your X position is very discriminating of whether you're a car, van, or truck, whereas your Y position is not so discriminating. So what you'd like to essentially learn is that each of these categories now can be described by some high-dimensional Gaussian-- of high, at least two in this picture, but it's actually 45,000 dimensional.
But you want to learn that in this part of the space, the Gaussians are generated by one uber-Gaussian that's elongated in one direction. And over here, they're elongated in different directions. That's kind of like learning that for some kinds of categories, say, the solid objects, it's sheep that matters, whereas for other kinds of categories, it's the material [INAUDIBLE], the non-solid substances.
And the way that's captured in this model is by learning-- oh, here's just another example of given this cow, if the cow-- you get other cows and sheep, whereas here you get some sheep and some park benches. So this is not what you want, but that's what you get in the original space, and this is what you get from the learned metric.
So the way that's captured is again, this is a model which combines a few of the elements that Sam was talking about. It's learning a hierarchical model on object categories where the basic level, or sort of most immediately nameable categories like cow, horse, and sheep, and truck and car and so on, are these high-dimensional Gaussians, which learning means and variances for each of 45,000 dimensions at this level.
But it's also learning priors on those things, which are sort of-- it uses a normal inverse gamma parameterization of the prior of these things. It's not really that important, but it basically looks kind of like the way that the Dirichlet prior multinomial looks a lot like the multinomial but this has sort of a strength parameter. It's the same idea but for Gaussians, so that you can think of it as like saying, as like in this picture here, we have the prior on each individual low-level Gaussian, is itself a kind of a higher, broader Gaussian.
And if you get one example of a cow and it's kind of close to this very general Gaussian over here, then that gives you information that the covariance of cow is like the covariance of more specific categories that also are generated by that same node. So that's basically like, you see your one cow, and you infer, well, I don't know what this cow is, but it seems to be a kind of an animal. So basically you're inferring, actually, that cows should be attached to this part of the hierarchy, as opposed to, say, seeing this thing and saying, I don't know what that is, but it seem to be a kind of a vehicle.
So the key part of the inference here to do this one-shot learning is actually forming this hierarchy. The hierarchy is itself learned. Nothing is labeled animal or vehicle, but that's actually learned by taking labeled examples of cow, horse, sheep, truck, or car, and then assembling them into this tree that sort of groups the classes into these superclasses according to which classes seem to be generated from the same prior. And you're able to build out this tree, even from one new example.
I think Sam talked about this. You talked about the nested CRP, is that right? So it's using that nested Chinese restaurant process where the generative model, the prior on trees, has a Chinese restaurant process at each level. So this tree could be infinitely broad at any one level.
But the way the CRP works is it wants to concentrate, if possible, the categories into a relatively small number of branches, each branching off a relatively small number of branches at the level above. So just to show this applies here to cases, in this paper, we applied this to a data set of 25 or so categories from a Microsoft Research data set. This was a nice one to use because the particular named categories they had seemed to intuitively cluster into some natural superclasses. And you can see, this is the recovered hierarchy of the model [INAUDIBLE].
So for example, it has a branch that puts forks, knives, and spoons together. It doesn't know anything forks, knives, and spoons, other than those are just three arbitrary labels for three categories. But they seem to have the same kinds of features in common. It puts cows and sheep together. It puts trees, birds, flowers, leaves together. And that's a little funny-- trees, flowers, and leaves maybe should go together.
And you might say, well, why should we put birds in there? But in the grand scheme of things here, again, if it doesn't know anything about living things, it's not so crazy that there's features shared in common. It puts buildings, chimneys, doors, office scenes, windows together, books, airplanes, benches, bicycles, cars, different views of cars together.
Again, it might be a little weird to put benches and chairs in with bicycles and cars, but maybe not so weird. As far as visual high-level similarity, it makes sense. It puts clouds over here in their own category. Again, it's not saying that necessarily benches are particularly similar to airplanes or that trees, flowers, and leaves are particularly similar. But it's saying the ways in which trees are similar to each other are similar to the ways in which leaves are similar to each other. The ways in which forks are similar to each other are similar to the ways in which knives are similar. Yeah?
AUDIENCE: So [INAUDIBLE] an airplane could be described as a machine but also as a flying thing [INAUDIBLE]. And Cuba could be described as an island or as a country. So is your model restricted to hierarchies?
JOSHUA TENENBAUM: Well, this model is. But remember the thing that Sam talked about, which was the cross-path model. That was designed to tackle that sort of problem, to sort of recognize that there might be multiple ways of categorizing the same domain of objects. And you could do that. You could imagine making a cross-nested CRP thing. It's one of the nice things about this toolkit, is that it's very compositional. Sam gave various other examples of taking this set of building blocks and combining them. So that would be an interesting extension that you could do.
Just the last basic result to see here is that this is an ROC curve for classification. How many people are familiar with ROC curves? How many people are not familiar?
So most people are. If you're not, it's basically a way of measuring the efficiency of a classifier where you trade off, depending on how you set the thresholds, and you have a graded signal where you trade off the different kinds of errors, misses, and false alarms. Testing something when it's not there and failing to detect it when it is there. And the sort of ideal, perfect performance would be an ROC curve that's fully up here in this upper left hand corner.
And here what we're doing is we're comparing the efficiencies of three different classifiers, none of which are all that great here, because this wasn't designed to optimize performance [INAUDIBLE] object detection. We're using these 45,000 feature dimensions. They're OK. They're not necessarily the best you can do. And in later work by Russ and others, he applied the same kind of idea to fancier kinds of representations.
But anyway, the main point here is to study the learning to learn effect. So what you've got here is the black curve is the best possible performance you could do with an infinite number of examples using this representation, which you have this 45,000 dimension feature space. And also you represent each category just as a multi-dimensional galaxy. That's itself pretty restrictive. And this is the best you can do with all of the examples that you have.
The red curve is what you get when you just do Euclidean distance in the original feature space. So no learning to learn. You're just doing nearest neighbor similarity. And [INAUDIBLE] performance on an ROC curve, is just this diagonal line. And you can see that Euclidean distance is barely above chance. It is above chance. I mean, when I showed you these examples of Euclidean distance given a sheep, there's more sheep than are there by chance. But it's barely above chance.
The blue curve is what you get when you've done this learned metric. So basically, we're doing held out cross-validation, where we hold out a category, and we learn this hierarchy of categories and the relevant inductive biases for all but that category. And then we see, now, how well can we do with just one example of the new category. And what you can see is with just one example, once you've done this learning to learn in the hierarchy, you can do almost as well as the ideal thing. It's the best you can do.
And it's just a very simple but nice illustration of how you can learn to do one-shot learning, or by learning these inductive biases in this hierarchical, non-parametric model, you could shift your learning curve such that almost all of the work is done now for a new class by the first example. Question?
AUDIENCE: This is more for discussion, so I could wait.
JOSHUA TENENBAUM: Yeah. I was just going to show one other thing you can do with this, is now you can also hold out some of these categories. And actually, you can sort of learn to do unsupervised learning. So suppose we hadn't seen any chimneys, trees, or clouds in our training set, and now we see a bunch of new things which include examples of things we've seen before, like cars and bicycles and forks and benches, but we also see a few trees, chimneys, and clouds.
This is like going back to the problem that Sam started with-- like, when you see a rhinoceros, is that some of weird unicorn, or is that some new species? Like, this fundamental problem of trying to figure out, OK, is this a weird example of a thing I've seen before, or an example of a fundamentally new thing that I haven't seen, even if it's not been labeled right.
Well, this system is able to do that. So it's able to decide that each of these things is one example of a category it already knows, whereas these ones here, it's able to figure out, hey, these three are-- first of all, that they're all in the same category and that it's of the same kind as the birds, flowers, leaves, and so on. And it's able to say, OK, these are all in the same category. I don't know what they're called, but they're of the same kind of buildings, doors, and windows, and office in urban scene. And these things are all of the same kind, and they're not of the same kind as anything. So it's able to figure out that clouds are a whole new branch at the highest level, which is kind of cool, because it's true. They're not really like any other object in this database. Did you want to raise your discussion?
AUDIENCE: It occurred to me that it seems related in some way to these ideas that originated with Roger Shepard on the second order isomorphism. And so there's the ubiquitous finding in psychophysics that people are much better at comparing things than making absolute judgements. And one idea is that when we represent the similarities between two things, like cows and dogs, if you ask someone to judge similarity, you're not necessarily-- this is Shepard's idea-- that you're not judging the similarity between-- there's prototypes in those groups, in the context of this model, but rather there are similarity metrics, in some sense. You're comparing the similarities between similarities. That's the second order isomorphism.
I'm curious what Tommy thinks. Because I believe this is related to stuff that Shimon Edelman did with Tommy a few decades ago. There was this argument that the brain doesn't really have this vertical representation of the world but represents these similarities between similarities.
JOSHUA TENENBAUM: I think it's really related to stuff that you're probably going to talk about next week. The way to achieve invariance by basically getting examples of several other invariant categories and then sort of projecting other-- you're going to talk about that next week, right?
So why don't we come back to that question. But I think it'd be interesting when you talk about that, or we can all talk about this together, yeah, I think Tommy's recently developed some approaches which are sort of Edelman-like [INAUDIBLE] that course or prototypes or something. It's like some of the stuff that Shimon Edelman did, but now it's trying to achieve invariance in a sort of neurally plausible architecture, that I think they have a whole rich mathematical theory for, which is different from this one. But it's true, it's very interesting to ask how they might be related.
AUDIENCE: But [INAUDIBLE] capturing the idea of similarities between similarities.
JOSHUA TENENBAUM: Right. Because part of what you're doing here is you're learning about variance and invariance. You have to learn that, OK, even though I've never seen these things before, that it seems that there's some dimension that characterizes the abstract similarity of this category, such that I can figure out, hey, these should go together, and these should go together and these should go together, even though I have, in this case, no label for them at all. So yeah, anyway, it's something to come back to. Other questions about this?
I'm going to skip through-- there were a few other slides on this so-called hierarchical deep model, which does more-- basically, it learns a similar hierarchical non-parametric Bayesian model of the classes and superclasses. But it also integrate that with learning features at the lower level. And if you're interested, you can check out these papers by [INAUDIBLE] on that. And just to-- let me just see where we are in terms of time.
AUDIENCE: This was going [INAUDIBLE] 4:00.
JOSHUA TENENBAUM: Yeah, well, I guess it is our plan to continue at 4:00, but I'll just talk for maybe 10 more minutes, and then we'll take a 15-minute break. Is that reasonable?
So again, just to echo some of the themes that Sam raised, and to set up where we're going for the last part is it's great to say we want to capture learning to learn with hierarchical Bayesian models. But if we want to learn the most interesting kinds of abstract knowledge, then at the high levels of the hierarchy, we need things that I think go beyond the statistician tool kit, right, that are not just Dirichlets on top of Dirichlets and exponentials on top of those, but rather capture something that really looks like more structural knowledge.
So Sam mentioned this work that Charles Kemp and I did where we were trying to understand, in a sense, just to put a little bit more of the cognitive motivation behind this, a lot of the most interesting discoveries of knowledge are ones that refer to the form of a model, right? Like, in biological thought, when people went from thinking about animals as organized according to something like the great chain of being in medieval Europe to something more like a hierarchical tree structure in Linnaeus or Darwin's more sort of randomly branched evolutionary tree, well before anybody understood the mechanisms of natural selection or selection more generally or whatever, evolution that might give rise to the tree structure of species, people were able to just by looking around at the natural world and being somewhat scientific about it decide that there was something like multi-level hierarchical structure.
Or how did Mendeleyev figure out the periodic structure of the elements? He didn't understand quantum mechanics. The reason why the periodic table has the form it did, was just by looking at the observable properties and sort of thinking creatively about the right topology of similarity, he was able to come up with some initial version of the periodic table.
And children are making these kinds of structural inferences too. So this idea that category labels should be organized into a tree, well, just like the shape bias, that's also something which kids don't come initially really thinking about. Again, they might be biased towards a tree as opposed to some other periodic forms, for example, but initially when kids are learning labels, they have what's called a mutual exclusivity bias, basically. They think each thing is one kind of thing. If you've learned one name for something, that's it. As opposed to, like, the phenomena of hierarchical classification, where something could be a dog, it could be a mammal, it could be an animal, it could be a living thing, just like we had in these models before.
So kids have to kind of learn that category levels can be hierarchically organized like that. But our standard structure learning algorithms, even the nice ones in the standard hierarchical non-parametric tool kit or any other similar things in earlier eras of data analysis, they're all assuming the form of the structure. So you can go in with a clustering algorithm or a mixture model, and it can be a non-parametric mixture model. We can be learning how many clusters there are, how many mixed components. But we've assumed we're looking for clusters. Or if we go in with a tree structure, a nested CRP, we might be able to learn the number of level of the branching factor at each level, but we're fixed to looking at trees.
So Charles' approach here was to say, well, let's define these grammars which are simple rules for growing out graphs, basically, where the idea that-- graph is a universal language for many different kinds of structure, and then these graph grammars are languages to capture the more abstract forms of structure, like the difference between a set of flat clusters and a chain or a chain and a ring or a hierarchy or other kinds of things. And by taking cross-products of these rules, you're able to get things like more interesting kinds of multi-dimensional structures, like to describe a two-dimensional space as a cross-product of a chain and a chain, or a cylinder as the cross-product of a chain and a ring.
And then what it means to do hierarchical Bayesian inference here-- and this is just putting a little bit more-- most of the mathematical details are still hidden away here, but a little bit more of the hierarchical picture behind what Sam was talking about-- is to say, OK, well, let's say we've got some data here, which is like one of these data matrices, a set of observations on-- remember the data that you showed with the animals and their properties, like the alligators, the ants, and so on, and they're having whatever their properties were-- [INAUDIBLE] and hunting. Like, any data matrix of objects and attributes.
And what we're trying to do here-- or it could be the objects and-- the rows of this matrix could be the images I showed you before, and the columns could be those 45,000 dimensional features, or they could be units in a deep network. Whatever you like, the idea is to say, well, let's see if we can take this data set very generally and make inferences at these two different levels-- the level of the structure which might be, say, a particular tree structured hierarchy, or it might be some low-dimensional space, whatever, and then also at this level, which are the rules which grow out the structure, which generate. So this is a rule which generates only hierarchically tree structured graphs with the objects at the leaf nodes. And here's a cross-product of two rules, each for growing a chain, which you put them together in that [INAUDIBLE] and you get rules that grows out two dimensional grids.
And the idea is that we can define a probabilistic model, first a prior on these rules, and then each of these can be seen as defining a distribution on graphs-- again, kind of restricting our hypothesis base but also biasing us towards simpler graphs of the appropriate forum. And then each of these graphs puts a distribution on these objects here, which again, to keep up this combinatorial tool kit theme is actually a Gaussian process.
So the way you link this level and this level is to say, basically, you assume that the features of objects are a distribution over these graph structures. And you assume that-- the intuition is that objects which are closer in the graph should have similar correlated feature values. And the way you get that is by basically turning the graph into the kernel of a Gaussian process, so that it's now capturing this notion of smoothness based on closeness in the graph. More technically, it's using the idea of the graph [INAUDIBLE] to provide the inverse of the Gaussian process covariance function. But basically the intuition is just that. It's that closeness in the graph is now capturing correlation and feature values.
So in that sense, it's a really cool way to take, on the one hand, what's now become a pretty standard, powerful Bayesian machine learning tool kit of something like graph based covariances to Gaussian processes, but then to add this extra level of what looks like more sort of traditional AI symbolic knowledge representation, grammars and graphs, that is able to capture the knowledge at this high level, to do joint inference at these multiple levels to figure out, as Sam was saying, that for a data set of animals, something like a tree structure is the right way to capture what's going on, whereas for some other domains, like here we took the voting-- this is the same kind of data matrix, but now the rows are, instead of being animals are Supreme Court judges. The columns, instead of being biological properties, are now how they voted on different cases.
And here, you could logically represent this as a tree, but in fact what comes out to be the best structure is this one dimensional chain, a kind of a left-right spectrum, where again we can recognize that judges over here on the left is the liberal ones, like Marshall and Blackman and so on, and here you've got the more conservative ones like Scalia and Thomas and so on. And what the system is figuring out is that not only is it recovering this sort of liberal-conservative underlying factor as the best simple model of how people vote, better or worse, but that that kind of one dimensional model is the best abstract form of structure to make sense of this data.
And again, Sam already went through these examples. Like here, this is an example of taking faces, recovering a sort of two dimensional structure of a white-black racial dimension, and there's sort of a masculinity dimension here. It's a little harder to see the masculinity dimension, but you can think of it as, like, these are the sort of football players and these are the tennis players or something, if you look at it like that.
Anyway, again, we generated these images by turning these knobs on a face synthesis program, and then we just gave this algorithm those high dimensional data vectors as [INAUDIBLE] to the pixel images. And it's able to figure out that there's that two dimensional grid structure and to recover what the dimensions are. Or here it's recovering that latitude and longitude of cities.
I'll just make one more point about this model, because it just links back to some of the other hierarchical Bayesian themes that we've seen-- which is, remember this phrase, the blessing of abstraction, this idea that knowledge at a higher level of abstraction sometimes can be quite easier than you might think to learn, maybe in some ways learnable before much of the knowledge at lower levels of abstraction.
So a version of that here is to say, look at a kind of a developmental learning curve where we take this animal data set, which has about 100 features in it, and we present progressive amounts of the data. So to start off, we just present five randomly selected features of animals. It's about 5% of the data, and then 20% and then 100%.
And what you see is when you give this model only a small fraction of the things you could observe about animals, it learns a much simpler representation. It learns this flat clustering, basically. It doesn't learn a tree structure. In a sense, it follows what in kids you would call mutual exclusivity. It puts everything in exactly one category. So you've got a category of insects and birds and-- I don't know what this is. I've put sort of fish together with amphibians, and here you've got a category of carnivorous mammals and non-carnivorous mammals. Those are sensible, high-level categories.
Then you give it a little bit more data. You give it 20% of the data. And now it's made a kind of a phase transition, or sort of a paradigm shift. Now it's got the idea of a tree. Suddenly it's linked all these things together.
It doesn't have the right tree. And if you look at it, you can see it says a lot of funny things compared to the adult tree. So it puts penguins in with dolphins, seals, and whales, whereas when you've given it all 100% of the data, penguins are over here by the birds, and the whales, dolphins, and seals are over in their own sub-tree.
What else does it do? It hasn't made a bunch of differentiations. So it hasn't differentiated the flying from the flightless insects or the flying from the flightless birds, or some other stuff like that. So it's less well articulated and doesn't have quite the right structure. In a sense, it takes a lot more data to work out all the details of the tree and to get the smaller scale branches right. But you get the big picture first, like Darwin and even Linnaeus. They got the idea of the tree before they understood the details and why it is the way it is. And that idea of getting the big picture first in some kind of sort of aha-like insight, that's again a kind of distinctive human kind of discovery phenomena that we'd like our models to be able to capture.
So just to kind of summarize that up, then, since we should stop and have time for the coffee break and the tutorial, I was going to show a parallel example of this in sort of a causal learning setting. But I won't go into it. If you're interested, this would be kind of cool. We can talk about it later. Another example of sort of learning high-level causal knowledge with this blessing of abstraction.
But let's see. I guess I was optimistic about how much [INAUDIBLE]. Let me just skip through this. We'll do this church stuff later and a lot of this development stuff later. But I just want to make a comment about some of the things that Sam mentioned, because the last thing I was going to show is just, again, some of the things he said at the end was this work by Roger Gross and various people, including me but also Ruslan Salakhutdinov and Bill Freeman, who were big parts of this, where we were taking this idea of high-level, symbolic descriptions of knowledge in a hierarchical Bayesian framework and trying to see what you could do with this now, again, in the machine learning setting.
And I share Sam's appreciation for this. Though I'm an author on these papers, I have played a very distant role, particularly on the really cool stuff that James Lloyd and David Duvenaud did, that thing at the end that Sam showed with the machine that writes its own-- the algorithm that writes its own papers.
But I just want to say that, again, part of the point of this was to take-- I don't think that any of this model or the other stuff with the Gaussian processes and the grammar of those Gaussian process kernels, I don't think those are particularly good models of anything that goes on in the human brain that humans do. But I think they are an example of trying to automate some aspects of scientific data analysis using some insights from what we've learned from computational cognitive science, that we can capture high-level discoveries that humans might make from data with this kind of toolkit of hierarchical Bayesian models with more structured symbolic grammar-like representations of the high-level knowledge.
I guess the one thing I would want to say a little bit differently from what Sam did, which is just-- I appreciate the idea of how you could-- of replacing scientists with algorithms or something, or how did you put it? I mean, we do talk about this stuff in terms of automating statistics. But I think it's really important to say that automating some aspects of statistical data analysis, like the choice of the model form-- these are two other data sets. So Sam showed an example of solar flare data. These are examples of this Gaussian process covariance stuff where here, this is an example of taking an airline traffic volume or atmospheric CO2 levels. And it's really pretty cool that you can give this system in this case, kind of a time series of how airline traffic volume has increased over the years. But not only that, it has this seasonal variation. And the seasonal variation itself increased over the years. So this is monthly data for airline traffic volume, a series of months over the '50s and '60s.
And you can actually get this thing to sort of extrapolate this function, both extrapolating the general upward trend as well as recognizing this complex kind of periodicity. And the periodicity itself is expanding. That's all pretty cool. And you do this by again, this sort of kernel for composing-- or this grammar for composing the kernels of Gaussian processes to give a very rich set of [INAUDIBLE] functions. And even one that can-- as Sam said, write up some reports.
So this is a few pages of one of these reports. And the whole thing is kind of automatically generated. But it doesn't replace a scientist for a bunch of reasons. One is that the way the report is generated is there's a lot of-- it's not any sophisticated natural language of processing. I wouldn't say it's necessarily badly written, but it's written from a very stock template. A lot of scientific papers are also written from a stock template, so maybe that is kind of like automating science.
But again, I want to emphasize how much more there is to science than just data analysis. We all know-- all of us are working scientists pretty much, right? We all do data analysis as part of our jobs. But there's much, much more to science that goes beyond-- not just fitting the parameters of the model or even choosing the form of the model, but thinking about what questions to ask, or what questions are even worth interesting, what phenomena are worth exploring. How would you design a good experiment? What would even count as relevant data for some question that's interesting to you? How do you integrate what you've learned in one experiment with all the rest of what you know, that none of these algorithms are trying to get at.
But it's important for our thrust of this research program-- because one of the things we'll see this next week when we get Laura Schultz in here and talk about children's learning is that there's sort of a long tradition of studying children's learning as a kind of intuitive science. Many things that children do when they learn about the world looks like scientific activities. But it goes way beyond data analysis.
And in a sense, the tool kit that we've been able to algorithmatize so far are more and more of the parts of science that are the data analysis part. But the really interesting challenge-- and it's one that people are just starting to make progress on, but it's really mostly open-- is how do you capture, in a sense, the aspects of learning which are like science but go beyond data analysis? And that's still the really interesting open question. And I'll just say, leave that till next week.
So thanks for your attention here. And we'll now take-- let's take a 10-minute break, say, and come back at 4:00. And there we'll start learning about probabilistic programs, which are sort of a powerful unifying language for many of these kinds of models. [? Summer ?] will present on that. And we'll also do some sort of interactive tutorial play. So make sure you have your laptop. If you haven't installed [INAUDIBLE], try to do that. If not--