Sam Gershman: Structure Learning, Clusters, Features, and Functions
May 31, 2014
May 31, 2014
All Captioned Videos Brains, Minds and Machines Summer Course 2014
Topics: Basic introduction to parameter learning, structure learning, nonparametric Bayes, mixture models, conditioning as clustering, learning relational concepts, multi-level category learning, latent feature models, function learning, Gaussian processes, and human function learning
SAM GERSHMAN: OK. So for those who don't know me, I'm Sam Gershman. I'm at MIT, Brain and Cognitive Sciences. And I'm going to, basically, pick up where Josh left off, and talk about structure learning, and many forms that structure learning can take.
So let's start with a story. In the 13th century, Marco Polo went to Java, and he saw a rhinoceros. But Marco Polo didn't know what a rhinoceros was. There's no such concept in his conceptual vocabulary. So he thought it was the next-- the closest thing that he could think of, which was a unicorn.
So and he thought, well maybe-- this doesn't look exactly like what I expected unicorns to look like, staring at medieval tapestries, or whatever. But maybe this is just some kind of funny, funny-looking unicorn. So this highlights a key distinction in what we're going to talk about for the rest of the morning, which is the difference between parameter learning and structure learning.
So in parameter learning, I might imagine some kind of unicorn space, and I'm trying to find the parameter setting that can best describe a rhino. But a fundamentally different problem, structure learning, is where-- I'm basically trying to figure out what kind of animals exist at all. Right. So what are the spaces, the hypothesis spaces, in which I'm going to try to place this rhinoceros.
Fundamentally, this is the central problem with structure learning. What's out there? What exists in the world? What is an object? What different kinds of natural [INAUDIBLE]?
And in practice, this takes on many different forms. So we may look at data like this, and ask how many clusters are there. What are those clusters? How many features do I need to describe an object like a face? What sort of structural form do I need to describe a particular domain? Is a domain that's described by a tree or some kind of square lattice, or a linear structure or a circular structure. And I think Josh is going to talk more about this later, and I'll touch on it again, at the end of this talk.
And what sort of functional form, if I'm doing function learning, what's the best functional form for modeling some sort of data. I'll come back to that later.
So the big picture here, is that, the way that Josh set up Bayes' report, we can-- we're given some hypothesis space. We can use Bayes' rule to infer the posterior description of our hypotheses from the data.
But where do hypotheses come from, in the first place? A key idea here is that we can actually apply the same principles, the same Bayesian principle, to the discovery and inference over hypothesis spaces themselves. And this is where nonparametric Bayes comes in. The motivating issue is that we need priors on hypothesis spaces that are rich enough that they can capture the full complexity of the data. But we also want to prefer simpler hypotheses, to avoid overfitting. This is essentially an ill posed problem.
So that's what nonparametric Bayes' does. It gives you a prior over, essentially, an infinitely large hypothesis space, and I'll talk more about what exactly that means. But the prior favor simpler explanations, simpler hypotheses.
So what is nonparametric about nonparametric Bayes? Nonparametric is really a confusing term because, of course, nonparametric Bayesian models do have parameters. The best way to understand this, I think, by analogy with non-classical, nonparametric statistics, like Kernel density estimation.
So let's take an example of parametric model, like a mixture of Gaussian . So here I'm showing a bunch of data points arranged on a line, and you see that there are a few data points that cluster over here, a few data points that cluster over here. So I could look at this and say OK, I want to use a mixture of Gaussians with two different mixture components. And I can, basically, place one Gaussian over here, place another Gaussian over here, and then the density estimate, the probability density over-- over data on this line, looks like this black line, some weighted combination of these two Gaussians.
Nonparametric kernel density estimator doesn't assume a fixed number of mixture components, what it does is it takes some kernel, some fixed kernel, and it places it over every single data point. And then takes an average of these kernels to construct a kernel density estimator. There's a huge literature on this, talking about how do you select the bandwidth of these kernels, and what sort of kernels you want, and so on.
Nonparametric data takes a kind of, intermediate route. What it says is that, we're not going to put, necessarily, kernels on every single one of these data points, but we're also not going to say there's only two that we have to-- and we have to learn how, learn where over the space to place the kernels. We're going to say that there's some unknown number of clusters. And we're going to, basically, learn how many there are, and where to place them, what their parameters are in the data. So this is what nonparametric Bayesian kernel density estimator looks like. You're saying [INAUDIBLE] is called process mixture model. I'll come back with that, in a few minutes.
OK. So over the course of this talk, and I'm not, probably, going to get through all this before lunchtime, but I think well I'll pick up some of it after lunch, and see how far we can get. But I want to-- the theme here is that there are a few fundamental building blocks of structured learning, and in particular within the-- I'm going to talk about it within the framework of nonparametric Bayes'.
So the three basic building blocks are a mixture models, that allow you to do clustering. And here we're going to focus on stochastic process known as the Chinese restaurant process. You'll notice that a few of these things have silly, culinary metaphors, and I'll explain what those mean.
Latent feature models like factor analysis, lead to a nonparametric distribution called the Indian buffet process. And then I'll talk at the end about function learning, which allows you to do things like regression. And there the key nonparametric distribution is a Gaussian process.
And I emphasize that these are building blocks, so each one of these is going to be able to capture some little-- some well-defined set of structures. But it may be the case that-- there's two issues. One is that we may not know in theory which kind of structure is appropriate for our data, and we want to discover that automatically. So I'll talk about some automatic ways of doing that. And then, the other issue is that, we may want to compose these together in interesting ways. So we might want to combine, for example, clustering and function learning in some way, to capture some more complex [INAUDIBLE] data. And I'll give you some examples of that.
OK. So let's start with mixture models and clustering. So as I mentioned before, people are often interested, given some data set like this, how many clusters are there in the data. And in the generative modeling framework, a cluster corresponds to a mixture component, let's say a Gaussian, which is parametrized by some mean and variance. So the cluster in-- in the Bayesian framework, clustering corresponds to assigning these data points to particular mixture components, and at the same time, figuring out what the parameters of those mixture components are. What are the means and variances.
I'm going to gloss over the problem of learning the parameters of the mixture components, and just talk right now about how do you infer the assignment of data points which I'll denote by x to clusters or mixture components, which I'll denote by z. And generally speaking, I'm going to-- this talk is going to be pretty light on math, but there's a lot of deep an interesting math here. And come talk to me if you're interested in that. I'll have some pointers at the end.
OK. So as we all know from Josh's lecture, the posterior description is proportional to the product likelihood of the data, given cluster z, and then this prior probability of cluster z. So what happens when we take the infinite limit, where we don't specifically say there's four clusters in these data, we allow some unbounded number of clusters. And we want to define prior distribution on clustering that allows for some arbitrary number of clusters.
So that's exactly the function that's fulfilled by the Chinese restaurant process. It's a prior over clusters with a number of clusters that's been bounded. Formally, it's actually a distribution on partitions of the integers. So this is some equivalence cost of clustering. So here's an illustration of this. Don't worry if it's hard to understand this diagram, it's not that important.
So the idea here in the Chinese restaurant process metaphor, we have tables. Let's imagine a Chinese restaurant with an infinite number of tables, and each table corresponds to mixture components. And the customers that enter into this restaurant, correspond to data points, and the customers-- the table at which your customers sit, determines if the, the assigning of, which mixture component it's going to be assigned to.
So here you see the customers that are attached to these tables, and the customers also carry with them some observation, that's the actual data that you're observing. And these thetas here represent the parameters of the two. So that's in the Gaussian to mean and variance. I think I should turn this out a little bit.
OK. So and just as a side note this-- the reason that this is called the Chinese restaurant process is because the statisticians who invented it, who are based in Berkeley were always very impressed by the Chinese restaurant in San Francisco that seemed to have some infinite seating capacity. That's where the metaphor comes in.
OK. So here is how the Chinese restaurant process actually generates a partition. So the first customer enters the restaurants, and sits at the first table. So remember customers correspond to data points, and tables correspond to clusters, or mixture components. Then subsequent customers enter and sit at table k, with probability proportional to the number of customers already sitting at that table. And with some probability they might sit at a new table.
So the probability's proportional to alpha. Alpha is a concentration parameter that determines the probability of sitting at a new table, in effect, that determines how many clusters are going to intersperse. So in the limit alpha goes to zero, all the customers are going to sit at the same table. And when alpha gets larger, the customers are going to progressively sit at distinct tables. The limit of alpha goes to infinity, they all sit at separate tables. And in general, the total number of occupied tables is going to scale alpha log n, where n is the total number of customers [INAUDIBLE].
OK. I think in the interest of time, I'm going to skip-- this is just explaining where this comes from, and where-- why these mixture models are sometimes called Dirichlet process mixture models. There's an intimate relationship between Dirichlet processes and the Chinese restaurant process. But in the interest of time, we'll just keep going.
So I'm going to talk about an example of this infinite mixture modeling framework in action, as I studied in some of my own research. So part of my research is working on classical conditioning, and understanding the theoretical basis of classical conditioning. So to take an example, here's a fear conditioning experiment, involving a fear conditioning experiment. You have a rat that's in this box and hears a tone that's called the conditioned stimulus, or CS. And after the tone terminates, the rat gets a shock. That's the unconditioned stimulus. And over time, if you do this a bunch of times, the rat when he hears the tone, he's going to freeze. That's an instinctive fear response.
Now, the classic idea about how learning in this framework works-- learning in this paradigm works, is that what the animal's actually learning, is an association between the tone and the shock. But that's not the only possibility of what the animal might be learning here.
And I like this cartoon as an illustration of why this is a problem. I don't know if people can see at the bottom. So there are these two ducks flying, and one of them is saying to the other, it's that time of year when guys randomly explode. And in the background, you see this hunter that's lurking, poised to shoot. This is an illustration of the kinds of perverse inferences that you might make if you had an impoverished model, that only allowed you to infer cause and relationships between observable variables.
What you really need here is actually, the possibility of having latent causes that can induce changes in your observations. So as I mentioned, the classic idea in the theories of classical conditioning, is that a tone causes the shock. So here I'm using graphical model notation like Josh used in his previous talk. But an alternative possibly, perhaps, is that the shock causes the tone. That's not all too plausible given the temporal relationships between the two.
Yet a third possibility is that something else entirely causes both the tone and the shock. And I'm putting Pablo here to stand in for this latent variable, because to a first approximation, that's actually the true generative model of the experiment. There's actually an experimenter there, that the animal can't see, who's generating the tone and shock pairs. As I already pointed out with that cartoon, in some sense, these models are too constrained. They can't cope with the richness of inferences that humans and perhaps animals, make about the world.
But this model has a different problem, which is that it might be too flexible. So when you start positing latent causes, what's to stop you from positing some infinite number of latent causes to explain your observations. This is simply an ill posed problem.
So the hypothesis that we explored, is that the animals are assuming a generative model, in which the number of latent causes is unbounded. So the animal can always infer some new latent cause, if the circumstances change, if its observational data change, and can no longer be explained by the causes it inferred previously. But a small number of latent causes is more likely to occur. So that pushes the animal towards the simplicity bias, or something that's called [INAUDIBLE].
So the idea, is really formally, exactly identical to the kind of infinite mixture model that I was describing to you just a moment ago. What I'm arguing here is that the rat does something like clustering. And conditioning behavior is, basically, the outcome of this clustering process.
Just to give you an example of how this works, so if you condition an animal in one box, let's say, this orange box, and the rat gradually learns to freeze when he hears the tone. And then, you extinguish the rat in another box. And what that means is that, you played the tone a bunch of times without the shock, and gradually the rat stops freezing, when he hears the tone.
So many theories of classical conditioning that assume some direct, positive relationship between the tone and the shock, basically, view extinction as the mirror image of acquisition. So in during acquisition, you learn an association between the tone and the shock. Then in extinction, you unlearn it. So that actually doesn't seem to be what happens, because under a variety of circumstances, this extinguished response comes back.
So for example, if you put the rat back in the acquisition box, or even in a novel box, you see a renewal of conditioned responding. So that suggests that the memory was not actually erased, but what animal learning theorists like to talk about, is that the animals actually learn something new, rather than unlearn an association. So the question is what does it mean to learn something new.
This latent cause model gives an answer to this problem. It basically says that learning something new corresponds to inferring a new latent cause. So I'll just illustrate that here. So let's imagine that the animal's a blank slate. When it comes into this conditioning apparatus, it infers some latent cause to explain these data.
And then when you put it into extinction, now two things have changed. The CS no longer predicts the US. The tone no longer predicts the shock. And the rat's in a new context. So that might be sufficient evidence to infer a new latent cause. But when you put the rat back into the test context, now the contextual information pushes the rat to infer that the previous latent cause is active, and hence to reinstate its fear of the tone.
And so this talk is not really going to be about neuroscience, but I thought I'd just mention this briefly. There's certainly a lot of interesting questions related to neuroscience, and here's of them.
People have suggested for a long time that hippocampus enables some sort of context sensitivity and flexibility to animal behavior. And in particular, when animals lose their hippocampus, they lose a large amount of context sensitivity. In particular, when an animal-- when you lesion the hippocampus, the animal no longer shows, the animals no longer show renewal of conditioned responding.
One way to think about the absence of renewal phenomenon is that, losing the hippocampus has impaired their ability to reflexively infer new latent causes. A tantamount to turning that concentration parameter down to zero. So they can no longer infer new latent causes. And indeed, you can capture this pattern of data by assuming, by making that assumption. And there are a bunch of other observations about hippocampal lesions that are consistent this general idea, which I'm not going to go into.
OK. I might actually skip, although I know Josh likes it. I can tell people about it later. OK.
So I want to talk about some extensions to this basic model [INAUDIBLE]. Right. So, mixture models are great, but they have a number of limitations. And in particular, these are limitations that we think the brain circumvents. And so the question is how-- what sort of generalizations can we think of that can more faithfully capture patterns of human reasoning.
I'm going to talk about three of them pretty briefly. The first is learning cross-cutting category. It's the model called CrossCat. The second will be clustering relational data. And then the third will be learning multi-level categories.
OK. So here is a graph that's-- here's a graph that shows the different features that are possessed by different animals. So if you see a little gray patch here, that indicates that, for example, an alligator bites, has the property that it bites things. Now you can see that there are some features that are shared across animals, and they line up pretty well. And then others, there's a lot of noisiness across different animals. So there's not consistent patterns of covariation.
One way to think about the task of clustering, is to basically, segment the animals into different groups or clusters, such that there's some kind of coherent pattern of feature covariation within a cluster, not necessarily between clusters. So if you apply the Chinese restaurant process mixture to this matrix, the optimal clustering is-- of breaking down the animals into four separate clusters.
And you can see here, and so here we've sorted the features now [INAUDIBLE] their likelihood. And what you can see on the left here, is that it does an actually, pretty good job of capturing certain coherent-- certain patterns of coherent variation among the features. So you see that certain features are consistently on, or consistently off within a cluster, but not necessarily across clusters.
You might reasonably ask, what's going on with all of these things. And one possible problem with this sort of-- this kind of clustering approach, is that is tries to pick clusters, such that it captures the coherent variation of all features. Right. So each cluster is trying to capture the structure of the entire feature space, but that may not be the most natural way to [INAUDIBLE].
In particular, humans show an ability to effectively partition the feature set, and learn what are called cross-cutting categories. For example, when asked to reason about anatomical properties like, does an animal have a liver with two chambers that act as one. People tend to draw upon taxonomic knowledge about animals. But if you ask them to reason about behavioral properties like, does it travel in a back-and-forth trajectory, then people draw upon a different source of knowledge, ecological knowledge.
So that suggests that there's not necessarily one clustering that people bring to a domain, rather, several different kinds of clustering that apply to different sets of features. So Pat Shafto and Josh, and a number of other people in Josh's lab, developed what they called CrossCat, which is a generalization of the Chinese restaurant process mixture model. But where you cluster-- what you do is you first cluster the features, and then you cluster objects within each feature subset.
So when you apply this algorithm, this model, to the data that I showed you before, what you get are-- you get three different feature subsets, and then different clustering within them. So basically, the first clustering captures the same groups that I showed you before. So leopard, sheep, and seal are clustered here, and you might have things that are more-- birds tend to go together in this cluster. But then you can get other kinds of clustering within the other feature subsets.
They also designed some artificial stimuli to test this idea more directly. The stimuli looked something like this, so you have these bugs that vary along in a number of different features like their texture, or their antenna, or tails, and so on. And unbeknownst to subjects, in this arrangement, the stimuli admit a natural clustering along the rows and columns. So two different clusterings.
And the question was, whether the model, and also people will pull out this natural clustering along the rows and columns. And indeed, if you apply CrossCat to the stimuli, you get two systems of-- two different feature subsets that are clustered sets.
I'm not going to go through the gory details of this, but in effect, what they found was that humans, they also tend to break down that stimulus set into two separate sets of clusters.
The next extension I'm going to talk about is clustering relational data. So here a few examples of relational data. Some of these are professors, and some of these are students. So professors give advice to graduate students and undergrads. Grad students give advice to undergrads. And undergrads give advice to no one. So this is an example of relational system.
Here's another example, magnets. So some of these objects are magnets, some are magnetic objects, and some of them are non-magnetic objects. And so you have a pattern of interactions where magnets interact with each other, magnets and magnetic objects interact. Magnetic objects do not interact with each other. Non-magnetic objects do not interact with anything.
And you can represent these kinds of relational systems in a matrix form like this, where the rows and columns represent the different objects in the domain. And you have a binary value here that indicates whether or not the two objects interact with each other. And this can also-- this can capture asymmetric patterns. So this means that person one gives advice to person nine.
And what we want to do in clustering these data, is basically, find a clustering of the objects, such that when we draw these line-- when we reorganize this matrix into the clusters, you get a consistent pattern of, basically, on or off, whether the relational predicate applies or doesn't apply within each of these boxes. So it's the same idea as clustering, where you want to get consistent covariation within a cluster, but now we're doing what's sometimes called bicluster.
The model is basically the same as the CRP mixture model with a slight modification. So we're going to draw a partition of the-- we're going to draw a partition of the object space from a Chinese restaurant process. And then for each pair of clusters, we're going to draw a parameter data, which, basically, captures the probability that object a interacts with object b. Or sorry, objects in cluster a interact with objects in cluster b. And then for each object in the entire domain, we're going to draw a relation from a Bernoulli distribution with that interaction parameter as its mean.
Charles Kemp, who developed this model, did a number of experiments to test this, and some of these involve what's called the causal blocks world. So you can see a bunch of these blocks with letters on that. And subjects are able to take some of these blocks, and put them together. And sometimes they'll make some sounds, and that indicates that causal-- that object a, block a turns on object c.
And Charles investigated a number of different relational structures, and looked at how people learned these structures. So for example, he looked at four different relational structures that vary along two dimensions. So one was whether the causal relationship between the two groups of objects, of groups are indicated by the capital letters A and B here. The whether the groups-- the interactions are symmetric or asymmetric, and whether they were deterministic or probabilistic. So for example, this is called the symmetric regular condition, where regular means deterministic. And here's the symmetric irregular condition or irregular [INAUDIBLE] or some kind of--
Or just to be clear. So when I say stochastic, what I mean is that, not all of the objects within a cluster necessarily have the same relational properties. But for a given object there is one-- it's deterministic whether or not it turns on another object. And one observation here is that these different systems have-- are more or less, easy to learn.
So you can expose people to different numbers of interactions in this causal box world. And we can ask what's the log Bayes' factor, well basically the log odds for the correct relational structure. So obviously, the more interactions you observe, the stronger the evidence is for the true causal structure. But these have different learning rates, so in general, the irregular systems take longer to learn.
So one interesting application of this model is to ask about one-shot relational learning. What they had people do is observe a number of interactions, and then they showed subjects a new block that they'd never seen interact with anything before. And they asked subjects whether or not it would turn on a certain other block. And they looked at this prediction as a function of the number of other-- of the number of interactions that they observed in that domain.
And they looked at both before or after they showed you one interaction between this test object, and another object in the domain. So they are looking at inferences before and after this one-shot learning. And unbeknownst to subjects, this test object belonged to either one set or another-- one of these two sets. And they specifically looked at tests where you apply the test object to the other group of objects, to the group of objects to which it doesn't belong.
And the learning curves look like this. So this is the number of objects observed and this is the probability of x turning on an object from group b. And what you see is that, before you've observed anything about x, you, basically, are indifferent between these two possibilities. But after you observe x interacting with one of the objects in group b, you have a-- you'll predict that it will turn on objects in group b. And the probability of that prediction, the strength of that prediction will grow with the number of objects that you observe, the number of interactions that you observe in between objects in the training [INAUDIBLE]. And that kind of pattern is exactly what you see when you apply the infinite relational problem.
Alright, the last extension I'm going to talk about for mixture models is multi-level category learning. So what is this, is it an animal, a mammal, a dog, a border collie. It doesn't seem to us intuitively, that one assignment of categories is sufficient here. In some sense, this object belongs to all these categories, and simply applying a standard mixture model is not going to give us the sort of multi-level membership. What we need is some kind of clustering model that will give us, basically, something like a taxonomic tree.
There's actually a straightforward extension of the Chinese restaurant process that will allow us to do that. It's called the nested Chinese restaurant process, and the idea here is that at each level of the tree we're going to generate a clustering from the CRP. And then for each of those clusterings, we can generate some kind of sub-clustering. And then you can recurse like this, essentially infinitely, but in practice we assume that data is drawn from some finite link path through this tree.
So Canini and Griffith developed a model of multi-level category learning that uses the nested Chinese restaurant process. And to test whether people followed the prediction of this model, they, basically, showed people hierarchies of objects where they had labels for the levels of the tree. So they might show this purple object here, and tell subjects that it was a [INAUDIBLE]. And they trained subjects up on this for a while. And then at the end, they asked subjects to arrange these category labels into a tree structure. And the question is whether subjects would arrange, would choose the tree structure that's consistent with this nested Chinese restaurant process model would predict.
And I don't have the data here because for some reason, it wasn't showing up right on the computer. But suffice it to say that, subjects are actually quite good at reconstructing this tree structure after only observing pairs of objects and their labels. And they basically-- this kind of pre-reconstruction is consistent with what the nested Chinese restaurant process model predicts.
OK. So I'm going to move on to the latent feature model, so if anyone has questions so far about mixture models. Yeah.
AUDIENCE: Does the clusterization actually depend on the identity of the objects at all?
SAM GERSHMAN: Are you asking about a-- asking about this particular model?
AUDIENCE: In your description of the Chinese restaurant process, it seems like it was just random, not dependant on the objects.
SAM GERSHMAN: Yeah. Well I mean this is the difference between the prior and the likelihood in the posterior. So the prior generates a partition of data, so it doesn't depend on the observed features of the data. But then when you combine that with the likelihood, the likelihood tells you how consistent are your observations with particular partition [INAUDIBLE]. And combine those together, you get a posterior that is dependant on [INAUDIBLE]. OK.
So latency [INAUDIBLE]. So here the question is, what in essence, is a feature of an object. It's really not a straightforward question to answer. There are many models of human cognition that, basically, assume a fixed feature set, but if you just get some of the feature. It's not actually [INAUDIBLE] an object is not actually completely obvious what the relevant features are. And in fact, as I'll show in a moment, the inferences people make over the relevant features depends strongly on the statistical properties of those features, and in particular the patterns of covariation between properties of different properties of the image.
The basic motif here for latent feature models, and this is only the simplest version of, really, a large family of models, is to imagine that objects are composed of different features, latent features. So we can imagine that this object is built out of these different pieces, this vocabulary of latent features. And then you combine that with a binary vector that tells you which feature is active in this particular object.
And multiplying these two-- multiplying the matrix representing the latent features, the visible properties of each latent feature, with this feature ownership vector, will give you the observed pattern of that. And this is showing that computation in, basically, block matrix. So here the question is, if we don't know how many features there are, what those features are, then how do we define a nonparametric generalization of this basic model to allow for, essentially, an unbounded number of features.
And that's where the Indian buffet process comes in. There's a similarly colorful food metaphor here. This idea was developed in London, I guess while Tom was visiting the Gatsby. And I suppose that in London, you have Indian buffets, which have seemingly infinite number of dishes. And so that's where the inspiration came from.
So what is an Indian buffet process. The IBP is a distribution of feature ownership matrices, where there's an unbounded number of features. So just to remind you here, so this is the feature ownership matrix. The rows correspond to data points, and then the columns correspond to features. So we're trying to-- so the IBP is a model, a distribution on matrices like this, where k is unbounded, but n is finite.
So here's the generative process. The first customer, the first data point, enters and samples some Poisson alpha number of dishes. The dishes correspond to features. And then subsequent customers come in. And customer n samples dish k with probability mk over n, where mk is the number of previous customers that have sampled that dish. And customer n will also sample a new dish with probably alpha over n.
So alpha here, really plays the same role as the concentration parameter that we saw before in the Chinese restaurant process. It controls-- now instead of the number clusters, it's controlling the number of features. So as alpha gets bigger, you're going to infer a larger number of features. And in the limit that alpha goes to zero, you have only a single feature.
So Joe Austerweil and Tom Griffith did a number of payroll studies to try to test the prediction of this model. And in particular, they were trying to answer the question of when do people unitize a bunch of observable features into a single latent feature. And the key idea here, the patterns of covariation between these observable features. So the idea here is that, when certain parts of an image are correlated, then with enough training, you're going to unitize them into a single feature. But if the parts are uncorrelated, or independent, then you're going to continue to represent those as separate features.
So what they did was, they built up a set of these objects out of a basic vocabulary of parts. And they told subjects that images like these are inscriptions on some-- the wall of a cave on Mars. So they imagine, they asked the subject to imagine Mars rovers going around Mars, and looking at caves. And they found these alien inscriptions, and subjects would be asked later to judge whether some new inscription came from the same Martian cave.
And to manipulate the correlation of the parts, they created two separate conditions. So these will run in two separate groups of subjects, where in the independent condition there is no correlation between these different parts. So the idea there was that it was going to prevent any kind of unitization of the parts into clustered latent features. Whereas in the correlated condition, certain parts tended to co-occur with each other. And the hypothesis was that, in this condition, subjects were going to unitize these parts into a single [INAUDIBLE].
So in the training phase, they just showed subjects images like this, and let them look at them, basically. And then in the testing phase, they looked at generalization. So what they did is they showed subjects a new object that they'd never seen before. And they asked whether they thought that-- how likely do you think the rover is to see this image in another part of the same cave wall. And they compared this rating to these novel or unseen objects, as well as to previously seen objects, and control objects where they just shuffle the parts of the seen objects.
And what they found was that in the correlated condition, subjects were very hesitant to generalize to the unseen object. And the reason why this makes sense within the latent feature model, is that in the correlated condition, these objects, because of the fact that these parts-- because of the fact that the co-occur together, if they're unitized, then some new object that has the same parts, but not in the same configuration, will be considered a new object.
So the latent future model predicts that unitization will prevent generalization or will reduce a generalization decrement. Where as, in the independent condition, subjects were very willing to generalize to the unseen object because, basically, they didn't lose the flexibility of parts. They never unitized parts into larger features. And so they showed no evidence for generalization. Yeah.
AUDIENCE: So in the condition where, well, I guess in both conditions, are the independent and correlated features determined by how the stimuli are generated? Is the description of the generative process of the stimuli themselves?
SAM GERSHMAN: Yeah.
SAM GERSHMAN: And this is just showing the features that were inferred by the Indian buffet process. You see that when you apply them to these two different sets of data, that IBP does, in fact, show [INAUDIBLE]. OK. Any questions on the latent feature section?
So the last building block that I'm going to talk about is function learning. So here's an example of [INAUDIBLE] data. And each of these panels shows a polynomial function that was [INAUDIBLE] do when you data. And they differ only in the order of that polynom. So this is a zero supported polynomial, which can only basically set a straight line, or rather it can only model, sort of, the overall bias of the-- whether these points are above or below zero. So this sort of polynomial is basically, linear regression. And you can consider higher order polynomials, as well.
So the key thing to take away from these graphs is that, a very low order polynomial is really too simple. It can't capture the [INAUDIBLE] data. But a really high order polynomial is too complex. Right. It's capturing maybe some idiosyncrasies of these, of these data that might just be noise. Whereas, some polynomial or intermediate order seems to capture what we see as the intruder structure of this function. So that sets up the basic tension here, that we've seen in the other examples, where you want a model that is flexible enough to capture, sort of, arbitrarily complex patterns of data. But it has a preference for simpler patterns.
So that's where Gaussian processes come in. Gaussian processes are the nonparametric model of choice for thinking about function learning. Then the idea is pretty straightforward. We're going to imagine there's some latent function that's mapping the input's x to the output's y, plus some white noise. And a Gaussian process-- one way to think about a Gaussian process, it's a distribution over function. And that distribution is parametrized by-- has two parameters. It has a mean function, and the covariance function, which is sometimes called the kernel. And these are exactly analogous to means and covariances. Mean vectors and covariance matrices are multivariate Gaussians.
But, in this case, we're imagining whereas multigrid Gaussians are distributions on finite dimensional vector spaces. This is, essentially, the distribution on some infinite dimensional vector space. And the kernel here, specifies the smoothness of the function. And I'll show you some [INAUDIBLE] in a second.
The one really nice thing about Gaussian processes is that when you condition on data with posterior predictions of the function values that unobserved points are available in closed form. So it's this very powerful and yet analytically tractable framework for modeling functions.
And of course, I showed you some one dimensional examples because it's easier to show in a plot. But x can be high dimensional as well. So here are some examples of functions that were drawn from the Gaussian process. So I can actually, because of the Gaussian process, the [INAUDIBLE] I can actually sample functions from it. And these different panels show different choices of kernels, or covariance functions that have different smoothness properties.
And so you can see that some kernels are very rough. So they have-- there's very little smoothness in the function that's drawn from them. Whereas, other kernels, like this one, are very smooth. Here's an example of what happens when you use different kernels to model the same data set. So here's a one dimensional data set. Here are samples drawn from the kernels just to give you an intuition for their smoothness properties. And then here, you see the posterior mean as the middle line, and then plus minus two standard deviations around it.
So you can see that there's this qualitative match between how rough the samp-- rough the functions are, sampled from that Gaussian process. And then the roughness of the posterior prediction. And in fact, you might say, well how will I choose the covariance function and the parameters of that covariance function. Well, a nice aspect of this framework is that you can actually use Bayesian model selection to choose the parameters and the covariance function. But, I'm not going to go into that now.
So there's actually a ton of data on human function learning, not so much applying this framework. But I'm going to talk just a little bit about what application the Gaussian process of human function learning that Tom Griffiths and his lab did. This is here, you can see at the top here, are data from this paper by Delosh, et al., in 1997. What they did was, they showed subjects function values within this region, between these two central lines . And the function values are taken from different functions. So they saw linear functions, exponential functions, and quadratic functions.
And then what subjects were asked to do, was to make predictions about the function values outside of that, of this interpolation region. So they were asked to extrapolate the region, to extrapolate function outside its interpolation. And what you find here is that, linear functions, subjects are very happy to extrapolate linearly. But for exponential functions, subjects seem to underestimate the degree of curvature. So this dotted line is the true function, and this black line is the-- are the human predictions.
And what you see is that, once you move outside of the interpolation region, subjects seem to underestimate the curvature. And that arises naturally from the Gaussian process framework, because as you move outside of the region in which you have data, your uncertainty grows. That effectively, regularizes your function for strong-- towards [INAUDIBLE]. And that's exactly what you see here. This is the Gaussian process model applied to these data. And you see that it qualitatively captures this pattern, this underestimation extrapolation. Yeah.
AUDIENCE: What is the mean function in this case?
SAM GERSHMAN: So in practice, people often assume that mean function is just zero. But you're asking-- what does it mean to choose a mean function of zero. Well, it's basically like modeling-- when you choose a mean function, what you're basically modeling are the residuals between the observed data and the mean. So you can-- so by choosing the mean function, you're basically just-- that by choosing the mean function to be zero, you're basically just modeling the covariance of these different data points on this function.
AUDIENCE: So when you say it's a bias towards the prior [INAUDIBLE] for having like a, that may have a shallower slope.
SAM GERSHMAN: Yeah. One way to think about this is that, the Gaussian process induces smoothness. So as you get farther away from your actual sample data, you're going to show a preference for smoother function. So function in this case, in this one dimensional space, they have less curvature. And that's the origin of this underestimation function.
SAM GERSHMAN: Yeah. And in fact, any kernel that you select for a Gaussian process, or any kernel that you select for an [INAUDIBLE] is it satisfies the same properties that you need to function as a covariance function in the Gaussian process. So there actually is a-- there a lot of deep connections between the two approaches And I'll say one other thing about function learning, which is that-- Tommy you talked, you probably talk about primal dual formulation.
AUDIENCE: Actually, [INAUDIBLE].
SAM GERSHMAN: Oh, OK. Well just as a brief aside, and then we'll go to lunch. There is substantial theoretical literature for this paper on function learning, and trying to understand function learning. And there are, essentially, two different classes of theories. One that was basically an exemplar based theory. So the idea is that, you're going to take your data points, and basically-- it's sort of like the kernel density estimation that I showed you before. You're going to take your observed data points, and then try to estimate unobserved data points by taking some weighted average-- that making a prediction by taking the weighted average of the other data points that you observed, created by their similarity to the unobserved data points.
And the other extreme, is basically a parametric function learning model, that tries to construct a neural network that can model arbitrarily [INAUDIBLE] functions. And one interesting outcome of using Gaussian process, the thing about function learning is that, it actually shows that these-- mathematically, these two approaches can be made to be [INAUDIBLE]. So exemplar models that reason about functions in terms of examples and similarity between examples, that similarity function is essentially what's captured by the kernel. You can show that any given parametric, any given parametric function, can be mapped onto this kernel formulation. So it turns out, that this is two different theoretical camps may actually been circling around the same basic theoretical data.
Associated Research Thrust: