A distributional point of view on hierarchy
Date Posted:
September 18, 2019
Date Recorded:
September 17, 2019
Speaker(s):
Maia Fraser,
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Maia Fraser, Assistant Professor University of Ottawa Abstract: Hierarchical learning is found widely in biological organisms. There are several compelling arguments for advantages of this structure. Modularity (reusable components) and function approximation are two where theoretical support is readily available. Other, more statistical, arguments are surely also relevant, in particular there's a sense that "hierarchy reduces generalization error". In this talk, I will bolster this from a distributional point of view and show how this gives rise to deep vs. shallow regret bounds in semi-supervised learning that can also be carried over to some reinforcement learning settings. The argument in both paradigms deals with partial observation, namely partially labeled data resp. partially observed states, and useful representations that can be learned therefrom. Examples include manifold learning and group-invariant features.
PRESENTER: Welcome to this CBMM seminar. It's a pleasure to have today Maia Fraser from the University of Ottawa. So Maia has a kind of a diverse background. She got her first PhD in Math from Stanford, and then she went on to work in industry in high-performance computing for a few years.
And then she went back to get a second PhD in Computer Science from the University of Chicago, and started to work on machine-learning-related topics. And I guess that's around the time when we met, when we were trying to figure out mathematical models of the visual cortex model that Tommy was working on at the time. So Tommy [INAUDIBLE] and I were discussing about this stuff. And we had multiple discussion with Maia in Hong Kong at the time.
And then she went on to work at the University of Toronto as a postdoc. And she's now a faculty at University of Ottawa. And so a pleasure to have you here.
MAIA FRASER: Yeah, so thanks very much. Thanks for the invitation. It's really a pleasure to be here.
So actually, I should say I expected an informal talking in a lab meeting. I wasn't expecting this at first, so I'll try to highlight some general themes and ideas without getting mired in too much technical math details. And so maybe one thing I can say, just this illustration here, that's just a to kind of emphasize that I'm going to try to talk about hierarchy and some ideas about the usefulness of hierarchy, but a little bit different from the typical deep versus shallow question.
And so that's why I've kind of chosen a different picture. This is like some artist's rendition of a molecule. And actually, I have an ongoing collaboration with chemists, and there's all sorts of-- it's really been eye opening to see the-- thanks-- kind of the magic that's going on even at that level. And so just in general, we have these very complex biological organisms. And there's hierarchy involved.
And so let me just first specify the terms a little bit. All right so just to explain what kind of hierarchy I'm going to be asking this question about, here are sort of typical, various feed-forward neural network architectures. And often, when people are asking a deep versus shallow question, at least in computer science, they're referring to architectures like this.
And this is actually a collage made out of this lab from Andre Wibisono and yourself from some lecture many years ago talking about this deep versus shallow question. So these are various architectures with the kind of references underneath. But I want to look more broadly at biological hierarchies.
And so in the previous slide, those were architectures that are modeled after peripheral regions in the brain, sort of mostly visual cortex, early stages of visual cortex, but could be auditory cortex, where similar principles are at work. But of course, in general, the brain is much more complicated. And this is, obviously, a highly simplified picture as well.
And so admittedly, it's not a hierarchy. It's much more complex than that, but it's still true that from some sort of stimulus to the actual action or perception of an individual or agent, at least part of the process is proceeding in hierarchical fashion. And I also wanted to emphasize over here that there are parts of this sort of control that are happening even outside the brain. So this is an illustration with-- you can't really see it too specifically, but central pattern generators.
And so the learning that's happening in each is, first of all, there are these various modules. And also, the learning that happens in each of them is happening in very different sort of timescales. And some things may be hardcoded already genetically, so that's learning that's happening on an evolutionary scale. And there's also sort of multi-scale aspects, in that the higher regions need-- the lower regions need to access these modules and right down to the level of proteins and all that kind of behavior that goes on at that level.
So these are the references for the various figures, but I should also make a disclaimer. I'm not a neuroscientist at all. I just have certain collaborations with neuroscientists, and that's a big source of inspiration for trying to understand what machines can do, as everybody here is very aware.
And so mostly I work on the mathematical machine learning side, trying to come up with mathematical principles to help understand. So of course, mathematical formalism is necessarily a simplification. And it's interesting when it gives you some kind of insight about what might be going on in a problem. And sometimes, then, it also gives you sort of general principles that allow you to design other things. So that's kind of the angle from which I'm approaching this.
So just to recap, I want to emphasize in that earlier slide, where we saw the various feed-forward architectures, that a lot of the themes that were happening there are still relevant here in what I want to consider today. But I want to allow the possibility of learning with one kind of data at one stage before one builds on higher stages. So that's not a whole architecture at once learning, but learning in modules. And this learning, then, can be progressive, so that one unit has been optimized before one then recalls on it from another area. And also just in general, this sort of theme of multiple space and time scales throughout-- those are some of the main messages I want to build on.
So I'm going to call this kind of learning, this kind of hierarchy, "multi-step learning." And by that I just mean that the training can happen in multiple steps. So we can train with one data set, and then later train a further machine that calls on the originally learned module. And so in that kind of context, I want to consider what are the benefits of hierarchy.
And here, I should also mention neuroscience is definitely one of the big inspirations, but also just kind of an amazement at what humans can do as a parent, and just looking at all sorts of things in the world. So in general, humans learn through a curriculum. They start with simpler tasks. Partly it's the parents that give them these tasks, but it's also there are various arguments for even their perception of the world is maybe going to be restricted earlier on. And later on, once they've developed initial representations, let's say, then it will be already a different access to the outside environment.
So to mention that there's evidence, for example, that in the visual system, that early stages are learning before higher stages begin their learning. And this is just in the first few days and weeks after birth. And that's the sort of theme that I want to delve into a little bit mathematically.
So when a child is watching objects move around like this, there's nobody who's saying all the time, "This is a pen. This is a pen. This is a pen. This is a pen." The child is presumably doing something useful, learning something from all of this.
And then later on, later in life, you can say to the child, "This is my pen." And very quickly, from just one example, they will understand. So what is this useful process that's been going on earlier in life? And it's clearly some kind of invariant representation, so that at least there's an awareness that this is the same object that persists throughout.
So as I was saying that after the child has been exposed to sort of simple tasks, then it's in a position to learn complex tasks, sometimes from very few examples. And these examples are not the same as for previous tests. So this is the sense in which I want to look at multi-step learning, where we initially have one sort of data set which can be used to train one module, and then that can subsequently be used in further learning tasks.
I want to emphasize that the tasks themselves are hierarchical. And so I will be talking also about hierarchical reinforcement learning in that sense. And once again, these themes are sort of echoes of themes that are relevant even in a feed-forward architecture, that you have reuse of sub-modules, and variances play a role, and so on. So the only extra ingredient is that now I want to consider what happens when we can train first on one kind of data and then use that in subsequent steps.
So the main themes that I'll address are representation learning. And another big part of this is sort of partial observability or partially labeled data. And so when I was mentioning that there's sort of different data sets that we would learn from, the point is that we may be learning from plentiful, unlabeled data at one stage-- unlabeled in the sense that it's not fully labeled for the final learning test-- and then later, learning with relatively few examples of more fully labeled data.
And I'll highlight an analogy here between semi-supervised learning and reinforcement learning. And sort of the distributional theme that I want to focus on for most of the talk has to do with a learning environment in which learning of these various things proceeds. So as I was mentioning, I'm mostly going to focus on those two paradigms. So it'll be semi-supervised learning and hierarchical reinforcement learning.
And I'll spend most of the time talking about semi-supervised learning. And I'll describe a point of view which is slightly different from the usual PAC framework for learning, but it's a way in which one can get some insight about semi-supervised learning. And that's not possible in the typical PAC frameworks. And then I'm going to bring in a short discussion about hierarchical reinforcement learning to show some analogies-- that the sort of insights one gains from that approach to semi-supervised learning also give rise to certain insights in the context of hierarchical reinforcement learning.
So this illustration is-- let's just think of it as an illustration for hierarchical planning in general. If you're going to try to build a souffle, you have certain subroutines. So here, there's one subroutine-- making ganache, making egg whites. This is not actually the kind of recipe I would follow.
But anyway, this is like any kind of more complicated task you're doing. It has various subroutines. We tend to think of things in that manner.
And so the idea here is that these subroutines may have been learned, or the agent may have been trained on these in other contexts, and can then call on them in a sort of a hierarchical control. So that's just to give a little bit of an illustration for people who are not working in hierarchical reinforcement learning. But I also mentioned POMDPs, and that's Partially Observed Markov Decision Processes. So that's a variant could be in the hierarchical context or in the non-hierarchical context. But this notion of partial observability, I'm going to kind of drive home the analogy between that and partial labels in the semi-supervised paradigm.
Before I get going, I just want to say that this work that I'm going to talk about today was triggered by ideas of Partha Niyogi, who is my supervisor at Chicago, and who passed away very young while I was doing my studies. So this is a former student of Tommy's, and that's partly how I met up with everybody here. And so in this technical record of his, he had been looking at some formal mathematical analyses to understand the benefit of semi-supervised learning, in particular manifold learning.
And in that paper, there is a sort of a key example to illustrate his ideas. And that's something that I built on my thesis work, and then also in this paper and a subsequent paper, with similar ideas for reinforcement learning. And so even though these are very mathematical, the actual papers, what I'm going to talk about today, I'm going to try to, like I was saying, avoid a lot of the math detail, and just emphasize the ideas involved. But really all of this was triggered by this original example, actually, of Partha's and then trying to generalize the principles that are at play in that example.
And I also have current collaborators with whom I'm working on some of the ideas that I'll be talking about today. So one is Georg Northoff, who's a neuroscientist at Ottawa-- actually, not just a neuroscientist, also a psychiatrist and philosopher. He's a very multidisciplinary person.
And then two people in Montreal, Prakash Panangaden and Guillaume Rabusseau. They're both in machine learning, but doing very mathematical work in that context. And then some students, Vincent Letourneau, Scott Parkinson, and Aleksandr Shumarov.
So as I was mentioning, I'm going to focus on these two paradigms, semi-supervised learning and reinforcement learning. So now let me just lay a little bit of the groundwork in semi-supervised learning to describe the setup. So how many people are familiar with this completely? OK.
At the very beginning, what I've written here, we're trying to predict some function from, let's say, a space x to a space y, typically called the labels. Right here, what I've written could just be a supervised learning problem. If you are given examples like this, of pairs x, y, then you can try to get a machine to learn the mapping from x to y.
It may be that there is a fixed y associated to each x. That's a non-noisy case. Or more commonly in the real world, to each x, there is a probability distribution of probability of y given x, which would be then the case of noisy labels. So all of that would just be supervised learning.
However, if we additionally allow ourselves samples that are just samples from x, and we have an algorithm that can take advantage of these samples from x to improve its performance, let's say, on the actual supervised learning task, then this is a semi-supervised algorithm. So manifold learning is considered to be an example like this, and there are various other ones. So I'm actually going to extend this slightly.
So here, typically one considers adding unlabeled samples x1 up to x m. But I'm going to also consider adding in time sequences of unlabeled data. But just in case there is a certain group action going on, it can be very informative to the agent to see samples as they appear in time.
So this is inspired by this case of a child seeing an object move through its visual field. That's time sequence data of unlabeled instances. So that's the basic setup.
And then one can ask, what is the value of unlabeled data? What's the use of these extra x1 up to x m that don't have any labels on them? And so they actually formulate it in this way-- Castelli and Cover in a couple of papers of theirs were probably the first to actually ask that question in that sense, and to address it carefully.
And that was the main inspiration for the technical report of Partha Niyogi's that I mentioned a few slides back, that kind of triggered what I'm talking about today. Other people who've looked at this question are Balcan and Blum in a 2005 paper. And there are many others.
Some people conclude unlabeled data is useless. Some people conclude it's useful in this way or that way. Clearly, there are many cases in the real world where it's useful for something, so let's just try to understand this theoretically.
And the reason I mentioned Balcan and Blum particularly is that there are a lot of echoes between their approach and what I'm going to describe today. So let me just say a couple more words about that. What they end up doing is they augment the PAC framework. So how many people know what this is? OK, so I'll talk about that in another slide.
This is a sort of mathematical formalism, in a sense, for analyzing learning. And it's become very widespread in machine learning. There are many variations on it and so on, but it's not suitable for semi-supervised learning. It's not suitable to analyze semi-supervised learning, and I'll explain why in a moment.
So Balcan and Blum augmented that framework by introducing, in addition, a compatibility function. And they use that. They show that you can, if you have one of these, then you can use this to reduce the concept class. And all these terms I'm going to come back to in a moment.
But this theme of reduced-- so "reduce" is maybe not quite the right worn-- restrict in some sense, alter the concept class, basically, to make a simpler learning problem. That theme has something in common with what I'm going to talk about. So let's look at what is the PAC framework.
It's a framework introduced by Valiant in the early '80s. And this little quote here is just from a machine-learning textbook by Maury and colleagues. I should also mention, since the audience is not just coming from computer science, a concept class.
So remember, we were trying to predict some function that goes from x's to y's. And the concept class governs all the possible relationships that these x's and y's could have together in this setting. So in other words, a concept class is a collection of functions from x to y.
And our agent is going to try to come up with a hypothesis that's also a function from x to y. And for simplicity here, we may as well assume it's being drawn from that actual concept class. So that means that the hypothesis class and the concept class, let's just assume are the same for now.
So one can describe a learning problem in terms of its associated underlying concept class. And a concept class set to be PAC learnable if there exists an algorithm and a polynomial such that for any epsilon and delta, and for all distributions d on x, and any target concept c, you have this sort of bound on risk. And I'm going to delve into that.
That applies for any sample that's larger than this polynomial expression. So here, this is the source of the-- PAC stands for Probably Approximately Correct. And that's because this is saying with high probability, so with probability at least 1 minus delta, we have that the risk is less than or equal to epsilon. This is saying that our hypothesis is approximately correct.
And what is the risk? The risk is just some expected loss, where loss measures how far off our prediction is from the true concept. So this is a statement saying with high probability, the risk of our hypothesis, the extent to which it's bad, is less than or equal to epsilon.
So back in that previous slide on SSL, I just said we're going to try to find some sort of map h from x to y. I didn't specify what properties that map should have. We want to find a map that has low risk. That's the goal.
So notice that this definition here, there's also of a version for p-concepts. And so that's this other situation I was mentioning, when the labels are noisy, where instead of just a class of functions from x to y, you would have a class of probability distributions, y given x. That would serve as your concept class. It's called a p-concept class then.
But the rest of the definition would be the same in that setting. So I should've mentioned this p-concept variation on that arose 10 years later, roughly, from Kearns and [? Schapire. ?] And the original definition of PAC learning you see here is from Valiant in '84.
So both of these things involve a distribution-free worst-case analysis, because right here, you say for all distributions d on x. So that's a very drastic kind of statement. Out in the real world, that's just not the situation. You're not trying to learn with any possible distribution on x.
But in particular, when you're wanting to look at semi-supervised learning, that's a big problem, because you're trying to ask what's the value of this unlabeled data. Here, this whole framework is only looking at what an algorithm can do for all distributions d on x. And if you consider arbitrary distributions d on x, then having a sample of data from x is just going to tell you about what d is.
It's not going to tell you anything about the actual underlying concept that you're trying to learn. So this is just inherently not suited to analyze semi-supervised learning in its raw form. So that's why Balcan and Blum proposed this augmentation of that, to add in the extra notion of a compatibility function.
And then it says how compatible-- so if you see a bunch of unlabeled data, that gives you a sense of the distribution d. And then with this extra gadget, which is the compatibility function, you also can measure how compatible such a d is with various concepts. And you will only search for concepts within a smaller class of those concepts that are close to d, that are compatible with d. So that's the way their argument proceeds. But you have this extra gadget that you kind of invented and added in there.
So just go back to this kind of setting-- when we have a child learning from seeing objects moving around, it's really not true that-- so let's say that the object itself is the unlabeled data, and then the name of it, like "pointer," is the label. So it's really not true that all of this unlabeled data, that it can be sampled from an arbitrary distribution.
We're not going to judge a child based on how well it can learn in any kind of weird distribution, where suddenly there's one object, and another one, and another one, like that. There's clearly some sort of space of naturally occurring images. And that's, of course, specific to the actual agent. So the agent finds itself in a particular environment, and within that environment, there's some kind of probability distributions that govern what kind of data may be seen.
I'm going to focus on that aspect to try to characterize, in as simple a manner as possible, the agent's environment. But what is that we need to know about the agent's environment? We need to know something about the probability distributions that are coming out of that.
So this is the distributional point of view that I want to take. I'm going to assume a joint distribution p on x cross y comes from some statistical model. So a statistical models is a sort of formalism that's been used for years in statistics. It's just a family of joint distributions. So we've got a joint statistical model. It's a family of joint distributions on x cross y, possibly parameterized, not necessarily.
So there's some similarity between some echoes of Baxter's notion of learning environment in a 2000 paper. He set up a sort of formalism for studying transfer learning. That's when you have a machine or a biological agent that's acquiring skill in one task, and then transferring-- then later having an advantage in learning another skill.
And so that formalism puts these skills within a common learning environment. So a learning environment in Baxter's sense is there is a calligraphic P, but it's the collection of all distributions on x cross y. And then he assumes, in addition, a prior q, which is how likely these various distributions are.
So suppose you take the ones that are impossible-- they have probability 0 according to that prior q-- and throw them away. Then what you're left with is essentially a statistical model of distributions that may occur together with this additional prior, which says how likely you think they are. So I'm not going to go so far as to put a prior on how likely different distributions are. There may be settings in which that's quite useful and so on. But let's just start simple.
Let's at least take a collection of probability distributions that may occur. So p in the rest of this talk will be a statistical model that is a family of distributions that our data may be coming from. These are joint distributions, so every p in there can be written in this form.
We can write it as a marginal times a conditional. And it's actually this conditional that we're after. We want to know, let's say maybe the conditional has all of its probability concentrated at one value. Then that's the label we're trying to predict. Or maybe it's just some sort of more spread-out distribution over labels, and we might want to know it's expectation.
We're basically trying to access this. And of the theme that I wanted to delve into are representations. So let's suppose there's a representation.
In these models, where we see learning proceeding along to this stage, and then based on what's that sort of transformation so far, one does further learning, in that kind of situation, there's essentially a representation from the original space x to some new space t. And the key thing is that we can still do our final learning problem based on the t-values. We don't need to go back to x.
So this representation is sufficient. This map t is sufficient for the parameter of interest. So in this case, let's just assume that there's some parameter theta here for this conditional that we're after.
And so the sufficiency that I'm describing is actually a stronger form of sufficiency than what you would usually say, because it's a kind of sufficiency that's tailored to the situation of a joint statistical model. Normally, we just describe it for a particular parameter, but here, it's in this context of a joint model. And so this is not a standard definition, but it's basically the same theme as the usual notion of sufficiency.
This t is sufficient if you can write the final p theta as some rho of theta that's y given t of x. In a moment, we'll talk about various examples so you can get a feeling for what this means. But basically, we're just taking the original x. We're converting that to some other t of x, and then we're going to look at how the final y values depend on t, instead of asking how they depend on the original x.
So why would we want to do that? What's the advantage of that? That's what we'll talk about.
So notice that when you're doing semi-supervised learning and you have access to plentiful unlabeled data, you can narrow down, possibly, the marginal. Suppose the data is coming from some p that lives in this joint model. Then with plentiful unlabeled data, you can potentially get a very good idea of this marginal distribution.
And then there is a natural projection, which just takes this p x, y and maps it to here. So the fiber-- that's the pre-image of a particular p x here-- will be all of these conditionals that can occur together with that marginal. So when you're doing semi-supervised learning and you have all this unlabeled data, you're getting a hold on this marginal here. And then that means that essentially you really only need to search within a reduced concept class corresponding to the pre-image of pi.
So if we add in an extra representation, then this is further reduced. And we're actually dealing with this projection that goes to marginals with respect to t. So here, I would just be writing this expression as p of t y is equal to a p sub t of t, and then times p of y given t. A very analogous expression would hold. And similarly, then we would be searching in a fiber of this map.
Just to emphasize the commonality with what Balcan and Blum were doing, we're seeing once again that we're searching in some sort of restricted class, but this restricted class is just arising because of the statistical model. So we assume a statistical model. Then automatically, to every marginal, there's only a certain collection of conditionals that are relevant.
And you can do that in an approximate version of this, and you would come up with something that's very similar to the compatibility function of Balcan and Blum. It's just that it's arising without having to introduce an extra gadget. It's arising from a simple assumption on what is the joint statistical model.
In this context, I'm going to try to ask what is the value of unlabeled data. And to do that, as often to get these upper and lower bounds on risk, we make use of measures of complexity. And since this whole thing is expressed in terms of statistical models, and I'm introducing a notion of complexity for statistical models which is really analogous, I've written here a simplified version just so you don't get bogged down in technical details.
There are more elaborate versions of this to deal with noise. This is a non-noisy version, with some extra simple assumptions in here. But just to give a flavor of it, it's very similar in spirit to VC dimension, so people who know of that will recognize this here.
And what is it saying? If you don't know VC dimension, don't worry. You can just take this definition for now. So we'll say that this statistical model p has gamma uniform shattering dimension at least n. If we can find disjoint subsets, and a whole bunch of the sort of subfamily of distributions within p such that the marginals of those distributions each assign some basic amount of measure to every S i, but the conditionals are as wild as possible.
So what I mean by that more precisely is that the conditionals, these conditional expectations, expectation under p of y given x, should be constant on each S i so that if you have an x value that's coming from S1, then for all of those different x values in the set S1, you should get the same conditional expectation of y given x. It's basically saying that we're associating the same label to every x in this set S1. And likewise for S2 we have a certain label.
So we essentially get set functions. For each set, we have a label. And then what we're going to require is that these set functions shatter S1 up to S n. That means that I can choose any map, any sort of assignment of 0's and 1's to these S1's up to S n's. So there's two to the n of those possible assignments. Those are called dichotomies, those 2 to the n of them.
And we want these conditionals, these conditional expectations, to be varied enough that we can get all of those possible labelings. Just the sort of take-home message of this, if this is kind of too much technicality to absorb in a short time, the take-home message is that the marginals are giving a certain weight, a reliable amount of weight to each of these sets S i, but the labels are varying as much as we want them to, as drastically as we want. So a situation like that is a measure.
If you have that, then that's a certain statement of complexity of the model. And then you can prove lower bounds on risk. So it will say that the risk will then be at least this amount.
So basically, the model is sufficiently complex that the risk will be-- you're never going to be able to get your expected loss below a certain amount. That's going to be hard, in that sense, to learn. So what I'm going to discuss is just at finite sample size always.
This is a statement. This is lower bound holds only for sample sizes below a certain number that comes from this n here. But basically, when your model has that kind of complexity property for this n, then if the sample size is too small, you'll be condemned to have high risk. That's kind of what this is saying.
It's easier to understand this if we look at concrete examples. So the two semi-supervised learning settings that I'm going to look at are manifold learning and group invariant feature learning. So you have a group action, and let's say the conditionals that you're interested in, let's say they're invariant under the action of this group.
So for example, here, this label "pointer" that I'm putting onto this object, it's invariant no matter what kind of transformation I use to modify the image that my eyes are seeing, these naturally occurring geometric transformations. This is still a pointer. That label is invariant under these transformations. So these are transformations of x of the unlabeled data, but the labels are invariant under those.
I'll go into an example of that right after this. But first of all, let's talk about a manifold learning example, which was the original motivation for the technical report of Partha's that I mentioned. Here is kind of an ugly, simplified metaphor. It's a [INAUDIBLE] linear one.
Let's, to make it a little bit nicer, imagine that it doesn't have these corners. Just think of some simple closed curve. And I intended this to be oriented, but I forgot to draw the arrow, so let's say it's going around clockwise.
And let's suppose that labels are assigned 1 where we have this dotted part, and 0 outside of that. So it's a very simple sort of scheme for labeling. We just take some subsegments here, and we assign 1's there, and we have 0's outside.
So if you happen to have spherical coordinates for this thing, then the labels in terms of those coordinates would be very simple. They would just be-- I won't try to draw-- you would just have some theta value, let's say angle where the 1's start, and an angle where they stop. That sector would be 1's, and the rest would be 0's. And you can, for example, check the VC dimension of this.
So that would be a simple learning problem. If you already have these circle coordinates, then the labels depend on the circle coordinates in a very simple way. However, if your agent is faced with data that comes from some embedded manifold, but you don't know which embedded manifold, then it's actually a hard learning problem. And so I'm going to show you how-- I'll describe the class p that-- I haven't given you the mathematical definition of p yet. I've just kind of handwaved.
We're going to consider embedded curves with this kind of labeling. But let me specify more precisely what I mean by that. I mean that we're going to allow the x data to come from any-- we'll take an embedded curve, and will allow the x data to come from that embedded curve. And then we will assign the labels in the matter that I've just described here.
So there's a huge variety in terms of possible marginal distributions. It could be any embedding of a curve in the plane, so even a non-parametric marginal model right there. On the other hand, the conditional model of the labels given the circle coordinates would be very simple.
Let's see how that measure of complexity applies to that statistical model. The definition said-- we'll see our model has this gamma uniform shattering dimension at least n if we can find these n sets so that we've got a bunch of distributions p that assign sufficient probability to those sets, but do wild stuff with the labels. The sets in question here-- just ignore these first top two.
This is a very convoluted-looking picture, but it's still a piecewise linear embedded curve. It's just kind of deformed into a sort of maze shape, but it's still a closed curve. You can check.
And for right now, just ignore this very these top two wires, horizontal wires there. And let's just focus on the ones that come down below. So the sets S i that I want to talk about are-- this will be the set S1. It'll be the pair, this pair of wires right there. And then S2 will be this pair of wires, S3 will be that pair of wires, S4 will be that one, and so on. So suppose I want to put labels on S1 up to two, three, four, five, six, up to S6 like this-- 0, 1, 0, 0, 1, 0. This particular distribution here would be one that realizes that.
So we see where the 1's are. Those are on these heavy dotted regions. This set that's S5 has a 1. And this set here that's S2, right here, that has a 1. And we have a label 0 on the others.
So using examples like this, you can show that that huge class p, that joint statistical model that I described before that consists of all these possible embedded manifolds, and then together with a simple way of labeling, that huge class has uniform shattering dimension. And in fact, you can take more elaborate versions of this, and show that it has gamma uniform shattering dimension at least n for any n, by slightly adjusting the n. And so in that extreme case, learning is impossible. You'll never be able to drive your risk down to 0 using raw data without access to the circle coordinates.
However, the other learning problem that I mentioned based on circle coordinates would be very simple. So if you have a lot of unlabeled data, and you can first of all then learn some sort of manifold coordinates, then after that initial phase which used maybe plentiful but unlabeled data, then you can learn the actual labels. So what you end up having is you can prove a gap here. This is the infimum over all algorithms that are supervised of the risk.
It'll be bounded above some sort of quantity. And here, I've written the risk of an algorithm that's got oracle access to manifold coordinates. That algorithm's risk we can upper bound. So there's this gap between not using the manifold coordinates and using the manifold coordinates. So I'll come back to that in a moment.
This sort of theme also occurs in an example of grouping invariant feature learning, which I'll describe in a minute. So for this example, this is just a rectangle right here that I've drawn. And so it's an interval i hear, an interval j. And I'm going to imagine that's what these two arrows are trying to symbolize. I'm going to imagine that I've connected those two edges that have the arrows, so that actually, the object that I'm looking at is an annulus, like this.
And now, we have S1. That's the circle group, let's say, acting on this. So it's pushing. This is my x. This weird space here is x.
And if we sample data, let's say it's just sampled uniformly along here so we can get unlabeled data points along here. At the same time, we have the circle group S1 acting on this space. And let's say it's doing that-- these are the flow lines of how it acts on the space, so it's pushing an x point here along this orbit.
Let's suppose that the labels are assigned in a very simple way in terms of those orbits, so that there's some kind of threshold orbit along here. And it could be one of these other orbits, but some threshold orbit. Above that orbit, we have label 1 and below that orbit, we have label 0.
So if the learner were given access to the quotient group of x mod S1, which just means some sort of listing of these orbits, that would essentially just be i. If we had access to that, then we could very easily-- for any point out here, we would just immediately know where does it come from on this original vertical line. If we knew that, then we could very easily say what kind of label it should have, because the label is dependent in a very simple way on those coordinates.
But if we're just given this raw picture, the group action could be very convoluted. And then it will be hard for us to guess the label, to predict the label. So once again, to get at how hard it is, I'm going to use this notion of gamma uniform shattering.
And what we need to do to show that we have, that this-- so once again, I didn't really describe the statistical model, but it's supposed to be that space x. And then the uniform marginal on x and the probability of y given x will basically be label 1 above one of these weird orbits, and label 0 below the orbit. So the variety comes from the fact that these orbits can be very convoluted.
That was this joint statistical model. And now let's see what its gamma uniform shattering dimension is. We need to find n sets, that's 1 up to S n and see that we have marginals assigning sort of the same probability there. That's easy, because these all these marginals are just the uniform distribution in this example. And the conditionals should be able to give us any kind of dichotomy.
So this is an illustration. The sets are these S1, S2, S3, S4, the white regions. And the dichotomy that I'm illustrating here is 0, 1, 0, 0.
This dotted line here is that threshold orbit. And the threshold orbit-- if this action of S1 can be very convoluted, it might very well be like this. That's one of the guys in our big class p.
And for any dichotomy that we come up with, 1, 0, 0, 1, we can find a similar convoluted action that would produce that. So this kind of joint model has a high gamma uniform shattering dimension. And I'll sort of avoid all the technical details, but once again, we achieve some sort of gap between the risk of a learner that's just faced with raw x and y data versus an agent that has oracle access to this list of orbits.
Those are the two settings. And it's kind of technical to compute the gamma uniform shattering dimension, but this is just a mathematical formalism. Obviously, on the one hand, it's an oversimplification of the real situation. On the other hand, it's kind of complicated and technical to compute it. But the point is, the insight that it gives, it's just one of these-- these are very naturally occurring types of joint models, and there are situations where we have some kind of inherent complexity, although in terms of a representation, intermediate representation, we have much less complexity. And the difference in those two is responsible for this gap.
Are there any questions for this particular example?
AUDIENCE: Is it a converse if there is a gap on [INAUDIBLE]
MAIA FRASER: Not necessarily the converse in that sense, but you could potentially, if you know that-- so I only gave a definition for the gamma uniform shattering dimension being at least n. But if you knew that it was less than n, then you can potentially devise an algorithm that would give you an upper bound instead of a lower bound. So here, I've actually only used this kind of novel measure of complexity to produce this lower bound.
But that is a nice idea. Maybe that's true, that if you have a situation where you find that there is a gap that cannot be bridged, perhaps that's actually an indicator that there is some kind of factorization underneath. Yeah.
So I should also mention that these lower bounds that I obtained-- in general, when one uses of quantities like VC dimension or fat shattering dimensions and others to derive lower bounds in the PAC learning framework-- so remember, in PAC learning, we are looking at worst-case bounds for arbitrary d-- those lower bounds that say the risk of any algorithm will be at least this amount, those are always-- you end up cooking up a bad distribution d. And then you show for this distribution, you'll have this kind of risk that's bounded below by some number. So that's a place where this bad, this kind of arbitrariness of d is quite important, whereas in this framework here, the lower bounds are actually proved for marginals that occur naturally in that joint model, so not arbitrary marginals, but only ones that might occur under this assumption of a joint model.
Let me just mention also another sort of a corollary of that collection of examples of group invariant features and the analysis that you can do with gamma uniform shattering. You can show-- there's various choice settings where you can set up, instead of an S1-- so S1 is this group of circle transformations. And if you discretize that, you would have what's called Z k. So the integer's mod k. It's basically all these k discrete shifts to make up a whole turn, instead of a circle transformation.
So if you look at a setting where you have the group Z n1 cross Z n2-- this is the integers up to n1 and the intercept n2 acting on x, and their labels are invariant under that-- then you get a lower bound on sample complexity that's n1 plus n2. If you do 2-step learning, at each step, you have an n1 lower bound at first and then an n2 lower bound. That's if you learn, first of all, invariant features under one of the factors and then invariant features under the other, whereas to do this in a flat architecture would require n1 times n2 sample complexity. So this is a much larger sample complexity.
And this kind of comparison between addition versus product is also something that appears in a memory-based factorization theorem by Tommy and others from 2012. So in that case, it was a similar kind of statement about memory of additive versus multiplicative memory requirements in a particular architecture. And a similar kind of thing is showing up here in terms of sample complexity.
So let me just then briefly delve into hierarchical reinforcement learning. And I should stop soon, so I won't go too much into this. The example that I've-- I'll just very quickly skip over the definition of the setup, because some of you may be familiar already with it.
And basically, we have a Markov decision process that encodes. So what that is is it's just specifying states, actions, transition probability to go from one state to another via actions, and reward. A partially observed version of that is one where, instead of seeing the actual state, the agent just sees some observation. And you have also specified a probability of an observation for a particular state after some action.
So this is kind of the formalism behind this terminology of MDPs and POMDPs. And the job of the agent is to find a policy, which is a map from state to action. So in any given state, the agent should find out what sort of action is the optimal one to do in order to have the greatest expected reward.
We can also express that by saying the agent is looking for-- is aiming to find, to obtain the lowest regret. And I mention it that way, because then it's sort of more clear the analogy with supervised learning or semi-supervised learning, where we're looking for this map h with lowest risk. Here, we're looking for a map from S to A with lowest regret.
So in terms of hierarchical reinforcement learning, this just means that we have a kind of hierarchy in time. There's some hierarchical planning possible. And there are what are known-- let's look at the particular example of options.
So these are essentially, you have some region of state space where you can initiate an option. And there's a condition for ending the option and some policy that's specified. So if you call an option, then the agent is going to, in some initiation state, it will just follow this policy for a while until it gets terminated, and then it passes control up higher.
So this is like this notion here. Here, we're involved in one option. This is a policy that we're following until it terminates. And the termination condition is maybe that the ganache is smooth enough or something.
Then we stop that. We pass control back to the upper agent. And then the upper agent will start the next subroutine, and so on. So that creates a kind of hierarchy of control.
And I won't go into the technical details here, but you can come up with an artificial reinforcement learning example so that the previous statements about semi-supervised learning give you a gap. And here, though, the gap is that an agent that does not have access to options will have a provably lower reward than an agent that has access to options. So the analogy that's kind of being cooked up here is these options are sort of like the intermediate representations before, and they can be learned even under-- in this model that I set up here, they can actually be learned under partial observability, just the way the intermediate representations could be learned with partially labeled data.
So in summary, the general benefit of most step learning that I'm trying to get at with this mathematical formalism is that intermediate representations are useful when they induce a factorization of the learning problem to one of lower complexity. And we sort of access that with this VC dimension, for example, was very low, and before, it used to be a high complexity with that gamma uniform shattering dimension that we had calculated. And then the other message, that these representations that are useful can sometimes be learned from cheaper partially labeled or observed data.
[APPLAUSE]