Calibrating Generative Models: The Probabilistic Chomsky-Schützenberger Hierarchy
Date Posted:
November 27, 2019
Date Recorded:
October 29, 2019
Speaker(s):
Thomas Icard
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Thomas Icard, Stanford
Abstract: How might we assess the expressive capacity of different classes of probabilistic generative models? The subject of this talk is an approach that appeals to machines of increasing strength (finite-state, recursive, etc.), or equivalently, by probabilistic grammars of increasing complexity, giving rise to a probabilistic version of the familiar Chomsky hierarchy. Many common probabilistic models — hidden Markov models, generative neural networks, probabilistic programming languages, etc. — naturally fit into the hierarchy. The aim of the talk is to give as comprehensive a picture as possible of the landscape of distributions that can be expressed at each level in the hierarchy. Of special interest is what this pattern of results might mean for cognitive modeling.
PRESENTER: All right, hi, everybody. Today we want to welcome Thomas Icard. He'll be here talking to us from Stanford in the Philosophy Department and as an assistant professor, also in the Computer Science Department. Thomas has written on numerous topics on the role of randomness in cognition, on causation and counterfactual cognition, as well as numerous topics and philosophical logic. And today he is here talking to us about the probabilistic Chomsky hierarchy. So with that, thank you, Tom.
THOMAS ICARD: Great. Thank you. So thanks so much to Tyler and the BCS-- thanks so much to Tyler and the BCS Philosophy Circle and the CBMM for having me here. It's really a pleasure to be here. It's such a stimulating place to be.
So the topic I want to talk about today is a little bit maybe of an unusual one for this seminar and a little bit theoretical. I want to talk about the probabilistic Chomsky-Schutzenberger hierarchy. And there's kind of two ways to get into this subject. One is if you already know something and care about the classical Chomsky-Schutzenberger a hierarchy of grammars or of machines, a way of thinking about formal languages.
And the classical hierarchy is extremely well understood from many perspectives, from the perspective of expressivity and definability of languages, as sets of strings, to algebraic perspectives, and it's extremely well mapped out. It's been extended in various ways. It's extremely systematically well studied.
And so from one perspective, what I'm going to try to do here is give my best attempt at starting on something like this for the probabilistic version of the Chomsky-Schutzenberger hierarchy. And what that means is essentially take all of the grammar notions or machine notions that you had in the classical hierarchy and add probabilities and then ask not about expressivity of these types of objects with respect to formal languages, sets of strings, but rather probability distributions over strings or natural numbers or discrete objects of some sort.
So that's one way into the topic. Another way into the topic is if you don't necessarily think much about the Chomsky-Schutzenberger hierarchy, but just think about generative models, probabilistic generative models. And you might want to ask, for certain kinds of generative models, what kinds of distributions can these generative models express?
So thinking about this topic, it's a little bit from a kind of philosophical logicians perspective, thinking about expressivity. How can we kind of calibrate probabilistic generative models in that sense? And this is one way of doing that.
So specifically what I want to do is first just give a little bit more motivation and background. And can everyone see that or is-- yeah, OK-- give a little bit more motivation and background and then actually just go through various levels of the hierarchy and say at least what I know, probably people here know, how to add little bits here and there or maybe big bits into what I'm going to say, but basically say everything I know about each level of the hierarchy and what can and cannot be done.
And so I want to start at the top level, the unrestricted case. This is essentially the domain of unrestricted probabilistic grammars, probabilistic Turing machines, probabilistic programs from some Turing complete probabilistic programming language, and then I want to jump all the way down to the bottom of the hierarchy and talk about probabilistic regular grammars. This is another way of thinking about probabilistic finite-state automata or hidden Markov models.
And then I want to move progressively up the hierarchy through some familiar and maybe less familiar types of objects, like probabilistic context-free grammars, probabilistic mildly context-free grammars-- these are kind of motivated by some work in linguistics, thinking about language modeling-- then up to probabilistic context-sensitive grammars and then end with some open questions and some discussion points.
So I guess I usually like being interrupted. But maybe given the format of this, that might not be a good idea. But raise your hand if something's unclear or if you didn't understand what I said.
OK, so just to kind of flash up the picture, at least at a qualitative level, what this is going to look like, over here on the left-hand side is the classical I'll just call the Chomsky hierarchy. Over here is the probabilistic version. And I just want a flag that a number of things are very different. And so some of the things we want to see are what are the differences with the classical hierarchy?
OK, so a little bit in the way of motivation, we're going to be talking about probabilistic generative models. And in particular, I'm motivated by use of these for cognitive modeling, for cognitive science. And so one question is, why generative models? And arguments for generativity go way back to the beginning of cognitive science, and this is going to be kind of just rehearsing a couple of points there.
One is that it gives a kind of way of thinking about inference and learning and other topics in terms of kind of analysis by synthesis-- so this idea that goes way back at least to the 50s-- and the idea that in order to make some inference about something, you can kind of generate some sort of internal model and compare it to the world in some way.
And related to this, the idea that especially in kind of dynamic domains or sort of temporally extended domains, one way of representing is kind of an implicit way by essentially deriving some sort of internal simulation of the thing that you're representing-- so representing it in this kind of semi-iconic kind of way.
And then maybe the third-- the third kind of in there more-- but the third sort of motivation for generativity is probably the best known, namely productivity. This famous quotation from von Humboldt-- "making infinite use of finite means." And more generally, productivity involves the idea that you can take some small number of primitives and generate a whole bunch of things very easily and quickly.
OK, so there's lots more arguments for generativity, why it's useful to model things as in some way-- aspects of cognition is in some way generative. But you might ask, also, why probabilistic? So why are we interested in probabilistic generative models and probabilistic grammars and so on?
And one reason is kind of obvious, that many processes-- in particular, mental processes-- are, in fact, well modeled as being at least quasi random or pseudo random or well modeled as being random. So that's kind of obvious.
Why would that be? Well, each one of these is, of course, a topic unto itself-- that randomization might confer some benefit in terms of efficiency. So you might expect various aspects of our mental lives to be randomized or approximately randomized.
A third topic that, again, is a topic unto itself is you can think of-- you might be able to think of some kinds of probabilistic generative processes as kind of encoding the agent's uncertainty in some way. Again, this is an idea of sort of semi-iconic kind of idea that the generative process itself can be used by the agent in a way that functions as a kind of representation of uncertainty. So it's both kind of receptive to data and also guides the agent's action a certain kind of way that is reminiscent of probabilities and so on and so forth. So you can you can imagine other reasons.
Of course, probabilistic generative models abound in cognitive science and cognate disciplines-- so things like hidden Markov models, Boltzmann machines, and other kinds of generative neural networks, Bayesian networks, probabilistic context-free grammars, probabilistic programs, and so on and so forth. So these are very familiar.
And this one way of thinking about probabilistic process is by giving the actual process, giving the generative process in some way, and then this defines perhaps some probability distribution over outputs of the process in some sense.
So the distribution's sort of implicitly defined in that sense. But, of course, often when we're doing modeling, we also give more explicit characterizations of distributions, more kind of analytical descriptions of probability distributions, and I want to give a couple of these just to put some on the table, the kinds of things that are common in modeling to say what might be the target, what kinds of things might we want to understand with respect to generative models.
So for instance, famous example-- Poisson distribution. So you give the probability that you'll see, say, k events in some interval given some fixed parameter lambda, some rate parameter. And you can consider this sort of distribution on the natural numbers for different rate parameters lambda.
And this is, of course, a common modeling tool in areas like memory, and also in neuroscience it comes up in interesting ways. And so you might ask, what kinds of generative models could actually implicitly encode, for example, Poisson distribution?
A very different kind of probability distribution are those that are associated with random walks. So here's an example. Imagine I have a random walk on the natural number, just a symmetric random walk. I start at one, and then I randomly go left or right, and so I bounce around a little bit. And then after some number of steps, I get to zero.
So you might ask for the sort of distribution on hitting time. So what's the probability that I get back to zero in k steps? So that's another important distribution. This comes up in modeling, things like decision making and semantic memory and so on.
And this also has a kind of analytic description. We can just say what that probability distribution is. And of course, in this particular example, you only have positive probability on odd numbers. But 2k plus 1 has probability given by, well, 1/2 to the power of 2k plus 1 and then multiplied by ck.
So you have to count the number of ways there are to get back to 1, and that happens to be the kth [INAUDIBLE] number. And so you can just explicitly say what this probability distribution is in this kind of analytic kind of form.
OK, one more, again, different example just to sort of motivate the kind of question is the kind of distribution that might show up in higher Bayesian models. So I can think of a beta binomial distribution or [INAUDIBLE] multinomial where I draw some parameter p from a beta distribution and then I draw k from a binomial using that parameter.
OK, so this is another example that comes up in many places and gives a distribution on natural numbers, and we can also write this explicitly using the beta function and so on and so forth. So these are the kinds of things where what we're going to be interested in is basically what kinds of generative procedures, generative processes, can define what kinds of probability distributions.
So for example, these three I just gave-- where essentially in the Chomsky hierarchy, or the Chomsky-Schutzenberger hierarchy, can we actually get something that defines each one of these? So that's the kind of game that we're going to be playing.
OK, good. So I'm going to be looking at this from a particular perspective of formal grammars. Why? Well, formal grammars are nice in many ways. For one thing, formal grammars themselves are used often in modeling and language and in other places. So they're of interest in their own right.
But also, the kinds of grammar classes I'm going to be discussing, probabilistic grammar classes, actually correspond, in each case, to very natural machine notions and other kinds of generative models that we know and use. So it's a nice way of thinking about calibration, and it's all very kind of simple to kind of vary what you allow and what you disallow. It's sort of one language for thinking about computation. And so that's what I'm going to be adopting.
And, well, just to review, I'm going to review grammars really quickly. But if you haven't seen grammars, this will go way too quickly, of course. So what do we have? We have some set sigma of terminal symbols. This is the alphabet. We're interested in distributions over these things, strings over these things.
And we also have some non-terminal symbols in, and we're interested in productions, or production rules, that say, if I see an alpha, I rewrite it as a beta. So these are kind of rewrite systems. And alpha and beta, in general, in the most general case, can be anything, any strings over sigma and-- OK.
And then what's a grammar? A grammar is just a finite set of production rules like this and a special start symbol, S. So we can think of a grammar as generating strings by starting with the start symbol and then iteratively applying these rules until I end up with a string in sigma, just a string of terminal symbols. OK, so that's a grammar.
And what is the Chomsky hierarchy of grammars? Well, I'm going to start at the bottom-- at the top here. So we start with regular grammars, which are the most restrained. And they say that we can only have production rules in the following form. Some non-terminal symbol x gets rewritten as either some string of terminal symbols, sigma, or some string of terminal symbol, sigma, together with another non-terminal appended to the right. That's all you get in a regular grammar, or a type three grammar.
Context-free grammars are a little bit more general. We just allow production rules where I take a non-terminal and rewrite it as some string, any string whatsoever. So alpha here means that can be made up of terminals and non-terminals.
And what is a context-sensitive grammar? Well, it's kind of like a context-free grammar, but these productions can be context sensitive. So I can say that x gets rewritten as gamma, but in this context when flanked by alpha and beta. And it's also stipulated that gamma here cannot be the empty string. And to get around that issue of language as not including the empty string, we also allow that S could directly go to the empty string as long as S doesn't occur on the right hand of any production.
So that's the context-sensitive grammar. And finally, unrestricted grammars are just that. They're just rewrite systems where you can rewrite any sequence of strings, any string as any other string. OK, so that's the hierarchy, the classical hierarchy.
Just to give one quick example, regular versus context free-- this is an example of a regular grammar where I have a terminal followed by non-terminal and then just a terminal, and here I can have two non-terminals on the right hand side. So that's really the only difference between regular and context free.
And if I take this grammar, I can look at the space of all possible derivations, and it looks something like this. I could drive in one step a, I could derive aa, I could drive aa, and so on and so forth. So this kind of shows what the language generated, and it's just the total-- it's the whole set of strings over the language a, of the alphabet a. So it just generates all of them.
What about the context-free grammar? We can also look at this one. It's also a binary tree, and it can generate a. But, in general, it only can give you odd numbers. So you get a. You get aaa and aaaaa, five of them. And it's actually going to be important momentarily that this generates a five a's in two different ways. So that's going to matter when we're interested in probabilities.
But this generates the odd number of strings. This is not a non-regular language. This is also a regular language, but this is a context-free grammar for generating that language.
OK, so what is a probabilistic grammar? A probabilistic grammar, there's different ways of defining it, and I'm going to define it in what I think is the simplest way possible. It's just going to be a grammar such that for every alpha I have at most two beta such that alpha goes to beta as a production rule, at most two. I could have one zero or two.
And the intuition is basically if I reach in my alpha bag, if I have a bag of production rules for alpha, if I have zero I just stall and come up short. If I have one, I just grab that one. If I have two, I probabilistically choose one or the other. I assume the bag's really well mixed up, so I just grab a rule for alpha and get either, let's say, beta one or beta two-- so just a coin flip to determine which one I apply. And that's the only thing I'm going to assume. That's what a probabilistic grammar is.
OK, so just like a classical grammar gives rise to a set of strings, one of these gives rise to a kind of distribution on strings. And how does that work? Well, I can think about the sequence of productions that I apply and the probabilities of each of those. I multiply all of them to get my string, and that gives me the probability of a particular derivation of the string, and then I just add up all the probabilities of all the derivations of the string.
So in this example, going back to this example, the probability of a in this grammar-- this was a grammar that satisfied my requirement of having at most two rules for each alpha. a has probably of 1/2, aaa has the probability of a 1/2 to the power of 3, and then aaaaa, five a's, has the probability 1/2 to the power of 5 plus 1/2 to the power of 5 because I have two derivations of it, and so on and so forth.
OK, good. So that's a probabilistic grammar. So now I want to start to consider what do different classes of probabilistic grammars look like? What can they do? So I want to start at the top, the most general class, and say, what kinds of objects are we even dealing with in general?
And here's an example of an unrestricted probabilistic grammar that doesn't satisfy any of the requirements for any other classes. So it's not context free or anything. But it is probabilistic. And in particular, you have this yz rewriting to u or vz with probability 1/2. And you don't need to work through this example, but if you play with it a little bit you can see that this defines this distribution that only has support of powers of 2. So it only gives positive probability the powers of 2.
And the probability of 2 to the k is 1 over 2 to the k. So it's like a geometric distribution, but you don't return k, you return to 2 to the k. So that's an example of a probabilistic grammar that has this distribution associated with it.
So the first result about unrestricted probabilistic grammars is that these are kind of natural and correspond to a number of other natural notions. So you can show, basically extending the classical equivalence between unrestricted grammars and Turing machines, that this extends to the probabilistic setting pretty straightforwardly.
So if I have probabilistic Turing machines-- and I can say more about how we're defining those-- but if I have a probabilistic Turing machine, I can always get some probabilistic grammar that defines the same distribution and vice versa. So these are equivalent. And probabilistic Turing machines are often used as kind of models of probabilistic computation in general, models of probabilistic programming languages and so on-- so really in that domain of sort of arbitrary probabilistic programming languages of probabilistic Turing machines.
OK, so what kinds of things are these that they define? What kinds of objects are these? Well, I wanted to introduce a definition for, let's say, mu from natural numbers to the interval is a semi-measure if, by adding up all of the measures of all the numbers, this is less than or equal to one.
And we have to make this less than or equal to one because grammars and machines and programs might not halt. There might be some positive probability that they don't halt. And so the sums of all the outputs need not actually sum to one. But it should sum to less than or equal to one.
And so I'm going to call that a semi-measure. And if I take a semi-measure, I can say that it's semi-computable, sometimes called lower semi-computable, if for every number I can kind of approximate the probability of that number from below. So I can get some sequence of rational numbers that converge to mu of k for every k kind of uniformly in k.
OK, so that's what it is to be semi-computable. And the next kind of result is that-- what do probabilistic Turing machines or probabilistic grammars define? They define exactly the semi-computable semi-measures. So not only is every kind of distribution defined by probabilistic grammar, probabilistic machine, a semi-computable semi-measure, but for any semi-computable semi-measure, I can find some machine that has exactly that behavior. So this includes uncomputable things, merely requiring semi-computability.
But this does have a pretty quick corollary, which probably people are familiar with, which is that if I restrict attention to those grammars or machines that kind of halt with probability one, then those define exactly the computable probability measures. What's a computable probability measure? That's requiring not just that I can approximate from below, but I can also approximate from above.
So I can actually get-- I sort of know what kinds of bounds I'm getting as I approximate the probability, and that's a computable probability measure. And if my machine or my grammar or whatever halts, gives an output with probability one, then it defines exactly one of those and vice versa. You can always find one.
So it's kind of interesting, this kind of area between sure termination, or almost sure termination-- that is, termination with probability one-- and positive probability of non-termination, in particular because, well, the problem of determining whether a program or grammar halts with probability one is undecidable, but it's also actually harder-- it's more undecidable than the halting problem, the classical halting problem. It's actually a harder problem.
And I want to give just some indication of why it's hard-- so give an example on the boundary of this. So this is a kind of tortoise and hare kind of example, which is kind of inspired by some kinds of examples that come up in theory of algorithms.
So imagine we have this tortoise and hare kind of scenario, and we start out with the tortoise at step one-- so sort of one step ahead. And the hare is it at step zero. And now, as time goes on and each tick of the clock, the tortoise is going to move forward just one step every time.
And now the kind of erratic hare is going to sort of sometimes go forward-- with 1/4 probability is going to move forward. And if it does, then it moves forward some random number of steps between one and seven. So you can imagine this tortoise going steadily forward and the hare kind of randomly moving forward different amounts in different instances.
So question-- will this program with probability one terminate? Will it actually give a probability distribution, not just a semi-measure? That is, will the hare with probability one catch up with the tortoise?
So it may not be immediately obvious. But in this particular example, the hare will, with probability one, catch up with the tortoise, and you can see that maybe by noting that t minus h through time forms a random walk martingale, and then by general martingale convergence theorem, you can actually get that this thing does reach zero, so to say. So it does.
But just some small change to the program-- so for example, adding epsilon to the pace of the tortoise now makes it so there's actually positive probability that the hare will never catch up. So there's a very fine line between almost sure termination and positive probability of non-termination. It's a very difficult thing.
So in particular, what that means is that there's no natural class of grammars that I can just pick out syntactically or of machines that almost surely halt. So in some sense, this result is characterizing just the general class that we know how to define. OK, good.
OK, so that's a little bit about unrestricted probabilistic grammars. They can define definitely all the examples that I gave before. They can define Poisson distributions. So I can come up with a machine that returns k with exactly the Poisson distribution on K and so on and so forth. Pretty much anything that anyone's ever imagined, including uncomputable things that are at least semi-computable, can be encoded by some probabilistic grammar.
But what if we move down now to the bottom of the hierarchy-- so probabilistic regular grammars? So as I mentioned at the very beginning, these are an interesting class. Even though people don't usually consider probabilistic regular grammars, they're equivalent to types of models that are commonly explored, like probabilistic finite-state automata and hidden Markov models.
And so these things are actually quite well studied. And most of what I'm going to be saying right now is just straight from the literature and actually not just one literature, but multiple literatures that have kind of discovered some of these things independently.
So just to give some sense of what these things can do, I want to give one example because it's maybe a little bit surprising when you first see it. So there's this argument that goes back to the '40s, Von Neumann, about how to kind of mimic a fair coin if you have a biased coin. So probably people know this.
So the idea here is that I have a q-biased coin. I don't even know what q is. And I want to use this q-biased coin to simulate a fair coin. And so there's a nice trick that you can do there, which is I keep flipping it, and as soon as I see a heads followed by tails, then I say heads. If I ever see a tails followed by heads, I say tails. And then those two have the same probability, so, in fact, I've simulated a fair coin.
So if I wanted to write down grammar rules like this with half-half probabilities, I could do that. Basically, even though this looks like a while loop or an until loop or something, it can actually be done in this finite-state regular grammar kind of land.
So the thing that I just said essentially can be reproduced by just adding two more non-terminals and a couple of rules where everything here is using the q-biased rules. OK, so this is possible. So this is possible. So that's just to give you some sense that these things can actually do a lot.
OK, so what else can they do? Well, they can also define all sorts of other probability distributions, not just one half and, in general, not just dyadic numbers, but they can also define distributions with rational numbers, with rational weight. So if I want to simulate x goes to each one of these things with equal probability, all three of them, I can do that as well.
And so you don't need to study this, but it's even small. It's even not hard. And the idea here is that I'm taking-- one way to think about it is I take the binary expansion of 1/3, and that's 0, 1, 0, 1, 0, 1 repeating, and I basically just need to take that and put it into the automaton essentially, make the automaton and kind of mimic that binary expansion.
And this is possible in general. So I can do this for any rational numbers, for any finite distribution-- that is, distribution with support on some finite set-- so positive probability only for finitely many numbers-- and any rational values for those that sum to one or less than or equal to one, I can find some regular grammar, probabilistic regular grammar, that mimics that.
And it's essentially just because rational numbers, their binary expansion is eventually periodic, eventually kind of repeating some sequence. And so we just sort of take that and put it into an automaton, essentially, probabilistic automaton. So there's more to say about that, but that's possible.
So what does this mean? This actually means that we get a lot. We get a lot. So probabilistic regular grammars can already do, for example, beta binomial distributions or the multinomial generalization. As long as the parameters are in the natural numbers, then these are all finite rational distributions. So they can do that.
They can also encode Bayesian network rational parameters. And this is a little bit more theoretical. If we're interested in continuous distributions-- everything that I'm talking about explicitly here is discrete distributions.
But if I'm interested in continuous distributions, then these kind of finite support rational measures are actually dense in the space of Borel probability measures, which means I can arbitrarily well approximate any kind of reasonable continuous distribution with these little finite state automata-- well, not little, actually, but with these probabilistic finite-state automata. They might not be a little-- in fact, arbitrarily large, of course. But as a class, they're actually quite expressive, especially if we're interested in approximation.
OK, so that's pretty good. You might not be surprised to know that probabilistic regular grammars can only define rational value distributions. So we never get outside of rational numbers, and it's various ways of seeing that. I won't go through any proofs here, but this is an old result.
And now, we might have the idea that, well, yes, they can only define rational value distributions, but I could have just added in some other parameters. I could have added in various kinds of transcendental numbers as weights and so on, and then maybe I'd be able to do things like Poisson distribution or others that involve irrational numbers.
And so you might have the intuition that probabilistic regular grammars are not just deficient in the probability values they can define, but somehow more structurally-- there's something about the structure of these distributions that's simple. And so that intuition can be made more precise by talking about generating functions.
So this is really where Schutzenberger comes into the story. This is a French mathematician who worked a lot on formal language theory and automata in the '50s and '60s and brought these kinds of tools from analytic combinatorics into formal language theory-- in particular, generating functions.
And I'm going to be talking about a particular kind of generating function, namely a probability generating function. And essentially all that's going to be going on here is I take a measure, or a semi-measure, and then I consider this as a sort of infinite power series.
So I have some z to the k, and the coefficient of z to the k is just the measure of k. OK, so think of it as like an infinite polynomial. If you haven't seen these things before, they're just higher and higher terms. And you can just go back and forth between measures and probability generating functions.
But these, as objects, are quite useful because we can talk about properties of them. So one property that these might have is they might be irrational. So it might be that this entire infinite power series is actually equal to some ratio of polynomials in z. So there's some ratio of q0 of z over q1 of z. So many distributions are like this, like a simple geometric distribution where I start flipping a coin and return k. If k is the first time I see a tails, that has a rational generating function, like 1 over 2 minus z or something, and many others do as well.
A little bit more complicated are algebraic PGFs, probability generating functions, which require that this power series is a solution to a particular polynomial equation. So there's some polynomial in y and z, and if I plug in this power series for y, then I get a solution.
And so actually rational is the special case where q there is degree one. So this is a little bit more general. You might get some more probability distributions that way. And then, finally, you're transcendental if you're not algebraic.
So this is a well known kind of classification, and it also continues and is refined and so on. But it's already a kind of helpful tool for thinking about generative models. So the first result-- and this is not explicitly stated by Schutzenberger, but it's pretty easy to get from his more general results, is that the probability generating function for any probabilistic regular grammar is rational. It's just a ratio of polynomials.
And that's actually important because it already shows us that this random walk hitting time distribution cannot be encoded by any probabilistic finite-state automaton or probabilistic regular grammar. So this has this PGF using square root, and it's algebraic but not rational. So you can't do it. So this is sort of like-- think of this maybe as a pumping lemma kind of thing, some tool for showing that we cannot define a given distribution at a given level.
OK, so what can define things like distributions associated with random walks? Well, actually these are quite natural for context-free grammars. So if we move to-- this was the random walk hitting time example. If we move to context-free grammars, actually the very example that we saw defines exactly this distribution.
So this very simple grammar, now thought of as a probabilistic grammar, you think of it-- we're starting at time one with an S. And then we're either adding an S that needs to be removed-- so we're moving one step to the right-- or we're subtracting and S, moving one step to the left. So this is exactly this-- can be thought of as a kind of random walk and has exactly that distribution in terms of the Catalan numbers and so on.
OK, good. So with probabilistic context-free grammars, PCFGs, we do get beyond the rational, at least in this sense. The probability values here are all rational, but PCFGs can also define algebraic probabilities. And here's one kind of fun example from literature and probabilistic programming.
So S here goes to SSS or to the empty string. Well, what's the probability of generating the empty string, essentially of halting? Well, it's 1/2 plus 1/2 times doing that 3 times. So it's sort of the solution to a particular equation. It's x equals 1/2 x cubed plus 1/2. And the solution to that actually is kind of a fun example because it's 1 over the golden ratio-- so an algebraic number, but not rational. So you can get algebraic numbers that aren't rational with PCFGs.
And here's a slightly crazier example. I'm taking advantage of the fact that I can always write any rational probabilities because PRGs, probabilistic regular grammars, can do that. Certainly probabilistic context-free grammars can do that. So here's a little grammar such if you do that same kind of solution, you get something which has no closed form at all. It can't even be solved in radicals.
So this can happen in PCFGs fairly quickly, and people are worried about this kind of thing because it makes things like inference hard. So you get algebraic numbers pretty quickly.
In terms of limitative results, the probability generating function for-- so these were not accidental examples. The probability generating function for a PCFG is always algebraic. So what does that mean? That means that, for example, PCFGs cannot define Poisson distributions because not only are the numbers there, the probabilities themselves transcendental, but the PGF for the Poisson distribution is transcendental. So this is a definitely limitative result.
For those maybe who are into formal language theory and know about this kind of thing, I want to make a couple of remarks about this result, which is related to but importantly different from an old theorem from the early '60s by Chomsky and Schutzenberger. They considered a different kind of generating function for context-free grammars where you're essentially counting words of a given length. So you can form a kind of power series with that.
And what they showed is that generating function, for any unambiguous context-free grammar, is algebraic. OK, so that's the Chomsky-Schutzenberger theorem. And in particular, one really good way of showing that a context-free grammar is ambiguous is to show that its generating function is transcendental. So there definitely are going to be examples of transcendental Turing functions for classical context-free grammars.
And this doesn't happen here. Any probabilistic context-free whatsoever will have an algebraic generating function. So probabilities, in some sense, kind of wash things out in that way.
There's another result that I want to mention here, if people know this kind of thing, of Parikh's theorem. So Parikh's theorem implies that if I consider a one-element alphabet-- so an alphabet with just a, like the examples that I've been doing-- then context-free collapses down to regular.
The context-free languages over one alphabet are just the regular languages of that alphabet. This dramatically does not happen here. All of the examples I've given are of alphabet size one. So we have no such collapse. So this is a little bit about that result.
Another thing that I think at least surprised me is that if I consider distributions with finite support-- so I consider probability distributions over just some finite set, then PCFGs give no more power over probabilistic regular grammar. So in particular, they can't define anything that's not rational. All those examples of algebraic stuff going on is all inherently infinite, had to be infinite. So this is also limitative and also suggests that really the power of these things really gets going when you're considering infinite distributions.
OK, so that's PCFGs, probabilistic context-free grammars. Now I want to go into something that's not actually part of the traditional Chomsky-Schutzenberger hierarchy, but is sort of in between the context-free and the context-sensitive languages, and these are the indexed languages that were devised by Aho in the late '60s.
So what are indexed languages? I haven't defined them yet. Indexed languages add to the non-terminals and the terminal symbols some indices, some finite set of indices I. And then indexed grammars are just like context-free grammars, except that non-terminals can carry these stacks of indices and pass along those stacks, which might give us a little bit more power.
So what does this look like? I have a non-terminal writing to some sequence of terminals and non-terminals. And I also can either just pass along the stack exactly as it is to all the non-terminals on the right-hand side or I can push something onto the stack or I can pop something onto the stack. This is kind of like a push-down automaton kind of idea, but in a grammar. So that's an indexed grammar, and these, in the classical case, are strictly in between context free and context sensitive.
And now I want to consider probabilistic versions of it. And an initial result-- this is just a sort of first result. And I don't know much about these things, but one observation is that probabilistic indexed grammars can define transcendental PGFs. So they definitely go well beyond what probabilistic context-free grammars can do.
And here's a simple example. This is where we start out with S in the empty stack, and then we push something onto the stack with y, and then we sort of probabilistically decide whether to add one or just move on. And so we get a kind of geometric distribution inside of it. But then I duplicate. I kind of take two to the power of it.
And this defines this distribution where 2 to the k gets the probability of 1 over 2 of the k. And you can show, because this gets really spread out in its support, that this has transcendental PGF. This is beyond algebraic. It's not going to be the solution to any equation. So this already shows that we go beyond probabilistic context free.
Now I want to say a couple of things that actually relate to the kinds of grammars that especially some linguists have focused on, which is a restriction of the indexed grammars to what's called linear indexed grammars. These are indexed grammars where, in the right-hand side of a production rule, I allow at most one non-terminal. So it's kind of like regular grammars, but I allow this non-terminal to appear anywhere.
So these are called linear indexed grammars, and now I want to consider the probabilistic version of those. And these are of interest because they're classically equivalent to things that people have studied in linguistics and computational linguistics, things like tree-adjoining grammars, combinatory categorical grammar, and others, head grammar and some others. So these are of interest in their own right.
But they're also algebraic. You can show that they have algebraic PGFs. But they also allow finite support irrational valued measures. So they now are actually going beyond PCFGs. And moreover-- this is even true in a further restriction where I say not just linear indexed grammars but so-called right linear indexed grammars where the non-terminal has to appear all the way at the right of the string, to the right of all the terminal symbols.
So those grammars, right linear indexed grammars, are classically equivalent to context-free grammars. They generate the same languages. But here we actually see a difference in the probabilistic setting. And I can explain why and say more about that, but this is a little bit surprising. And these things, the things I just mentioned, are equivalent to another machine notion, namely probabilistic push-down automata.
But if you follow the literature on these topics in computational linguistics, this may be a little bit confusing because there are various results out there that show that probabilistic context-free grammars are equivalent and express the power to probabilistic push-down automata, and I've just said probabilistic push-down automata are equivalent to right linear indexed grammars, probabilistic right linear indexed grammars, which are more powerful than probabilistic context-free grammars.
And there's a kind of subtlety there, which is essentially something about how we set things up such that if I take a probabilistic push-down automaton and then I get the equivalent probabilistic context-free grammar, I actually have to solve some system of equations, which might give me parameters for the grammar that I couldn't have gotten just using one-half probabilities or rational probabilities or anything else. So that's essentially the crux there. But these are all kind of morally equivalent, but if you actually pay attention to details, there's some subtleties.
OK, so the last thing I want to say something about are probabilistic context-sensitive grammars. So these are not very common. Even context-sensitive grammars don't really appear that often in the literature, in any literature that I'm aware of. And probabilistic context-sensitive grammars appear almost never. And so what I want to do here is essentially just explain why-- so say something about what's peculiar about these kinds of distributions.
So remember, a problematic context-sensitive grammar allows contexts around the non-terminals that you're rewriting. And in the classical case, they're strictly more powerful than indexed grammars and the rest, context-free grammars.
But here's a first little result that kind of equates these with a machine notion. So if I look at probabilistic context-sensitive grammars, these are equivalent to non-erasing Turing machines-- so probabilistic Turing machines that start writing and can never write a blank symbol. They can every race-- so just as powerful as Turing machines ordinarily, but they can't erase. Those are equivalent, and it's fairly easy to see.
There's a quick corollary of this-- in fact, many corollaries of this because probabilistic Turing machines are often easier to work with-- that PCSGs can define lots of distributions that can't be defined by any of the kinds of formalisms that we've considered so far, except the unrestricted ones.
So in particular, they can define distributions that have support in terms of factorial numbers and all kinds of things that go beyond even indexed grammars. So these definitely can do things that none of the other grammar formalisms can do or any of the other machines can do. And they're really complicated in other ways as well.
So here's a little result that's a kind of probabilistic version of an earlier result by Savitch about classical context-sensitive grammars. And maybe don't look at the actual statement of it. Let me just describe it.
So imagine I just take some semi-computable semi-measure, some possibly very complex kind of distributions, maybe not even computable, but something that some grammar or machine could define. Then I can give you a probabilistic context-sensitive grammar that has exactly that same distribution as long as you don't mind there being some string of dummy symbols tacked onto the end.
And the intuition here, I hope, is sort of intuitive that if these things are non-erasing probabilistic Turing machines, then I can do whatever computations I want, and every time I erase, I just write some dummy symbol and push them all to the back at the end. OK, so essentially these things can do everything, but you get a bunch of junk at the end that you just maybe want to ignore or something. So these are really complex in some ways. So they can encode any computations in some sense-- not in other senses, of course.
And one sense in which they can't is this other last result, which I think is kind of surprising, that PCSGs, probably context-sensitive grammars, can only define rational probabilities. So we actually can't even do any of those examples that probabilistic context-free grammars could do. We can't even get irrational probabilities.
And so that looks a little bit weird because context-sensitive grammars classically, of course, extend the context free. But why do they extend the context free? This goes back to the original paper by Chomsky on this.
The fix-- the issue is that you have to add in this further stipulation that the start symbol can go to the empty string so that you can allow languages that include the empty string in them. So we have to allow that, and that's not problematic in the classical setting.
But in this probabilistic setting, that little fix actually reverberates in a big way. So the inability to erase, essentially, to shorten a string is what essentially gives rise to this issue of only being able to define rational probabilities and therefore being actually incomparable, strictly speaking, in expressive power with probabilistic context-free grammars.
OK, so that's a little tour of at least everything that I know about this hierarchy and ways of showing that something is and isn't at a given level of the hierarchy. So what is the picture that we end up with? We have this sort of nicely laid out classical hierarchy of increasing expressive power.
Over here, we see something similar, but we have this kind of unfortunate thing where the context sensitive are incomparable with everything except for the probabilistic regular and the unrestricted. And then another thing just to note-- I mentioned this earlier-- is that if I consider certain limit cases, like alphabet of size one, over on the classical side, a lot of this hierarchy collapses. So even the linear indexed grammars collapse down to the regular grammars.
And all of the examples showing strictness of this hierarchy use only one-element alphabets, just distributions over strings of a's. So it looks in many ways different from the classical hierarchy.
OK, so what does this mean? So I'm not going to mention all the very many open questions about this stuff. You can probably imagine many of them yourself. But let me just say a little bit about how I think about what we learn from this. It's, of course, very much of a kind of theoretical exercise.
But if you look in the literature, you see really interesting arguments for why we should actually focus on one extreme or another of the hierarchy. So some arguments that we should really be thinking and focusing our attention on upper parts of the hierarchy, sort of turning complete kinds of things that sort of cognition and aspects of cognition language call for that type of structure and in models seem to use it.
And then you have other arguments that I think are kind of equally compelling that start with the patent observation that brains are finite. So at most, models should be something like a finite-state machine or something that looks like a probabilistic regular grammar or is equivalent to one. And so we should really be focusing our attention on that part of the hierarchy.
And I think there's actually something right about both of these. And in some ways, the results here lend support to both of them. So if I take the unrestricted case, this allows us very simply and naturally to define many common distributions, including Poisson distributions and many others, essentially anything that we ever want to write down, including all sorts of interesting combinatorial stuff that happens in language and elsewhere.
Meanwhile, if we look at the very bottom part of the hierarchy, the probabilistic finite-state part or the probabilistic regular part, those are, at least in some theoretical sense, as good as we could ever need. They could approximate any distribution whatsoever, and that's pretty powerful.
So of course, practically speaking, there's a reason that we see lots of formalisms in between these two extremes. So in general, there's this kind of obvious trade-off between expressiveness, which is what I've been focusing on here, and various kinds of tractability-- so tractability in defining models and defining generative processes.
So sometimes it's just easier to write down a particular generative process if I have more apparatus to use, not to mention inference and learning and all these kinds of things that as a community we, of course, have tons of experience with very practical issues in tractability and what you can do and what you can't do with different parts of this hierarchy.
So what I've been trying to explore here is the other side of this trade-off, trying to get a little bit more comprehensiveness and precision in what can the different levels of this hierarchy do so that we can actually understand this trade-offs perhaps a little bit better and understand the landscape of generative models just a tad bit better. So I think that's all I'm going to say. So thanks for listening.