Tutorial: Probability (43:23)
August 11, 2018
August 11, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Andrei Barbu, MIT
Introduction to probability, covering uncertainty, simple statistics, random variables, independence, joint and conditional probabilities, probability distributions, Bayesian analysis, graphical models, mixture models, and Hidden and Markov models.
Download the tutorial slides (PDF)
PRESENTER: So as with all tutorials, this is going to be very, very basic. If you know anything about probability, of taking a course in probability anytime soon, or recently, or know anything about the math behind probability, you will certainly not learn anything new today. My goal is to provide you some intuition about what's going on rather than actually give you equations. The intuition is much more useful for the rest of the stuff that's going to happen in this course.
In particular, in a few days, you're going to see a tutorial on probabilistic programming that's going to very directly let you use this intuition about probability to actually run experiments, to sample from complicated distributions, to model your data. And you'll be able to do that even if you actually don't really have a good grasp of the underlying equations just by having the basic concepts. So that's the goal of this.
Of course, there are some basic problems that we want to get at with probability, and one of them is we want to be able to capture uncertainty. There's much more to uncertainty than probability. Even though in a 101 probability course or stats course, you tend to conflate the two. Uncertainty is a problem that people struggled with for many centuries. You could read about people trying to reason about uncertainty going back about 3,000 years ago or so.
And some basic questions that they were asking, even back then, is I have two sources of data, are the two sources the same? How can I generate some data? And maybe I know part of some time series, how do I fill in more of it? And what can possibly explain my observed data? Even if you look at writings of ancient Greeks and 500 BC by Ptolemy, he's wondering exactly these kinds of questions. And he comes to some interesting conclusions that we might find sort of absurd today. But if you don't have a theory of probability, they're actually quite reasonable.
If you think about the way we think about uncertainty now we have a pretty rigid framework for it. We say that every event has a probability of occurring, and it has a single probability of occurring. Some event always occurs. And when you combine events that are separate, that have nothing to do with each other, mutually exclusive, you add their probabilities.
You might recognize these as the basic equations that you start with in any probability course. They only date back to about the '60s or '50s or so by calling Kolmogorov. There's an even more basic question we can ask about this to give you some intuition about what's happening, which is, why these statements about uncertainty? If uncertainty is such a big problem, why arbitrarily choose to have one probability of something occurring? Or why choose that something event should always-- some event should always occur, or why bother adding together the probabilities of different events?
One question that you could ask, which is a little bit more trivial is, why these axioms for probability? There are lots of different ways. For example, Cox's theorem is another way to axiomatize probability. But a deeper question is, why not allow for multiple probabilities of something happening? Does anyone, just out of curiosity, seen what the basic reasoning behind this is? Oh, that's kind of fun.
OK, so let's relax some of these constraints. If you take away the idea that every event has a single probability, you can get what are called possibility theories or interval probability theories or maximal or minimal probabilities. So I can say that every event has a probability, and I can bound it with a lower and upper bound. I don't have to actually specify it.
There's some reasons while you might like this intuitively. If I don't know about something, if I'm Bayesian I can assume I have a uniform prior. But maybe it's even simpler to just, say I don't know, but I think the probability is probably above 0.3 and below 0.8.
Something else that you might do, you can also relax the-- some event always occurs case, but that's suddenly more annoying to talk about. Something else that you can use, you can change the fact that you add together the likelihoods of events that are mutually exclusive. For example, if you change this addition to maximization, you get what it called belief functions, which try to aggregate together your belief about something happening. And they are a very natural way to think about uncertainty.
Or you can just say, well, I'm not going to add them together, I'm going to take both the maximum and the minimum and try to simultaneously upper and lower bound the probability. What's fun is each one of these leads to an internally consistent, perfectly reasonable, perfectly good mathematical theory about uncertainty. You can prove theorems about it, and people actually do. And even though this isn't a huge subfield of mathematics today, there are people that actively work on this.
So why do you ever only see probability in day to day life? There's nobody that does machine learning with belief functions. There are two reasons. One is practical. It just turns out that the mechanisms for doing inference are a little bit easier in probability than in these other systems, at least by the standards of, say, the '90s when I mean easier. Today, with GPUs, you could certainly do inference in all of these different mechanisms.
But there's a very good reason why you only see probability. And that's-- probability has to do with maximizing your rewards. This is an argument that comes from Maynard Keynes. Eventually, it was proven and systematized by de Finetti in the 1930s.
So let's assume that I'm a bookmaker. These are called Dutch books. Amusingly enough, no one has any idea why they're called Dutch books. Maynard Keynes just wrote it in a book in the 1920s, and no one has been able to figure it out since then. And if I'm a bookkeeper or a bookie, what I'm going to do is I'm going to say, I will let you place a bet with me. I'm willing to buy this bet from you or sell it to you, just like any rational agent, because the two are equivalent.
And this bet is very simple. If event happens, I will pay you r. If the event doesn't happen, I get to keep your money, right? Just as if you're betting on anything else. Why should I always value my bet at the probability of the event happening times the reward? This is what you would have if you were to compute the expected value of this.
It turns out that if I value my bet at anything other than this, there is a way for you to take advantage of me and make me lose money. The most immediate way is, let's say that even though the probability of the event is whatever it is out in the real world, I'm willing to value my bet at a negative p. Well, immediately I lose money because I'm offering to pay you r dollars for a minus p times r money, and you should probably take me up on that. So I'll pay you this much, and then I'll pay you this much if you happen to win.
Let's say probabilities don't sum to 1. Well, if the probabilities don't sum to 1, you can take out a bet that always pays off, right? That's what it means for probability to the sum. Just take out a bet on a tautology. It always happens. Then if the sum is below 1, what I'm doing is I'm basically paying you r dollars, and I'm telling you the ability to make r dollars for less than r dollars, right? My bet always pays off, therefore, it's worth exactly $1. And I'm saying you should pay me less than $1 for the privilege of me giving you $1. Probably not a good idea.
If the sum is above 1, you can do the opposite thing, which is you can make your own bet. You can become a bookie. And you can say, well, if you really value the pleasure of giving me $1 at more than $1, then you should buy this from me. So why don't you pay me $2 so that when this happens, which is inevitable, I have to give you $1? So again, that's probably not a good idea for me.
And if you try to relax with some constraint, you can play exactly the same game. If the value is larger than the sum, I end up paying more than it's worth because you can take the bet and you can split it in half and add together the rewards. And the opposite, if the value is smaller, you can do the same thing of making that bet and selling it back to me.
So there's something very deep about probability that somehow the only way to make rational decisions under uncertainty is to follow precisely these rules of probability. And anything else someone or something can take advantage of you. Now, the word rational here a slightly sketchy. There are people in mathematics and philosophy that disagree that these are the only rational rules, and that they think there's something wrong with the scenario.
But it's definitely the case that if you have a utility function, or you're following some process according to decision theory, probability is the only way to maximize your utility. Even more than this, in the 1950s Savage and Ramsey came up with-- Savage theory, if you ever see Bayesian Savages in any kind of mathematical literature, they're referring to the fact that, regardless of what experiment you run like this, as long as it follows the rules of decision theory, probability is the only way for you to act rationally.
So now you know why probability. One thing you might ask is, why didn't anybody else realize this 2,500 years ago? If you know anything about ancient history, the Greeks, the Romans, and friends, they loved to gamble. If you dig up any kind of Roman settlement or town, you're going to find a huge amount of gambling equipment everywhere. These are bones from the ankles of sheep, and this was their preferred way of gambling. And they occasionally had dice.
But what's fun-- we were talking yesterday with Eric Schmidt, and he mentioned the importance of experiment in science and the importance of getting good equipment and good data to be able to draw good conclusions. And he in passing mentioned this idea that maybe science is in part limited by our ability to gather data then analyze it.
Well, certainly ancient Greeks were limited by this. If you look at their dice, the probability of any one face coming up is not uniform at all. And actually, for a good 50 years or half a century or so, even if you go back and read the original exchanges between like Pascal and Fermat they had this idea that early probability could only handle cases where all outcomes were equally likely they could tell you that some outcomes became more likely in the posterior because they aggregated together because they were identical. So thereby, they were more frequent. But initially, they were all equally likely. So it just seems like it's really hard to come up with a theory of probability unless you really have good equipment.
Also you need a good funding source. Pascal and Fermat got some funding from gamblers who wanted to know why certain outcomes, when you roll multiple dice, are more likely than others. We can tell Eric Schmidt that, indeed, equipment is important for science.
Other things that you can do, once you have a basic theory of probability, that you could not do in the past, is you can try to measure some basic facts about what's going on. So if you get some data like this, like someone's shooting at a target or throwing some darts, you can, of course, compute the mean. That's what we just did in the exposition before. You can compute the variance if you feel like it.
I'm sure you've all seen means and variances before. I seem to have lost my friend, the microphone. There we go. I'm sure you've all seen means and variances before. The variance is just the expectation of your deviation away from your mean. Your covariance just asks, how does the expected deviation from the mean relate between one process and another Process what's nice about these things, you don't have to remember any equations. You can easily reason about all of these things from scratch.
You could also think about things like correlations, which is just a covariance that's normalized by the standard deviation of the two. There are some nice properties of having these things. One property that you don't have by computing means and variances is you actually understand very little about your data. You're essentially making an implicit assumption that your data is uniform, or is normally distributed when you analyze it this way.
There are a lot of very absurd conclusions that you can draw from your data if it's not normally distributed if you just rely on means and variances. That's one really important thing, both in machine learning and in statistics and probability, always look at your data over and over and over again. Because very often, people will spend weeks and weeks analyzing something just to realize that they actually don't quite understand what's going on.
Of course, any time you do this, you have to make sure that your samples are independent. If you ever want to read about IED samples, there's actually entire tract in Ptolemy where he talks about the fact that if you measure something about the stars in one year, for some reason, those measurements don't carry over variable to the next year. But if you measure from year to year to year to year, even though you're gathering the same number of samples, for some reason that kind of data generalizes. That is just because the samples within one year were not actually independent from each other. But you didn't have the language to think about this.
They also don't capture enough about the underlying data unless it's really actually normally distributed. Very often, the single most confusing thing is this idea of correlation and causation and independence. This is the single biggest sort of probability intuition mistake that people make. These are all cases where the correlation is 0. Clearly, these are not unrelated to each other.
Any time you hear the word correlation, you need to think to yourself linear. That's all correlations can do. They can tell you about linear relations. We can even do something completely trivial, like take a normal sample from it, take exactly the value and pass it through a sine-- pass it through a cosine. The two resulting values are actually entirely de-correlated from each other, they have correlation 0. But they're obviously causally related, there's just a phase shift between them. So definitely don't rely on correlations as being an example of independence in any way.
These days, that was how you present this in the 20th century. In the 21st century, there is a more fun way to think about correlations, which is the data dinosaur. All of these have the same mean, the same standard deviation, the same correlation. Actually, it's exactly the same even while this is changing. This was a cute paper from Adobe from a year or two ago trying to update these experiments and to really show people that we shouldn't rely on means, variances, and correlations. That they're entirely meaningless. You need to know much more about your data before you can say something substantive about it with this.
Of course, there are lots of questions that we'd like to answer. And if it's not enough for us to use mean variance, covariance, and correlation, we have to do something slightly more. And that's where some sort of Bayesian reasoning is going to come into play. If you have information from darts like this, you might ask, are two players the same? How do you know, and how certain are you of this? What about two players is different? How do you quantify the difference between the two?
One of the biggest differences between us-- and if you read a paper from the 1800s is that they weren't doing this. They weren't thinking systematically about differences like this, like are these two processes producing the same kind of information because they didn't have the tools of probability. Actually, you'd be particularly out of luck if you were, say, in the Middle Ages. Because the way the trials worked, we didn't have a theory of probability. So there's no idea that you would be 50 plus 1% guilty if you're-- or you lose a court case in civil court with 51% preponderance of evidence. Or have to have 99 or 95% evidence if you get convicted of something and you get thrown in jail.
Back in the Middle Ages, the way that it worked is they knew about no evidence for something, a small amount of evidence for something, 50% evidence for something, or full evidence for something. And if there was one witness that was willing to come out against you, that was 50% evidence. If it was two witnesses, it was 100% evidence and you go to jail. And if there was one witness, it was up to the judge to decide if they wanted to get the other 50% of the evidence from you in the form of a sworn statement, or from you in the form of torture. So there are some really good reasons to have probability.
You might also ask here, as a player, how good are they? Chess players will recognize this is a question we ask all the time. But even if you play Xbox these days, Microsoft does this and they do this using something pretty close to Elo. Might want to ask what's the best information to ask for in [INAUDIBLE]. And questions keep getting more and more complicated as you go down.
At the top of these kinds of questions, you might say, if two players are the same, all I really maybe care about is I'm going to compare their means and their variances and see what's going on. But when you get to the bottom, if I change the size of the board, how might the results change? This already becomes much more complicated to think about it just in terms of means, variances, and distributions, and what you would do with them.
Part of the reason why I'm telling you all of these stories about the past now is because they all involve experiments in one form or another. And that's really, I think, by far the most common misconception that people have about probability that it's something that's mechanical. That you have the rules, you crank equations for it, and you get out results.
The best way to think about probability is as an experiment. There's a stochastic process somewhere. And what you want to do is you want to model that stochastic process. You can think of it like a machine that enters some state. That state is an event. Some events have some probability, as we talked about, there is some set of events out there. There's some random variable. Random variable just stands for part of the stochastic process.
There are two-- probability of two events is just the union of them. You can talk about intersections, you can negate probabilities, join probabilities. You just create a large table that lists how the two vary. And you have a much stronger statement about independence, which is actually true rather than correlation. The fact that you can take the table that tells you how x and y varies with respect to one another, and decompose that table into two matrices that you can multiply back together to reconstruct the original one. It's more about the sparsity of interactions of x and y rather than their correlations.
The thing about conditional probabilities-- amusingly, there's a Dutch book argument for conditional probabilities as well, even though you can derive them from the rules of probability. People seem to be very worried about conditional probabilities in general. I think people just don't buy the law.
Just to underscore how important it is to think about experiments. We could talk about dry science experiments, but it's more fun if we talk about people cheating the lottery. This was a lottery from Toronto from a few years ago. And there was a mathematician in the states that he likes to go around and try to cheat lotteries all over the place in every state.
So he was looking at this tic-tac-toe lottery in Toronto and he noticed something interesting. The way the lottery works is you have a bunch of numbers and you have to complete the tic-tac-toe. You can only complete it on the horizontal or the vertical and not on the diagonals. And you get a bunch of numbers here. And you scratch the 0 and it might say 19, and now you know this is a 0.
And depending on what numbers you get out here, maybe this turns out to be an x, maybe this turns out to be an x, maybe this turns out to be an x, and there you go. Now you win-- whatever, $10. That's all well and good. And the Ontario lottery certainly did their job with producing these numbers totally randomly.
But what they forgot is to actually compute the posterior distribution of these numbers. And so if you actually look at the distribution of the numbers, forget about anything else, don't look under the x's and o's at all, you'll notice something. Their algorithm vastly prefers to form rows and columns that are winners with numbers that are unique. That's just how they're sampling process happened to have worked out.
This guy noticed this and he noticed that he had a 95% chance of winning the lottery every time that he noticed that there was a unique-- there were three unique numbers on a row or a column.
So that's another example where-- in a sense, the lottery did its job. They set up all the initial conditions correctly. They ran the experiment, they got their data. But what they didn't understand about their data is they didn't actually look at it after the fact and see whether it still made any sense. And they didn't ask the experiment from the point of view of the person that's actually analyzing the experiment.
You can think of this as I'm running an experiment and there's a subject that's looking at this data and I'm trying to get them to do something. I should analyze the expert from their perspective too to see whether it's actually uniform or whether there's a horrible bias like this.
Unfortunately, it turns out that after he did all of this, he computed his expected value and realized he could only win $600 a day with eight hours of work. So don't play the lottery because even when you cheat, you actually don't win. He ended up telling the lottery and they canceled it.
So let's look at this in a little bit more detail. So you play the lottery and you have a method to win. Half a percent of the tickets, for example, are winners. And you have a test to verify this and it's 85% accurate. 5% false positives, 10% false negatives. If you've seen Bayes' rule, you know where this goes. But I find that even when people have seen Bayes' rule, often they don't internalize what's going on very well.
So let's look at Bayes' rule in simpler terms more directly. So with these initial conditions, is the test useful or not? It turns out we don't actually need Bayes' rule to answer this question. We can do it exactly the way that Pascal and Fermat did at 200, 300 years ago. Actually, what they did is they drew out trees. That's when the gambler went to them and the gambler said, hey, when I roll these three dice, some of these values seem to be more frequent than others.
They took three dice and they rolled them out and they made the large tree and they counted how many things around the leaf-- on the leaves, and that's how probability theory got started. So let's do that. Either the ticket is a loser or it's a winner. So that's a 5% probability of winning. And there's a 90% probability that your test actually says it's a winner if it's a winner, and a 10% probability that it is not. And exactly the opposite here. If it's a loser, sometimes your test occasionally says it's a winner. And almost always it says it's a loser.
OK, when we draw it out like this, we actually don't need very much in the way of rules as long as we can add and divide we can figure out the likelihood that you're a winner, right? All we have to do is figure out-- my test said this thing is a winner. How likely is this path versus how likely is the sum of these two paths, right? The only way that t can come out to be true is by following this path or by following this path.
So that's exactly what we're going to do. We're going to ask this path, the d plus and t plus path, is this one, and we're going to divide it by any way that we might get a t plus. We can-- basic arithmetic, and you got 8.3%. So even though the number sounded good, the test is actually garbage. His test actually was good. His test had the posterior probability of being 95% correct, not the prior.
If you seen Bayes' rule, this is basically what it looks like. These two are exactly the same statements, just competing probability of t plus. This d plus is this part of the path, and thus my test is true condition on my ticket actually have been correct is this part of the path. So you can think about these very geometrically if you want to have a little bit of intuition.
This is the statement about Bayes' rule that you normally see that's exactly the same as that, except we swapped out the notation. If you were describing this to one another rather than using math, we would say that there's the likelihood, there's a prior, there's the probability of the data being true in general, and that gives rise to your posterior.
Of course, it's not possible to roll out data like this every time. Otherwise, things will be too laborious, and you certainly could never deal with continuous data. And so we have probability distributions for this. The Bernoulli distribution is exactly this kind of-- I'm flipping a coin sort of thing, which we just saw a moment ago.
I'm not going to go into how these distributions work, why they work the particular way that they do, or proving any properties of them. It's more important that you know when they're appropriate and what the intent behind them is than you actually know the underlying math. So you'd use a Bernoulli distribution when you have a process that can either return true or false with a fixed probability.
Binomials are sequence of Bernoulli distributions. This is precisely what our friends the gamblers were doing. They had, say, lots of coins. They were flipping them. They want to know how many coins can possibly come out true, and whether they should bet more on three coins coming up heads, or five of n coins coming up heads at any one time.
Gaussians are particularly special. I'm sure you've seen Gaussians many, many times. Unfortunately, I was really thinking about this this morning. There is this very important reason why we use Gaussians, which is the central limit theorem that says that if you have any kind of higher order errors, they tend to look as if they're Gaussians eventually in the limit as long as your process behaves some very basic rules.
But I can't come up with an intuitive explanation for why this is. And so rather than reproducing the proof, we'll just move on. But you should know that this is why you use Gaussians. And sometimes, your errors don't behave according to the rules that the central limit theorem requires.
For example, the stock market crash that we had no less than 10 years ago is precisely because people applied the central limit theorem when it did not apply. And they should have used distributions that had much fatter tails. And because of that they massively, massively underestimated the risk of the various derivatives that they were producing. So sometimes this stuff applies and sometimes it crashes the economy, so we should be slightly careful.
Gaussians have a few nice properties aside from this. You can add them together and you still get Gaussians. They also happen to be their own conjugate priors, and I'll say just two words about that in a moment. You can also have multivariate Gaussians, which are very simple to reason about. They work just like Gaussians, but in lots of dimensions. You can think of them very geometrically as taking a Gaussian ball and rotating and stretching it. And the covariance matrix tells you about the rotation and the stretches. This isn't true of many other distributions that don't have this very simple geometrical interpretation.
In particular, Poisson distributions are very handy. I'm sure that the neuroscientists among you have seen Poisson distributions many times because they're very useful for analyzing firing rates. But hopefully, that gives you some idea of where you might use some of these. And if you come up with problems that seem to fall into some of these categories, now you'll at least know where to go.
Of course, one of the most important things that we want to do is be able to change our mind if we have new evidence. And these days, you might want to think of this as a Bayesian update. This is exactly the equation that we had before. Except that, this time rather than saying we have all the values and we just want to get out the result, we might see that we don't know one of these values. And in particular, we might say that we don't know this one. And just to make that a little bit more concrete, let's think of a situation again about how good a player is.
I might have no idea about how good players are in general. If you think of a chess player or something-- I don't know, maybe people are good, maybe they're bad. So I can just choose something very uniform. A Gaussian that is very, very large variance. I could say that I think something like dart throwing or playing chess is stochastic process. Not everybody is perfect. On top of it all, people don't play perfectly every time. And I can say that I can also observe people playing chess or throwing darts.
I could also try to integrate over all possible outcomes of every chess game. But generally, this normalization factor goes away. You'll see in a moment why. But let's just say we pick this thing as a normal distribution with 0 mean and high variance, let's just call them uninformative prior. And we'll also pick this as being a normal distribution.
So what we're saying is, I don't know good people are in general. And I don't know very much about how they play games, but I think it's probably normally distributed. They play as if they are some sort of-- as if they're sampling from some distribution that's normal. As if the error is random. So you might ask, how good is this player if I observe a whole bunch of their behavior? Can I update my information about them?
So obviously, the simplest thing to do with this is to just treat it as if it's any optimization process. You could just take the derivative with respect to theta, which are your parameters, which is your knowledge about this particular player. And if you're taking derivatives and you want to maximize or minimize, you know that at a maximum or a minimum the derivative is going to be 0. So you just have to search for that derivative of 0.
And very quickly, you get to what's called maximum likelihood approaches, where you take the derivative, you take the log of your likelihood, and you try to solve that equation. There are some good reasons not to do this, though. So if you believe in Bayesian updates, you probably don't want this kind of estimate. Because you might want to know a little bit more about this player, like how certain your estimate in their skill actually is, and whether they're very consistent.
You can imagine that some days, they're particularly good, and other days are really terrible because they're angry at something. Maybe their behavior is actually bi-modal rather than uni-modal. So there are sort of two peaks. Sometimes they're good, sometimes they're bad.
If you just did this maximum likelihood estimate, you would come up with a single estimate of how good this person is. But maybe you would be better off actually knowing that whole posterior distribution. Won't go into how you do this largely because when you do the probabilistic programming tutorial, you're going to see that-- at least if you're willing to put up with the kind of inferences that it can and can't make, you can actually write this out in a very, very high level language.
You can just say exactly, I believe that the process by which people play games is first, their skill is sampled from some uniform-- some Gaussian distribution. And then, their playing is some process that looks as if it has normally distributed error. And here's my data that I observe. And all you have to do is you provide the data and it will try to figure out what the skill of the player is. And it will give you the whole posterior distribution rather than a point estimate.
If you're doing this by hand, this is where conjugate priors are very important. This is where Gaussians have their other very nice property. That if your knowledge about the player and your knowledge about the process that's being undertaken are both captured by some Gaussian, you can actually find closed form equations for your Bayesian update. Otherwise, generally, you can't do this, which is why people use probabilistic programming and similar processes.
You may have noticed that this was a pretty laborious way to try to talk about someone playing a game. In general, the more complicated your models get, the more laborious this explanation gets. And we've talked about a few different concepts. It turns out that there's a way to tie them together at a higher level of abstraction. And that's what graphical models are. That's why Judea Pearl won the Turing award 10, 20 years ago.
So the idea is I can take a probability distribution and I can write that down as a graph. And this graph is entirely isomorphic to that probability-- to the probability distribution. If you want to think about how the distribution looks like, it's just traffic jams are a condition on rush hour, bad weather and accidents, sirens are only conditioned on the accident, and accidents are only conditioned on the bad weather.
The other thing that graphical models let you do is that let you think intuitively about your problem. Because often in probability, and this is again why it's good to think of it as an experiment, the inferences that you make can be a little bit counter-intuitive. So you might think that the more data I have, the more this graph sort of becomes separated. The more data that I have, the more independent all of the variables get from each other. And I don't have to look at the whole experiment anymore. I just have to look at the one half of the experiment that I don't have any more data from-- that I don't have data for and everything else is irrelevant.
It actually turns out that that's not particularly good intuition. So if you imagine that traffic jams have some other consequence, like everyone in cars on highways is unhappy. If you happen to know that traffic-- that a traffic jam is happening right now, then traffic jams are happening, therefore you can compute the probability that someone in the car somewhere is unhappy. And it doesn't matter whether they were happy because there was an accident or there was rush hour. Traffic jams happened. The probability that someone is unhappy is just conditioned on traffic jams, and you're done.
But on the other hand, if you know traffic jams happen, the likelihood of rush hours and accidents are no longer independent, even though they were originally. There was no arrow between them. The intuition behind this is to do with explaining away. You've heard people talk about explaining away when it comes to Bayesian reasoning. This is what they mean.
Imagine for a second that I see that there's a traffic jam. But there's only a tiny, tiny likelihood that it's rush hour. Then, that increases my likelihood that it's an accident. And if I'm almost certain that this is an accident, that decreases my likelihood that this is actually rush hour. If I didn't observe that there's a traffic jam, I wouldn't actually know anything about these two variables. They wouldn't inform one another.
So just to wrap up. We actually have enough to talk about a relatively complicated graphical model now. One that is so sophisticated that a variant of this is an almost state of the art speech recognizer-- was a state of the art speech recognizer up until about three years ago. And even deep learning base speech recognizes are based around essentially these principles.
So if I give you a speech signal, one thing that you might want to do is break it up into parts and tell me what the person actually said. This is just a spectrogram. This is a spectrograph of different vowels. And you can see that in human speech vowels are not-- they're not random. They seem to have these ripples. These are a consequence of the shape of your throat and the position of your tongue. And the vowels are called formants. They're essentially the maximums in the spectrograms.
And it turns out that this is really why you can recognize speech very well because there is all this wonderful structure-- all of this wonderful structure inside speech. So if we just wanted to make the world's simplest speech recognizer, one thing that you could do is you could just get a set of features, you could say every sound is composed of these features, and you could try to recognize it.
And the natural features-- the moment that you look at a spectrum are these formants. People order the formants. So you talk about a vowel having the first formant, that's the frequency at which the highest amplitude in the spectrogram is. Second formant is the frequency of which the second one is, et cetera. Turns out the first two formants are most important for recognizing vowels.
So what you might say is, fine, I'll have two features. I'll have a feature for the first formant, I'll have the feature for the second formant. Now, I want to use this in order to determine what class this-- the word that was spoken is or the vowel was spoken. Well, another way to write this is I'm saying I have some prior belief about how likely every class is, every vowel is.
I can compute the likelihood that conditioned on this vowel actually happening, I observe one of these formants, which you can just record some speech for. And you compute the posterior distribution using Bayes' rule exactly like we talked about, and compute the likelihood of a particular word or a particular vowel.
You'll often see the graphical models get simplified like this. There's actually much more notation, and I'll show you that in just a moment. This is like a ninth base classifier. What it comes out to be is something totally trivial. It's essentially, at least, if you choose simple distributions, it's essentially learning like a linear combination of these features that are important for recognizing certain classes. If you look at early spam filters, this is how they worked.
Unfortunately, this doesn't work. But one very small modification to it gets you something that's quite good. And you just have to assume that your vowels come from a Gaussian mixture model rather than assuming that all their features are independent. This is sort of what the vowel consonant space looks like for English. It's kind of a little bit of a mess. That's why when you talk to your phone sometimes Siri or Google don't recognize what you said. But they do sort of look as if they might be modeled by Gaussians. And indeed, it's not a bad model.
If you already know the graphical model the way that we did in words before, it would be kind of a mess. But there is a simple way to write it down in the picture, and actually the picture is quite easy to read. So what we can say is-- let's work our way in from the extremities. We can say we have k classes of-- k categories that we want to recognize. We have some distribution data over the number of categories that we care about. And then, we sample the number of categories.
We have some prior distribution over the mean and the variance of each one of these Gaussians. So this is the mean and the variance of k different Gaussians because we have k different vowels. And we have some beliefs about what these means and variances are.
This is, in part, why people look for other theories of uncertainty. Because one criticism of Bayesian methods is that you constantly have to specify these priors, and people don't really like doing that. There's an argument to be made about the fact that specifying these can be difficult. But generally, once you get far and removed from your experiment, you can start to put an uninformative priors. Priors that look pretty uniform. And it doesn't really impact your results anymore if your priors are slightly off. But this is one of the main problems that people have with these approaches.
So now, we have a mean and we have a variance for every class. And what we're going to do is we're going to determine for n different words what class-- for n different vowels-- sorry-- or sounds what class this is. This is just an indicator variable, and this just chooses one of these k things. And our output gets sampled from this mean and variance.
So even though this is a long story, this is a pretty complicated graphical model, particularly because it actually works decently well as a speech organizer. So what people do is they gather a large amount of data, they will extract the formants from it, and they'll train up this model. Actually, even as of this year, a lot of the deep learning-based speech recognizers begin by training this kind of model first. And then, they can reuse these parameters for other things.
The one additional piece that you need for this is to have a Hidden Markov model. Part of the reason why I mentioned this is a lot of the more interesting models that you're going to see later on in the course are-- have some components that involves time. So in a Hidden Markov model, the thing that you want to do is you have multiple observations, one at every time point. Then, you have some hidden states. That's x in this case. Generally, with Hidden Markov models, you assume that the amount of hidden state is relatively small, and that it's also discrete. Neither of these actually has to be true in any way. Just makes inference more tractable.
Another way to write the same picture would be to say that I have this finite state machine. I have some allowable transitions, others have zero probabilities between them. These should also have self-loops, otherwise, this doesn't make a whole lot of sense. And you can assign probabilities to each of these, it's called the transition matrix of the Hidden Markov model. And you have output models. This is the likelihood that x1 would produce output y1. These two pictures are equivalent. You'll see both variants of this picture is all over the place.
This is called a Markov chain. Later on when we talk about problems through programming, you will see inference based on Markov chains. So hopefully, you'll have some context for this.
Now, it turns out that if all you do is you take those Gaussian mixture models that we saw before and you essentially put them here in the model and you say, I'm going to hear a speech signal. I'm going to model all the low level features in the speech signal with some Gaussian mixtures, and take the posterior probabilities about what which Gaussians I'm hearing from right now. And then, use a Hidden Markov model as a kind of speech model because we know that certain sequences of sounds are just invalid in English. Just no words are formed out of the sounds. You actually get a very good speech recognizer. This was the standard speech recognizer for a good 20 plus years.
So hopefully, you've seen a little bit more about probability, and a little bit more by the intuition that goes behind these things. Probabilities are definitely defined in terms of events. You should always think about experiments. You should always think about whether the distribution of your random variable makes sense for your experiments right now. You'll definitely be using Bayes' rule quite a lot later in the course, and perhaps in your projects as well. And you also have to think about whether your knowledge updates make any sense.
And of course, try to visualize whatever distributions-- whatever stochastic process you're making as a graphical model. You'll see that when you write things out in WebPPL, which is the probabilistic programming language that Kevin will show you, there's a difference between what you write and what you would see in the graphical model. And it's infinitely easier to first write your intuition in the graphical model, and then translate it and make sure that it makes sense than to just write something. Debugging these kinds of things can be very, very, very difficult because you don't get a single answer. You may have no idea what went wrong unless you do that. So with that, I'm happy to take your questions.