Learning from Biased Data
Date Posted:
August 17, 2020
Date Recorded:
August 14, 2020
Speaker(s):
Constantinos Daskalakis
All Captioned Videos Brains, Minds and Machines Summer Course 2020
PRESENTER 1: It's a great pleasure to introduce Constantino Daskalakis, a professor of EECS and CSAIL at MIT who's going to tell us today about learning from biased data. So Constantinos, this is all yours.
CONSTANTINOS DASKALAKIS: All right. Hi, everybody. I hope you're having a great time in the summer school. It's a great pleasure to be talking to you, and I'm really sorry that my camera doesn't work. But hopefully, there's enough content in the slides to keep you interested.
I'm going to talk about learning from biased data, and here at the bottom I'm showing you a Fresco from Minoan Palace in Crete where I'm from, and hopefully the relationship of that to the topic of the lecture will become apparent later during the lecture. Next, I want to do a little amuse bouche before I get to the actual content of my talk, and I want to do a little experiment. So I want to start as follows. I want to consider a distribution, which is the blue distribution, a distribution over 0 1.
So the interval 0, 1, the support is just the interval 0, 1, and the density of the distribution is the one shown here. Hope you see my mouse moving around. So it's the one shown here.
So the log density is sine of 10x. So this thing is just a normalizing factor to make sure that this is a probability distribution. So the distribution takes this form shown here in blue.
Now, the orange one is the conditional of the blue distribution in the interval 0 to 1/2. So in particular, the log density is the same, but I'm normalizing by a different thing because I'm integrating this thing only over the interval 0 to 1/2 to make this thing a distribution. So the support of this orange distribution is 0 to 1/2, and that's the conditional distribution of the blue distribution whose support is 0 to 1.
So this is our two distributions, and here's the experiment I want to do. So I want to take a sample from the conditional distribution-- so from the orange distribution-- and I'm going to do maximum likelihood estimation to fit the best density whose log density is a polynomial. And I'm going to be playing with a degree of a polynomial. Higher degree makes this q more expressive.
So I'm going to be considering different degrees, and I'm going to collect samples from that orange distribution. So these are conditional samples. These are samples falling only 0 to 1/2, and I'm going to do the best I can using maximum likelihood to fit a distribution from a different family than this one.
So this one is the log density is a sinusoid. Here I'm considering log densities of the polynomials. But polynomials are quite expressive. So I'm hoping that for high enough degree q, I could in principle fit the details of the orange distribution that I'm fitting.
So this is what I'm going to do, but here's twist in the plot. What I want to do is I want to fit the orange distribution on the conditional sample that I got, and what I want to do is I want to see how well it does in extrapolating outside of the interval where I saw sample. So I'm going to be fitting samples over here, and I'm interested in how well the fitted distribution is going to be doing in the extrapolation regime. Where I consider what this density, if I normalized it differently like 0 1 here, how well would it do outside of the range where I saw samples.
Another way to view it is that really, this blue is the distribution I'm interested in, but I only saw biased sample from the conditional of the blue distribution in some sub interval. So I want to understand how well the biased samples inform me about what's happening outside of the domain where I saw samples. And that's the experiment I want to do, and I want to try different degrees for this polynomial q.
Here's something super interesting that takes place. So again, I fit on the conditional sample, but I ask, how would I do in the whole interval if I were to use the same log density that I identified with on the conditional sample? And the cool stuff that happens is here.
So here I'm showing you in different colors what happens as I try different degrees. And in particular, what I'm interested in is what happens outside of the interval where I saw sample, so 1/2 to 1. So as you see here in the domain where I see samples, in general as the degree of the polynomial that I fit increases, I capture better and better the shape of the conditional distribution in this endeavor. I capture it better and better.
But the behavior, the performance in the extrapolation regime has a non-monotone behavior. So let's look at the case of the [INAUDIBLE] 10 for this polynomial. That distribution is shown here in orange. It doesn't do super well in the domain of my samples but it also doesn't do
So what I'm trying to match is this blue distribution. So it overshoots over here and undershoots over there, but it doesn't do horribly. On the other hand, a higher degree polynomial, which is the green one, is horrible. So it has the performance of a log density is this green line that you see here. So it's a horrible fit.
On the other hand, as I move past that point, I get improvements and eventually for a log polynomial here that is 15, I get a very good fit even in the extrapolation regime. OK, so I'm missing a little bit here in the interval, but that is pretty cool, right? Because it only saw samples in the first half of the domain from the conditional distribution, but nevertheless does really well in extrapolation. So what we see here in this experiment is that as the degree goes higher-- sorry-- as the degree goes higher, the extrapolation properties of the distributions that are fit-- even the interpolation because this thing is doing horribly-- is a non-monotone behavior. And the other thing that I see is that eventually as the degree goes big enough, I get a very good fit even in the extrapolation regime.
So we get this overfitting, basically, is what's happening here, and that's why I get this horrible fit of my underlying distribution. I get overfitting there. But yet if I increase enough degree, that doesn't hurt me. And we see this double descent behavior in terms of the accuracy of my fit as I increase my degree.
So this is a mystery, and I want to understand this mystery a little bit. But at a higher level, this is a very nice example where given a biased sample from a distribution, because I'm only seeing samples from the conditional of the blue in the interval 0 to 1/2. And using those samples I'm trying to fit a density, but I want that fit to be good in the extrapolation region outside of the domain where I actually saw samples. So that's a little experiment that I did, and that's just the beginning of the talk.
Do I see a question here? I cannot see it, but please interrupt me if there is a question.
PRESENTER 2: There is one quick question for does this work better with Hermite polynomials?
CONSTANTINOS DASKALAKIS: You mean if I didn't fit a log-- so what I'm doing here is that I have to make sure that I get a positive measure. So that's why I sort of raised it. I raised my polynomial into the polynomial. That's my way of saying sort of like, I want something simple but I have to make it positive.
So if I was to fit Hermite polynomials on the density, then sort of like I would have to also see-- so yeah, I have to maintain it to be positive. So that would be another hurdle to have. That being said, I didn't try it, so I cannot claim that you would see this double descent behavior. But it may be.
So at the brother sort of perspective, so machine learning, what is machine learning? So machine learning is to construct something, so an algorithm, that takes inputs, and processes them, and outputs something that is good. Like what's happening in this picture.
And the machine learning pipeline says, go out in the world, collect the relevant data for the phenomenon you're interested in understanding, partition your data into train set and validation sets, train your models, select a family of models, train them, choose hyperparameters for your algorithm, and then deploy it in the real life. And that paradigm has gone very far. And if you put it on steroids, it does really well in image recognition. So it beats humans in performance on ImageNet in terms of predicting what's inside pictures.
And so when I say humans, I mean a particular person, Andrej Karpathy, who actually sat down and measured his performance on ImageNet. But more broadly, I mean, it's undeniable that these technologies have been deployed in the real world, and they are around us. So they are being used in our cameras, in our security cameras at home, or whatever. This works.
But it does also have issues, and even one reasonable question is, what does this whole progress that is being reported and that we experience, what does it really mean? And to put it a bit in a different context, so this a celebrated Propublica article that was written over four years ago about raised a super important issue with the use of algorithms more generally and machine learning in particular for the important decision making, such as deciding whether or not to detain somebody before their trial.
So after a lot of experimentation with algorithms that are actually deployed in practice, what the authors of this article found is that it contains bias. And that's just one screenshot from the article. So two individuals that based on their prior offenses appear to have different risks. Somehow the algorithm reversed its assessment of the risk of that person committing a crime before their trial or not appearing to their trial. So the algorithm somehow assigned scores that are not consistent with what we would think are reasonable predictions.
And there are many reasons why algorithms might be biased. Maybe just the features that you input into your algorithm, or the architecture of your algorithm are problematic. But today I wanted to talk about a very specific bias in data that are used to train algorithms, and that is a classical type of bias. That is the selection bias in data collection.
So when you go out to collect data before you actually try to train your algorithm, there might be systematic sources of selection bias in your data collection. And it's very likely that if you train an algorithm using that data you will see prediction bias, as well. So the goals of a work that I would like to outline a little bit is to decrease bias by developing statistical methods that are robust to what is called in statistics censored or truncation of data.
So to be precise, truncation is statistics referring to the situation where samples that fall outside of certain observation window are hidden from the data collection, and also their count is also hidden from data collection. You don't even know if that data existed. Censoring is similar, but a slightly easier situation where at least you know that there are some samples that were censored that you can not use.
And truncation and censoring appear a lot in many human endeavors and data science scenarios, and they reflect limitations of measurement devices that may saturate in certain bands of measurements, and as such produce unreliable readings in those bands. Or it could be due to bad experiment design that ignores certain subpopulations that are important, or ethical or privacy considerations, or the law which may disallow the algorithm to use some of the data because it might violate privacy or other things. So these types of biases in data collection appear in lots of sciences, and there's a lot of work dedicated to actually understanding and correcting for those types of biases.
Next, I want to give a few examples and sort of illustrate what might be going wrong if you're naive about your analysis if your data is biased. And I want to start with the following example. It's a classical example in econometrics. It's about whether IQ is important for income for low skill workers.
And of course, I mean, this is not super well defined, so you need a running definition for what a low skill worker means. And for the purposes of the papers that I'm citing here, low skill means they make below a certain number of dollars per hour. Let's say in today's-- I don't know, maybe that's not that the right number, but let's say under $10 an hour. Probably in the '50s and '70s, or even now, that's not the right definition, but whatever. This is our running definition as an example.
So the question is for low skilled workers, does IQ influence the income of that person? So when you're interested in low skill workers, it's very reasonable to collect data from-- and this is what these papers propose to do-- to collect data from a poor neighborhood. So what these papers did is they use data that were collected in a district where the average income is smaller than 1.5 times the poverty line. So they collected data of the form pairs of feature vectors xi and scalars yi where the feature vector was the IQ of the person, the occupational training of that person, the education level of that person, whether that person belongs in unions in their work, et cetera, et cetera. And yi is the income of that individual.
And what they did is they fitted some model. So in particular, what they did, they fit a linear model, and they looked at the coefficients of the fitted model in front of IQ to understand, does IQ influence the response variable, which is the earnings of that person? But the obvious issue that arises from this approach, which otherwise looks pretty reasonable, is that by collecting your data in a particular district where the average income is less than 1.5 times the poverty line, you're doing some thresholding of your data. And that thresholding might actually introduce bias in your model that you fit, and also then if you interpret your model as reflecting reality into the conclusions that you'll draw from your model.
And indeed, that is the case as was shown by a follow up paper in the '70s which debunked those prior works which had actually found that IQ does not affect income. As these guys actually proved with a better analysis that takes into account the bias in the data, IQ does actually influence income. And what I want you to think about is the following picture, and I think that picture says a lot about why selection bias in the data might result in model bias.
When we do linear regression, our mental picture of the world is the one shown here. That the world is pretty much in line, and then every observation receives some independent noise that makes the observation slightly off of this line. So there's an orange line, which is defined by this coefficient vector, and then some noise, and this image points around that line. And if the world is like this, doing least squares regressions is a great idea.
What happened in the experiments about IQ and earnings that I showed in the previous slide should instead be thought of as the following scenario where on the left are all the data that exist in the world, but on the right is that data that you see if your threshold, the income of the person below a certain threshold. So these guys went to a neighborhood where the income was below 1.5 the poverty line. So out of all the points that the world has produced, which are on the left, by going to that neighborhood they decided to collect the points on the right with our thresholding all the true points below a certain income threshold.
Now, if the data that you see are only the data on the right, well, this line looks like a better fit. And as you see in that new fit, IQ has a much smaller effect to the income than what it does in the real world. So it's possible that let's say high IQ low skill workers do multiple jobs, are more creative about where to find opportunities to work, and so on, so forth. And as a result, they have higher income.
But by you're going and truncating your-- and have higher income, and live in a different neighborhood, and you missed them. Because you decided to focus on low skill workers from a particular neighborhood where you know that the income is below a certain threshold. So this is what went wrong in those papers. And what I want you to get out of this middle picture is why thresholding data-- in this setting, thresholding data with regards to the response variable y-- why truncating them may lead to bias in the models that you fit. So that's my first example of how truncation results in bias.
My second example is a much discussed example from a couple of years ago, maybe two or three years ago, including this paper by Buolamwini and Gebru, who tried state-of-the-art at the time of the paper gender recognition software and tried that software in different sub-populations of subjects. So as you see in the numbers that are reported here, the algorithm does really poorly for darker female subjects. Really poorly. And I'm pretty sure that these companies wouldn't have released their algorithms had they suspected that they do so badly for certain subpopulations.
So the explanation offered in that paper by Buolamwini and Gebru was that the training data that was used to train those algorithms contained much more faces of lighter skin tone, male gender, and caucasian. And as a result, the training of the gender classifier paid less attention to other subpopulation, which resulted in, as you can see here, bad accuracy over there. And that example is a different kind. So here we have a situation where based on some criteria that may have to do with the label, but also even with the picture itself, the data that was collected to train those classifiers is not representative of the real population of humans, the algorithm is biased.
So the last example, and then I'm going to do a little pause to address the first questions that might have arisen so far, it's an even older type of bias that has been identified by astronomers, and that is called Malmquist bias. So as you know on the right, a luminous source, a source of light, emits light. But the intensity of the light that the source emits drops 1 over distance squared with respect to the distance of the observer.
Because light spreads in spheres, and spheres, the surface area increases with distance. And as a result, the intensity spreads over bigger surfaces. And as a result, the source appears less luminous the further away from it you are. So this is something that we know from mathematics or physics, but what is the result of that in astronomy?
Well, when you take a photo of the sky, your film has some sensitivity and some sensitivity threshold. And it could very well be-- and it does happen-- that equally luminous objects, depending on how far away from Earth they are may or may not show up in the film in a photo of the night sky. So at this level of luminosity, depending on where the star is located, so this star here is more luminous than this star. But because it's further away from Earth, from this star, this star ends up appearing in the film while this star does not appear in the film.
So if you collect data like that and you don't take into account this phenomenon, then, you fit a model, you get a false trend of increasing brightness of stars as you go away from the Earth. So that bias was identified by Malmquist, and astronomers have ways to correct for that type of bias. In our [INAUDIBLE] lingo, this is a scenario where you have an unsupervised learning problem, you collected some data, you're trying to fit a model. And because there is truncation again based the function of the luminosity and the distance, you get biased models if you naively [INAUDIBLE] on the data.
So this is a little intro to the types of problems that I'm interested in addressing, and maybe I can pause a little bit for a few questions.
PRESENTER 2: Sounds good. We have a little conversation going on that you can hopefully clarify. One person is saying-- Sasha is saying, I'm lost. How is it possible that a polynomial function with higher degree fits worse? Couldn't the degree equals 12 polynomial simply set the corresponding additional two parameters to 0? That way we'd end up with the same fit as the one for the polynomial with degree equal 10.
CONSTANTINOS DASKALAKIS: So let's start with that. Let's see, where is that? Slight two, maybe? Yeah.
So that is counterintuitive, right? So here's what's happening. Degree polynomial 12, so what data are you using to fit the polynomial? Well, you collected some data from this distribution, and you tried to do the best you can to fit this distribution. So now degree 12 gives you a lot of freedom.
So you go there, and hammer at it, and you try to match this density here as best as you can, exploiting all the flexibility that degree 12 gives you. And what happens is you overfit, right? So your performance-- so here you do very well. You did fit the polynomial well.
But what the polynomial that you've fitted does outside is uncontrollable. So the polynomial of degree 12 actually ends up having a huge amplitude in the interval where you didn't pay attention, because you just fitted in the data that you saw. And the data that you saw was in 0, 1/2, and you did the best you could.
And indeed, the degree 12 polynomial is better than the degree 10 polynomial for the conditional density. For the task it was given to solve, it did better. It fitted it better than the 10. However, it didn't pay attention on the amplitude that arise between 0 and 1. And as it turns out, by putting too much effort in fitting the details of the conditional distribution in here, it actually ends up having too high an amplitude towards the end of round one. And as a result, if you normalized it differently, if you tried to use that polynomial, you fitted in the conditional and try to extrapolate outside of the stuff that we saw, that huge amplitude close to 1 hits you, and essentially trivializes all the probabilities up until here and shoots up to infinity in or close to 1.
So this is what's happening. Does it make sense? It's a result of extrapolation. We don't usually do extrapolation. Doing extrapolation is a very dangerous endeavor. And for this extrapolation task, as it turns out, 10 does better than 12 because it does not put huge-- it does worse over here, right? But it doesn't put huge amplitude outside, and as a result it does better.
And the observation that we have here is that-- the cool observation is that like, even though 12 overfits the conditional, if I give even more freedom, in the end it doesn't hurt me. So like from 13 on, polynomials become better, and better, and better even in the extrapolation task, which again, is a very dangerous thing to do in general. But as it turns out, it works.
So this guy is very bad because of this big amplitude over here. But eventually, you see how well 15 does. It does great. Gets almost all the details of our distribution, even it has not seen it over here at all. It has not seen the distribution at all. Yet it fitted very well.
The green guy does great over here. If I take this green guy and look at its conditional in 0, 1/2, it's going to be a very good fit of the distribution. But if I use it for extrapolation, it's very bad. So that is-- I hope I addressed the question and clarified why this type of thing may happen.
PRESENTER 2: Great, thanks. Do you want to take another one, or move on?
CONSTANTINOS DASKALAKIS: Sure, I can take another one, yeah.
PRESENTER 2: OK. We have one from Sam Holz. A higher degree polynomial will fit points within the range where there is data better, and the improved fit depends on having more data points within this range. But may not do so outside of the range of the data points so that it does not extrapolate better. Actually, that might be more of a comment.
CONSTANTINOS DASKALAKIS: Yeah, so right. So as I said, extrapolation is a dangerous endeavor. But the surprising thing that I am showing you here is that it actually works in this example. And I'm going to show you a formal theorem that says under what conditions it does work in the end of the talk. This is why this is amuse bouche, but yeah.
Amuse bouche was supposed to be a little bit provocative. Kind of like it's kind of unbelievable, but it should work. It could work. Because as Sam correctly says, the algorithm has not seen data at all outside of the 0 to 1. So how the heck can we possibly extrapolate?
For all I know, outside of here this distribution could be crazy. It could be some weird depiction of my face over here, right? And of course, there's not enough information about that over here. So one shouldn't expect that this extrapolation should work. But as I will show, it does work under certain conditions, and I will hammer on them in the end of the talk.
All right, so maybe I should proceed. I'm going to jump to slide 10. We'll see where we are. OK, good. And yeah, so I hope I motivated the type of questions that I'm interested in, and what I want to do is I want to present you a framework for addressing these types of issues.
So I want to define a mathematical model that tries to capture some of the stuff that we saw taking place. So here's a framework for thinking about regression and classification tasks when there is truncation. So bear with me on the model, and hopefully I will give a couple of examples to clarify why the model is set up the way it is.
So let's think about the world as follows. So the world samples a feature vector from some distribution. That feature vector is mapped to some latent label z using some mechanism that is parameterized by some unknown feature star vector of parameters, and noise is also added to that. Presumably maybe this noise distribution is known, maybe it's a Gaussian, or maybe it's also parametric family of distributions that is also unknown.
So as the world samples a feature vector, produces a latent code by using this function. Some reach parametric family plus noise. And if this story ended here, that would be my classical supervised learning setting where I'll see pairs of x and z, and I'll try to fit theta star and potentially also try to fit the noise model.
But the story to capture truncation doesn't end here. With some probability distribution that depends maybe on both x and on the latent code, the nice little sample observation that was created in this way is thrown to the garbage. It will never be observed by somebody. It was produced, it is out there in the world, but the observer is not going to see that point, unfortunately.
With the remaining probabilities, this point survives observation. Maybe the latent code is further transformed into a label via some function pi, and then the point x and y is added to our training set. So the challenge is using the training data that are produced in this through this model that incorporates a little [INAUDIBLE] here that preferentially promotes some samples and throws to the garbage some samples, you want to estimate theta star, potentially also the noise model.
And phi might be known. For example, phi could be, I went to that neighborhood and I'm thresholding the income to below 1.5 of the poverty line, or phi may also not be known. And you want to learn that, as well.
So I hope that this type of model makes sense first, and also captures a lot of the phenomena that we saw earlier. In particular, I want to instantiate my model here to capture some problems with truncation and see what it does. So the first instantiation I want to do of my model here is to see whether it captures my IQ versus income example. Let's see whether it does or not.
Well, in my income versus IQ example, I assume that the world is linear. The true world is linear. So remember my left hand side picture, it was a bunch of points scattered around a line. The noise, let's say, is Gaussian, and my points are scattered around the line. And again, if the story ended here and I have samples x and z, I could just do least squares regression. I would be done.
So this noise model does, it creates a nice Gaussian distribution around this thing. And the z-distributed from a normal distribution that is centered over here. That's really what this thing does.
So again, if the story ended here and I had a lot of samples x and z, I could just to least squares regression. However, our little devil, or let's call them econometricians who decided to use that truncated data threw away I guess-- sorry, this is the wrong picture-- but they threw away points where the income fell below-- there's no truncation at the bottom. They only have truncation at the top.
They throw away all the points whose income distributed like this. So if it fell above a certain threshold, they threw it away. And I guess my picture is a little bit-- it's a little bit of a cartoon picture because it has negative incomes over here. But bear with me please.
And the data that survives is kind of like the data that wasn't truncated, and that was added to the training set. In this little example here, there is no need for projecting the latent code. The label and the latent code are the exact same thing. So this is how I'm capturing linear regression.
So here's what I would modify to capture logistic regression. So now the noise becomes logistic. Similarly, the latent code, I hope you can see this thing here. I cannot see there is a bar in my screen. So this says x transpose theta star. So you do really what this thing does is it has logistic noise around this point. So you get the z, the latent code, to be distributed like this.
And this is a logistic regression problem. So projection is just thresholding. So after you create the latent code, you look at whether it's positive or negative, and you assign labels depending to whether the latent code is positive or negative. That's a way to talk about logistic regression. But the important addition here is that this phi function might be rejecting latent codes that are outside of some range of values, or do something more complex based on z. So that's how to capture truncated logistic regression.
So again, if this phi function was 1, this would be exactly logistic regression. I hope that you're also familiar with this way of thinking about logistic regression. Again, if I was 1, if this thing was 1, nothing was rejected. This pipeline here would exactly give you what you have in logistic regression. In logistic regression, a way to view logistic regression is you create a feature vector, you create a latent code by adding logistic noise to excess post theta star, where theta star is a vector of coefficients.
Then you look at the sine of the latent code, and that is the label that you assign to the point. You add your point the training set. The only modification that I did in this standard pipeline is I said, well, there is a little demon that looks at the latent code. And if it doesn't like the latent code, it just throws away the sample and only allows samples to go through where the latent in this example falls within A and B, some interval.
But more generally, this phi function could be probabilistic, not 0, 1. And also it could be a more complex criterion for what samples are allowed and what samples are not allowed. So hopefully my sort of like 3 slides ago model, so in here, makes sense. Again, more generally, I want to be thinking of a covaried vector being created. That covaried vector being mapped to a latent code.
So in a regression setting, maybe there is no projection. But in a classification setting, there might be production. But importantly, there is a demon over here that looks at both z, the latent code, and x, the vector of covariance, and might decide to throw away the sample to the garbage bin. And that's not going to be added to the training set.
With a similar logic, of course, I can capture truncated densities dimensions. So over there you can think of a nice point being sampled like all the stars in the sky. But based on some criterion, the distance and luminosity of the star, that star is not going to be depicted on the film. Otherwise, it will be depicted on the film. And again, I only have access to truncated samples like this. And I want to estimate the law that governs the intensity of the stars and their distance from the earth.
And again, the truncation mechanism might be known or might not be known. Of course, the more general that the problem becomes, the less identifiable it becomes from data. So if you make it too general, if this is too general a family, and this is unknown, and it's too general of a type of function, it might not be identifiable to identify theta star and/or phi from the truncated samples that you see. But as I will argue next, if there is enough structure over here you can actually do it.
So the types of problems that I discussed both through examples but also through my high level models are very classical in statistics in the field that is called censored and truncated statistics. There is a ton of work in statistics, but also econometrics on these types of problems. Heckman, who unfortunately isn't cited here, got a Nobel Prize for these types of works.
Censoring and truncation is, of course, very much related to the domain adaptation problem in machine learning, which is trying to resolve a similar thing. So the challenges faced from a statistical and computational point of view in this literature is that the error rates of the algorithms scale-- while they do scale well, they achieve parametric rates in terms of the number of observations that you see, the number of truncated observations that you see, their dependence in the dimensionality of the problem, the number of parameters of your model, is very bad. Or badly understood. Also, the methods that are used are computationally inefficient and do not provide convergence guarantees and running time guarantees.
So in recent work in my group at MIT with students, and also undergraduates, and also former students, and their students, and so on and so forth, we have been hacking on these problems, and we're making substantial progress, I would say, towards producing computational and statistical efficient algorithms and even handling arbitrary locations sets. So in my two little examples a couple of slides back, I was looking at translations that were intervals. For example, in this work we actually can allow very, very, very permissive types of truncation sets. They don't even have to be convex. They can be pretty much arbitrary.
And the types of problems we can solve are the classical, both computationally and statistical efficiently, the classical regression problems like truncated linear logistic proper regression, high dimensional linear regression and compressed sensing, and certain non-parametric density and summation problems. And just for the sake of being concrete, the types of rates that we have-- for example, for truncating linear regression, we can recover the statistical rates that you have for untruncated regression even though we have truncated sampling. For compressed sensing and high dimensional regression, again, we can recover the rates that you see that pay sparsity log dimension over a number of samples. So we can recover a lot of the stuff that-- a lot of the guarantees that you have for untruncated regression, we can reproduce them here even though the setting is more challenging. And we have efficient algorithms compounding these types of guarantees.
So now you may ask, so the references here span the whole century of statistics and econometrics. Why the heck do you come in the 21st century and present those results now? These are very classical problems. Who are you? What happened? Why are you able to do these things now?
And there is a confluence of things that actually happened that permit both getting the theoretical guarantees that I mentioned here, but also making us more able to apply these types of techniques even in situations where the model that we're trying to uncover is a deep neural network. As for the top two, have to do with the mathematical understanding that enables us our theoretical results. The last bullet talks about the more pragmatic hardware innovations that allow us to deploy our techniques even when the models that we're trying to uncover describe a particular method.
Of course, over here you will lose these types of guarantees because you don't have these guarantees even if you don't have truncation. But at least you can get efficient algorithms that sort of, if you want, carry all the advancements that have happened in standard supervised learning to even supervised learning with truncation. But let me talk more about these top two bullets, which are the mathematical understanding that enabled us to get these results.
So the confluence of ideas are two. One is mathematics has made a lot of progress in fields that unexpectedly are very important-- maybe unexpectedly are very important and allow us to make progress on these problems. And in particular, one of the tools that we use is the new advancements around the concentration of measure for polynomials of random variables. For example, the Carbery-Wright Theorem. That is a very new and important tool. At the same time, a technique that is super important and plays a huge role here is all the advances that we have in stochastic optimization that were obtained to enable progress in optimization and in machine learning, and this understanding is very important to enable these types of efficient guarantees.
So what I want to do next is I'm almost done with my talk, but I want to come back to my amuse-bouche example, which is summarized here. And I want to give you a little vignette of-- if you want an impressionistic picture of when one might expect that extrapolation could be possible. As was correctly pointed out in one of the questions and comments, in general extrapolation is not possible. If you give me the conditional distribution in some interval, and outside of the interval I can observe the distribution can do funky stuff, I have no information to extrapolate the funky stuff.
So in general, this problem of extrapolation is impossible, but I want to give you conditions under which it is possible. And here's a theorem I want to state that will kind of illustrate a little bit of what is taking place. So let's walk through this slowly.
So suppose that this is not the setting I'm looking at here, but it's a little bit of a babier version. But still I'm trying to make an important point here. So suppose we are looking at two distributions, P and Q. These distributions are a high dimensional, so they're supported on the hypercube. And suppose that the log densities of these two distributions are polynomials of certain degree k.
Suppose also that I'm only observing these distributions in a subset of the hypercube, and what I want to assume is that this subset is not a trivial subset. So it has reasonable Lebesgue measure, say 1%, whatever, 2%, 5%, whatever, is some distant subset of the hypercube. What I want to do is I want to relate the total variation distance of these two distributions in the whole hyperube, which is the numerator, and the total variation distance of the conditionals of these two distributions in this subset, S, of the domain.
And here's a result that you can prove. That these total variation distances are related with this inequality which basically provides a trade-off that involves the dimension, the degree of the polynomials, and how big the volume of the set s that we are observing is. So the implication of this inequality is basically that if these two distributions are far in the whole domain, there will be some signature of that distance on the conditional distributions.
To say it differently, it cannot be that the conditionals appear super close when the unconditionals are very far. In this picture here, if I don't know, if I fit the distribution well in the half interval, it cannot be that they're super far outside of the domain where I can observe. And this trade-off here tells you what degree you need to consider for this thing to behave well.
So this is a little vignette. Is not exactly the setting, because here the log density is a sinusoid. But the sinusoid is a smooth function that can be well-approximated by a polynomial. So think of, let's say, P as a polynomial that approximate this guy super well in the whole domain-- in the whole support in 0, 1.
So let's say P is a log density that is a polynomial that is approximating this super well in the full domain. Now, let's say q is a log density that's a polynomial. And suppose we tried to fit Q to P only in the subdomain. This theorem here tells us that it cannot be that this is super small when this is super big. These two things are related. This is really what this thing says.
OK, that's a little vignette of mathematical theorems. But kind of like it sheds light to when you could expect that extrapolation is possible. And what this theorem suggests is that if this thing is sufficiently smooth, which this is, then by choosing a high enough degree as dictated by this theorem, you can be certain that you won't find yourself in a situation like the green. But you will be in situations like the purple.
So that's a little vignette of theory. Let's do some experiments. So I'm going to do two experiments.
PRESENTER 2: Sorry to interrupt, but we're just about out of time. I don't if you want to wrap up there or take a few questions first?
CONSTANTINOS DASKALAKIS: I could actually-- I could wrap up there. Let me just show one experiment, OK? And I'm done basically. Let me show a real data set experiment.
So let's see. So what we did here is we looked at UCI million song data set. x for every song are a bunch of attributes for that song. z is the year recorded. We've only View Z as a latent code. And y, the binary variable, is going to be whether the song was recorded before '96 or after '96, and we want to have truncation. So truncation looks at the year of the song recorded and decides to only keep the newer songs. So songs that were produced above certain threshold c-- c could be 1980, could be 1990, or whatever.
So of course, if c is '96, you never see negative examples. You only see songs that are recorded after '96. But if c is minus infinity, it's a standard logistic regression problem. And if c's not minus infinity, it's one of these truncation situations, and here's what happened if you naively do-- so this is the truncation threshold. Like which songs I'm considering in my data set. So this is 1980, 1985, '99, et cetera-- '95.
As I said, after 96 you only see positive example so it's impossible to do anything. So here, I'm comparing how the naive algorithm does on truncated data versus the sophisticated algorithm that uses the techniques. And, you know, you see some advantage. Of course, this is not a realizable setting, so the model is not exactly right. But we see the advantages of being more sophisticated.
And I want to wrap up here. So again, systematic missing of observation leads to potentially systematic bias in the prediction. The work I described is some formalization of this issue as well as some techniques to get more robust algorithms. We have a general framework that can apply to deep neural networks using the hardware optimizations [INAUDIBLE]. But we tend end to end guarantees, and efficient algorithms for several classical problems.
Yeah, a future work. Super important problems over here. Bias in machine learning is a super important area. It is not only due to truncated data. There are many other sources of bias in creating algorithms and using organisms, but there is also a lot of mathematical beauty behind these questions. So important topic, mathematical beauty, and lots of things to do. Thank you very much.
PRESENTER 2: Great. Thank you for that wonderful talk. We can probably squeeze in two questions if you have time?
CONSTANTINOS DASKALAKIS: Sure, of course, yeah.
PRESENTER 2: Great. Here's one from Sofia. You mentioned not only data, but architecture and features can be biased. Could you comment on some examples for these?
CONSTANTINOS DASKALAKIS: Yeah. So let's say-- I don't know which one is the best example, but let's say that-- as you know, a lot of features are proxy for other things. And sometimes what features you decide to use in your algorithm might inadvertently result in bias.
Let me give one example with relating to the decisions about [? predetension ?] over here. So in certain districts, the rate of arrests is higher than other districts. So if my algorithm just looks at the-- I mean, this is not the example shown here on the right. But it won't explain this thing.
But imagine an algorithm uses prior arrest information to make a prediction. Now, if in some neighborhood the police tends to arrest people over time while in some other district the police is much more lenient, if I make the decision to use the feature of number of prior arrests, or types of prior arrest in my algorithm without thinking about the effect that police might be more or less lenient in different neighborhoods, that's a decision that I'm making an algorithm for what features to use, and that's a super important decision. So that is a little example of what can go wrong when you choose certain features or not.
Of course, in this example here, as I said, the person on the left hand side has much more prior offenses. And my example does not explains this scenario we saw in here, but it explains the other scenario.
PRESENTER 2: Great, thanks. And the last one from Dario is you have always made reference to supervised problems regarding truncated stats and extrapolation. What are your thoughts about self-supervised scenarios such as contrasted mechanisms?
CONSTANTINOS DASKALAKIS: Yeah, so these things will show up there, as well. I mean, you mean for unsupervised generation type of problems, correct? So I say if I'm interpreting the question correctly, we're talking about learning distributions. And again this example here is a situation where whatever you decide to do-- whatever method you decide to do to fit density, if you don't take into account the possible existence of bias, you model will suffer.