Sobolev Independence Criterion: Non-Linear Feature Selection with False Discovery Control.
Date Posted:
May 5, 2020
Date Recorded:
April 28, 2020
Speaker(s):
Youssef Mroueh, IBM Research and MIT-IBM Watson AI lab
All Captioned Videos CBMM Research
Loading your interactive content...
Description:
Youssef Mroueh, IBM Research and MIT-IBM Watson AI lab
Abstract: In this talk I will show how learning gradients help us designing new non-linear algorithms for feature selection, black box sampling and also, in understanding neural style transfer. In the first part of the talk, I will present Sobolev Independence Criterion (SIC), that relates to saliency based method in deep learning. SIC is an interpretable dependency measure that gives rise to feature importance scores. Sparsity inducing gradient penalties are crucial regularizers for the SIC objective and in promoting the desired non-linear sparsity. SIC can subsequently be used in feature selection and false discovery rate control.
Paper: http://papers.nips.cc/paper/9147-sobolev-independence-criterion.pdf Joint work with Tom Sercu, Mattia Rigotti, Inkit Padhi and Cicero Dos Santos
Bio: Youssef Mroueh is a research staff member in IBM Research and a principal investigator in the MIT-IBM Watson AI lab. He received his PhD in computer science in February 2015 from MIT, CSAIL, where he was advised by Professors Tomaso Poggio and Lorenzo Rosasco. In 2011, he obtained his engineering diploma from Ecole Polytechnique Paris France, and a master of science in Applied Maths from Ecole des Mines de Paris. He is interested in Deep Learning, Machine Learning, Statistical Learning Theory, Computer Vision. He conducts Modeling and Algorithmic research in Multimodal Deep Learning.
TOMASO POGGIO: This is another CBMM virtual meeting. And through zoom and the internet we have the opportunity to see people who are sometimes too far away or difficult to reach. In this case-- all the CBMM alumni.
Yousef was at MIT. He got his PhD in 2015, formally supervised by me, in practice by Lorenzo Rosasco. I regret not having spent enough time with him, because he did, and is doing, great work. And you will hear from him about Sobolev independence criterion.
YOUSSEF MROUEH: Thanks so much. Thank you so much, Tommy, for the introduction. And thank you for the invitation. And thanks also to Hector and Gadi.
So I will be talking today about Sobolev independence criterion. And it's a method for non-linear future selection where we are also caring about doing false discovery rate control. So this is joint work with a bunch of colleagues from IBM and ex-colleagues at IBM, Tom Sercu, Mattia Rigotti, Inkit Padhi, and Cicero Dos Santos.
I will start, first of all, by presenting some motivation. Then I will delve a little bit into the details about, what do we mean by feature selection and false discovery rate control? And I will tell you about the method that most of us know from LASSO to elastic net to random forests that are used in feature selection, especially in bioinformatics and biology.
Then I'll be presenting our method that is called Sobolev independence criterion and how it relates to non-linear sparsity. And then I will show, how do we use our criterion to do false discovery control, using, as well, generative models, and something called holdout randomization tests. And I will finish with some experiments.
So let's start with some motivation. So the motivation is that once we want to do scientific discovery, things might go wrong. For example, if we only care about simple correlation, you may end up with those kind of headlines in certain newspapers-- that spaghetti sauce and pizza fight cancer. So in a sense, if you see only some spurious correlation without making controls for false discovery, things may go wrong.
And another example of this is that crisis of reproducibility that has been seen more recently in all science domains, which is that if you repeat an experiment another time, most of the time the results are not correct. And this was, for example, in biology, this was a big replica conducted in Germany that showed that many of those replica lead to-- there are false discoveries.
And another case is that in most of the studies for-- this is now off-- by the FDA to do the control for failure or acceptance for a drug, most of them, also, are failing. So in a sense, once we are doing all of those discoveries, it's important to control for false discoveries, and to keep some power, which is the true positive rate. So it's this tension that we know in terms of-- if you are in computer vision, precision recall, or positive rate versus the false discovery rate.
So just to set the stage, the data that we will be talking about is in this form. So we have a bunch of individuals, right? So this is on the rows. And on the columns, we have a bunch of features. Think of them as being some genomic features, for example.
And then we have a response for each individual. With its genomic feature, we have a response, y, that is the presence or absence of a disease. And what we want to see is that this response depends only on a few of the features in the population, and we want to know which are those features that are important to predict the response. So it's a very old-fashioned feature selection problem.
So let's just put things formally. So what we have-- we have a vector x. This is, for example, the genomic features. And we have a response y that is the disease. And our goal is to find a subset of features S such that the complement of S-- if I restrict my x to the complement-- it's independent of the response, conditionally on all the other features. So I want to find which are the features that are responsible for this response.
This S is usually referred to as the Markov blanket, and our goal in finding this S is, as I was saying, maintain a high true positive rate. And true positive rate is like, if I give you a candidate set, I want it to be intersecting the correct set S of features, and I want to control the false discovery rate. I want to avoid being in the complement of the correct set.
So what do we know about this, and how has this been solved in the past? So typically, people tend to solve via the so-called LASSO and/or elastic net, that are fitting a linear predictor. The beta is just a linear predictor on your data X, and Y is the response.
So Y now will be the matrix, and X is a big matrix of all the data points, and beta is a linear predictor. And typically, you want this beta to control how many are active in the beta-- how many are many nonzero. And usually, you use the L0 norm.
And typically, this is relaxed via some convex relaxation-- to an L1, or a mixture of L1 and L2. And when you mix L1 and L2, it's called the elastic net. So adding the L2 term usually enhances the stability of the method in this case, when we do L1.
So typically, how do we do the feature selection after we solve this? Then we get a linear predictor. We sort the absolute values of the beta j, and then we use this as a statistic to say if a feature is important or not.
The caveat of this method is that it is linear, right? It will not account for any non-linear dependency between Y and X. Even though this method is still very widely used in many, many science domains. We know very well how to cross-validate the values of lambda and lambda 2, so everything is-- there are a lot of packages that can handle this.
So this method is still very widely used, although it has this limitation of being only capable to do linear predictors.
Another very popular method due Leo Breiman is random forest. So a random forest-- the typical approach-- is that you will consider your data set of features and responses, and you will fit many decision trees. And then you will aggregate all those decision trees to become your random forest.
And Leo Breiman-- actually, in his paper where he introduced random forests, introduced future importance scores via those decision trees. And the way those feature importances are computed-- they are computed by looking at how many times in those trees a feature has been visited.
So this is like, on average, if a feature was visited many times in the decision tree, it means it is important. Although this is not backed with theory, random forests have the capability-- I'm just talking about the feature importance is not backed with theory.
So the random forests have a very good capability of fitting non-linear dependencies between X and Y, and it's very widely used in all domains where we need, and we care about, interpretability and feature importance scores, such as bioinformatics.
So now that we talked about what the methods are that are present into the literature, we will start now to revisit how we think about future selection from an information-theoretic point of view. So let me first describe how we will revisit this problem-- this very old problem.
So the way we will think about it-- I will start just setting the stage first, to say, how do we define distances between distributions using superproblems. OK. So we'll define a variational distance between a distribution p and the distribution q by defining a function f that has high values on the support of p and low values on the support of q.
So this is the sup of f in F. I take the expectation over p minus the expectation over f for q. So if you think about it, this is just a function-- they usually refer to it as a witness function-- that's telling like, can I tell apart p from q? Right? It's a classifier, if you want, between p and q.
So those types of distances between distribution are called integral probability metrics. Integral because in the objective, there is an integral. You are integrating over those distributions. They are called integral probability metrics, and in this family, there are so many distances. And the distance is determined by the choice of the function class F.
So since now we are in CBMM, we'll go talk first about RKHS. If we are in RKHS, yes this distance is called the maximum mean discrepancy. And if the kernel was universal, it defines, really, a good distance between distributions.
Now, back to our problem. So what does it mean-- mutual information? So the mutual information-- we know that it is the distance between that joint distribution of the genomic features and the response versus the product of marginals. So seeing the joint distribution-- how reality is, how the features are with their responses together-- versus a scrambled view of reality. I take a genomic feature, and I assign to it a random response.
So the mutual information, as it is defined usually-- it's just the KL distance between the joint and the product of marginals. The issue with KL is that with high-dimensional data, it's not easy to estimate. So we will define our mutual information, now, via this integral probability metric, by just saying, instead of p and q, I will put the joint distribution, and the q-- it will be the product of marginals.
Do we have, so far, any questions?
TOMASO POGGIO: We're OK.
YOUSSEF MROUEH: OK. So how do we do feature selection within this point of view? So within this point of view, feature selection becomes, I want to find a gate w such that if I gate my input x, I maintain maximum mutual information.
So remember, now I'm putting D of pxy-- this is the joint-- comma px py. But what I did here-- I gated x. I'm putting a gate w, and I'm asking this gate to have SL0 norm. So in a sense, I am trying to zero out many of the features. I want to see up until when I can keep my mutual information maximum. So this is the information-theoretic point of view of feature selection. It is to find a gate such that if I mask some features, I'm still maintaining very high sup.
So now, let's expand what's going on inside this, because D is also a sup, right? So we have a sup over the gate, sup over the classifier between the joint and the product of the marginals. So now, if we just think about it, this sup over w and sup over f can be absorbed in one sup, because it's still on the function class F.
But all what we are asking-- we are asking our derivatives with respect to the input xj to be sparse. So in a sense, instead of writing this difficult problem, where I want to run at function and a gate, I could learn it as learning one function, and controlling by gradients to be sparse-- my gradient with respect to the input-- which now becomes obvious when you think about it.
We always do it in those attention- or saliency-based methods, right? We look at the gradient of the function with respect to the input to get a sense of if this feature is important or not. But usually, in saliency-based methods, you do it as an afterthought, right? So you just say, OK, here's my model. Let me see what I can explain with it.
Here, we want, really, our gradient to be meaningful, so we'll have to do [INAUDIBLE] which is in the next slide. So as I was saying, instead of the formulation with the gate and the function, the thing is to say, let me learn a function and regularize the gradient of this function so that it is sparse.
And here to the rescue, there is this notion of non-linear sparsity. So remember, when we talked about sparsity of linear models, it was very easy. You just look at the L1 norm of the linear predictor.
Now, to define the non-linear sparsity for a non-linear function, there is this very nice notion of non-linear sparsity that was defined, actually, by Lorenzo in 2013-- Lorenzo Rosasco. So it is the number of feature sets that, on average, my derivatives are different from 0.
So in a sense, if you want to get an estimate of, is this feature useful, you need to look at expectation on the norm of the partial derivative with respect to the input. But so far, this one is difficult to deal with, because it's a counting norm.
So there is a relaxation for it that is similar to L1, L2 norms, which is you will be isolating each coordinate alone-- each xj alone-- and taking the square root. Really the norm, right? The norm of the partial derivatives.
So in a sense, what we want is that this WSF that is written here on the support over some distribution that will be defined later-- that I want it to be small. Is this clear? So the only difference between the usual linear sparsity and the non-linear sparsity is that in linear sparsity, it was very easy to define.
Here, we need to take a look at the partial derivatives, and make sure if they are big or small. But since the partial derivatives are defined on points, we need to define things on average, right?
And if I take away the square root, this is an issue, right? If I take the square root away, what happens is that now, this is just an L2 norm of the gradient. So I'm not I'm not making the coordinate complete. So I want just that we see here that this square root that we have is very important to isolate each feature alone, because if I remove it, all the features are treated equally, right?
So now, back to how we define our feature selection. So we are looking for function that is comparing the joint distribution to the product of the marginals. We are asking the function that is making this comparison to have the notion of non-linear sparsity we talked about. This is the L1 equivalent of the LASSO. And for stability, the equivalent of elastic nets, we will add an L2 norm of the function.
So if you put in this linear function, the WSF will end up being just the linear sparsity-- the usual sparsity-- because the derivative of beta x is just beta j. So in a sense, if we put just the linear predictor that we know about, it's just the gradient sparsity.
And in general, we will use mu to be just a product of marginals, because it covers all the space, right? The product of marginals-- it covers more than just the joint. This mu.
Now, what do we gain out of this? So if we think about it, the expectation of the gradient-- of the partial derivative on a feature j-- will be our feature importance score. Because at the end of the day, if I train, I can compute how much this feature was important or not for this non-linear model. Then, again, I'm getting, also, a feature importance score.
So now we will start to try to optimize this cost. Before I delve into it, any questions on the formulation?
AUDIENCE: Yes. Can you remind me what exactly f is? I remember that it's related to the mutual information.
YOUSSEF MROUEH: Yeah. So f-- think of it just as a classifier that is giving plus 1 label to the joint distribution-- to the genomic features and the correct response. This is pxy.
And then px, py-- it's given a minus 1 label. It is when you scramble the responses, right? Think of it just as a contrasted loss that has high values on the joint, because you want to really be able to learn, what does it mean to be in the joint distribution? To learn the support of the joint distribution, you are giving it negative examples around it.
AUDIENCE: So how does that relate to the mutual information?
YOUSSEF MROUEH: OK. So the mutual information is the KL distance between the joint and the product of marginals, right? It's the KL distance. But instead of KL, you can use any other distance. And what I'm saying-- without the regularizer in blue and red, the green part-- the sup over f-- this is a distance between distributions.
If the xy was equal to px py, meaning they are independent, the sup is 0, right? And if they are not, it will try to find, what is the distance between them? So the idea here is that instead of using the complicated KL to compute mutual information-- in high dimension you cannot do it, right? So we are just using another form of measuring the distance between the joint and the product of the marginal that is easy to compute.
AUDIENCE: So sup over f is the distance, or f is the distance?
YOUSSEF MROUEH: No. So the full sup.
AUDIENCE: OK. Thank you.
YOUSSEF MROUEH: Yeah. So really, the first sup is the distance. And the parts in blue and red are just controlling the sparsity. So this is the regularization. So the main loss-- that is the dependency measure-- is the part in green. The part in blue is the part that is responsible for the gradient sparsity, and the part in red, on the extreme right of the screen, is just for stability, similar to elastic net.
So now we will turn on to how to optimize the source, and to see what the complication is in. So the gradient penalty has an expectation proceeded by a square root. So imagine, now, I want to do stochastic gradient descent.
As I get a mini-batch, this is an issue if I have a non-linearity before the expectation, because I will always have biased estimate inside of the square root of the expectation. So this is one big issue first. The other issue is that square root by itself is not smooth at 0. The derivative is discontinuous at 0. So we will show how to alleviate those issues using two tricks.
So the first easy trick is that the non-smoothness at 0-- just add a small perturbation epsilon, and let this epsilon go to 0. And then things will be fine. The other issue with result-- we need to remove the square root in some way.
But as I said before, the square root is very important, because if it was not there, it's just a Sobolev norm. It's just aggregating all the gradients together. So by putting this square root, we are isolating each coordinate so that they compete between each other.
So now, in order to remove the square root, we are going to resort to another variational trick that is called the eta trick. So this is eta trick-- it's a very neat trick. It allows you to remove the square root at the expense of adding other auxiliary variables.
So remember, what we have here is just sum of square roots of-- whatever's inside, I will call it aj-- and then it's raised to the power 2. So it is sum of square root aj raised to the power 2. So this, you can see it-- if I add other variables-- eta j's-- in this way, such that it's the sum of aj divided by eta j, and sum over eta j is equal to 1-- I will get the objective that I had.
So at optimum-- if you want to believe me-- at optimum, what you have-- it would be good if I can-- is there any way to annotate? Let me try that.
So if I take this guy, and I put this eta j in it, it will be aj divided by square root aj. Then I will get aj. And everybody is multiplied by square root k aj, then I will get this guy. So in a sense, this trick-- the neat part from it is that I can remove this square here and square root here by writing it this way.
And what I get now-- I remove the square root that was annoying when I had expectation. OK. So let's see how, then, things work.
So the way we will do it is that we will replace the square root with this eta j, and we will put it in the cost, right? So instead, now, of optimizing only on f, we'll be also optimizing on etas. And as I was telling you, the eta, at the end of the day, will be the square root of the expectation of the gradients squared. And these are the feature importances that we care about.
So in a sense here, we kill two birds with one stone. So we made the optimization easy, but at the end of the optimization, the eta j's are the feature importance score that you care about. Is this clear?
So the full trick is just to remove the square root by adding auxiliary variables that are eta j's. And eta j's, at the optimum, are the square root of the expectation of each partial derivative, and these are the things we were after-- the feature importance.
So altogether then, the method will become learning over f and eta. Something that does-- still, the green part is like the dependency measure. The sparsity part now is just written in a different way, but it's exactly the same thing that it was before, but we added an auxiliary variable we want to optimize on. And another term for stability.
So the cute thing about this trick is that at the end of the optimization, I'm left with the witness function that is discriminating between joint and the product of marginals. And I have also, for free, each coordinate-- how much it is important, as encoded by eta j's.
And the full optimization is then over the function f, over the etas. f is in a functional space. We haven't yet talked about what it is. And eta j-- it's just a probability, right?
So in a sense, eta j, at the end of the day, is an attention, right? It's an attention that sums to 1, and if you rank it, you will get how much each feature is important. And now, if I have samples, I just replace all the expectation by the empirical estimations of their expectations. And I can solve this problem.
Now, f-- if it was in RKHS, it happens that this problem is convex in both f and eta. So in a sense, if I do alternating optimization between f and eta, I'm guaranteed to converge to the optimum. Or if I do block coordinate descent, I'm also guaranteed to converge.
So now, we started from-- let me just summarize what we did so far. So what we did so far-- we redefined the feature selection via this gradient sparsity selection on a dependency measure, and then we replaced the gradient sparsity with a convenient way that allows us to get, immediately, future importance score from those eta j.
And then we can solve this problem empirically by specifying what the function class is. And now we will delve into some theory to try to understand what this independence criterion is once the function class is in RKHS.
So if this function class was in RKHS-- here, I'm writing it in a finite-dimensional RKHS, but it's true also for infinite-dimensional. So what we get is that-- remember, you can think about the green part as being the objective, and the blue and red part as being constraints, right?
So in RKHS, in most of the cases, the optimum happens when the objective meets the constraints. So this is very nice for the interpretability of the feature selection method that we have, because the value of SIC decomposes in terms of those feature importance scores.
So as you see, SIC of pxy px py is proportional to the sum of the contribution of each feature. So in a sense, the difference between a regular mutual information and this type of measuring the dependency is that it's granular.
So I can read out from it how much each feature contributed to this mutual information. It's still global dependency, but it's granular in the way that I can read out each feature-- how much is contributed from-- and I can just see it from the eta j.
Now, we can also extend this to not only dealing with it as if it was an RKHS. We can do it also if it was a neural network, right? So this is what I was saying. It's a form of mutual information. So mutual information is not the correct term. I should say a dependency measure that decomposes in terms of the contributions of the coordinate.
And it's interpretable, because then I can read those out in terms of which one has contributed. If I was in RKHS, the problem happens to be jointly convex, and alternating optimization on the function and the eta works, or if I do just the gradient descent on the function and the gradient descent on the eta. And remember, eta lives on a simplex, right? It needs to sum to 1. So if I do gradient descent, I need to be doing mirror descent.
So if we were doing it with a neural network, the choice would be to do stochastic gradient, right? So we learn the function f as a neural network, and then we will update its parameter by just gradient descent.
Then for updating the etas that have the feature importance score, we have just to do it via mirror descent, which is, if you want, you just do the computation in the logit space, then you use softmax to bring it back to become a probability. So you do the regular update of gradient descent on the logits, then you push it through the softmax.
So this is basically the algorithm that we will be using, what we refer to as neural SIC, which is-- we will be alternating between learning a neural network that's trying to discriminate between the joint distribution and the scrambled distribution, if you want-- the associated the product of marginals. And then, we will also update the feature importance score by doing the logits.
Do you have any questions before I move to the second part of the talk? So so far, we have a method that's able to give us feature importance scores that are discovering non-linear relationships between the x and the y. So now, I'm going to move to the other part of the talk, where we want to use those feature importance scores to control the forced discovery rate. So how do we do it?
So so far, we trained a method, for example, that gets us a critic function-- the function that's discriminating between the joint and product of the marginals. And it produces for us, also, feature importance scores eta j.
We could just say, OK, I will rank those eta j's, and I will keep the first 50, or the first 60, or the first 70 features as being the most important features. But to set this threshold is very arbitrary if I do that, right? So we need a principled way in order to say when I should say, this is important versus not important.
And this is where usually-- we are familiar with the permutation tests, right? So we need to do a permutation test in order to tell if a feature is important or not. So as we will see in the next slide, permutation tests that consist of just taking a column of features and permuting across the individuals. It will fail, and I will explain why.
So typically, what we want to know, is a feature j independent from the response y given all the other coordinates. So this is the null hypothesis that we want to test. So I want to know if a feature j is responsible for the response, conditioned on all the other features-- yes or no.
So this is equivalent to saying my joint distribution of x and y decouples this way. That coordinate xj depends on all the other coordinates that is noted here via x minus j. And y only depends on all the other coordinates x minus j. So we want to test if this holds true or not.
So the problem is that I don't have access to those things-- xj, given all the other coordinates. What permutation tests do-- it ignores that you don't have this one. And it says, OK, all the features are independent. So to generate from xj, given all the other features, I just randomly the feature j in my population.
So this is wrong, right? Because that coordinate might be dependent. And in most cases, in practice, the coordinates are dependent-- the features are dependent. So permutation tests fail whenever your futures are dependent.
And in order to do proper statistical testing for conditional independence, you need to be able to generate a coordinate xj given x minus j. In practice, we don't have those. But these days, we have a lot of powerful generative models to be able to generate such a distribution-- xj, given x minus j.
So the way we will do it-- we'll train a generative model to-- given all the other features except j, to predict the feature j. So this is now a conditional model that can be trained easily. So now, given that we have those conditional generative models, and that we have the future importance model, and the function that I was talking about, SIC, how can we control for false discovery?
So the method goes this way. So you take a holdout set. This is the reality. This is how the data is. It is like, I have those genomic features, and I have this response. Now you want to stimulate the hypothesis H0 that you want to test, right?
So you take all your other coordinates, and you generate the scenario where the response does not depend on the coordinate j via your generator. So you create a fake set by generating how it would look under this hypothesis H0. Now, this f was the one that I was talking about, that was trained from our Sobolev independence criterion. So it was trained to have high value on the reality, and it should have low values on the not-reality-- this guy.
So now we have a method to start to compute the values and do proper statistical testing. So the way we'll do it is that we take the holdout set-- this is the reality. This is the fake reality that we want to test-- if reality is similar to it or not, right? So that's why I was saying, f is doing some distance computation. It's not f. It's the average of f minus the average of f-- it's the sup at the end.
So we do one first set, and we get the mean of the function f over the holdout set. And we get the mean of the same function f on the fake set. And we do this for a bunch of simulations of the reality. Usually, when you do permutation tests, you just simulate the reality by just permuting those coordinates here. And as we said, it's wrong if your features were independent.
And then we compute p-values, and p-values are how many times after doing this sampling of the hypothesis set H0-- for n times. So remember, f was trying to have high values on the reality and low values on the non-reality, so we are counting how many times the fake scenario I've been simulating is higher than the real scenario.
Then I get those p-values. I feed them to a very well-studied statistic for multi-hypothesis testing, and I accept or reject if the feature is important or not. So in a sense, by doing this full pipeline, we are guaranteeing that-- so Benjamini Hochberg procedure takes an FDR target-- like a false discovery target that you want to set it at. And then you accept or reject on it.
So this is a full pipeline if you want to be able to guarantee that you are selecting the future selection under some target false discovery rate. And this is important in practice.
So this is one example. This is a synthetic example where you have this very complicated non-linear function between y and x, and your x is 50-dimensional. And to make it even more difficult, we make all the coordinates correlated with correlation rho 0.5. But y depends only on the first six coordinate in this non-linear way, and we want to see how the other methods do, and how our method does.
So at the very first thing here, we have elastic net. That is just a linear model. So in all of those methods, we are using the same method that I was describing to control the false discovery. So the false discovery rate is controlled at 0.1 here.
And if we use the elastic net, as you see, the false discovery rate is well-controlled, but this is just thanks to the holdout randomization test, meaning that this method works to control the false discovery rate. But it takes a hit in the true positives, right? So because it's a linear method, it won't be able to find all the correct relationships-- all the coordinates responsible of the variable-- so it takes a hit.
Random forests, for example-- the holdout randomization test using the generative models, et cetera, that we talked about is going to also be controlling the false discovery rate. It will have better true positive rate than elastic net. And by the way, those boxes are averaged on many data sets, just to show that the method is robust with respect to multiple tests where you are doing the training.
So now, if we go to our method that has two instances, we see that the false discovery rate is controlled. But at the same time, we still have a very high true positive rate, which is thanks to the representative power of neural networks.
So you are able to capture very non-linear dependencies thanks to the representation power of neural networks. So you have very high, still, true positive rate, and your false discovery rate is controlled thanks to the whole statistical pipeline that we talked about.
So this is all nice and cute in the synthetic experiments, because we know exactly what is reality-- what are the important features. In reality, we don't know, right? So it's a harder game in reality to know if a method is doing well or not, because here, we know x1 to x6 are responsible of it. So this is a test to see if the method is worth it or not.
So the other real data set we've been using here is the drug response in the Cancer Cell Line Encyclopedia. So you want to see how the anti-cancer drug responds with respect to genomic features from cell lines.
So again, here the problem is that we don't know what the true responsible features are. So what we do here is that we see how much of a drop there is in error once we select those features, using, for example, a neural network model, or using a random forest method.
So what you see is that on average, in this case, if your predictor was a neural network-- just to see, after you select it, if you still can explain your response-- elastic net here seems to be performing better in terms of MSE. If we use random forest, then you will see that our method would perform better.
And we have a similar thing on HIV. And here, we also achieved better true positive while controlling the false discovery. So this is yet another data set with a similar feature.
And thank you for your attention. And if you want, the code for that paper and for all the experiments that is available online, with both the false discovery control and feature selection. Thank you for your attention.
TOMASO POGGIO: Thank you, Youssef. Questions?
AUDIENCE: So I have a question about this eta trick. I'm assuming this only works if you're not taking any derivatives of this expression, or anything like this?
YOUSSEF MROUEH: Can you repeat?
AUDIENCE: I'm assuming this trick only works, really, if you're not then doing any derivatives on top of this, or anything like that, no? Or you're taking--
YOUSSEF MROUEH: It works, because you're just-- remember, here it's sup of f, right? You're just replacing the omega x f with its equivalent that is inf, and then you just pull out the sups. So it's sup f and beta.
AUDIENCE: OK. Well let's say that I like this trick and I would like to apply it somewhere else. I'm assuming it only works if I do this kind of sup or infimum operations of my own.
YOUSSEF MROUEH: Yeah. So by the way, this was like first introduced in a paper by Pontil-- by Massi Pontil. And then it was used later on to deal with L1 and L2, or with L1, L2 norms, or with multiple kernel learning.
So here we just used it, and it happened that it was nice, because it ended up being the feature importance scores. And there is a very nice blog post by Francis Bach explaining this eta trick. I think he did it. Let me show you this one. Francis Bach blog post.
If you want to read more about this one-- so I recommend this very nice blog post from Francis Bach on it. We did the work before he publicized it. This was popular back in the group sparsity days, so it's not new.
TOMASO POGGIO: Youssef, have people already been using this software?
YOUSSEF MROUEH: I'm not sure if it has been used but like we put it online and I don't know if it's been widely used, but we would like it if-- especially people in biology or bioinformatics-- to get feedback. Because the choice of most people is random forest in this domain, because there are very well-designed packages, et cetera, for this. So I would be curious for using it in real scenarios with really high-dimensional data where random forest excels.
TOMASO POGGIO: Yeah. But for instance, at IBM internally, is there a goal-- a need to use something like this or not?
YOUSSEF MROUEH: So we have also a lot of people that are doing genomics work inside IBM. We haven't yet reached out, but in our department, there is a big theme oriented on interpretability, so this was folding under interpretability.
But in order to be useful, I think it should be applied on really large genomics data to see how it compares to random forest. Really, random forests are very hard to beat on these kinds of tests.
TOMASO POGGIO: What about big data coming from the web? Any applications there?
YOUSSEF MROUEH: Yeah. There are people who also do this. I think they also use random forests on it. Like you take all the logs-- this would be very, very big vectors-- and you want to do some feature selections to do it on it. Mainly, people there-- I think they use linear models, because they are very fast, easy to train.
AUDIENCE: You also have a question from Gleb Zarin on the chat window. Gleb, are you able to ask your question?
TOMASO POGGIO: Sure.
AUDIENCE: Hello, and thank you for the presentation. Could you tell one more time how we can generate xj?
YOUSSEF MROUEH: Conditionally?
AUDIENCE: Yeah.
YOUSSEF MROUEH: So here, what I said is that we trained conditional generative models. So we would feed a one-hot vector to which coordinate we want. This is how it was trained. So we quantized all the features, and we trained the model to predict the bins of the quantizations. And then we mask a feature j, and then we put a one-hot to j. And all the layers here have conditional batch norm.
And the goal of this model is to predict the feature j. So it's one model conditioned on a feature it should predict, given the other features-- the feature j. You can train it using a GAN, using a VAE. We just trained it with Gauss entropy here. Does this answer your question?
AUDIENCE: Yeah, thank you. And one more question. If I understand it correctly, you set up alpha before, so you can control how important a feature should be using your statistical test?
YOUSSEF MROUEH: Yeah. So yeah, the statistical test uses the function f that was trained to be sparse in terms of the features. And then this function is used-- these are the p-values right? So here, we were using, in this version, the function f.
So typically, imagine that you have 10,000 features. So then you want to know-- I want to keep like 200 or 100. First, we rank them with respect to eta, and then we apply only this statistical test on the top 100. So there are two steps. The first step is you rank them with respect to eta, because you don't want to apply this test on the 10,000 features. Otherwise it's very computationally expensive.
AUDIENCE: OK, thank you. Thank you very much.
TOMASO POGGIO: All right. Very good.
YOUSSEF MROUEH: OK, great.
TOMASO POGGIO: Thanks a lot, Youssef.
YOUSSEF MROUEH: Thank you, Tommy.
TOMASO POGGIO: We can see you again, and maybe not virtually.
YOUSSEF MROUEH: I hope so.
TOMASO POGGIO: Thanks a lot. Keep safe to everybody.
YOUSSEF MROUEH: Thanks. Bye-bye.