TOMASO POGGIO: I'm Tomaso Poggio. I'm the director of CBMM, the Center for Brains, Minds, and Machines. And I'm very happy to host today David Vogel, who is coming to speak to us about machine learning applications.
David was an MIT student in the math department. And he got involved in prediction, first in health, and then in the world of hedge funds. He started a hedge fund less than 10 years ago, which is now a multibillion dollar company. And he's not only the founder but also the chief scientist and mathematician.
And I think machine learning is at the core of artificial intelligence. It's at the core, in a sense, of science, itself. Because the paradigm of machine learning is the paradigm of modern science, being able, from data, to generate models that can make predictions.
That's why astronomy is different from astrology. And that's why the challenge of finance is one of the biggest challenges you have, from past data, to predict the future. And the test is pretty clear. Either you make money or lose money.
And so, it's very good to have David here. And he will tell you about some of the competition he has won over the years. There have been, initially, a few competition in prediction, many years ago, when the Santa Fe Institute organized and, more recently, Netflix. And now there is a flood of them with popularity and practical importance of machine learning.
So David, I'm happy to welcome here.
DAVID S. VOGEL: Thank you very much for having me. I guess first and foremost on the intro is that I am an MIT graduate. And it's just such an awesome place here. And it's just so great to be here. So I'm really happy to have been invited to give this talk. I'll give a little bit about my background. Some of it was mentioned.
I have about 95% of my background is on the applied side. I have done a couple o peer-reviewed publications on techniques, but almost all has been on the application side. I've done predictive models in a pretty big variety of areas, many professionally and some just on the side.
I guess a lot of people, in their spare time, might enjoy golf. And some other people might enjoy fishing. And I enjoy analyzing data sets in my spare time. So you can see, I've done everything from sports predictions to medical outcomes.
In the process, I've actually developed a lot of predictive modeling software on my own, which is sort of unique to a lot of applied people. Because there are a lot of tools out there.
Although, on one hand, it slows you down to split your time, between application and creating the software tools, but having written the software tools also allows for tweaking algorithms and making them specific to certain applications. And you can get that little bit extra accuracy when you have that ability to tweak algorithms. So I have done quite a bit of my own software, before.
So in terms of competitions, I've participated in a lot of them over the years. I'll be talking about a couple of them. One thing that I love about competitions-- I encourage people to do them in general-- is they're just a great learning experience. You can learn a lot from publications, as well.
But I found that, if a certain new method is published and it's compared to other methods, there's always a little bit of bias. Because the author would know that technique the best. And so in participating in these competitions, you can get an idea of a fair comparison of all these different algorithms, from different places, being pitted against each other.
Currently, 90% of my time is as CEO and chief scientist of Voloridge Investment Management. So people know me mostly, now, as a hedge fund manager. But really, less than half my career has been in the hedge fund space. And I really consider myself still more of a data scientist and able to apply algorithms in a lot in different areas. I also, in 2014, started VoLo Foundation, and I'm on the board of a couple nonprofit organizations.
A little background on Voloridge firm, I'm not from the finance industry. I didn't have buddies with $10 million to run the hedge fund. So we started, actually, with $80,000 under management. And over the years, the returns started attracting more investors. We kept multiplying until, suddenly, somehow we got to where we're managing $1 and 1/2 billion.
Probably most important of those bullets is that we've got, since I'm here at MIT, seven MIT alumni. We try and hire the best and brightest for our company. And certainly, there's a huge amount of talent here at MIT.
Our investment strategy is completely quantitative. We use machine learning-based models not only for the predicting directions in stock and futures but also for the risk management. And I'll talk a little bit about that later on in the presentation.
And also, to the VoLo Foundation, we try to do some data driven-based decisions in terms of making a difference in the world. And climate change, we found to be probably the most challenging problem that needs a lot of scientific focus.
I'll start with just some conceptual overview of ensemble models. There was a great study a while back, by Elder Research, where five different modeling techniques were applied to six different data sets. And the y-axis is the accuracy of those models. It's actually error rates. So lower is actually better.
And you could see how no one technique worked best on all the data sets. So you got, for example, neural networks did well in most data sets but very poor on diabetes data. And logistic regression did well on diabetes but poor on the investment data set.
So you have this problem with individual models, where there's weaknesses of those models. And you have to try a lot of different things, in general, to find what works best for that particular data set.
So now this next slide shows, if I were to ensemble or combine those models, you can see the error rate drop dramatically. So probably the most basic version of ensembling is just a straight average of those models. You can see that red line.
Over here is just a straight average, dumb average on those modeling techniques. And even that performed better, overall, than any of the individual techniques. And there are some more intelligent ways to ensemble those models that perform even better than that.
So with that concept in mind, two of the most popular used ensemble-based methods are bagging and boosting. I'll talk about those techniques. And then what I call miscellaneous is this group of sort of customizing an ensemble for a specific data. There's really no generalizing on that.
So the basis of these methods is a decision tree. And decision tree, in its pure form, is probably the least accurate of any modeling technique. And the reason why? It's so discrete. And you run out of dimensions very fast.
If you had a million records in your data set, and you split the tree 10 times, on average, you're going to have only 1,000 left in each node. So you end up not being able to use the value of all your predictors. And so, by itself, a decision tree is a very poor modeling technique.
But when you combine them-- this was discovered, in 2001, or at least published by Breiman. If you average a large set of trees, then, all of a sudden-- and you create some variability in those models, based on different subsets of records and different subsets of variables-- if you combine those, similar to that ensemble concept I showed, where a group of models performs better, the bagged decision trees performs extremely well.
And it's actually, of course, way, way more than a single decision tree. And it's actually more accurate than regression and neural networks on most problems.
So this was published in 2001, by Leo Breiman. And I actually have since met other scientists who were using it before but sort of kept it proprietary and didn't publish the result, just kept it for their own, their own edge.
So to give me an idea of how many trees you need for the bagging to work, this example is from one of the competitions. Yahoo put on a competition on ranking search queries. They had 700 predictors and about half a million records.
And if you ran a single decision tree-- and actually, just to give a frame of reference, the neural networks and regression models got an r-squared out of a sample about 0.35 to 0.4. And so this single decision tree, as we expect, performs very poorly at a sample. The difference between training and testing shows that it's extremely overfit model.
But as you start getting 10, 20, 30 models, you see the out of sample r-squared start increasing. And around 50 to 100, you're passing the accuracy of the neural network and regressions.
And then once you get to 400 or 500, at some point, all of these bagging models start leveling off. Because, at some point, you're just getting redundant trees averaging together. But more trees will never be worse, other than the processing power required to random score them.
So gradient boosting is the second of the two very popular methods of ensembling. And instead of creating separate trees in parallel, you create them serially, where a small decision tree is made. The error is taken from that decision tree model. And from the residuals, the second tree is made. A learning rate is applied to avoid overfitting.
And this also is on par with the bagged decision trees. And I would say, in my experience, having just applied both of these models to many different data sets, about 75% of the time I can get a better result with GBMs. And then another 20%, 25%, I get a better result with the bagging.
And it's just very hard. I haven't been able to categorize what type of data set does better. I just try them both. But one thing about gradient boosting is, if you have too many trees, it will overfit, whereas bagging is much easier to tune. And if you're new to ensembling, that's probably the first one to try to get a good result.
But gradient boosting, your out of sample result will improve, improve, and then it will start to go down as you overfit. So it takes a little bit more experience to apply GBM. But, overall, it's a slightly favored method in my opinion.
So this was published by Jerome Friedman in 1999. It seemed like it was around 2005 where I recall a lot of people were using it, where it started becoming accepted as a very popular technique.
Now one interesting thing, when people ask, what does GBM stand for? In the original publication, it was Gradient Boosting Machines. And then in more modern software tools, they call it Generalized Boosting Models. But as far as I know, they're the same thing. So if you call them GBMs, then you can't go wrong.
I guess that's the key with GBM, because if you're doing them in parallel, then it becomes more of a random forest. So I suppose, by definition, GBM, you're always applying them to the residuals.
So with those basics on GBM, random forest, I'll talk a little about the Netflix competition. Even though it was 10 years ago, it seems like there's still a lot of interest, because it was the biggest competition in predictive modeling ever held, with the biggest prize of $1 million, and also the largest number of competitors. 41,000 teams entered.
So that was a competition we did not win. But we did rank as high as seventh, which hopefully qualifies me to talk about it. We kind of figured out, on a normal research salary, it would have cost them $12 billion to hire all those 41,000 teams. And if I were to flip that around and divide the $1 million across the 41,000 teams, the average researcher was making less than $0.10 an hour.
One thing I thought really neat about this competition, because it was so widespread, is it really, truly evolved what were the best practices. And that's why I'm always a big fan of competition, a level playing field for [AUDIO OUT] algorithms.
What was the most published algorithm and what the Netflix team was using was shown, very quickly, to not be the best algorithm.
So there were multiple techniques compared. These are just five of them. In the interest of time, I'm going to focus on just two of them, the highlighted ones. I'll talk about item-based collaborative filtering, because that was sort of the published, most accepted method in the industry. That's the method that everybody was trying to beat, and then matrix vectorization, which is a more elegant numerical solution.
I'll show the difference in accuracy. So just to give a little bit of background on the problem setting. We were provided with 100 million records, basically, just a big, 4-column data set, with movie ID, user ID, rating, and the [AUDIO OUT] rating.
And so if every user had rated every movie, you'd have 9 billion movie user combinations. So it ended up being about a 1% sparse matrix if you were to make a grid between movies and users. And very minimal other information was provided, movie title, the date it came out, so nothing too significant in terms of adding to the accuracy.
And so the idea of item-based collaborative filtering is you take all the movies, you have this big giant correlation of every movie together. So in this case, there were 324 million movie combinations. And the way you correlate one movie to the other would be, for example, if I were correlating Lord of the Rings I to Lord of the Rings II, I would find all the users who rated both those movies, just run a straight correlation.
And the particular combination would be something like 0.9. So generally, people who like number one would like number two. Similarly, a movie like The Hobbit would rank something like 0.85 to Lord of the Rings II. A movie like Children of the Corn would rank about 0. So you had this similarity index or similarity matrix based on the correlation of the ratings.
Then the scoring of the model would simply be a weighted average of the most similar movies. So say a person that had watched Lord of the Rings I. I wanted to predict their rating for Lord of the Rings II. Because it's not similar, Children of the Corn would not be weighted in there.
But I would look at their rating on The Hobbit. I'd look at the rating on Lord of the Rings I and maybe 20 or so other movies. And that would generate a prediction.
So there's some bit of variability in how you weight based on correlation. But those method, in that family, all basically follow this format.
So now matrix factorization, and I wrote "Fantasy World" on top, because I start out with a hypothetical just to get the concept. Now let's pretend that, for every movie and every genre, we had a rating. So let's say we started with this data set, where Friday the 13th is listed. It is rated very high for Scary but almost 0 for Action Comedy.
Then you've got The Hangover, which is top ranked for Comedy and below in the other areas. Now then you've got the matrix, which is ranked top for Action and [AUDIO OUT] were low for the other categories.
So if you were to start out with that, you could then run a series of regressions on each individual person, based on their ratings, to calculate coefficients.
So there, B0 would sort of be the constant, their average rating overall. And B1 would be how that person likes Scary movies, how that person likes Action movies, how that person likes Comedy movies. And so you could basically fill in the blanks, pretty easily, if you had the left side as your starting point.
If a person, let's say Bob, had a high rating for Scary movies, his prediction, naturally, with this formula, would be high, if I said Scary movies, for Friday the 13th.
So the other hypothetical is, let's say, we didn't have the matrix on the left. But for each, person we started with a set of weights. So we know that Bob likes Scary movies. We know that Dave likes Action movies. And we know that Monica hates Scary movies and likes Action Comedy somewhat.
So once you have that matrix, you could similarly run a linear regression to generate the weights on the left. And you'd end up with the same exact formula for the prediction, which just would be the sum products of the characteristics of the movies times what that person has as preferences.
Now since, in the real world, you don't start out with any weights, matrix factorization starts with all random numbers. And then you iterate regression one way or the other. So first based on these random numbers, you can do a regression, generate these weights. Then you run linear regression, generate those weights.
And so what starts out as non-sensible numbers slowly iterates to where you have very meaningful coefficients for both the movies and the person. So you end up with this very elegant mathematical solution. And you've, in the process, created genre variables, this M1, M2, M3. And in the real world, there's maybe five or six main genres.
But mathematically, you can generate 40 or 50 genres and get an improved accuracy. So these numbers can really go out. So the result was a much higher accuracy. But it's very hard to quantify in terms of real world terms.
So what I did is I just ran a cluster [AUDIO OUT] on the movie coefficients, just a basic k-nearest neighbors. And I'm going to show just some of the clusters that just came out of this nothingness that started with random weights.
So in this particular cluster, you end up with a [AUDIO OUT] of action movies. Second cluster, you've got a huge concentration of cartoons. And you probably noticed some of the imperfections. That's mostly cartoons, but then you've got some movies, like Buffy the Vampire Slayer thrown in there.
It doesn't mean that the math was wrong. It just means that the same people who like The Tigger Movie ranked Buffy the Vampire Slayer. So it still is actually predictive, even though it may not make a lot of human sense.
Then you've got this sector. And it's just neat that all this came from randomness. They got a whole bunch of TV series in this one sector. Got a pretty nice concentration of [AUDIO OUT] in this other sector.
Then when you look graphically at the sectors, there's one that stands way out, so you could characterize this as the most different type of person who likes Dragon Ball. Probably the biggest outlier of this sector in terms of characteristics.
And so, just some of the results. My team, that I was on, was the Ensemble Experts. That was our team name. With just collaborative filtering methods, with traditional methods, we were able to beat the Netflix team by about 2% to 3%.
Using matrix factorization models, we were able to beat the Netflix team by up to 5%. Then, by sort of blending 20 of all different kind of models together, we were able to get as high as 7.07% improvement. So [INAUDIBLE] for the grand prize.
But then, of course, we saw what the winning teams did. They had blended over 200 models to get that 10% threshold. And so they were [INAUDIBLE] with it. So we did about 1/10 of the work necessary to get 70% of the way there.
Just another competition, this was one that I didn't participate in but helped organize. So I got to see very detailed descriptions of what people did. This was a telemarketing data set predicting churn, appetency, and up-selling. I won't go into the problem space, because I got a few other things to share.
But interestingly, bagging, boosting, I ranked third and second. And the first place team had done both bagging and boosting and included a bunch [AUDIO OUT] stuff in there, too. So this miscellaneous, a bunch of everything type of custom ensemble seems to work in many different fields. And it gives the ability to customize based on that particular data set.
The Heritage Health Prize, so having learned lessons from these other competitions that a bunch of everything-- I was determined, in this competition, to just generate as many models as I possibly could and have a pretty big team that could help with that.
So we had, actually, a very talented team of seven people. There was a pretty big prize, so you could call it maybe the second biggest predictive modeling competition. We were predicting hospital outcomes, so basically hospitalisations.
So we were given a set of historical variables. The idea was to use whatever algorithm you could and produce, submit number of days in the hospital for each person given to us.
So we were given three years of data to download. And then, of course, the fourth year the outcome was withheld. And that was for scoring. And so teams would make submissions based on what they believed the year four outcome would be in hospitalisations.
In the interest of having a lot of a methods, there were actually two completely different ways we thought to put the data set together in preparation for the machine learning. So approach one was to create predictors based on year one. Use year two, length of stay in the hospital, as the dependent variable for [AUDIO OUT].
And similarly, you could take year two information, create the predictors, and use your dependent variable, year three. So you could train the model based on that. And then when you score the final data set, you'd take your year three information and create your predictors and generate year four. That was approach one.
Approach two was take year one and year two information to create predictors and use that to train on the year three outcome. So you had to use two years of input to predict one year. And then you score on year two, year three predictors to get year four outcome.
So there's a big advantage of each one. Approach two could have much richer predictors. You had a lot more information in the predictors but only half as many records to train with. Approach one had twice as many records but simpler predictors. So it was sort of a question mark, [AUDIO OUT] would perform better because of that trade-off.
As it turned out, they performed pretty similarly but had different enough results we were able to ensemble the two approaches to get an overall higher accuracy.
Later, we combined with-- originally our team had three. We combined with two other teams, who had similar accuracy, to further the ensemble idea that, if we had three models that were about the same accuracy, and we blended them together, blended it all together, we could get even higher accuracy.
So the final combined solution had over 100 models go into it and had several GBMs, with different parameters, and several random forests, just everything under the sun that we could think of to sort of emulate what the Netflix winners had done.
And the result was we actually won this competition by a pretty significant margins. You can see our error rate compared to [AUDIO OUT]
Again, that supports the idea that just lots and lots of models blend into something, we'll really fine tune the accuracy, getting better and better.
So I'll talk a little bit about financial models. I like to start with this graph. I've asked stock picking, chart pattern people, which stock do you think is the most likely to-- which one would you invest in, based on these charts, if I didn't know what they were?
And so you've got like trend-based people who like this one. Definitely trending upwards. And then you got value-type investors, who actually like this bottom one, because, even though it's gone down, it looks like the down period is ending. And you can buy it cheaply. You don't want buy something after it's gone up. So there's some disagreement.
And then I'll reveal [AUDIO OUT] actually randomly generated these graphs using normally distributed noise. So punchline is, it's very, very tricky to model stocks at least with your brain. So using machine learning and using a lot of data, basically, let the data decide. Don't try to let your eyes fool you with the charts.
Use machine learning algorithms not-- I mean, there's plenty of other methods to do it, as well. Some quants, they come up with a preconceived notion of what might be a pattern. And they back test it. But what isn't common with quantitative strategies is adequate back testing, so that you can see that those patterns that you found work over a variety of market conditions.
So in terms of measuring success of a portfolio, I'll give sort of the most widely used [AUDIO OUT] ratios, the return divided by the risk. Certainly not the only measure, but what's convenient about it is risk is measured as the expected standard deviation. Therefore, the Sharpe ratio is analogous to a z-square.
And the nice thing about that is it doesn't consider the fact-- let's say, I had an S&P portfolio. The S&P, on average, returns 7.7% over the last 50 years and 15% standard deviation. So the S&P has a Sharpe ratio of about 0.5.
Now if I were to just evaluate based on total returns, I could [AUDIO OUT] 2 to 1 and get 15% return with 30% risk. So I haven't really done anything special. So just looking at pure return is not a good way to assess portfolio. But the Sharpe ratio is a pretty good measure to factor in the risk.
In terms of minimizing risk, actually, just one important point on this. There's certainly a limit that, when you're doing predictive modeling, it's very hard to generate more returns. So there's certainly a limit to where you can only predict stocks, so well.
And so you can actually do just as well, in the Sharpe ratio, by minimizing this risk. So I think it's an important portfolio management tool to use machine learning or predictive models in the risk management.
And so some of the risk management models that we integrate, we predict individual volatility risks. So, for example, a company like Walmart is much less volatile than Tesla, so we can, therefore, have a bigger position in our portfolio to have comparable risk and better overall Sharpe ratio.
Sector diversification, you can use published GICS sectors. But I'll show a diagram where we actually correlate every stock to each other and create analytical sectors.
Zero beta-- beta is a measure of how correlated you are to the overall market, so minimizing exposure to the market. If there's a big crash, say, like in 2000 or 2008, where the market dropped 50%, you don't want to lose. You don't want that huge risk in your portfolio.
And the liquidity-based measures, as the fund gets bigger, and you're trading bigger positions, there is substantial risk of influencing prices as you trade.
So that's just a visual on our correlation matrix, where the thicker the line, the higher the correlation of all the tickers in our universe. And that allows you to partition into sectors [AUDIO OUT] sectors, which I found some interesting flaws in the published GICS sectors by doing this.
One example would be gold stocks are very, very highly correlated to each other. And they should actually be in one sector. But retail stocks, there is a lot not correlated to each other. For example, auto retail versus apparel, they don't correlate at all to each other. But a GICS sector might put them all together in the same risk management.
Another interesting one I've found is Brazilian companies. Brazilian banks correlate more to Brazilian oil companies than they do to US banks. So even though they're in banks, it's that volatility of Brazilian currency that is really driving the risk exposure to those markets. So doing everything analytically, I've found to be substantial in lowering exposure to sectors.
And finally, there's the beta groups. Predicting a correlation to the S&P by ticker, you can figure out your overall exposure and balance better. So beta is a measure of how much a stock will move relative to what the market moves.
So if a stock has a beta of 2, if the market moves 1% up, you'd expect that stock, on average, to move up 2%. Likewise, a beta of 0.5, you'd expect it to move 1/2 of what the market does.
And so you can see, this red line represents betas of approximately [AUDIO OUT]. The red line represents a perfect prediction. And the dots represent the actuals of what we observed over this nine year period.
And one thing important to note is, at the very high market moves-- this x-axis being the market move-- when the market moves 8%, these beta 2's don't move 16%. They move [AUDIO OUT] 13%.
So there's this regression to the mean of a beta that's very important in the risk management. Because you don't want to be market neutral, one day, and then not market neutral the day the market crashes. So very important to understand some of the non-linearity of predicting beta of a stock.
So my final application, I had to throw this out there, because, through the VoLo foundation, we do a lot of work to try and fight against global warming. And this is glacier data. So [AUDIO OUT] the ice samples that were drilled down.
It's interesting, you can drill down and get an ice sample, and glaciologists can pinpoint, pretty accurately, what time it is from based on the decay of the beryllium and other chemicals in there. They can tell what the CO2 content [AUDIO OUT] was at that period in time, through the air bubbles trapped in the ice. And they can actually know, pretty accurately, what the average air temperature was in that region based on the density of the water.
So you have this correlation of 90% between carbon dioxide and temperature. So the importance of this, and the reason why we've gotten more into this, is that we're currently, over the last 100 years, [AUDIO OUT] pumped 150 parts per million CO2 into the atmosphere.
And so there's sort of the question whether we're heading for a temperature at the top graph or the bottom graph. So that's sort of the open-ended one. You have to figure out application going on [AUDIO OUT] data analysis standpoint.
So I guess, with that, I'll be happy to answer any questions.