Demis Hassabis: Towards General Artificial Intelligence (1:07:29)
Date Posted:
July 11, 2016
Date Recorded:
April 20, 2016
CBMM Speaker(s):
Demis Hassabis All Captioned Videos CBMM Special Seminars
Description:
Demis Hassabis is the Co-Founder and CEO of DeepMind, the world’s leading General Artificial Intelligence (AI) company, which was acquired by Google in 2014 in their largest ever European acquisition. Dr. Hassabis draws on his eclectic experiences as an AI researcher, neuroscientist, and video game designer to discuss what is happening at the cutting edge of AI research, including the recent historic AlphaGo match, and its future potential impact on fields such as science and healthcare, and how developing AI may help us better understand the human mind.
Additional Resources:
TOMASO POGGIO: I'm Tomaso Poggio. I am the Director of the Center for Brains, Minds, and Machines, which is a center between MIT and Harvard, located in BCS in Building 46.
And I have the pleasure of hosting Demis. I don't need to say much about him. If you look on Wikipedia or Financial Times, there's a very good caricature of Demis.
And you can find him everywhere. He was a chess child prodigy. He studied computer science in Cambridge. He started a couple or more of successful computer games companies. Then he became a neuroscientist, got a PhD at UCLA in London, and then I was lucky enough that I can put on my CV that he was a post-doc of mine for a brief period-- between 2009 and 2010, I think.
And then we saw each other a couple of times. Once, he came to speak at one of the symposia for the MIT 150th birthday. This was 2011. And we had one session of that symposium, which was called "Brains, Minds, and Machine." One section was titled "The Marketplace for Intelligence." And you spoke about DeepMind that you had just started.
And so DeepMind is an amazing achievement. Demis managed to put together a company, sell it to Google. The company is also a great research lab, I would say the best one in AI these days, with high-impact papers in Nature and so on and achievements like AlphaGo winning against what is arguably the best player in the world, Lee Sedol.
I was in Seoul for the last game, the fifth game, and it was exciting and historic. And it's great to have Demis here kind of telling us about what went on and what was the background of it. Demis.
[APPLAUSE]
DEMIS HASSABIS: Thanks, Tommy, for that very generous introduction. Thank you all for coming. It's great being back at MIT. I always love coming back here and seeing and catching up with old friends.
So today, I'm going to split my talk into two. The first half of it is going-- I'm going to give you a kind of whirlwind overview of how we're approaching AI development at DeepMind and the kind of philosophy behind our approaches. And then the second half of the talk will be all about AlphaGo and the sort of combination of our work there and what we're going to do with it going forwards.
So DeepMind-- first of all, it was founded in 2010, and we joined forces with Google in an early part of 2014, so we've been there for just over two years now. One of the ways we think about DeepMind, and one of the ways I've described it, is as a kind of Apollo program for AI, Apollo program effort for AI.
Currently, we have more than 200 research scientists and engineers, so it's a pretty large team now, and we're growing all the time. So obviously, there's a lot of work going on, and I'm only going to be able to touch on a small fraction of it today.
So apart from experimenting on AI, which is obviously the main purpose of DeepMind, at least half of my job and half of my time is spent on thinking about how to organize the endeavor of science. And what we try to do at DeepMind is try to create an optimal environment for research to flourish in.
And the way-- I mean, that would be a whole talk in itself. But just to sort of give you a one-line summary, what we try to do is fuse the best from Silicon Valley startup culture with the best from academia. And so, you know, we've tried to combine the kind of blue-sky thinking that you get in interdisciplinary research, you get in the best academic places, with the focus and energy and resources and pace of a top startup. And I think this fusion has worked really well.
So our mission, as some of you have heard me state, the way I kind of articulate that is in two steps. So step one, try and fundamentally solve intelligence. And then if we were to do that, I think step two kind of follows naturally-- try and use that technology to solve everything else. Certainly, that's why I've always been obsessed with working on AI since I can remember, because I truly believe that it's one of the most important things that mankind could be working on and will end up being one of the most powerful technologies we ever invent.
So more prosaically, what we're trying to do at DeepMind-- what we're interested in doing-- is trying to build what we call general-purpose learning algorithms.
So the algorithms we create and develop a DeepMind-- you know, we're only interested in algorithms that can learn automatically for themselves, from raw inputs and raw experience, and they're not handcrafted or preprogrammed in any way.
The second important point is this idea of generality, so the idea that a single set of algorithms, or a single system, can operate out of the box across a wide range of tasks. In fact, this sort of connects with our operational definition of intelligence. I know that's kind of a big debate, and there isn't really a kind of consensus around what intelligence is. But operationally, we regard it as the ability to perform well across a wide range of tasks. So we really emphasize this flexibility and generality.
So we call this type of AI "artificial general intelligence" internally at DeepMind. And the hallmark of this kind of AI is that it's flexible and adaptive and possibly, you could argue, inventive. I'm going to come back to that at the end, once we've covered AlphaGo. And the key thing about it is that it's built from the ground up to deal with the unexpected and to flexibly deal with things that it's never potentially seen before.
So by contrast, obviously AI's a huge buzz word at the moment and is hugely popular, both in academia and industry, but still a lot of the that we find around us all or that's labeled AI is of this kind of what I would call narrow AI. And that's really software that's been handcrafted for a particular purpose, and it's special case for that purpose.
And often the problem with those kinds of systems is that they're hugely brittle. As soon as the users interact with those systems in ways that the teams of programmers didn't expect, then obviously they just catastrophically fail.
Probably still the most famous example of that kind of system is Deep Blue. And obviously, that was a hugely impressive engineering feat back in the late 90s when it beat Garry Kasparov at chess. But Deep Blue, you know, it's arguable whether it really exhibited intelligence, in the sense that it wasn't able to do anything else at all, not even play strictly simpler games like tic-tac-toe. It would have to be preprogrammed again from scratch with expert knowledge.
So the way we think about AI and intelligence is actually through the prism of reinforcement learning. And most of you will be probably familiar with reinforcement learning, but I'm just going to cover it quickly here in this cartoon diagram. for those of you who don't know what it is.
So you start off with an agent or an avatar. It finds itself in some kind of environment trying to achieve a goal in that environment. That environment can be, obviously, the real world, in which case the agent would be a robot. Or it could be a virtual environment, which is what we mostly use, in which case it's a kind of Avatar of some sort.
Now, the agent only interacts with the environment in two ways. Firstly, it gets observations through its sensory apparatus and reward signals. And we mostly use vision, but we are looking to use other modalities pretty soon.
And the job of the agent system is kind of twofold. Firstly, it's got to try and build as accurate a statistical model as it can of the environment out there based on these noisy, incomplete observations that it's getting in real time. And once it's built the best model it can, then it has to decide what action to take from the set of actions that are available to it at that moment in time to best get it incrementally towards its goal.
So, reinforcement learning, that's basically the essence of reinforcement learning. And this diagram is very simple but, of course, this hides huge complexities and difficulties and challenges that would need to be solved to fully solve what's in this diagram. But we know that if we could solve all the issues and challenges behind this framework, then that would be enough for general intelligence, human level general intelligence. And we know that because many animal systems, including humans, use reinforcement learning as part of their learning apparatus. In fact, the dopamine neurons in the brain implement a form of TD learning.
So the second thing that we kind of committed to philosophically in terms of our approach [INAUDIBLE] at the beginning was this idea of grounded cognition. And this is the notion that a true thinking machine has to be grounded in a rich sensorimotor reality. But that doesn't mean it needs to be a physical robot. As long as you're strict with the inputs, you can use virtual worlds and treat them-- these avatars and these agents in these virtual worlds-- like virtual robots in the sense that the only access they have to the game state is via their sensory apparatus. So there's no cheating in terms of accessing the internal game code or game states of underlying the game.
We think, if you treat games in that way, then they can be the perfect platform for developing and testing AI algorithms. And that's for many reasons. Firstly, you can create unlimited training data. There's no testing bias in the sense that, I think one of the challenges of AI is actually creating the right benchmarks. And very often, this sort of turns out to be an afterthought for an AI lab to build the benchmarks. And we think actually crafting the right benchmarks is just as difficult, maybe even more difficult, than coming up with the algorithms.
And games, of course, have been built for other purposes-- to entertain and challenge human players-- and they've been built by games designers, so they weren't built for testing AI programs. So, in that sense, they're really independent in terms of a testing/training ground for our AI ideas.
Obviously you can run millions of agents in parallel, and we do that on the Google Cloud. And most games have scores, so it is a convenient way to incrementally measure your progress and improvement of your AI algorithms. And I think that's very important when you're setting off on a very ambitious goal and mission like we have, which may be multi-decades. It's important to have good incremental measures that you're going in the right direction.
So this kind of commitment then leads to this idea of end-to-end learning agents and this notion of starting with raw pixels and going all the way to deciding on an action. At DeepMind, we're interested in that entire stack of problems, from perception to action. And I think we've, over the last five years that DeepMind's been going, have pioneered this use of games for AI research. And I see many other research organizations now, and industrial groups, starting to use games themselves for their own AI development.
So I guess the first big breakthrough that we had at DeepMind was really starting this new field of deep reinforcement learning. And this is the idea of combining deep learning with reinforcement learning. And this allows reinforcement learning to really work at scale and tackle challenging problems.
Until we came up with this idea of deep reinforcement learning-- RL, of course, as a field, has been going for more than thirty years. But generally speaking, up till then, they've only been applied to toy problems, little grid-world problems. And nothing really challenging or impressive had been done with all our research, so we wanted to take that further and apply it to a really challenging domain.
So initially we picked Atari 2600 platform, which is really the first iconic games platform from the '80s. And, conveniently, there's a nice open-source emulator which we took and improved. And then there are hundreds of different classic Atari games available on this emulator.
I'm just going to run you one video in a second showing you how the agent performs in this Atari environments. But before I do, just to sort of confirm with you what you're going to see, the agents here only get the raw pixels as inputs. So the Atari screens are 200 by 150 pixels in size. There's about 30,000 pixels per frame.
And the goal here is simply to maximize the score. Everything else is learned from scratch. So the system is not told anything about the rules or their controlling or even the fact that pixels in video streams next to each other are correlated in time. It has to find all that structure for itself.
And then there's this notion again of generality-- one system able to play all the different Atari games out of the box. So we call this system DQN, and we think it really is a kind of general Atari player.
So this is a little medley of the same system out of the box, the same [INAUDIBLE] is playing all these very different games, very different rule sets, very different objectives, very different visuals out of the box with the same settings and the same architecture. And it performs better than top human players on more than half of the Atari games. And since our "Nature" paper, we've now increased that to about 95% of the Atari games.
And here's the boxing where it's the red boxer here, and it does a bit of sparring with the inbuilt AI and then eventually corners it and just racks up an infinite number of points. So if you want to know more about that work, you can see our "Nature" paper from last year. And the actual code is freely available as well, linked from the "Nature" site, so you can play around with the DQN algorithm yourselves.
So two planks of our philosophy is, grounded cognition and reinforcement learning. A third sort of pillar, if you like, of our approach is the use of systems neuroscience. And as a neuroscientist myself, you know, I think this is going to play a very important part of understanding what intelligence is and then trying to recreate that artificially.
But when I talk about neuroscience, I really want to stress I'm talk about systems neuroscience. And what we mean by that is really the algorithms, the representations, and the architectures the brain uses rather than the actual low-level synaptic details of how the neural substrate works. So we're really talking about this high level, this computational level, if you like, of how the brain functions.
Now, I haven't got time to really go into all the areas that we're sort of using neuroscience inspiration for, but suffice it to say, some of the key areas that we're working on-- memory, attention, concepts, planning, navigation, imagination-- all these areas that we're pushing hard on now, it's going beyond the work we did for Atari.
And actually, the area of the brain that I studied for my PhD, the hippocampus-- which is the center part of the brain here in pink-- is actually implicated in many of these capabilities. So it seems like, perhaps the notion of creating an artificial hippocampus of some sort which mimics the functionality of the hippocampus, might be a good plan.
So I haven't got time to go through all of these different areas of the work we're doing here, but I'll just touch on a couple of the most interesting ones. So one big push that we have at the moment is adding memory to neural networks. And what we really want to do is add very large amounts of controllable memory.
So what we've done is created this system, which we are dubbing the Neural Turing Machine. And what it effectively is is you take a classical computer, you train a recurrent neural network on it from input-output examples, and that recurring neural network you can think of as like the CPU, effectively. And what we give this recurring neural network is a huge memory store, a kind of KNN memory store, that it can learn to access and control.
And this whole system is differentiable from end to end. So the recurring neural network can learn what to do through gradient descent. And really, that is then all the components of a Von Neumann machine that you need, except here it's all neural and it's all been learned. So that's why we call it the Neural Turing Machine because it has all the aspects you need for a true Turing machine.
So here's a little cartoon diagram of what the Turing machine does. And you can think of this input tape, and then the CPU, which is this recurring neural network that actually has LSTMs as part of it, and then it's trying to produce the right output. And then it has this huge memory store to the side that it can learn to read and write elements to, vectors to. Now, with this kind of system, we can start moving towards symbolic reasoning using these kinds of neural systems, which is really one of the big holy grails of what we want to do.
And, of course, there's a classic problem in AI-- many unsolved classic problems. One of the problems we apply this Neural Turing Machine to has been inspired by the Shrdlu class of problems, which are these block worlds from the '70s and '80s. And the idea here is to manipulate the blocks in some way and answer questions about the scene.
Like put the red pyramid on the green cube. Or what's next to the blue square? And both manipulate this world, and also answer question and answer about it.
Now, we're not ready yet to-- Neural Turing Machines can't scale to the full complexity of the full Shrdlu problem. But we have cut it down to a 2D version, a blocks world version, where we can solve some quite interesting things. So we call this Mini-Shrdlu, and it has aspects of Tower of Hanoi and other problems in it.
And the idea here is that you've got this little blocks world that you're looking side on and all these different colored blocks, and you're given the start configuration here on the left-hand side and the goal configuration you want to reach. And what the system can do is lift one block from one column and put it down on the top of another column. That's the only moves you're allowed to do.
And it gets trained through seeing many starting examples and end examples and doing trial and error with reinforcement learning and improving itself over time. And then, once it's done it's training, we then test it on new start positions and goal positions that it's never seen before. And it has to try and solve these problems in an optimal number of moves.
So I'm just going to run this little video which will show you, going from that start position on the left to end up on the goal position. I think this one's about twelve moves. It's actually a pretty hard task to do in an optimum number of moves. It's really hard even for humans to do this.
And so now it's solving pretty interesting logic puzzles. Also, what we've been using Neural Turing Machines to do recently is solve graph problems. Which, as you all know, are a general class of problems. And we'll be publishing something pretty impressive, I think, in the later part of this year on this topic to add to our archive paper that we already published last year.
Now, we're also site experimenting with language as well. And we've incorporated a cut-down version of language into these Shrdlu tasks. And here, the Neural Turing Machine is reading a set of constraints that are given to it in code that you can see at the bottom of the screen. So here, each of the blocks are numbered, and there are some constraints that you want to satisfy with the goal configuration.
So, in this case, block three should be down from block five, four up from two, one up from four, and six down from three. And so it reads this in, character by character, remembers these instructions, and then starts executing the actions. And then it solves the puzzle, and this is the end position that satisfies all those constraints.
Another thing we're moving to now is, there are still challenges to overcome in Atari, but we're also starting to move towards 3D environments. So we've repurposed the Quake III engine and added modifications to it. We call it Labyrinth. And we're starting to tackle all kinds of navigation problems and interesting 3D vision problems within this kind of labyrinth-like environment.
So I'll just roll the video of this agent finding its way through the 3D environment, picking up these green apples which are rewarding, and then trying to find its way to the exit point. And again, all of this behavior is learned just through-- the only inputs are the pixel inputs, and it has to learn how to control itself in this 3-D environment and find its way around and build maps of the world.
So here, for an agent like that, we're starting to integrate some of these different things together-- deep reinforcement learning with memory and 3D vision perception. So as we take this forward, we're thinking as one of our goals over this next year is to create a rat-level AI, so an AI agent that's capable of doing all the things a rat can do. And, you know, rats are pretty smart, so it could do quite a lot of things. So we're looking at the rat literature, actually, for experimental ideas, experimental tests that we can test our AI agent on.
So now I want to switch to AlphaGo, which is also part of these big pushes that we're doing into going beyond the Atari work. So one of the reasons we took on AlphaGo is, we wanted to see how well these neural network approaches could be meshed with planning approaches. And Go is really the perfect game to test that out with.
So this is the game of Go for those of you who don't play. This is what a board looks like. It's 19 by 19 grid, and there's two sides-- black and white-- taking turns. And you can place your stone-- your piece, which is called a stone-- anywhere on an empty vertex on the board.
Now, the history of Go has got a long and storied tradition in Asia. It's more than 3,000 years old. Confucius wrote about it 2,000 years ago. And he actually talked about Go being one of the four arts you need to master to be a true scholar. So it's really regarded in Asia up there with poetry and calligraphy and art forms.
There's 40 million active players today and more than 2,000 professionals who start going to Go school before they're teenagers, from around the age of eight, nine, or ten. They go to special Go schools instead of normal schools.
And although the rules of Go are incredibly simple-- in fact, I'm going to teach you how to play Go in two slides in a minute-- they actually lead to profound complexity. One way of quickly illustrating that is that there are more than 10 to the power 170 possible board configurations. That's more than there are atoms in the universe by a large margin.
So the two rules are-- rule one, the capture rule. Stones are captured when they have no free vertices around them, and these free vertices are called liberties. So let's take a look at our position from an early part of a Go game, and let's zoom into the bottom right of the board to just illustrate this first rule.
So here, you can see this white stone that's surrounded by the three black stones only has one remaining free vertex, one remaining free liberty. So if black was to play there, it would totally surround that white stone, and that white stone would be captured and removed from the board. And actually, big groups of stones can be captured in this way, not just one at a time. Whole large groups can be captured if you surround all of their empty vertices. So that's the first rule.
The second rule is called the ko rule. And that states that repeated board position is not allowed. So let's imagine we're in this position now and it's white to play. Now, white could capture that black stone by playing here and taking that black stone off the board. So now it's blacks move and you might be wondering, well, can't black just capture back by replacing that stone and taking white?
So what happens if black was to play this? And this is not allowed because if black was to play back there and remove the white stone, now you'll see that this position we're in now is identical to the position we started with. So that's not allowed. So that black move is not allowed. Black would have to play somewhere else first to break this symmetry and then can go back and recapture that stone.
And that's it. That's the rules of Go. And the idea of Go is that you obviously want to take your opponent's pieces by surrounding it. But, actually, the main thing you are trying to do is wall off parts of empty territory on the board. And then at the end of the game, when both players pass, they don't think they can improve their positions any further, you count up the number of territory you've got and you add the prisoners that you've taken from your opponent. And the person with the most points wins the game.
So the rules of Go are simple, but it's pretty much the most profound and elegant game I think that mankind has ever devised. And I say that as a chess player. You know, I think Go is really the pinnacle of perfect information games. It's definitely the most complex game that certainly humans have spent a significant amount of time mastering and play at a very high professional level today. And because of this huge complexity of Go, it's been an outstanding grand challenge for AI for more than twenty years, especially since the Deep Blue match.
And the other interesting thing for us is that-- and I'm going to come back to this more in a minute-- that if you ask top Go players, they'll tell you that they rely on their intuition a lot to play Go. So Go really requires both intuition and calculation to play well. And we thought that mastering it, therefore, would involve combining pattern recognition techniques with planning.
So why is Go hard for computers to play? Well, the huge complexity means that brute force search is not tractable. And really, that breaks down into two main challenges. Firstly, the search space is really huge. There's a branching factor of more than 200 in an average position in Go.
And the second point, which is probably an even bigger problem, is that it was thought to be impossible to write an evaluation function to tell the computer system who is winning in a mid-game position. And without that evaluation function, it's very difficult to do efficient search.
So I'm just going to unpack these by comparing Go to chess, and you'll see the difference. So in chess, in an average position, there are about 20 possible moves. So the branching factor in chess is 20.
In Go, by contrast, as I just mentioned, it's more like 200. So there's an order of magnitude, a larger branching factor. Plus, Go games tend to last two to three times longer than chess games.
The evaluation function-- Why is this so difficult for Go? Well, we still believe, actually, that it's impossible to handcraft a set of rules to tell the system who's winning. So you can't really create a expert system for Go, for evaluating a Go position.
And the reasons are, there's no concept of materiality in Go. In chess, as a first approximation, you can just count up the value of the pieces on each side and that will tell you roughly who's winning. You can't do that in Go because, obviously, all the pieces are the same.
Secondly, Go is a constructive game, so the board starts completely empty and you build up the position move by move. So if you're going to try and evaluate a position halfway through or at the beginning of the game, it's very difficult because it involves a huge amount of prediction about what might happen in the future. If you contrast that with chess, which is a kind of destructive game, all the pieces start on the board and, actually, the game gets simplified as you move towards the endgame.
The other issue with Go is that it's very susceptible to local changes, very small local changes. So even moving one piece around out of this mass of pieces can actually completely change the evaluation of the position.
So Go is really a game about intuition actually rather than calculation. And because the possibility is so huge, I think it's kind of at the limit of what humans can actually cope with and deal with and master. And, you know, I've talked to a lot of top Go players now and when you ask them about when they play a brilliant move why they played it, they'll just tell you actually or quite often that it felt right, and they'll use those words.
If you ask a chess grandmaster why they played a particular move, they'll usually be able to tell you exactly the reasons behind that move. You know, I played this move because I was expecting this, and if that happens, then I'm going to do this. And they'll be able to give you a very explicit plan of why that move was good.
And you can see that Go definitely has a sort of history and tradition of being intuitive rather than calculating because it has notions of things like the idea of a divine move. And actually, there are some famous games in history that get names, and within those games, there are famous moves. And those moves are sometimes named as well.
And if you talk to a top Go player, they dream about one day, at one point in their career, playing one of these divine moves, a move so profound it's almost as if it was divinely inspired. And you can look that up online. They have some really interesting stories from the Edo period in Japan of these incredible games played in front of the shogun and these divine moves being played, ghost moves.
So how did we decide to tackle this intuitive aspect of Go? Well, we turned to deep neural networks. And, in fact, what we did is, we used two deep neural networks. So I'm just going to take you through the training pipeline here.
We started off with human expert data that we downloaded about 100,000 games from internet Go servers of strong amateurs playing each other. And we, first of all, trained through supervised learning what we called a policy network. And this deep neural network, what it was trained to do was to mimic the human players.
So we gave it a position from one of those games. And, obviously, we know what the human player played. And we trained this network to predict and play the move the human player played.
And after a whole bunch of training, we could get pretty reasonably accurate. We can get to about 60% accuracy in terms of predicting the move that the human would play. But, obviously, we don't want to just mimic how human players play, especially not just amateur players. We want to get better than the human players.
So this is where reinforcement learning comes in. Where we then iterate through self-play this policy network many millions of times playing against itself and incrementally improving the weights in that network to slowly increase its win rate. So after millions of games of self-play, this new policy network has about an 80% win rate against the original supervised learned policy network.
Then we freeze that network and we play that network against itself 30 million times. And that generates our new Go data set. And we take a position from each of those 30 million games. And, obviously, we have the position, and we also know the outcome of the game. We know who finally won, black or white. And then, with that much data, we were finally able to crack the holy grail of creating an evaluation function.
So we created this second neural network, the value network, which is a learned evaluation function. So it learned to take in board positions and try and accurately predict who is winning and by how much. So after all of this training, which is a lot of compute power and training on that, we end up finally with two neural networks. The policy network, which takes a board position coders are trying to [INAUDIBLE] as an input. And the output is a probability distribution over the likelihood of each of the moves in that position.
So the green bars here, and the height of the green bars on the green board, represent the kind of probability mass associated with each of the moves possible from that position. And then, the second network is, we get this value network here in pink. And, again, you take the board position as an input.
But, here, the output of the network is just a single real number between 0 and 1. And that indicates whether white or black is winning and by how much. So if it was 0, that means white would be completely winning. And 1, black would be totally winning. And 0.5, the position would be about equal.
So we take those forwards-- but the neural networks are not enough on their own. We also need something to do the planning. And for that, we turn to Monte Carlo tree search to stitch this all together, and it uses the neural networks to make the search more efficient.
So I'm just going to show you how the search works here. So imagine that we're in the middle of pondering what to do in a particular position, and imagine that position is at the root node of this tree represented by the little mini Go board here. And perhaps we've done a few minutes or a few seconds of planning already, so we've already looked at a few different moves represented by the other leaf nodes here.
And what you do is, you've got two important numbers here-- Q is really the current action value of the move, the estimate of how good the movie is. And P is this sort of prime probability of the move from the policy network in terms of how likely it is a human would play that move.
And let's imagine we're following the most promising path at the moment that we've found so far in the bold arrows here that are coming down, and we end up at a node, at a position that we haven't looked at so far. So what happens here is, we expand the tree. And we do that by first calling the policy network to find out which moves are most probable in this position.
So instead of having to look at 200 possible moves, all the different possible moves in this position, we just look at the top three or four that the policy network tells us are most likely. And so that expands the tree there. And then once we've expanded the tree, we evaluate the desirability of that path in two ways.
One is that we call the value network, and that gives us an instant estimate of the desirability of that position. And we also do a second evaluation routine using Monte Carlo rollouts, so we roll out maybe a few thousand games to the end of the games and then we backup the statistics of that back to this node.
And what we've found is, that by combining these two valuation strategies, we can get a really accurate evaluation of how desirable that position is. And then, of course, that's one of the parameters we experiment with is the mixing ratio between what the rollouts are telling us and what the value network is telling us.
And as we improved AlphaGo, we trusted the value networks more and more. So I think now, the lambda parameter's about 0.8 in favor of trusting the value network. And when we started on this around last summer, it was about 0.5. So then, once you have that, you back the Q value up the tree. And then once you've run out of time or you allocate a time, you basically pick the move that has the highest Q value associated with it.
So if we think about what these neural networks are doing then for us in terms of the search, you could think of it in this way. Imagine that this is the search tree from the current position. It's totally intractable. It's really huge. What we do is, we call the policy network to really cut down the width of that search, to narrow that down.
And the value network really cuts the depth of the search. So instead of having to search all the way to the end of the game and collect millions of statistics like that to be even reasonably accurate, we can truncate that data search at any point we like and call the value network.
So once we built the AlphaGo system, it was time to evaluate how strong it was and test it out. So the first thing we did was play it against the commercially best available Go programs out there. The two best ones are Crazy Stone and Zen. They've won all the recent computer Go competitions of the last few years. And they've reached to about strong amateur level.
So in Go, you start off in this thing called "cue" K-Y-U and you go down in score as an amateur. And then, as you get better as a strong amateur, you get a dan rating which goes from one dan to about six or seven dan. And then, finally, you can become professional, and then the dan ratings start again from one to nine.
So really, these programs were about the strength of strong amateurs, a strong club player. And AlphaGo did incredibly well against them. So in the 495 matches we tried, it won all but one. And it could do a 75% win rate against these other programs, even when they were given a four-move head start, which is huge in Go. It's called a four-stone handicap.
And this graph here that I'm showing you is just the single machine version of AlphaGo, and it was even stronger on the distributive version. And these rankings are quite subjective, these Go rankings. So we actually created a numerical ranking, an Elo ranking, that's on the y-axis on the left-hand side which is based on chess Elo ratings and is purely statistical in terms of the win rates of the different programs.
And what we found is that a gap of about 200 Elo points, or 250 Elo points, translates to about an 80% win rate. And AlphaGo was more than a thousand Elo points better than the other best programs. And so, this was back in October, so this is not the most recent version. And we beat all of these other programs, so it was time to test ourselves against some of the world's top human players.
So what we did back in October with challenge this lovely guy called Fan Hui who's now based in France but was born and grew up in China. He's the current reigning three-time European champion. He's a two dan professional. He started playing go at seven and turned professional in China at age 16. It's very difficult to turn professional in China, so he was a top, top player before moving to France, and now he coaches the national French team. And we change in October, and this is what happened.
FAN HUI: I think after first game maybe it don't like fight, it like play slow. So it's why begin second game, I fight. It do mistake sometimes. This gives me confidence. I think maybe I'm right. It's why for another game, I fight all the time. Now it's complicated, now it's complicated. But I lose all my games.
DEMIS HASSABIS: So AlphaGo won five nil, much to our surprise, and became the first program to ever be a professional at Go. And if you ask AI experts, even the top programmers of these other programs, even sort of a year before, they were predicting this moment would be at least another decade away. So it's about a decade earlier than the top experts in the field expected, and certainly a decade earlier than the Go world thought it was going to happen.
AUDIENCE: Was this the distributed version or is it single version?
DEMIS HASSABIS: This was the distributed version. And this story ends well though, he looks distraught here, but he ended up hiring him as a consultant on our team after this. And he joined this side of the program afterwards. But one interesting point about this, which is interesting, is that he then came into the office for about a week, every month, to make sure-- he was part of our making sure we went over fitting in our self play, by carrying on pitting our wits against him. And he felt that his play had improved by playing against AlphaGo. And actually he went from ranked about 600 in the world at that time in October, to in January, February, like three or four months later, being ranked 300 in the world.
So it was-- and he'll tell you that it really opened his mind, he said, it freed my mind from the constraints of the 3,000 years of tradition to think in a different way about the game. So it's very interesting. So again if you want to read the technical details of this, this is another Nature paper, front cover, that was a couple of months ago. And I think it's caused a really big storm in the AI world and the Go world.
So then it was time to take a kind of ultimate challenge, which was just a few weeks ago now. We started to challenge Lee Sedol, who is an absolute legend at the game. I call him like the Roger Federer of Go. And he's been indisputably the best player of the past decade, and he's won 18 world titles. And he's also famed for his creative style and creative brilliance, so he was the perfect player for us to pit our whit's against. And we played him in early March for a million dollar first prize in Korea.
Now just before I go to the results, I just want to side note on compute power here, which I always get asked about. So we use roughly the same compute power for this match as we did for the Fan Hui match. So there's around about 50-60 GPUs worth of compute. And you might ask, well, why don't we just use more?
Well, actually this-- asymptotes quite click quickly, actually, the strength of the program with more compute power. And one of the reasons is it's actually quite hard to paralyze MCTS algorithms. They work much better, more efficiently, if you do them sequentially. And if you batch them across lots and lots of GPUs, you don't actually get that much more effectiveness out of it. And one measure of that is that the distributed version, surprisingly, probably, I think maybe to many of you, only wins about 75% of the time against the single machine version.
So we play the match, and many of you will see that we actually won 4-1. And it was pretty outstanding to us, because even the day before the match, they interviewed Lee Sedol, and he was saying he was confident he was going to win five nil. And the whole Go world thought there was no chance we could win. Obviously, they we're looking at the Fan Hui matches and trying to estimate-- maybe we'd improve 10-20% since then, and that would stand no chance against Lee Sedol.
But actually in the five months that we had between the two matches, the new version of AlphaGo could beat the old version of AlphaGo 99.9% of the time. So it is pretty astoundingly much stronger. And actually it's an amazing experience out there, and I'll talk about the culture significance in a second, but one very nice thing is the president of the Korean Go Association, in the middle there, awarded us and AlphaGo with an honorary 9-dan certificate. So it was really beautiful, we have that framed up on the wall-- for it's creative play.
And I just want to touch on those themes actually about the creativity and intuition. And that's one of the reasons I explained to you how to play Go, because I want to just try and explain to you some of the significance of what AlphaGo did. Now Chess is really my main game that I play, but I played Go well enough to be able to appreciate what's going on. Now probably the best move that AlphaGo played in the whole five game series, and maybe the Go world will decide to name this move, is move 37 in game 2.
And this is the position-- AlphaGo was black, and AlphaGo decided to play here. It's called a shoulder hit move, this move here in red. And it's funny-- I'm going to try to explain to you why this is so amazing, this move, by telling you a little bit about Go.
So there's two key lines in Go, the third and the fourth line of the board, that's the critical lines in Go. So here's the third line. Now if you play a stone on the third line, what you're really trying to do, you're telling your opponent is I'm interested in taking territory on the side of the board. That's a third line move means. A fourth line move, by contrast, is the fourth line. What that means if you play on the fourth line is I'm trying to take influence and power into the center of the board. OK, so you're going to try and influence the center of the board, and radiate that influence across the board. So that's towards the center.
And the beauty of Go, and I think one of the reasons why it ended up evolving to 19 by 19 board, is the playing on the third and fourth lines and going for territory or influence, is considered to be perfectly balanced. Right? So the territory that you get for playing the third line is about equal to what the opponent gets by playing on the fourth line, and getting power and influence into the center.
The idea is that that influence that you get and power you get, you store up for later and eventually that will give you territory somewhere else on the board. So that's the classic 3,000 years of history of Go, and yet AlphaGo played on the fifth line, to take influence toward the center of the board. And so this is kind of astounding-- goes against 3,000 years of history of Go.
And just to show you how astounding that was to the Go fraternity, I just want to show you a clip from the commentary, the live commentary. And so we had commentating live, there were lots of commentary channels. There's actually 14 live channels in China, it's all free national TV stations in Korea, but also we had an America-- we had an English Channel via YouTube. And we had this fantastic commentator called Michael Redmond who was the only Westerner ever to get to 9-dan; he is the only English-speaking person to ever get to 9-dan.
And look at his reaction to this move 37. So just to show you, so what turned out was about 50 moves after this move, that move here influenced the fight over in the bottom left corner. Right? About 50 moves later. So you can't calculate that, because there's too many possibilities. That was the influence of the power of that move. So this is Michael Redmond seeing this move.
MICHAEL REDMOND: The Google team was talking about is this evaluation-- the value of--
DEMIS HASSABIS: He doesn't even know where it is.
CHRIS GARLOCK: That's a very surprising move.
MICHAEL REDMOND: I thought it was a mistake.
CHRIS GARLOCK: Well, I thought it was a click miss.
MICHAEL REDMOND: If we were online Go, we'd call it clicko.
CHRIS GARLOCK: Yeah, it's a very strange move. Something like this would be a more normal move.
DEMIS HASSABIS: So I think he means a miss click as opposed to a click miss. So he was thinking that our operator, the person actually playing the moves for AlphaGo, [INAUDIBLE] Wang is the lead programmer, had actually entered the move wrong into the machine, because it's that surprising of a move. And this is what Lee Sedol thought of it, he disappeared to the bathroom for 15 minutes. So that's his empty seat there. No one knew what happened to him. So he just disappeared for 15 minutes. So maybe it will be called the face washing move or something, because they're usually named after something that happened.
And actually later when we investigated the statistics behind this, we found that the policy network gave prior probability of this move as less than one in 10,000. So AlphaGo overcame-- so it's not just repeating what it's seeing in these professional games, because it would never have thought to play this move. Later on, some of the 9-dan pros commented, it's not a human move, no human would ever have played this move. So it's just really kind of an original move, if you like. And one thing that we think is going on here is that--
AUDIENCE: Do you have data that [INAUDIBLE]?
DEMIS HASSABIS: No, we can't yet. We need to build more visualization tools to actually do that, we're building that at the moment. It's pretty hard to know why-- to explain why it's done that, for us.
So here, in terms of these surprising moves, I think this shows some kind of originality. And what it might mean is-- when I talked to Michael Redmond about this is that he said that-- AlphaGo has this very light touch. It doesn't commit to territory early, and what we think is going on is that AlphaGo really likes influence in the center of the board. And he likes it so much, and it's so good at ultimately making the influence pay later on in the game, that it actually thinks fifth line's influence is good enough. So this may cause a whole rethink in the game of Go as to what's an acceptable trade.
Then, I must say, the other really spectacular move was played by Lee Sedol in game 4. So we won the first three games, and then Lee Sedol came back strongly, because he's an incredible games player. I've met many of the best games players in the world, Garry Kasparov and others, but I put Lee Sedol at the top of all the games players I've met in terms of his creativity and fighting spirit. And in game 4, he won game 4. And he did it by playing this incredible move, move 78.
I haven't got time to go into why this is so special, but basically when we look to the data on this as well, we found that AlphaGo thought the probability of this move was also less than one in 10,000. So it was totally unexpected for AlphaGo, and that meant all the pondering and search it'd done and up to the move prior to this ended up having to be thrown away. So it basically had to start again as soon as this move happened, and for some reason this caused some misevaluation in the value net, which we're still investigating what happened there.
So cultural impact of this match was huge. We had 280 million viewers, that's more than like the Super Bowl. And 60 million viewers just in China for the first game. We were being stopped in the streets in Korea, and it was pretty crazy. And 35,000 press articles literally every day, it was front page of all the newspapers in Korea. And the thing I liked most, actually, was that it popularized Go in the west, there was a world wide shortage of Go boards for the last few weeks after-- still now, I think. If you're trying to order a Go board, you might have trouble because of this game, which is fantastic to see. The press coverage was just insane.
These are pictures of the press room, just a scrum number of live TV cameras, 50 live TV cameras in the back. It was on all the national TV stations, jumbo screens in the shopping districts, it was pretty crazy. It was amazing to see. I think for Korea it was the perfect match up of-- they love technology, they love AI, and they love Go, so for them it was the perfect storm.
And this is one interesting thing that I want to show, is the rate of progress of AlphaGo. So we started this project only about just over 18 months ago, and the progress has been relentless from the beginning. We found that these techniques, which can improve themselves, and you can create more data, and then train new versions, and then that can create more better, high quality data. That virtuous cycle had delivered about a one rank improvement per month, which is pretty astounding.
And the interesting thing is we haven't really seen any asymptote yet, so we're quite anxious to see how far this can go. And what is the optimal play, or to get near optimal play in Go. How much further is there to go? And actually, I think, most of the Go professionals are really interested in this question as well. And I'm pretty sure that, just like with Fan Hui, when we ultimately release AlphaGo in some way to the public, I think it will improve the standard of Go, and bring in whole new ideas.
So after the heat of battle I had a great dinner catch up with Lee Sedol, who's also an amazing and lovely guy, and we talked about the match. And he told me that it was one of the greatest experiences of his life, and the fact that it had totally-- just the five games he played had made him totally rejuvenated for his passion for Go, and the ideas and creativity about what could be done.
AUDIENCE: How many games a day does the machine play against itself?
DEMIS HASSABIS: It's playing a few thousand a day, depends on how many machines we use.
AUDIENCE: So [INAUDIBLE] is when human experience is at?
DEMIS HASSABIS: Potentially. I mean these pros play several thousand games a year, probably about 1,000-2000 when they're training, so it's quite a lot. Plus they read a lot about all the ancient games.
AUDIENCE: Do you think the strong culture in Go has forced human play into a corner instead of--
DEMIS HASSABIS: I don't think so, because there are three different schools of Go, the Japanese, the Koreans, and the Chinese. And they're very competitive against each other. And they approach the game differently, and I think that that creative tension has forced them out of local maxima, I would say.
So just to compare Deep Blue with AlphaGo, just to be clear again about the differences. So deep blue, again not to take away from the immense achievement that it was for its time, absolute incredible, but it used handcrafted chess knowledge. By contrast, AlphaGo has no handcrafted knowledge, all the knowledge it has it's learned from expert games and through self-play.
Deep Blue did full width search, pretty much looked at all the alternatives, and that's why I needed to crunch 200 million positions per second. By contrast, AlphaGo uses these two neural networks to guide the search in a highly selective manner. And that means we only need to look at 100,000 positions per second to deliver this kind of performance.
So I just want to finish by a couple of words on intuition and creativity. And this may be a little bit controversial, so I don't want to-- I'm not saying this is the full truth of the matter, or even fully encompasses on everything to do with intuition and creativity, but I think these are interesting thoughts. So we have to sort of define a little bit what do we mean by intuition? And one way I'd like to-- at least the way I think about it for Go, is this implicit knowledge that humans have acquired through experience of playing Go, but it's not consciously accessible or expressible. Certainly not to communicate to someone else, but not even to themselves.
But we know this knowledge is there, and we know it's a very high quality, because we can test the knowledge and verify it behaviorally. Obviously it's the output of the moves that the player plays. Secondly, what is creativity? And I'm sure everyone in this room has their own pet definition. But I think, again, it definitely encompasses the ability to synthesize the knowledge you have and use that knowledge that you've accumulated to produce novel or original ideas. And I think certainly, at least within, albeit the constrained domain of Go, I think AlphaGo has pretty clearly demonstrated these two abilities.
And obviously while playing games is a lot of fun, and I believe the most efficient way to go about AI development, obviously that's not the end goal for us. We want to apply the technologies that we've built here as part of AlphaGo, that we believe are pretty general purpose, extend them, use components of them, and apply them to have impacts on big challenging, real world problems. And we're looking at all sorts of areas at the moment, like health care, robotics, and personal assistance.
So I just want to thank the amazing AlphaGo team who did all this incredible work, really incredible engineering and research efforts. And also again I just want to stress all this work I've shown you today is really less than a tenth probably of the work that we're doing at Deep Mind, and if you're interested in seeing all of our publications, they all are on our website, and there's about 70-80 publications there now of all of our latest work. And of course, I must mention, if you want to get involved, we are hiring both research scientists and software engineers. Thanks for listening.
[APPLAUSE]
DEMIS HASSABIS: Yeah?
TOMASO POGGIO: You have--
DEMIS HASSABIS: Yeah, go for it. Do you want to use this?
TOMASO POGGIO: For a second. Thank you. So let's have a couple of questions. Anybody? Yeah? OK. Let me--
AUDIENCE: If groups of people play together could they beat AlphaGo?
DEMIS HASSABIS: I think the question was, can groups of players together beat AlphaGo? Maybe. So that's something that we might play in the future actually, is a group of top professionals versus AlphaGo. And it'd be quite interesting to see, because it's known that some of these top players are really good at opening or middle game or end game, and you could switch between them. And I'm sure they'd be a lot stronger together. So maybe we'll do that towards the end of the year or next year. Yes, behind you. Yeah.
AUDIENCE: You mentioned earlier using visualization to better understand why AlphaGo--
DEMIS HASSABIS: Yeah.
AUDIENCE: [INAUDIBLE] Can you talk about that?
DEMIS HASSABIS: Yeah--
AUDIENCE: Can you repeat the question?
DEMIS HASSABIS: Yes, so the question was using visualizations to understand better how AlphaGo works. So we think this is a huge issue with the whole deep learning field actually, how can we better understand these black boxes that are doing these amazing things, but quite opaquely. And I think what we need is a whole new suite of analysis tools and statistical tools and visualization tools to do that. And again I look to my neuroscience background for inspiration, for those of you to fMRI or that kind of analysis, I think we need the equivalent of kind of SPM for a virtual brain.
So we actually have a project called virtual brain analytics, which is around building these kinds of tools so that we can better understand what representations these networks are building. So hopefully in the next year or so we'll have something much more to say about that. Yeah?
AUDIENCE: So you mentioned that Deep Blue used sort of human crafted moves, which sort of helped them. And then AlphaGo didn't have that, but it still learned from moves and experiences of the game.
DEMIS HASSABIS: Yeah.
AUDIENCE: Is there any sort of hope for completely reinforced learning--
DEMIS HASSABIS: Yeah.
AUDIENCE: In Go or even in other agents. What is the--
DEMIS HASSABIS: Yeah, so it's a really good question, actually. The question is, can we do away with the supervised learning part, and just go all the way from literally random, using reinforcement learning, up to expert. We plan to do this experiment actually. So we think it will be fine, but it will take a lot longer to train, obviously, without bootstrapping with the human expert play. So until now, we've been just concentrating on trying to build the strongest program we can in the fastest time. So we haven't had time to experiment with that, but there are a number of experiments like that that we want to go back to and try.
I will say that a very smart master student from Imperial College in London did do this for Chess from scratch, and they got to International Master Standard. So it seems like this is definitely sort of possible. And actually we've hired him now, and so he may be the person that would end up-- Matthew Lai, he's called-- that may end up looking at this as well. So maybe someone from near the back. Yeah?
AUDIENCE: So [INAUDIBLE]
DEMIS HASSABIS: Sorry?
AUDIENCE: [INAUDIBLE]
DEMIS HASSABIS: Yes.
AUDIENCE: That algorithm [INAUDIBLE].
DEMIS HASSABIS: Yes, potentially. So we're thinking about adding learning into that part too. And also maybe there are ways of doing away with some of that [INAUDIBLE] search too, there are other ways of doing that search, more like imagination based planning. So we're thinking about that as well. Maybe back there, yeah?
AUDIENCE: [INAUDIBLE]
DEMIS HASSABIS: So I think the question, if I understand it correctly, is that if agents play games well, is that AI? Is that what you're asking, or is that--
AUDIENCE: Yes. Can AI [INAUDIBLE]
DEMIS HASSABIS: Well, I mean, obviously, that's our thesis is that this will work. But I think you have to be careful how you build the AI. There are many ways you could build AI for games that would not be generalizable. So I think that's been the history-- that is what generally, for commercial games, which I've also helped make lots of commercial games, which have AI in them. And usually the M built AI is a special case, usually it's finite state machines or something for the game. And it utilizes all kinds of game state information that if you were just using perception you wouldn't have access to.
So I think you have to be careful that you use games in the right way, and you treat the agent really as a virtual robot with all of that that entails, in terms of what it has access to. And I think as long as you're careful with that, then it's fine. And one way we enforce this is we have a whole separate team, an evaluation team, of amazing programmers, most of them are X Games programmers, who build the environments and the APIs to the environments and so on. And they're entirely separate from the algorithm development teams. And the only way the AIs can interface with the games is by these very thin APIs. So we know there's no way, even if the researcher was to be laxes with this, or lax with this, they can access things they're not supposed to-- the agents. Pick from the left. So we'll just go around, any questions? Yeah, here.
AUDIENCE: Why does AlphaGo improve? Is it [INAUDIBLE], self-training, or do you tweak it?
DEMIS HASSABIS: Well, we're doing both, actually. So there's self-training in terms of-- they're self-training producing high quality data, it's tweaking itself through this deep reinforcement learning, and we're also actively doing tons of research in terms of new architectures or parameters and other things. So it's all of the above. So we really threw everything at it.
Associated Research Thrust: