Tutorial: Reinforcement Learning (1:07:33)
Date Posted:
August 15, 2018
Date Recorded:
August 15, 2018
CBMM Speaker(s):
Yen-Ling Kuo ,
Xavier Boix All Captioned Videos Brains, Minds and Machines Summer Course 2018
Description:
Xavier Boix & Yen-Ling Kuo, MIT
Introduction to reinforcement learning, its relation to supervised learning, and value-, policy-, and model-based reinforcement learning methods. Hands-on exploration of the Deep Q-Network and its application to learning the game of Pong.
Slides and Code example
XAVIER BOIX: OK, reinforcement learning, so what is this? So this is the talk we are going to have. I'm going to go into some detail with this, but the setup we have here is an agent that is going to learn by interacting with environment. So we find this setup in video games, robotics, board games. There is ton of different setups we'll find that have successful applications in this.
What do we mean exactly by this? So we'll conceptualize this environment in the following way. We have an agent, which is going to be an AI, the AI that is going to be learning by interacting with the world. The way the agent interacts is by taking some actions. These actions, as we'll see later, could be, for example, if you are a robot, the agent is a robot, will be moving in the environment or taking things in the environment. If it's in a video game, it could be just pressing a button in the keyboard in order to interact with the video game.
From the world agent, we'll make observations, the agent is a robot and it has eyes, has a camera. It will be able to observe what's going on around, and there will be rewards coming to the agent. So the agent will have some sense of what is good for him, and as you will see, as you can expect, the agent will try to maximize toward rewards.
Also important detail, we are going to consider discrete timing, so all these will have discrete time steps. So at time step 0, the agent receives observations and rewards, and by this information, the agent decides to take an action in the world. And the world receives the action, something happens in the world, and the world produces a new observation and a new reward at time t plus 1. So there's going to be continuous interaction with an agent and the world.
And an experience, we consider an agent has an experience who can extend the experience through these concepts. Well, an experience is simply a sequence of observations, rewards, and actions. So a sequence of this, we'll consider this is an experience. So an important concept in reinforcement learning is the state. The state is the current situation of the world, and with the given state, the agent will have to decide which actions to take.
So we can conceptualize the state in this way. It's a function that depends on the whole experience in that sequence. And we'll make an assumption here that that state will only depend on the current observation.
An example of this, a setup in a video game. Imagine that you want to play Breakout, and you want to design an AI that is able to learn to play Breakout by itself. So no one will tell the AI how to play Breakout, the Breakout by interacting with the game will come up with a strategy to play Breakout and be successful in this.
So the agent, as we said, will have the ability to take some actions that will produce something in the environment, in the video game. Those actions will correspond, in this case, in this video game, it can only go right or left. So the upward of the agent is going to be, OK, now I move right, I move left, or maybe I do nothing.
And then the agent will observe something from the video game. It will observe will get somebody rewards. These rewards in the case of this video game will be the score. So the agent will try to produce actions that maximize, that make the score as high as possible. That's going to be the goal of the learning. And at the same time, the agent will be observing the video game.
I put here four frames because in order to have an ocean of the motion that's going on in the video game, you need more than one frame. So in this case of the video game, usually, the agent gets as an input as observation, multiple frames in order to capture the motion that has been happening during the game. So that's the basic setup of enforcement learning, and now, we are going to try to see how the agent can come up with a strategy to learn to produce actions.
And as you can imagine, one option of an agent is a neural network. And the case we're going to study is that. We are going to see in practice, we are going to build a big neural network that will learn to play video games at the end of this tutorial.
So in the last few days, you have been studying supervised learning, if you haven't been before, so how this double reinforcement learning relates to supervised learning, the machine learning that you have been studying in the previous tutorials. So these are the relationship. So for example, imagine that you have a data set of dogs and cats. You have images of dogs and cats, and the task of the agent is to say if the image is a dog or a cat. With supervised learning, the agent is given examples of dogs and cats, each of them with a label, and the agent has to learn to produce the correct label for each image for unseen images. So that's the paradigm of supervised learning.
So you could see in this scheme of reinforcement learning in the following way. The agent can take two actions. The agent can say, well, this image you've just seen is a dog or this image is a cat. That would be the action. The reward would be the classification score for that particular image. When the agent says there's a dog, if it's correct, the environment the world, will say, well, yeah, you were correct or you were wrong, and observation, of course, is the image.
So there are two differences between this and that and reinforcement learning. One is the sequence here. The sequence of experience is just one step. So the agent produces one image, produce one action and its other. Then you can start a new experience. And one image doesn't depend on previous images. The data set provides images independently from previous actions.
And the second difference is that there is immediate feedback. So when the agent say, there is a dog, the world immediately tells, OK, you were right or you were wrong. In reinforcement learning, this doesn't need to be the case. You might be playing a video game, you may be pressing left, you may be moving left, or maybe the score doesn't increase or doesn't decrease.
So you don't have feedback immediately. The feedback will come later. You press buttons while playing a video game, and the feedback comes later. So I would say it's a more challenging situation than supervised learning.
Let's go a bit more in detail about reinforcement learning. So concept of a policy. So the policy is a way, the strategy to say which action to take, and this will depend on the state. This will depend on the current situation that the world is. So essentially, you can define pi as a function that maps the state to any action.
We can consider there are two possibilities. One is deterministic. There is just a mapping. And the other is that there is some randomness in the policy, depending on some situation that might be useful to consider some randomness in case you want to explore sequences randomly. But the whole concept here is that policy maps of state current situation to any action.
And another central concept in reinforcement learning is a value function. So we're talking about the agent wants to maximize that reward, wants to get as much reward as possible, but what you don't want is to just maximize the immediate reward. You just don't produce actions to get just reward immediately, you want that reward that you get over the whole sequence is as big as possible, and we can express this in the following way. You can talk about the reward that comes at this time, at time step t, and the future rewards.
Usually, we'll consider the current reward is much more valuable, or we'll consider more important is the future rewards by adding a wave, gamma, and this wave will decrease over time. And there is an expectation here because we are not sure about the reward. This expectation, depending on which policy we play on the world, we'll have more reward or later reward. So we will never be sure about what is the reward that we are going to get. So essentially, we have this value function that depends on the current state that measures what we expect, we'll get a reward if we are in that state.
And it is one that you notice the following thing, that this function, there is recursion here. So if there is a reward at this time, the reward would feature time steps, but all this here also defines a new value function, in this case, the state t plus 1. So the value function of state t starts at reward t, but here, there is some recursion, which appears the value function of state t plus 1.
Also, if we introduce this probability, the probability of transition into a new state given the current state in an action, so here there is some Markovian assumption, meaning that the next state will only depend or it can be modeled only by using the current state and the action. So by introducing this and taking into account this, we're going to write the value function in the following way, in a recursive way.
So we have here the value function, essentially, we've put all this here into here in this term here. And you can see that here it appears a new value function, in this case, the state t plus 1. So there is some recursion. The value function depends on itself at the newer state. And all this here is an expectation. So we don't know what's going to be the state in the future, so this is an expectation, depending on the probability of transitioning.
And finally before Yen-Ling goes into the different methods, so a similar concept to the value function is the Q-function, and essentially, is exactly the same. The only difference is that here this only depends on the state, and there is an expectation on the actions, but the Q-function, essentially, is the same, but with an action that we already decided. We say, OK, what's the value, what is the expected reward if I am in this state and I take this action.
Essentially, it's exactly the same as the value function, but without these expectations because the action is given. And as you will see later, this will be very useful to introduce DQN, the algorithm that we are going to study with deep learning.
So just a small recap, in this crash course-- Yeah, go ahead.
AUDIENCE: So the difference between the value function and the Q-function, just to clarify, would be the value function is linked to the expectation of what outcome of an action would be? And the Q-function seems to be after the action at the label?
XAVIER BOIX: Yeah. So expected reward, in this case, also the expectation doesn't become the action, but here the action is given. It's fixed. So essentially, both of them measure expected reward, one given the action, the other without. So then if you don't have the action here, you also have to put the expected action. It depends on the policy.
Just to recap, what we're saying reinforcement learning is trying to learn a policy that maximizes the expected reward. These are two kinds of interviews, and that's what we are going to see. Now, Yen-Ling is going to introduce, is going to give an overview of the big landscape of reinforcement learning in the literature.
YEN-LING KUO: So as a recap of this effort-- so in reinforcement learning, we have our reward, a reward model action. So that means there are several ways we can try to optimize to learn, to figure out what's the best policy to take under this kind of setting. And one thing continue while we discuss our value function, what you can do is, when you have an expectation at each state, you can come from the expected value and then select the optimal policy based on the expectation.
But sometimes it is not very easy to figure out what's going to happen in the future and then compute all the expectations. So some people may think, OK, what about we just optimize the policy first. So we just randomly sample or choose some policy, and then we optimize to go from the worst one and slowly move to a good one. And once the actual work, I can get a reward and a good policy.
And lastly, since we are interacting with the environment and sometimes the world is simpler or easier, like you are playing games. So we can actually know the rule and learn the interaction model or transition model in the simpler or easier-to-learn world. So once we learn the model, we know the transition, and then we can sample the action from the world model. So I'm going to talk a little bit detail about each of these reinforcement learning types.
So first, value-based reinforcement learning. So it's built on top of the value function we just talked about. So I'll give some examples about how to evaluate the expected reward at each state. So take this simple Great War example. So Minion, a cartoon character, he really likes banana. So he wants to get the reward to get more banana and less likely if he can last banana, no so happy.
But sometimes on the way, he has opponent will take all his bananas, so he will loose, and then he probably last preferred this state. So now he's at the state t, and we want to estimate what the expected value at this state. So let's take a look what's a policy all the action he can go.
So now we take a sample of the action from the policy. So now is that he can go up, left, right, and down. And these are the probability to select each action. So from the table look up, we can see this, we have expected value like 6, because it's more closer to the huge reward. And this one, even though there is an immediate threat, but the expected value, probably not so bad, because the future return to get banana is also high, so you can take some kind of a risk for this.
So once we have this as severe, just say, we can compute the expected value from all the value from the next high step to be the one we are going to use at this state. So to choose a policy, what we can do, how can we choose policy using these value functions.
So one thing we can do is first, evaluate the Q value. So here we are using Q value instead of just the value function because we are considering the action. So we want to know at this state what action is more likely give me the better return. And then once we have filled out all the expected reward for each of the cells, we can try to pick a step to select the policy based on the Q values I know.
So I choose the action they have a better reward, expected reward from my current estimation. And then I get a new policy. And then, of course, I can do this over and over again once we get a new expected reward and then my policy. And I have a new policy. I update a more accurate reward, and then I can improve my policy again.
And this algorithm is called policy iteration. We iterate it to find a good policy given the value of functions of the state space.
As well, we say sometimes you don't really want to compute the expected value, especially when the state space is so huge. The actions or possible states, you can now enumerate very often. Or sometimes in robotics, you are dealing with the continuous world. So there is no discrete or action space. And you can not compute exactly that each continual actions along the way.
So in this case, people usually want to optimize the policy, the actions directly. So by selecting a policy, what we mean is, given the state, what's a best action to do at this state. So how can we optimize it? First, we can directly parameterize the policy with some parameter, theta. So the goal is to find a group, theta, that can give us a good policy.
So let's take another concrete example. Minion want to get banana again. So initially, we have some policy like this. So this policy is a distribution, which parameterized by the theta one. So we know, OK, this policy distribution probably not so good because there is a threat on the way. So we want to improve on top of it.
So how can we improve? We need a way to evaluate how good this policy is. So one lesson we learned from the optimization class is we need to define an optimization objective to know how we are going to optimize too. So in this case, given we have the policy, we can separate a lot of the trajectory from this distribution. And therefore, each of this sample, the trajectory, we know the action, so we can exactly compute what the reward we would get along the way.
So we know, OK, what's expected reward for this trajectory? And then as what we do in optimization, to optimize this parameter, theta, will be easier step. And as a side note, usually and practically, we don't compute against this distribution exactly because it is probably some really complex distribution, and you can not compute. So practically, people just sample depending on their computation power to get from the distribution to do the estimation.
So you can see you may have some other computational way to have a better estimation of the reward for this given policy. So to do optimization, to optimize on the theta, one way we can do is we take a gradient descent. But we compute the gradient with regard to the parameter, theta, and take a step to optimize the parameter.
And then when we have a new theta, probably move a little bit more here, we can evaluate the policy again. And then again, we have an updated theta, and then we can evaluate back again. So we iteratively do this to improve the theta that parameterize the policy.
So we may move from, initially, this very poor policy to the one that is really optimized and short path. So this is a reinforced algorithm you may hear a lot of people talk about when they talking about reinforcement learning. And in many deep learning approach, they are doing a policy gradient or policy-based RL, they are improving on top of the reinforced algorithm.
And finally, when we are talking about a model-based RL, there are many situations like when you play golf or play video games. Or like some simple navigation tasks, you know what's a rule on the road. So when these rules are known or easy to learn, we can actually learn the model of how the world performs.
And to be exactly, what are the model we want to learn, it is basically the transition from state to state. And what kind of reward we will get in each state. And of course, you can parameterize each of these with another function, theta, so we can optimize using neural network or other approximators.
And to learn this model, actually, this becomes a supervised learning problem, because from all the experiences, if we want to learn the reward, it could be a regression from the data points to the value of the state. And then once we learn the world model, how can we select the policies based on the null model.
So one way you can do is something what we did before to compute the value function and the value of each state and to select optimal action based on that iteratively. And another way many people do is, since we have the distribution, we can do some simply, to simple possible trajectory from the distribution, and then scoring them to find the best one.
So for example, if you're in a state, you can expand what could be the possible future, and then to give a tree like this. And then you can evaluate each termination node to find what is the best path to get to the goal, and then select it.
So this is three types of reinforcement learning most of time people are talking about. And in all this problem, actually, we all deal with the exploration and exploitation dilemma. So for example, when you have a reinforcement learning agent, you're trying to learn a policy in this game and this game. Usually, we'll say, well, this is very easy. You don't need to explore and you can know the rule and how to play and [INAUDIBLE].
But this one, probably not so easy because you need to explore a lot more. For example, you need to explore to figure out what each of the sprites means, and for example, to get to the key and open the door, and this is also some composition of task, which probably not so easy to explore and find a solution. And also, you may need to sacrifice some short-term returns in order to find something big in the future. So for example, you may go to the path with school and you hurt some points, but you can get a key on the way.
So these two trade offs is an exploration and exploitation. And exploitation is used, what's the best policy you already have and to take action and do it using the current information. And the exploration is you will try to do something you're probably not that familiar with or you haven't tried before, but you are going to use that to collect all information to see if you can explore it in the future to get something big.
And usually, there are some approaches to trade off between these two, and sometimes people just do it greedily, so you just add some noise, and every time you have epsilon popularity to choose something new to try it out to see what you can use, the policy or the state transition. And another way people do is to do probability matching. So instead of picking the action or policy, there's a highest probability. We're sampling from the probability distribution. So for those, they are not the best, still have a way you can try it out.
And another important question about RL is sample efficiency. Because you know when you are learning, you are actually trying to explore and try it out and to get some data and some information to build the model or learn the policy. And this question, how many samples do we need to learn a good policy? Because we know in deep learning, you need to have a lot of samples to learn a good classification model.
And in RL, probably, even worse because you will need to explore more, and the learning signal is weaker because you sometimes get the reward at the end. So there are two different ideas about an RL for different learner. So one is on-policy learner and another is off-policy learner. An on-policy is something like when we look at the policy-based learner. Whenever we change a new policy, we need to generate a new sample to evaluate if the policy is good or bad. So every time you need to generate new samples, if you want to evaluate a policy.
And off-policy is something like when we want to build the model of environment, you may just use whatever information you have, no matter if it's current or past and then use that to build a good translation model. So if you put it in the spectrum, they are like some RL algorithm like model base. It's more efficient in data and samples. And some methods like policy gradients, they need to get a lot of samples whenever we want to evaluate a new thing.
So you may have a question like, OK, if this is less, if not so efficient in data usage, why we still need method like this? So to deal with different kind of problem, you need different reinforcement learning algorithm. So for example, in some problem, it's like holding a cup. It is probably easier to learn, OK, you need to approach a car and move a joint and make a contact, the moving policy. And rather then learning the whole dynamics or physics of the environment between cup and your hand gesture and hand factors.
So this is a brief discussion about different type of RL algorithms. So of course, there are several model families we haven't talked about. And one thing is people usually try to combine different types of reinforcement learning algorithm together to try to get the best from different words.
So one example is the actor critique algorithm you may hear from somewhere. So they take from the policy-based method, but instead of computing the exact reward for each policy, they also have a critique to learn the value functions of the current policy. So they will replace this one with a Q-function or value function. And then when the actor decides to update the policy, they optimize against the value to improve, not just the parameter, but also the value function.
And another type of thing is when we are talking about a different method. So basically, we are optimizing against the value function, the policy function, or the transition model, and then the reward model. And now with deep learning networks, we can also approximate all these kinds of functions with a different network.
And one example you may like to see before is, if I have a Deep Q-Network, you can use it to play the Atari again and have a very good score, of course, of several games. And now we will have Xavier to talk about what exactly they are doing inside the Deep Q-Network.
XAVIER BOIX: Thank you. So as you can see, the landscape of our reinforcement learning is quite complex. There is a ton of different options and variations and so on. So we'll go through this particular example. This is a paper that was published 2015 in Nature, "The Deep Mind," and this started this big boom of reinforcement learning plus deep learning together. So we're going to go through the details in order that you have in mind or you could see one particular example of reinforcement learning.
Just a small recap, that's the slide we showed before Yen-Ling took over. Essentially, value functions, the expected reward; Q-function, expected reward given particular action. Before working I mean with the max, so the goal of this algorithm is going to be to maximize, to pick an action that maximizes the Q-function. So essentially, want to find a policy that makes this Q-function maximum, and we are going to call this Q star.
So the policy that produces the maximum value, so that's essentially the maximum achievable value. If you have the best policy overall, that's the maximum achievable value you can have. And this policy, pi star, essentially, is going to be based on picking the action that maximizes Q star. So every time a step we're going to take, if we are able to have this to estimate this guy here, every time a step, we want to take the action that maximizes Q star.
I hope that we get Q star. First of all, we are going to work on a bit more with the max just doing simple steps, and then we're going to see that we can approximate this Q star with a deep neural network. The whole business of DQN is going to be approximating this with a deep network. So this is the previous equation as I showed and essentially want to find the policy that maximizes Q pi.
Just by operating, you get that you need to apply a maximum on this guy and the value function, and this is essentially equal to this. Just go quickly, because I think the importance is what I want to say later. And essentially, you're after this equation, which is also called the Bellman equation. And essentially, what you can see here is that the optimal Q-function essentially is also a recursive function, which depends on the Q-function in the next state.
So Deep Q learning. Now, here stars the Deep Q learning. That's the Bellman equation. There are many algorithms that are based on this, but what is particular of Deep Q learning? So what Deep Q learning says that we can approximate, we want to approximate all this right here with new function, which is going to be deep neural network.
So we are going to use the same studies as you have seen in previous tutorials where you have a deep network. You show examples of we'll see how we do this later. You show examples to the network, and the goal of the network is going to be approximating the value function, the expected reward.
So in other terms, you can understand the Q function in the following way. It's a function that can input the state, so you can enumerate all the states, all possible situations in the game. And for each of these states, you can take an action, and for this action taken in that state, you might get a reward.
And the whole point here would be to play the video game, interact with the environment, and we'll get examples in this table. We'll be getting examples. And with deep neural network, we want to generalize across all of this state. So we'll be given a training, some subset of this table and a testing will be in a new situation we have never been until the deep neural network we'll be able to approximate or generalize, hopefully, generalize well in that new situation, given the situations we have seen before. So fw is a deep neural network. w, the parameters are that deep network.
So we have this adapt. We have a video game, we have a deep neural network, and we'll put in the deep neural network is going to be the last four frames in order to capture the motion, but essentially, what we observe from the video game. And the deep neural network, the output of the deep neural network will be a set of neurons, one for a different action.
So if you have a game that you can move left and right, you have two outputs here. So essentially, you will map a set of frames to an action. And once you have this action, this action, you will pick the best, the one that is maximum, and the one that is maximum will be applied in the video game, and then you will get new frames. From these new frames, again, you will do this over and over.
Now the whole question is, OK, how do I train this neural network? What is exactly the data? What is exactly my loss function that they want to optimize? And this guy, Watkins, in '89, proved that if you use this loss function, if your goal is to minimize this-- I will explain a bit why this. If your goal is to minimize this, the neural network minimizes this over here, you have here a square loss, essentially, you converge under some assumptions, you will converge to the optimal Q-function. And what does this thing mean here?
So first of all, here what is important to note that this is the received reward. So this is a situation that the agent has played, has made an action and just received a reward, and this is essentially the neural network approximated by the neural network. So we have all these terms given the received reward, this the neural network and the new state that has appeared after playing action, and this is the neural network before playing the action. So all these values, we have it, and essentially, this would be just a matter of minimizing these terms.
So why this function here? Why this terrible equation? So it's not so terrible if we look a bit more carefully. So if we go back to what it means mathematically, these are the previous equations that I showed before. Again, this is approximated by the neural network, so the reward here is approximated by the neural network. And what you can see here is that there is this term here inside. So this is equal to this, and inside here, there is this guy, and this guy is equal to this.
So essentially, this one and this one cancels. There is a minus sign here. This expectation here doesn't apply because at this point, we know the next state because we just played, so we have the state. So essentially, what we end up having is that when you do this minus this, you end up having the predicted reward by the neural network. So essentially, what you are minimizing here, what this equation is doing is minimizing the real reward that users get by the predicted reward the neural network thinks it's going to get.
So when you calculate this, essentially, what you are doing this, and this makes a lot of sense. So you want that the award neural network of things is going to get to be a similar response is the reward you actually just got. And this is also known, this study is known as Minimizing the Temporal Difference Error. So essentially, when we do a propagation, we are going to minimize this. Yeah.
AUDIENCE: [INAUDIBLE]. We're always trying to minimize this difference, even though the reward might be-- the real reward might be much higher than the predicted reward?
XAVIER BOIX: Well, this is the reward-- what do you mean much higher than the predicted reward? Oh, yeah. I think there are some strategies too-- so in practice, as you will see in the practical case, that's not exactly what is minimized. That is, in order to remove outliers, there is some non-linearities applied here, but just conceptually, that's what we are going to do.
So later, you will see in practice that this is modified in order to avoid outliers in the case you just mentioned, when maybe the real reward is very, very different from the predicted reward. But just for understanding, that's the theoretical goal. Then in practice, there are many hacks that are being applied. And in the code, you will see later, that's one of the cases.
Just to insist one more, these Q-functions here are neural networks. So we have all this from the neural network. And the paper of "Deep Learning," the Nature paper, so all these was reinvented, so that was not in the Nature paper. That's not new. In 2015, all this was already known there. There are two novelties in that paper that make this algorithm really shine in the video game setup.
So the first novelty is the following one. So it introduces two neural networks, one which is the target one, which is called w minus, they call in the paper, and one is the actual one that you are doing back prop up and you're updating at every iteration. So essentially, this one is, again, neural network. That's essentially same row, but they don't update it every time you have an update through the back prop. They update it every n time steps.
And they argue in the paper that when they do this, they observe that the algorithm becomes more stable. So the argument why they introduce the change is more a critical argument that they found that this tends to work well. So there are two networks, one is updated every time you get some sequence, the other is updated every end sequence, every n time steps.
Another important component is what they call experience replay. So what they do, instead of training with the last video games with the more recent experience that maybe all the sequences that you will be training are very similar to each other. What they do is they store in memory all the sequences the algorithm has been playing with.
And then during training, they sample some of the sequences. In this way, you have a variety set of sequences, which are different from each other. So in this way, you have data that is more variety and you make correlations between data, and they also observed that this produces a big improvement of the algorithm. And then just another detail, they also use epsilon gradient. This means that when the algorithm is playing, sometimes it doesn't follow the policy given by the network. Sometimes with a probably, epsilon, it just chooses randomly an action in order to explore the different space that merge better.
Now, we'll go into the demos and the hands-on, and you can see how this algorithm works in practice.
YEN-LING KUO: So for the video yesterday, I implement DQN and have it to play Pong. So this is my agent left. So I can show you how it looks like now. Sorry. So how can I move? I don't know how to move to the screen. Sorry.
All right. Let me play it again. Yeah, it is quite impressive, but it took a lot of printing time. So yesterday, it took about eight hours to get this kind of performance. So you will need some time to train the reinforcement learning model because you want to explore and collect the data.
So one example we are going to have you to work on is not this one because it will take so much time to see the result. Instead, we'll have you try a simpler copper example. It is a classical reinforcement learning example. And it looks like follows. You have a control cart. You can go left or right. And there is a pendulum on top, so if you don't move, you will just fall. So you need to counter balance it to make it step on right.
So your goal is just to have it out as long as you want. And we can actually implement the Deep Q-Network to perform this task. In this Colab example, I took it from the PyTorch website. How can I make it bigger? Is it big enough? Oh, too big.
All right, it's the right size. So in this tutorial, we'll go through how to implement a replay memory and the Deep Q-Network and what's the optimization of methods and how to train the network. And before that, because we are running the Colab, there are a lot of things you need to do to set up the environment. But you can just ignore this cell and jump to the tutorial first.
So the first thing is, we are using the open the [INAUDIBLE]. So there are several Atari games or several examples you can play around with. So if you want to try reinforced learning and other examples, it's very easy for you to use. So here we are starting to use the car pool example.
And to implement the replay memory, actually, you can consider it's just an array of a relevant state and action pairs. So the memory, we implemented a few methods. So one is you want to push the new state or new experience into the memory, but because the memory has a capacity, so always when there's is more rewrite or there is the oldest memory.
And then every time in training, we also want to sample from the memory to take into our training process. So there is also a simple method here. And for the Deep Q-Network, we can skip all this because we just talked about it. And concretely, it is a stack convolutional neural network. So for example here, we have three convolutional layers and go with three best normalization to have a better regularization. And we are going to use this to approximate the Q values.
And then there are some inputs we want to extract from this environment. So first thing is the position of the cork board because you want to know what's the observation and the action or the result of your action. And also we want to take a screenshot to know what's a current game state. So we take the image from the game environment, they follow the actions. So this is one example you will get from the screen capture.
So there are some utility here because at each step, we want to sample an action. So you need to implement this sampling function, and like what we just discussed before, there's an absolute threshold. So there's a small possibility you will try to explore other actions, rather than choosing from your current policy from the Q-network.
And in the training loop, one thing to note is we have actually maintaining two networks as Xavier just said. So there is one the policy net, and another, the pocket net. So we will update the network after a few times to make the training process more stabilized.
And then finally, this is the combination of all of these components. So for each episode, you need to reset the environment and then take a picture of what you have in the environment and the position. And then try to sample the action to follow it and evaluate using the network, and then do the optimization.
So if you're running through the training process of not so many, and you can have a script to visualize what your couple looked like, and this is what I had with just 50 episodes. And you can try to play with different parameters and more complicated network to see how you can solve the problem. And as an exercise, you can pick whatever problem you're interested in from the open [INAUDIBLE], or you can pick the one like Mountain Car.
So the Mountain Car is very similar to the cup, for example. You control a car, but the goal is you want to go up, up, up to the hill to get to the goal there. And there is some momentum in how much you want to accelerate the cart you need to control.
So with this, I think we summarize with a few discussion topics. So people are always asking, why the RL agents are so successful, what kind of cases they are working well. So far they are working well in many domains that we know the rules or simple rules or dynamics. So for example, the games, the different type of games, video games, or simple environments.
And another one is you can actually learn motor skills from the road sensory input, but you need to give enough data. So by enough, I mean, let's take this open the eye dexterity example that just came out several weeks. And if you haven't checked this blog post, you can actually read it after today, and you will understand.
So the goal is, they built this robot arm. They can move the object and control the object post. And the goal of the robot is, they will give the goal of the object post. So you need to match what's your object at hand with the target goal they gave the robot.
So to do this, the robot needs to explore to learn how to control the fingers and then how to turn the object over. And it is very easy for humans, but for this case, they actually trained with 6,000 CPU course and 8 GPUs to collect around 100 years data in order to do the task. But the result is quite impressive, and we can take a look.
Yeah. So this is a goal of the object's posts, and this is where the robot with-- they train in the simulation, and then to turn the objects to match the one in the glove. And then you can have the robot to do it over and over again because in the simulation, they never get tired. So you can just keep asking them to do it and do it for 100 years, and then you can get pretty good results.
AUDIENCE: Did they just swap out the cube?
YEN-LING KUO: Hm?
AUDIENCE: Did they just swap the cube?
YEN-LING KUO: Yeah, they just swapped the cube. Yeah. I think one important thing for this work is actually what they demonstrate is they train in a simulation, so everything they learn is from assimilation, and they bring the model to run on the real robot with the real object. But of course, in the simulation, they also built the network to predict the object post and also to learn the policy based on the object post they have right now.
AUDIENCE: So we're looking at the actual robot now, right?
YEN-LING KUO: Yeah, this is the actual robot, but they trained in simulation.
AUDIENCE: But you didn't have to re-train on the robot afterwards, or was it already [INAUDIBLE]?
YEN-LING KUO: One thing you can do is something like fine tuning. So usually, people do their training on the simulation, and you have the simpler rule to simulate the environment. Then you bring to the real robot and then you'll fine tune on a real-world noise in order to have the robot to adapt to those noises.
AUDIENCE: [INAUDIBLE].
YEN-LING KUO: Yeah.
AUDIENCE: So we have to simplify [INAUDIBLE].
YEN-LING KUO: Sorry, I didn't get--
AUDIENCE: [INAUDIBLE].
YEN-LING KUO: Yeah, I think it's a little one. Yeah. So finally, there are still some unsolved challenges. They are really interesting in a lot of people in the reinforcement learning community. So for example, the first thing is, we know humans usually learn very quickly in the new tasks. But most of us with RL agents, they're usually slow or need many data. So most of time, we can only do it in simulation.
And second is as humans, we can use a previous knowledge to make learning faster in the future, and then we are not really have a good idea how to do-- there are some proposed solutions, but in general, not a good idea how to use the past knowledge. And another thing is we talk about transfer to other task or transfer from the simulation to the real environment.
Or another thing we talk about to define, we always take in reward function and reward value as a default thing we know. But actually, we it's not a good answer how we can define a reward function, and sometimes there's not very good reward function in a task. Another is a composition of tasks, so you need to combine different actions in order to achieve more goals, and currently, not many RL agents can handle complex and composition of tasks.
And there is also a lot of safety discussion about reinforcement learning agent because the agent explores to learn the policy. So there's no control over what kind of policy they learned, and will the policy help people or hurt the people. So there are several research topics working on this right now. And there is also a neuroscience or biology relationship between the reinforcement learning agent, including this different type of learning paradigms and how they actually implement it in bread.
So people are interested in to figure out how they are related to brain and neuroscience or cognitive science. So with that, that's all our tutorial about reinforcement learning. And you can stay to finish the DQN exercise, and we will be here to help you if you have any questions. Or let us know if you have any questions, and we will finish here.