Tutorial: Deep Learning (1:07:33)
August 12, 2018
August 12, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Eugenio Piasini, UPenn & Yen-Ling Kuo, MIT
Overview of supervised learning with neural networks, convolutional neural networks for object recognition, and recurrent neural networks, with a brief introduction to other deep learning models such as auto-encoders and generative models.
Download the tutorial slides (PDF)
Hands-on tutorial activities:
slides for 2016 deep learning tutorial
EUGENIO PIASINI: And we give it by me and Yen-Ling. We're going to be doing an introduction to some ideas in deep learning. So the roadmap for this tutorial is to have a look at general concepts of supervised learning with neural nets. Then we're going to switch to presenting the basic introduction to convolutional neural networks. We're going to talk about recurring nets, and we're going to finish with an overview of some other interesting deep learning models.
So supervised learning. I think most of you are familiar with the concept of supervised learning, but since you haven't had the machine learning tutorial yet, let's go over the basic idea first. So say that you are given a example input output pairs, x and y. So in this case it would be-- say that you have x is the position of dots on this square, and y is the color of the dots. Or, say, blue is one, red is zero.
What you want to learn is to predict the association between x and y. So if I give you a new point y, is this going to be blue or red. So the general idea is that you probably want to learn some sort of rule that allows you to map from the x, so the position, to the color, the y, such as this kind of coloring of the square.
As you might be familiar with, there are many different approaches from statistical learning, machine learning, et cetera, that deal with supervised learning. Some of the names in this space are various types of regression. So like logistic regression, support vector machines, decision trees, and neural networks, et cetera. So today, we're going to focus on obviously, the last element of this list.
So I'm going to start with the simplest possible artificial neuron that can perform the type of problem that I just described here. So the simple perception. Simple perception is something that is mathematically described by this expression here, which is the linear regression unit, which is something that was presented already by [INAUDIBLE] in the '40s as a first initial model of how a neuron might work, as composed by some inputs represented that could correspond to the synaptic inputs coming in on [INAUDIBLE], linear integration and a non linearity applied to it.
So in this case, you would have the neuron that performs a linear combination of the inputs as possibly a bias term, and then performs some sort of linearity, which is represented here by g. In this case, for the problem that we're showing here, the new linearity would be like a step function. So zero if the argument is negative, or one if the argument is positive. This particular functional form can learn to perform this type of discrimination.
So in fact, in this case, you would have that if your vector, w, of your weights is this vector here. The simple perception will learn to assign the value one to the dots in the blue area of the plot, and the value zero to the dots in the red area of the plot, thus solving our initial problem. Rosenblatt already the late '50s and '60s showed that you can actually-- if you have a problem like this, you can train a simple perception to solve this problem by applying a learning rule that he expressed like this, where it has some sort of learning parameter. And t is your target value, and y is the current output of the neuron.
But this leads us to talking about linear separability, because the example we just presented is kind of ideally suited to the very simple neuron that we were discussing. Because the different classes that we want to distinguish can be separated by some sort of linear boundary. In fact, it turns out that this type of neurons can only learn to solve this particular class of problems. As soon as you have a problem which is more complicated-- like the one that we present here, where there is no single linear boundary that divides the blue from the red-- the perception fails miserably at distinguishing them.
Maybe you can only have one perception can learn this boundary here. And then another perception can learn this other boundary here. But really, if you want to distinguish these four quadrants, what you have to do is to kind of take the outputs of the two perceptions and put them together. You want to see if this perception is happy in it, this perception is also happy, maybe you're in this area. If they're both sad, they're here and combinations thereof.
So this suggests the fact that we can solve more complex problems than linearally separable ones just by combining multiple units across multiple layers, as we have schematized here. This is precisely what is done in what we call a multilayer perception, which is just an arrangement of many simple perceptions in several layers arranging a forward architecture. So a bit of terminology. We could call it the input layer composed by our input data. This is the output layer. And the things that we have in the middle, we call them hidden layers. That's why we call them h. So this is the first neuron of the second hidden layer and so on.
So the way in which these multilayer perceptions work is totally not surprising given the symbolic representation. So each of these units is just a simple perception, and just computes its activation from its inputs. So the units in the first layer just compute their activations from all the inputs from the input layer. And the units in the second layer use the activations of the first layer as their inputs, and so on and so forth until we get to the end. One might have several of these output units. Here, I have represented only one for simplicity.
What we call this procedure of kind of passing information forward from the input layer on through the hidden layers out and over to the output layer is called forward propagation. Because as you can see, information kind of flows forwards from the input to the output. So one cool thing about multilayer perceptions in general is that they are universal function approximaters. That means that given enough simple units, you can use them to approximate arbitrarily well any function that is reasonably well behaved.
Of course, under certain assumptions, as well. You're kind of free to pick your nonlinearity, very broad class of nonlinearities, but it needs to be a nonlinearity. For instance, you can try to show-- it's a simple exercise-- what happens if g is a linear operation. And you will see that the essentially expressive power of this network basically collapses down to the expressive power of a simple perceptron.
So since we have seen this very nice property, that multilayer perceptions with, say, one or two hidden layers can represent any functions you throw at them. Then why are we talking about deep learning at all? Why aren't we just happy about fitting simple shallow networks with one hidden layer? So there is two problems.
One is that this is just an existence proof. It just tells us there is some network out there that can perform what we want. But it doesn't tell us anything about the fact that the number of required units for implementing this function should be reasonably small, or achievably small. So a problem of expressivity of your network. And on the other hand, we have no guarantee that actually we have a way of computing this network. So we know that maybe it exists in a mathematical sense, but we have no way of deriving it.
So this is something that leads us in the direction of deep networks, which in fact, have two interesting properties. So there is like two arguments to be made in favor of structuring your network with multiple layers. So the first is statistical. Deep networks-- if you imagine having a network with many layers computing one from the output of the previous one-- are compositional. In the sense that they compute features, and then features of features, and then features of features of features.
So this is naturally well suited if you think about the fact that in our world, perhaps many of the interesting data that you have, and many of the interesting statistical structure that you might be looking at has a similar compositional properties. So for instance, in vision, you have edges. And you can compose edges to form simple shapes. And you can compose simple shapes for more complex shapes, et cetera, et cetera. So in this sense, the architecture of deep networks can reflect something that is going on our world. And I believe perhaps next segment might say something about that in his talk later during the school.
And the other reason for which deep nets are useful is also computational. Under certain conditions, you can show that deep architectures are more expressive. So for a given number of hidden units that you have for assembling your network, you can basically learn more patterns if you arranged them in a hierarchical way. So we have seen why we care about networks with many layers. And in the previous lecture, I assume you have learned about how you optimize things.
And you have learned that for optimizing things, it's good to do-- gradient descent is a good approach. So say we have a particular network, and we can define some loss for our supervised learning problem. We want generally to compute the derivative of this loss with respect to the parameters of this example because we want to do gradient descent on it. So the classic way in which we do this computation is called back propagation.
Back propagation is really a mathematical trick. It's an application of the chain rule. And it's just a way of computing derivatives in a simple and manageable way across an arbitrarily complex graph of computations defined by your network. The two key intuitions that support back propagations are the following. Not intuitions, facts.
The first is that if you want to compute the derivative of the loss of your network with respect to, say, the weights, then parameterize a specific hidden unit. You know that the loss will depend on these weights only through the activation of that unit. The second key fact is that the derivative-- essentially, the derivative of the loss with respect to that activation will only depend on the activations of the units that are downstream from the first one.
But OK, let's see how it works in practice. So remember, you want to compute all the derivatives like d of l over the w's, where the w's are inside these nodes here. So you start by computing your first-- to compute the derivative of the loss with respect to the weights in the third layer here, so the weights are inside here. You can just do it very simply by applying once the chain rule. You just derive the loss with respect to the output value, and then you derive the output value with respect to the weights, and you're done. Because that's just one step.
Then the interesting thing is that when you keep applying the chain rule backwards and backwards to compute derivatives with respect to the units in the previous layers. You see that basically, you can express the derivative of the loss with respect to the activations in the second layer as the product of the derivative of the loss with respect to the third layer by the derivative of the activation of the third layer with respect to the activation of the second layer.
And, again, exactly as above. Every time that we have the derivative of the loss with respect to the activation in one layer, then we can directly plug it in, and an expression like that immediately gives us what we care about, so the derivative of the loss with respect to the weight. Again, for the second layer, derivative of the loss with respect of the weights is, again, just the derivative of the loss with respect to the activations in that layer multiplied by the derivative of the x activations with respect to the weights.
So you can keep doing this over and over again. And the cool thing is that as long as you keep these values here that we call the errors. So the derivatives of the loss with respect to the activation. So as you can here with respect to y, and h2, and h1. As long as you kind of propagate them back through the network as you're taking these derivatives, you can always compute these quantities in a fully local way. You see here. You only need this quantity that came from the layer above. And this is an entirely local quantity. It's something that only has to do with the relationship between deactivation in the second layer, and deactivation in the first layer, as you can see here with these blue lines.
So yeah, this is essentially just a convenient way of, and principal way of computing all of these derivatives that is what you care about for gradient descent through an arbitrarily complex graph with forward computations. All right, we can actually make a more concrete example. Just to give you a better idea.
Say that we have a quadratic loss. That is something that looks like that. That's something that you define. Say that your nonlinearity is a hyperbolic tangent. So this is just kind of a translated sigmoid. If you just plug this in into this machinery here, you will see that you can compute all the numbers you care for. So because of the definition of this way in which activations are coming from the previous layers, you can compute directly the derivatives of, say, activation in layer L with respect to activation layer L minus 1.
And this is just given by this expression. Remember that the derivative of the hyperbolic tangent is just 1 minus square of the argument. So you get that like a 1 minus square of the activation multiplied by the derivative of what's inside the hyperbolic tangent that gives you-- so you have this w that comes out. And the same thing symmetrically, when you take of the derivative of this with respect to the these weights, w, i, k. You get, again, this term 1 minus the square multiplied by the activation in the previous layer.
So you have these expressions here. You can just plug them in. And you can see that when you take first step, you take of the derivative of the loss with respect to the output. And it's just 1y minus t. And you can see why they're calling this thing the error that we're backpropagating. I mean, this is really an error term.
And then you can just go backwards. This is a number now. Because say that, for instance, for a particular data point, your target value was one for the classification, and your output value was one. This number here would be zero. If instead it was one and zero, this would be, say, minus one, something like that. So this is a number now. And then you can plug this number into this other expression. You can compute this because these are also numbers.
Because previously, when you had forward propagated the information from the input to the outputs, you have computed you know what is y, and what is the activation of the previous layer. So these are all numbers, and you can compute this is a concrete number that you can plug in into your gradient descent procedure. And you can just do that over and over until you have everything that you need to take a step in your gradient descent. Yes.
EUGENIO PIASINI: Sorry, I'm going to come closer and then I'm going to repeat the question.
AUDIENCE: Sorry, I just want to make sure that I understand conceptually what's happening. I don't know if everyone can hear me. So because you're getting an error term at the output that adjusts the weights right before it. Because those weights need to adjust, the weights before that also adjust to give you--
EUGENIO PIASINI: Yeah. Yeah, it's basically if you write down. You're just computing derivatives here, right? I mean, leave alone the idea of adjusting weights. So you just want to compute the derivative of this with respect to that. And then if you think about the chain rule of the derivative, the derivative of essentially in spirit. The derivative of this with respect to that would be the product of the derivative of this with respect to that. And then you're going to have this with respect to that, et cetera, et cetera.
And that's why you get all these terms that propagate back, basically. Because you see that this derivative here will appear. This term here will appear down here, et cetera. And then you will take this one and plug it here. Cool. Any other question? Yeah. OK, if you want to for fun, you can derive it to the gradient. This gradient in terms with respect to the bias terms b that I have ignored until now, if you want to try to see how this works.
OK, so this was the last slide of our multilayer perception stuff. But I wanted to conclude it with just a couple of citations here, just for historical perspective of talking about cycles of hype. So this was an excerpt from The New York Times in 1958 talking about the simple perception. "The navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself, and be conscious of its existence. Dr. Frank Rosenblatt, a research psychologist of the Cornell Analytical Laboratory in Buffalo said, "Perceptions might be fire to the planet as mechanical space explorers."
So this is for the simple perception that does linearly separable problems. So of course, people got very excited. And then 10 years later, essentially-- I'm not going to read through this, but-- the community recognized that there were several limitations to these devices. And that's basically when was if you've ever heard of the first or second AI winters, this was the first AI winter that happened when people realized that it was hard to train multilayer perceptions without knowing about the back propagation. And people were not really sure about whether they were ever going to be useful for anything.
So now, let's switch gears and say something about con nets. So in general, if we think about the problems in vision about, say, object detection and object recognition. What has been traditionally been done in the computer vision community was-- the traditional approach was to essentially spend a lot of time and effort from very smart people to devise useful ways of essentially reducing the dimensionality of images, and representing them with some features that would be useful for the purposes of, say, image recognition or, say, the problem that is illustrated in this picture is, say, oh, find the toy train that is hidden in this picture.
Actually, there's two of them. One is here and one is there. And there is a complicated algorithm to compute features from these pictures that allow you to make sure oh, look, this actually matches that, even though it's distorted, it's rotated, it's partially occluded. And you can do that if you spend a lot of time thinking carefully about how you're going to represent your images. Also, a parts-based model was something that were used. So the idea was to really put a lot of thought into how to represent your input images.
But in deep learning, we're laser, and we don't want to spend time doing that. We want our machines to automatically learn useful representations of our input data. In particular, in vision, the way we're doing this comes from an intuition-- comes from an analogy with neuroscience. In particular, the thing we're thinking about is the work by Hubel and Wiesel. And of all their incredible and fundamental work, the main ideas that we care about today are the idea that in visual cortex, connections and activities are organized topographically. That is to say that activities of neurons in visual cortex retain to some extent the spatial organization of the stimuli impinging on the retina.
And on the other hand, that there is a hierarchical organization of different types of cells. You could think of simple cells as being essentially a linear combination of whatever output comes out of appropriately arranged cells from the LGN. And then that you have a complex cells. There are some more complicated combination of outputs of simple cells.
So these two ideas are what we care about today. And already in the '80s, in the early 80s. This is work by Fukushima with a network called Neocognitron, is schematized here. It was an awesome name, by the way. He took direct inspiration from the work of Hubel and Wiesel, and implemented this network where the idea of a hierarchical organization of different types of cells is represented by the fact that you have several layers on top of each other. And each layer is connecting to the next layer in a feedforward fashion.
And the way in which he called these layers are simple layer, complex layer, simple, complex, simple, complex, which would correspond in modern terminology to convolutional, pooling, convolutional, pooling, et cetera. We will see what that means. And the other idea about the topographic organization of the connection is that these connections across layers have some element of spatial locality. So that there is this idea of some operation being performed-- basically performing the same operation over and over again across multiple locations to go from one layer to the next.
But this is just to give the intuition for the connection to neuroscience. But let's see what are the ingredients of a canonical modern convolutional network. So is the recipe that we're not going to discuss in detail. But basically, the ingredients are this. So you need to be able to perform convolutions to take nonlinearities to do something called pooling, and to have fully connected layers.
Now, fully connected layers are something that we have-- it's basically what we have discussed this earlier, with the multilayer perception. It's the same thing. And now, we're going to go into a bit of detail concerning the other three ingredients to see what they are, basically. And the way in which these ingredients are arranged is schematized by this figure here from the paper by John de Kooning in 1998. As you can see here, there is a convolution, pooling, convolution, nonlinearity, pooling, convolution, nonlinearity, pooling, fully connected output.
So what is a convolution? The idea of a convolution is basically-- intuitively, it's the idea of having taking, say, some small filter and applying it over and over again at all possible positions of an image. And taking the output of that as the output of your operation. So an example of this type of operation would be blurring an image by replacing each pixel in the image by an average of its neighbors.
So how will you do that? This is a mathematical form of what I just explained. If this is your image. So these would represent numbers, and these colors would represent grayscale values. Essentially, you convolve your input with this filter here. What that means is that you can imagine taking this filter, which is just a bunch of ones divided by nine. So it's like 1/9, 1/9, 1/9, 1/9, et cetera. Kind of superimpose it to your input image, and then basically take the dot product between all these elements, and the elements in the picture.
Because these elements are all the same, it just takes an average across all these pixels. So when you put it here on top of this area, the values in the original image are all zero. So you're taking average of a bunch of values that are all zeros, so the output is just a zero. There should be a zero in here.
Then when you move forward, you're applying now your convolution to the next position to the right. And you have taken average that is across eight pixels that have zero value, and one pixel that has value of 90, so the result is 10, and so on. So when you move again, now the average here is 20.
And in the same way, you can see. When you get to, say, computing the average for this point, you can see that the average should be now 90 because all these elements are 90. So as you apply this operation sliding this filter all over an image. You can see this is your final output. So this is your smoother version of your input image. Is it more or less clear? OK.
So one tricky thing that you can notice about this is that the output image actually has this border here. So the real output image is just this part. It doesn't really have the same size as the input image. And this is just purely due to geometry. Because there is no way of taking this filter here, and applying it to every possible position of the input image in such a way that the filter doesn't kind of spill over the borders of the original image.
So because of just this geometrical constraint, the output size will-- unless you're padding your input image with some values, you're extending it somehow. Your output size will generally change according to this formula that if you spend a minute thinking about it, you can convince yourself it's true. The size of the output will be given essentially by one plus the linear size of your input minus the size of your kernel or your filter divided by the stride.
So I haven't said what the stride is. The stride is just the size of the step you take in the input for every step you take on the output. So here, the stride is one. Because every time we move by one in the input-- I'm sorry, in the output-- we are also moving by one in the input. If stride was two, when we move by one here in the output, we will be basically skipping a pixel. So we would be going, say, from here to here. So it's just a measure of how far you jump on the input every time you move one step on the output. So we know our example. Anyway, this is one, so it doesn't really matter in this formula.
Another thing about convolutions is that we have seen this example that was done on just grayscale images. But in general, when you have an image, you have-- most images, you will have three channels because RGB, it's a color image. And also, if you're applying a convolution in the middle of a neural net, you will have an arbitrary number of channels in general.
So your filters actually, they're not just two dimensional things, but they are three dimensional. And they have also a depth dimension. And this depth dimension must match the depth dimension of your input layer. So for instance, in this case where we have an image which is 32 by 32, and, say, three channels for RGB. Our convolution, the operation, what it does is it convolves a full slice of this input with a 5 by 5, say, by three filter, and it outputs just one single value.
So it collapses the depth of the input onto just one value. So imagine sliding this thing all over the image. You will get that OK, according to the formula that we just saw in the previous slide. You will get something which is 28 by 28 by 1. Because every possible position of this filter gives you just one output value.
And then what you can do is you can have multiple filters. So you can, say, if these are featuring the textures, you might imagine looking for different types of features. And so you change that type of filter. You recompute the whole thing. You get another 28 by 28 by 1 bunch of numbers. And you just stack it on top of the previous one. And then you do it again, and then you do it again.
So what happens in the end is that the depth of the output of your convolutional layer will be equal to the number of filters you're applying. So you have one element in this depth direction for each different filters. And you call each of these colored things, you call them feature maps. So each of them corresponds to a representation of your input seen through the particular filter that you are applying at that point.
So how this might look with a simple example is that, say that you have one filter, which is just kind of like a horizontal stretching. And the output would be something like that. And then you might have, say, another filter, which is just vertical kind of stretching operation. And the output might be looking something like that. And you might [AUDIO OUT] stacking these images. And this way, so you can have multiple filters and stacking the outputs to stack in the feature maps.
So something that is very good about convolutions is that as we have seen, the dependency. So the output of your convolutions preserve-- because the dependencies are all local, they preserve somehow the spatial structure of your input. You can kind of still see that pixels that are nearby here are related to pixels that are nearby there.
And at the same time, the fact that we apply the same operation over and over again means that we have relatively fewer parameters to learn, as opposed to, say, having a fully connected network. Because we now only need to know what's in the filter. And then the filter is the same for all positions.
So to make this a little bit more concrete, take for example, an image. Say, if you have an image which is 200 by 200, grayscale, say. And then you want to map this thing to a fully connected layer with, say, 40,000 hidden units. That would amount to about two billion parameters. Whereas if you have the same image, but you map it to a convolutional layer which has by a property choosing the parameters in a reasonable way, also has 40,000 hidden units. Sorry, you would get that the number of parameters would be only 4 million. So it's a dramatic difference.
So this is one nice thing, because we have way fewer parameters to reason about. And the other thing is that actually, this way of reducing the number of parameters makes sense, especially if you think about vision. Because if you think of these filters being, again, like some sort of feature detector. To a very rough first approximation, it kind of makes sense to think that the type of features that you're looking at, say, in the top left corner of your image, might be kind of similar to the type of features you're looking for in your bottom right corner.
Through a higher order effect, you can think about why this might be different given the type of images you have. But in general, you can assume some sort of stationarity of your data in that sense. So it is a very clever thing to do, to just share parameters across different positions. So that was for the convolutional errors.
The next ingredient of a com net-- or really, any artificial neural network-- is you need to choose an appropriate nonlinearity to put just in front of your convolutional networks. And in this case, what we use generally is the so-called ReLU. The ReLU unit, which is a rectified linear unit, which has these activation forms. So if you remember earlier in the multilayer perception example, I had mentioned hyperbolic tangent, which would be something like that. I can [INAUDIBLE] nonlinearity.
The advantage of using a ReLU instead is that it doesn't saturate for very high values of the input, which means that your gradient never goes to zero when you have very strong values of your input. And in general, you probably want to avoid having zero values of the gradients. Because if you go back and look at what we have written for back propagation, if at some point you have a gradient. You have a derivative, which is zero. It basically kills all the other derivatives that comes before that in the computational graph. So that basically means that you're not learning, so we like gradients.
And this is the effect, just to visualize what it does. Say if this is a picture where black points are negative and white are positive. It only keeps the positive ones. So the final operation we need to implement components is pooling So pooling is a coarsening operation. What is usually done is a very popular thing to do is, say, max pooling is a popular way of performing pooling.
In this particular example, this is done with filters that are two by two and stride two. Let's see how it works. So the fact that the filters are two by two. It means that we are looking at one area of the input, which is a two by two. And then we take the maximum of that, and we discard all the other value. So we take this red area. We look oh, the maximum is six. And then in our output, we write six.
Then if you remember the definition of what a stride is, to jump by one to the right in the output, we jumped by two to the right in the input. So we go here where the green square is. So there is an overlap in this case between the red and the green. And we perform the same operation. We discard the seven, the two, and the four. We keep the eight, and we write it down. And we do the same for the other two quadrants.
So this is a way of shrinking down your inputs, and it has two advantages. So one is that it reduces the size of the representations in the layer that follows. This makes your data more manageable. It reduces the number of activations and parameters that you have later on. And at the same time, it introduces some invariance to a small translation. Because I mean, if you're going to be moving these values here. I mean, if you move this thing up, the maximum of this thing is always going to be eight for a small adjustment of these pixels. So this introduces some invariance.
OK, and just to wrap it up. When we have finished assembling our convolutional neural network using the ingredients that we have just discussed about, then we can just train it. We define a loss, and we just train it by gradient descent and back propagation, as we have seen for the multilayer perception. So just before I finish, very quick historical mention.
So the key evolutionary steps where it happened to bring us to the modern ideas of convolutional networks. Of course, there are many more, but these are just some that I picked. So the first idea of taking inspiration from neuroscience to build the first annual cognitron. So the idea of this convolutional structure, of applying the same operation over and over again over the space by using local connectivity, and alternating layers that effectively perform pooling.
Then 20 years later, the LeNet by Yann LeCun that starts to show the structure of modern convolutional networks, and very importantly, uses the idea of supervised learning with back propagation and gradient descent to learn the weights. So that was a key advance. And then really, this is the prototypical fully modern convolutional network, AlexNet. That has essentially the same structure as this other device that was published 14 years later, the main difference being scale.
So what happened between 1998 and 2012? Essentially, digital photography and the internet happened. So that meant that people were able to download massive amounts of pictures from the internet and curate these huge training data sets using tools such as Amazon and Turk. And that was the data that was needed to actually train much larger networks together with dramatically increased availability of compute, and in particular, the availability of general purpose GPU computing.
So these were kind of like the key steps to arriving at the modern conception of a com net. OK, then just to finish, we have talked about-- a little bit about how you do this. We are talking about this in the context of image classification. But of course, these tools are very useful, as you all know. We can use convolutional structures for many, many things.
This is an example of image retrieval, where you just take the internal representation of an image of the computer is computed by the network. And you just you use it as a representation to compute Euclidean distances between images to say something like oh, give me pictures that look like this. And this is what you get. So other applications, of course, include object detections, image segmentation, captioning, and so on and so forth. So I think this is it for the com net part.
YEN-LING KUO: All right, so we have been talking about all those information are like spatially located. So they are represented as 2D xy dimensions. So when we're talking about recurring neural network before that, we are talking about the sequential information processing. Why we are caring about this, because there are actually several applications. They are not just like a two dimensional or three dimensional data. We are talking about the time theories.
For example, when we do things like natural language processing while looking at a sequence of sentence or paragraphs. Or speech, the speech signals are like allocated over time. And of course, when you are watching videos, those actions online, they are also doing like roll out over time. And when we do captioning, you describe the videos according to how the actions happens along the time. Of course, you are doing like biology. There's a lot of examples in like 14 sequences, or the molecular structures or activations.
I know some people probably from background of physics. So one way a lot of people dealing with this sequential information processing. One view is from the classical dynamics. We usually thinking about the world as-- described the world as a state. So for example, the state of object as a positions, velocity, or accelerations. And then we can apply some dynamics.
For example, we know the classical Newtonian dynamics. And then applied to the states, and we will see what's a new position, velocity, and acceleration of the object. So basically, we just do this over and over and see the new updates of new-- the new observations. And of course, it's not just a closed system.
So you can also think about there are some external force applied to the system, to the dynamics. So you can make more interesting changes. Another view, we look at from data generation point of view. So yesterday, we do the probability tutorial. So you can think about when you have this sequence, there are actually some hidden state underlying the data generation process. And one common example people do in speech processing. There are like hidden state to describe how your mouse move to generate those phones, and you are listen.
So there are transition probability to transition between this hidden state, and also the emission probability to generate the data you are looking at. So they are like oh, organize along the time and also the emission to generate the observations. So when we talk about recurrent neural network, I would like to take it as pretty general form to describe and process these sequences.
It has really a generic formula like this. You take a input, and then we apply the recurrent formula, which could be the [INAUDIBLE] neural network or functions to do the recurrent processing. And when we apply for it, it basically takes the old state and gives you out the new hidden states. And then you can get the output.
So very different from the view, what we saw before is now the state is only consists of the hidden vector h. So there's no explicit description of state, so like the position or velocity, or the description of a category of the state. So these are just a arbitrary vector. But we can think of it as a summarization of what information accumulated until the time we are looking at.
So the general form is just what I talk about. You take an old state and then input a t, and you spit out the new state for the update information. And of course, when you find an observation, you can apply another neural network to read out the value to figure out what could be the transformation from the state to the value you care about. So we can take a more concrete example.
I hope this is true. I like this course. And this is a sentence people do with natural language processing. And they usually want to take this word, for example, the part of speech, or sentiment, or the how to pronounce this word. So for neural network, we can actually do first the text embedding to represent the word into a feature vector to describe the word. And then we have initially, we have a hidden state. But of course, at the beginning, we know nothing about the sequence or the word. So we can just initialize the hidden state as all zero vectors.
And then to do the processing, we fit into the input and the hidden state to a neural network. And then give you out the updated-- the hidden state. And then we can do another fully connected layer to do the prediction to whatever labels we are interesting at. And then we can do it over and over again to unroll it over the input and try to generate all the prediction we care about. And as I talk about there are some labels like the part of speech. We want to say, OK, this is pronoun. This is verb. And this determinant, and this is a noun.
So to do this, we can have an evaluation function to compare these two to compute a loss. To do this, we are actually taking advantage of it's actually the same processing. So we can combine the loss instead of just looking at one loss, we actually accumulate all the loss together. And then when do that propagation. They propagate that through all the units over the time.
So this is very general and very good property for learning. So like we learn this each of neural network as a separate function. This allow us to have an ability to process sequences at different lens, or a different lens and different time durations. So take an analogy between the convolution network we just talk about. Convolution network, we talk about filters. So we can do the parameter sharing around the local patches.
And similarly, in recurrent network, we're also doing this parameter sharing. But it is sharing the parameters along the time. So the recurrent formula is like the filter we are doing in the 2D space. So this is, as I say is also more-- accumulate a loss, so there are more informations. And also, it can generalize better two different kind of sequences.
We talk about a lot of good thing you can do with recurrent neural network. But one very big problem people are facing when training recurrent network is the vanishing gradients problem. So take this very simple three layer recurrent nets, for example. So for inputs, you apply the weights and the recurrent formula, and over and over again for 3 times to get a loss.
So to do it, we know how to do it for the very last layer. You just take a loss, and you want to update the weight at third layer. And you just using a chain rule to take a loss with respect to the recurrent formula and the weights. And this is easy. But the problem happens when we're talking about if we are going back in the time, like 10 time step or 20 time steps of goal. And we're actually unrolling this derivative over and over.
So you can see it expands really quickly, and it could be like-- train up to like 50 and 70 times. For this, you can imagine. If this number is like a larger than one. When you do several multiplication, you just become giant value. And when you take gradient for the giant value, you basically see the weights just jumping over from here to there, here, there. We basically learn nothing from the gradients.
And another problem is if the value is very small. So when the value is very small when you train it up, it becomes an even more smaller number and close to zero. So basically, when you update the network, the weights is basically just flat and doesn't moving anymore. So you can think like the information is not really flowing along the network. So it doesn't really take advantage of the sequence, and learning the dependency between the information at time 100 and time like 10, for example.
So this has a problem in learning the long dependency. But sometimes, we really want the long dependency, for example, when we talk about sentences or paragraph. That dependency from the pronoun for the next two sentence is depending on the one I just talk about now. So there are several approach people have been discovered to fix this problem. So for gradient explosion, people just very simple. We just clip it to have a maximum value, so you don't explode. So I won't talk too much about that.
And for gradient vanishes, so basically, the information is not flowing along the time. People have been designing new architecture in neural network to keep the gradients in the cell or a neural network itself. So when we do a base, we can keep information from a few time steps of goal, and then update the gradients instead of getting zero all the time. So one very famous architecture people using you probably also hear about is a long short-term memory. So basically, they are introducing memory into the neural network.
And there's other variations, and the proof of a very similar performance, and easier to train, so getting recurring units. And let's take a simple look at this long short-term memory. So what they do is to introduce a case to decide if you want to let the information flow through. So in addition to the hidden state I just talk about to summarize the state of the information so far, they introduce another state called cell state.
So cell state is basically to decide OK, what's the information I want to keep, and how much information I'll just release. So they have three different gates to control this cell state. So the first one is forget. So when a when input get into the cell, the forget unit will compare the input and the cell state to decide OK, what are the information or bits we want to forget because they are irrelevant to what I'm looking at right now.
And then they have the input gates to select from the input to decide what are the elements in the inputs we want to update the current cell. And then finally, we decide from the output gate to decide which part of the information we want to output. And this has been nicely-- allowed the neural network to train, even though the sequence is very long. So it's been successfully used first in like speech or translation. And then keep having improvements, and other people trying to simplify this a little bit. Just what I said, the gateway recurrent unit.
So when you are doing projects, you probably will choose between different cells and see their performances. So I would like to say the recurrent network is one way to flexibly allow you to assemble different architectures to dealing with your sequential data. So what we talk about is this vanilla recurrent network where you have input and map to the output. But actually, you can be very creative to dealing with a sequence of input and output.
So for example, image captioning, you just have an input of one image. But you output a sequence of words to describe the object or action in the image. You can do it inversely. So for example, you have a sequence of words or sentences, and then you just want to output at the end have a one classification label to say this sentences is happy or sad. And of course, you are using Google Translate. They are also doing this kind of translation.
But since the translation between different language is not a exact mapping. So they have some many to many, but they may skip some of the words or rearrange the words for the input and outputs. And of course, the example we just look at is to mapping between a word to its labels, like the part of speech. And we can go on and on and to have a different examples, especially when we talk about actions and reinforcement learning, people like to think about using this kind of sequence model to generate the actions or motor controls.
We have been talking about recurrent networks or convolution nets. All this training paradigm we are talking about now is we have a label, and then we want to make some prediction to approximate this-- to match this label to do the supervised learning. But actually, there are some different kind of learning we want to look at. And especially, we want to do it in an unsupervised way, where we don't have label, but we still want to learn a good representations, or good features about the input.
One thing I want to talking about is a auto-encoder. So this is in 1980s, Yann LeCun have this idea. So when we have this input, if we don't know the label, how can we know our network is learning something to describe your data. So one trivial idea-- well, not that trivial. One idea is to have our network to learn to reconstruct the image. And this auto-encoder, it is actually do it in two steps.
So first, there is the encoder step to encode the input image into a learned representation, and learned dimensional vectors. And then we can have a decoder network to decode what is encoded in the representation back into the original image. And then for training, we're actually trying to figure out how to update the weights to minimize this reconstruction loss. So mathematically, this will be the formula to generate the representation and use it to reconstruct the image, and then compare against your input.
So by this way, we're actually talking about learning the representation in the middle of the structure. And we are thinking if we learn a good representation, it is actually can keep all the information pretty well to regenerate whatever input we are given. There some question about OK, what this learned representation are about. So there are several views.
So first, some people may think about from linear algebra, or point of view, they can think about this as multi mini faults, and high dimensional mini faults. Or if you are familiar with PCAs, they could be the reduced dimensions in the PCA space. But there is actually another view to looking at this problem. So some people in the generative model thinking about this as the left hand variable to generate the data. So to make it more concrete, we can think about there are some example from computer vision and graphics.
When you know this kind of latent variable like color, shape, and positions, you can regenerate the shape in objects in the scene. So when we are observe a data, the goal of learning is actually to uncover the distribution of the data so we can resample and regenerate it. And one very desired property, and why we want this generative model is once we have a model, we can do a lot of interesting things using the model.
For example, you can sample a new data point you never seen before from the data. Or when you get a new instance, you want to really see if they are like front can describe using these variables, or from the similar distribution. You can evaluate the likelihood of the data. Or more importantly, when people are talking about representation, you can [INAUDIBLE] the latent features from the network, like using the latent variables you describe the model and data.
But there is a really big problem to do this all so nicely when we talk about intuition. But it is a problem to [INAUDIBLE] the model and the distribution. Because it is really hard to computing the posterior exactly, especially for some distribution, they are really complex. And it is hard because we need to sample through all possible latent values, and then do a modularization over the landscape, which make it intractable.
So some idea people have been proposed this year. One way is the variational autoencoder. So talking about the generative model we just described. So you can think about there is a latent variable, and there is a decoder name network to generate the observed data. But it is really hard to compute the distribution. So one idea is instead of computing the likelihood of the latent variable, we are doing this with a much simpler and tractable distribution. So you could be like a Gaussian or like some easier distribution which we are manageable to do it.
And then the goal is from the data, we learn this approximate inference network, and then use it to reconstruct the image. So the learning objective now becomes first, still we have a reconstruction error to figure out what's the difference between the reconstruction. But there is another measure is how close this approximate distribution, approximate network, and how close they are to the real distribution of latent variables.
So this allow us to train to generate the data. So when you are only looking at this network, we are doing the inference from the data to figure out what's the latent end variable, which are important. And when you are using this part of the network, we are trying to using the learned generative model to generate the data and evaluate the observations.
Of course, there's also another formulation you may already hear about a lot of times from media. So instead of formulating the problem as probability distribution likelihood and posterior exactly. So another idea is doing adversarial training. So in adversarial training, it is actually also an implicit generative model. So instead of talking about we want to approximate which distribution, it model it as a minimax game.
Instead of a generator, they introduce the discriminator to guide the distribution changes. So for the discriminator, they want to distinguish if the generator generate the data, it's the same or is the generative data. Are they like real or the fake samples? And for the generator, the goal is basically to generate the fake data, which we'll close to the original distribution to fool the discriminator. So take this example they have in the paper.
The black dots is the original data distribution. And we can initialize the discriminator and generator randomly at the beginning. And learning starts, this discriminator will classify whether the data points are the fake or real. And then the generators take the feedback-- take the gradient then to learn to approximate to the distribution to get closer to the original data distribution. And then they do it over and over again through the conversions.
And the final learning goal is to maximize the loss of discriminator. Because by the time you basically cannot tell what is from the generative model or the original data distribution. So this is just some idea about how people thinking about can we learn the data on supervised leave, or from a different perspective, like generative model, and allow us to do more task. And of course, these are just some quick and high level introduction of different kind of network. And we are only touching about the surfaces.
So feel free to like read the original papers, or some link we included in the slides. And this is all the tutorial we are talking about today. Before we take questions, so we will have hands on session tomorrow, and we will do PyTorch tutorials, all the topic-- well, not the generative model ones we are talking about today. And if you want to run that on your computer, you can install PyTorch and Jupyter Notebook.
But if you don't want to install it on your computer, or it takes too much time to installing it, so don't bother to do it. We are going to run it using Google Collab. So you can run the Jupyter Notebook and installing everything from that. We will have instruction for that tomorrow.
EUGENIO PIASINI: As long as you have a Google account.
YEN-LING KUO: Yeah, so make sure if you don't want to install, you have your Google account. So you can do all the examples we gave you. So that's all. Any questions. All right.