Tutorial: Linear Algebra (48:39)
August 11, 2018
August 11, 2018
All Captioned Videos Brains, Minds and Machines Summer Course 2018
Andrzej Banburski, MIT
Introduction to concepts from linear algebra needed to understand Principal Components Analysis (PCA): vectors, matrices, matrix multiplication and other operations, data transformations, solving linear equations, and PCA.
Download the tutorial slides (PDF)
ANDRZEJ BANBURSKI: So the point of this tutorial is that we want to get to something called the Principle Component Analysis. And if you know everything that you need to know about PCA, then I would suggest just step out. Because this is going to be worse than what you know.
But I'm going to start from the very, very basics and tell you what vectors are, what matrices are, and so on. So a vector is a collection of numbers, n tuple for n dimensional vectors. For example in two dimensions, a vector is just going to be two points and you think of it as an arrow. But it's really a collection of two numbers or an array of numbers. And it's a one dimensional array.
And then, there is an obvious generalization. The matrices which are two dimensional arrays of numbers, basically a collection of numbers put in some kind of pattern. Before we get ahead and talk about products or anything like that, it's just some array that has some kind of structure to it. And you can think of that each column here is an n dimensional vector. Or each row is also, in this case, three dimensional vector. And we'll get to how we can relate the columns and rows in a second.
So for matrices, there is this operation called, or even a for vectors, there is this operation called the transpose. And the transposer matrix swaps rows and columns. So you can see for yourself. You'll basically sweep it this way.
When we talk about matrices, one very specific table matrices that is useful to think about is a symmetric matrix for which the transpose and itself are equal. And that means that the off-diagonal entries are basically equal to each other. There are some other examples of matrices below that are symmetric.
But so far, we've just been talking about these collections of numbers that have no meaning whatsoever. And the meaning that you get is by deciding what's the operations that you can do with these. And for matrices, the thing that really defines them is the matrix multiplication, is this fact that you take one matrix, you multiply by another matrix, and you do so by this operation where if you want to find what is the output of a matrix in this space, you multiply a-11 and this b-11 plus a-12 times b-21 and so on.
In general, if you have an m by p dimensional matrix, you can only multiply it by a matrix that has the first dimension equal to the second dimension of the first matrix. And the output of that is going to be that these two are summed over and you have the outer dimensions. The first dimension is here, second dimension is this way.
So maybe I a quick question to someone who is not proficient in linear algebra, if you multiply a times a b, is it equal to b times a. In general, we say that the multiplication of matrices is not commutative. So--
ANDRZEJ BANBURSKI: In most in most cases, it is actually not case-- not true. It's very easy to see cases where it is true. For example, for diagonal matrices so matrices that only have entries on the diagonal and everything else is 0. Those are very straightforwardly always commutative.
And actually, a diagonal matrix always commutes with any other matrix. But that's the only cases, apart from some special situations. So we can also do this matrix multiplication between matrix and a vector. Because, as you see, a vector is just a special case of a matrix where the dimension of the vector is n1. So we can do this kind concatenation again. And it's much simpler in the case of vectors because you just do the summation without extra columns in the vector case.
So obviously, you can also do multiplication of vectors. In this case, remember that we have to have the inner dimension match. So we will have to take the transpose of one of these vectors for this to happen. And you can immediately see that whether you take the first one as a transpose, the second one as a transpose will give you a different dimensionality of the matrix [INAUDIBLE].
Say that you have an n dimensional vector. So y-x transpose is going to be n by n dimensional matrix. But x transpose y is going to be just a number. And this number is known as the dot product. So it's basically the summation of all the entries. So x1 times y1 plus x2 times y2 and so on.
And this dot product basically gives us the notion of an overlap between these two vectors. If the dot product is 0, then that means these two vectors are perpendicular to each other or orthogonal. And if the dot product is just equal to-- if this angle between them is 0, if they just align, then you basically get the fact that x dot y is proportional to the cosine of the angle. And these are the length of the vectors.
And the length of the vector, I'm going to describe in a second how we define this. But you can immediately-- you probably remember how it works. But we can immediately get to something that actually is applicable to say, machine learning. And this is an example of a simple linear and [INAUDIBLE] perception in which you have an output and say this output is just one number. You have some inputs, x1 to xn. Then this output is just multiplication of x1 through xn times the weights w1 to w-n. And this is just an inner product [INAUDIBLE] a vector x, another vector lw.
If the output was m dimensional, then in general this would be, again we have an input, which is a vector, n dimensional vector. And then this has to be some matrix that takes an n dimensional vector to an m dimensional vector. So it's a matrix of the dimension n by m. So I guess this is all kind of straightforward.
For those of you who are kind of proficient in linear algebra, turns out that basically any operation that we're doing in a neural network can be put as this matrix multiplication. For example, convolutions are matrix multiplications but in a very non-trivial way. So a convolution operation, if you know what it is, is this sliding of a filter across, say, an image. And that is this convolution operation which basically achieves doing this.
And it turns out that you can put this as just creating this thing called a topless matrix. So your original filter of [INAUDIBLE] is this k1, k2, k3, k4. In my case, this is slightly-- they make it slightly higher. But you can create this topless matrix that basically is the sliding effect. If you don't know what convolutions are, don't worry now. You're going to pick up on this later. If you do then this might be an interesting fact.
But now getting back to the thing that's more relevant, I was talking about the length or size of a vector. And the size of a vector is called the norm. And there is infinitely many norms [INAUDIBLE]. And typically we talk about the LP norms, which are take each entry of the vector, take the absolute value of it [INAUDIBLE] sum of overall the whole vector. And then take and raise it to power of 1 over p.
So more generally, you can define other norms. And they will have to satisfy these quantities that the norm is a function which takes, say, the vector. And if this norm is 0, then the vector has to be a 0 vector. And a 0 vector is an vector that just has entry 0. It also has to satisfy this triangle inequality, which is that if you add two vectors, the norm has to be smaller than the sum of the norms of each vector. And this you saw from the fact that we have this cosine of the angle between the vectors if the dot product is-- if they're perpendicular, the dot product is 0 for example. So you can see it from this kind of geometrical thing.
And most importantly, they satisfy this homogeneity property that if you multiply the vector by some number alpha, then what pops out is the absolute value of alpha times the normal. So most commonly, we use the L2 norm which is basically compatible with inner product because the L2 norm is just the square root of the inner product of x with itself. And throughout this summer school, you might also see the L infinity norm, which is also known as the max norm and is the maximal entry in the vector. And the absolute value of that.
Then we could also ask what's the size of a matrix. This is something when I was learning linear algebra never thought about because there is no obvious way of defining the size. But the most natural one is something called the Frobenius norm. And it's kind of like the L2 norm. You take each entry of the matrix, you square, you sum it, and take a square root of it. So you just basically see the matrix as bigger vector. So any questions about this? Yes?
AUDIENCE: Do we need to worry about what [INAUDIBLE]. It's not just the dimensionality of it?
ANDRZEJ BANBURSKI: It's the length of the vector. So say I define a vector which is this point and this point. So this will basically define me a vector which is this line. Then the normal of the vector is this length in the most natural way. But that will be for the L2 norm. The L2 norm is the Euclidean distance in space. The other norms are not. And they define your different geometries. But the one that [INAUDIBLE] in normal life of measuring distance is the L2 norm. Anything else? Yeah?
AUDIENCE: So it can actually be like the magnitude.
ANDRZEJ BANBURSKI: Yeah Yeah. It actually is the maximum entry is the biggest entry in the vector. So it gives you, say that you have a vector that has some small entries but there is one that really peaks out, it's going to show you that one. It's useful for seeing what's the possible variation in your data, for example.
That gives you how far your data actually spans. You can also define the same L infinity norm for the matrices. And it just picks out the-- well, something we'll talk about later. It picks out the biggest eigenvalue of the matrix. But we'll get to eigenvalues [INAUDIBLE]
So there is two operations on matrices that are kind of useful to recall. One is the trace of a matrix, which is the sum of its diagonal entries. And it's just that. It's useful to know that because the diagonal is not touched by the transpose, the trace of the matrix is equal to the trace of the transpose. And if these matrices somehow match up in size, for example there are square matrices. So there are n by n matrices.
Then the trace as the cyclic property, the trace of ABC, is the same as trace of CAB and BCA. So basically you can cycle through it. So this is a simple operator on the matrices. But then there's also the determinant. And the determinant is some of the more complicated because it's this value that you're trying to get from a square matrix. And this only works for square matrices.
You might kind of remember determinant. And you remember that is this thing that, for a 2 by 2 matrix, you calculate a times d minus c times b. For higher dimensional things, it's not that simple that there's just these diagonals because there is a lot more entries. And the rules get-- you can draw yourself these nice graphs but they get really complicated when you get to really huge matrices.
And in the summer school, you'll be dealing with really, really big matrices, not 3 by 3 but million by million. In which case, it really is not useful to do it this way. And for the [INAUDIBLE] of this thing, say that we do 3 by 3 matrices. And you have this shape that is defined by a vector R1, R2, and R3. In that case the determinant or the absolute value of the determinant is going to be the volume of this [INAUDIBLE] that's defined by these three vectors.
In general, you can find an expression for the determinant. And it's actually complicated to calculate it. Because you have to take each entry in the matrix, take their products and take all the possible permutations of them. So in general, for n dimensional matrices, this becomes a really big operation. I'll show you later how there is a much more simpler expression than this one. So you don't have to note it down because no one actually uses this. But this is the definition.
So now we've got them through the basics of what we can do with matrices and vectors, let's say to do something actually useful with them. So let's do the very simple thing and talk about linear equations. So we're going to have input x1 to x-n as an n dimensional vector. And [INAUDIBLE] For simplicity, let's choose n equal to m equal to 2. And we want to solve a linear equation where y is some linear combination of x.
So in general, this linear combination is going to have four possible entries here. And this is going to be the matrix a, a-1, a-11, a-21, a-22. And the linear equation that you have written in this way can be just put in the simple form on that side, which is basically matrix multiplication. This is slightly, I mean, maybe more obviously written in this way that you have on the right. So in general, you can think of this matrix, any matrix A, as a transformation of vectors. At least, in the sense of linear equations.
So maybe let's get to some examples apart from [INAUDIBLE] kind of abstract way. Let's look at stretching of data. Say that we have this matrix, which is diagonal. And it has 1 plus delta where delta is positive on one of the entries and then the one on the other diagonal and 0 everywhere else. What this will do is take your original data, which is in blue, and stretch it out.
And you can easily see this because you can take your data, which is going to be, in this case, a two dimensional point-- x1 and x2. And you can see that the stretching is going to leave this height-- this is x2, this is x1. This height is unchanged. But it's just stretching points in this direction. Similarly you can do the stretching in the other direction. And I am assuming this is kind of obvious.
You can also that this stretching uniformly. And then you're basically blowing up points. This is the kind of transformation that are kind of useful when you try to look at data to see some patterns. Equivalently, you can also do this kind of shrinking where instead of multiplying by bigger numbers, you're multiplying by something-- dividing by something that's bigger than 1. And this collapses your points. This is actually useful in the sense that some things, we might have variance of data that's too big. And you want to shrink it down so that your variance, for example, is 1 or something normalized.
Some other operations that are kind of common are reflections, which is basically flipping points along one direction. You can also do flipping in the other direction by just choosing minus 1 on the other diagonal, or complete the reflection if both of the entries are minus 1.
So another of these kind of transformations is shear, which is basically shearing data. And it's like if you're taking a sponge and kind of like pulling it in this way. Basically, pushing some points in this direction, pushing the other oriented points in the other direction. In they'll similarly share and depending on where you put this off diagonal entry.
Or you can do shear on putting both entries on the diagonal but switching sides. And this creates this kind of spinning effect which you can basically rotate your data in this way. This is not exactly rotation. Actually, it is a rotation matrix. And that's because the general rotation matrix for small angles is what I just showed you.
And for big angles, it's actually something given like this. So cosine so [INAUDIBLE] This is only true for two dimensional rotations. In higher dimensions, this is slightly more complicated. Well, what this matrix does is take a point which is, for example, the (1, 0) vector and pushes it into the cosine and sine of the angle.
So for example, if you have your data somehow aligned in this way in a two dimensional plane, then it's going to rotate it like that. So I think that's enough about this kind of common operations that you can do with it.
One thing that we didn't talk about is the inverse matrix. So you know that functions can have inverses. The inverse of the function is such that if you have the function of x, we apply the inverse of that, you usually get x back. So the inverse of the function is identity.
And for matrices, we have the same thing. So there exists this A inverse, which when you apply it to A times x, gives you just x back. So this basically means that A inverse times A is the same as A times A inverse. And this is an example of another matrix that is commutative with itself. And this will give you the identity matrix. And the identity matrix is just a matrix that has 1 through the diagonal and 0 everywhere else.
So some examples, say that we have a matrix which was this one. This was the stretching matrix. Then the inverse of that is just the 1 over 1 plus delta. And you can easily verify this by multiplying this matrix for yourself. It is done here. And you just get back the identity matrix.
Obviously, these are like opposite operations in some way. This was a simple example for the shear. Sorry. Not the shear. Actually this is-- strange-- yeah, is the shear. The operation is just the shear in the opposite direction. And this is kind of obvious. You just multiply by the opposite thing of what you just did.
But it turns out that not every operation is invertible. For example, if you take the x squared function, the inverse is kind of ambiguous. There's two points that get mapped to one. So it's kind of not obvious what you should be doing. And in general, not every matrix has an inverse.
So for example, the projection matrix which is just 1 and then 0 on all the other places, will project the two dimensional data to one dimensional data in this case. And this thing doesn't have an inverse. Because there is no other matrix that you can multiply by it that gives you an identity back. And you can think of it that once you've collapsed this data, then there is no obvious prescription for how to uncollapse it. Because there is just no matrix that is going to store all this information.
ANDRZEJ BANBURSKI: You can ask yourself, when does a matrix have an inverse. Right? And this is the next slide. So in general, first of all, for an inverse matrix to exist the matrix has to be square. So it has to be n by n dimensional matrix. And its columns, which I told you were actually vectors, have to be linearly independent, have to be forming in linearly independence of the vectors.
And what this means is that no single column can be a linear combination of the other columns. So you cannot take-- if you have one column that's basically column 2 plus B, then this is not going to be invertible. How to figure out whether any column is linearly independent or not is somewhat complicated. And you just have to play through it to convince yourself. But it is also some prescriptions.
For example, a necessary and sufficient condition is that the determinant is not 0. So once you have a non-zero determinant, you can calculate then. You can obtain the inverse. And the thing now is, this is one of the big things why computations in general are complicated. It's because finding the inverse is very complicated.
Now I gave you these 2 by 2 matrices. And for 2 by 2 matrices, it's very simple to find it. But for higher dimensional matrices in general is a very arduous process. So there exists an expression for the inverse, which is in terms of 1 over the determinant and some products of the matrix times the trace of the matrix raised to some powers and so on. It's a really complicated thing to obtain. And in general, it's an NP-hard problem. Now it takes a long time to calculate these kind of things.
I'm not 100% sure it's NP-hard, but this is just for the recording. I might have screwed up here. But I think it is NP-hard.
So that is an interesting thing that there are these matrices called orthogonal matrices for which the transpose is the inverse. And the interesting point about these matrices, so busy you have Q transpose times Q's identity. These matrices are interesting because first of all, a square matrix is orthogonal, again, with this. And it turns out that orthogonal matrices are rotation matrices. Or more precisely, the set of orthogonal matrices is basically built out of the set of rotation matrices plus the identity plus the total reflection.
And this together form something called the orthogonal group. So how we can think about these orthogonal matrices is that they change your coordinate systems. So say that you have one coordinate system xyz, if you multiply this now by orthogonal transformation, that's just going to rotate these coordinates. So maybe a slightly clearer way, say that we have a basis of vectors, which is 1 0 0, 0 1 0, and 0 0 1, that forms you an orthogonal or orthonormal basis of vectors.
And then if you apply to it a rotation-- well, if you want to rotate it then by, say, an angle theta, then you can see that the correct new basis is this one. And the equivalent rotation matrix is just the simple like two dimensional matrix times identity in the direction that we didn't change anything.
So I think that's enough of explicit matrices that we can construct that are useful to deal with. And let's go back and get to the more interesting thing, which is eigenvalues and eigenvectors. So eigenvectors are basically-- an eigenvector of a matrix A is a vector that satisfies this condition that A times x is some number times x. So this number is called the eigenvalue and is basically the number by which, if you multiply the matrix to this vector, is just re-scales it.
So what this basically means is that for a given matrix, its eigenvectors get only re-scaled by multiplication of this matrix. So let's look at an example. Again, this matrix that re-scales things, it has two eigenvectors. So a 2 by 2 dimensional matrix can have, at most, two eigenvectors. And those eigenvectors are just 1 0, 0 1. They're basically these two vectors. And this direction gets unchanged and this direction gets to be scaled. So the eigenvalue of this vector is 1 plus delta. This vector, just 1.
Similarly, we can look at this non-invertible matrix. It also has a eigenvalues and eigenvectors. And they're exactly the same set of eigenvectors. The difference is that the eigenvalue for this guy is 1. [INAUDIBLE] doesn't get changed in this direction. But for that guy, the eigenvalue is 0. So it just re-scales the direction down to a point.
So before I kind of continue with this line of thought, I was talking about some symmetric matrices which are the matrices that have off diagonal-- the transpose is equal to itself. So the off diagonal entries are equal to each other. So the eigenvectors of a symmetric matrix are orthogonal to each other. This means that they're always going to have inner product with each other 0.
You can immediately see-- you can go through the proof of this. But I'm not going to. But this immediately tells you that if you have a symmetric matrix, it defines you a basis or a coordinate system. And more importantly, because these eigenvectors are orthogonal, this means that symmetric matrices are diagonalizable. And what this means is that we can find-- so consider that we have an n by n matrix. Let's make this thing UI to be an eigenvector of A such that A times UI is lambda-I times UI. And lambda-I is the eigenvalue associated with the eigenvector.
Let's assume that these UIs form an orthonormal basis. So they are all orthogonal to each other, which is the case for symmetric matrices. Then you can create this matrix made out of column vectors. And then what happens is that you have this U transpose times A times U as this expression. Now remember that when we have these-- when we have A times U is lambda times U, this basically means that this product becomes U transpose times lambda-U. And what we get is just a matrix that has the eigenvalues on the diagonal and all the other entries 0.
This is what we call it diagonalizing of a matrix. And this is actually quite useful because this allows you to remove this tilde matrix that the most important parts. Because you know that the eigenvectors are the thing that kind of defines you the directions in space. And the eigenvalues define you the size. And in some way, if you know that your matrix is diagonalizable, then all the information you need about it is just on this diagonal.
And maybe before I go on, it's important to notice then that the trace of the matrix, for example, is unchanged under this diagonalization because trace had this property that you can cycle things through. U transpose times U is 1 because this is an orthogonal matrix. So what you have is that the trace of matrix is equal to the sum of its eigenvalues.
So this immediately gives you some kind of handle on the fact that you can estimate at least one number about this eigenvalue. So you know what is the sum of eigenvalues just by looking at the matrix immediately because that gets unchanged. So in general, if we have a square matrix and it has n linearly independent eigenvectors, so it's invertible, we can always diagonalize it. And this diagonalization is such that the matrix A is some orthogonal matrix U times diagonal matrix that has the eigenvalues on the diagonal times the U transpose.
And immediately we get that for these matrices the inverse is very simple to get. Because it's just the same orthonormal matrix times the diagonal with 1 over the eigenvalues times U transpose. And this is actually if your matrix has these [INAUDIBLE] independent eigenvectors, then this is the actual way to calculate the inverse because finding the eigenvalues is slightly-- it's more immediate how we actually do this. I don't actually say how you find these eigenvalues. But you can write down this, polynomial equations that basically give you the eigenvalues.
Now the thing is, I was talking about square matrices. But you actually will be dealing a lot with non-square matrices. And you cannot do most of this stuff that I've been talking about. For example, an m by n matrix, the best we can do instead of this diagonalization. is something called single value decomposition. And at best you can do that A is U times D times V transpose.
U and V are orthogonal matrices. But they're of different size. U is m by m. V is n by n. And D is diagonal matrix. But it's non-squared diagonal matrix. So some of the entries are going to be 0 [INAUDIBLE]. And the entries of D are not eigenvalues but they're known as singular values. Question?
AUDIENCE: For U and D, they're orthogonals of each other?
ANDRZEJ BANBURSKI: No. So U and V cannot be--
AUDIENCE: [INAUDIBLE] orthogonal to what?
ANDRZEJ BANBURSKI: They're orthogonal matrices. And why I said orthogonal matrix is that the transpose is equal to the inverse. So U then U times U transpose is identity. And the V times V transpose is the identity. But they're not orthogonal to each other. They're just called orthogonal because their columns basically define, you know, they're matrices that have eigenvectors that are orthogonal to each other. And they define the z rotation matrices.
So we call them orthogonal matrices because they have these properties that are useful to talk about orthogonal directions. But there is no sense in which U and V are orthogonal to each other, apart from the fact that you cannot multiply them. Because they're different size. So in some sense, maybe, you can call them orthogonal. But let's not get there.
So the one thing that I was talking about is that it's useful to find the inverse because it allows you, for example, to solve linear equations. But again, this only works if you have a square matrix. And in most cases where you'll be doing dealing in this course, it's not square. And then we have to go to something called the Moore-Penrose pseudoinverse. And this is defined for non-square matrices.
Say that we have a linear equation, A times x is equal to y. And we want to solve for A, for x. So we immediately, if you'll remember something about linear equations you know that if A is taller than y there, there might be no solutions. If A is wider than it's taller, then there might be many solutions or infinitely many solutions. So you can immediately see that there is no obvious way in which you can multiply A by something else so that you solve for something that is the inverse.
But you can define-- the closest you can get to it is the pseudoinverse. And the pseudoinverse is defined in this funny way where you take A transform times A. You add identity with some scaling alpha. You take the inverse of that because that matrix is invertible now. Because A transpose times A is square. So you take the inverse of that.
You take the limit as alpha goes to 0 and multiply it by A transpose again. This is very unintuitive. And in practice you don't calculate it this way. In practice, you do this SVD, where you say that the pseudoinverse is V times the pseudoinverse of the diagonal thing times U transpose. And the [INAUDIBLE] the diagonal is just the-- you have the single values on the diagonal. So you take one over the singular value whenever the singular value is not 0. If it's 0 then you just keep it 0.
So you can immediately see that this thing is kind of discontinuous. Because it kind of, it flips big numbers to small numbers but flips-- and small numbers to big numbers. But 0 stays 0. So there is some very interesting features of this that you can analyze.
We're kind of getting close to the end. And this is good. So we'll get to finally the thing that we wanted to talk about which is Principal Component Analysis, PCA. So the goal of PCA is to visualize and find structure in data. And this is the thing that you really care about. And the most important thing is that it's difficult [INAUDIBLE] in high dimensional situations.
When you have two dimensional data, yeah, it's easy to see a pattern and a structure to it because you can plot it. But say that you have several 784 dimensional thing. That's going to be difficult to what happens in this data. You can immediately see the two dimensional case that if you have your coordinate system set up somehow like this, you can rotate it. And then you've somehow aligned yourself with the data properly.
So the assumption that we're going to have for PCA is that the relevant dimensions to describe the data are somehow linear combinations of variables you measure in your-- and how you collected the data. And another assumption is that the relevant variables are orthogonal to each other. And this is a big assumption. And I'll give you an example where this doesn't hold.
But let's say that you experimentally measure n variables n times. Say that you have n neurons and m measurements or trials. Then you can create this i-th neuron, which is just an m dimensional vector. Because that is m measures of the neuron i over all trials.
So assume for now that this x has 0 mean. If it doesn't have 0 mean, then you can make it 0 mean. Then the variance and how you make it 0 mean, your basically subtract the mean of the data. Then the variance of the neuron i is just going to be some normalization vector 1 over m minus 1 times the sum of the squares of the entries.
And we can write this as x-i transpose times x-i. Then you can also talk about the covariance, so how are the two different neurons are related to each other. And this is just x-a transpose times x-j, with this minimization factor 1 over m minus 1. Maybe let's do this in the matrix form.
Say that we have n neurons. These together form an n by m dimensional matrix because we have m measurements. So we can talk about the covariance matrix, which is, again, this normalization factor 1 over m minus 1 times x times x transpose. And now if you think about what is the dimension-- maybe someone could tell me, what is dimension of this thing. Remember that x is this thing. So it's n dimensions this way, n dimensions this way. So any takers?
ANDRZEJ BANBURSKI: I mean, it's kind of cheating, because I already wrote it down. But anyway the important thing is that this matrix is symmetric. So x, x transpose of that is equal to itself. So it gives you how the different entries in this matrix are basically correlated, how different activations of the neurons in the trials are correlated with each other.
So it's a symmetric matrix. Now recall to what I was talking about earlier. A symmetric matrix has orthonormal eigenvectors. So all the eigenvectors are orthogonal to each other. So you can create a matrix U, which is the orthogonal matrix whose columns are the eigenvectors of this covariance matrix. And the eigenvalues are lambda-i.
Now we can choose-- one thing that I didn't say is that we can choose-- there is a freedom to choose this orthogonal matrix up to such that we can kind of choose where we put the eigenvalues on the diagonal. And let's just choose such that the first entry is the biggest, the second is the second biggest, and so on.
In that case we have n vectors. And we have that U1 points in the direction of the greatest variance of the data. Because that's the biggest eigenvalue. U2 will point in the direction of the next greatest variance, and so on.
So we can immediately see that if we write that y is U transpose times the x and x was the original data, then we see that this is a transformation on the data. And these new variables to i are uncorrelated from each other. So you can immediately see that the covariance matrix of y is just the diagonal. Is the diagonal of this eigenvalues. And these eigenvectors are something we call the principal components. And this is where the principal component analysis comes in. We say that U1 is the first principal component because this is the direction in which the data varies the most. U2 is the second principal component and so on.
So you're going to immediately think that if your data varies just in a few directions most and then the other directions are irrelevant, then you might be able to do something interesting with this. So if you're familiar with MATLAB, there's a very simple way in which you can implement a PCA in MATLAB.
You basically take your matrix of data. You multiply it by some normalization factor, whether it's m or m minus 1 doesn't really matter. You find just the eigenvalues of this covariance matrix. This is very simple in MATLAB. You can re-order things such that the diagonal entries are ordered. And then you can just read out the principal components rather quickly.
Now I mentioned the PCA sometimes fails. And I'll give you a demo of how PCA actually looks like. But PCA can fail. And this is an example of how PCA can fail. Say that we have a Ferris wheel. And we are tracking a person on it. And we have these data points.
You know that the dynamics of this can be described by this phase because you're just rotating on a Ferris wheel. But it's a nonlinear combination of the basis vectors. And this nonlinearity is what screws everything up. Because you have these non-orthogonal axes and PCA basically fails. So you have to be really careful about these kind of things. So PCA sometimes might fail.
But it allows you to do this impressive thing because it dimensionally reduces your data. Say that you only want to look at the few largest directions the data varies. You might be able to explain at least most of the variance and be able to visualize it somehow. So I'll show you an example of this now. We only have a few minutes.
I'll give you direction to a code that does this kind of thing. So this is in TensorFlow in Python. So there is this very simple way of let's say that we want to take a data set. And I'm going to take a data set that we use in machine learning all the time. It's called [INAUDIBLE]. It's the data set of handwritten digits. 0 to 9.
So you can set up your TensorFlow read in your [INAUDIBLE] data set. These are very simple-- create some simple embedding variables. And you save the data. And then TensorFlow has this cool associative tool called TensorBoard in which you can take your data and try to visualize it. And it immediately does this for you. And I've already loaded this. And this is a subset of the [INAUDIBLE] digits. This is just 700 of them. But the dimensionality of the space is 784 because [INAUDIBLE] digits are 29 by 29 pixel images. So you can immediately calculate and they're black and white. These are examples.
You can't really see this clearly now. But I'm going to color by labels. And you can immediately start seeing some kind of structure emerging. And I've just took three highest components of PCA. So let's analyze this slightly. We immediately see that-- I mean, the colors are [INAUDIBLE]. But once the digits once have all clumped together, the other ones that are kind of clear are 0's. They also kind of sit together.
There is a very interesting thing that you can learn from the data, for example, that [INAUDIBLE] four and nine are kind of similar to each other. And you can kind of see it when you write it. You can also see sometimes that there is some failure cases. And you see that there is like 7 that looks like a 1, and so on.
And in fact this PCA is not going to show you all the variation on the data. But in this case, PCA in three directions gives you the variation-- it explains 25% of the variation in the data. So it allows you to get some handle on how the data looks like and how it varies.
There are better techniques for this. But they're not based on linear algebra. So there is this thing called [INAUDIBLE] that is more probabilistic and actually works way better. So you can, for example, and Tensor Board does it automatically for you. So let's see how it looks like. And this thing runs and runs until you find some nice pattern that [INAUDIBLE] I'm going to stop here.
So you can immediately see that 1's are kind of clumped with each other. 7's are there. This is 2, 3, and so on. So this thing is not based on linear algebra. And this does nonlinear separation of the data. And this is one of the reasons why it's useful to do nonlinear neural networks where you do these nonlinear transformations of the data.
PCA is just a purely linear algebra operation that does non-nonlinearities and still manages to get something. And I'm going to send you a link to the thing as well. So that you experiment with it yourself. That's it.