Connections between physics and deep learning (51:45)
July 5, 2017
August 29, 2016
All Captioned Videos Brains, Minds and Machines Summer Course 2016
Max Tegmark, MIT
The study of deep learning lies at the intersection between AI and machine learning, physics, and neuroscience. Exploring connections between physics and deep learning can yield important insights about the theory and behavior of deep neural networks, such as their expressibility, efficiency, learnability, and robustness.
Lin, H. W. & Tegmark, M (2017)
Journal of Statistical Physics 168(6):1223-1247. Why does deep and cheap learning work so well?
MAX TEGMARK: So I'm going to talk about the connections between physics and deep learning. And I'm going to talk about this because I feel that there's a lot of really wonderful science to be done here in the middle of this Venn diagram, at the intersection between physics, neuroscience and AI. I'm going to give two different examples, in fact, of cool things exactly in that area.
And I'm going to start with this question that Gabriel raised. Why is it that deep learning works so well? And moreover, why does cheap learning works well? I'll explain what I mean by that in a moment. So you've all heard a bunch of cool things about deep learning being successful here at the meeting. And Haim Sompolinsky also told you about why deep learning is sometimes better, and gave some very interesting partial answers. I'm going to argue that the full answer to this question cannot be found in mathematics and computer science alone, though. Because part of the answer is actually in physics.
But before we go there, let's just remind ourselves, so we're all on the same page, what we mean by deep learning. So a feedforward neural network is just a function. A particular function, of course, where you send something in, in an input layer, and something comes out. Data in, data out. Bits in, bits out. Or vector in, vector out, whatever. Feedforward neural nets are a very particular class of functions, where you alternatingly multiply by-- these are affine transformations, so you multiply by some matrix and add some constants, and alternate-- and then do some simple nonlinear operation, often just apply non-linear function to each number, sometimes you do other simple things.
So what we know empirically is that by just alternating linear stuff with simple not linear, you can do all kinds of very, very cool things. You've seen a lot of examples in this school. Just give you a few that you're familiar with, this is a function where you send in an image. So maybe you send in a million pixels, which each has an R, G, and a B number that specifies its color. And what comes out of the function is a caption for the image, a group of young people playing a game of frisbee. Or you take the same function exactly, the same feedforward neural net, no more training. Feed in this image, out comes this caption.
Another example of a deep learning function, which has been in the news recently, is the fact that AlphaGo that beat Lee Sedol spectacularly in Seoul this year, it has-- it doesn't only do deep learning of course, it also has reinforcement learning, but it has a deep learning function where you just feed in a position. You feed in a 19 by 19 goal board, with each of these little squares can have one of three values, blank, white, or black. And it returns the probability that white wins this game. Again, it was very good at evaluating this function as evidenced by the results.
Now, deep learning doesn't exactly implement the optimal functions you want, but it generally approximates the functions we were interested in really, really, really well. Typically, the functions we try to approximate are probability distributions. So if you have some variable x which predicts, in some stochastic way, some variable y for what happens maybe next, then if you're trying to predict the probability of y given x, we call that prediction. That might be, for example, that you're texting your friend on your cell phone and you're trying to predict, given what you typed so far, what is the word you're trying to type next.
Or if instead you know what came out, and you're trying to predict what went in, we call that classification. So for example, if x is the name of one of your friends, and y is the pixelized image of what that friend would look like in some random lighting conditions, you can get given y and try to predict the probability for which person this is.
And then another kind of function we often want to predict is just the probability distribution itself. So if you just have some variables, no assumptions about any causality, you just look at a bunch of data and try to find patterns and see what is this? What probability distribution is this drawn from?
So how good is a neural network then at approximating functions? Well, good, what is good? It breaks down into three parts. The first is expressibility. So what class of functions can it express? And then the second is efficiency. How many resources you need to do this? And the third is learnability. So how rapidly can this network actually learn good parameters for approximating the function? And I'm going to talk about the first two, mainly the second.
So for the first one, expressibility, it's clear that if you're allowed to just keep tacking on more and more layers, the answer to the first question of what kind of functions and it can exactly express, it must be a class of functions, which is invariant under composition. Because if you take the neural network to implement some functioning, and you just add another neural network on top of it, you've just composed the functions. You've put the output from the first guy into the second guy.
And we know about a lot of function classes like that. For example, linear functions are like that. So if you throw out all your nonlinear things, you're just going to get linear functions. Nothing else. I find functions, linear plus constant, are like that. Piece-wise, linear functions are like that. You can get them by just having these so-called [INAUDIBLE] gates, which are constant, and then go linear when they put on the positive side.
Polynomials is another kind of functions like this. If you plug a polynomial into a polynomial, you get a polynomial. Continuous functions is another class that's closed under composition or various kinds of smooth functions. And this is basically has solve problems many years back. There's a lot for example, raise your hand if you have a math background. All right. Raise your hand if you know of the Stone-Weierstrass theorem. All right. Yeah.
So it basically says that polynomials can approximate arbitrarily well any continuous functions. And there's been a lot of famous theorem is proven in the machine learning literature that says that pretty much whatever arbitrary function you take for the non-linearity in the network, you can approximate arbitrarily well any function. Haim Sompolinsky talked about this in his talk. It's a lot of very beautiful work there.
So we shouldn't be so shocked, maybe, that neural networks can do well in principle if they're allowed to be arbitrarily big. But how efficient are they? How many can you actually do? How well can you do if you're limited to how many neurons you're allowed to have or how many layers you're allowed to have or how many synapses, in other words how many connections you're allowed to have here? And this is what I want to mainly talk about now for the next bit.
And this is the focus of this paper we have. Because if you start to think about this question of in the red box here, after a while, you realize that it's kind of shocking that deep learning works at all, that it doesn't epically suck. Why do I say that? Well, let's just come back and look at this picture again. So suppose you take a very simple example where what you send in is a pixelized image. And what you want to get out is a probability distribution over your different friends, which person is this.
Let's make it really simple. Let's say this is a million pixels, 1,000 by 1,000. Suppose it's just black or white pixels. How many images are there that you could possibly feed in to this function? There's 2 to the power 1 million. That's more images than there are particles in our universe, which is a tenth of the 89. So there's a huge number of images. And if I want to be able to specify an arbitrary function, even if I've narrowed it down to two of my friends, I want to know, is it Gabriel or Maya in the picture. Then I need to for each image compute the probability that it's Gabriel.
How many probabilities do we have to store, how many parameters do we have to store? 2 to the power 1 million parameters. 2 to the power of 1 million probabilities. More probabilities, more parameters, than there are particles in our universe. Clearly impossible to ever do good image recognition on a neural network, right? Wrong. So what's the swindle? How does it do this?
Somehow deep learning pulls off this combinatorial swindle where it replaces exponentiation by a kind of multiplication. Because if you send in, for example, a million parameters in the input, and then you take some modest number of layers, 3 layers or 200 layers, or whatever, that are maybe a million in size, ballpark, that means you still have [INAUDIBLE] a million-- [INAUDIBLE] a million parameters, maybe times 1,000 or times a million. But you certainly don't have 2 to the power a million.
So how can it be that we can do so well approximating all the functions we care about when we just proved to ourselves that it's completely impossible to approximate any functions except a tiny set that measures zero in the mathematics sense. So basically all functions that you can imagine writing down, you can't do. How can that be? It can't be just because of math because math just says, forget it. It only works for a tiny subset.
It must have something to do with physics. That's what I'm going to argue, that physics, the images we care about, are very, very special kinds of images. They don't look at all like typical images. A random image with a million pixels looks indistinguishable from pure noise. You can even prove that. This Italian mathematician Borelli proved that over 100 years ago already. The proof is the same as if you just take a random number between zero and 1, and you write it out in binary, you look at all its bits.
He proved that almost all numbers have that binary-- the decimals just look indistinguishable from just pure random. They'll pass any randomness tests. And so that's not what Gabriel looks like, that's not what Maya looks like, fortunately. So let's look into this a little bit more closely and see what is a special class of functions that neural nets can do so well. And then we'll talk about what is a special class of functions of physics and things inspired by real world images, like drawings and cartoons and so on, represent.
And then I'm going to try to persuade you that they're actually more or less the same, which is why deep learning is so successful. So suppose I want to figure out whether the picture is of Gabriel or Maya. Suppose I know what the probability is, what the image is going to look like given x. X is now Gabriel or Maya, to make it simple. I have some prior probability distribution for whether it was Maya or Gabriel. And then I can use Bayes' theorem to say, OK, this is then the probability for who it is given what I saw.
Now first let's make this look a little more like physics by just taking the logarithms of the probabilities. OK? So just take the logarithm of the probabilities. Then the product, of course, becomes a sum and the exponent. And when you look at this, it starts to look suddenly a lot like a Boltzmann Distribution in statistical mechanics. And in particular, a lot of times in physics, the way in which probability distributions of many, many variables is simple, is that it ends up being some things are independent of some things.
So you can write down the whole thing down as some complicated product of stuff of conditional probabilities. And then when you take the logarithm, just becomes a sum of those things. It's very much like in physics where the energy function that we call the Hamiltonian tends to be the sum of a lot of different contributions to the energy. Now if you stare at this a little bit more and just put on your machine learning hat again, this thing here, which is just Bayes' Theorem written in physics denotation, is actually exactly what machine learners call a softmax function.
One of the fun, little nonlinear things that we often do in deep learning, is we just exponentiate all the numbers we have. And then we just re-scale them so they sum up to 1. So we can think of them as probabilities. That's called the softmax. When I write a function of a vector, I mean I just apply the function separately to each element of the vector. What I did also here is I took the x, which was Gabriel or Maya, and I made it a little subscript.
So I think of it as being one energy function or what we in statistics call the surprisal, minus log probability, in the case of it being Gabriel and when the case of it being Maya, here. And then I made a little vector out of this. So in this case H is a vector with 2 elements in it. The Maya function and the one from entry from Maya, x equals Maya, and one for x is equal to Gabriel.
So what this is saying in plain English is that if I can somehow build the neural network that will really, really accurately approximate the Hamiltonian, or the log probability in other words, for these pictures of Maya or pictures of Gabriel or whatever else it is I have, then I'm done. I can actually approximate the exact thing I want to Bayes' theorem by just putting a single little layer at the end, which is the softmax operation. I will have exactly optimally predicted the probability that I'm looking at my wife.
Now is there any reason why I should expect it to be simple to calculate this Hamiltonian? Well, let's take a step back here. In physics, a good fraction of what we do is we look at systems in the real world and try to figure out what their Hamiltonian is. In other words, what is the function that gives me the energy of this system as a function of its parts and particle positions, whatever. And guess what? We have found that the Hamiltonian of our universe is absolutely not some random function. It's an incredibly simple function, very, very special function.
In fact, if you go look at the Hamiltonian of particle physics, standard model of particle physics that we use at CERN to calculate stuff about the Higgs boson and whatever, it's not any old function. It's a polynomial. And moreover, it's not any old polynomial, either. It's a polynomial a very low degree, a degree 4. The fourth power thing comes from the Higgs particle and a lot of it is this quadratic or cubic.
And it's simple also for a different reason, which I'll get back to you in a moment on the next slide here. So to make this a bit less abstract, let's just take a much simpler example. Suppose I have some simple one-dimensional system that I want to model. Suppose I'm not taking photos of you, but I'd listen to you. I record each one of you speaking for a little bit. And then I want to write the neural net to figure out whose voice that was.
So I measure the air pressure with my microphone at a lot of different time sequences. I get a little data vector x1, x2, x3, et cetera. And I want to know, OK, what's the probability distribution of this? I take the logarithm. Again, they call that the Hamiltonian. How many parameters do I need to specify this, an arbitrary function of 10 variables? Well, infinitely many parameters. If it's a polynomial, the number of parameters depends on the degree. And if it's not a polynomial, you need infinitely many. So that sucks.
But suppose I just say that I bet I can get away with a polynomial. Maybe it's a quadratic polynomial. Now I only need 66 parameters. Each thing can be multiplied by each other thing. And there's exactly 66 pairs of things like this I can take. That was nice. But the real world Hamiltonian we tend to encounter, tend to be even simpler than that. Because we also find the physical systems tend to only interact with things that are nearby. This is locality.
Information doesn't propagate at infinite speed. It goes only at the speed of light. So things can only affect their neighbors. So if you only allow each thing to be multiplied by its neighbors, then there's some interaction between these two and between these two and these two. So the number of parameters is just linear. And parameters basically-- each thing can also interact with itself. So 20 parameters.
But the real world is even simpler than that because we tend to also have symmetry. The laws of physics don't depend on where you are or when you are. So for example, if it sounds we're dealing with, it's quite likely that the correlation between the air pressure from Gabriel's voice now and the air pressure a tenth of a second later doesn't depend on average on what time it was. So if we insist on this kind of symmetry that the coupling between x1 and x2 is the same as that between x7 and x8, now there are only two parameters left, the coupling to the nearest neighbor and to itself.
So very few parameters. And we can look at not just fundamental stuff like the standard model of particle physics for these sort of counting, but we can look at all sorts of approximations that we use in physics, also. Suppose you just want to study the Navier Stokes equation that governs fluids because you want to model the water on the beach and think what should a random photo of the beach, the water look like, what's the probability distribution?
There again, the Hamiltonian is the polynomial, order 3 in this case. Again, has symmetry, locality very, very nice. The Maxwell equations of light, super simple. There is a quadratic polynomial that's even easier. And pretty much all of these approximate things you write down in physics are very, very special, very simple functions. And they tend to be very super sparse polynomials. Question?
AUDIENCE: Can you just [INAUDIBLE]
MAX TEGMARK: Oh, yeah. I'm just defining the minus the logarithm of the probability distribution. I'm calling that the Hamiltonian because in physics, the probabilities in physics-- and thank you for asking this-- tend to always be of the form e to the power, the energy, divided by the temperature if you're in thermal equilibrium. But you don't have to have it be a simple physics problem like that, either.
It turns out that there is many, many other cases where the logarithm of the probability is often an easier function than the probability itself. For example, suppose you're just looking at some random system that you're interested in studying. And suppose you think that well, there are many different independent things going on that have sort of combined together in some linear way. So maybe we can use the central limit theorem and say everything is Gaussian.
Whoo hoo. If it's Gaussian, multivariate Gaussian, that is just e to the power of a quadratic function. That's what Gaussians look like. So the Hamiltonian now is quadratic. Yay. Awesome. Another thing that's often done in machine learning is you do some maximum entropy approximation where you say, OK, I don't know exactly what my distribution is. But I'm going to just take the simplest one that has the minimum amount of assumptions, the maximum entropy, given that I know what some of the moments are, like the second moments, third moments, fourth moments, whatever.
Then you can also prove that it's just e to some polynomial. So the log of the probability will just be a polynomial. So to summarize all of this for now, I just want to say that both in images we get in the real world, like if I take a photo of a galaxy and I asked you to write down what's the probability distribution for all galaxy images, I know it came out of the laws of physics. And ultimately that came out of this Hamiltonian, which is very simple or it's some sort of machine learning example.
The log probability is very often the polynomial. And if it is a polynomial, then you can totally trust the classification problem if your own neural network is good at approximating polynomials. OK? So let's ask ourselves, is it? Are they? [? Chaim ?] told you a lot of great stuff about what neural networks can do. Polynomials are, it turns out, incredibly easy to not just approximate, but actually do exactly with very simple neural networks.
There are a lot of very powerful theorems that prove what you can do if you have as big neural network as you want. But let me just show you how with just four neurons and the hidden layer you can already implement multiplication of two numbers perfectly. So in this little drawing I made here, the yellow circles mean you sum all the lines that go into them. The square boxes mean that you apply your nonlinear sigma function to whatever goes through them.
And if there's a line with a number on it like Lambda or mu, you're supposed to just multiply by that. So those are the [INAUDIBLE] matrices. If you take any function that isn't a linear function, you can always by shifting it sideways for the bias make sure that it has a second derivative that isn't zero at the origin. And then if you just tailor expand it, it's trivial to see that if you just plug this in and check, that what you get if you implement this little graph, you take the sum of these four combinations, you will just get the product of the two numbers that went in plus the fourth order term.
But if you take this number Lambda to be kind of small, then you're making sure that all the numbers are very small that go in. So the error term becomes negligible. And then you can just scale it up again. So this is a multiplication gate. And once you can multiply numbers, of course, addition comes for free in neural network. It's trivial with one layer. Then you can implement any polynomial you want by just doing many layers with multiplications and summations.
If you don't have continuous numbers, but instead just bits that are either zero or 1, it becomes even easier to do multiplication. This thing will only multiply two numbers at a time. So if you want to take u times v times w you have to add another layer. But if it's just bits, you can do it even with just one neuron in the hidden layer. Because if they're all zero or 1 and you want to multiply things together, the only way the product of a lot of bits can equal 1 is if they're all equal to 1.
So you just have to take the sum of them. If there are three of them, the sum better equal 3. And it's very easy to check if the if the sum equals 3. If you have some sort of sigmoid activation function, you just make sure that you scale things that will come out so the 3 is over here, 2 is down there and below. Now you can even multiply an arbitrary number of things together trivially.
So in summary, the natural world gives us probability distributions for images, for sounds, for whatever, that are not just randomly weird probability distributions. They give us this very, very special kinds of images and sounds. And neural networks are very, very good at approximating almost no functions. But they're really, really good at approximating this very special kind that nature feeds us. So this is a very, very nice coincidence, except that I don't think it is a coincidence.
We are very excited to be humans because we invented deep learning. But of course, evolution scooped us. We've heard a lot of great stuff in this school about how the parts of our brain we understand best, things like visual system, are making heavy use of these kind of techniques. And it's pretty clear that evolution did it because it was useful and tapped into the regularities that were there in the real world, that there were maybe a lot of things that were in this class that but physics offered us that you could model accurately that way.
My hope is that by studying more of the kinds of things that physics throws at us, and also the kinds of things that are inspired by physics, like handwriting and drawings and so on, we can get even better insights into how to further optimize actual neural networks that we're trying to build. So I'm going to say a little bit more about this also connecting with another issue that Haim told you about, which is it's not just that neural networks are good, it's not just that you can do cheap learning, by which I mean you can do well with a very small number of parameters, millions rather than a googolplex or whatever.
So cheap learning works. But Haim also showed you that in many cases, going deep with more layers is better. Now why might that be? So let me tell you a little bit about what physics has to say about that also. So if you think about the way in which data sets in physics, in which our universe creates data sets, it typically involves a series of steps. First this happens and then this happens and then this happens and then this happens and so on. So if each of the steps are simple, then we know that deep learning can model the different steps well.
We prove in the paper that in general, if you have a hierarchical, generative process like this, either that makes natural images or data sets, or whether it's fake data sets like this little ray-traced cat here, or it's a series of steps. First, you pick what animal it's going to be. Then you pick your parameters for your solid modeling. And then you do your ray-trace, whatever. We prove that there is always an optimal way of undoing this.
When you go down the hierarchy, often you lose a little bit of information about what the truth was because you added noise and stuff like that. But there's always an optimal way we show to undo this that doesn't lose any more information any more about what you started with. There's a rich area of statistics called sufficient statistics where people study exactly this, what's the best way of estimating certain things. And what you find is, not surprisingly at all, that the best way of analyzing data is to just undo the complications that nature made for you one step at a time.
And as long as the steps are simple, as we talked about earlier, then we know that deep learning can do each of those inversions efficiently. So by default, you end up with a network maybe that's not super deep, but maybe it has five layers or 20 layers or how ever many steps there were in your process. So if it's true that deep is better than shallow, that must mean that in many cases if you have a really efficient network with 20 layers, that somehow it's impossible to flatten that network without losing efficiency.
And there's a very nice literature on that topic. I like to call this no flattening theorems. So you have a black box neural network with five layers, it does something really awesome with this many neurons in this many synaptic weights. And if you can prove that it's impossible to do that, well, with fewer layers, that's a no flattening theorem. Tommy Poggio, with collaborators for example, has written a series of beautiful papers showing that if you have a hierarchical process where you compose functions, there are many situations where you cannot flatten them without having to have exponentially more resources.
And there's a lot of other really nice work of that kind also. And just to close out this topic, let me give you an even simpler example of why deep is better, where you don't want to flatten. One which is so simple, you might think there's nothing interesting to say about it, namely entirely linear networks. So suppose I tell you I have this cool deep network that does this cool thing. And it's entirely linear. All my layers are just multiplying my matrices.
You'd be like, get out of here. I can't believe Gabriel let Max waste 37 minutes of our time. This is so retarded. If it's just a bunch of matrix multiplications, you can just multiply all those matrices together into one matrix. Of course, that's going to be just as well. But before you laugh me out of the room, let me defend myself a little bit. Yes, of course, if you flatten it like that, you haven't lost any expressivity. The neural network performs exactly the same. You've also reduced maybe the number of neurons you need.
But actually you might have just screwed yourself over and created a network that takes much more synapses and much more parameters to implement. Think for example, of the Fast Fourier Transform, the FFT. That corresponds to just multiplying by a big matrix. Suppose you have 1,024 numbers and you want to do an FFT. If you do it in one layer, then it's 1,024 by 1,024 matrix full of complex numbers you multiply by.
But the beauty of the FFT thing is that you take that big matrix and you factor it into 10 matrices, 10 being the logarithm of 1,024, base 2. And those matrices are all uber sparse and have just like two end parameters in each. So each time you're really just taking some sums and differences. So now the total number of multiplications you have to do isn't n squared. It's n times log n. So if you have a nice deep network with 10 layers that does the FFT, if you flatten it, now it's going to be n squared. Almost 1,000 times slower. 100 times slower. That sucked.
So you see even four linear networks sometimes deep is much better. And it's not just the FFT. There are lots of other things of this type, discrete wavelet transform, a discrete this and discrete that. Whenever you do a convolution, of course you do it with FFTs. So that, again, is something you don't want to flatten. It's much more efficient to go deep for the FFT, multiply by something, and go back.
Even matrix multiplication is like this. But normally you learn in kindergarten and multiplying by matrix. Well, a matrix by another matrix is an n cubed operation. Because you have to calculate n squared things. And for each one of them, you have to do n multiplications. But it's been proven now you only need, not n cubed, but n to the 2.4 operations if you factor this into many, many layers. Or if you have very sparse matrices, again, it turns out it's much better to go deep.
And finally this has a lot to do with physics. Again, I'm not going to bore you with this now. You can ask me over a beer later. But the whole business with renormalization in physics, which is the process by which we extract information we care about while ignoring all the things we don't care about is very analogous to this process of analyzing data and deep learning. We try to extract features and distill them out and throw away the noise. And we talked about how that's related and different from all of these things.
So I have to summarize a few little links between physics and machine learning. But you can ask me about it a little bit later. So that was the main part of what I wanted to share with you here, fun connection between physics on one hand and AI and neuroscience on the other hand. That helps see why it is that deep learning works so well. And my hope is that the more we can understand why deep learning works so well, the greater the chances are that we can make it work even better also.
Both better in the sense of learning faster and approximating things better with less resources, and also understanding better in the sense that we can make it more robust so it actually does what we want it to do. And it doesn't get fooled into thinking that a person is a starfish, for those of you who heard of this famous adversarial training example. Let me take just a little bit of time also to tell you about one other fun connection between physics and deep learning, which has to do with phase transitions and so-called critical behavior.
Raise your hand if you've heard that, if you know what critical phenomena are. Those of you who didn't, worry not. I will explain. First I just want to show you a really weird plot. What's really weird about this, is that on the same x and y-axis, you have things as disparate as French, English, Bach music, the human genome, and some physics stuff with magnets, the 2D Ising model. What?
I should say that both this and all the previous stuff, work I've done together with Henry Lin, a student, he works with me at MIT. He's from Harvard. And what you're looking at here, first of all, is a bit like correlation functions, except to calculate the correlation between two different things, you have to be able to multiply the thing. So that works fine if it's like some magnetization or whatever. But you can't multiply two musical notes by each other.
So we generalize this to showing the mutual information instead. So what's showing in here is, if you're, for example, listening to this music by Bach, and you try to predict what tone is going to come say 10 notes later, how many bits of information do you actually have about that based on the present tone you're on? And what you see is in all cases, less and less of course, the farther into the future you're trying to predict.
And it falls off kind of like a parallel, roughly with the slope of minus one half, which is coincidentally what happens when you have the so-called 2D Ising model which is a two-dimensional magnet when it's exactly at the phase transition on the verge of becoming demagnetized. So we were very shocked, honestly, when we looked at all these curves. Why are they looking so similar? What's going on?
And to give you a little more intuition, let's look, for example, at the physics case here. I put the magnets. So here we have a two-dimensional grid of little spins, as we call them, magnets that can point up or down. And if it's very hot, then they're just going to point randomly in whatever direction they point. And I seem to no longer be able to play the movie of this. Let's see if I can quickly repair my Ising model. Maybe this will go better. Maybe it wants to be center stage. Nope. It doesn't want that. Oh, there we have it. Oh, I see. It was actually working. Just was working for you but not for me.
So, if you cool this off gradually, the little magnets actually want to line up with their neighbors. But they won't if it's too hot. When it gets cooler, they start lining up more and more. And there's a magic temperature where they start really lining up very, very well called the phase transition. It's exactly analogous to if you have water and you make it colder and colder. Suddenly the water gets more organized and it all turns into ice.
At exactly this critical temperature you can calculate that the information they have about the neighbors falls off like this one over the square of the distance parallel. Otherwise what typically happens is that these correlations always die off exponentially. Things just don't care about what's happening very far away. This is also what happens with any Markov process where if what comes next only depends on what happened right before. So somehow all these systems are showing long range correlations. There's a large scale, long range structure in there in them.
If you listen to this music by Bach [PLAYS MUSIC] and I was very impressed that Gabriel both knows and loves the composer also who is-- so the performer. Sorry. The composer is Bach. Hillary Hand. Yeah. If you listen to this, it's clear there is a lot of structure in this. It's not just random white noise. There is very short-term structure. You can guess quite well what the next tone is going to be. But there is also things that repeat and then certain themes that come back and then larger scale themes that come back much, much later.
So there's sort of a hierarchy of things like this. Similarly in the English text here, for example, it's clearly like that. If you start reading for example, O-P-O-S-S, you figure what's the next letter, probably? U. It's probably going to be an opossum. Because you have to have structure at the level of phonemes, that level of words, but you also have a lot of structure at the level of sentences, and then the level of paragraphs, the level of ideas.
So there's this almost fractal nature of English, whether it's structure at all scales, just like there is in music. Apparently there is in our genome, also. This is the fifth chromosome. We just did it. Somehow maybe there are little amino acid codes and then this stuff at the gene level and maybe a higher level. So we were very intrigued. Why is it that this is happening? We proved that for any marker process, this curve always dies off exponentially.
And then we started to think, OK, what kind of processes could give this sort of scale and behavior with long range correlations? In physics there's a famous theorem that says this can only happen for systems that are two-dimensional or higher, or three-dimensional. That can't happen for one-dimensional systems. That's so weird because English and music and the genome, they're all one-dimensional sequences. So what if there is in some sense some hidden dimensions of--
So what we found was there's a very simple class of models which always predicts the sort of scale invariant critical behaviors that we saw here. Anything that's recursive where you have something that creates something that's smaller that creates something smaller in turn, on and on and on and on. So English language is obviously very much like that. If you're going to write-- are they going to have to write reports for their projects here? They give talks, right? They give talks. They give presentations, right?
So when you prepare your presentation, how are you going to do it? You start with the first letter and then try to decide the next letter and the next letter? Probably not. You probably first come up with the idea for the whole thing. And then you're like, OK, how can I break this into sub ideas in my presentation? And then how can I break each of those into smaller ideas and compartments? And then how do I do this and how do this and what slides there I want to put in my keynote?
So the causal hierarchy is more like a tree. You start at the top level and then gradually it filters down like this. And similarly, I can't compose music, but I imagine when Bach composed that piece, he had some vision for the whole thing. And he said, I want this and this. And then it gradually filtered into all the way down to the small-scale sort of details. And we built some simple toy models like this, which are very much also like universal grammar, Noam Chomsky and stuff like that in English.
Whenever you have a noun phrase with some thing, and you can replace that by another little sub-tree and so on. And prove that these always give scale invariant behavior. And it's easy to see why from these little drawings. So if you first just accept that for any markup process where what happens next only depends on what happened previously, these correlations decay exponentially sparing you the proof. It's easy to see intuitively why that would be.
Suppose you've only managed to keep 90% of the information each time. Now after 10 steps you've lost 0.9% to the power 10 of the information. So it's the amount of information you have left just decays exponentially. But suppose instead you have some sequence here like the music we listened to or your presentation you're going to give at the end of the school, it wasn't created in a linear fashion. It was created in a hierarchical fashion. What is the actual logical distance between here and here?
Well, you just have to go back. So the distance between this and this until you have common ancestor. It was this idea here, 3, that led to both this and this. So if you look how far away you are, the numbers are written at the bottom that show how many links you have to traverse to get to the next guy. This only grows logarithmically with time here because of all this dividing by 2 dividing by 2.
So how is that going to be correlated? Well, it's going to fall off exponentially with how far you have to go but that in turn just a logarithm of the time. So you take the exponential of a logarithm and poof, you get a parallel. Works beautifully. Interestingly in physics if you look at parallels, they're always, as far as I can tell, also created by almost the same kind of thing. For example, my colleague Alan Guth at MIT invented this theory of inflation making [INAUDIBLE] universe by taking a tiny piece of space and having it double double double double.
It produces a beautiful parallel if you calculate the clumpiness of our universe for how correlated the density here is or the density somewhere else. Why is that? It's for this very reason. First you have a little piece of space. And you make some random fluctuations in it. Then you double it. And now each of the two parts just does the same thing all over again. And add some fluctuations here that are inherited by all parts. And then it doubles and its children do the same, its children do the same. Poof. And we get a parallel.
In turbulence Kolmogorov found that you get parallels. Again, if you just look at the water coming out of your faucet or something or look at a waterfall and calculate the correlation between the water speed here and there. Why is that? Well, Kolmogorov realized that you get these little vortices or eddies and they tend to have babies. So one little big vortex gives rise to maybe two baby vortices, which give rise in turn to baby vortices. They're half the size and it just cascades all the way down.
So it's exactly this recursive kind of fractal process again. And sure enough, you get beautiful parallels. And what does that have to do with deep learning and machine learning? I'll get to you in just a sec. Raise your hand if you know what an LSTM is. Long and short term memory. Yeah, it's a tweak, a kind of recurrent neural network, which is proven incredibly useful, which is all the rage and a lot of companies these days.
And I will spare you the details but it's very easy to make an LSTM that exactly implements this kind of thing where the higher up you go, each layer you go, things are remembered for twice as long. And this will very naturally produce these sort of things. And finally, I just to say that this is actually-- the idea of going in measuring this function about how much information things have with what's farther down the road and plotting a curve, is something I think is even kind of useful in practice for any one of you trying to optimize some neural net to actually do stuff.
Suppose you try to make a neural network. Suppose you, for example, want to predict, you want to do auto correction for your texting or you want to predict what's going to come next in some data file for good data compression or whatever. How would you do this? Well, you would write a neural network. It's supervised learning. It tries to always predict what comes next and it gets punished or rewarded depending on if it gets it right or wrong.
And then you find, OK, it does this well. I get it right. I predict correctly 90% of the time or you can quantify it with something more fancy, some KL divergence or whatever if you want. But how do if what you've done is the best possible? For example, there's a famous competition called the Hutter Prize to see who can data compress Wikipedia the most. And right now the record is 16%. But how do you know? Maybe that's the best you can do with English. Or maybe you can do better.
How can you tell? If you only know that you accomplished 16% with your neural network and you trained really hard, there's no real way of knowing. Well, this graph shows that what we have today is not the best you can do. You take your neural network, train it to predict as well as it can in the usual way. And then you measure the correlations for the actual English text in Wikipedia.
If you did a Markov process, you'd be doomed. But if you do add one of these LSTM RNNs, spend some very nice work showing that this is quite competitive where the world record data compression, and you train this, and then you have it just keep predicting and predicting, sort of hallucinating what's going to come next, you can calculate exactly the same curve that we've talked about from that. And you can see it's totally underpredict-- it's doing very well on predicting what comes next. But it's falling way short by a factor of 10 or so in the information it has about longer range structure and the text.
So I think this is another fun example where by being kind of here and connecting ideas from physics with ideas for machine learning, you can not only surprise yourself by finding interesting patterns in nature, but you can actually get practical tools to diagnose what you're doing and hopefully learn to do better. So I want to end by in welcoming you all to this wonderful place here in the middle, at the interface, between physics, neuroscience and AI, I feel that there's a lot of wonderful stuff left to do. Thank you.