TOMASO POGGIO: So I'm Tomaso Poggio and very glad to introduce Klaus. I didn't know you were such a pop-- a rock star, but-- [LAUGHS] but we are finding out. So he has been working on machine learning for quite some time.
I think he's-- he holds a chair in machine learning. There are not many machine learning chairs in the world. So that's I'm sure there will be many more in the future.
So he is a member, among other things, of the [INAUDIBLE] Academy, which is a quite old and unique honor to have. And he organized, among other things, last year, incarnation of a series on deep learning that started several years earlier in Tokyo. This was organized in Berlin during last summer. It was a great workshop.
But more importantly, I think Klaus is really a solar presence this morning. I bumped into him in the CBMM space upstairs. And he was there, sipping coffee, writing equations on his laptop, listening to Schubert.
He was basically back being a student, away from organizational duties in Berlin, and really, in paradise. So that's great. And so Klaus will speak about machine learning and the eye for the sciences towards understanding.
KLAUS ROBERT MULLER: OK. Yeah. Thank you, Tommy, for bringing me here. And I'm frightened by this crowd. Although it actually reminds-- I mean, it reminds me of my lecture. I have a specialized lecture on machine learning, currently, at TU Berlin, which has 500 people attending. It's a specialized one. So it feels a bit like home, right? Although the lecture hall is a bit larger.
Yeah. So it is wonderfully inspiring to be here. So I got inspired by MIT and TrueBot this morning. So this this good. So I will, in the next hour, try to talk about some technical aspects of machine learning and AI.
I will give a lot of applications in the sciences, which I find important. So there are some technical contributions and some contributions that are nice for scientists. So you can choose whatever you are, right?
So I-- one of the main points is trying to understand single decisions for non-linear learners. And I will introduce some algorithm that we have come up with in 2015, which is called Layer-wise Relevance Propagation. And as I said, I will give some scientific applications.
So quite often these days, people use machine learning in different types of sciences. And quite often, they use linear methods. And you wonder why because the nonlinear ones are so much better. And the reason is very simple. Because they want to understand their stuff.
They use machine learning as a tool, but they want to find some scientific insight. And the reason they cannot use machine learning, so far, in particular the nonlinear part, was that it's considered a black box. You don't know what's happening inside.
And I would like to convince you that this is not the case anymore. So it's not necessary to use linear methods anymore if you want to understand stuff. OK? And I will give you an example.
So maybe I will just stay here. OK. So assume that you have trained a large neural network. Or somebody has. So for example, this is AlexNet that we are using. And so if you put this picture into AlexNet, it will answer that this is a rooster, because it's a rooster. OK?
And so then you wonder, why does this-- you know, it's AlexNet, so it's a very complicated neural network. So you wonder what is happening in AlexNet. What makes AlexNet think that this is a rooster. And that's a problem. Because usually you can't go backwards through non-linearity.
And Sebastian Bach, OK, who now has changed his name to La Pushkin, has found a solution to that. And I'm also a part of this paper. And so basically, given the single decision for this single picture of this trained network, you can go backwards and create something which we call the heat map. And this heat map shows you what part is relevant for this single decision. OK?
And green is neutral, and red speaks for the rooster. And so if you take a look at it, it will be the red stuff that sits on the rooster's head. This is the rooster-ish part of the rooster. OK? This is what the neural network thinks. And I sympathize.
OK. So let me just briefly tell you why this is actually a difficult endeavor. So I already mentioned that if you do linear classification, things are simpler. So here is a classical data set. It's the Iris data set.
And you have different types of irises, and that's the sepal, here. And you have the sepal length and the width of these different iris types. And you plot them in 2D. So it's not a very fascinating data set, but it's a 2D data set. So we can learn something from it.
And if you take a linear method, that clearly doesn't separate well between the irises. You know that in this direction, which is sepal width, there's a difference between these different classes. Because you know this is the separation, this way is what explains this classification. OK?
As you can see, this is a nonlinear problem. And the linear classification doesn't lead you much anywhere. So if you have a nonlinear classification-- so maybe this is your classifier now.
So the reason why this is considered-- you know, the setosa, the red point here, in this case, is the sepal width. In this case, it's the mix of the width and the length. And in this case of the green, the [? genica, ?] it's really only the length that is the difference.
So the explanation for every single iris in this case is a different one. So it's not one single thing. It's a local thing, and this is why it's difficult to actually explain non-linear methods. And you could just get an intuition for that from the picture.
Now, if you think about more high dimensional things that I cannot plot into 2D anymore. So boats. OK? So assume that you would like to classify this as a boat. Then we will use this LRP thing that also gave us the nice heat map of the rooster.
And this boat is considered a boat because it's the wheel house, essentially, that the neural network finds interesting to classify that. And here, it's more-- it's a different type of boat, and clearly the sails are the most important thing. By the way, thanks for coming here, this great wind. I mean, and not sailing.
But here there's another boat which is in the desert. Right? So it's not even on water. But it has a characteristic bow. So the neural network actually thinks that this is a boat because of the bow. OK?
Now, you can see, you do the classification for every single data point, and you explain for every single data point. This is much different from what you learned from feature extraction. Feature selection. Sorry.
So in feature selection, you look at the whole ensemble of the data of one class, and then you ask, what are the features that make this whole ensemble, all of the boats, what make them a boat. OK? What are the features that are most likely on if you classify this as a boat. But it's an ensemble view, so you need all your-- the whole data for your whole class.
So if you apply this, then you get this type of picture. OK? So this is why the whole ensemble is considered a boat. OK? So what does it tell you? It just tells you that boats in pictures are typically in the middle. OK?
So I'm just showing you this result to make the difference between the signal classification and the ensemble view and feature selection. So feature selection doesn't allow you this part. And I will try to explain to you how to get there. And the why we can say-- wave goodbye to the black box machine learning.
And by the way, I'm talking about neural networks, but you can also apply this to any non-linear learning machine. So support vector machines, current methods. You can all basically do this explanation thing. OK. So.
So clearly, we didn't start this completely. Although our first paper was in 2010 on that. But here, this is a bit of a picture of the scene. And most people have always looked at gradients, at saliencies, right? And I will explain what this is. And I will explain why this LRP algorithm is a different story and ask a different question. OK?
Let me just make a general remark as well. So it depends, really, on who you are when solving a problem whether the ensemble point of view is an interesting one, or the individual one is an interesting one. So say you're a doctor, and you would like to diagnose. Then you basically, if you look it up in the books, then what the WHO says, that's the ensemble view which holds for all patients of that sickness.
But if you are the patient, you would like to have your diagnosis correct for yourself. You don't care about the ensemble of all the other people. You couldn't care less.
So OK. So let's just talk a bit about this layer-wise relevance propagation. So it's something where we have created a theory. So it's not only an algorithm.
There's lots of algorithms where people try to make some explanations. But there's a few theories out there where you can say what the heck you're doing. So what I try to say about explanation is-- and I showed you the pictures-- is which pixels contribute how much to the classification.
This is why you get the rooster thing and the boat bow and all that. And sensitivity or saliency asks a different question. It asks which pixels lead to an increase or decrease of predictio score when changed. OK?
And so this is just taking the gradient of the function that you have estimated, or that your neural network is implementing. and deconvolution is, again, something else where you have rather the ensemble view again, and you try to find some underlying pattern of your class that represents this best.
All of these methods solve different problems, and I think this should be noted first. And we are now looking at the first one because that's what I found most relevant. OK. So how does LRP work?
So assume that you have a neural network. So you put in some picture, and some people like ladybugs. So they put in pictures of ladybugs, and the neural networks should, sure enough, classify this as a lady bug, and not as a cat or a dog or a car or whatnot.
So for this, you let the deep network, the ladybug to rattle through the deep network, and then you see what the network says. And then sure enough, it says ladybug, and that's it.
So now how to get backwards. So getting backwards through the non-linearity is non-trivial, as we already saw. Just computing the gradient is not good enough.
So this means that we need to find a way to go through the non-linearity and to decompose this whole non-linearity in a meaningful way so that we can get this heat map. And we start by taking what the neural network has provided us as a prediction. And we call this the relevances, because we are doing layer-wise relevance propagation.
And then we take all these relevances. We take the neural network that has been trained. So these are the weights of the neural networks that have been trained. And we just multiply the weights of the trained network with the relevances that the network has computed in the forward pass and compute the relevance in the next layer backwards.
We normalize this, and we multiply this with the forward parse activity. So you can do this in every layer backwards. It's like aerobic propagation, but we are propagating relevances spec. And the formulas are not the same as aerobic propagation, but if you want to understand what is the theoretical interpretation of this, I will do some hand-waving, because that would be a very long lecture, otherwise.
So if you can read this up in Montavon Pattern Recognition, 2017. So think about this non-linear surface and in some high-dimensional space that the neural network or the learning machine implements. So you could do a Taylor expansion of that. A global Taylor expansion.
So that is a bit of a problematic thing because you need all these higher orders of the Taylor expansion. So this is difficult to estimate, and it won't cut it. Right? So therefore, you need to do something else.
So you could say, why not do a local Taylor expansion at every neuron around some local root point that you choose appropriately? Because a local linear Taylor expansion can be shown to be equivalent to a global nonlinear one, just not on this slide. OK?
So that's the idea. OK? So then that's the metaphor. So we are trying to do a global Taylor expansion in a very smart way, decomposing locally. OK? And this is why we call this deep Taylor. Because it's so deep.
OK. So now we go from layer to layer. We compute these irrelevances, and we make sure that no relevance is added and none is subtracted. So there's the relevant conservation principle.
So now we can look at the results, and we can see already, we've seen some results before of boats, but we have never seen this contrasted with the sensitivity, which looks at the gradients. So the difference here, for this picture, which is the scooter class. So the neural network considers this a scooter picture.
So now if we look at the gradients only, it looks like a mess. And in particular, there's activity just in the middle of the road, where there's obviously no scooter. Why? Because sensitivity asks, how do I change the pixel in order to make this more or less a scooter? And obviously, we change the pixels on the street in order to make this a scooter. Because typically, scooters are not in the sky.
Oh, here, LRP, it gives a different answer. And you see the tires and parts of the back light and things like that. So I think that it's quite clear that this is a nicer explanation, because it grasps the concept of the scooter on this [INAUDIBLE].
So maybe this is a bit of a technical thing. If you take a very simple example, sensitivity actually has a problem-- or gradients have the problem if there's some non-continuous changes of the error function, which gives a lot of noise. LRP is continuous.
So now we can look at what does a neural network say to this three? Why is it a three? Because there's nothing here and nothing here, and there's something here. And this is why this is a three. And for this three, it's a different explanation. For this one, again, it's a different explanation.
If you do the convolution, it's always the same thing. But you can also ask, what makes this a three? Or why is this a nine? OK? And clearly, it's not a nine. So the neural networks thinks this is a three.
But you could ask, if it was a nine, what would speak for it, and what would speak against it? And this part is blue. It speaks against it. OK? Because there's nothing. In nine, there should be something. OK?
The nice thing about the LRP construction, by construction, you have the relevance, a conservation. So you [AUDIO OUT] do statistics about it. Because it's well-normalized. You can't do this with the other methods because it's not well-normalized.
OK. Now you can look at many pictures, like pictures of dogs, birds, and you can compare all this, and you can [AUDIO OUT] because it looks nice. But this is not very scientific. If you look at the big databases, they have 30 million pictures. Would you want to look at all these?
So some more. OK. So is there a systematic way to compare different ways of explaining? OK? And there's a very simple idea. So this model gives us a heat map. What is important for this three. What is a good explanation.
And you could just say, well, let's just flip these pixels where there's a lot of heat, and let's see what they classifier does, OK? As opposed to random pixel flipping. So now I'm flipping some pixels. OK?
Here's the picture, and I'm pixels. OK. So the classification is still very good. Still good. OK? Bang, it goes down. So now it's not a three anymore. OK?
So this heat map is not bad. It makes sense. But if I look at this heat map, and I do this exercise, and I can flip any old random pixel, I can contrast these two curves, and I can compute the area between these curves. I can average, and this is a well-defined quantity. So I can judge how good these methods are. I don't have to look at 13 million pictures. So that's good.
Now I can do that. And I can look at different data sets. I mean, real data sets, not toy data sets. Like MIT places. You can play with the parameters. And the higher the curve is, the more difference is from the pixel flipping. OK?
So you see that this is much better, these red ones. And this is sensitivity. Green, deconvolution. OK. And the dashed line is random. So what can we do with this tool? We can do a lot of things.
So first of all, it gives us a way to understand what these methods do and also how they solve problems. So here is a data set where every project has a box around it. And so we can take some algorithm of someone.
And so this is the AlexNet again. This is the Fisher vector algorithm. Two populate computer vision algorithms. And we can see it explain the correct classification. Both algorithms say this is a boat, but this one says it's a boat because that's water, and this one says it's a boat because there's this bridge.
And so now we can measure how much of the activity is within the bounding box of the object, and we can understand how much the methods use context and non-context. And we can judge this and gauge this, and we can see that, for example, for boats, the Fisher vector algorithm uses context a lot, as you can see, whereas the deep neural network doesn't. And there are different usages of context in these methods, and we can try to understand that. OK?
Now we can ask a fun question. OK? So in my class-- and I'm sure at MIT it's the same-- I teach in machine learning that it's the generalization error that we should get up-- that we should get low. OK? Not the training arrow, the generalization error.
So the metaphor is that you take the data. You do your machine learning model. You do your predictions, and then you get the generalization error. And a lot of papers in computer vision, in machine learning in general, NIPS, ICML, iClear, they are all of this structure.
You take some data set. You compare end methods. You have a table. Your method is best. Your paper's accepted. OK? Of course you do generalization error. Right?
So now I'm asking the question, is the generalization error all that we need? And I'm not asking this question from the theoretical perspective, I'm asking it from the practical perspective. Because we have data sets that are very large, but are they large enough to capture everything? Because we are not solving the integral that you would have to solve in order to compute the generalization error.
And here is a fun example. OK? We came-- this is from a CVPR paper of Sebastian Bach, now La Pushkin. OK? The same guy. So he explained two different things. So he had the Fisher vector algorithm and the deep net. Again, somebody's deep net. I think it was AlexNet.
It was compared on a large data set with 20,000 classes, 30 million data points, something like that. And these are classification results, and I'll do a sample. So this is the generalization.
So for example, for the horse class, Fisher and deep networks don't-- there's no difference. This is not statistically significant. Now, you can ask, given an image of this horse class, of this generally used database, how do these methods solve the problem?
So this is the deep network that looks a bit to the horse butt, to the rider, and to the nose of the horse. OK? The Fisher vector also looks a bit to the horse butt, and it looks around. And it looks at the lower left corner. And you wonder what is happening. Right?
So then you go and look at this, and it says, [SPEAKING GERMAN] OK, which is German, and it says, fear the horse pictures archive.
Now, this is not our data set, just-- but it's the generally used data set in computer vision that everybody is doing its generalization game on. OK? So this is complete nonsense. Nobody looks at this data anymore.
I mean, people have now terabytes of data. Petabytes. It's Google-sized data. Nobody looks at that data. And you have all sorts of nonsense in these classes. And of course, new methods do something. They have a way to solve the problem.
Of course there's AI, so you're trying to be intelligent. These are intelligent machines. Is this really an intelligent machine? Of course it is. It does cheating.
But would you like to be diagnosed by this type of machine? No. Would you like to use this kind of thing for science? No. You have to understand methods. Really understand how they solve a problem in order to judge whether something is intelligent or not.
And if you want to do science, you actually need to understand. So I think generalization error is not what we need only, but we also need an understanding of how the problem is solved. And of course, from the theory point of view, I could say, if I have a zillion data points, and they are even sampling [AUDIO OUT] so that the probability distribution is nicely maintained, I can just trust in my generalization error. OK?
But I want to catch these. And sure enough, there's lots of these problems. You can take anybody's network, because they have put it online, and you can find these kinds of fun things. So a lot of people are horsing around these days. OK.
Of course, it's not only about classification, but also computer vision. You can, say, do age estimation. So explanation tells you what your network says and thinks about why this face is a bit older. Well, it's the ear lobes, right? OK?
So if you think about it's the ear lobes that make the age shine, attractiveness, sadness, these are things that you can analyze. You can try to understand how these methods solve these problems. You can apply this to text.
So these are [AUDIO OUT] text, and you wonder whether this is medical, or whether this is motorcycle, or whether this is space. And you can say, what are the relevant features that make this a space text? NASA astronauts, earth. OK. Ride, not necessarily.
So motorcycle, ride is a good word for that, motorcycle. Medicine, discomfort, sickness. OK. So you can play around with these things in natural language processing. Because people have wonderful methods to do natural language processing, you can understand them.
Now some more. So remember, there was this wonderful paper in Nature by our friends from DeepMind playing Atari games. So you can try to see what these methods are doing. All of that is transparent. You can reconstruct these models.
And now you can look at a model that has been trained for some iterations, so it's a good model. And you can see how this explains itself. OK? So maybe it's better to show a video. OK?
So here, this is the LRP side, and this is the sensitivity side. As I said, it's not really something meaningful. It's about the same network. So clearly, network has learned that the ball is something important. And that this thing is important, right? So it's clear. So sorry for this.
So you can see that this [AUDIO OUT] network has learned what is up there. So this is actually a smart behavior. It's really intelligent. But how can you judge? You can judge from the quality that the network has. From how good it plays the game. But also you can judge from the strategic value of how this model actually plays the game.
And there are many things to show that I don't want to show here because I'm running out of time otherwise. So because I want to briefly talk [AUDIO OUT] machine learning now in the sciences. This is like my ultimate hobby. I'm a physicist by training, I'm also a computer scientist. But in my heart, I try to understand the world. I can't help it, right?
So and I try to use machine learning for doing that. OK. So one of my longest hobbies has been building a brain-computer interface. And so we are using EG. It's the Berlin Brain Computer Interface. So we have a multichannel EG.
We do some feature extraction, and we do some classification. Done. OK? Then we can do all sorts of things. So it's clear, brain computer interfacing, as a field, has been around for a long time. It came mainly from the medical side, where the idea was to say, people who are strongly disabled, ALS patients, for example, who are locked in, need to have a means to communicate. OK? And the idea was to re-install this means of communicating, decoding brain activity in a meaningful way.
And when I started engaging in this, the subject had to learn. So in other words, they had to learn for about 100 or up to 300 hours to change their brain signals such that the state of the art of node processing at that time could decode.
So we brought machine learning to this so the subjects would think whatever they think, and then they-- we would just decode it. And we could get this to work, and the patient training was reduced from 300 hours to something like five minutes. Now you can do [AUDIO OUT] any training.
So the original motivation to help patients also got-- it's still there, but also there is other things that you can do as well. So I'm just showing you an example of this skill. So you see me a couple of years ago, wearing the EG cap, sitting there, comfortable in the chair. This is the amplifier. This goes into the laptop, and this controls the cursor down here.
I'm imagining left and right, because left and right imagination [INAUDIBLE] my different motor cortices. OK. So here you go. So I can tell you that it's profoundly fun.
You know, there's a certain saying that-- I mean, I'm-- actually, in this experiment, I'm actually using the imagination of pulling strings with my left hand and my right hand. OK? And so just imagine it.
And so German professors seem to have some skills in that. So there's hope by this technique for patients. And really, it's hope. Right? And there's a huge population of people who have stroke, and nowadays, people like Niels Birbaumer and others are trying to use this technique to do better, more efficient stroke rehabilitation. So this now has become an own field. So at that time when we started, there were about a dozen groups in the world doing that, and now there's about 450, because it's so easy for people to start this.
Now, of course you can ask neuroscience questions new. You can ask-- you can do the libet experiment with this kind of technique, watching the thinking and behaving brain in real time. But you can also do something really strange. And I will share this with you because you would probably never think about that.
So some time after coffee, I was wandering with my colleague, [? Tomas ?] [? Vigant, ?] in Berlin, next to the river. And we talked about some nonsensical projects that we could do together. So [? Tomas ?] [? Vigant, ?] he is one of the fathers of H264, which is the video coding standard. So every second bit in the internet is coded by that.
And so the idea that came up was I said to [? Tomas, ?] well, why not have something like mp3 [AUDIO OUT]? Just code the stuff that we actually perceive. OK? He said, well, that would be great. But how can we measure this? Well, let's use a BCI.
And of course, we were stupid enough to put this in a grant proposal, and the reviewers said, this is complete nonsense. It will never work. Now, it's-- as you know, some reviewers are nice people, but they're not always right.
And so we started this. And trying to measure with the BCI how the brain reacted to changes in the compression quality. And the idea was to actually use this as a basic neuroscience experiment to understand how things are perceived.
So we did SS VEP experiments. We did ERP experiments. And the bottom line is what we learned [AUDIO OUT] of compressed video data was enough to make these engineers tweak their coding to the point that they could save some bits.
So about two years ago in the Champions League Finals in Berlin, that was broadcast by this changed code that they have constructed at the Heimlich [INAUDIBLE] Institute. The interesting part of this is-- and maybe this is for the-- you don't have a feeling for this.
I mean, we did basic neuroscience experiments. But if you save 5% bits or 10% bits, this amounts to a reasonable number of nuclear power plants on the global scale. So it's usually-- it's a very-- I mean, think about it. It's a very strange idea. That it actually shifts to something interesting in the end. Useful, at least.
Now, one thing is that we can do-- and this is why I put this into this talk-- is you can also use a neural network for classifying. It actually works very well. But now we can also apply this explaining method.
And the good thing is that the explaining method can be applied in every single trial. So that's interesting. Because a lot of times, you do your experiments. You do your cognitive experiments. And you get some results from your decoder, and it's wrong, but you don't know for what reason.
Sometimes the subject is just not paying attention. Sometimes it's sleeping. Sometimes it's just doing something else. Sometimes whatever. Didn't understand the instruction. So there's different reasons for that. We can sort this out.
If we average over all this, then we get the very beautiful pictures of the Motor aquatics which make this the relevant explanations for left and right imaginations. But you can apply it to every single trial. OK?
Now I change gears, now. I'm aware-- I mean, this may be a machine learning crowd, a neuroscience crowd. I'm not sure whether there are some physicists in the audience. So I-- yay. [LAUGHS]
So I-- OK. So this goes back to 2011. OK? So there's a place at UCLA that's called IPAM, Institute for Pure And Applied Mathematics. It's a wonderful place, in particular for Europeans who have a very gloomy winter.
And my [AUDIO OUT] up, and my wife said, well, you go. You can't do anything wrong in California for a few months. And then I went there. And I didn't look carefully at the program, and it turned out that this program was on quantum chemistry. Which is a bit far from machine learning, somehow. And I was the only machine learning dude there.
And so I met with these and many other guys, and we started something adventurous. And so the idea that I had then, the following. There's this innocent-looking equation that Schrodinger did. Right? The Schrodinger equation, which gives the quantum mechanics of molecules and materials and everything.
So the problem is it looks innocent, but it's really hard to compute. So you can only [AUDIO OUT] compute it, and then you still need supercomputers. So for a small molecule, you need about five hours of computing time and a decent approximation. So if you have a better approximation, you may need seven days per sample.
So I [AUDIO OUT] perhaps it may be a good idea to get a lot of training data from the approximations of the Schrodinger equation. And instead of solving the Schrodinger equation by simulation and by approximating it, we treat it as a stochastic problem. Predict the outcome of the equation. OK?
So we are just completely ignorant to mathematics, because mathematicians solve their equations, typically, and we just predict the outcome of the equation. And of course, people have tried to kill me in various ways, but they couldn't make up their minds. That's good.
And we just did it. Right? And the astonishing thing was that this actually works. And it's also a bit frightening, and I will discuss this in a moment. So how do we do [AUDIO OUT]?
So first of all, we need to get some training data. And so this is like a molecule. And if you do machine learning, you need to say what is the representation. So now here's the atoms, right? Clear charges. And they have a 3D coordinate in space.
So now we describe every molecule as a matrix where MIJ, the matrix element between I's atom and J's atom is just the Coulomb force. OK? And on the diagonal, we have some zed I to the 2.4, and 2.4 is due to Anatole. And if you want to know about that, you have to invite Anatole von Lilienfeld.
So this is a matrix that now represents this molecule. Now we can-- machine learning, we compare things. Compare objects. We can also compare matrices. We take the Frobenius norm for this.
Now, of course, people there wanted to be a bit adventurous. They said, well, let's use a neural network. Let's do something really [AUDIO OUT]. Some great machine learning approach.
And I said, no, no. Let's just do something simple. Let's do kernel ridge regression. Because if it doesn't work with kernel ridge regression, it doesn't work with anything.
And so we just put Gaussian bumps onto the molecules. The distances between the molecules. And then you get your estimate of the energy of the end molecule by just summing over all the Gaussian bumps with their respective alphas. And you get them by just inverting your matrix.
So what we actually did was-- in fact, Alex Tkachenko did this. He used the Max Planck supercomputer. Let it run for 7,000 times 5 hours. Then we have some data.
We take, say, 1,000 data points out of it. Use this for training. And then we look at the remaining 6,000 and see whether [AUDIO OUT] we get a good result, in terms of the prediction quality.
And then you can see, this is the number of samples. This is the mean absolute error. In k culpable, which chemists-- it's the relevant quantity to measure.
So in the program was [AUDIO OUT] 11. We at the last week of the program, we submitted a paper to Pierre [? Elle, ?] officer of letters. And then it was-- it appeared in March. So 10k cal per mole was the result. And later, we did something at NIPS, and we got 3k cal per mole. Now we have 1.3k cal per mole. And our recent NIPS submission is 0.3k cal per mole.
Chemists are happy if this is below 1k cal per mole. This is that what they call chemical accuracy. So the interest-- the worrying thing about this is that why should this work? OK? I mean, I'm a theoretical physicist by training, so I'm deeply worried about this.
But unfortunately, I cannot answer this. This is a deeper question that we may need some more years to figure out. But if we figure it out, you will hear [AUDIO OUT].
So let's say we're more adventurous now with the methods. And one of the methods that we used was a deep neural network. It was a tensor neural network. So this was a-- if you want to read up on this, this was published in Nature Communications earlier this year.
So the idea is, again, you take the molecules. You get features from it. So this is like the Coulomb distance. Then the model looks a bit complicated now. OK? And I think I'm-- I started a bit later. But I may have to come to an end also.
So you can ask me about the details in the questions afterwards. And so the idea is [AUDIO OUT] so many of you may know where to [? vec, ?] right? So where to [? vec ?] is the algorithm, state of the art, that translates representation from text, including the context, to a vectorial one.
So this is very complicated because text is a complicated beast. Natural language processing people know that, of course. That's their business. So chemistry is also a complicated beast. Because you have an atom and its environment. So the atom sits and interacts with the other atoms around it, but it's not the atoms around it, but it's also the atoms of the whole molecule. And of course, if it's in solution, then it's even worse. Right?
So the idea of this model was actually to find-- to learn a representation. So the Coulomb matrix representation was a very innocent one. [AUDIO OUT] one. Here you would like to learn the representation in the similar way as where to [? vec, ?] but in chemical space. OK?
So we do this in the following. So this structure here is repetitive. OK? So we have the distance. So this represents one atom. And you can repeat this graph for how many atoms you have in the molecule.
So you basically try to approximate in the low dimensional space, in the 3D dimensional space, how the distances between these atoms [AUDIO OUT] the molecules are represented. How they are distributed, right? And then you expand them with some parameters. OK?
So then you put this into what we call an interaction layer. So we take this information, and then in the first layer of this, we'd look at the atoms and just the atomic representation. Then we do the same story again. Feed in what we got from the atomic representation and get the interactions between two atoms and more, and three, and four, and we do this for a while. OK?
So I will skip this part, which OK. So this is some equations. But practically, this is like your representation that you're building for one [AUDIO OUT] for one atom. This is fed into the next layer that's exactly the same layer, with weight sharing and everything. And then you have the interaction between atoms.
And then you do this again many times. OK? So then you can see, this is a well-scaling model. So if you happen to have more examples, then just taking the atomic environment gets you to a 1.5 kcal. Taking the interactions gets you lower. Right? Which makes a lot of sense because chemistry is about bonds and everything.
So with this method, we can train it on some data set. And for example, in 25,000 compounds, and then we can predict whatever [AUDIO OUT] chemical compounds space. But with the same model, we can also do something that is called molecular dynamics.
So this is for one molecule. The molecule wiggles around. OK? And so you have this molecular dynamics trajectory over time. So again, this is super expensive to get. Right?
And then with a few data points, you try to learn in a similar way. To do the molecular dynamic simulation. Of course, you can do the molecular dynamic simulation very quickly then, with the neural net. And if you look at this, you can see the predicted and the true curve. OK? So well, you can see anything as a difference.
Now, the interesting question is did this model that is trained [AUDIO OUT] energies, did this model actually learn something about physics or chemistry? OK? Because in the end, it would be nice to have a very well predicting model and to get from it some chemical insight.
Perhaps [AUDIO OUT] there as well. And somebody lost a tag in DFT. But it didn't seem so. So we did use the model that we have trained.
And now we can put something like a test charge on the neural network model. And with this, we can compute the potential, and the potential is now in quotes. The chemical potential. Because it's explosive terrain in chemistry if you use this word and not put it in quotes.
So assume that you have an hydrogen atom. And you can now see, look at benzine and see where would this hydrogen atom like to bind? Where would the carbon atom like to bind? And so you can ask all these questions. And note, we have not told this model how to do chemistry. It just predicts the energy from the coordinates. OK?
So interestingly, we can do more interesting things than binding, because that can be done otherwise as well. For example, we can take all the compounds that have a benzene ring in it. They're called aromatic. We can look at the aromaticity of them, and we can order them according to our model.
This has not been trained at all. It's implicitly learned. And chemistry people are happy about what our model does. I can't tell you much about it. Because I'm not a chemist. Nobody's perfect.
So but interestingly, we [AUDIO OUT] do that, but we can also think about there's a group, right? And we'd like to think about the molecules that contain this group that are most stable, or something like that.
OK. But anyway, so I will come to an end now. I will try to say why explaining non-linear models is essential. And it's also orthogonal to improving neural network models.
So we-- now I'll do some shameless self-advertisement. We will have a workshop at NIPS that asks the question, now we have an interpretable model, now what? So what can we do with it? So we can look at nice pictures. We can convince ourselves that the models are doing the right thing. We can understand how they solve problems.
But maybe there's more. and that's being discussed at NIPS. [AUDIO OUT] So there's absolutely a need to open black boxes in the sciences and in medicine.
There's theory. This is the detail expansion that I briefly alluded to. And I think understanding models is really essential for the progress of AI and science and for understanding intelligence in the first place. So with this, I just end my talk and ask for questions.