DeepOnet: Learning nonlinear operators based on the universal approximation theorem of operators.
- All Captioned Videos
- Brains, Minds and Machines Seminar Series
KENNETH BLUM: Welcome, and welcome to our speaker George Karniadakis. George is probably really nervous, I'm guessing. He is an applied mathematician with wide ranging interests. They range from stochastic differential equations applied to various physics problems and life science problems-- computational fluid dynamics figures in there heavily-- and more recently, meaning in the past decade plus, a lot of work on machine learning for scientific applications. And that, I think, will be the category that encompasses today's talk.
He was an undergraduate at the National Technical University in Athens in mechanical engineering and came to MIT to get his master's and PhD, also in mechanical engineering. I think he did these under Anthony Patera and Borivoje Mikic. He then held a number of positions, including a faculty position at Princeton, and then ended up where he is now, at Brown University as a professor in applied mathematics, a fancy named professorship but with two long names that I can't remember at the moment. And we're very happy that he's going to talk to us about what he has dubbed DeepONet. George, welcome.
GEORGE KARNIADAKIS: Thank you very much. I believe that we are at the crossroads of AI right now. And if we want to be critical, I would say we are at the stagnation point.
And I would like to give you an example of that. It's the recent GPT-3 from OpenAI. While it's 1,000 times bigger than GPT-2-- that's a good thing, three times bigger-- it has 175 billion with a B parameters to train. It takes about 356 CPU years to train it and in terms of money, about $5 million. And according to their savvy CEO, Sam Altman, the Dropbox guy, he tweeted recently that GPT-3 makes silly mistakes.
So true intelligence requires really-- scaling up is one avenue, but there are limits. It requires a higher level of abstraction. And such abstractions-- it's my thesis-- can be effectively represented by nonlinear operators, that is, nonlinear mappings from one or multiple functional spaces to another. So imagine, for example, here in this picture that I have the robot, and you try to endow this robot with mathematical intelligence. Well, would you teach this robot calculus?
I asked my daughter. She says, "God, no," as she's finishing up high school. Because that's very tedious. Well, if you do that, then let's say the robot will try to use what I used to do, solve numerical PDEs and so on, to predict something, to move somewhere, to check the weather and so on.
But that will take an enormous amount of computing. So this robot has to go around with an extra scale computer on its head. That's a lot of energy. That's a lot of power. And you know, you all know here-- you are the guys who are deconstructing the brain to understand it-- that even a small chocolate bar provides sufficient energy for human intelligence, for all sorts of operations. So it's also the energetics coming.
So let me go to the second slide. I hope you see my second slide. It's the universal theorem of function approximation. And almost every paper on neural network published today, about 1,000 or maybe 200 today, about 1,000 per week-- last year, I checked, it was 100,000 papers on neural networks-- were all based on universal approximation theorem for functions. But what I want to tell you today is something else. I want to give you a higher level approximation, the universal approximation for functions and nonlinear operators.
So why is it different? Well, if you look at what we are doing today for image classification, we take an image like that from Rd1 and we map it to Rd2. When you deal with an operator, you have a function, an infinite dimensional space to an infinite dimensional space. So you can have multiple functions as input, multiple functions as output. So it's a very, very different setup. Now, obviously higher level abstraction.
Now, what this operator is, this operator could be as simple as a derivative-- talking about calculus-- an integral, and I'll show you some examples. It could be a complex dynamical system. It could be an ODE, Ordinary Differential Equation, partial differential equation, stochastic differential equation, fractional differential equation. If I have time, I'll show you all that.
But it could also be a another biological system we don't quite understand. But we map x to yx of t in space to y of t in space. It could be a social system. It could be a system of systems, and I'll try to give you an example of that at the end.
So then can we learn this operators, neural networks, and how do we do that? And so in some sense, broadly speaking-- and I would like to go to this question, can we learn operators? How do we do it? How fast we do it? And so on-- but before I get there, I want to give you a couple of teasers because I know there's some groups there working on generalization, and generalization is a big question and so on.
So I would like to address the generalization question, which is-- that's also what I've tried to do with these operators. I want to extrapolate. I want to go out outside the space of distribution space, and I want to have a small generalization there.
So let me give you just a very-- before I get to the operators, I want to give you a very brief overview of two topics that I've been working on recently. One is-- it's just standard classification problem. You can see here, k categories. Now, I'm trying to quantify generalization there, but I will do it in a very different way than you have usually seen, for example, when we try to stabilize this stochastic gradient descent or some other methods. I will try something different here. I want to give you some introduction to that, and then hopefully, you can be interested in our paper that was published recently.
So this schematic here shows the error which can be broken down. This is a hypothesis space. This is the approximation here for the network size. You can make that smaller. But this is a big elephant, of course, the generalization error. How can you handle this?
So I'm going to look that-- so no apparatus here yet, no physics, just pure classification. How do we approach it? And as I said, we approached it from a very different point of view. And just like the title of the paper says, we will try to quantify the data distribution and also the smoothness of the network. And we introduced these new concepts, if you like.
So for example, we introduced the probability of the neighborhood in a training set. In this panel A, if we have a datum here, it can reduce r. And the probability here will be in this plot. And the integral under this from 0 is r is 0. The integral out of-- under this cover will give me what I call the total cover.
If I have two classes, the red, the blue, then I want to introduce this self cover. Ti is a test excerpt for the i class. Mu i is the corresponding measure. So self cover would be something like this. It's just notation for now.
And then correspondingly, in panel D, I'll introduce also the mutual cover between-- now you can see, the red class interacts with the blue class. I also need to see how sparse are my data, how far is label one from label two. Ideally, I would like to know the minimum distance delta 0, but I don't have enough data to be very precise, so I will introduce this empirical distance delta T. And finally-- so this is the data distribution.
And finally, I will introduce something about the inverse modules of continuity of the network f. So if I have a change in the network epsilon, I have delta f due to that epsilon. So delta f here is the inverse of the modulus of continuity.
So what we did, we proved the following theorem. First, we define the cover difference, which is basically the average case, the number of classes here. It's the average of the self cover minus the average of the mutual cover. And then I define the cover complexity. If it's 0, it means that it's very, very easy to predict and so on.
So here's the first theorem which uses this assumption. In the paper, we justify this assumption. One thing that we use here, kind of strong assumption, is that the maximum loss, the cross entropy maximum loss, is bounded. But actually, in the experiments, we just took the average. We don't need to have the maximum because that's a very strict constraint.
But here is-- the main result is somewhere here in the middle bullet that says that the error is bounded by a coefficient that depends on the training set and the smoothness, it turns out, times this cover complexity that we defined. Just to see that alpha of T depends on this wealth of parameters and the smoothness, and I will connect that. I just want to make this simple. I cannot explain everything here.
But just to show you what happens in this framework, we found, surprisingly, that the error, just according to the theorem, for all these familiar benchmarks, grows linearly with a cover complexity, as you can see here, for the different classes, 10 classes, 20 classes, 100 classes. So that's the-- the theorem provides that. Now, empirically, then we found that if we normalize the area with the square root of the number of classes, everything collapses into one curve. So we have a universal master curve for all these cases.
Is that universal for everything? We don't know. There is some gaps in the theory. The theory is not totally complete yet, but I think it's a new theory.
Now, I want to connect-- so this is a distribution of data, cover complexity. We also tried to connect this to the smoothness of the network. Of course, it's very difficult to characterize delta f directly, but in this inequality that we proved, we can find a lower bound for this modulus of continuity in terms of the loss function and the weights that we can compute. So this capital delta f is a computable quantity.
So what I have plotted here for this MNIST training set is actually the testing loss. It's the blue curve. And you can see the minimum point, and then we can see the other fitting.
Now, how does this relate to the smoothness of the network? Well, the red curve shows you that because the red curve is a measure of-- of course, it has the loss in there, but the loss is important here when it's big. Right here, where it starts decaying, the red curve, if you can see, what we have is loss of the smoothness of the network.
And right here, it's not coincidence, but the fact that where you start-- the over fitting starts, is where the smoothness of the network drops. So if you go back, you can basically relate this delta to the constant that we talked about here, which is hidden somewhere. And therefore, one can connect those two things, namely the data distribution and the smoothness of the network. Again, the theorem is not complete, but I was hoping to show this here, so one of the smart MIT students can take and advance it further. So that's one type of generalization, just different approach, the same problem.
The next generalization is something different. It's what Kenny said earlier, that I'm interested in physical laws and unsupervised learning. So we published this paper. We call it PINNs, Physics-Informed Neural Networks. It's been used now for many different industries, from NVIDIA-- has a parallel code on this-- to Ansys, the biggest software company in the world and so on, for physical problems. It's agnostic to any type of physics, actually. But physics is by regularization.
So what is a PINN? Actually, I wear a pin. I don't know if you can see my t-shirt, but it's a very simple thing. It's a neural network. Let's say you're trying to learn u of x and t, and you have lots of data. Then you have a corresponding loss, mismatch of the data, and so on.
But we don't have enough data in science. We never have enough data. You know that. That's very expensive, and they're not reproducible. So we have very little data.
But what we have is conservation of mass, momentum, energy, and so on. Here, I show an example of the parameterized [INAUDIBLE] equation. So you have to satisfy that. By insisting on this, then I have another loss. That's the residual of this conservation law, which I can weigh it with the total loss, and then I can improvise for not having data.
I gave a talk recently at the Army here at Natick. They have an installation. They were talking about autonomy and can you actually be autonomous without any physics at all. I told them no.
Here's a simple example. You can learn how to solve this ODE and predict inside the domain of training. But you go outside the domain, there's a huge error. It turns out that if you use the PINN approach, no problem because unsupervised, you follow exactly the trajectory that you want of your vehicle, so to speak.
Now, interesting is that if you are outside the parametric space-- I have lambda as a parameter-- if you train a certain parametric set and you're outside, the errors are not as catastrophic, as you can see. Some errors but not big. So it's different of what you do in the domain or parameters and so on. So it's better to PINN if you can.
And here's an example we published recently in Science, something very timely. It's what can you do if you have this type of approach? And I call this hidden fluid mechanics because it's a hidden Markov process type of thing.
I used auxiliary data like smoke or some thermal gradients from your breathing or from your coffee, and from there-- this is one of our collaborators from LaVision, a German company-- I can tell you that-- I don't know if you can see the movies playing-- but I can tell you what the pressure is and what the velocity is, just using this PINN approach, combine physics and data. The only data I used is the data of the video, but then I can infer lot of other things. Did you see that the movie, Kenny?
KENNETH BLUM: Yeah, looked good.
GEORGE KARNIADAKIS: OK, and I prepare this espresso, I said, for Tommy. Professor Poggio, I know he likes espresso like me. So recently, I was doing this project with LaVision. They took a stereo photograph, 3D, over an a espresso cup, and we were so curious to see what's the maximum velocity and what's the pressure above that. Kind of physics question. [CHUCKLES] But you can see here--
KENNETH BLUM: Have you compared it to Greek coffee?
GEORGE KARNIADAKIS: [LAUGHS] No, but there was a controversy because we predicted that the maximum velocity was 0.4 meters per second, which sounds really, really fast. And they didn't believe us, so they went back and did an experiment with particle model of symmetry, and indeed, they found 0.45 after predictions. So anyway, we could be infer, again, the velocity and pressure and so on. So this is just a fun project.
We're doing more biomedical projects. I will skip that because-- this is a brain aneurysm from children's hospital data. I'll skip that. Same idea.
I want to go back to operators. So what I basically said so far is I'm interested in generalization, like everyone else. There's ways to generalize. It's difficult to find generalization errors. And I want to resort to operators to make the big jump.
So you should be seeing now a slide that says "Problem setup." So here we are. G is the operator I'm looking for. u is a function in some compact domain, which I will define later. But we have this mapping from u to G of u at y. G of u at y, that's the output of the operator.
So the setup is the following. We will train-- there's no physics now. All this is data-driven. We'll train this system with functions u, a lot of them, first one, second one, third one.
We will observe the output at some points, and then you will give me another function from that space. That I have to define-- have yet to define. And then I have to be able to give you the G of u of y. So this will establish that I have learned a mapping between u and the output of the operator in the space of y.
So now, I went back, and I looked at the literature. And I found this theorem. And I don't know how many of you who have been doing theoretical machine learning have ever run into it, but I asked one of my collaborators who was doing-- taking a course at MIT on machine learning and asked the instructor, and the instructor had no idea that you can actually approximate functionals and operators.
But Chen and Chen, back in Fudan University in the early '90s, developed this theory first for functionals. Here, I show you for system identification, nonlinear operators. So basically, the theorem says the following. Imagine you have a compact space V. Remember I showed you the function u? This function u would be in this compact space V.
And you're trying to identify this G nonlinear continuous operator. Here, G could be an explicit operator, an implicit operator, or a totally indescribable operator. I'll show you some examples.
But basically, the theorem, just-- now, remember, Cybenko, Hornik, and other people at that time, they were developing theory for function approximation. They developed this theorem that shows that a single neural network like this-- actually, two neural networks-- but a single layer here in two single neural networks can approximate arbitrarily, close, this continuous operator. G of u of y will be approximated by a branch and a trunk.
Notice that this is one layer for the output and one layer for the input. These are two different networks. We are calling them branch and trunk. This can be done for any u in this compact space V, and y in this K2, which is in Rd space.
What does it mean? I interpret it here. So it means that I can think of this as a cross product of the output, the trunk and the branch. So if we look at panel D, what we have here is a branch network where we take this function u, we observe it at m points-- we call them sensors. Let's say you have m sensors-- you observe them. And then that feeds network-- the branch network.
But we also need to say something about the output space because we need to have labeled data, so I need to provide some G of u of y. By doing that-- so you see, I have p points of the output. I have m points where I observe this. And that n is the number of neurons, if you like, for this network. So I pipe them through these two different networks. I take a cross product. I found the output.
So let's review again-- oh, now, this is a single neural network. What my team did recently under a grant from DARPA, we extended this for deep neural network or basically replaced the single neural network with a gN, another neural network, and the trunk with an fN. And this could be now very general neural networks, any type, in fact, of class of functions that satisfy the classical universal approximation theorem.
So the classical approximation theorem would now go into our neural networks, but of course, this is a kind of network of networks, if you like. But this is a composite network. I'll show you it again.
But first, you have to define this input space. The input space V is a compact space for the theorem. It turns out, in practice, it does not need to be a space like that. So for example, I can do Laplace transform. And in fact, as you can see here, I use Gaussian random fields to approximate my space V.
But I want to see, if you commit an error in representing the space V because you cannot exhaust that space, right? It's an [INAUDIBLE] space. How big is that error?
So I take this ds dx on the right hand side. I take-- I observe, I sample my u uniformly. And the real u is this curve, for example. And then I can-- for the special case of a Gaussian process with a Gaussian kernel, square exponential kernel, for a relational length l, I can find that this constant kappa, which is based on space V and the number of sensors, is basically quadratic with a number of points, 1 over m square, quadratic with the correlation length.
And then I can prove the theorem for this case only, that indeed, the error to approximate this neural network is bounded above by the error that we sampled the space V. So it's quadratic in the number of points, observational points for that function and quadratic in the correlation length. So that makes sense.
Of course, there are many different ways of representation. For example, you can imagine that the V space could be a neural network itself. You can imagine that V could be wavelets. It could be a radial basis function. It could be spectral expansions, and I'll show you-- if I have time, I'll show you some of it.
So let's recap what we have. We want to find the operator that shows that nonlinear mapping in general from u to G of u, from Rd to R, let's say. So down here, in the panel B, it gives you an idea of what we have. Here, I have a summary of what I told you already.
So in the left column, it says training data. I observe one function at that point. I observe another function at that point. I may observe 10,000 functions, OK?
Now, correspondingly, I have to observe the output G of u of y, but you can see, I may have 100 points to observe the input and only two or three or four points to observe the output. So very laconic, very Spartan on the output. I'm from Crete, actually, but I use Spartan here as an analogy of just a little data.
And then what I have here on the left is the input function u, the space V, and the output G of u of y. I think that's pretty simple. That's what I have in mind, is a simple ODE to explain to you what we're doing. And here's an example.
Here is an example. So I will compare different neural networks that are out there. So let's say I want to find the integral operator. I want to build a neural network that approximates the-- it's one dimensional, but other people have done multidimensional. So don't worry about the complexity of this. Just as a pedagogical example.
I want to find the 0 to x. x could be in some range. So here, the integrand u of x goes into the integral, ground, which is s of x, depends on x. So it's a map from u of x to s of x, and capital G is that integral, really. That comes, of course, from this derivative definition.
So how do we do that in practice? Well, I take one function. I represent it with 100-- this is the simplest possible case. This is just to introduce the concept. 100 points for the function. I take 10,000 functions, and I only observe s at some random point, s of x, the output, at some random point. So only one point, OK?
Now, here is a summary of what I got with my best network. That best network is what I just showed you, the unstacked DeepONet. The mean square error, training or testing, as you can see, are almost on top of each other. And the error goes down to 10 to the minus 5.
I compare lots of networks. I compare-- the best network here is, of course, the one that I show you. That's why I show it to you. Because the generalization there is very small, the difference between the training and the testing error.
If I use a standard fully connect network like that, if I do a ResNet, it's similar to FNN. I did a sequence to sequence. One of reviewers said, "Well, sequence to sequence works well." It does a little bit better than FNN, but it doesn't do as well as this, the unstacked network.
So now, what happens if the space, simple space V, is very poorly represented? For example, as I said, first violation is that V is supposed to be compact, and I make it non-compact by just taking a GRF. Then I fixed the correlation length to 0.5. And if you see what I have in my basket, that's space V, are these functions.
So you come along and you say, "Can you integrate this function? This will be my u of x. Can you integrate using a neural network?" Needless to say that if you train the network, you can spit out the answer in a fraction of a second. So all the cost is amortized already if we train it.
So the answer to this is, it depends actually on l, how-- if I go outside the distribution, it depends on how well I did with l. Here, if I have a very small correlation length right here, my error, although I'm outside the distribution, is pretty small. If I take 0.5, my error is pretty big-- my correlation length. So obviously, you want to be careful with the space. So how rich is that input space is very, very important.
Now, what happens if you get lazy or if you don't have enough data and so on to train in this case? So we pre-trained the DeepONet. So in step one, we used supervised learning to pre-trained this-- what I told you.
Then step two, you have two options. One is, actually, if you know any physics, any constraints, you can do what I told you about PINNs, but do it through a very, very short time, let's say, 10, 20, 100 iterations, not a million iterations for that's too deep. In other ways, somebody gives you data, just a few data but not a lot of data. Then you can just use this neural network, an external neural network, to-- and use DeepONet as a pre-trained neural network.
We've done this. I will not bother you. The results look good. For example, if my correlation length is small, you can see, I start with a 2% error. I can improve it and so on.
Now, here's another example and a surprise, big surprise to us. This is now u of t, one input function, and two outputs, s1 and s2. It's a nonlinear problem. It's a nonlinear operator.
I show you here are three examples with three different networks. They all have the same depth but different width. So if I take the middle one, I plot the error versus the number of training data. And you can see one thing, that the testing error here drops very fast. In fact, exponentially fast originally.
Then it goes towards a break, like Monte Carlo type sampling. Now, you can see that this transition point from-- it's a exponential convergence. This is great because I train operators. If I can do it exponentially faster, it would be great.
Now, we haven't got there yet, but one observation is that if we make these networks bigger, let's say from width of 50 to 200, you can see that the transition point moves to the right. So my exponential range is much bigger. So again, I'm looking for someone, an MIT guy who is very smart, to come up and take this and make it a really, really good network that will have exponential convergence in training and testing for all sizes.
We can do that for PDEs, advection-diffusion-reaction systems. You can find that in the brain. You can find it in biological systems. Now, you have very few points that you observe in space time, and you can do the same thing. Again, you can find exponential convergence. I will not bore you with that same idea.
You can, with this DeepONet, now, you learn how to solve this PDE. Now, if you pump-- if you have data, advection-diffusion system, and you train that network, then you can change your initial conditions, boundary conditions, and so on. And then you can solve this PDE in real time, in a fraction of a second. I spent 35 years working on numerical methods for PDEs. I cannot find a method that competes with this.
Not only that, you will learn an operator that is very general. For example, you learn implicitly this operator. Now, how do you explain the data, let's say?
Here, I have an example where I fit it with data from this advection-diffusion system, from the old boring integer calculus-- which I don't like anymore, I like fractional calculus-- so I use as a dictionary, fractional operators. Using the operator, I spit out values. I found a new equation that described equally well my data. So I can explain it with integer calculus, the boring one, or the fractional calculus, the exciting one. I can do lots of different things.
And talking about fractional calculus, I like fractional calculus. I like fractional calculus because it's as expressive as neural networks. So let me give you an example. Let's say I try to learn this operator. This is a fractional derivative, but it's actually an integral because it has a memory. It's a null thing. It has to go back to Riemann, [INAUDIBLE] and so on.
But trying to-- I learned the integral before. Can you learn the fractional derivative? So here's the idea again. I take all sorts of functions. I'm trying to train a neural network to learn a fractional operator. I was trying to really push the DeepONet. And I used known formulas and so on and just take a library, and you can do it.
Here, I do it specifically for what's called the Caputo derivative, which is for fractional time-- time fractional value problems. But the main point I want to make here is that I can learn this really, really well. And there are three curves here that show how well.
It all depends on the space V, the input space. For example, if I represented my functions with spectral expansions, I can do a really, really, really good job. If I use GRF, which I used before, I still get a good accuracy, but 10 to the minus 3, not 10 to the minus 6 that I would like to. So your space V is very important, and that's what I demonstrate here. And I'm sure there are better ways to represent spaces.
Talking about spaces and difficult operators, one of the most difficult operator is to compute the fractional Laplacian, which gives you anomalous transport. I am 100% sure that diffusive transport in the brain is anomalous, so it will be described by a three dimensional fractional Laplacian, but that's a different topic. But here, I represent my input space V with Zernike polynomials, which are orthogonal polynomials on a disk. Now, the reason I include this result here is because some of you may be using contrast microscopy, and you may know who Frits Zernike was. He's the Nobel Prize winner of 1954 who discovered this contrast interferometry, and he was the one who actually discovered this Zernike polynomial.
So I used this to represent my input space, and I learned the fractional Laplacian really well. And after you learn the fractional Laplacian, instead of a few hours to compute it on your laptop, it takes about 0.01 second to compute it, for any function. You can see that for any different functions.
You can do stochastic ODEs as operators. This is a very simple example, deceptively simple, in fact. dy dt equals k times y, but k is a process. It would be white noise or partially correlated and so on. There's a little bit of correlation I can use. I can do a Karhunen-Loeve expansion on this, which I do here.
And then now, my branch and the trunk will change because now I'm in high dimensions. I'm sort of deterministic now because I take advantage of the color noise, but now I have a much bigger input and also, the trunk, which is the output, then plus 1. So if I keep 5 or 10 modes, I'll have 11 dimensions. If I keep 20, I'll have 21 dimensions and so on.
A little difficult to train, but it turns out that you can find not only the statistics of the stochastic operator, but you can find individual trajectories. As you can see here, I have 10 samples, 10 different trajectories. DeepONet, in the split of a second, split of a split of a second, can have accuracy 10 to the minus 5. The accuracy, as you may guess, that depends actually just on optimization, nothing else. You can get better accuracy than that.
You can-- I have some math that explains why I can do that and what is the error breakdown and so on. I will skip that. But you apply this to also PDEs. This is a tricky PDE. It has a exponential nonlinearity, as you can see here. It's nonlinear in the stochastic domain. Again, you take advantage of the KL expansion. High dimensional space, but you can do it.
If it's white noise, you have to do something different. But for color noise-- most physical and biological processes are governed by some color noise-- you can do it with DeepONet. You can get all the statistics, the standard deviation, and even trajectories, as before.
So I know you don't do physics, but I want to show you a really, really difficult case that I'm doing now with DARPA. There's a lot of interest in hypersonics recently because of the Russians, as they say. And so we were asked to provide some fast ways of predicting path trajectories of hypersonic vehicles.
So what I show here is the Euler equations, which is like a ribbon problem. You start with some discontinuities. The Euler equations develop shocks, discontinuities, contacts, expansion fronts, all the crazy stuff. And on top of that, the air dissociates because you're flying at Mach number 8 to 10.
So the question is, can you pre-train a neural network so that when you have some real data on the fly, literally on the fly, you can correct your trajectories? So here's what we do. We take not one but 1, 2, 3, 4, 5 neural networks and DeepONet. So we pre-train them, and the idea is that if you pre-train them, then with just a little bit of extra data and so on, you can predict entirely what's going on.
I've had literally hypersonic speeds, not supersonic. Mach number 10, as I said. So the parameterization of the problem is the initial conditions, which are down here. So I kind of start with very steep initial conditions for the Riemann problem or very shallow and so on. And I can get all sorts of different solutions. I don't have one solution. I have millions of solutions, and DeepONet will encapsulate all that into one pre-trained network, OK?
And we did that in phase one, which in the timescale of DARPA, phase one now means two months. [CHUCKLES] So here are some results we got for one specific case. We got accuracy of 10 to the minus 5, starting from the initial conditions, which are with blue here. You can see it converts and so on.
So this shows you that this DeepONet can be used in any type of situation. We have many more cases, mostly from biological, but we're moving to the biomedical domain. And I have one more slide, this one, and the conclusion.
And this is a sweet new concept. We call it DeepM&M. M stands for Multi-physics. The other M stands for multi-scale. And the idea is that any complex problem can be built up from DeepONet.
So for example, imagine you have three different fields, temperature, velocity, and magnetic fields. They're coupled through some physics, or through observations, you know what it is. You train DeepONet for each one of them, where the other two are the input functions.
Remember, DeepONet produces functions. So now, I have all this. So this is sort of the LEGO approach to doing multi-physics. Off the shelf, you get your DeepONet. It's pre-train. You have new data. You have an overall neutral network, the DeepM&M.
You give it a little bit of data of the true multi-physics problem. And then you're basically done. So pre-trained by you 99%, and we have done this for many, many applications. I will not bother you anymore with all this physics because you may be already bored. But I just want to demonstrate that.
So from DeepM&M, I want to show you my current center, which is the biggest center in physics-informed learning machines. I started a few years ago. MIT is participating and CSAIL. Costis Daskalakis is one of my co-PIs. Stanford has representation-- but the National Labs and so on. The idea is to use this type of networks that I show you, primarily PINNs but also now DeepONet, to build new ways of approaching modeling of complex multi-physics, multi-scale problems. I think I will stop here. Thank you very much for your attention, and I'm happy to take any questions.
KRIS BREWER: Great, thanks very much for that wonderful talk. And George, we have our first question. They ask, "You mentioned these neural network methods are the best you've ever seen for these problems. Can you give some intuition for why these neural networks are so good at them, compared to classical methods?"
GEORGE KARNIADAKIS: I should have qualified this, when we have some data. Well, they are not as good-- OK, so I have been working on actually my thesis in spectral element method, which is probably a very, very accurate method, combination of finite elements and spectral methods. So you cannot beat the accuracy, but these are very, very slow methods.
These DeepONets are-- you can literally use them on the fly. Some of the physical applications we have, for example, the hypersonic, may take several days to do one simulation. Here, we compute-- we predict the right answer at a clocked time 0.01 on an old laptop of a postdoc.
So the main thing, especially with DARPA and so on, they're interested in speed. They're not interested in 10 to the minus 16 accuracy. They're interested in reasonable accuracy, like I showed, 10 to the minus 5. But they're interested and really, really fascinated, interested in incorporating new data quickly. So in some sense, for those analyzed examples, I should have said, these methods are basically unbeatable.
KRIS BREWER: From Rada Abdul Kalaf, "Does this apply to convolutional neural networks?"
GEORGE KARNIADAKIS: It's a good question. Actually, I forgot to say that, but the DeepONet-- one of the cases, when I did the fractional Laplacian, which I called a very complex operator, if you have a nice domain, square domain, you can get an image. And then CNN works really well there. You make the input as an image, and it's fast and works very well.
For the first part of the talk, when I was talking about PINNs, in PINNs, actually, I used automatic differentiation to avoid any grids and so on. So if I abandon totally the numerical methods because I use the same technology that is used for back propagation, I use it for differential operators, in CNN, you don't have that. You need to find a differential and so on. So then you go back to the old problem of having numerical methods and the artifacts with errors of diffusion and so on.
But yes, CNN, if the domain is simple, can be used in DeepONet, if the domain is simple. The domain does not have to be simple. It doesn't have to be rectangular. It doesn't have to be 2D and so on. But in some cases, you can. The answer is yes.
KRIS BREWER: Great, thanks. The next one is from Zhangyi Li. "Can we bound the error in term of the operator norm?"
GEORGE KARNIADAKIS: That's a very good question. I didn't show that. I just wrote the report to DARPA. For some-- so you have to be very specific. If you're talking about the DeepONet, yes, the answer is you have to be specific. So you take a class of, let's say, hyperbolic problems or conservation laws, and then you use these properties to show the error. In fact, some of the stuff that I skipped on the stochastic ODE was attempting to do exactly that. And the answer is yes.
You need to use-- you need to assume holder continuity with the alpha less than 1, and then you can prove it. But you can use some sort of equivalence of norms. You can use also the gamma convergence and so on to prove that. The answer is yes.
But you have to go class by class. It's not one thing fits for everything. Yeah, good question.
KRIS BREWER: Great, the next one is from Christian Ueno. "Thank you for the talk. You mentioned that DeepONet gives good performance even when you move away from compactness assumption of the theorem. Could you say a little more on that?"
GEORGE KARNIADAKIS: Yes, so as you probably know, most of this type of theorems are for compact spaces. It's very difficult to show theoretically non-compact sets. But almost all our examples are for non-compact sets. I didn't show it here, but in the paper-- we have a paper that will appear in Nature Machine Intelligence, I think-- we have like 16 different cases. Most of them are actually for non-compact cases, including a Laplace transform, I think, and Legendre transform and so on.
But it's all empirical. I don't-- I don't know if you are a theoretical person, but it's very difficult to prove for-- even for functional approximations, it's very difficult to do proofs with non-compact sets. I hope I answered the question, if that's what you mean. If you mean to extrapolate outside, way outside the distribution space, that's a different question. But I answered the question about compactness.
But lots of people, lots of mathematicians, actually didn't think of this very highly because they thought it's so restrictive. And actually, it's not. That's what I want to show you.
KRIS BREWER: Thanks, and the next one is from anonymous. "Follow-up question to the first. What allows these networks to approximate exact solutions so fast? Do we have some understanding of how the implicit prior in these networks helps them approximate our desired physical function/operators efficiently?"
GEORGE KARNIADAKIS: That's a really good question. It's-- because I have observed this, actually. I have another case where I feed the network with molecular dynamic state, and I have stochastic fluctuations. And it learned the stochastic fluctuations. So I don't really know.
You know, when I showed that theorem, which is a rigorous theorem, when we went from a single layer to a deep neural network? The deeper you go, the better, just like in neutral networks. And then how rich is the space and how well you represent v, how representative is-- how the representation is done, it's very appropriate. That's why I showed you this example with the Caputo derivative. I had something there that I called poly-fractal-nomials, which is some functions I discover, exact functions with I combined polynomials with fractional exponents. And these are the best to represent this type of solution. So if you do that, you gain like an order of 10 or sometimes 100 in terms of accuracy.
So I don't know. I think representing the space V is very important, but I don't have it. As I said, I don't I don't really have an intuition yet. That's-- qualitatively, I would say, yes, we used good priors. But I put emphasis on the input space V, how well I represent it. And I-- may be better representation out there.
And the other thing in terms of training, I talked about a balanced network. If I have a balanced network, I can learn exponentially fast, which is a big deal, as you can tell, for this type of things which need a lot of training. It was a good question. I don't know the answer.
KRIS BREWER: All right, the next is from Urich Malik. "What about arbitrary operators with DeepONet? Can it learn complex user-defined operators?"
GEORGE KARNIADAKIS: Yes, it doesn't, actually. That's a good question, and that's why in the beginning, I put a biological system, social system, and system of systems. Because it doesn't really care. So for example, I have this advection-diffusion system, right? So I generate data from an integer equation. I learn the operator. Then the operator-- now, it's implicit, right?
Now, how would you represent it if you don't know that the data came from there? You like integer calculus, and you use a dictionary of integer derivatives. I used fractional derivatives, and I came up with a different operator. You may remember, I found a derivative, d to the 1.5, which is a combination of second and first derivative.
So yes, any fractional operators are exactly that. It's kind of very, very general operators. You can have variable order. You can have distributed order. So if you don't want to be totally arbitrary, you can use fractional operators, which are extremely expressive, and you can express them. But yes, any operator which is nonlinear, has to be continuous operator, could be represented, yes.
KRIS BREWER: Great, and thanks to David for helping Marty submit his question. Marty's question is, "Did I hear you mention the possibility of using wavelets instead of sigmoids? Have you tried this option yet?" And then a follow up question, "Could you envision using Mallat's deep scattering networks as trunk networks?"
GEORGE KARNIADAKIS: The first one, that's probably, I misspoke, maybe. I meant that the wavelets would be a wavelet representation of the space V. So imagine you have like this dilations and so on or multi-scale problems. Maybe wavelets is the best way to represent-- just like I use my Legendre polynomials and so on-- to represent the space V, not to replace the activation function.
If you are interested in activation function, actually, we have something really nice that we produced with one of the postdocs from CSAIL. Kenji is his name. And we did this adaptive activation functions or what we call rowdy activation functions. They worked really, really well for training. But no, I didn't mean to replace the activation function with wavelets. I meant to represent the space V.
I'm not sure about the second part. Is this the Stéphane Mallat method? But I don't actually know the scattering, so I won't be able to answer your question. The trunk network, it's really interesting, but to target the trunk-- because when we started doing this, I thought, especially for DeepONet, I use it as a pre-trained network and then trained it a little further with some data, I thought that I need to put more-- to fine-tune my branch. But it turns out that this DeepONet, the trunk is actually the most sensitive part.
So I think if there are ways to improve that trunk, like fine-tune it with better parameters or with a different architecture, that's a place to target. That's kind of our experience so far, but next year, I may say something else. Sorry, I cannot answer the second part of the question.
KRIS BREWER: No worries at all. The next question is, "In FEM, there is a lot you can do with your choice of FEM space. Consequently, there is a lot of work on which spaces to choose for certain types of PDEs. Ergo, finite element, exterior calculus, and so forth. It seems like you are picking a function space in some of these examples. What should I do to pick my space, ergo, how did you choose that space for the fractional Laplacian example?"
GEORGE KARNIADAKIS: Good question. [LAUGHS] I didn't show that result-- the results actually that point to this. But for-- this is a different thing. This is-- the V space is the input space of functions and how you represent it. In terms of the PINNs and how we do it, we have something called variational PINNs.
Now, since you talked about finite elements, finite elements is the Galerkin method. The trial space and the test space is the same. But in variational PINNs, what I have is, my trial space is the neural network space, which is nonlinear approximation. And I use-- so this will be the vessel space. But my test space, it could be a polynomial space. I use Legendre polynomials or monomials, just like in finite elements.
So variational PINNs, if you Google VPINNs, you will see a paper VPINN. You have the best of two worlds. You have the neural network approximation, which by now, we know is a adaptive finite element.
And then you have nicely-- you're testing this in subdomains, just like in finite elements but arbitration subdomains with nice smooth functions. So if you integrate by parts, you can transfer the non-smoothness to a smooth part, and you can do lots of great things. You don't have to take high order derivatives. Because every time I take a derivative for my physics stuff, I double the length of the neural network. So it's not very good to go very deep from a training point of view.
So you can do-- you can choose the space for something. You can construct FEM type and least squares type solutions here. And one of my collaborators from Sandia is working on exterior calculus and what's called generalized least squares nets, since you mentioned the exterior calculus. So one can mix and match in the Petrov-Galerkin type framework, perhaps. So there are lots of possibilities.
KRIS BREWER: This is from Ram. "Comparing this approach to real neurons, is branch a dendritic branch? Trunk is axonal spiking. Cross product is a recurrent feedback. So the hidden layer is simultaneously both fragmenting data and categorizing it and using categorization to guide fragmentation? These two networks compete and cooperate simultaneously. I wonder if the model can be run backwards as well?"
GEORGE KARNIADAKIS: That's a really good question. We should talk. Because I know-- actually, I know what the synapse are, but I have no clue. Actually, I shouldn't be giving a talk to a neuroscience group today. But it could be.
I mean, yes, the trunk and the branch have to work in sync. They have to be balanced and so on. We have seen some stuff, but I don't have the knowledge to draw this analogy that you are presenting. In the archive, we have DeepONet, and it's a Spartan version.
But I can make the long version available to you with all the theory and so on. And then if you can come up with an analogy like that, that would be actually great because I just don't have an intuition for it. And I just do these operations. But that sounds great if that's true. I hope it's true. [LAUGHS]
KRIS BREWER: And we actually have a follow-up question from Christian. He's asking a follow-up question on his compactness question and the Laplace transform example. "Do you learn the Laplace transform for function supported on, say, the unit integral and then apply the learned transform on functions with larger and larger support but continue to see the same l infinity error? Is this how you would test this idea?"
GEORGE KARNIADAKIS: Yeah, kind of. Yes, exactly. That's exactly right. This is how you do it. And of course, as you increase, you have to increase your training and so on. But you can-- but yes, exactly. That's exactly right.
So the GRF was just for the input space, right, to-- basically, you take a heart, and you pull functions from that heart. I was saying you can replace it with other spaces and so on. So this is just experimenting with representations of V, of that compact space V, which doesn't have to be compact, as I said. How-- I mean, there's a lot, actually, out there that one can use for that.
This hasn't been around for very long, and so-- we have a lot of examples. I have lots from physical and biological cases that we have tested. The theory, we proved just the deep neural network and then some error bounds. But we don't really-- we haven't done a lot yet.
The reason I chose this topic today is because it's kind of at a high level abstraction, so I thought maybe you guys could take a different angle of it. And I hear some good questions. So it's wide open. I know DARPA is very interested in this, and they want to start a whole new program just on DeepONets, on all sorts.
And they ask me the same questions. Can we do this for social systems? Can we do for flocking? Can we do it for our troops? Can we do-- because it's agnostic to different-- to all sort of specifics. Of course, you have to have data to train all this but--
KRIS BREWER: Ram just stated that-- I think this might be a clarification-- "Trunk is inhibition. Branch is excitation. Minus some is difference of Gaussian."
GEORGE KARNIADAKIS: Yeah, so branch is the excitation. It's the input, the excitation force. Trunk is the output, that's correct. This is a nonlinear system, so there's nothing Gaussian, if you have a Gaussian. These are all nonlinear systems, so there's nothing Gaussian in the output.
TOMASIO POGGIO: Great talk. Thank you, George.
GEORGE KARNIADAKIS: Thank you very much. Thank you very much, Tommy, thank you.
It is widely known that neural networks (NNs) are universal approximators of continuous functions, however, a less known but powerful result is that a NN with a single hidden layer can approximate accurately any nonlinear continuous operator. This universal approximation theorem of operators is suggestive of the potential of NNs in learning from scattered data any continuous operator or complex system. To realize this theorem, we design a new NN with small generalization error, the deep operator network (DeepONet), consisting of a NN for encoding the discrete input function space (branch net) and another NN for encoding the domain of the output functions (trunk net). We demonstrate that DeepONet can learn various explicit operators, e.g., integrals and fractional Laplacians, as well as implicit operators that represent deterministic and stochastic differential equations. We study, in particular, dif