Liquid Neural Networks
October 8, 2021
October 5, 2021
Ramin Hasani, Daniela Rus
All Captioned Videos CBMM Special Seminars
Ramin Hasani, MIT - intro by Daniela Rus, MIT
Abstract: In this talk, we will discuss the nuts and bolts of the novel continuous-time neural network models: Liquid Time-Constant (LTC) Networks. Instead of declaring a learning system's dynamics by implicit nonlinearities, LTCs construct networks of linear first-order dynamical systems modulated via nonlinear interlinked gates. LTCs represent dynamical systems with varying (i.e., liquid) time-constants, with outputs being computed by numerical differential equation solvers. These neural networks exhibit stable and bounded behavior, yield superior expressivity within the family of neural ordinary differential equations, and give rise to improved performance on time-series prediction tasks compared to advance recurrent network models.
Dr. Daniela Rus is the Andrew (1956) and Erna Viterbi Professor of Electrical Engineering and Computer Science and Director of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at MIT. Rus’s research interests are in robotics, mobile computing, and data science. Rus is a Class of 2002 MacArthur Fellow, a fellow of ACM, AAAI and IEEE, and a member of the National Academy of Engineers, and the American Academy of Arts and Sciences. She earned her PhD in Computer Science from Cornell University. Prior to joining MIT, Rus was a professor in the Computer Science Department at Dartmouth College.
Dr. Ramin Hasani is a postdoctoral associate and a machine learning scientist at MIT CSAIL. His primary research focus is on the development of interpretable deep learning and decision-making algorithms for robots. Ramin received his Ph.D. with honors in Computer Science at TU Wien, Austria. His dissertation on liquid neural networks was co-advised by Prof. Radu Grosu (TU Wien) and Prof. Daniela Rus (MIT). Ramin is a frequent TEDx speaker. He has completed an M.Sc. in Electronic Engineering at Politecnico di Milano (2015), Italy, and has got his B.Sc. in Electrical Engineering – Electronics at Ferdowsi University of Mashhad, Iran (2012).
PRESENTER: So welcome to today's CBMM talk. It's great, really great, to have Daniela Rus coming here. She's, of course, the director of CSAIL, a great leader. I think you all know her. And from time to time, she has these great, wonderful, simple, beautiful ideas in robotics, which we read in papers and in the news, in the tech news.
And she's also a great friend of CBMM, has been a great advisor for me. And it's somebody who really likes the problem of the brain and not just artificial intelligence, although artificial intelligence, of course, is also a great problem.
DANIELA RUS: Thank you for this kind introduction. It's really a great pleasure to be here to share some of our ideas with the CBMM community. And so today, we will tell you about a new idea we have been pursuing, together with Dr. Ramin Hasani, who will present most of the talk.
And the basic idea we want to describe with you aims to bring the natural world and the engineering world closer together. And Ramin and I are going at this problem, in part because we have a general curiosity and desire to understand intelligence, in part because when I look at the state of the art in the field of artificial intelligence, I see a lot of advancements.
And I see that these advancements are really using decades-old ideas that are enhanced by computation and data. And so natural question is whether this is intelligence. Another question is, are there other ideas? Can we use the natural world to inspire us to think differently?
Because I believe if we don't come up with new ideas, then our results are going to become increasingly more incremental. Because more and more people will be plowing the same field. And so the field really desperately needs some new ideas.
And the idea that Ramin will describe today aims to build machine learned models that are much more compact, much more sustainable, and much more explainable than the models that are based on deep neural networks. And so let me just say that much.
And now, it is my great pleasure to introduce more formally Dr. Ramin Hasani. Ramin is a postdoc in my group. Prior to joining my group, he was a PhD student at the Technical University in Vienna. And prior to that, he did his master's degree at Politecnico di Milano. And so with that, Ramin, please join us and tell us about your vision and results.
RAMIN HASANI: So hi, everyone. Thanks, Daniela, for the introduction. And thanks, Professor Poggio. All right, I'm very excited to be here, presenting liquid neural networks, a class of artificial intelligence algorithms that tries to bring a little bit of neuroscience in a structured way to machine learning.
So if you look at neural activity in brains, in general, on the left side, you see the brain activity of a mouse, and on the right side, you see one of the networks that we trained end to end-- a controller for controlling an autonomous car. We see that, basically, the activation of the patterns and activations maybe, superficially, look very similar.
But in principle, there are fundamental differences. There are huge gaps between intelligence as we know them in brains compared to deep models, in particular, representation learning capacities-- how natural brains actually approach the organization of the world around them to make use of them, to be able to control them to achieve their goals.
So we know that natural brains interact highly with their environments in order to understand their world. So by understanding-- I mean when they can actually interact with the world and to capture causality, basically, like the causal structure of the task that they are performing.
And this is one of the reasons where natural brains can actually go out of distribution, where statistical machine learning, by definition, will stay in IID, right? And this is one area that would be extremely beneficial if we can explore more and maybe bring some of those insights from natural brains back to artificial intelligence.
And at the same time, we know that brains are much more robust and much more flexible in terms of a perturbation or environments that they are getting into. And finally, efficiency of the models. So a network is not always active, so there is always some part of the network that is taking care of the computations that is on demand.
So allow me to demonstrate this kind of a typical, statistical end-to-end machine learning system, so where you have inputs that are from camera inputs. And then you have a deep neural network that is take care of the, let's say, steering angle of a car.
So in this kind of framework, what we are seeing, we are seeing the activity of the network. And we see that this network is actually real work tested on a real car. And these are demonstrations from the test set, where they are actually deployed in the environment. They have been trained using human data, and they are now deployed.
So one of the things that we actually looked into is, basically, the attention of this network, like what kind of representation has been learned? What pixels are the most important pixels when a driving decision is being made? So this CNN actually learned to attend to the sides of the road, where we see lighter regions in this attention map, in order to take driving decisions.
And that's not a actual causation. When you're driving, you're not just looking around, right? You're looking into the road and in front of you. So you want to actually have your focus on that perspective. So the causal structure here is missing, although the task is being completed by the network.
Now, if you add some noise on top of the image, like a little bit of noise, we see that this attention map is not even reliable anymore. Even if this noise is kind of a small Gaussian perturbation, you can see that it has huge influence on the decisions and the consistency of the decisions that the network makes.
So how can we improve this by bringing neuroscience in. As Marr and Poggio said and set up a framework for us for actually creating-- let's say, if you want to explain a biological system, you want to say, at a system level, you can look at it from a system level and find out, what are the goals of the system and what are the kind of mechanisms that, actually, you get to the goals, that's the system level.
And then you can also have this view of looking into building blocks of these things, going down and looking into how intelligence emerges from cells. You can go down and basically use computational models, precise mechanisms that exist in biology.
So having this kind of framework in mind, what we can do-- and that's what we did, just showing you an outline of how this research is a summary of what this research is about. So we looked into nervous system of a small species. And we got down into neural circuit level.
And even for understanding neural circuits, we actually went into the neuron and synapse level even further to explain, to really fundamentally figure out, what are the building blocks there. And you know that you can even go lower than that and computational model down to atoms.
But there is actually a level that you have to satisfy yourself that you don't want to go below that in order to actually get there and then take this model and see what kind of capabilities you can have using the engineering, super-advanced machine learning frameworks that recently got developed. So we stopped at a certain level, which I'm going to explain throughout the talk.
And we saw that these models are much more expressive than their compartments in deep learning, although the kind of abstraction that we did is really simple. But in terms of how much capacity these networks can generate, they are much more expressive. And I'm going to show you the math behind and also the experimental evidence for that.
These systems can handle memory, and these systems can handle explicit and implicit memory mechanisms that I will explain throughout the talk. More importantly, these systems can capture the true causal structure of the data. And that's part of the reason why these systems actually can be helpful in those kind of this closed-form, real-world decision-making processes.
The systems are basically robust to perturbations. And we can use them for generative modeling. We can even use them for extrapolation. You can go out of distribution with these type of networks. Because if some process can capture the causal structure of the data and you can prove that that's the case, then the system is being able to actually go even out of distribution.
And with that in mind, we actually try to perform decision making in real-world robotics. We are distributed robotics lab, and we want to bring these insights into the brains. Now, to show you what kind of change we have done, you can look at this system.
This system has now, on the right-hand side, what you see is the 19 nodes of the system that is sparsely connected together. And this is described by that model that, actually, we developed.
And then you can actually get into attention maps that are much more focused on the true causal structure of the task. And this is not just on this task. But we can actually see more throughout the talk.
Well, how do you get started for creating a model? Let's look into the, let's say, interaction of two neurons and the synaptic propagation between information propagation between the two. So neural dynamics are typically given-- unlike deep learning systems-- they're given with continuous processes. And they are described by differential equations.
So synaptic release is not just the scalar rate. So synaptic release can be modeled with much more sophisticated kind of mechanisms. So you can really get down to probability of if a neurotransmitter is actually going to stick to the receptors of the second neuron. So you can really get into the process, how much complexity. You can really add nonlinearity to the system. And there are also recurrence in the structure, there's memory, and there is a sparsity all over the place in neural circuits.
So having these principles in mind, the goal is to actually incorporate these small principles that I mentioned into improving representation learning, improving the robustness of machine learning model and the statistical models, and, at the same time, improving their interpretability. So to get into a common ground between the computational work of neuroscience and the machine learning systems, I would like to start exploring where do we have continuous dynamics.
So let's start with these processes that has been recently brought up-- continuous time, or continuous steps models-- in the machine learning community. So a continuous time neural network is basically when a neural network f that has certain number of layers, has certain width, it has activation function of choice. And it is a function of its hidden state, its inputs. And it's parameterized by parameters data.
So if a neural network f parameterizes the derivatives of the hidden state, then you would have a continuous time process. Now, it's going to be a continuous time neural network. With this representation, you can go from a discrete computational graph, like in residual networks that we have. Like, you would actually take a computation step each layer.
Now, if you define your system like the way we show it here, the depth dimension of your system becomes continuous. And when you have a continuous-time system, then you would have a lot of advantages.
First of all, the space of possible functions that you could actually explore and generate is much more than that of the discrete representations. Second advantage is the arbitrary computation. So you don't need to perform computation at every time step. You can have arbitrary step time computation.
So your depth becomes very variable, basically. So it can be infinitely depths kind of networks with one process. And this would naturally, this continuous process, would be a natural fit for modeling sequential behavior.
So let's say, compared to the normal recurrent neural networks that you know, the updated state of a neural network is actually given with this discretization. If you have a neural ODE and, basically, a more stable version of that where it has a damping factor, then you can use this also as a recurring neural network.
On the top row, you see the interpolation and extrapolation capability of a recurrent neural network on irregularly sampled data that are put around the spiral. And we see that the red line in between is actually extrapolation capability of this model, where it cannot actually capture the dynamics very well. But on the bottom row, you would actually see that the dynamic process generated by a continuous time recurrent neural network actually captures those dynamics properly and even extrapolates to that. So this is nice.
Now, how do we implement these things? I'm just going through the details of how to implement these type of models. So you basically, you want to, actually, because they are ODEs, you want to use numerical ODE solvers. So you basically unroll this difference.
And then you can use any type of numerical ODE. So let's say we use an explicit Euler solver. And then, there, you can actually create the forward path of your network based on this unrolled version of your network. And then, choice of these ODE will actually define the complexity of your map. You can use a more complex adaptive solvers that has adaptive step sizes to have a more accurate forward path.
How do you now do backward paths? You can use a mathematically known adjoint sensitivity method, where, let's say you have a loss function, and your dynamic is given by a neural ODE.
So your loss function, basically, if you have the dynamic of your system starting from t0, given by this time, and you have labeled data, you can compute the output dynamic to compute a loss. And this loss is getting computed by running this ODE solver which basically give you this trajectory.
And then, the adjoint method actually creates a new state, an auxiliary differential equation, that connects the dynamics of the loss in respect to the state of the system. And then you can run this ODE backward one step at a time to get the gradients of the loss in respect to the state of the system. And at the same time, you would be able to also get the gradient of the loss in respect to the parameters of the system.
So this adjoint sensitivity method on the backward path would give you a constant memory propagation. Because it actually forgets the previous states and it just do one step at a time computation.
When it does back propagation,
You can also train this network for backpropagation through time, gradient base. And what you do, you perform one forward pass, and then you compute the derivatives of your-- based on the chain rule, you can actually compute your derivatives. And you can update your parameters.
This way, you are actually not treating the solver in a black box manner. So you are actually going through the solver. So the dynamics of the solver becomes part of your gradient, as well. So you need to be careful about that. But at the same time, the memory complexity of this method is really high. But it is much more accurate than the adjoint method if you use it in a vanilla sense.
So I told you how these models are getting implemented forward and backward. Now, we have this neural ODE. So we said the continuous-time processes, and this representation actually can have a spatiotemporal kind of data processing powers. And it actually has a really good potential.
But we didn't define any biological process there. We didn't actually get any inspiration from the biological insights that I talked before. And a really funny fact is that when you deploy them in real world, they're even worse than a simple long short-term memory network, right?
So basically, what's the point, right? If you define a really fancy equation they cannot even work in real-world applications very well, then what are we even doing? So let's improve.
Now, by this improvement, what we want to do, we want to get into biology. I told you that activity of neurons are described by differential equations. And you can actually model the dynamics of a cell or of a membrane as a leaky integrator and with these simple linear dynamics.
And the more important part is the conductance-based synapse model, where you can have a nonlinearity included in the synapse of the system and not in the neurons of the system. So basically, the interaction between two nodes or two differential equations is given by a nonlinearity. And this is what is inspired by channel modeling behavior of Hodgkin and Huxley when they did channel modeling of ion channels.
So you can actually get into this kind of a steady state behavior from those differential equations of Hodgkin-Huxley. You can reduce them into this abstract form. And if you want to bring it, the nonlinearities look like a sigmoid and activation function. So you actually can, in principle, bring neural networks, inside artificial neural networks, in the representation of a synapse.
Now, putting these two systems-- very simple things, has been there for over a century-- together, you will get a dynamical system of such. And this dynamical system has certain properties and certain advantages. It's obviously a neural ODE. It's an ODE-based neural network.
It has a component neural network f and nonlinearity that appears in the coefficient of x of t, or a state of your system, and in the state of the system itself. So there is a coupling between the state and the time constant of your differential equation.
So at the same time that f for that linear-- let's say I don't have recurrent connections. So x of t in that f is 0. Then f becomes only a function of I, or the inputs of the system. Then the whole system becomes a linear system.
Now, if you have that linear system, the coefficient of x of t is input-dependent. So if the inputs of the system is changing, then the kind of behavior of the differential equations changes. Because that defines the damping factor of your very simple neural network that you have and very simple dynamical system that you have.
So just to show you a block diagram, like how does it look like, in a standard neural network, the range of possible connections that you might have is basically you can have-- let's say you have two neurons. They have activation function. You might be able to have reciprocal connections. You might have feedback. You might have an external input to the system, and they have their own scalar weights.
Now, in a liquid network, you would have the same kind of a structure but, at the same time, you have a nonlinearity that controls the interaction of two differential equations. So the difference here is that activations are changed to differential equations. And their interactions are given by a nonlinearity that can be a neural network.
So in terms of what does it represent, let's say I trained a neural network for driving, for autonomous driving, from visual data. I'm showing the visual data in the middle. I did that with a standard neural network that has a constant time constant. And I did that with a liquid network.
What we are seeing on the x-axis is 1 over tau. That means 1 over the time constant of the system. And on the y-axis, what we see is the steering angle of the car. And the color shows left for blue and yellow for turning right. And in the middle, you have the middle part.
So now, we see that a neuron actually learned to associate its behavior, its timing behavior-- without any prior, just to plug in those very simple building blocks together-- actually learned to associate the dynamics of the task to its behavior. So that's one of the advantages that you receive from these type of networks.
Another property of these networks is that the state of these systems are stable. And their time constant and their behavior is stable. So if you define the time constant of the system as that expression that is the coefficient of x of t, or the hidden states, then you can actually write that down as relaxing for not having a recurrent connection. Let's say, x of t is out. Then you would be able to bound the time consent of the system. And these are actually the bounds that you can have. So the network cannot go unstable.
You can also bound the state of the system. Let's say a neuron is receiving many synaptic connections. A, in this representation, is a synaptic parameter, and its synapse is specific. So each synapse has a bias, or has an A, that actually has a connection to this neuron.
And now, basically, you can say the maximum of the A parameter would be the maximum amount that your state can actually reach. And the minimum of that, the one that has the least one, actually has the least amount of impact on your activity of your differential equation.
We can also show that this biologically inspired system is actually a universal approximately. You can actually do a function approximation, use those methods, actually, to prove that, actually, this expression can approximate any given dynamics with arbitrary precision given in number of their cells. But to truly, actually find out how expressive is a neural network from the theoretical standpoint, we want to get down to a more fine-tuned expression.
So for example, there are more measures of expressivity of neural networks that we can use for measuring expressivity of a network-- for example, the trajectory lengths. Imagine I have a circular trajectory, and I input this circular trajectory to a deep neural network. I'm just defining what is this trajectory length measure.
You input this to a neural network. This neural network is parameterized. And then we can observe that, at every layer of the network, this trajectory gets deformed, gets more complex and the lengths of the trajectory getting more complex and complex. And it actually increased exponentially.
You can measure that length of this trajectory with an arc length measure. And you can actually find the lower bound for the expressivity of the neural network. Given its depth, you can actually measure the expressivity of a neural network by its parameterization, properties of its synaptic parameterization, the width of the network, and the depth of the network, basically.
So we actually did use this expressivity measure. Because this actually draws a boundary between shallow networks and deep networks. The deeper you get, the more expressive you can get based on this measure.
Now, in our space, we have continuous-time processes, let's say, liquid time constant networks, or LTCs. We have continuous time neural networks. And we have neural ODE representations. Now, if we give the same neural networks-- we parameterize this neural network f for all of these processes, given their representation of differential equation-- we see that we consistently get longer and more complex trajectories out of the LTC network.
Now, we systematically analyzed this in an empirical fashion, where we changed, basically-- like, on the x-axis, you see different types of ODE solvers for these three types of networks. Neural ODEs, CTRNNs, and LTCs. And we see that the yellow line actually shows the trajectory lengths.
For these LTC networks, we see that, even if you change the width of a network, on the x-axis, you see that the trajectory length is always higher. And we can see that if the initialization of your network is actually changing, you also have a dependency on that.
Now, we also figured out, theoretically, lower bound for expressivity of, basically, these type of networks where the lower bound is a function of weighted scale, biases scale, width of the network, depth of the network, and number of discretization steps that you're taking for your ODE. And we also implemented that for LTCs. You cannot compare lower bounds to say that, yeah, so this network is more expressive than the other one. But it's just a good measure to just see where are we standing in terms of this type of behavior.
Now that we have this type of measure and theoretically evaluated them, let's really put these networks in action, and let's see how good they are in representation learning. So one of the things we start with modeling physical dynamics. When I told you that a neural ODE cannot beat an LSTM network, you see that here. And you see that we can actually get better performances while using these networks.
You can compare them across a large series of advanced RNMs. And this [INAUDIBLE] inspired network is actually beating them even in person activity in a real example, just to perform, in irregularly sample data. We
Also performed some analysis on some real-world examples. And we saw that, on most of these tasks, LTCs are better. For example, one task is LSTM is better, and that's the task where we have longer term dependency.
And that's one of the issues that you have to solve gradient propagation in continuous-time processes is problematic. So you always have to take care. If you actually wrap them inside a kind of well-behaved gradient propagation, then you would be also getting a better performance there.
We didn't stop there. And we actually scaled the applications to this end-to-end autonomous driving that, at the beginning, I showed you. We have human-collected data. And we trained deep learning models.
Typically, a deep learning pipeline actually looks like that when you want to have a set of convolutional heads. And then you would have fully connected networks that has, basically, the over-parameterization part of their network is actually there, in the hidden layers. Between five to 100 million parameters it takes to actually perform lane-keeping, or this type of task, if you have this type of networks.
What we did, we said that let's replace the fully connected networks by continuous-time processes, and let's see what kind of behavior we get. So we get four types of variance. We take a neural circuit policy, which is the first one, NCP.
That has a four-layer architecture-- again, nature-inspired-- that has interneurons, command neurons, and motor neurons, all LTC-based neurons based on the masses I showed you before. You can replace that fully connected layers with LSTMs and CTRNNs, and you have the convolutional neural network. So I'm going to talk about differences of these four variants.
So the first thing, the number of parameters that requires to actually perform autonomous driving is basically significantly reduced when you're using these type of networks. Now, remember the representation of the network where I was showing that convolution on a fully connected convolutional network can get perturbed, the kind of representation they learn.
And now, with LTCs, we would be able to have 19 neurons at control. And then we perform and see that the convolutional part of it-- so what I'm showing in the attention map, we are not changing the convolutional neural network structure of these variants, of these network variants that I showed you. We see that this architecture imposes an inductive bias on the convolutional networks that let them learn a causal structure.
Now, if you add, even, noise, we see that the explanations are not scattered as much as it was for convolutional neural networks. We also take to a real-world measure of this, like how many crashes would you have if you increase the amount of input noise? And you will see that these kind of networks are basically much more robust to this type of perturbations.
And now let's look at the convolutional neural network attention of these end-to-end trained networks when their heads are different-- when they had a CTRNN, when they had a LSTM, and when they had our LTC-based model. And we see that the kind of prior that the recurrent neural network had put on convolutional neural networks makes them learn different types of weights. So the representations that are learned out of this system are completely different from each other.
And we see that the only one that has a consistent behavior is the CNN itself in our solution. But CNN actually focuses consistently on the outside of the road, so we don't want that. LSTM is actually giving you a good-- most of the time-- a good representation. But it is actually sensitive to lighting condition.
So if I stop the video in some parts, you will see that when the shading areas are not good, the attention of that LSTMs are actually getting scattered. And the CTRNN, or the neural ODEs, basically cannot actually gain a nice representation in this task.
Now, why is this the case? Now, let's explore the why of this. So if you look at the taxonomy of possible modeling frameworks, at the bottom at one end of this-- I don't want to call it the bottom-- at one end of the spectrum, we have the statistical models where statistical models are amazing in learning from data and, at the same time, basically performing inference in IID, so predicting in IID. So this is actually what the statistical models can do.
On the other side of the spectrum, we have physical models. So physical models are basically described, usually, by differential equations. When you have differential equations that describes the dynamics of your system, they can actually answer questions. They can account for interventions in the system.
So if you can actually design a universal approximator that is closer to the physical kind of models, then you would actually get into a more causal structure by nature. And also, you're being able to actually get insights about the system. You can learn from data. You can answer counterfactual questions and predicting IID and outs of distribution.
So as I said, physical dynamics can be modeled by ODEs. And this set of ODEs can actually predict the future evolution of your system. They can describe the results of interventions in the system. And the coupled-time evolution helps us define averaging mechanisms for capturing the statistical dependencies in data. And it enhances our understanding of the physical phenomena. And because of that, they are actually causal structures.
So now, let me get more formal about this. Let's say we have a differential equation given by dx over dt equal to g. And g of x is basically a nonlinearity of the system. So we have the Picard-Lindelof theorem that actually shows that this kind of differential equation would have a unique solution if the nonlinearity is Lipschitz.
Now, if you unroll this system with Euler, then the representation, the underlying representation under this uniqueness condition, would be a causal mapping. Why? Because you can actually say what happens in the future events, which is the xt plus dt based on the previous events.
Now, there is a framework within this spectrum of causal models. It's called dynamic causal model. So a dynamic causal model has the nonlinearity of the shape that you're seeing. It does take a bilinear approximation, or a second-order Taylor approximation, of that ODE. And it gives you these coefficients for the system.
So coefficient 1 controls the internal coupling of the system, A. Coefficient B controls the coupling sensitivity among networks nodes. So it actually accounts for internal interactions and interventions. And coefficient C regulates the external inputs.
This framework is actually a graphical model that is implemented by ODEs. So you can put these things together to actually create this system. They allow for feedback, as opposed to their kind of Bayesian network architectures that you can actually receive.
Now, if we look at the liquid neural networks, or the representation that we gain from that representation, under two conditions, that f is C1 mapping-- that means like f is Lipschitz-continuous, basically, and is bounded-- I didn't write the bounded, no? no, I didn't write that, so it has to be, also, bounded-- and tau is positive. And if you have a strictly positive tau, then this network would also have a unique solution.
Now, let's say I assume that this f, the nonlinearity, is given by a tangent hyperbolic. It has recurrent connections. And it has weights like an input mapping. And then, with this nonlinearity, I would be able to compute the coefficients.
If you look at the coefficients for causal models, we can compute the coefficients of this causal behavior. So that means there are certain parameters of the system that are responsible for a certain type of intervention in the system-- internal intervention and external intervention in the system.
Just from the diagram perspective-- going back to our diagram-- we will actually have a dynamic causal model that can have the parameter B that controls the amount of collaboration of two nodes with each other, or interactions of two nodes, and coefficient C that controls the inputs, or external inputs, to the system. You would have the same type of behavior-- it's a nonlinear version of that dynamic causal model-- that actually performs the same thing. And they have more sophisticated causal structures.
Now, with that, we did some experiments. They are behavioral cloning kind of experiments where we have drone agents that are moving in the environment. And they are given-- visually, there is actually a target in the environment.
And we ask the drones-- so actually, we drive the drones towards that target. And with this visual demonstration, what we want to do, we want to learn this behavior and gain agents that are good in closed loop when they're interacting with the environment.
We see that this is actually a learned behavior of this system, where as soon as the target becomes apparent, then we see that this neural network actually learned to focus on that target. Because that's the kind of important matter in this kind of task process. So basically, the causal structure of the task is learned by these drone agent.
Now, if you compare the kind of focus, or attention, of these networks to other neural networks, we see that the only representation that, actually, we see this type of process is actually the liquid network-based solutions, where this attention is not persistent in the other ones. So we cannot say that the other systems actually learned to navigate towards the target and understood what they were doing.
We also did that in multi-agent. Right now, you're a follower drone. And there is a leader drone in front of it. And the target is basically to follow this drone. And in this type of environment, also, we observe that the attention of the network is, actually, always on the second drone, basically. So that means the causal structure is actually captured.
Now, how you can show this even more quantitatively? Then we looked into close form interaction. We trained these networks in open loop and from training data. Now, we deploy them, actually, in that environment. And we measure the amount of success rate that they can have in different type of tasks in closed loop.
So if they do not have their true causal structure of the task, they wouldn't be able to perform this task very well. And we did across different kind of spectrum of perturbations on the system. We see that the systems are being able to perform much better than the other ones.
Of course, there are always room for improvement, even for these systems. Because we didn't add any kind of constraint on helping these systems to learn more and more. So we were just trying to see what's the gap between these type of networks and the others.
So obviously, these type of networks come with certain limitations. So the complexity of the networks are basically tied to the complexity of their ODE solver. So as a result, you might have longer training times and longer test time if you use these networks.
You can have a solution for that. You can use the fixed-step ODE solvers. You can use the sparse flows. You can use a sparsity-- and the process that optimizes sparse neural networks-- on, let's say, CPUs or any kind of hardware that you're running or GPUs.
And then you can use hypersolvers. And these are the class of solvers where they can actually integrate everything together, and they can actually run much faster when you have differential equations. You can also use closed-form variants in these kind of scenarios.
So you can use the closed form-- if you solve these differential equations as closed form, then you can end up with a nicer presentation. And that's one of the things that we did and we're very excited about.
So there's another limitation that this ODE-based network. They might also express vanishing gradient problem. Because they're continuous systems, and their memory is given by an exponential decay. So then, you would face learning long-term dependencies.
So the solution is that you wrap it inside a well-behaved kind of process-- for example, a gating mechanism that you can actually put these networks together-- for example, if you have the state of an LSTM network defined by an LTC network. So if you do that, then you would have gating mechanism, and you have a gradient propagation preserve the gradients.
Now, in summary, what I showed you I showed you that you can acquire knowledge by these flexible neural models that can perform inference model-free. They can really capture the temporal aspects of the task that is at hand better than-- the tasks that require temporal kind of data processing, they can actually infer the-- and these are all thanks to their causal structure. And they would be able to perform credit assignment better than the other models that are out there.
So you might use them for generative modeling. And if you want to model the world, you basically can use these representations or also get representation of your world in order to do further inference from those kind of models.
So there are certain properties that I mentioned-- the compositionality of, layer-wise, these networks, you can actually put them in different architectures. And you can connect them in a sparse fashion. And the network is actually differentiable. And you can use this.
And if you're dealing with visual data or video data, it would be adding CNN heads or perception modules. And then this can act as your decision-making engine. They're expressive, they're causal, and they add more into interpretability of the networks.
So some of the perspectives that we have is that there is-- I just put two different hundred-years-old models together, and this is all kind of properties that emerge from those kind of things. And you can see how much potential is actually in this type of research that you can put, and you can really explore what's going on in the brain.
And why do you need to do that? Because, basically, the research space is huge if you just want to algorithmically implement something intelligence, right? So you would narrow down if you actually focus on brains and how they acquire knowledge. And definitely, because we have these machine learning tools these days, you would be able to actually do much more than it was possible before.
We can also work with the objective functions. In this talk, in this research that I showed, we just focused on the model and the properties of the model in a structured fashion. So you can also work with the objective function of your learning problem.
You can also, for learning processes, you can use physics-informed kind of learning processes in order to perform this type of learning. You can do causal entropic forces, for example. This is like defining intelligence as a force that maximizes the future freedom of action.
So that would be a new way of formulating intelligence. And then, from there, you would be able to actually get into much more. So this is actually an exciting area of research that could be enabled and scaled by what we showed today.
And as I said, one of the properties that we showed today is that there are certain structures that can emerge from these liquid networks. And those structures are good. So you would be able to use these for more complex tasks.
So these are good candidates-- this could be giving you some candidates for performing decision-making, better decision-making, based on these selective computations. With that, I would like to thank you for your attention. And all this technology is open source. You can actually get them online.