Principles and applications of relational inductive biases in deep learning
April 19, 2019
April 11, 2019
All Captioned Videos Computational Tutorials
Kelsey Allen, MIT
Common intuition posits that deep learning has succeeded because of its ability to assume very little structure in the data it receives, instead learning that structure from large numbers of training examples. However, recent work has attempted to bring structure back into deep learning, via a new set of models known as "graph networks". Graph networks allow for "relational inductive biases" to be introduced into learning, ie. explicit reasoning about relationships between entities. In this talk, I will introduce graph networks and one application of them to a physical reasoning task where an agent and human participants were asked to glue together pairs of blocks to stabilize a tower. We will go through DeepMind's recently released graph networks library (implemented in tensorflow) to see how to set up different graph models, and train some simple models on some simple tasks.
Kelsey Allen is a PhD student working with Josh Tenenbaum on the problems of structured physical reasoning, planning, and learning from limited data
Computational tutorial references and videos can be found on our stellar site (
https://stellar.mit.edu/S/project/bcs-comp-tut/index.html) or on the CBMM learning hub ( https://cbmm.mit.edu/learning-hub/tutorials#block-views-learning-hub-blo...)
PRESENTER: So today we have Kelsey Allen talking about graph nets. Kelsey is a grad student in the Tenenbaum lab. And so I'll pass it off to her.
KELSEY ALLEN: Hey, everybody. Thanks for coming out this morning. As Jenelle mentioned, I'm a grad student with Josh Tenenbaum. And I'm going to be talking today about work that I did while I was an intern at DeepMind on a suite of approaches that we call graph networks. The talk is going to be about-- how do I get to the-- there we go.
I'm going to give just about a 30 minute introduction to graph networks, including an introduction to what they are generally, talking specifically about what graph networks are mathematically and how they're applied, and then giving a specific example that I worked on while I was at DeepMind on using graph networks to do physical reasoning for different kinds of construction tasks. And then the rest of the time of this tutorial, which I think is going to be the most useful, is actually getting some hands on experience with the graph nets library the DeepMind released, which is all in tensorflow.
So hopefully, those of you who have laptops can sort of walk through this, and it should be very useful, I think, to go through it together. And then at the very end, we could always have a discussion. I'm very excited about talking to people about ways in which you might want to use graph networks for your own work and also what graph networks, what kinds of extensions we might think about for future work.
So just to give an introduction, so when I was at DeepMind, I was working in the cognitive science group. And so a lot of the work that we did is influenced by thinking about what kinds of things humans need to be able to reason about in order to interact with the world. So in most of our daily experience, we interact with different kinds of very structured scenes. So on the left, you have a structured tower that's made of different blocks that can be connected to make even taller towers, which you can then play with as an infant.
And on the right, we often think about graphs of communities, like all of our social media connections or inside our families who's connected to who, inside this department who's connected to. And this is just to give two different examples of sort of graph structured kinds of reasoning and two very different kinds of domains. And so we would like approaches that can easily handle these kinds of structures that we see all the time.
To give some other examples, computationally and empirically, of what people have worked on in the realm of cognitive science that is related to some of the graph network stuff we've done, early on in the late 1990s, Dedre Gentner has done really fabulous work looking at how people do analogical reasoning using relationships. There's been a whole lot of work on hierarchical planning and composing different kinds of routines, which you could also think as a sort of tree shaped graph for how you might combine lower level primitives into higher level plans as well as things like discovering relational structure like Charles Kemp and Josh Tenenbaum's classic work on trying to discover graphs of relationships between different kinds of, for example, animals or other kinds of semantic information.
To go back and look at sort of the history of AI, early on many classic approaches were focused on discovering and using structure for reasoning. So things like logic, grammars, graphical models, et cetera are just some examples. And these were really critical when data was sparse, because using structure affords us very strong inductive biases, which make learning from sparse data reasonably tractable.
And so early on, these connectionist models that have now become very popular often failed because they didn't have these strong biases which allowed them to learn effectively from sparse data. They required more data to make up for the lack of biases in order to try to learn that structure, which they didn't have at that time.
However, when data became more available, it became clear that some of our structural assumptions that we were using in these early models were incorrect. And so at this point, and as we all know now, these connections and approaches that are now termed broadly deep learning have been employed to great effect in all kinds of different scenarios. And in particular, I think the most compelling way it's been applied has been in vision. And I think the reason for that is because we don't really have a good understanding of the underlying structure in vision that is in some sense correct. And so learning the structure from massive amounts of data has proven to be substantially more effective.
However, now we've taken again a turn in deep learning in thinking about how we can bring structure back to these kinds of approaches and more meaningfully integrate learning at really large scales with the kinds of structure that we expect to see in the world. And so deep learning has now been used for several different approaches combining sort of classical AI and more recent work. Things like deep learning for logical reasoning. And a lot of the work the Tenenbaum Lab has done is in using deep networks for initializing inference or for learning proposals for graphical models. And so in this talk, I'm going to be talking about a similar sort of theme but particularly in encoding deep networks with graph structure data.
So why might we actually want to use graph structure data? Well, in a variety of different kinds of things we might care about, graphs are really popping up everywhere. So to just give some examples that I think cover a very wide range of different ways in which you could think about the world as graphs. On the upper left here, Qui et al. used graph networks to try to predict traffic flow. So you could imagine a city map as a graph where you have different nodes as being the hubs that people are moving between and the roads as the edges in that graph which you're trying to predict the congestion of, for example.
You could also consider organic chemistry to be a kind of graph where a molecule is described as a graph where each atom is a node and the bonds connecting those atoms are the edges within that graph. And you could use that graph structured representation to try to predict, for example, the energy or the fundamental frequency of that resulting molecule.
On the upper right, you can use graph networks to describe the bodies of agents. You can describe the body of a particular agent as a graph in that each limb is a node within that graph and the joints connecting those limbs are at the edges. And you can do things like try to predict how that body will move when a given limb or joint is perturbed.
And then the work that I was mostly focused on in my internship was on using graphs for physical prediction. So you can imagine in a physical scene that each object is a different node in that graph and the edges connecting those objects are the forces. So things like gravity or elasticity or anything else. And you can then use that to try to predict the motion of objects in different physical scenes.
So to give even just a couple more examples on the bottom two rows, you can also think about tree structured data as being a kind of graph. So you could use graph networks to try to do things like, for example, semantic parsing, since a parse tree is just another kind of graph.
You can also not make any commitment to where the objects are and instead just break an image into different grid cells that could be fully connected, and that could also be seen as a graph. So in all of these cases, we can think about-- I'm going to use the term entities and nodes somewhat into interchangeably and relations and edges. So the relations are the connections between the entities.
To interpret some of our standard deep learning tools with graph terminology I think is somewhat informative in thinking about the kinds of inductive biases the different layers afford us. So at the weakest level, we can imagine a fully connected layer, like the one on the left. And here the entities are just the units in the neural network. So each of these different blocks is an entity, and the relations are all to all connections.
And so the inductive bias afforded by this is relatively weak because we haven't assumed any inherent structure in that setup. Making it a little bit more structured, we could use things like convolutional layers where now the individual entities are the grid elements in your image and the relations are local. So this affords an inductive bias of locality and now makes a convolutional layer spatially invariant. So you can do any different kinds of spatial translations, and it will be robust to that.
Recurrence is another kind of-- you could imagine that as another form of graph where now you have entities as being the time steps and the relations are sequential. So time step 0 is connected to time step 1. And so this gives you the inductive bias of sequentiality and the invariance of time translation.
And finally, graph networks are in some sense just an extension of these various ideas. But now we'll talk about general nodes as the entities, general edges as the relations, and the inductive biases that you can now get are arbitrary and depend on the structure of your graph. And the invariance you get are the node and edge permutations.
To go into a little bit more intuition on that and really hammer home this idea of being invariant into the order of the entities, I want you to consider this solar system. So we have a sun in the center and a bunch of planets orbiting that might each have a moon. And we want to, for example, predict the center of mass of the system. So critically, the order in which we consider the planets should not matter. It shouldn't matter if I represent this scene as being this feature concatenated with this feature concatenated with this feature instead of this one and then this one and then this one.
But if we were trying to represent this scene in a classic sense, we would have to commit to some ordering of these entities in order to use a standard deep learning kind of approach. And so a standard, for example, multi-layer perceptron training on the features of the entire scene won't be order invariant. And so instead what we would like, as in physics, is to apply the same function to all the objects and interactions in the scene and then aggregate that information to make predictions. And so this is really at the core of what graph networks do.
In order to get into a little bit more of the mathematical details of how these things work, I'm going to use this general graph definition. So actually, before I go any further, does anyone have any questions so far? Yeah?
AUDIENCE: Can you just rehash the explanation of why the multilayer perceptron has the order--
KELSEY ALLEN: Has this order effect? OK. Yeah. Maybe I can draw it. So imagine we're going to call this 1, 2, and 3 and 4 as our entities. And I'll forget about the moons for now. And so we'll have-- I guess I can't easily write anywhere.
AUDIENCE: Do you want that board out there?
KELSEY ALLEN: Maybe.
KELSEY ALLEN: I'll try to explain at a high level while they're getting the board. So if you wanted to apply a standard deep learning kind of approach, you need to represent this scene as some vector, as a single vector representation. And so in order to do that-- thank you. Yeah. Oh, green is terrible.
So imagine this has some position that's like-- actually I'll just call this position one, position two, position three, position four. So if you wanted to predict the center of mass of this whole system, then you need to represent the scene in some way. So the standard thing to do would be to concatenate your features and then run this through some network and then predict center of mass. Does this make sense?
But you should equally be able to do this and predict that same center of mass. But this feature representation is completely different from this. And so if you were to train it such that you always flip the order of these features, you could learn to become invariant to the order in which you present them, but it's not inherent to the network.
So what we would like is to, for example, be able to not have this issue where we have to know that we, for example, need to present order invariant switching of our features and always get the same center of mass. Does that make intuitive sense? Yeah?
AUDIENCE: So it's actually about the trained network and not necessarily about the order that you choose at the beginning?
KELSEY ALLEN: Can you explain what you mean by that?
AUDIENCE: So you could make a network that works on any of these features, but then you can't switch it after you've already trained it to work on a new set of features.
KELSEY ALLEN: Exactly. So you want to have something that can always be trained where you don't need to know the order in which you will show the features to that network in the future. So if I also, for example, always knew that I was going to have this object in this first position, that would also be fine for a classic approach. Does that make sense to people? OK.
AUDIENCE: I guess what's not clear to me is why it's a problem to know that-- to fix the order in which you put things?
AUDIENCE: The thing is, the way these things will work, I guess, are just [INAUDIBLE]. So just allow a class of functions that are invariant.
KELSEY ALLEN: That's right. Yeah.
AUDIENCE: The permutation or whatever structure there is. So it's more of an example of where-- this is an example of a problem where you might kind of get an advantage out of this invariance under permutations. So I guess--
KELSEY ALLEN: Yeah. Another thing, by the way, that the graph networks can do that an MLP could not as add another guy here. So this is maybe more clear. But if I wanted to add a fifth object to this system and still compute the center of mass, I now don't have the same sized feature vectors. So I can't use the same network at all. Does that make sense to people? OK.
All right, so to get more into the details of that, we're going to use this general graph definition. So on the left there's a graph. And I'm going to say that the graph has nodes v. vi is going to be node i with different kinds of attributes. Edge k is going to denote a particular edge in the graph which is related to a sender node sk and a receiver node rk and then a u, which is a global graph variable with attributes.
So each of these things has a feature vector which is the attributes of that object in the graph. So for example, in the solar system example, the nodes would be potentially the positions and the masses have those as the attributes. The edges could be something like gravity or something like that as the attributes of those examples.
To give you just an example when we're walking through this, I'm going to use this mass-spring system where I'm assuming that the nodes vi are the masses in the mass-spring system and the attributes could be the mass, position, and velocity of those objects. The edges ek are going to represent the possible interactions between the masses. In this case, springs. And every edge is directed.
So in this graph, you would have four edges, one from node 1 to node 2, one from node 2 to node 1, one from node 2 to node 3, and one from node 3 to node 2. And then the global properties u of this graph could, for example, be the total energy of this mass-spring system. So now to get suddenly-- yep.
AUDIENCE: Sorry, this might be a little bit [INAUDIBLE] but are graph networks typically always direct-- the edges are directional?
KELSEY ALLEN: Yeah.
AUDIENCE: OK. Cool. So you would usually add four as opposed to two edges for the two springs. [INAUDIBLE].
KELSEY ALLEN: Yes. Yeah. You could potentially change that by having different kinds of aggregation functions, which I'll talk about in a second. But generally, it's easiest to just describe it as two different edges.
So this is an entire graph network block. And what's really critical is that the graph network takes as input a graph. So it takes a set of nodes, a set of edges, and a global property and it outputs a graph. So it outputs an updated set of nodes, an updated set of edges, and an updated set of global properties. And when I say updated, I mean that the attributes for the nodes, edges, and global properties will change, but the structure of the graph will remain constant. So that's really critical, and I'll talk about that a little bit more in a bit.
The six functions that we have in a standard graph network block are these three learnable functions, phi e, which applies to the edges of a graph, phi v, which applies to the nodes of a graph, and phi u, which is going to update the global properties of the graph. We also have these aggregation functions, which are going to allow us to take the structure of the graph into account when updating these properties, which are an aggregation function that goes from edges to the nodes, an aggregation from the edges to the globals, and an aggregation from the nodes to the globals. And on the next slide, I'm going to walk through how each of these operate in a particular graph. Any questions about this before I go to the next slide? OK.
All right, so I'm going to talk about just one particular graph update order for one particular graph. The first step that we're going to do is apply this edge function to the property of the edge attribute, the node attribute, which is the receiver node for that edge, and the node attribute, which is the sender node for that edge, and the global properties. So in the mass-spring system, this could be something like computing the forces in the graph for each different interaction edge. So like the spring constant or an updated version of that, which will allow you to propagate that information. So that gives us a set of values ek prime or updated edges for each edge in the graph.
Then we're going to apply the aggregation function, which takes the edges and computes the set of ei prime with this hat on top of it, which is going to, in this case, sum all of the inputs to a particular node. So when we apply that aggregation function for this node, for example, what we're going to do is sum up these different edges for that node. And that's what our aggregation function is doing.
KELSEY ALLEN: Yeah.
AUDIENCE: So in the first step is the-- so this phi e.
KELSEY ALLEN: Yeah.
AUDIENCE: So that would be kind of start node, end nodes, then you look up in the u it could be the position and then computing the mass from that and take the attribute of the edge to be--
KELSEY ALLEN: So the u is actually just some global property. So often we think about that like gravity or something that's affecting the entire graph. So it's not really looking up anything within the u. It's perhaps even easier to not even think about the u at the moment. But in order to update the edge, we take the edge's current value as well as the nodes that are connecting-- that that edge is connecting and compute the updated edge value.
AUDIENCE: I'm just wondering where the-- so the position of the objects in your system--
KELSEY ALLEN: That's coming next. Yeah, so once we've updated the edges, we then update the nodes.
AUDIENCE: So in your paradigm that we are talking about the forces, we don't need previous values of edges to compute the new ones. We just, in principle, need the position or the attributes of the nodes, right?
KELSEY ALLEN: That's right. So in this case, we would be-- the simplest thing that we could learn would hopefully be just, yes, that.
AUDIENCE: In general, the update may use the previous edges. It's just that in this case, we don't need them.
KELSEY ALLEN: Yes, right.
AUDIENCE: This way.
KELSEY ALLEN: Yeah. So critically, that node rk and node sk are not the indices of the nodes. They're the actual node attributes. So the positions of the nodes that that edge is connecting. So once we have that representation, we then apply our node function to the aggregated set of edges for that node as well as the node's previous value and the global attribute. So now we have updated node attributes. And that's something like computing the new position velocities and kinetic energies for a particular node in the graph.
And finally, if we're using these global properties, we would aggregate all of the edges and all of the nodes in the graph and then apply our learned global function to the aggregated edges and aggregated nodes in the previous global value to get an updated global value, like an updated energy. Yeah?
AUDIENCE: This is kind of taking a step back. But can you remind me which of these values are being learned and which of them are I guess the inputs?
KELSEY ALLEN: So phi u, phi v, and phi e are learned. The aggregation functions are constant and the graph structure itself is constant.
AUDIENCE: So then all the arguments of those functions are--
KELSEY ALLEN: The k's are all constant. The actual connectivity of the graph, but the attributes of each of those things in the graph is changing. Yeah?
AUDIENCE: So the edge update is always dependent on rk and sk attributes?
KELSEY ALLEN: Yes.
AUDIENCE: And the node is always only dependent on the incoming edges?
KELSEY ALLEN: Yeah. And you could define alternative update schemes here. The really critical part is these phi functions are only applying to-- are applying equally to all nodes in the graph or all edges in the graph. Yeah?
AUDIENCE: Is it possible-- maybe I'm looking a little bit ahead. But is it possible to have a dynamic kind of graph instead of having to replace the graph?
KELSEY ALLEN: So dynamic in what sense?
AUDIENCE: Maybe the model will tell you for this kind of thing-- maybe you need a new node to explain this thing.
KELSEY ALLEN: Yeah, so you would need a separate learning scheme for that. There's nothing that says you can't change the structure from one iteration to the next, but this will not predict what that structure should be. Yeah?
AUDIENCE: Just have a naive question. Because I saw there is two-- one Pearson node that has two connect-- actually--
KELSEY ALLEN: This one?
AUDIENCE: Yeah, yeah. Just wondering, can two nodes have two edges or more edges?
KELSEY ALLEN: Yeah, there's no limit on the number of edges. These could each have a different edge attribute, for example, and then they could encode different kinds of connections. Like if you had two different springs connecting the same nodes with different spring constants, then you might want two different edges between those same two nodes.
AUDIENCE: Follow up question. So you mean you have to train a new model for that other kind of graph if you have a new node or you have a couple of new nodes--
KELSEY ALLEN: No. No, no, no. The model that you're learning does not care about the graph structure. Because each of these functions is not dependent on the particular values of rk and sk. They take the node attributes from rk and sk as input, but the structure of the graph just determines the propagation and the aggregation functions, which are independent of the-- in some sense, independent of the structure. So you can learn-- when you learn a graph network, you can apply it to any structured graph.
AUDIENCE: But how did the performance change in that case?
KELSEY ALLEN: In practice, you look at that. I will show some examples of having, for example, you can train on towers with different numbers of blocks. Yeah?
AUDIENCE: Well, I don't know if you're going to give an example. But previously you said there's one example where you can use those to train on some image and basically use that as point [INAUDIBLE] in graph. But now is it that between one pair of nodes you can have arbitrary number of edges? And so in those cases where there isn't actually a theory after how many edges there is, like how do you determine what kind of graph? Because now you can have arbitrary number so the fully connected can be all kinds of different fully connected.
KELSEY ALLEN: Yeah. So in general, in practice what I have seen is that people just assume there's one edge between two nodes in each direction. But you could play around with that. It's, again, not something that this model will be able to predict for you unless we can talk about some ways of extending that at the end. But there's no obvious way of doing that with this. Yes?
AUDIENCE: And I guess next [INAUDIBLE] to this, the graphs that you're treating here cannot be expressed as simply a matrix with nodes and the values of the elements being the edges. It's more general here because you can have more edges with the same destination.
KELSEY ALLEN: Yeah, yeah, yeah. So this is more general than just an adjacency matrix, for example. But you could convert an adjacency matrix into a graph. Yeah?
AUDIENCE: Another question. So is ek here learnable or not?
KELSEY ALLEN: So ek is the input edge representation. So to go back two slides, the edges-- actually, one more. The edges ek are going to start with some attributes. And what you're learning is a transformation on these attributes.
AUDIENCE: So these attributes are interpretable?
KELSEY ALLEN: So when you put them in, they will be interpretable. When you then run your network, it could embed this in some high dimensional space, and it could become uninterpretable but useful. And then you can use things like your standard deep learning visualization tools to try to figure out what those high dimensional vectors are representing.
AUDIENCE: The same thing for v, i, and u. So they are initialized as interpretable vectors.
KELSEY ALLEN: And then we project them to high dimensional magic deep learning space and then something happens in some-- yeah, right.
All right. So that really is the core of graph networks, that one slide. So now that we've developed a particular graph network block, we can actually compose them in all different kinds of ways, because each graph network block takes as input a graph and outputs a graph. And so you can connect them in somewhat arbitrary ways.
So for example, one of the classic things that we use and the thing that I use for my internship and these encode process decode models where you're going to take your input graph, encode those nodes, edges, and globals into some high dimensional representation, and then run multiple steps of graph net processing and finally decode to something you can then understand, like the new positions of node.
AUDIENCE: So that means in practice, you have, I guess, two sets or three sets of--
KELSEY ALLEN: Of learnable functions. Yeah.
AUDIENCE: Takes to a high dimensional space, one that precedes in that high dimensional space, and another that--
KELSEY ALLEN: Exactly. Yes. So to give you some intuition for why we might want multiple steps of processing, when you define a graph, if you take just one step of propagation, then for example, to get this information from this node to the rest of the graph, after one step, you're just going to be able to propagate it to its direct neighbors. After two steps, you'll be able to propagate it to all the neighbors. And after three steps, you'll also be able to propagate it to all the edges.
But here's an example where that doesn't quite happen. So at the first step, you're only able to propagate information to this one other node. And even after three steps, you actually only reach three of the other nodes. So if it's important that the information from one part of your graph gets to all the rest of your graph, you might need to take multiple steps of propagation. So here is just-- yep?
AUDIENCE: Go back again. OK. Yeah.
KELSEY ALLEN: So here's just a few different examples of graph network blocks that are different versions of the full block. So in the simplest case, you can imagine an independent recurrent block, which is not actually using the graph structure at all. It's just assuming that everything is independent and then you're going to update the edges, the nodes, and the globals independently.
So the graph structure will never affect anything. Here's a message passing neural network, which was also published around the same time. And here they don't use globals in the input representation, but they do try to predict globals from the graph. So it's a sort of minimal change from the full graph network thing.
You can also imagine things like deep sets which, again, are just assuming that we're going to learn a single or two functions on the nodes and the global properties of the graph but not assume any connections between them. And so all of these things are representable in the graph nets library that we're going to go through. We'll talk about them a bit more then.
So how might we actually use this in a system? So the graph, we can define targets that we might want over the nodes, edges, or the globals of a graph, since we'll get updated representations for each of these things. So node centric could be something like trying to read off the inferred mass of an object or the positions and velocities at the next time step. Edge centric could be something like trying to predict whether or not two objects are in contact.
And global centric could be something like predicting the energy of a system. And the input graphs could be pretty much anything. You could have it being structured with node attributes given by known quantities, and you could include sparse connectivity information or you could include all to all unstructured graphs or hierarchical things for tree to tree learnable networks.
So the biggest limitation is that the structure of the graph is typically not learned. So at no point where we changing the receivers and the senders for a given edge. And also the structure of the graph is typically not changed as we unroll something. So inside that recurrent block, we're not somehow changing the actual structure of the graph within that. And if you wanted to do so, you would need some other mechanism to possibly delete edges or nodes or add edges or nodes, which is something people have been thinking about.
Another just thing I want to say is that graph networks will not cover absolutely everything. They won't handle things like recursion, control flow, or conditional iteration. And for those kinds of things, you might want to use program induction instead. And so something to just think about as we're going through some of these examples is how useful is this approach if we can't learn the structure? Because some people would say that learning the structure is really the core of the problem. So just something to consider. And if you're curious, I have some references of people who are trying to learn structure. Do you have a question?
AUDIENCE: Yeah. I missed the original definition. Can you redefine what an edge is?
KELSEY ALLEN: Yeah. It's just denoting that there is a connection between some sender node and some receiver node. And it has a certain attribute vector, but this could be initialized to be empty.
AUDIENCE: OK, so it's like, it's a particular, I guess, it takes as input a node, applies some function to it, and then outputs a node? And I guess the function that applies to it is of a fixed form?
KELSEY ALLEN: So there is a difference between an edge existing in the graph and the edge function that you're learning. So the function that you're learning takes in as input the current edge representation and then the node representations for the nodes it's connecting and outputs an updated edge representation.
AUDIENCE: OK. And is the edge function, does that have a particular form? Is that something with a linear map or something like that?
KELSEY ALLEN: So it's a learnable network. So it's a set of weights and biases. So that's the part that's being learned.
AUDIENCE: OK. And then so I guess my actual question was when you say that there are things that you can't learn or you can't represent with a graph network. That means that the graph network itself can't do something like control flow?
KELSEY ALLEN: Yeah, there isn't a graph representation that will give you control flow. It's actually independent of the network part.
AUDIENCE: And is that-- is that anything that looks like control flow? Or is it--
KELSEY ALLEN: I'm not sure. We should talk more offline. Yeah.
KELSEY ALLEN: I should also say that there is one way you could imagine learning structure in these graphs, which is to always assume everything is fully connected and give every edge a weight and try to learn those weights. But that's computationally slow. So when I'm talking about learning the structure, I'm talking about really sparse learning of the structure.