What makes high-dimensional networks produce low-dimensional activity?
Date Posted:
September 21, 2018
Date Recorded:
September 21, 2018
Speaker(s):
Eric Shea-Brown, University of Washington
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: There is an avalanche of new data on the brain’s activity, revealing the collective dynamics of vast numbers of neurons. In principle, these collective dynamics can be of almost arbitrarily high dimension, with many independent degrees of freedom — and this may reflect powerful capacities for general computing or information. In practice, datasets reveal a range of outcomes, including collective dynamics of much lower dimension — and this may reflect the structure of tasks or latent variables. For what networks does each case occur? Our contribution to the answer is a new framework that links tractable statistical properties of network connectivity with the dimension of the activity that they produce. I’ll describe where we have succeeded, where we have failed, and the many avenues that remain.
Short Bio: Eric Shea-Brown is a professor at the University of Washington Applied Mathematics Department, an affiliate investigator at the Allen Institute for Brain Science, and is adjunct faculty in the Department of Physiology and Biophysics and a member of the Program in Neuroscience.
FREDERCIO AZEVEDO: So Hi. I'm a postdoc here at the Center for Brains, Minds, and Machines. For those who don't know me, I am Frederico Azevedo. And today, I have the pleasure to introduce you all to Eric Shea-Brown. And Eric started his undergrad in engineering physics in the UC Berkeley. And then he had his first contact with a lab of neuroscience. And since then he got fascinated and could never stop learning it.
So he followed a PhD in Princeton in the Department of Applied Mathematics and Computation. But he still worked in neuroscience with neural oscillators and integrators. And after a few postdocs, he became a professor in the Department of Applied Mathematics in the University of Washington. And he's also an affiliated investigator in the Allen Institute. And today, he's going to tell us everything we can know about what makes high-dimensional networks produce low-dimensional activity. So without further ado, please.
[APPLAUSE]
ERIC SHEA-BROWN: Thank you. Thank you so much for having me. It's a real honor to be with you-- grateful for the opportunity. This is joint work with a lot of people. Stefano Recanatesi, Matt Farrell, and Merav Stern at the University of Washington, Michael Buice, Gabe Ocker, and Sahar Manavi and Shawn Olsen at the Allen Institute, Guillaume Lajoie in Montreal. And I want to call out Yu Hu here because Yu was a postdoc with Haim Sompolinsky, who was here at Harvard until very recently, when he started out his own group across the Pacific in Hong Kong.
So the topic is something very familiar to all of us, doubtless, in the room these days, right? We all make recordings from tons of cells at once or are lucky enough to collaborate with people who do. And when we describe what those responses look like, they are, therefore, a lot of numbers at once, for example, a vector of spike counts coming out of each of however many cells I'm recording simultaneously. And we have one such vector, either for every different visual stimulus that's presented or perhaps across multiple representations of the same visual stimulus, thereby representing variability intrinsic in the brain's dynamics.
Anyway, however you slice it, these days neuroscience looks like this, right-- a big list of long spike count vectors. And the question is, well, what do we do with them? The first thing you might do, at least if you had a high-dimensional brain or a high-dimensional screen, would be to simply plot those points in high dimensions.
There's one. There's the other. Eventually you have some set of different representations of what the brain is doing. And the first thing to notice, of course, is that we're lying if we pretend that we can fit those on a projector screen. And this raises the question that many are familiar with now, right? How many dimensions do all these spike responses really fill up? In other words, how much am I lying by pretending that I can even plot those points on the screen with you today?
So a way to quantify the answer to that question was put forward by these authors in the context of neuroscience, barring results from statistics that are much earlier. And they suggest that we quantify that dimension, right, in other words, the inverse of how much I'm lying to you by squishing all those points onto the screen via something that results from an elliptical approximation of all of the points, right? So got this big point cloud approximated by some high-dimensional ellipse, measure all the major axes of that ellipse. Those are the lambdas.
Aficionados will immediately recognize those are also the eigenvalues of the covariance matrix of the data. And write down these quantities, these lambda tildes. Each one of those lambda tilde is how long one of the principal axes is, normalized by the sum of all of the length that the system has. It turns out that the dimensionality of these data can be well-described by this number, which involves those normalized lengths of these axes. Let's go through a couple of cases to give us intuition for that.
Let's say that really I wasn't lying that much. Even though you've got 70 different cells at once, when I look at the ensemble of points, they really just fill up a three-dimensional sphere. In that case, three of these normalized axes lengths would be the same, all about a third. The rest are zero.
If you plug these normalized axis links into this formula, you'll get 1/9 plus 1/9 plus 1/9. 1 over that is 3. So you recover the three-dimensional sphere. Likewise, if you have a hoagie roll instead of a cream puff here, with one dominant dimension and then a little bit of spread in another couple of dimensions, you'll get some fractional dimension, which is about what you would expect.
So all of us have high-dimensional data, at least in principle. We'd like to know what the dimensionality is in practice. We have a way to measure it. Well, would you really like to know what it is? Is there any point in you staying for the remainder of this talk? Why are people and why is this person talking about dimensionality in the first place?
Well, as we see reviewed in this paper and studied by these authors, as well as many others, if it turns out that the dimensionality of these data is a lot lower than it might at first seem-- in other words, it's a lot lower than the number of variables or neurons, say, that I'm recording at once-- there are advantages. First is visualization. I'm not lying too much when I actually try to look at the data.
But things get a lot deeper. Think about denoising. If what's really happening is that these data lie on some low-dimensional set-- and I've got a lot of noisy neurons working together to represent just a couple of variables-- fluctuations in those individual neurons can cancel out and give me a much more accurate representation of the variable at hand.
Likewise, if I, as an experimentalist, or if I, as a downstream brain area, wants to read out that signal, don't have to sample from all these cells. And maybe I don't have to be particularly choosy about what subset of cells I'm decoding from. There there's a machine learning justification as well. And that's that if the data are eventually represented in some way that doesn't require too many parameters, I don't need too many samples to learn what the representation, say, of a particular object category is. And a system can learn based on fewer examples.
Well, the alternative outcome, though, has an advantage, too. What if things are very high dimensional? This is a classic picture. Well, then, maybe the representations that the brain is forming enable for easy categorization or classification of different stimulus categories.
For example, let's say I have a bunch of stimuli that correspond to one category. Those are Os, maybe images of Tommy, X's, images of Haim. My brain wants to learn how to classify those. I'm going to need some complicated classifier. If dimensions of that representation is low-- the classic picture is if we live to some higher dimensional representation-- then classification becomes easy.
Sometimes this is called Cover's theorem. And sometimes this is called the Kernel Trick. And there are beautiful reviews of this idea, as well as applications in neuroscience, some coauthored by members of the audience. And in between, there is something that may be the best of both worlds, right, the type of representation that lets us do a task efficiently-- categorize, for example, different colleagues-- but also allows us to do so in a way that permits good generalization in some of these other properties here.
Reviews of these idea are included in Chung and Sompolinsky et al., as well as this beautiful review by Bengio on representations, but are also contained from a different angle in work by your other colleagues here, such as that paper listed. So that's why I think you should care about this. If you disagree, you can bail. But that's the motivation slide for why it is that we should be thinking about representation dimension, really, in brains in the first place.
So this is a big field. Our one group will just advance on a couple of the small paths that are in this big landscape today. And those questions come at this question of dimensionality of brain representations from two points of view. The first is bottom up-- extremely simple conceptually.
Just give me a network of spiking cells with some sort of architecture. Can we say in some reasonably simple way what it is about the connectivity, the wiring diagram, that determines whether said spiking network will produce high- or low-dimensional activity patterns? That's it. What matters about connectivity?
The next is a top-down point of view, which is say, OK, I'm not just going to take any connectivity matrix of the shelf and try and identify the features that matter in general. Let's consider the type of connectivities that result from training, using basic learning rules and network to solve some extremely simple tasks. What is it about those connectivities that determines the dimensionality of the representations of neural activity? And these are the people who are leading the work I'll all present-- Stefano Recanatesi, Merav Stern, and Matt Farrell. And it's the work of these people and their colleagues.
Bottom up-- so let me remind you what it is that we're studying one more time. You've got a bunch of cells that are spiking away, make a simultaneous measurement of that spiking activity coming from more than one cell at once, plot it, approximated by a football, measure its relative axes, and quantify its dimensionality, right? That's our game.
Now, I already mentioned that this football approximation is related to the covariance matrix of the cells spiking. As a matter of fact, as I mentioned, these lambdas are exactly the eigenvalues of this covariance matrix. I want to bring up another measure of how coordinated or how collective neural activity is, that it's also based on the covariance matrix, which is used a lot in the field. And that's basic measures of the pairwise correlations among pairs of cells.
For example, another way of saying how collective, how coordinated, how low-dimensional, if you will, neural activity is, is to talk about on average how correlated pairs of cells are? Are they doing the same thing, like, the [? baguette ?] are flying all over the place, as in a fully dimensional representation? So I'll talk about this, these pairwise correlations, as Cij, defined exactly as you would think, and their averages as such as well. So that's the quantity that we're studying.
So our question, right, this is a bottom-up question. Got some network, might be complicated, spiking away-- what's the dimensionality of the activity that it produces? And what is it about the connectivity that determines that answer?
Let's just run some experiments. So big complicated network, excitatory inhibitory cells for now-- let's start out by making these networks at random, so subject to the fact that they're excitatory and inhibitory. Takes some number of connections, where the number's determined by the connection probability p and put those connections down completely at random.
What are the dynamics? Well, this is a spiking network producing spikes, like you can see here. And the cells are spiking in this model as in homogeneous processes that instantaneously have a firing rate, the chance of emitting spikes in some small time bin, which has some baseline level, modulated by all the other spikes in the system through a kind of connectivity matrix Wij.
So this is my simple model for spiking networks. Cells are spiking at random. And they're impacting each other's firing rates through the key object of study, W, in the entire talk, right, which is the connectivity matrix, or the wiring diagram of the circuit.
So many of you will recognize this model. This is often called a GLM in the computation on neuroscience literature, schematized by diagrams like this made by Pillow and Park and Paniniski, and colleagues. But this is really the diagrammatic form of this model.
OK, red alert-- so I'll try and be very clear about this. This is a linearized model. So in this type, the general version of a neural model of this type, inputs will come into a cell and pass through some non-linearity. And then that will be the firing rate. Here we've linearized that around some operating point to develop the results that I'll discuss with you in a moment.
OK, is this reasonable? So simple model-- throw down a network, random connectivity, what's the dimension that it produces? Yeah?
AUDIENCE: W could be negative?
ERIC SHEA-BROWN: A W, in this case, would be, yes, negative for the inhibitory cells and positive for the excitatory.
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, so you don't want-- you don't want negative firing rates? I thought you were a theorist. Good point. So this model is going to be broken unless the operating point-- or the mean firing rates of the cells is sufficiently high so that the inhibitory inputs could not push the cell to negative firing rates. So this is only going to work for reasonably high baseline firing rates-- excellent point by Haim.
So we do this with the yellow caveat. And we just simulate one of these networks with excitation and inhibition, as we just described. The one parameter-- remember, all the connections are thrown down completely at random. So the only feature describing the connectivity so far is this p, right. Just how connected is the network?
First of all, let's measure this easier-to-think-about mean pairwise correlation. When there are no connections, then there are no correlations either. All the cells are just firing away independently. And as you turn up the connection probability, cells get more and more correlated. Fine.
Let's measure the subject of the talk at the same time for exactly the same network. So every single dot here is one realization of a random network with a particular connection probability p, as I described. So let's plot not just the average pairwise correlation of all of these networks. That's what I've done here.
Let's also plot their dimensionality on this other axis here. And we see that as I make the cells, the network's more and more connected via p. The dimensionality also goes down. These things are no longer firing independently.
What I find interesting about this plot is the relationship between these two quantities. We know one's going to go up, one's going to go down. But notice this, please. The dimensionality here, measured as a fraction of the full dimensionality, which would be N if I have N cells in the system-- as per here, so it's a little tricky to deal with. The dimensionality of the system plummets to quite low values, say, 10% or 20% of the full dimensionality, even when, from the perspective of the mean pairwise correlations in this system, there is not that much statistical interaction among the cells-- 4% or so correlations on average.
So from this, I get the first answer of those I'd like you to remember from this talk, how does connectivity influence dimension? Answer-- by a lot, even when you might not think so from looking at pairwise statistics alone. To me, the upshot of this is that you're not yet, at least, wasting your time in this talk.
This quantity is worth studying, even though there are many papers I could have listed-- 50 here. I just trimmed a few-- on the input and output and really of the brain that indicate that average pairwise correlations really are quite low in the brain. There's also a beautiful theory of why that is the case from these and other authors. But again, even when these pairwise statistics make it look like the system's pretty independent, it turns out that you can have quite a low dimensionality in the system as a whole.
I think the reason for this is as follows. These authors have shown us that the formula for a dimension that I mentioned a moment ago-- excuse me, the expression for dimension depends not just on the average level of correlation between pairs of cells, but it also depends on the variance in that level of correlation from one cell pair to the next. And more than just that extra dependence, it depends on this in a very sensitive way.
How sensitive? The archetypal case of an almost uncorrelated network, or almost independent network, in computational neuroscience and increasingly in experimental neuroscience, too, is something called a balanced network. This was actually-- lots of things came out of this group originally studied by Haim and colleagues. And then in the context of pairwise correlations, this famous paper here indicates that, man, these networks are really uncorrelated from the point of view as follows.
The correlations as you grow these networks bigger and bigger on average go away, and also they're very tightly distributed around zero. And here would be the network size. But it turns out that if you plug these two numbers into the formula of these authors, you get an answer that the dimensionality of the network itself is still constrained.
In other words, what really matters is the pre-factors in front of these very small pairwise correlations in terms of determining the dimensionality of a system as a whole. This message that I'm trying to offer echoes one which is already in this well-known paper in the literature, which is that if you look at neural systems, or really any systems, one pair at a time, even if it appears that the statistical interactions are weak, the system is uncorrelated. All possible states could be represented.
That's not true at the level of the population as a whole. There could be strong constraints here represented by low dimensionality. OK, so that's the first point. So we've established that even when you might not think at first glance, connected neural networks can have very low dimension.
What is it about that connectivity that matters? We haven't explored much so far. We've just been looking at that connection probability alone in the context of otherwise randomly connected networks, and observe that as you make, again, a system more and more connected, its dimensionality goes down. Let's specialize to the case of excitatory-only networks, for reasons that you'll see in a moment, and compare what you get as a function of connection probability p, right?
So you can think about this for a fixed budget of different connections that you could put down in a general neural network. Let's compare the dimensionality that comes out of the randomly connected systems-- you've just seen that. Those are also called Erdos-Renyi networks-- with the type of dimensionality that you get by spending that budget of connections, spending a fixed budget here of roughly 50% of possible connections being present, spending that budget in a different way, for example, rearranging those connections according to the famous scale-free architecture, popularized by Barabasi and colleagues. That's where you have certain hub neurons that are using more of the p, or more of the connectivity than other cells.
Another classic way in which these connections might be arranged is according to a small-world architecture, in which you have a preference for local connections, but there are a few long-range connections. Investigating these plots, we see that the situation is pretty interesting. There are some ways of rearranging connections from the purely random case that caused the dimensionality to plummet. They have a very strong control on this.
But there are other things that seem sort of equally dramatic at first blush, like going for a small-world architecture with lots of local connectivity that don't do anything at all in terms of the dimensionality or the degrees of freedom the system explores. So for me, this just says there's more than just connection probability that matters. And the answers are likely to be quite interesting. So let's see what they are.
So what is it, if it's not just connection probability that actually determines, in the end, the dimensionality of the activity that a neural network will produce? Well, to get at that, let's see what the ingredients of a general neural network are. The first ingredient, indeed, is this connection probability, right?
I can go through any general complicated network and count how many links it has. Sounds like a good descriptor of a network. But if I zoom in, there's more than that, right? There are these little paths, like the purple one here, which are chains of two connections. I could enumerate those as well.
There are these diverging pathways coming out of one connection in the middle. And I could enumerate those as ingredients as well. And those are often called network motifs.
If these look familiar, that's not an accident. Turns out that in this famous paper from 2005, exactly this type of network motifs were quantified in cortex, using multi-celled patch clamp, found to be present at levels not expected by chance. And this type of work was followed up by a host of other studies across scales and species.
So a question that I want to pose to us now is, well, what's going on? We've got low dimensionality can often occur in neural networks. We see that as more than just connection probability that determines this low dimensionality.
Is there is something comprehensible? Perhaps the content of the network in terms of these motifs is good enough in order to predict what the dimension will be. And the answer that I'll present to you is we think that this is the right approach or a useful approach-- a right approach-- at least in the case when you have networks that are dominated by one sign of connections.
Let's go through that in detail. And we'll see some successes and failures. So how do we get to that approach? Is it clear what I'm aiming at? Any questions, requests to slow down? All right. Let's see if this works.
All right, so I'm walking up to a general neural network like this. And I want to know what the dimensionality, again, is of the activity that it produces. Here's my model for its dynamics. This is exactly what I showed you on the third slide. Here is my honesty point in yellow. And it turns out that for these type of systems-- we chose them, well, because they're a widely used neural models inferred from data and also because they are analytically [INAUDIBLE].
So we can have an explicit expression for what this key thing is, this covariance matrix on which my lambda is and, therefore, my dimensionality is based. We can have an explicit expression for this covariance matrix in terms of the W. As a matter of fact, it looks like this problem is a cakewalk, right?
You want to know what the dimensionality is that a network with a particular W is? Plop it into this formula. But our goal is not to just have a formula that you can plot big N squared connectivity matrices into and get numbers out of. We want to have some sort of understanding, again, of why it is that you get particular numbers that might be high in some cases and relatively low in others.
So we don't want to stop here in the literature. Let's go back to the '70s, retro style for Terry Sejnowski, who I think is the first to write down this formula in the context of neuroscience, and note that Benji Lindner has also used this formula and many others that followed. So what are we going to do with this covariance matrix?
Well, let's remember that our dimensionality is based on it through its eigenvalues. It turns out that the dimensionality is based on some simple matrix statistics of that covariance matrix involving the trace. So this covariance matrix is going to be useful. What's the next step that I can take to get a formula, again, which is not-- oops, yeah. Go ahead.
AUDIENCE: [INAUDIBLE] there is some [INAUDIBLE]?
ERIC SHEA-BROWN: This is the identity. Sorry. This comes from the baseline rate. Thank you. So this is what you would get if the baseline rate was one.
AUDIENCE: [INAUDIBLE]. The baseline is an external source.
ERIC SHEA-BROWN: Yes, I guess that's right. The baseline is the rate at which a cell would fire in the absence of recurrent coupling from the system, or like a con-- this is the identity matrix. Yeah.
AUDIENCE: [INAUDIBLE]?
ERIC SHEA-BROWN: Let's see here. Yeah, so what I've done is I've taken the system, and I've linearized it around some point.
AUDIENCE: [INAUDIBLE]?
ERIC SHEA-BROWN: It is already linear, so a nonlinear system linearized, a point would give you this. Or you could just take this off the shelf. And it's linear.
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, that's right.
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: So if you would solve for-- right, so take the average of this equations very quickly. You get rate on the left, average of a Poisson's spike train if you interpret it correctly is a rate on the right, so. You have rate vector is equal to some constant plus W times the rate.
Obviously, you solve that, right? And you'll get that the rate is equal to the identity minus the weight matrix inverse times the baseline. Well, it's true. Well, you can do it on-- OK, anyway, they claim it--
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, that's for the instantaneous rate. So you need to do a general version of that by basically multiplying the rate equation by itself.
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Pardon me?
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, I think--
AUDIENCE: [INAUDIBLE] sequence.
ERIC SHEA-BROWN: All right, thanks, zero frequency or at any frequency. Good point by Haim, I think. So yes, so, correct, right? If you want to solve-- right. So formally to get this expression-- this is a good point. I appreciate the clarification. Haim's right here.
So this is the covariance matrix formally of the spike count. That's what we're talking about, right, our vectors of population-wide activity are vectors of [INAUDIBLE], right? Correct point-- spike count over what window? I think that's at the core of this question. Answer-- this is the formula, exact formula, for the covariance matrix over a spike count of infinite duration-- in windows of infinite duration.
That sounds really stupid, right? I mean, that means like you have your entire brain does one thing, and it's only [INAUDIBLE]. But it turns out that that's also a very accurate approximation for the spike count over any window that contains all of the temporal structure and the auto and the cross-correlation functions of individual neurons. So for example, [INAUDIBLE] updated in a J Neuro Sci paper in 2001 or 2006. This is a good estimate of the spike count correlation over windows of 32 milliseconds or longer in primate visual cortex.
So this is covariance matrix of a spike count in sufficiently long windows. It's given by this formula. And I'm expressing that I'm not satisfied with this formula. Because although it's something that you could plug a W into, I don't find it very intuitive to think about what happens when you take a bunch of inverses and transposes and multiply them. I don't see what the key features of the W are that give rise to high or low dimension. And I especially don't see what that is when you plug it into some trace formula and then take all sorts of squares and things like that.
So this is not an intuitively to me-- well, this is not an intuitively or conceptually useful expression yet, or at least we could make it more so. To do so, here's the math. Take this thing and realize that even though this has matrices inside, you can do a Taylor series expansion that's valid for matrices, just like you could for scalar equations.
You get a bunch of terms out of this. I've dropped proportionality contents. I got constants. And left off some of the terms. But you can see that the type of terms you get are going to look like this, so various products of S, the connectivity matrix itself, as well as its transpose.
Now, why do we care about this, right? So I'm just replacing some formula with some infinite expansion of it. The reason we care about this and why this can get us on the track to a conceptual viewing of this problem is due to some insights from these authors. And what they've done is notice, well, hey, what does this W really correspond to?
Well, W, again, is just all the connections in the matrix. So my blue W corresponds to-- the entries of that correspond to all my blue areas-- arrows. But how about these other more mysterious looking product terms, like the green one? Well, it turns out if I look at nonzero entries in this green product, those nonzero entries are proportional to the probability of having this kind of very interesting connectivity motifs, specifically a diverging connectivity motif like that. And this statement about the equivalence between the matrix products that show up naturally and the probability of having connectivity motifs in your network is true for all of the terms in this expansion, as we'll see in a minute.
So I can have an expression of my covariance matrix in terms of all the motif probabilities, which are proportional to the counts, or numbers of them that exist in a system. I can plug into this formula, somehow deal with the trace operations, and end up with a formula of this general form. You want to know what the dimensionality that given neural network produces? Tell me its motif content, plug it into some huge formula like this, and I'll return to you the dimension.
So this has reduced our problem of figuring out what matters about a matrix's-- about a system's or circuit's connectivity to the question of what motifs it contains. The problem with this type of a formula is this infinite sum. It turns out that you cannot truncate this expression well, and you need information about very long motifs in order to get a good approximation.
So we've reduced this down to something that is intuitively useful. But if you want to design a-- whoops-- an experimental protocol, for example, to measure the information you needed to predict dimensionality, you'd have to have a multi-celled patch clamp set up with 100 simultaneous electrodes, right? So you're going to fire me as your theory colleague, as I tell you that's what you need to do.
So here is where the ideas of Yu Hu, Haim's former postdoc, now in Hong Kong, my former graduate student, come in. So Yu Hu saves the day. And here's what Yu Hu does-- not on this project. But this is his general theoretical idea which he used to advance other aspects of network dynamics in past work that he led.
So what Yu Hu does is he says this, OK, what are some of the ingredients here? Well, I got motif counts, right? And these are proportional to the probabilities of connections. Is there another way of talking about the probability of these motifs occurring, about the density of these ingredients in my network?
Maybe I can say, well, probability of, for example, this green motif occurring is whatever I would have expected by knowing about shorter motifs, right? So knowing about connection probabilities alone, I would have said, all right, what are my chances of getting one of these green things? Anybody want to shout it out? It's lonely up here.
Connection probability is k. What are the chances that if I look at a given triplet of cells, just knowing that alone, that I would get two connections present at once to form this diverging motif?
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: I'm not sure what that means.
AUDIENCE: Without the [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, just based on that information alone, like what you'd expect from the simplest--
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: Yeah, k squared. Yeah, exactly, k squared and then if you don't distinguish between self coupling, just k squared. But then there's some extra amount by which those particular green motifs might be present in a given network, right? It doesn't have to be just k squared. There could be more of them.
Well, I can do that for this type of motif. And this extra amount is something that you call motif cumulants. It's not the count of these structures, but it's how much extra you have beyond a baseline statistical assumption based on the shorter, or simpler, paths in the network. You can [INAUDIBLE] four converging motifs, chain motifs.
You can even do this for this type of red motifs that corresponded to the red term on the previous slide. What do you do next? Well, you have this formula, remember. This is what we want. It's written in terms of motif counts, which are proportional to motif probabilities.
So you plug in all of these expansions of motif probabilities into the formula that we have in the first place. Now, does this sound like a good idea or a terrible idea?
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: I vote terrible because I was complaining about the fact that these are really long series. And all I'm doing so far is bringing in lots of terms, like all of these red terms in for every individual term in this series. So what you get is just replacing one infinite series with another infinite series. It's actually doubly infinite. And it looks like you're going backwards fast.
So it turns out-- and this is what Yu figured out, that we were able to generalize dimension case as well-- that this unpleasant looking series is actually very pleasant because there are ways of using nonlinear re-summing tricks in order to rewrite this ratio in terms of another function, which involves the sum over just these motif cumulants. Remember, this is just these [INAUDIBLE] probabilities not baseline probabilities, but extra probabilities of having these little sub-graph structures. And it turns out that this series, this expression, can be truncated efficiently.
So it doesn't work if you just take the basic thing and try to put in small motif counts, the type of things that you could measure without driving yourself crazy. But it can work when you do this rearranging and rewriting of these expressions in terms of the extra counts. And this makes some sense when we think about extras that came more quickly, for example, with length than basic counts of sub-graphs.
So does this work or not, right? So this is the type of thing that works. And it doesn't always work. It solves our problem, right? It says, again, remember, question, right? Give me a W. What about the W matters for the dimension? Answer-- just measure short little things, just like Song did in 2005, plug it into some formula, and I'll tell you what the dimensionality is.
That's useful. That's compact. And that's simple. Does it work? Sometimes, and sometimes not. In the case of purely excitatory systems, or more generally systems that are dominated by one sign of connections, it works great.
So this is the plot that I showed you before, Erdos-Renyi or random networks, small-world, scale-free. If we look at the full expression for their dimensionality, right, [INAUDIBLE] this from knowing absolutely everything about their connectivity, I get the lines that I showed you. If you take a look at it an approximation which says, just do three-cell patch-clamp experiments and figure out what the statistics of the low order connection motifs are, we also get a very good approximation to the result.
This holds, also-- I should have made this yellow. But I can color it in-- for a purely excitatory system of more general, sort of, exponential family of random graph forms like this. We get very good approximations to the exact result for dimensionality just by measuring a few observable properties of connectivity. That is the type of thing that is useful.
Failure-- oh, so it works for excitatory systems. It seems to fail in our hands for-- I'll admit to you-- the type of systems that matter the most for describing cortex, which are these strongly balanced excitatory inhibitory circuits. Our expression in terms of an expansion in terms of motif cumulants is exact. That's not the problem.
The problem is that we need more of these motifs than those involving at least three cells. And obviously we need to systematically see, which we haven't done yet, whether we succeed in order four or five or how far out that we have to go. But all I can say at this moment is you need more than just three cells at once to predict the dimensionality in E/I systems. So no, at least at second order.
Why are we-- yeah, sure. You want me to put my [INAUDIBLE]? Hi, mom. Here's what you bought for your education. Yeah, OK, good.
AUDIENCE: These networks [INAUDIBLE], are you separating [INAUDIBLE] for different types of [INAUDIBLE]?
ERIC SHEA-BROWN: No. I love that question. That's very nice. So the answer is no, we just have E and I, and we have some connection probability and some motif probability that goes into this quadratic family of random exponential quadratic-- a family of random graphs. So we over-expressed motifs [INAUDIBLE] just E, and E and I, and just I and I. But we don't have separate subtypes.
What you're getting at though is, I think, a very nice question-- what is a nice situation to try to apply this theory to? Answer-- the case in which you have multiple subtypes of both E and I cells, each of which has a different connection probability. If you re-homogenize over all those cell types, right, then you get something which would involve a lot of interesting motifs.
So having different cell types is a natural way to generate departures from purely random graph models that would give you the types of motifs for which you could make predictions like this or like this. The question in the back.
AUDIENCE: Does it all depend on the covariance matrix--
ERIC SHEA-BROWN: Yeah, that's it.
AUDIENCE: [INAUDIBLE]. So if there is a hidden [INAUDIBLE], which is highly correlated with all other [INAUDIBLE], then the covariance matrix--
ERIC SHEA-BROWN: So I think the question is, this was fun. Thanks for your lecture about-- thanks for this formula, for example, whoops, which seems to be just the intrinsic activity of the cells, right? And you saw that also on my first slide, where I was sort of introducing things. If the connection probability is zero, then the correlation is zero. Everything's independent.
It turns out that if you [INAUDIBLE] signal term, here let's call it s sub i of t, that may be either independent from one cell to the next or are strongly correlated, you can recycle exactly the same analysis. You just have the covariance matrix of that signal term in the middle. In your case, in which there was like a rank-one, sort of, component or a single factor that's driving all the cells, that would be a rank-one covariance matrix that gets slapped in the middle.
You can do all the same machinery as before. It turns out that the motifs that you end up pulling out to make these predictions are-- they appear in the same form, same re-summing, same truncations, same everything. The difference is that [INAUDIBLE] weighted. So that cells that might be listening most strongly to that common input will contribute not just according to their cardinality or account, but according to their stimulus-driven-- or proportional to their stimulus-driven variance.
So you can account for all of those factors exactly in this theory. It just comes down to re-weighting the neurons so that the motifs that contain neurons that are most strongly driven will contribute to a larger numerical extent to the kappas or the K's in the motif cumulants. There'll be another comment on that at the end.
So this is an exact approach-- write down things in terms of cumulants. It's going to converge faster. It's going to be very useful, I think, for one-sign networks. The jury is out for networks that are close to a balanced case.
Why are we doing this? Again, if we just wanted formulas, just shove a W in, we had that 10 slides ago. [INAUDIBLE] is we're trying to develop some intuition as to what's going on. And the ultimate form of that, or the simplest form of that, would be look, take all these measurements right from your patch clamp or whatever, a simple way of characterizing a graph, toss them into this formula with some linear weights in front, and I'll tell you what the dimension is.
Now, of course, our theory, our approximate theory, is some sort of very nonlinear expression. But we can do a Taylor series expansion of this nonlinear expression if I feel like it. OK, just tell me what the sensitivity is. Tell me what this alpha is. Tell me the sensitivity, the derivative of the dimensionality with respect to the chain motif content.
I can compute that explicitly. It's some sort of formula. The fun thing is that you can prove that it's negative. That says that you take any baseline system, add more chains to it-- that could be in terms of cardinality or in terms of their strength, as per the question in the back of the room-- and you'll always decrease the dimensionality of the system.
The same is true for diverging [INAUDIBLE] converging motifs as well. And in the end, I can come up with a simple plot, which tells me, again, how much adding connections or not adding connections but rearranging them to boost up the effective content of chains or diverging motifs or whatever will all factor in together to determine the dimensionality of the network activity as a whole. And this can explain the type of results that we saw on our motivating slide here.
If I measure the motif content in a totally random network-- that's Erdos-Renyi-- or a small-world network, it turns out that the motifs that matter, according to this dimension theory, are exactly the same in both cases. But in a scale-free network, I have more of the type of structure that matters, which is these three types of second-order motifs. And that is enough, in this case, to drive the dimensionality in the direction that we expect, which is down.
That's the bottom-up part of the talk. [INAUDIBLE] to connectomes and circuits-- well, please. If I care about the dimensionality, what is it about the connectome, hopefully not its entire complexity, what are some simple properties of this that determine the dimensionality of the representation or the activity that a neural network with that connectome or with that circuit will produce? The first answer that I came to was that the effects can be dramatic, even when pairwise measurements of correlations are low.
We suggested that network motifs are the right quantity to look at. And they become efficient only when you use the accumulate technology of Hu et al. We found that short motifs, making for a genuinely useful theory, work for excitatory networks, for E/I networks. We had the failure I showed you.
Presumably we can fix that with longer motifs. But we don't yet know how long. There's the caveat in yellow and the caveat that Haim's question underlined that this is for linear or linearized systems. You can do extensions with nonlinearities that work. They just involve more higher order powers of the connectivity matrix W, making for a much more complicated expressions. And this was a question of the gentleman in the back.
The same thing works in the presence of stimulus drive. More caveats-- the red or yellows are failures or things I want to draw attention to so you can get too enthusiastic about my talk. I think in the presence of stimulus drive, we have this case in which we need longer than two connection motifs again. I think that's a typical case.
Highlighting some [INAUDIBLE] on dimensionality and connectivity in the literature-- Mastrogiuseppe and Ostojic have a beautiful paper about low-rank perturbations to connectivity matrices, a different approach to this problem but beautiful work. And it actually uses nonlinear dynamic mean field theory, made up by yet another reference to somebody in the audience here, in order to solve this question. Brent Doiron, Chengcheng Huang, and colleagues also have work on dimensionality and recurrent circuits, where the critical factor here is spatial and temporal scales of the connectivity.
So in the end, to answer this bottom-up question, sometimes this low-rank, or spatial scale approach will be better. You saw some failures of our approach. Sometimes probably this connection statistic approach that I introduced here will be best. And these are all sort of different sides of the same coin. So we'll be working together all of us, I think, to try to unify these approaches in the future.
Top down, OK? So let's not just take any connectome off because somebody happened to measure it, or some cell types happened to give it to me. Let's take the type of connectivity that comes out of training a network to do something. Is there anything systematic that we can say?
I won't say much. But I'll do my best. And as I admitted to Tommy, and Tommy didn't disinvite me, this sort of top-down approach is new to us. So I want your feedback and your criticism. And seriously, if this is well-known, interrupt, stop. Let the audience know. They, at least, deserve to know that. This is new to us.
Let's try and get into this. So simplest task-- so right, the approach is common to many, right, pioneered, in some ways, here by the DiCarlo Lab, but also others. Omri Barak, Susillio, Abbott, others have worked in this area very productively-- trained in artificial neural networks to solve a task and then look at its dynamics. In this case, the aspect of the dynamics that we want to study, which we think is reasonably new, [INAUDIBLE] dimensionality of those dynamics.
So stimulus comes in, goes into recurrent network, does function-- all right, so what is that? So the simplest function we could come up with for recurrent neural network is a classification task. I've got a bunch of stimuli. Here are my pictures of Tommy again, and here is Haim, and here's Chris in the back as a triangle or an X or whatever, right? You get it.
Vectors come in up pixel space or whatever, get bounced around in the network for some number of time steps. After some predefined number of times steps, the system has to read out the correct category-- in other words, category circle. Specifically, this is implemented via some sort of readout weights, which are also trained in the network with one hot encoding. So a one in the first unit, for example, is the training objective if a circle was given into the system.
So that's my question. Yes, there is a recurrent network. Get the W out according to some learning algorithm, and that'll turn out to be important. And then what kind of dimensionality in the network dynamics results from said W? And we're going to concentrate entirely for this talk, although we've worked on a bunch of other cases, in the case which I call easy classification.
So the stimulus itself lies in high dimensions. That's just like what I had on slide 2, right? So in particular, the stimuli lie in such high dimensions compared to the number of categories, the number of clusters here-- thank you, Cover's theorem-- that actually, this network-- this is important-- could perfectly perform this classification task by doing nothing to the stimuli.
Leave them alone! They're linearly separable. Maybe act on them with W's equal to the identity or something equivalent. Then train these readout weights. You can easily separate this type of thing with a hyperplane and get 100% accurate classification.
So I'm not solving a stunning machine learning problem. Instead I'm trying to ask how learning rules cause recurrent neural networks to find particular solutions that may have a particular dimensionality to such problems. And by choosing such an easy problem, I'm attempting to isolate the role of learning rules and the fact that we're solving this problem using a recurrent network. Yes, please?
AUDIENCE: [INAUDIBLE].
ERIC SHEA-BROWN: I do. Yeah. Yeah. So, right. So with these type of recurrent network dynamics-- it's a good point. The dynamics are H or the activity of the hidden layer, the activity in all the units at the next time step rf of W times h of the previous time step.
So if you're operating in the linear part of the rectified nonlinearities, W equal to identity would give you just preservation of the input perfectly from one time step to the next. So that's a good point. So it's not exactly a dynamical system with leak, which would need a line-- or a plane attractor or something for that. Yeah, thanks for that clarification.
It's a recurrent network in the sense of machine learning people, who don't seem to be very realistic about dynamical-- differential equations per se, a discrete time recurrent neural network. Thank you for the question. It's helpful.
So again, simple problem, but what happens in one of these recurrent neural networks when you train the network to solve this task? And in this case by train, I mean, the standard thing, this is back propagation through time implemented through RMS prop in TensorFlow, for people who care, in order to do stochastic gradient descent with some batch size on samples that look something like this. So what is the dimensionality that this network learns?
Well, let's look at this for a bunch of different classification problems. So these are all classification problems in which I have just two categories, circles and X's Haims and Tommys. But I might increase the number of clusters of different Haims-- Haim wearing a white shirt, Haim up close, et cetera. So that's what's going along in this axis. And what do you see in blue, heavens to Betsy, the machine learning-- I'm going to give up.
What you see in blue is the dimensionality and the input space, right, so more and more clusters, more and more copies of Haim, higher and higher dimensional inputs. But our question's not about inputs. It's about the response. And this is what you get for a network which is trained to solve this classification problem.
The answer is that you get extremely low dimensional representations. You absolutely do not do the trivial and immediate thing, which is just copy the input into the network and classify it with a 100% accuracy by training the readout weights. When you do a join optimization, at least, of these readout weights, the representation in the neural network, because the neural network [INAUDIBLE], massively compressed the inputs.
So in the end, what we end up with is the dimensionality that, in this case, is two. Does this make sense? So you have a linearly separable problem in the input space. You could just solve it right there if you want to. But that is not what recurrent neural networks do when you train them on back prop through time.
Instead they seem to do a gigantic squashing of all of the different Haims and all the different Tommys into something which looks a lot like a point for each class. That is something that occurs not due to random initialization of the network. Random initialization preserves the dimensionality, as promised, for the trivial type of case that I described.
It happens across stages of learning-- blue to yellow-- and across stages of time. So this is a dynamical compression that the network is really doing to the stimulus itself. And the type of compression that you have really is, according to the example I mentioned where you would take all the examples of a colleague one and collapse them to one point and all the examples of colleague two and collapse them to another, the dimensionality of the representation in the system is close to that what you would get if you took one point for every class and just scattered it around randomly in general position in the state space that you see in the black line.
Is this clear what I'm claiming, what the numerics suggest? OK. So this is my yellow point-- not a warning, but something that I find quite interesting. For easy problems, for which there's no reason, from a classification point of view, to compress the dimensionality or to [INAUDIBLE] dimensional representations, recurrent neural networks trained through back prop do this anyway, and they do so very strongly.
So can we have some intuition or build a toy model as to why this is the case? Let's build up said model. So first stage in building up a toy model to try to explain where the necessary dimensionality comes from-- and again, interrupt me if anything's unclear or if you want to object. If this looks like your homework problem or something, let me know.
First stage in building up a toy model is as follows. Take all the example points of one category, say, all the Tommys, all the circles, and replace them just by a grid of points instead of a bunch of clusters. The only point of this is so we can easily visualize what's going on.
Take all the Haims and replace them with X's So we just have change things around so we can see some grids. It's not really that toy. This is really toy.
Next we're going to say that the recurrent network dynamics can be completely linearized. So whatever is [INAUDIBLE] this fully nonlinear W over multiple stages with rectified nonlinearities is approximated by another W-- abusive notation, not going to be exactly the same W-- is approximated by some sort of approximate matrix or some linearization of the dynamics acting on the input set. So my dynamics in my ultra toy model are multiply by a matrix W, your grid of input points, get something out. That is what the network is trained to do.
Another aspect of my toy model-- boy, are we getting toy here-- is we're just going to consider the action on one of these sets of points at once, so all the Tommys. And just to introduce some notation, H is what we're going to call the hidden unit states. Those are just the dynamical variables that characterize all the units in my recurrent network.
OK, is the toy model clear? Take some grid of examples, W them, get some hidden states. And the objective, right, as in this classification, is I'm going to take all the network operated upon or all of the hidden states corresponding to one of these particular categories. I'm going to read them out according to some readout vector R. And I'd better get the right number, plus 1, for example, for all of these stimuli or all of these inputs that are in category number one.
So copying that toy model again, I've got a bunch of examples of inputs that lie in a particular category. I act on them by some effect of linearized network dynamics. I get something out, right? This is huge. So you can see what's going on.
That is the set of all states of the neural network across all the different inputs. I'm going to read out that state according to some readout vector R. And I'd better get one out. That is the objective of the classification.
In particular, I have a loss function here, which is how well this system is, indeed, able to categorize these points, how closely the hidden states, when read out according to R, give me the desired output, the category 1. Does this make sense? So a couple of lines-- if all these hidden states lie on the, in general, hyperplane that is in blue here, right, that is what you need to do to have zero loss, right, to perfectly classify it as one. And that corresponds to a case in which you might have a boundary somewhere else at zero in the classical case, in which all the Os lie on one side. The margin is the distance to the blue line, et cetera.
So the point is what does the system do, right? Notice my setup. All it says in this easy case, all these samples, all these hidden states are already on the right side of the classification boundary. I just need to minimize this loss.
Well, let's minimize that loss according to back propagation through time. So this is a gradient or a stochastic gradient descent. I'm going to update my W in order to minimize this loss. It turns out we can write that down explicitly for a linear model like this.
I get that the delta W, or the increment in my weight matrix, no matter what it starts with, in order to maximally reduce this loss is proportional to this. What is this? This is the outer product of some undefined vector V or some general vector V and R, where R is the readout vector. So think about the consequence of this.
If I work in this toy model, and I update my W matrix according to back propagation through time, every [INAUDIBLE] by increment W, it'll be according to the outer product of a different vector and the same readout vector R over and over again. The consequence of this is as follows. Any change that delta W, or any change to this learning rule can enact in terms of how it moves my endpoints, can only point in the direction of R.
Think about it. Take any point. Project it onto a random vector. That gives you a scalar. And then multiply it by R. So all changes that result from back prop through time can only move around points in the direction of the readout vector.
The consequence is then as I minimize this loss according to that back propagation through time, I will have a compression of the dimensionality but in just one direction-- only in the direction in which I'm reading out this system. I will do nothing. I can do nothing to all of the other N minus 1 dimensions that characterize these states. So this is not a good mechanism for compression of all these states except for along a single dimension-- yeah.
But what if the following is true. What if due to different effects of linearizing the system around different points and due to the fact that the readouts in systems are generally trained at the same time as the recurrent weights, I don't have just one readout vector that I'm using over and over again on every step through stochastic gradient descent. What if this readout vector is effectively jiggled around as well?
Sometimes it might point like this, sometimes like this. And every step, I'm moving all my points, remember, in the one direction of this readout vector to try to lie on the blue line. Well, what's at the intersection of all the blue lines, or what's in general at the intersection of very many hyperplanes? An object of much lower dimensionality, in general, have dimensionality zero, a point.
So if I look at a simulation of all these points evolving under my linear time model, my linear model, where at every step I'm updating my weight matrix according to back propagation through time, now something very different happens. The points are collapsed. But they're not collapsed in just one direction. They're collapsed in effectively all directions at once.
So that if I run this forward through time, I'll have compression of all of the dimensions at once. And we think this is the mechanism that underlies compression in recurrent networks, even when you see no reason for this compression to occur according to the original definition of the classification problem. In short, doing gradient descent, as you would in as in back prop through time, combined with a fluctuating readout direction or fluctuating effective readout direction, even if you're not training the readout weights, but the readout direction under this linear approximation changes throughout time is going to lead to a compressive W matrix that has to squeeze the points in all directions at once in order to minimize the loss.
The signature of this type of compression in a fully nonlinear system, getting away from our linearized toy models, is the idea of negative Lyapunov exponents. So these generalized eigenvalues to the case have trajectories that don't necessarily settle down to fixed points. And we see if we look at the eigenvalues of the W matrix in our fully nonlinear systems-- sorry, the Lyapunov exponents, of our fully nonlinear system before training, they're all neutral, indicating no particular compression.
But as I allow training to occur, moving through different stages of stochastic gradient descent, eventually all of the Lyapunov exponents [INAUDIBLE] eigenvalues through a fixed point. If you're not used to these, all of the Lyapunov exponents become very strongly negative. So this we think is the general signature to look for in terms of training according to stochastic gradient descent. We expect, according to this work at least, that Lyapunov exponents for this system will become negative.
I want to highlight some other work of Sussillo and Barak, who I admire, and who also take dynamical systems approaches to trained networks. The last question in yellow, and we'll be done. So this is a suggestion that when neural networks are trained to do even really, really subtle things, right, that would put a machine learning person to sleep in like 1970, when they're trained using something that looks like back prop, and they're recurrent-- I'm not sure how much the recurrence matters-- then the dimension will go down.
So question, is this the kind of thing that is worth looking for in neural data? These colleagues, Matt Valley, Shawn Olsen, Sahar Manavi, Doug Ollerenshaw, Katherine Champion, and Merav Stern, asked this question in the context of this task. This an experiment at the Allen Institute for Brain Science.
You are a mouse, and you will receive a reward whenever a stimulus here, a very obvious grading pattern, sometimes natural scenes, changes. So here-- vertical, vertical, vertical-- boring. You can see why it's hard to train the mice to do this.
What an awesome talk.
[LAUGHTER]
OK, there's the switch, and then the mouse is receiving the reward. So first of all is that we have our colleagues have trained many mice to perform tasks like this. Let's see whether this slightly dynamical task-- so it's a change detection. It's one step in time more complicated than the categorization task I just introduced. Question one, will you see similar reduction in dimensionality over stages of learning for this temporal task? And the answer is yes. It seems to be in the same category of problem.
Second question is, well, let's look at brain-wide activity simultaneously measured while the animals learn to perform this task. This is an awesome experiment by these colleagues. They have the animal under the wide field, GCaMP6s, this case, expressing fluorescent-- sorry, that wasn't very clear. These mice have GCaMP6s. It's pan-excitatory.
This is the type of fluorescent signal that you'd see in a classic 2P experiment, except they zoom the camera out and capture also just one photon at a time. So you see activity, obviously not at single-cell resolution, but across the entire dorsal surface of the cortex. And you do this day one when the animal doesn't know how to do the task, on day two, on day three, on day four, all the way up through about a month of imaging. And you see how this brain-wide activity changes during a month of task acquisition. And here are the preliminary data.
Isn't this cool? So this is the dimensionality as measured using basically the same metric that I mentioned. I want to say, for honesty, we throw out the largest principal component in doing this analysis. That seems to be some sort of brain-wide fluctuation pattern, so just honesty, full disclosure.
Throwing that away, we see that the dimension in the residual directions collapses across stages of learning. This is obviously very suggestive, according to what you see in an artificial neural network model that's also trained on the task. We need more N. This was just a couple of mice.
We also need to check to be careful that the collapse in dimensionality really are the type of collapses [INAUDIBLE] by our theory. In other words, if we look at the direction normal to, right-- normal to the R direction, which we can use to decode the mouse's behavior, for example, the fluctuations or different stimulus presentation or the representations elicited by different stimuli in those orthogonal directions, those are the ones that are getting squashed, right? That would be a proper test of the theory, and we haven't done that yet. Anyway, at this stage, this is suggestive that this dimension collapse is really something we can look for in learning, at least these simple types of tasks.
Let's end the talk. Summary-- so I had two points of view, top down and bottom up. We'll review the top-down, one first. All we said is one thing, which is for classification problems for which you wouldn't expect that the network has to do anything in order to solve the task, instead gradient descent and recurrent dynamics, or at least network multiplication-style dynamics of weights, seemed to lead to strong compression of dimensions.
This doesn't randomly happen. This happens in the directions that are orthogonal to R, so orthogonal to the directions in which the task is being read out. This is a complementary viewpoint on dimension to that in a really beautiful paper-- I suggest it-- by Gau and Ganguli, who studied the task dimensionality. That would be dimensionality of representation [INAUDIBLE] direction of R.
Claim-- this is a useful signature of some learning rules and neural data. So I talked about stochastic gradient descent. You see the same thing. And that was actually the network that Merav [INAUDIBLE] and the change detection task was, the forced learning. That's sort of like a rank-one version with large updates in weights.
Anyway, so lots of learning rules give rise to this dimension reduction but not all. If you look at work by Todorov amd colleagues about optimal control or alternative network instruction schemes from Druckmann and Chklovskii and colleagues, you'll see that there are some ways of constructing your learning networks that will leave alone [INAUDIBLE] dimensional representations in the orthogonal space.
This is maybe good news, in which this dimension could be a useful signature of what is and is not happening in terms of qualitative properties of learning rules. The hope is that when this dimension reduction does happen, this isn't just some sort of factoid. But this is actually useful from the perspective of pushing us towards this Goldilocks-type of representation, in which learning can proceed from relatively few examples.
I think this is also related to the ideas of the information bottleneck, in which networks trained under stochastic gradient descent are found, by Tishby and colleagues, to throw away lots of information about the inputs except the label. Compression, to a point, certainly does that. And negative Lyapunov exponents would be a mechanism for that.
Summarizing again the first half of the talk, we said not for general, not for specific weight matrices that come from different learning rules, but in general, can we say something about what the features of a connectome are for dimensionality? Our answers were don't give up on dimensionality, even if pairwise correlation is low. There may be very dramatic effects at the population level. And we have a general approach to attacking this in terms of small substructures of networks that we call motifs that fails in some cases and succeeds in others.
I want to close by mentioning that I think there's a bridge between these levels of description that I'd like to leave you with. Remember, to the extent to which our bottom-up theory worked, we said that you need to measure only very local features of connectivity involving a couple, or maybe in E/I cases longer, chains of connections among just a few cells. Well, it turns out, right, that this local level is where a lot of plasticity rules actually operate.
A hope is therefore that if dimensionality is something that actually matters, as per these two points, in terms of learning useful representations, that it may be under local [INAUDIBLE], which could be driven by the type of plasticity mechanisms that we already know about. On that optimistic note, let's thank everybody-- Stefano Recanatesi, Matt Farrell, Merav Stern, Sahar Manavi, Shawn Olsen, Matt Valley, Doug Ollerenshaw, as well as Guillaume Lajoie and the ghost, the friendly ghost of Yu Hu. Let's thank sources of funding and let's thank the Allen Institute for existing and supporting a lot of this work as well. And thank you so much for the questions and for your kind attention.
[APPLAUSE]